The Digital Labyrinth

Chapter 1

Trouble In the Cathedral

Why are many large software projects either ‘functional failures’ or downright disasters? Why does software development seem so different from other forms of engineering design? Is software just an example of a general class of problem domain where good design is difficult or even impossible?

Next Chapter

Software is in crisis. Wherever we look, information technology projects are out of control. They frequently overrun their budgets and timescales, often to the brink of cancellation. Even if they make it into operation, complex computer systems are often plagued with insidious bugs, unplanned performance problems and unwanted behaviour ranging from annoying “features” to critical life-threatening bugs.

Today software is embedded into the infrastructure our daily lives; the car we drive, the elevator we ride, the machines that cook our food or wash our clothes. Alexa listens for our whispered commands for knowledge, light and heat. From air traffic control networks, aircraft landing systems, nuclear power plant controllers to heart pacemakers, we stake our lives on the integrity of the software that drives them. Our very identity, our political and economic rights and privileges, are defined by patterns in digital information systems. Recent history is full of expensive and salient examples of what can go horribly wrong with large scale IT projects.

After a career spent building complex software systems, my original motivation for this book was to understand the root causes of computing failure and to explore some possibly better methods for computer engineering. In the course of research however, the scope has expanded hugely to include the most generalised notion of design itself, its difficulties and limitations. Design lies at the heart of the universe. Our very existence demonstrates that there are indeed some very powerful methods available to us in exploring astronomically large spaces of possibility. Can we harness these hidden strategies to improve the quality and robustness of our engineering products, our pharmaceuticals, our economics and perhaps even our systems of politics, business and aesthetics?

Some IT Horror Stories

The new international air port planned for Denver in 1994 was to be a wonder of modern engineering. Twice the size of Manhattan, 10 times the capacity of Heathrow, the airport is big enough to land three jets simultaneously in bad weather. A fundamental part of the new infrastructure was an automated baggage-handling system. An impressive 4,000 computer-controlled "telecars" would route luggage along 21 miles of steel track between the counters, gates and claim areas of 20 different airlines. Some 5,000 electric eyes, 400 radio receivers and 56 bar-code scanners would be controlled by a network of 100 separate computers. The airport's grand opening, originally scheduled for Halloween, was postponed until December to allow the software contractor time to eradicate thousands of bugs in the $193-million system. The deadline slipped from December to March, and then to May. Finally in June, loosing money at over $1 million a day, the airport’s builders were forced to admit that they could not predict when the baggage system would ever by stable enough for the airport to open. The plug was finally pulled on the whole system in 2005.

In 1987 the California Department of Motor Vehicles initiated a project to merge the state's driver and vehicle registration systems. The original project plan was to rollout convenient one-stop renewal kiosks within 4 years. Instead the DMV saw the projected cost explode to 6.5 times the expected price and the delivery date extend to 1998. In December 1994 the project was cancelled, with an uninsured loss of $44.3-million to the state of California.

Yet another well-publicised public disaster was the FAA’s upgrade for the aging US air traffic control system. The Advanced Automation System (AAS) comprised more than a million lines of computer code distributed across hundreds of computers and embedded into new and sophisticated hardware. This is a safety critical system that must respond to unpredictable real-time events with foolproof reliability. The FAA had budgeted to pay $500 per line of computer code, some five times the industry average for well-managed development processes. They ended up paying $700 to $900 per line of code because, according to a report on the project by the Centre for Naval Analysis, IBM's "cost estimation and development process tracking used inappropriate data, were performed inconsistently and were routinely ignored". An internal FAA report concluded that the price became so astronomical because "on average every line of code developed needs to be rewritten once." In June 1994 two of the four major parts of the programme were cancelled and a third scaled down after tests that showed that, despite soaring costs, half the completed system was unreliable. The $144 million spent on these failed programs is dwarfed by the $1.4 billion invested to deploy new workstation software for air-traffic controllers. This project was also spiralling out of control and was delivered over 5 years late.

These examples are all US-based but similar government-funded computer fiascos abound in all advanced countries; from the National Health Service to the Ministry of Defence and the Inland Revenue in the United Kingdom for example.

Studies have shown that for every six new large-scale IT systems that are put into operation, two others are cancelled. The average software development project overshoots its schedule by half, with larger projects typically even worse. Some three quarters of all large systems are "operating failures" that either do not function as intended or are not used at all. A 1994 survey of 24 leading companies by IBM discovered that 55 percent of large distributed systems projects cost more than expected, 68 percent overran their schedules and 88 percent had to be substantially redesigned. A similar study in 1995 by the Standish Group, a research organization in Massachusetts, estimated that cancelled software projects in the US amounted to some $81 billion per annum. Only about 16% of the software and IT projects were completed on-time and on-budget while a staggering 31% were deemed complete failures (cancelled). Furthermore, few projects delivered what was ordered, with only 61% of the originally specified requirements available when the few successful systems went live.

Comparison with other engineering disciplines

How do these statistics stack up against similar sized projects in engineering and construction?

While it is generally true that large engineering projects overrun on time and cost, they are far more successful in delivering to specification and budget. More significantly, physical engineering systems, such as jet airliners or nuclear power plants, show massively better reliability and fault tolerance when compared to large software systems of comparable complexity. If airliners crashed with the same frequency as the operating system in your PC would you ever consider taking that trip abroad this summer?

Lets face it – the state of software engineering is such that our economic livelihood and even our very lives are increasingly in jeopardy. We must first recognise that computer science has made huge strides over the last fifty years; developing sophisticated analysis methodologies, testing strategies and object oriented programming tools – not to even mention artificial intelligence. And yet there seems to be some elusive difference in character between the worlds of software development and those of applied engineering, chemistry and physics. After decades of research and development by some of the brightest people on the planet, software development still has more in common with the arts and crafts movement than science.

Even measurement of productivity in terms of ‘lines of code per development day’ is notoriously misleading unless the entire software lifecycle is taken into account. Today our software architects have more in common with the builders of the great medieval cathedrals than with the engineers at Boeing or the architects of the Sydney Opera House.

The initial aim of this book was to explore exactly what the underlying problem is with software engineering. Just why is civil and mechanical engineering so overwhelmingly successful compared with software design? What is it about the physical engineering disciplines that make them seemingly inherently less difficult to get right first time? In exploring these issues however, the question becomes a much broader and more fundamental one – just what is the process of design and how does it change as we move from one domain of application, such as building a bridge, to another, such as a real time software system?

To answer this question we need to devise a model of the design process itself which can be applied across all disciplines. In doing so we shall discover that, as in many things, Mother Nature has beaten us to it. The result of a blind algorithmic process, evolution by natural selection, living systems display apparently smart design and complex components of irreducible complexity. From the design of the mammalian eye (far surpassed however by the eyes of an octopus) to the exquisite molecular machines inside eukaryote cells, we see fine-tuned design which is efficient, robust, resilient and endlessly repeatable. By exploring how evolution discovers such solutions in an infinite space of possible designs will provide insight into how we might begin to build more reliable software systems. However, the applications of recent work in evolutionary biology go far beyond engineering design - showing us how we might better organise our business and social systems.

Beauvais Cathedral and Other Disasters

Before we turn again to software, lets look back at our architectural and engineering legacy. The path towards the confident, soaring structures that we see around us today has not always run smoothly.

The builders of the great cathedral at Beauvais in northern France were the engineering visionaries of their time. They conceived a slender, stone structure, full of light and space, soaring above the surrounding wooden houses, taller and more graceful than anything yet built by man. The nave would be fully 48m high, 5m higher than that of the cathedral in neighbouring Amiens. The cathedral was to be a monument to the greater glory of God, a symbol of the wealth and power of the church and of the artistic and technological skills of the local artisans. Driven by religious zeal, ambition and pride in their skill, the medieval architects strove to build ever larger, higher and wider cathedrals. Having only a rule-of-thumb understanding of how their materials worked, they had no sound mathematical methods for predicting how their design would perform once built.

Construction at Beauvais began in 1247, and the choir was successfully completed around 1280. Early in the morning of January 17 in 1284, the great arching vault spanning the choir suddenly collapsed. Tons of stone and timber crashed into the choir space below, leaving the cathedral and the engineering vision that inspired it in ruins. The disaster at Beauvais was the single most devastating failure of design in the entire program of medieval cathedral building and was the impetus for the host of architectural innovations that was to follow.

The gothic architecture we see surviving today is the result of a challenge in engineering design: how to span in stone ever-wider vaults from ever-greater heights. From 1100 onward, a series of technological innovations allowed visionary architects to play with light and space in ways hitherto un-dreamed of. The pointed arch allowed for higher and wider openings to be created than possible with the round ‘Roman’ arch. Systems of stone ribs allowed vaults to be made of lighter, thinner stone and the walls to accommodate ever-larger windows. That major innovation, the flying buttress, absorbed the outward thrust of the walls, making it possible to reduce a cathedral’s external masonry shell to an open skeleton.

The centuries following Beauvais have seen many other catastrophic failures in engineering and architecture, some more famous than others.

In 1879 the Tay Bridge collapsed into the Firth of Tay at Dundee, catapulting a train, 6 carriages and 75 people to their deaths in the swirling waters. The collapse of the bridge, only opened 19 months and passed safe by the Board of Trade, sent shock waves through the Victorian engineering profession and general public. The disaster is one of the most famous bridge failures and to date it is still the worst structural engineering failure in the British Isles.

On the afternoon of August 29, 1907, steelworkers perched high on the partially constructed south arm of the Quebec Bridge awaited the final whistle for the day. With a span of eighteen hundred feet when completed, the design was the most ambitious cantilever bridge then attempted. With a loud report like a canon shot, two compression chords in the south anchor arm of the bridge failed without warning. With the steel structure tortured way beyond its design limits, the nineteen thousand tons of the south anchor and cantilever arms and the partially completed centre span collapsed onto the banks of the St. Lawrence River. Of the eighty-six men on the bridge when it failed, only eleven survived.

More recent times have seen spectacular failures of complex engineering systems. On January 28, 1986, the space shuttle Challenger rose from its launchpad into a cloudless sky. It was a bright, crisp morning with a ground temperature of 36° F, some 17°F lower than the previous lowest launch temperature. One minute after launch a flickering flame appeared at a joint in the shell of the right-most solid rocket booster. Thirteen seconds later the main fuel tank ruptured, releasing hundreds of tons of liquid hydrogen and liquid oxygen into the air. In the explosion that followed, the Challenger was ripped apart, the crew of seven plummeting to their deaths in the Atlantic. The images of the flaming debris were scored forever into the minds of the television audience, the confidence and pride of the American nation in its technology shaken to its very core. The disaster suspended the shuttle programme for over two years and prompted an investigatory commission to be set up by the President.

The Challenger disaster was ultimately a failure not of engineering design but of the decision making and management processes within NASA. The design faults in the O-ring seals of the solid rocket boosters and the implications for operation at low temperature were well-known to the design engineers at sub-contractor Thiokol had been documented and communicated to NASA. However, for reasons of political and economic expediency, pressure was brought to bear on the engineers to approve a launch in operating conditions that all concerned knew to be dangerous. What followed was as inevitable as night following day.

On reflection, surely what is remarkable is not that engineering disasters occur at all but how relatively few projects end in catastrophic failure. Our world is not filled with crumbling, collapsing structures and erratic and unreliable machines. What is remarkable is the manifest success of engineering in building a highly coherent world of soaring architecture, and highly complex machines that not only work, but that we unquestioningly trust our lives to them. From microelectronics to digital communications, nanotechnologies and new gene technologies, our society is reaping unprecedented benefit from the fruits of engineering science. Whilst failures in engineering design can and do occur, their frequency is astoundingly low given the scale, ambition and complexity of many of the projects now undertaken.

The last five hundred years have seen immense advancement in science, engineering and architecture. Wherever we look we see testament to the power of the engineer to bend nature to new ends, to shape new materials into structures that are not only functional but also beautiful and inspiring. Since the Renaissance, our understanding of the forces that govern the physical world been underpinned with explanatory laws of immense predictive power. Western scientific empiricism has yielded knowledge which, while described through beautiful mathematical abstractions, has immediate and widespread practical application. The rough and ready rules-of-thumb of the cathedral builders have been replaced by the precise tools of mathematical analysis and materials science. The flowering of physics and chemistry has allowed our knowledge of materials, forces and motion and to be deployed in precise and reproducible ways in countless fields of application. The classical mechanics and the differential calculus of Newton, allowed us to model and predict the behaviour of the world around us and of the heavens beyond. The unravelling of the molecular structure of matter in the last century has given new insight into the properties of materials and the means to devise new, stronger materials. The new insights into the fundamental nature of matter provided through quantum mechanics (now over a century old) have enabled us to create un-dreamt of conceptions from microprocessors, lasers, GPS and mobile phones to NMR scanners.

The rapidly expanding urban populations of the industrial revolution created a host of new problems that demanded solutions supported by new technologies and new political and economic systems. Food production on the land became mechanised to feed the urban masses. Transport networks linked isolated communities across continents, and global communications systems shrunk the planet to a village. The invention of high tensile steel and reinforced concrete gave rise to new corporate cityscapes that dwarfed the achievements of the medieval cathedral builders. This coupling between sociological and economic demands and their solution through technological innovations forms an auto-catalytic system. Industrial history is a series of paradigm shifts; water giving way to steam, giving way to gas and internal combustion; analogue electronics to digital and soon to quantum computing.

So, given this seemingly unstoppable march of progress and knowledge, we must return to our original question. Why have many of the techniques which have proven so powerful and so successful in other branches of engineering failed so signally in the realm of computer science?

The Fortuitous Nature of Physical Engineering

This massive progress and achievement in engineering was possible only because of a number of fortuitous properties of the universe and the way in which mathematics allows us to describe it. Arguably, the technological world is the result of a fruitful isomorphism between the external world and the corresponding mental model evolved within the human mind.

The first lucky break is that, to a high degree of approximation, the universe as considered by the engineer behaves as a linear, classical system. To tolerances sufficient for many engineering applications, effects are proportional to their cause – small changes in the initial conditions give rise to proportional final states as the system evolves over time. Furthermore, the relationship between initial conditions and the final state of the system can be characterised exactly in the language of mathematics.

This is not to say that the universe is not abundant with highly non-linear chaotic systems where infinitesimal changes in initial conditions give rise to enormously amplified and not easily predicted effects. This is best exemplified by the famous ‘butterfly effect’, where tiny differences in air pressure in one part of the world (perhaps caused by the flapping of a butterfly’s wing) can cause huge differences in weather systems on the other side of the world. Such ‘inconvenient’ aspects of the world, such as aerodynamic or fluid turbulence, weather systems and even the stock market have until recently been largely ignored by the classical engineer. Firstly, this is because, until the advent of numerical modelling on computers, the mathematics of chaotic systems was intractable. Secondly, huge swathes of profitable ‘application space’ lie within the region of ordered, quasi-linear behaviour where non-linear phenomena, such as turbulence, can be sufficiently approximated by a few differential equations. In most of engineering, we can safely turn a blind eye to chaos – its effects are lost in the noise.

Why is this linearity so critical to the success of engineering? Today almost all engineering projects are designed through the use of models – at first physical and now largely virtual – which allow the design to be tested within a representation of the problem domain. To a greater of lesser extent, the behaviour of the model can be extrapolated to predict how the full size version will perform. We don’t build full size jet airliners without first testing the aerodynamics with a small model in a wind tunnel. This obviously has a dramatic effect on the costs, timescales and safety of building large complex systems. The application of static and dynamic mechanics, ballistics and hydraulics is testament to this underlying linearity. It seems that not only is God an Englishman, he is also a staunch Newtonian where His creed is the Lagrangian.

But surely, the universe is not a classical linear system? The cosy model of the world as conceived by Newton was finally blown apart a century ago with the simultaneous development of quantum mechanics and general relativity. Peel apart the atomic heart of matter and what we see is not the tiny solar system of the Rutherford atom but a ghost world of virtual particles, wave packets and quantised energy states. It is a world were exact knowledge is forever denied us and we must work not with tolerances but with probabilities. Newton’s laws of motion successfully predict the motion of the stars and planets to a very high degree of accuracy, but show increasing anomalies in the presence of strong gravitational fields. In contrast, Quantum Electro-Dynamics, discovered by Richard Fynman, is our most astonishingly precise scientific theory – able to calculate outcomes with an accuracy up to 14 decimal places.

Both quantum mechanics and relativity have been assimilated into modern engineering application very successfully. Indeed, many fundamental advances have been possible only because of our new understanding of the quantum universe –lasers, semiconductors, silicon chips, and super-conducting magnets for example. Similarly with relativity, the calculations used to navigate a spacecraft around the solar system must routinely take into account the curvature of Einsteinian space due to gravity. The GPS we use every day would be inaccurate by kilometres if it did not adjust for the dilation of time in Earth’s gravity well.

The second major driver towards the success of classical engineering concerns the nature of the human mind itself. Why have humans uniquely been able to build complex mental models of the world with the power to improve the individual’s chances of survival, of success within the social group and, later, of conceiving solutions to technological problems?

The evolution of the brain in higher primates and larger predators has been heavily influenced by the fact that they live in highly social communities. Monkeys, apes, the big cats and hunting dogs live in societies where each member of the group is known as an individual with specific skills, traits and position within the pecking order. In contrast, the social insects behave not as societies of individuals but as a single composite ‘super-organism’ comprising millions of indistinguishable ‘cells’. Within the lion pack or the ape colony, key skills needed for success within the social group are the abilities to empathise and communicate. It is of immense evolutionary benefit to be able to predict the behaviour of other animals; fellow tribe members, rival groups, predators and prey.

The higher social animals react to problem situations not with simple conditioned responses but with rational behaviour resulting from an inner simulation of possible solutions and the selection of the best perceived outcome. Rather than resorting to a trial-and-error search for the best response – a highly risky strategy where a fatal choice might be made before the preferred move is found - the cognitive species are able to perform this search in the virtual space of their minds. As Karl Popper says, we can “let our ideas die in our stead”. Animals able to survive long enough to reproduce and pass on this successful trait will have highly developed faculties for visualisation, planning, empathy and communication.

The final key factor in the success of classical engineering is the power of reductionism. Physical systems behave very much as the sum of their parts, allowing a complex design problem to be decomposed into smaller, simpler sub-problems that can be solved in isolation. Providing these component sub-systems interface appropriately, they can be treated as ‘black boxes’ whose internal mechanisms and structure is largely irrelevant to the whole picture. In computing parlance, such sub-systems are said to be ‘loosely coupled’. Again, this is only possible because (1) such components mostly interact with each other in a quasi-linear fashion, and (2) we have the kinds of minds able to visualise a solution decomposed into abstract components and functions. Without the ability to reduce a complex design into sub-problems, all but the simplest applications would be beyond our mental capacity to model the entire system in all its internal complexity. How would our ability to build, say, a modern jet airliner be impacted if the aerodynamicists modelling the wings needed to take into account the details of the sanitary or cabin heating system? The difficulty of the design task would be increased many orders of magnitude without the ability to divide and conquer the problem space.

Towards a Model of Design Itself

In order to begin to understand why software development encounters fundamentally different difficulties we must first understand exactly what the process of design entails. What is design? Can we abstract the design process in a way that allows us to understand better how designing a suspension bridge and a real-time airline booking system are both similar whilst encountering radically different issues.

Before we can explore the difficulty of computer design, we must first be clear about what we mean by computation and computers without being side-tracked by the technical minutiae of modern computers. This is the topic of the next chapter.

Next Chapter