Big Software

"The grateful moon has granted the city of Lalage a rarer privilege: to grow in lightness" - Italo Calvino

A number of software projects I had the pleasure of working on were what I later came to think of as big software. They had common qualities that led the development team to work in a certain way, perpetuating these characteristics in a cycle. These common qualities should be thought of as umbrella terms; not all big software systems have each and every one of them, and none of them are strictly required. In the following, I would like to describe these qualities, and how they are related to each other.

There is an undeniable attractiveness, or at least maze-like quality to big software. If one did not have to change it to satisfy clients, having to keep the whole edifice running in the process, diving into big software could even be considered a fun and revealing exercise. Navigating the parallel paths, conflicts, bolted-on suburbs and dead ends, one could learn about the individual tendencies and social tensions that led to such a system. But this would be software archeology work, fundamentally different from maintenance and extending.

Changing big software is like writing as Philip Roth describes it: In most professions there's a beginning, a middle, and an end. With writing, it's always beginning again. Every change opens a new can of worms, and closing it is temporary. The conflicts and tears are discovered only when change is attempted, and the change introduces new ones itself, because it pulls the software in yet another direction, in yet another manner. Reconciling the various demands on the code is impossible, as such chances are perpetually delayed. In professional programming work, there is little more satisfying than refactoring big software with proper (frequently archeological) knowledge, and enjoying the simplifications and dead code that result.

All sophisticated software is unpredictable, becoming a complex system in the limit. Big software turns complexity into an art form. Any factor –code, operating environment, external systems– can have cascading effects on the behavior of the system. The only way to reliably find out what the system will do is to run it on the live production system with real data – and this not because production is particularly reliable, but because it's what matters for the users. Even then, the behavior will change from one moment to the other, because of subtle changes in the environment. Effects in big software are nonlocal and disproportionate. A developer can never be certain that what she thinks is the core location of a certain functionality is the only relevant place to look. Simple changes to the environment or code might cause ripple effects.

Delivering big software is a complicated process that depends on many other components of software, online resources, and special conditions. The delivery of changes to users is in no corellation to the size of the change. Since the effects of even minor changes are unforeseen, complex testing mechanisms that take a long time to run exercise all software, for every feature and regression. Many security checks that are themselves complicated due to their target are built into the delivery mechanism, which causes the build to be long and fragile. These mechanisms cannot be forfeited, however, because they are the last barrier to the application disintegrating on delivery, or at least they are perceived to be so. Even if the change to be deployed is tiny, it takes hours, if not days, to deliver, because the baseline for integration and deployment is big.

Once in operation, big software is difficult to observe. In order to navigate the immense complexity in operation, very detailed logs are emitted. Understanding and evaluating these logs becomes a domain of its own, with its own independent logic. There is a fallacy hiding here, similar to that of expecting badly written code to become more comprehensible through comments. It is the belief that a complex system can be understood through a large amount of logs. The same diffuse, uncoordinated approach is applied to error handling. In oder to fight the reliability demons, code is written in a very defensive manner. Default values are used for missing data, errors are caught and handled in different ways in multiple frames. These practices serve to hide errors, in that they are not perceived unless brought up by the clients. In case one of these errors actually manages to surface, the core reason has to be assembled from a number of code locations. As per the Fundamental Failure Mode Theorem, complex systems usually operate in a failure mode; the reflection of this theorem in big software is that big software has ambiguous error conditions. Distinguishing correct functionality from incorrect is difficult.

Due to the reasons listed, integrating new code with big software is a royal pain. This leads to a mindset of not undoing work, of letting things chug along as long as there is no urgent reason to rip things out. After a while, it becomes practically impossible to remove things. Thus, it is difficult to scale big up, but down is even more difficult: Big cannot scale down. Regarding resources or scope, big software will not accept any limits. This is also the fundamental source of big's fragility. As big grows, the impact necessary to cause a failure becomes smaller compared to its size. As it cannot scale down, however, the impact threshold does not go down, even when the system is doing less, in terms of load or functionality. That is, even when the system is used less, for fewer functions, it will keep on breaking as often, and need the same amount of maintenance.

A second-degree quality of big software is a result of the way big software systems working with each other store data. Big software usually interfaces with multiple other big systems. These systems share a lot of representational data, things that are supposed to correspond to a shared reality out in the world. These "facts" frequently diverge from each other, however, but not because the facts change due to transmission or storage errors, but because differences in representation lead, over time, to differences in content. Take e-commerce, for example. There are no two e-commerce systems in the world that represent an order in even remotely similar ways. Some store consumer data in independent tables, tying these to orders, whereas others store all such data on the order itself. The product information is either line or item-based, and the primary reference for a product can be one of many different formats. In order to account for the mismatches between the different storage formats, logic is applied to transform data when it crosses between boundaries. This logic is neither static nor lossless. It changes over time, and data transported from system X to system Y cannot be transported back, into its original format. As data is shuffled from one big system to the other, their representations of reality become multi-faceted, rich, and correct and incorrect at the same time.

Considering all the negative and weird things that accompany it, the surprising thing about big software is that there is a lot of it running and keeping customers moderately happy. It has also made a decent amount of money for some people. The inescapable conclusion is that big software still gets most of the job done, most of the time, and its clients are happy. For me, as a developer, the more relevant question is why we have to work with such systems. There is the fact that sometimes one simply has to work on big software. It might be legacy software that has to keep on running, maybe of one's own doing. It is not infrequent that a team outgrows its methods and tools, getting stuck in a system that served as a ladder used to climb to a deeper design understanding. There are also cases where a team (or more frequently in this case, individual developers) have no problem working on big software, despite recognizing the issues. There is a certain joy in working with big software, as alluded to in the beginning. It gives the programmer a sense of working with something big, complex, beyond the capabilities of others. The bug fixes and feature changes are as big as the software itself. Meeting this challenge provides its own satisfaction. What is forgotten in day-to-day efforts, however, is that it's not possible to grow in lightness in such big steps.