Lessons from Legacy

Last two years of my working life have been spent on an e-commerce application that is mainly occupied with coordinating inventory items, orders and shipments. The main user interface of this application is a REST API tied to an Angular frontend. The same API is also used by various middleware applications that sync with other e-commerce applications. Since our company has moved on to a new product, but we still have a customer using this legacy system, I had to take on the duty of keeping it running. As I dug deeper into it to make improvements, a couple of decisions we made at the time stood out to me as having proven to be suboptimal. It is one thing to read about general software design principles in the abstract, and another to see them demonstrated on a live, growing system in which you have a stake. Here is my attempt at making my observations as concrete as possible.

Don't make a straitjacket out of constraints

When it comes to securing the integrity of your data, PostgreSQL can do wonders, with such features as constraints, triggers, enumerations, and normalization, the sine qua non of relational databases. Add SQLAlchemy to the mixture, and anyting is possible. For example, with a feature called composite secondary joins, SQLAlchemy allows you to present a join across multiple tables as a field on a model, making complicated normalization schemes possible. On the Python end, you are accessing a simple entity attribute, while in the background PostgreSQL is jumping over foreign keys to make sure that, for example, the total available quantity for a certain product is the sum of all inventories for that product in different storage cells. This is all good and dandy as far as you are aware of the trade-offs. If you rely on the database too much, like we did, there are a number of dangers that lurk around the corner. First is performance. Accessing an instance attribute is a cheap operation in Python, and one does not think twice about it while coding. If there is a complicated join behind such an access, however, you might be creating a time sink that gets deeper with the size of the database. The ugly truth about slow database queries is that they rarely get caught in development or testing, because the database has little data and few connections. Production is where database performance is diagnosed, and avoiding that diagnosis in the first place is not a bad idea.

More impactful than performance, however, was how difficult it became to change things at the database level because of our excessive use of constraints. This is a textbook case of tight coupling: We propagated the constraints in the application into the database where they didn't really belong. Generally speaking, leaving room for action on the database gives you the freedom to circumvent otherwise complicated issues by modifying data. When you fixate everything with constraints, enumerations etc, you are giving this freedom away. For example, we used enumarations to limit the kinds of connectors (our name for the async jobs that were queued for execution). Since we already had these enumerations, we used them further to denote the sources for orders. I recently found out that removing connectors was made pretty much impossible by this dependency. I could purge the code for an unused connector, but the name in the enumeration had to stay, because there were orders created by this connector, and the source field had the name of the connector. Another affected area was deployment. Due to our excessive use of triggers, running migrations required dropping and then recreating all triggers. This made deployment longer and more error-prone, something I will touch upon later.

As general principles, I would venture to extract the following:

  • Anything that changes regularly at the code level should be secured only at that level, and not in the database.

  • Anything that requires locking complete tables when modifying is unacceptable as a consistency feature.

Make sure that securing your data does not turn into an exercise in rigidity. DB constraints are OK when they are used sparsely, but you should always be conscious of the trade-offs you are making.

Have a strategy for growing business logic

Just in case you haven't recognized it yet: Organizing code is difficult. A decent portion of a developer's time goes into figuring out where a certain line of code goes. One serious mistake that we made was distributing business logic around the code base, instead of centralizing it. The logic for routing orders to consumers, for example, was in the module that contained the queued jobs, whereas the code that was responsible for changing the state of orders was in the API module. Among the consequences of disorganized business logic were the following:

  • We frequently had a really hard time deciding on the correct location for changing or adding a functionality. Alternatively, finding the location of code responsible for a feature was never straightforward.

  • Tests for the various units of code manipulating the same DB model were complicated. Since the logic for state changes and processes was spread all over the place, the DB state had to be created for each test. If we had abstracted away the business logic into code that didn't need database objects, but could work with plain Python instances, testing would have been easier, faster and less error-prone.

  • Mentally tracking a process was an exercise in compiling code in your mind. A module was never really complete without inserting code from some other module in your mind's eye. Actually, I would sometimes temporarily copy-paste code to make understanding things easier.

As soon as you start building business logic, you should have a clear idea of how you want to keep it coherent, united and testable. These orientation points should also make your code open to discovery and modification.

How you bring code to users is as important as how you build it

Writing the code is half the work of creating a feature. The rest is actually bringing it to the user. This involves testing, merging into main branch, and deploying to production. There were a lot of things we got right here; we were using a very simple integration approach with pull requests merged into master, and integrated testing. There were also some things we got wrong. The first of these was slow testing. This is something most of the industry is getting wrong, in my opinion. We as developers have gotten used to having test suites run in tens of minutes, and accept this as a necessary evil of large code bases. Our test suite was also rather long-running; some test runs took more than half an hour to complete. The result of this was distraction, prolonged deployment times, the occasional skipping of test runs, and overall developer dissatisfaction. A long running test suite is also frequently broken due to reasons unrelated to the code, reasons more related to infrastructure or annoying race conditions. Such failures have led to the biggest frustrations of my life as a developer. As I'm trying to concentrate on a difficult issue, a one-liner commit seemingly breaks the test run. Such things sometimes cost me a whole day, and they usually turned out to be due to the unrelated issues like the minor version of a cli program having changed. Through such breakage and long waiting times, the test suite cannot be used as a tool that gives feedback on correctness. It becomes a nuisance, a bureaucratic process that has to be fulfilled to deploy code. If I had a say for a new project, I would put a hard limit of 10 minutes for test runs, and work very hard to keep this limit.

The next step in delivering the code is deploying it to production. Again, we weren't doing bad in this area, but it could have been better. Deploying code did not take long, but it was prone to breaking. The fixes required better devops skills, which not enough people possessed at the time. I know that this is much more often said than done, but I have to join the chorus: Having a dead simple and robust deployment process should be the highest priority of a team that runs web applications to earn their daily bread.

Be very very careful with end-to-end tests

Keepin with the testing theme, one thing we really regretted was going overboard with end-to-end (e2e) tests. These were implemented using the test runner for Angular, and fired Chrome to go through many frontend features. The Angular frontend was run against an actual backend. The e2e tests appeared as a real wonder weapon at the beginning. Look, you can actually run the whole application! Can you get a better guarantee that everything is OK and you can deploy? With this thinking, we wrote e2e tests for each corner case of a given functionality. Once we accumulated a decent number of such tests, however, they turned out to be rather problematic. We faced two principal difficulties in this context. First, one had to set up the exact environment through the database so that the frontend could find out what it was looking for. Second, navigating through the interface was painful even with very precise setup, because one had to find the exact selector for the elements (buttons, inputs etc.) to be manipulated. These selectors also broke very easily when the HTML structure changed. Adding one new list to a page led to having to fix the selectors in all the tests for that page.

I know that frontend testing is a difficult area, and won't pretend that I have the solution to it. There was one lesson that surfaced in our review discussions, however: It would have been a better idea to abstract away some of the JS and unit test it without the DOM or the backend application. The browser has many complexities that are difficult to control in a completely automated environment, and it is best to avoid these when you can.

Don't leave transparency to your tools

We had to do our fair share of firefighting, which went like this: Customer calls us, tells us that they are getting weird responses from our API, or maybe no responses at all, which puts us into scrambling mode. We go to the logs for details, but our application is not telling much, so we have to check the logs from the web server, message queue and the database. From the information spit out by them, we need to piece together what the issue is, and deploy a fix after figuring it out. The fact that we had to consult the infrastructure to find out the exact problem was the reason that our customer was calling us in the first place, instead of us getting alerted to the issue.

While discussing this topic with our current CTO, who has a lot of experience with large systems, she remarked that if you leave the job of reporting on your platform to the infrastructure components such as database or message queue, the developers will go to them to understand system behavior. This will make debugging and performance-monitoring an indirect process, where you have to interpret the data coming from these components to draw conclusions on what is happening in your own code. The more sensible thing to do is build in proper logging and monitoring from the beginning to make sure your system is transparent, i.e. it should be telling you what's going on during normal and faulty operation.

Focus on your core technology

Every large application relies on at least one component of the infrastructure as the core. Depending on the kind of application, this core can be different. Data-driven web applications nearly always have an RDBMS in their center, which was also the case with our app, in which case it was specifically PostgreSQL. PostgreSQL is an incredible peace of technology. It is robust, feature-complete, fast, and somehow still evolving. Despite relying on it so heavily, I have to admit that our knowledge of PostgreSQL was still relatively limited, in the sense that we were not using many important features properly. We had proper replication in place, and were using triggers and constraints as explained above, but our use of indexing was really limited, for example. I discovered this when I took the time to examine the index usage of our database (which I documented in another blog post), and created a number of indexes that improved the most frequent queries. The result was our customer representative telling me that the app was so fast, that he thought there was something wrong. Indexing is obviously not a specialty of PostgreSQL; it is the feature of RDBMS's. It is therefore database theory that we should have tried to understand a bit better, and you can do this much better when you are working on a data heavy application that should have been performing much better yesterday.

Other core technologies such as message queues or key-value stores all have their tricky corners which, when understood properly, can help you navigate difficult situations, and earn yourself some time when you desperately need it. Most important, though, is understanding the theory common to all such systems, and the common algorithms that exist to navigate the constraints. You should take the time to read the documentation a bit, and inform yourself about not only the specifics but also the general theory.

Nothing beats incrementally improved systems

To finish on a positive note, I need to mention that our application has been improving continuously, and getting more robust and performant as we keep working on it incrementally, attacking one issue at a time. Continuous attention and incremental improvement beats rewriting a large codebase, given that it is not utter spaghetti code, or written without a sound understanding of best practices and the programming model supported by the programming language(s) used. This was one of the biggest advantages of our team: All developers were experienced and dilligent students of Python and/or JS, the two languages our application was built with. Every developer strived to write clear and idiomatic Python and JS, and there was constant exchange on how to do the right thing. The end result is a codebase that can still evolve, even with a single developer left working on it.