There is a certain counter-intuitiveness, even a conflict to the basic tenets of the DevOps concept. Beyond the basic practices of storing infrastructure as code and automating / scripting as much as possible, DevOps requires organizations to bring code to the users as frequently as possible, making the changes in small increments. Only by making the delivery process as quick, easy and safe as possible is it possible to create a platform that is reliable, secure and and of maximum value to customers. So you have to frequently undertake modifications to live systems that might break them, and this is supposed to be better than doing the same thing less frequently, and much more carefully. This book, written by people who made huge contributions to the DevOps world (among others coining the term itself), aims to explain and “sell” this contradiction to tech workers and managers. It is not explicitly stated, but the focus is mostly on big companies that might not think of themselves as purely tech companies. This focus is understandable, as most startups are already on the devops bandwagon. Nevertheless, the book goes beyond proselytizing, gathering a lot of devops wisdom in one handy volume.
The aforementioned conflict (or a version thereof) is actually a core principle of not only devops, but also of lean manufacturing, which served as an inspiration for agile and lean methods. In lean manufacturing, the core conflict is between ensuring on-time shipments to customers and cutting costs. In IT, the core conflict is between delivering fresh code to the users, and keeping the application platform securely running. Traditionally, the ops team wants to keep everything in place and secure, introducing new checks and conditions for deploying. The app team, on the other hand, wants to constantly change the live code and try things out. The solution to this fundamental conflict is a change of mentality that involves pervasive automated testing, frequent deployments, pervasive telemetry, small and incremental changes, and multiple channels of feedback. The details of this mentality, which I also take to be the core of the devops movement, are discussed in the rest of the book.
The core process that has to be optimized using devops methods is called the technology value stream, which is defined as “the process required to convert a business hypothesis into a technology-enabled service that delivers value to the customer” (p.8). Companies that can improve their technology value streams can take hypotheses to the market faster and more reliably, thereby running more experiments than their competition for the same amount of money. These improvements fall into either one of “three ways”, a way being a collection of principles around the same topic. The first way, the principles of flow, concerns how work items flow from business to development to ops, and finally to the customer. The aim here is to “make work visible, reduce batch sizes and intervals of work, build in quality by preventing defects from being passed to downstream work centers, and constantly optimize for global goals” (p. 11). The second way, the principles of feedback, concerns the flow of feedback in the opposite direction, back from the customer to business development. The last way is the principles of continual learning and experimentation, which aims the “creation of a generative, high-trust culture that supports a dynamic, disciplined, and scientific approach to experimentation and risk-taking” (p. 12).
The discussions of these three ways are bookended by a number of chapters on “Where to start”, and another group of chapters on information security and compliance. Unfortunately, the “where to start” chapters are mostly about introducing or maintaining devops practices in large companies with established ways of doing things. These tips vary from the political (how to find allies that support your work, or get people on your side) to the organizational (how to set up teams, organize the workflow) to the tactical, and are not particularly interesting for people who are already used to working in the devops way, or small teams that have to work with e.g. embedded ops engineers anyway. I can understand the need for detailed handling of these topics, but it does make the book a tad unattractive for me.
The “ways” are discussed using clear “commandments”, such as “make infrastructure easier to rebuild than to repair” for the first way, or “enable peer review of changes” for the second way. The discussions of such commandments consist of background information that justify and explain them, offer processes around them, and make recommendations for concrete technologies or techniques the use. There are also many case studies where the practices are championed by people who instituted them in their organizations. Concepts that are related to the commandments are also explained, giving the reader ample topics to dig deeper into by just googling. For the experienced devops practitioner, there won’t be many surprises regarding the commandments and practices, but there are many eye-opening contexts and comparisons, especially in conjunction with other practices, or even altogether different industries. One example is the Andon cord used in manufacturing to stop the production line when there is a defect in the products or the process. Workers are encouraged to pull it whenever they see is fit, after which the line manager and the worker have a minute to fix the issue, or the production line is stopped. After a while, as the production line is improved, the number of cord pulls drops. Instead of being satisfied with the status quo, however, Toyota plant managers decrease the tolerances to increase the cord pulls once more. I found this idea quite fascinating, and would like to put it to use at the first opportunity.
The ops conflict I mentioned in the beginning is also frequently thematized, especially in the chapters on the second way. A version of the base conflict is the question of how to react to production issues which, in hindsight, could have been prevented with more pre-deployment checks. The naive approach would be to introduce these extra checks as a part of the deployment process. Incorrectly applied, however, such checks can actually lead to more errors happening in the future. This will happen if more friction is added to the deployment process, especially with the decision to deploy in the hands of people other than those who did the work. The effect of such interventions is that the batch sizes grow, leading to riskier deployments, and the engineers will have no feedback on their changes from the deployment process. The decisive aspect is whether the company culture is “low-trust, command-and-control”, compared to an open, trusting one. Another example of what might be considered a counter-intuitive tendencey in some large devops teams is the focus on mean time to recovery (MTTR) instead of mean time between failures (MTBF). Simple logic would dictate that companies that need reliable online services would put all their resources into avoiding failure. Failure is simply unavoidable, however, and it is much wiser to put significant effort into diagnosis and recovery.
Despite being wordy, which might be expected of a handbook, the DevOps Handbook is worth reading from cover to cover for those who are in responsible positions in IT organizations, especially in large companies. I will definitely be going back to it to refresh on the whys and hows of runniing highly availabile services.