In this piece I would like to describe a practice I adopted a few years ago, after seeing how effectively colleagues were applying it. The idea of what I will call structured debugging is very simple: Document every step of the debugging effort of a complicated bug in an interactive environment in such a way that your thought process and deductions can be followed and verified. Despite its usefulness, simplicity and effectivenes, I see structured debugging practiced rarely. It helped me with complex bugs on multiple occasions, especially in distributed applications, and also leads to artefacts that have value on their own, independent of the debugging.
The first step of structured debugging is figuring out where to document your progress. If you are happy working with the bug tracker that is at your disposal, you can document your work as comments on the bug ticket. If you prefer working in your editor, as I do, I would advise you to create a markdown file named after the ID of the bug. When you want to quickly switch to this buffer, you can use the bug ID, and once you are done, or want to notify others of your work-in-progress, you can copy-paste the contents. Every half decent bug tracker out there accepts markdown input these days; by editing the report in your editor as markdown, you will have best of both worlds, with local shortcuts and utilities, in addition to clickable links and decent formatting for code snippets etc.
Once you have picked out the documentation environment, you should start proceeding in a systematic way. This can be done in the standard debug loop (gather information, set up conjectures, test them, repeat). What you want to achieve iterating over this loop is a reproducible narrative: Document each of the steps in such a way that any other developer with acquintence to the code base can open the ticket, follow through the comments and repeat any commands, arriving at the same conclusions as you do. When gathering information, it is common to make use of SQL queries, for example, or even simple scripts that join data from multiple sources. You should gather all of these in your report, together with the results at the point you ran them. One nice side effect of making this information available in a nice form is that you can take the time to make them as informative and simple as possible, for example by using joins instead of multiple queries in SQL. In order to gather information from your colleagues, you can tag them in the bug ticket, so that they can write their responses there, enriching the bug hunt.
The most relevant source of information in debugging live systems is logs. In the old days, web application logs were stored either as text files on servers, rotated and zipped regularly, or piped to syslog. Both made accessing these in a linkable form problematic. More recently, however, dedicated systems for log analysis have become more and more widely used. These all (or nearly all; CloudWatch doesn't allow linking to a single line) have means for linking to individual lines, time windows or the results of specific queries. Instead of just copy-pasting the relevant log lines, or in addition to that, consider using these links. Any readers can open these links and try alternative searches. Another important source of information for especially complicated bugs is diagrams for explaining workflows, relationships or complex constellations. Timeline of events, for example can be explained using sequence diagrams, which are much better than convoluted text. When the difficulty of the bug warrants it, these are a great addition to the narrative.
Structured debugging can cause significant extra work, but it has major benefits. Most importantly, the end result will leave little doubt as to anything was missed. The conclusions you derive will not be based on conjectures and assumptions, but concrete data and tooling, open for everyone to read and verify. Furthermore, the artefacts resulting from structured debugging are valuable on their own. Not only once did I see the tools used in such debugging actually getting incorporated into internal products, such as SQL queries turned into internal web pages & reports, or Kibana searches that were added to dashboards as graphs. In case the bug proves tougher than you thought, or something more important comes in between, the report will prove invaluable: Once it's taken up again, you or anyone else can read it, and easily take off from where you left. Last but not least, this method will make it visible to the team what gaps in visibility and diagnostics exist, making analyzing and linking the whole system harder or incomplete.