Code complexity, modules and DSLs

Any idea, no matter how often iterated, is worth reading if it's stated succinctly enough. On this note, I would like to present my ideas about code size, modularization, testing and domain-specific languages.

In Code's Worst Enemy, Steve Yegge argues that code size is not a feature; it's a liability that you have to fight against. There is even reliable empirical proof of code size as an indicator of bug count. Unfortunately, keeping line count under a certain limit is not particularly easy, and especially so in certain languages (such as Java; a 500 kloc project Yegge wrote alone serves as an example). This is not only because Java is verbose; it is rather a cultural thing, where the solutions touted by experts against code bloat actually cause even more code (e.g. design patterns).

What weapons do you have against code size? IDE's are one, but they are problematic on their own account, as Yegge argues. IDEs do not limit code size, they help you come to terms with it. Being relatively heavy bundles of software themselves, they have their computational limits; it is not abnormal to have a code base too big for Eclipse. So if you're relying on the features of an IDE to navigate a code base which has become orders of magnitude bigger than what you can get grips on, you're essentially pushing the threshold into the future, delaying the onset of the problem.

Skipping IDEs and similar technical solutions for navigating and refactoring code, what weapons do programmers have in their arsenals to fight code size? First weapon: do not code. Take an open-source project that fits your purposes, and either use it as is, or fork it. Forking used to be relatively painful, since open source projects either had their own VCS repositories, or even if they were on Sourceforge, it was extremely difficult to keep your fork parallel to the mainline due to the vagaries of Subversion. Now, thanks to the advent of distributed VCSs and hosts for these (bitbucket, github) it became much easier to have your fork and keep it up-to-date too. Example: At work, we needed a CMS for our web page, and instead of writing it and having to maintain it, we picked the Django CMS. One more good thing about using open source code is that you get to use any plugins people developed for it; we had to write just configuration to get various plugins for Django CMS to work with our web page.

What if there are no open source projects that fit your needs, and you have to code it yourself? Here comes the second weapon: modularize. If a part of your code base starts looking like it could live without the rest, pull it out, put it in its own package (jar, gem, egg, deb, whatever), make it installable on its own. In the best case, open source the code, and if you hit a sweet spot, people might even use and improve it, which moves you up to a certain degree to the first base. One of the most important advantages of creating standalone modules is that you can test this code to the last line using coverage tools. Such testing turns an externalized package into a reliable piece of infrastructure, instead of a part of your code base which you have to support.

Both of these steps require that you can actually modularize your code. What are you to do if the essential complexity of your program increases, and you end up with a closely interconnected set of modules which cannot be split through stable interfaces? The problem is so complex that you have to write huge amounts of code to solve it, your repository getting bigger and bigger no matter how succinct your code is, and how much you squeeze to make it smaller. This is where the third weapon comes into play: domain specific languages. Solve your problems from the bottom up; create syntactic facilities in your language of choice to decrease code size, creating tools to get a better handle on the problems.

A good example of DSLs is ORMs; SQLAlchemy, the popular Python ORM, is in essence a domain specific language for accessing capabilities of relational DBs. A concrete example for DSLs that my colleagues recently built at work involved the specifiation of a set of products using logical operations on individual product dimensions. Let's say you have a business domain with products differing along the dimensions of size, color, and weight, and you want to be able to succinctly specify certain combinations of products. In Python, a few lines of code suffice to create a domain-specific representation of such statements which are easy to understand, and enable short descriptions of complicated statements using the built-in features of the language:

Using this module, you could do things like the following:

To which extent domain specific languages can be used depends on what programming language you are using, of course. The ability to extend your language to your liking --bottom-up coding-- has been a major selling point for lispy languages: begin with building the language that fits your needs. Modern languages all provide similar capabilities to a certain extent. Why are DSLs so important, and why do I believe that simple solutions like the one above will lead to less code? The reason is that DSLs, when done rightly, allow you to create compact abstractions without becoming cryptic. And the better your tools fit the domain, the less cognitive load complex solutions cause, leading to easier-to-maintain code and less bugs.

Where does testing fit in this picture? When there is a bug that is well hidden, you don't suspect it among one of the system libraries that you use, or the programming language runtime. You know that these are used in wildly different contexts, so have been tested thoroughly and can be relied on as infrastructure. Testing allows you to achieve the same thing with your own code. Using coverage tools, you can make sure that a module is thoroughly tested, thereby practically taking it out of your code base, and turning into infrastructure that you can rely on.