Real World Machine Learning

Book cover image

Before machine learning became such a lucrative branch of computer science, there was only a single kind of book on the topic: Theoretical treatise that assumed clean data, and provided algorithms in the form of mathematical formulas. It was the task of the reader to take care of the dirty business of loading the data from a source, clean it up, implement the algorithms, and then debug it. There were a couple of books that provided examples in Java, or Matlab if you were lucky, but dealing with the everyday details of data science was usually left as an exercise to the reader. Since machine learning has become a part of applied everyday work, there is more work going into making these necessary steps more accessible to developers. This books aims to fill in that gap, showing the user the tools, tips, and tricks data scientists use in their daily work. Not only the computational tools, but the concepts and, of course, various algorithms are introduced. The theoretical load is rather low; only a couple of mathematical formulas can be seen, and most algorithms are simply compared to each other, in terms of what kind of advantages they offer. The fundamental concepts of machine learning are given succinct and very practical definitions, and then demonstrated on the code samples.

The programming language of choice is Python, which is a huge plus for this book. Python is a relatively simple language that can be learned quickly by those who know other languages, and it offers a great collection of libraries for data scientists. Most of the exciting new tools create in the ML space, especially in Deep Learning, are for Python, which makes it a good bet for books like this. Another good decision is to use Jupyter notebooks, which enable developers to write Python code in a literate style, mixing code with text, and displaying the output, either in textual or graphical. The only problem with the code samples is that sometimes they are not idiomatic Python. One annoying thing, for example, is that they contain meaningless variable names such as i, t, thr, and tpr e.g. in the code samples for Chapter 4. Also, all code samples require importing all members from numpy, which in fact leads to collusions with other symbols which are also supposed to be imported without qualification, costing me some debugging time.

Some chapters are missing important information. For example, the programming task in Chapter 5 involves modelling whether users are interested in event recommendations on an online event site, but I haven’t been able to reach the accuracy claimed in the book no matter what I did. A short search on the book forum revealed that I wasn’t alone in having trouble. It would have been a nice idea to include the code in the online repository, and leave it out of the book, so that the readers can take a stab at doing it themselves, but can fall back to the online solution if they get stuck. My hunch is that the data they point at is not the same as when the task was created, leading to difference in results. This theory could be validated if the solution were available. A similar situation happens in Chapter 6, where data from an open source is used for tasks, but there is no mention of how to join the various tables to achieve the form used. The data I downloaded does not resemble what is described in the book (missing columns), and I would really like to know how the authors actually got it into the form they describe in the book. Keeping in mind that the aim of this book is to explain actual real-world data wrangling, it would have made sense to publish the code that massages the data into a form that is useful for machine learning.