An Introduction to Cython, the Secret Python Extension with Superpowers

Published on 21.02.2019

Cython is one of the best kept secrets of Python. It extends Python in a direction that addresses many of the shortcomings of the language and the platform, such as execution speed, GIL-free concurrency, absence of type checking and not creating an executable. It is a mature tool with a number of widely used packages that are written in it, such as spaCy, uvloop, and significant parts of scikit-learn, Numpy and Pandas. It smoothly hooks into the latter two, giving you access to underlying data structures in a straightforward way. All these superpowers come with the baggage of certain parts of C, however, which makes becoming proficient in Cython a bit steep for those who don’t know C. In this tutorial, I will give an overview of working with Cython, focusing on the parts of C that are relevant.

The requirements for running the code samples in this tutorial are Python 3 (preferably 3.6), a C compiler (GCC or Clang should do), and virtualenv or pipenv. Once you have these, getting Cython is as easy as creating a pipenv or a virtualenv, and within that environment, running pip install Cython.

What is Cython, and what is it not?

Cython is built on the fact that Python the language (or at least the CPython implementation of it) is built on top of a C API that is instrumented through an intermediate language. The way CPython works is by compiling Python code into a bytecode representation, and then executing the result on the virtual machine at runtime. Individual instructions of this bytecode consist of an opcode and a reference to any arguments. You can see the existing opcodes in the file opcode.h. In the main evaluation loop of the runtime, the opcode of an instruction is used to determine what to do next, in the form of one big switch statement. For example, if the opcode is BUILD_LIST, specifying the construction of a new list, the PyList_New function is called with the appropriate arguments. Obviously, there is a great deal of work the Python runtime is doing around this evaluation loop, such as garbage collection and error handling. Another large piece of the runtime work pertains to the dynamic nature of Python, where the actual methods to run have to be figured out at runtime, based on the object structure available. For example, when you try to multiply two numbers, Python has to figure out whether these are floats, integers etc., or the multiplication operator has been overloaded as with the string type. In many other languages, explicit type information helps the compiler figure this out at compile time, leading to faster compiled code. This distinction is known as early vs. late binding, and is the source of one of the major performance gains one can achieve by using Cython.

Cython makes use of the architectural organization of Python by translating (or ‘transpiling’, as it is now called) a Python file into the C equivalent of what the Python runtime would be doing, and compiling this into machine code - this can be a Python extension which can be dynamically loaded, or an actual executable. The resulting module makes calls to the Python runtime in order to deal with things like above mentioned dispatch, which means that straightforward Python code will not be executing much faster than it would anyway. The speed difference becomes significant when you code using Cython-specific constructs that are transpiled directly to their C equivalents, thereby avoiding the Python runtime. We will see what these constructs are in a minute, but let’s start with a simple example to get used to working with the Cython toolchain.

Starting off

The first thing you need to do is to install Cython, obviously. Please refer to the official Cython documentation for the installation instructions. You will also need a C compiler; GCC on Linux and Clang on Mac should do. Once you have have these two, we can start with the usual hello world. Save the following in a file named hello_world.pyx (or simply clone and use the sample code repository):

def say_hello():
    print("Hello world from Cython!")

And then execute the command cython hello_world.pyx. You should end up with a file named hello_world.c. This file can be compiled into a shared library with the following command:

gcc -shared -fPIC `pkg-config --cflags python-3.6m` hello_world.c -o hello_world.so

Your Python extension should now be the ready, in the form of a file named hello_world.so. You should be able to drop into a Python shell in the same directory with this file, import it with a simple import hello_world, and then run hello_world.say_hello(), the output of which should be “Hello world from Cython!”. Voila, your first Cython extension.

Now let’s have a quick look at the hello_world.c file. The file is pretty big, and fortunately, you don’t need to understand any of it to work with Cython. Still, it’s interesting to have a look at what Cython did to your Python code. Go ahead and search for Hello world from Cython in this file (there is a much easier way to compare the generated code with the original, which we will see in a minute). You will see that Cython has generated C functions for the high-level hello_cython and the inner print functions, and also annotated these by marking in comments what they correspond to in the original code. The call to print has been moved into a separate function (called __pyx_pf_11hello_world_hello_cython on my computer). It makes a call to __Pyx_PrintOne, which is a wrapper for the Python print, and deals with various error conditions and return values using C functions and macros.

Easier compilation of Cython extensions

There is a much easier way to turn Cython code into native modules, involving the distutils core module that is responsible for building and installing modules in Python. This will allow us to delegate the trans- & compilation to Cython and distutils, and be able to import the file like a normal Python module. In order to do so, you need to create a setup.py file with the following contents in the same directory as hello_cython.pyx:

from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules = cythonize("*.pyx", annotate=True))

The role of the annotate argument will be explained in a bit. After running the command python setup.py build_ext --inplace to build the extension module, you should be able to import hello_cython and call hello_cython.say_hello(), getting the same result as above. From now on, whenever you make a change to a Cython file, you need to run the above command, which will build all the Cython files that have changed.

Python is valid Cython

As you can see with the hello world example, Cython accepts and processes correctly the Python you are used to writing everyday. It was mentioned above that there is an easy way to view the C code generated by Cython from the input, and also that the annotate argument would be explained later. If you compiled hello_world.pyx using distutils as explained above, you should now see a file named hello_cython.html in the same directory. This file contains a visual display of the Python code Cython has transformed, with the lines that require access to the Python interpreter colored in yellow. The darker the background yellow of a line, the more Python runtime interaction it contains. There is also a plus sign to the left of every yellow-tinged line, expanding to the C code that was generated by Cython as translation. Generally, the aim when converting Python to Cython for purposes of optimization is to decrease the number of yellow lines, or at least lighten the hue of their yellow. This will ensure that your code is as close as possible to a C version, and thus –probably– faster.

The simple things: Variables, Functions, Loops

Now let’s take some Python code that is somewhat slow, turn it into Cython, and make it faster by annotating it with Cython-specific statements, decreasing the amount of yellow in the annotation. We will use this approach to write a fast version of the Sieve of Erastothenes, with which one can find the prime numbers up to a certain limit. Here is an implementation in pure Python:

import math

def sieve(up_to):
    primes = [True for _ in range(up_to+1)]
    primes[0] = primes[1] = False
    upper_limit = int(math.sqrt(up_to))
    for i in range(2, upper_limit+1):
        if not primes[i]:
            continue
        for j in range(2*i, up_to+1, i):
            primes[j] = False
    return [x for x in range(2, up_to+1) if primes[x]]

Let’s find out how fast this code is, by saving it in the file sieve_python.py, and benchmarking it with the handy timeit module, as follows:

python -m timeit "import sieve_python; sieve_python.sieve(200000)"

On my computer, the output is 10 loops, best of 3: 27.9 msec per loop. Admittedly, this does not look like a slow way of computing 17984 primes (computers are freaking fast), but it already gives us a good starting point.

Small note on timeit: The timeit module will run the piece of code you are timing in loops that increase in number of steps, until execution time exceeds 0.2 seconds. So don’t be surprised if different attempts lead to different number of loops. From here on, I will omit the complete output, and report only the best time for each timing run. Also, in the following, benchmarking and timing will be used to mean the same thing, namely measuring the running time of a piece of code.

Small note on benchmarking: Since building the extensions and benchmarking a module is an operation we will repeat frequently in the following, I have added a bash script named benchmark.sh that does this for you. It accepts the name of the module as its single argument, as in ./benchmark.sh sieve_python.

First attempt at Cythonizing

Now let’s copy the contents of sieve_python.py to a Cython file named e.g. sieve_naive.pyx (also available in the sample code repo), build it as an extension, and benchmark it with ./benchmark.sh sieve_naive. On my computer, this leads to an average runtime of 17.2 msec, which is already an improvement over the pure Python version. Still, an improvement of 40% is not really worth our efforts. If we have a look at the annotation file in sieve_naive.html, we can see that there is Python runtime interaction pretty much on every line, except for the line with continue, which is a keyword in C, too. In order to optimize the naive Cython code, we need to convert all of these lines to Cython-specific code that would be translated to pure C. Without further ado, here is the properly cythonized version, available as sieve_cython.pyx in the sample repo (discussion of the changes will follow):

from libc.math cimport sqrt
from libc.stdlib cimport malloc, free

def sieve(up_to):
    cdef bint *primes = _sieve(up_to)
    response = [x for x in range(up_to+1) if primes[x]]
    free(primes)
    return response

cdef bint *_sieve(int up_to):
    cdef int i, j
    cdef bint *primes = <bint *>malloc((up_to+1) * sizeof(bint))
    for i in range(up_to+1):
        primes[i] = 1
    primes[0] = primes[1] = False
    cdef int upper_limit = int(sqrt(up_to))
    for i in range(2, upper_limit+1):
        if not primes[i]:
            continue
        j = 2*i
        while j < up_to + 1:
            primes[j] = False
            j += i
    return primes

When we benchmark this new version with ./benchmark.sh sieve_cython, we achieve a considerable improvement in performance: 4.24 msec, which is 6.5 times faster than the Python version. Now let’s dive into the Cython features we used to achieve this speed-up.

1. Interface separation

Cython offers two fundamental ways of defining functions: With the usual def (Python functions) vs. with cdef (C functions). The common pattern of organizing Cython code is separating the interface and computation functions, adn then writing them as Python and C functions respectively. Python functions have the exact same facilities as in pure Python, and are accessible from the Python runtime. They are mostly responsible for any conversion of arguments & return values to and from C types, in addition to Python-specific things like exception handling. You can nevertheless use type-annotated variables as arguments or within them (such as the cdef int *primes above). C functions, on the other hand, which are declared using the Cython-specific cdef keyword, are transpiled directly to C functions. These functions (from here on called cdef functions) can be used exactly in the way Python ones are declared (optional arguments, Python objects as argument and return types etc.), but these lead to runtime interaction. In order to circumvent this, cdef functions can be annotated with types in the signature and in the variables used. The _sieve function from the above code example, for example, does not accept, return or process any Python objects; all arguments and variables are annotated with C types, as well as the return value.

cdef functions have a couple of oddities you need to consider, however. The default return type of cdef functions is Python object, and they will convert whatever is returned into one. It therefore makes sense to declare some return type to avoid Python runtime interaction. Also, within them, rules of the C world are valid, meaning that integer division and overflow function differently. Especially if your code is doing intensive numeric calculation, you should watch out for these catches.

A look at the annotation file for sieve_cython.html is rather revealing. In sieve_cython.html, you can see pretty clearly that the _sieve function has been completely converted to C, with no yellow lines. The four-line sieve function, on the other hand, is yellow except for the call to free. If you expand the third line of this function (line 6 in the whole file), you can get a very good view of what Cython is doing in the background. The one-line Python call to create a new list from a comprehension on a range calls has led to 58 lines of error handling, resource management and Python interaction. The above mentioned PyList_New, for example, is called to create a new list.

There is a third type of function you can define with cpdef instead of cdef, which is a hybrid between Python and C functions. In the background, Cython will actually define two functions, a Python and a C one. Calls from Python runtime will be directed to the Python function, whereas calls from within Cython-generated code will be directed to the C function. Therefore, you will be getting the benefit of C optimization when your code is called from other C functions, whereas the Python function will still be available as an interface.

2. Type annotations

The type annotations used in Cython are relatively straightforward, especially if you are acquainted with C types. Any C type can be used as a valid type, including pointer types. Variables with C types have to be defined using the cdef keyword, which can be done in both kinds of functions we have seen. One special type used also in sieve_cython.pyx is bint, which is a normal int in C code, but is converted to the Python boolean when necessary. If a variable is defined in the usual Python way without type information, it’s assumed to be a Python object. When you build a Python object out of C types, as on line 9 of sieve_cython.pyx, Cython will do certain kinds of type conversions on the fly, without asking for more information. Similar conversions are possible also the other way around, from Python to C. You can see which conversions are possible, and how they work in the documentation on type conversions.

3. C libraries

Instead of importing and using the Python math module, sieve_cython.pyx uses the C math library, which makes type conversions unnecessary. Cython facilitates this step by providing the C libraries as Python-like imports, as in from libc.math cimport sqrt. The same is valid for the dynamic memory management routines malloc and free, which are discussed next.

4. Memory management

In C-land, memory demands much more of the programmer. The difficulty arises from the interaction of dynamic memory management and the stack vs. heap distinction. Data allocated on the stack is automatically managed, i.e. removed when the reference goes out of scope. If you want to keep a reference to a piece of data after its initial context is gone, however, you will need to allocate it using the notorious malloc, and then return the memory back with free. In our case, we need to keep a reference to the primes list after _sieve is done; this is the reason it is created using malloc, which takes as input the amount of memory that needs to be set aside, and returns a pointer to it. The topics of dynamic memory management and pointers are very C-specific, but they are relatively straightforward, so any decent book on C should provide enough information to set you up for their use in Cython.

5. Loops

Python has numerous facilities for patterns that are handled using a single tool in C, the for loop. In the pure Python implementation of the sieve, we have three loops: A list comprehension (line 7 of sieve_naive.pyx and two for-in loops, all three with a range call. In C, all of these have to be written using a for loop with auxiliary variables and all, and Cython does its best to convert them properly, both the loop and the range call. In the very first case, we don’t really care that much about what the code is converted to, as it’s an interface function, and a list comprehension is mixed with type conversion and range, but I would like to note that the call to range is of a simple form, with a start and an end index. The second call to range (on line 10) is also of the same form. This form can be converted relatively easily into a for loop by Cython. The last use of range, however, involves a step argument (on line 13), which means that it would require Python runtime interaction, which we want to avoid. For this reason, in the Cython-optimized version, this third call, together with the for loop, has been turned into a while loop with explicit loop counter increment to achieve the same functionality.

Defining and Collecting New Types with Cython

The above example is relatively straightforward, since it uses only built-in Python and C types. To delve into more complex use cases, I will use as an example the 8-puzzle that is a simple task used to demonstrate search algorithms. The 8-puzzle involves integers from 1 to 8 placed on a grid, with one empty cell. Moves can be made by sliding one number vertically or horizontally to the empty cell. The board is considered “solved” if the numbers are in order, and the last cell is the empty one. In the sample repository, you can see the solutions I already wrote in Python and C. Both require a start state as the single argument, for which there is a sample in the file state.txt. In both implementations, state of the puzzle board is represented as a 2-dimensional array of integers, with 0 representing the empty cell. Breadth-first search is used to find a sequence of moves that solves the puzzle, printing this sequence in reverse. The set of board states that were already seen is represented as a trie, as membership in this set is checked frequently, and must be approriately fast.

Let’s start off with benchmarking. As we are comparing an executable with a Python script in this step, the timeit module won’t cut it; we will have to resort to a more general solution. The most comprehensive Linux tool for this purpose is perf, but I should warn you that on Ubuntu it depends on kernel tools specific to the kernel version, which caused quite some headache for me, so proceed at your own risk. With that out of the way, here is how to benchmark the python script:

    perf stat -r 10 python eight.py start.txt > /dev/null

On the last line, I can see that it took 0.201 seconds on average on my computer. I also gave pypy a try for completeness’ sake, simply replacing python with pypy in the above command, which resulted in 0.360 seconds on average, surprisingly slower than Python itself. The C code delivers the goodies, however: When benchmarked, the average runtime with the same input is 8 msec, nearly 30 times faster than Python. The question is now whether we can get close to it using Cython.

First iteration of eight puzzle in Cython

As with the previous example, we will start by simply copying the Python solution to eight_cython.pyx, and using it as a starting point. In order to get this file compiled into an executable, we need to use the cython CLI command instead of the extension building mechanism, and pass it an extra argument, as in cython eight_cython.pyx --embed. Don’t forget adding --annotate to the mix if you want the annotation file. The resulting eight_cython.c file can be compiled into an executable with the following command (given that you have Python 3.6 and the relevant development package installed):

gcc `pkg-config --cflags python-3.6m` eight_cython.c -lpython3.6m -o eight_cython

When benchmarked, this file already leads to a significant improvement, clocking at 200 msec, 23 msec less than letting the Python runtime execute the script. There is a lot to be done to speed this up, however, which I went ahead and did already. You can see the results in the file eight_cython.pyx. This version is considerably faster: With an average runtime of 43 msec, it is five times faster than the pure Python version. There were two major improvements that were used to achieve this speedup, explained in the following.

1. Class vs Struct

The main data structure in the Python version is the State class that wraps the integer array and provides the necessary methods for generating children, checking final status etc. Cython can definitely deal with this class, as we saw in the naive attempt, but it will be approximately as slow as the Python version, so we have to convert it to something else. We have two options what this something can be: C structs and extension types. The former is the plain old C struct that can be declared in Cython using the cdef struct construct. Here is an example from eight_cython.pyx:

cdef struct BoardPosition:
    int row
    int column

Extension types, on the other hand, are a Python-runtime related construct. Their instances are proper Python objects, managed by the Python garbage collector, and they can be subclassed in the usual way by other extension types or Python classes. You cannot add new attributes at runtime, however, as these are stored directly on the C struct that represents an instance object. The attributes are also not accessible to other classes or functions unless explicitly declared to be so. The way special methods of extension types function also differs in subtle ways; you should check in the list whether your expectations are met before you implement one. In eight_cython.pyx, the State class is declared as an extension type using the cdef keyword, and it has the attributes board, _zero_index and parent:

cdef class State:
    cdef int **board
    cdef BoardPosition _zero_index
    cdef public State parent

As you can see, the board 2-dimensional array is declared exactly the way in C, but the reference to the parent State is not a pointer, as it would have been in C. Cython manages references to Python objects (extension types or normal classes) as pointers in the background; you don’t have to declare them as such. The State extension type has methods defined with either def or cdef; the same conditions as above are valid, but you cannot declare any special methods (those wrapped in double underscores) with cdef. Also, it is not possible to turn cdef methods into properties with the usual @property decorator (as would have been convenient with the zero_index method).

One special method of State, __cinit__ is worth special attention. This method is called once when an extension type is allocated, and it is the place where code that allocates any further C data structures belongs. In our case, the board array is allocated here. There is a corresponding __dealloc__ special method where you can free resources allocated in __cinit__. This is also implemented in State for completeness sake.

2. C arrays versus lists

In the State.children method, we need to collect structures for representing the two following things:

Possible ways of swapping positions on the eight board, depending on where the empty cell is (as BoardPosition)
Child states that result from these swaps (as State).

When it comes to storing such structures, the simplest option is using the tried and proven Python list. Conveniently, basic Python collection types (list, dict, tuple and set) can be used as a type in cdef functions. The problem with the list structure, however, is that it leads to Python runtime interaction, and is accordingly slow. Therefore, in performance-criticial situations, it is advisable to use C arrays instead. It is the bread and butter of C programming to allocate arrays of structs and iterate over these in every which way possible, and it is not any more difficult in Cython to do so; you can see how it is done with the array of BoardPosiion structs in the State.children method. The situation is more complicated with the child states, however, as these are represented with the State extension type. Since extension types are managed by the Python garbage collector, and they can be referred to only as pointers to Python objects, their storage in arrays as C pointers is rather complicated, involving casts and calls into the reference counting mechanism of Python. For this reason, in order not to complicate the code too much, I opted for the simpler list type in the first iteration. There will later be a more detailed discussion of storing pointers to extension types in arrays.

3. Using DEF to declare a constant

Cython allows C-style constants with the DEF directive. As with C, any values defined this way are replaced within the code at compile time. you have to be careful, though, as only the basic types int, long, float, bytes and unicode can be declared as constants. The number of rows and columns on the eight board is declared in eight_cython.pyx as a constant with DEF SIZE = 3. Cython also allows conditional compilation with the IF directive.

Profiling Cython

Comparing the performance of the C version (9 msec) with eight_cython.pyx (43 msec), we can see that there is still room for improvement to reach C level performance. This is also obvious in the annotation file, which is deep yellow in many parts. To find out which bottleneck to tackle next, however, the annotation file is not enough, as it shows us only how much Python runtime interaction a line of code causes, and not how much it contributes to the total runtime. Profiling is what we need, and fortunately, Cython makes this extremely easy. You need to only add the following compiler directive to the top of a pyx file to make Cython generate profiling data which can be processed by the standard Python profilers:

# cython: profile=True

Another difference to normal operation is that you need to run the code under profiling as a module, and not from the command line, which means that the code has to be compiled as a module with the usual python setup.py build_ext --inplace. Now we are ready to profile our code by starting a Python shell and running the following:

import cProfile
import eight_cython
cProfile.run("eight_cython.main('start.txt')", sort='time')

And here is the output I have received:

     403088 function calls in 0.097 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    33453    0.018    0.000    0.022    0.000 eight_cython.pyx:59(add_or_get_child)
    44935    0.017    0.000    0.021    0.000 eight_cython.pyx:67(get_child)
    1    0.009    0.009    0.097    0.097 eight_cython.pyx:202(search)
     3717    0.008    0.000    0.012    0.000 eight_cython.pyx:163(children)
     6413    0.008    0.000    0.037    0.000 eight_cython.pyx:88(contains)
    44935    0.008    0.000    0.029    0.000 eight_cython.pyx:67(get_child (wrapper))
[..snip..]

The runtime increased a little due to the profiling overhead, but we now have ample information on which functions are causing the most execution overhead, namely the get_child and add_or_get_child methods of the TrieNode extension class. This is not surprising, as this code, used for checking whether a state has already been seen, is guaranteed to be executed for every evaluated state. On top of that, it uses the Python list type, leading to huge overhead in interaction with the runtime. This can also be seen in the annotation file in which the two TrieNode methods are a very deep shade of yellow. If we want to speed up our code, we need to tackle the way the board trie is built and used; we will now see how.

Replacing extension types with C structs

The difficulty we are facing in the optimization of the TrieNode class is that its methods need to return lists of TrieNode instances. As already mentioned, interaction with the Python list class requires a lot of runtime interaction, so the way to speed up list iteration is to get rid of it, and use C arrays instead. There are two ways of going about this. The first is turning TrieNode into a C struct, and storing pointers to instances of it in the array. This is the path we will take, and we will have a look at the alternative, using pointers to Python objects, later. The file eight_cython_improved.pyx contains the implementation where TrieNode is now a struct, defined as follows:

cdef struct TrieNode:
    int value
    TrieNode **children

Code that used to be methods on the TrieNode cdef class is now packed into individual functions, all prefixed with trie, and accepting a pointer to a TrieNode as the first argument. This is a common pattern of organizing C code, somewhat similar to object oriented coding. If you compare this new trie code with the implementation in eight.c, you can see that the Cython version is pretty much a line-by-line translation, except for the semicolons, and with an extra cast of the malloc return value, which Cython needs for typechecking. Another C pattern that is used in both eight.c and this improved Cython version is marking the end of a list with a NULL value. The end of the list of children of a TrieNode is marked with NULL, a keyword in Cython that is synonymous with its meaning in C.

This improved version benchmarks at 33 msec, shaving off 10 more msec from the previous version. Is there any other big win we can score, as with the trie optimization? Once more, the answer lies in profiling. Here is the result of profiling eight_cython_improved.pyx:

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.009    0.009    0.054    0.054 eight_cython_improved.pyx:206(search)
  3717    0.008    0.000    0.013    0.000 eight_cython_improved.pyx:167(children)
  6413    0.006    0.000    0.011    0.000 eight_cython_improved.pyx:93(trie_contains_board)
 44935    0.005    0.000    0.005    0.000 eight_cython_improved.pyx:77(trie_get_child)
 33453    0.005    0.000    0.009    0.000 eight_cython_improved.pyx:64(trie_add_or_get_child)
  3717    0.005    0.000    0.013    0.000 eight_cython_improved.pyx:86(trie_add_board)

There are two more bottlenecks that promise significant improvements: the search function, and the children method of State. These two are connected, however, in that the children method returns a list, and the search function iterates over its result. According to the annotation file, this iteration is the most intensive interaction with the Python runtime in search. We might achieve further speedup by doing the same kind of refactoring we did with the TrieNode structure, namely by turning State into a struct and its methods into functions that accept a State as an argument. The template is already available in eight.c, but instead of this easy fix, I wanted to give the alternative a try, namely using a C array to collect pointers to Python objects. The result is in the file eight_cython_improved.pyx. As you can see there, a ChildSet struct is necessary to return the number of children and the actual pointers to the children:

cdef struct ChildSet:
    PyObject **children
    int count

The PyObject is the C struct that Python uses to keep track of objects; you can use it in Cython to refer to arbitrary objects. It is your responsibility, however, to manage reference counts when you create a reference to an object, as with the pointer in the child_set.children array on line 198. This is the reason for the call to Py_XINCREF on line 196. PyObject, Py_XINCREF and Py_XDECREF are imported from the same package with the following line at the top of the file:

from cpython.ref cimport PyObject, Py_XINCREF, Py_XDECREF

Py_XDECREF is later used to decrement the reference count, on line 233. Another thing explicitly done in a number of places is casting between the objects and pointers. When we are adding the child State object into the array on line 198, we are casting it to a PyObject *. Later, when this entity needs to be accessed as a State object again, it is cast back again on line 231. Building and timing the version with object pointers, I get an average runtime of 32 msec, which is the same with the version with extension types in Python lists. Apparently, the load of the casts and reference counting balance out the wins from interacting with the list, leading to zero improvement.

Going Further

The material presented here covers only the basics of Cython. There are numerous other features, explained in detail in the official documentation and some other resources, that allow Cython to interact with other C or Python libraries, generate better optimized code, or let you tune the interaction with the Python runtime to your needs. Here are some resources that can help you get further in general Cython or in specific areas that interest you:

The Basic Tutorial in cython documentation is another good starting point, but it leaves off at a point where confusions arise if you try to write useful code.
The Language Basics page is worth going through before embarking on any significant Cython project, as it touches on all Cython features, with links to further documentation.
Typed Memoryviews allow Cython code to interact with memory buffers of uniform data types, such as Numpy arrays or built-in Python array types. This feature is based on the buffer protocol, the C-level infrastructure that lays out the groundwork for shared data buffers in Python.
Cython also allows for easy and GIL-free parallelism using OpenMP with the cython.parallel package.
The book Cython - A Guide for Python Programmers is an in-depth discussion of Cython, with all the ins and outs and corner cases. I would highly recommend it if you are going to work extensively with Cython.
The Day of the EXE Is Upon Us is an excellent talk given by Brandon Rhodes at PyCon 2014. Among others, it touches upon the complicated distinctions between interpreted and compiled code (surprise: even x86 assembly is interpreted), why Python is slow, and why Cython is incredible.