An Introduction to Cython, the Secret Python Extension with Superpowers
Published on
Cython is one of the best kept secrets of Python. It extends Python in a direction that addresses many of the shortcomings of the language and the platform, such as execution speed, GIL-free concurrency, absence of type checking and not creating an executable. It is a mature tool with a number of widely used packages that are written in it, such as spaCy, uvloop, and significant parts of scikit-learn, Numpy and Pandas. It smoothly hooks into the latter two, giving you access to underlying data structures in a straightforward way. All these superpowers come with the baggage of certain parts of C, however, which makes becoming proficient in Cython a bit steep for those who don’t know C. In this tutorial, I will give an overview of working with Cython, focusing on the parts of C that are relevant.
The requirements for running the code samples in this tutorial are Python 3
(preferably 3.6), a C compiler (GCC or Clang should do), and virtualenv or
pipenv. Once you have these, getting Cython is as easy as creating a pipenv or a
virtualenv, and within that environment, running pip install Cython
.
What is Cython, and what is it not?
Cython is built on the fact that Python the language (or at least the CPython
implementation of it) is built on top of a C API that is instrumented through an
intermediate language. The way CPython works is by compiling Python code into a
bytecode
representation, and then executing the result on the virtual machine at runtime.
Individual instructions of this bytecode consist of an opcode and a reference to
any arguments. You can see the existing opcodes in the file
opcode.h. In
the main evaluation loop of the runtime, the opcode of an instruction is used to
determine what to do next, in the form of one big switch
statement.
For example, if the opcode is BUILD_LIST
, specifying the construction of a new
list, the PyList_New
function is
called with
the appropriate arguments. Obviously, there is a great deal of work the Python
runtime is doing around this evaluation loop, such as garbage collection and
error handling. Another large piece of the runtime work pertains to the dynamic
nature of Python, where the actual methods to run have to be figured out at
runtime, based on the object structure available. For example, when you try to
multiply two numbers, Python has to figure out whether these are floats,
integers etc., or the multiplication operator has been overloaded as with the
string type. In many other languages, explicit type information helps the
compiler figure this out at compile time, leading to faster compiled code. This
distinction is known as early vs. late
binding,
and is the source of one of the major performance gains one can achieve by using
Cython.
Cython makes use of the architectural organization of Python by translating (or ‘transpiling’, as it is now called) a Python file into the C equivalent of what the Python runtime would be doing, and compiling this into machine code - this can be a Python extension which can be dynamically loaded, or an actual executable. The resulting module makes calls to the Python runtime in order to deal with things like above mentioned dispatch, which means that straightforward Python code will not be executing much faster than it would anyway. The speed difference becomes significant when you code using Cython-specific constructs that are transpiled directly to their C equivalents, thereby avoiding the Python runtime. We will see what these constructs are in a minute, but let’s start with a simple example to get used to working with the Cython toolchain.
Starting off
The first thing you need to do is to install Cython, obviously. Please refer to
the official Cython
documentation
for the installation instructions. You will also need a C compiler; GCC on Linux
and Clang on Mac should do. Once you have have these two, we can start with the
usual hello world. Save the following in a file named hello_world.pyx
(or
simply clone and use the sample code
repository):
def say_hello():
print("Hello world from Cython!")
And then execute the command cython hello_world.pyx
. You should end up with a
file named hello_world.c
. This file can be compiled into a shared library with
the following command:
gcc -shared -fPIC `pkg-config --cflags python-3.6m` hello_world.c -o hello_world.so
Your Python extension should now be the ready, in the form of a file named
hello_world.so
. You should be able to drop into a Python shell in the same
directory with this file, import it with a simple import hello_world
, and then
run hello_world.say_hello()
, the output of which should be “Hello world from
Cython!”. Voila, your first Cython extension.
Now let’s have a quick look at the hello_world.c
file. The file is pretty big,
and fortunately, you don’t need to understand any of it to work with Cython.
Still, it’s interesting to have a look at what Cython did to your Python code.
Go ahead and search for Hello world from Cython
in this file (there is a much
easier way to compare the generated code with the original, which we will see in
a minute). You will see that Cython has generated C functions for the high-level
hello_cython
and the inner print
functions, and also annotated these by
marking in comments what they correspond to in the original code. The call to
print
has been moved into a separate function (called
__pyx_pf_11hello_world_hello_cython
on my computer). It makes a call to
__Pyx_PrintOne
, which is a wrapper for the Python print
, and deals with
various error conditions and return values using C functions and macros.
Easier compilation of Cython extensions
There is a much easier way to turn Cython code into native modules, involving
the distutils core module that is responsible for building and installing
modules in Python. This will allow us to delegate the trans- & compilation to
Cython and distutils, and be able to import the file like a normal Python
module. In order to do so, you need to create a setup.py
file with the
following contents in the same directory as hello_cython.pyx
:
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules = cythonize("*.pyx", annotate=True))
The role of the annotate
argument will be explained in a bit. After running
the command python setup.py build_ext --inplace
to build the extension module,
you should be able to import hello_cython
and call hello_cython.say_hello()
,
getting the same result as above. From now on, whenever you make a change to a
Cython file, you need to run the above command, which will build all the Cython
files that have changed.
Python is valid Cython
As you can see with the hello world example, Cython accepts and processes
correctly the Python you are used to writing everyday. It was mentioned above
that there is an easy way to view the C code generated by Cython from the input,
and also that the annotate
argument would be explained later. If you compiled
hello_world.pyx
using distutils as explained above, you should now see a file
named hello_cython.html
in the same directory. This file contains a visual
display of the Python code Cython has transformed, with the lines that require
access to the Python interpreter colored in yellow. The darker the background
yellow of a line, the more Python runtime interaction it contains. There is also
a plus sign to the left of every yellow-tinged line, expanding to the C code
that was generated by Cython as translation. Generally, the aim when converting
Python to Cython for purposes of optimization is to decrease the number of
yellow lines, or at least lighten the hue of their yellow. This will ensure that
your code is as close as possible to a C version, and thus
–probably– faster.
The simple things: Variables, Functions, Loops
Now let’s take some Python code that is somewhat slow, turn it into Cython, and make it faster by annotating it with Cython-specific statements, decreasing the amount of yellow in the annotation. We will use this approach to write a fast version of the Sieve of Erastothenes, with which one can find the prime numbers up to a certain limit. Here is an implementation in pure Python:
import math
def sieve(up_to):
primes = [True for _ in range(up_to+1)]
primes[0] = primes[1] = False
upper_limit = int(math.sqrt(up_to))
for i in range(2, upper_limit+1):
if not primes[i]:
continue
for j in range(2*i, up_to+1, i):
primes[j] = False
return [x for x in range(2, up_to+1) if primes[x]]
Let’s find out how fast this code is, by saving it in the file
sieve_python.py
, and benchmarking it with the handy timeit
module, as
follows:
python -m timeit "import sieve_python; sieve_python.sieve(200000)"
On my computer, the output is 10 loops, best of 3: 27.9 msec per loop
.
Admittedly, this does not look like a slow way of computing 17984 primes
(computers are freaking fast), but it already gives us a good starting point.
Small note on timeit: The timeit module will run the piece of code you are timing in loops that increase in number of steps, until execution time exceeds 0.2 seconds. So don’t be surprised if different attempts lead to different number of loops. From here on, I will omit the complete output, and report only the best time for each timing run. Also, in the following, benchmarking and timing will be used to mean the same thing, namely measuring the running time of a piece of code.
Small note on benchmarking: Since building the extensions and benchmarking a
module is an operation we will repeat frequently in the following, I have added
a bash script named benchmark.sh
that does this for you. It accepts the name
of the module as its single argument, as in ./benchmark.sh sieve_python
.
First attempt at Cythonizing
Now let’s copy the contents of sieve_python.py
to a Cython file named e.g.
sieve_naive.pyx
(also available in the sample code repo), build it as an
extension, and benchmark it with ./benchmark.sh sieve_naive
. On my computer,
this leads to an average runtime of 17.2 msec, which is already an improvement
over the pure Python version. Still, an improvement of 40% is not really worth
our efforts. If we have a look at the annotation file in sieve_naive.html
, we
can see that there is Python runtime interaction pretty much on every line,
except for the line with continue
, which is a keyword in C, too. In order to
optimize the naive Cython code, we need to convert all of these lines to
Cython-specific code that would be translated to pure C. Without further ado,
here is the properly cythonized version, available as sieve_cython.pyx
in the
sample repo (discussion of the changes will follow):
from libc.math cimport sqrt
from libc.stdlib cimport malloc, free
def sieve(up_to):
cdef bint *primes = _sieve(up_to)
response = [x for x in range(up_to+1) if primes[x]]
free(primes)
return response
cdef bint *_sieve(int up_to):
cdef int i, j
cdef bint *primes = <bint *>malloc((up_to+1) * sizeof(bint))
for i in range(up_to+1):
primes[i] = 1
primes[0] = primes[1] = False
cdef int upper_limit = int(sqrt(up_to))
for i in range(2, upper_limit+1):
if not primes[i]:
continue
j = 2*i
while j < up_to + 1:
primes[j] = False
j += i
return primes
When we benchmark this new version with ./benchmark.sh sieve_cython
, we
achieve a considerable improvement in performance: 4.24 msec, which is 6.5 times
faster than the Python version. Now let’s dive into the Cython features we used
to achieve this speed-up.
1. Interface separation
Cython offers two fundamental ways of defining
functions:
With the usual def (Python functions) vs. with cdef (C functions). The common
pattern of organizing Cython code is separating the interface and computation
functions, adn then writing them as Python and C functions respectively. Python
functions have the exact same facilities as in pure Python, and are accessible
from the Python runtime. They are mostly responsible for any conversion of
arguments & return values to and from C types, in addition to Python-specific
things like exception handling. You can nevertheless use type-annotated
variables as arguments or within them (such as the cdef int *primes
above). C
functions, on the other hand, which are declared using the Cython-specific
cdef
keyword, are transpiled directly to C functions. These functions (from
here on called cdef functions) can be used exactly in the way Python ones are
declared (optional arguments, Python objects as argument and return types etc.),
but these lead to runtime interaction. In order to circumvent this, cdef
functions can be annotated with types in the signature and in the variables
used. The _sieve
function from the above code example, for example, does not
accept, return or process any Python objects; all arguments and variables are
annotated with C types, as well as the return value.
cdef functions have a couple of oddities you need to consider, however. The default return type of cdef functions is Python object, and they will convert whatever is returned into one. It therefore makes sense to declare some return type to avoid Python runtime interaction. Also, within them, rules of the C world are valid, meaning that integer division and overflow function differently. Especially if your code is doing intensive numeric calculation, you should watch out for these catches.
A look at the annotation file for sieve_cython.html
is rather revealing. In
sieve_cython.html
, you can see pretty clearly that the _sieve
function has
been completely converted to C, with no yellow lines. The four-line sieve
function, on the other hand, is yellow except for the call to free
. If you
expand the third line of this function (line 6 in the whole file), you can get a
very good view of what Cython is doing in the background. The one-line Python
call to create a new list from a comprehension on a range calls has led to 58
lines of error handling, resource management and Python interaction. The above
mentioned PyList_New
, for example, is called to create a new list.
There is a third type of function you can define with cpdef
instead of cdef
,
which is a hybrid between Python and C functions. In the background, Cython will
actually define two functions, a Python and a C one. Calls from Python runtime
will be directed to the Python function, whereas calls from within
Cython-generated code will be directed to the C function. Therefore, you will be
getting the benefit of C optimization when your code is called from other C
functions, whereas the Python function will still be available as an interface.
2. Type annotations
The type annotations used in Cython are relatively
straightforward,
especially if you are acquainted with C types. Any C type can be used as a valid
type, including pointer types. Variables with C types have to be defined using
the cdef
keyword, which can be done in both kinds of functions we have seen.
One special type used also in sieve_cython.pyx
is bint
, which is a normal
int
in C code, but is converted to the Python boolean when necessary. If a
variable is defined in the usual Python way without type information, it’s
assumed to be a Python object. When you build a Python object out of C types, as
on line 9 of sieve_cython.pyx
, Cython will do certain kinds of type
conversions on the fly, without asking for more information. Similar conversions
are possible also the other way around, from Python to C. You can see which
conversions are possible, and how they work in the documentation on type
conversions.
3. C libraries
Instead of importing and using the Python math module, sieve_cython.pyx
uses
the C math library, which
makes type conversions unnecessary. Cython facilitates this step by providing
the C libraries as Python-like imports, as in from libc.math cimport sqrt
. The
same is valid for the dynamic memory management routines malloc and free, which
are discussed next.
4. Memory management
In C-land, memory demands much more of the programmer. The difficulty arises
from the interaction of dynamic memory
management and the
stack vs. heap
distinction.
Data allocated on the stack is automatically managed, i.e. removed when the
reference goes out of scope. If you want to keep a reference to a piece of data
after its initial context is gone, however, you will need to allocate it using
the notorious malloc
, and then return the memory back with free
. In our
case, we need to keep a reference to the primes
list after _sieve
is done;
this is the reason it is created using malloc
, which takes as input the amount
of memory that needs to be set aside, and returns a pointer to it. The topics of
dynamic memory management and pointers are very C-specific, but they are
relatively straightforward, so any decent book on C should provide enough
information to set you up for their use in Cython.
5. Loops
Python has numerous facilities for patterns that are handled using a single tool
in C, the for loop. In the pure Python implementation of the sieve, we have
three loops: A list comprehension (line 7 of sieve_naive.pyx
and two for-in
loops, all three with a range
call. In C, all of these have to be written
using a for loop with auxiliary variables and all, and Cython does its best to
convert them properly, both the loop and the range
call. In the very first
case, we don’t really care that much about what the code is converted to, as
it’s an interface function, and a list comprehension is mixed with type
conversion and range, but I would like to note that the call to range
is of a
simple form, with a start and an end index. The second call to range
(on line
10) is also of the same form. This form can be converted relatively
easily
into a for loop by Cython. The last use of range, however, involves a step
argument (on line 13), which means that it would require Python runtime
interaction, which we want to avoid. For this reason, in the Cython-optimized
version, this third call, together with the for loop, has been turned into a
while loop with explicit loop counter increment to achieve the same
functionality.
Defining and Collecting New Types with Cython
The above example is relatively straightforward, since it uses only built-in
Python and C types. To delve into more complex use cases, I will use as an
example the 8-puzzle that is a simple task used to demonstrate search
algorithms. The 8-puzzle involves integers from 1 to 8 placed on a grid, with
one empty cell. Moves can be made by sliding one number vertically or
horizontally to the empty cell. The board is considered “solved” if the numbers
are in order, and the last cell is the empty one. In the sample repository, you
can see the solutions I already wrote in Python and C. Both require a start
state as the single argument, for which there is a sample in the file
state.txt
. In both implementations, state of the puzzle board is represented
as a 2-dimensional array of integers, with 0 representing the empty cell.
Breadth-first search is used to find a sequence of moves that solves the puzzle,
printing this sequence in reverse. The set of board states that were already
seen is represented as a trie, as membership in this set is checked frequently,
and must be approriately fast.
Let’s start off with benchmarking. As we are comparing an executable with a
Python script in this step, the timeit
module won’t cut it; we will have to
resort to a more general solution. The most comprehensive Linux tool for this
purpose is perf
, but I should warn you that on Ubuntu it depends on kernel
tools specific to the kernel version, which caused quite some headache for me,
so proceed at your own risk. With that out of the way, here is how to benchmark
the python script:
perf stat -r 10 python eight.py start.txt > /dev/null
On the last line, I can see that it took 0.201 seconds on average on my
computer. I also gave pypy a try for completeness’ sake, simply replacing
python
with pypy
in the above command, which resulted in 0.360 seconds on
average, surprisingly slower than Python itself. The C code delivers the
goodies, however: When benchmarked, the average runtime with the same input is 8
msec, nearly 30 times faster than Python. The question is now whether we can get
close to it using Cython.
First iteration of eight puzzle in Cython
As with the previous example, we will start by simply copying the Python
solution to eight_cython.pyx
, and using it as a starting point. In order to
get this file compiled into an executable, we need to use the cython
CLI
command instead of the extension building mechanism, and pass it an extra
argument, as in cython eight_cython.pyx --embed
. Don’t forget adding
--annotate
to the mix if you want the annotation file. The resulting
eight_cython.c
file can be compiled into an executable with the following
command (given that you have Python 3.6 and the relevant development package
installed):
gcc `pkg-config --cflags python-3.6m` eight_cython.c -lpython3.6m -o eight_cython
When benchmarked, this file already leads to a significant improvement, clocking
at 200 msec, 23 msec less than letting the Python runtime execute the script.
There is a lot to be done to speed this up, however, which I went ahead and did
already. You can see the results in the file eight_cython.pyx
. This version is
considerably faster: With an average runtime of 43 msec, it is five times faster
than the pure Python version. There were two major improvements that were used
to achieve this speedup, explained in the following.
1. Class vs Struct
The main data structure in the Python version is the State
class that wraps
the integer array and provides the necessary methods for generating children,
checking final status etc. Cython can definitely deal with this class, as we saw
in the naive attempt, but it will be approximately as slow as the Python
version, so we have to convert it to something else. We have two options what
this something can be: C structs and extension types. The former is the plain
old C struct that can be declared in Cython using the cdef struct
construct.
Here is an example from eight_cython.pyx
:
cdef struct BoardPosition:
int row
int column
Extension
types,
on the other hand, are a Python-runtime related construct. Their instances are
proper Python objects, managed by the Python garbage collector, and they can be
subclassed in the usual way by other extension types or Python classes. You
cannot add new attributes at runtime, however, as these are stored directly on
the C struct that represents an instance object. The attributes are also not
accessible to other classes or functions unless explicitly declared to be so.
The way special methods of extension types function also differs in subtle
ways;
you should check in the list whether your expectations are met before you
implement one. In eight_cython.pyx
, the State
class is declared as an
extension type using the cdef
keyword, and it has the attributes board
,
_zero_index
and parent
:
cdef class State:
cdef int **board
cdef BoardPosition _zero_index
cdef public State parent
As you can see, the board
2-dimensional array is declared exactly the way in
C, but the reference to the parent State
is not a pointer, as it would have
been in C. Cython manages references to Python objects (extension types or
normal classes) as pointers in the background; you don’t have to declare them as
such. The State
extension type has methods defined with either def
or
cdef
; the same conditions as above are valid, but you cannot declare any
special methods (those wrapped in double underscores) with cdef
. Also, it is
not possible to turn cdef
methods into properties with the usual @property
decorator (as would have been convenient with the zero_index
method).
One special method of State
, __cinit__
is worth special attention. This
method is called once when an extension type is allocated, and it is the place
where code that allocates any further C data structures belongs. In our case,
the board
array is allocated here. There is a corresponding __dealloc__
special method where you can free resources allocated in __cinit__
. This is
also implemented in State
for completeness sake.
2. C arrays versus lists
In the State.children
method, we need to collect structures for representing
the two following things:
-
Possible ways of swapping positions on the eight board, depending on where the empty cell is (as
BoardPosition
) -
Child states that result from these swaps (as
State
).
When it comes to storing such structures, the simplest option is using the tried
and proven Python list. Conveniently, basic Python collection types (list, dict,
tuple and set) can be used as a type in cdef
functions. The problem with the
list structure, however, is that it leads to Python runtime interaction, and is
accordingly slow. Therefore, in performance-criticial situations, it is
advisable to use C arrays instead. It is the bread and butter of C programming
to allocate arrays of structs and iterate over these in every which way
possible, and it is not any more difficult in Cython to do so; you can see how
it is done with the array of BoardPosiion
structs in the State.children
method. The situation is more complicated with the child states, however, as
these are represented with the State
extension type. Since extension types are
managed by the Python garbage collector, and they can be referred to only as
pointers to Python objects, their storage in arrays as C pointers is rather
complicated, involving casts and calls into the reference counting mechanism of
Python. For this reason, in order not to complicate the code too much, I opted
for the simpler list type in the first iteration. There will later be a more
detailed discussion of storing pointers to extension types in arrays.
3. Using DEF to declare a constant
Cython allows C-style constants with the DEF
directive.
As with C, any values defined this way are replaced within the code at compile
time. you have to be careful, though, as only the basic types int, long, float,
bytes and unicode can be declared as constants. The number of rows and columns
on the eight board is declared in eight_cython.pyx
as a constant with DEF
SIZE = 3
. Cython also allows conditional compilation with the IF
directive.
Profiling Cython
Comparing the performance of the C version (9 msec) with eight_cython.pyx
(43
msec), we can see that there is still room for improvement to reach C level
performance. This is also obvious in the annotation file, which is deep yellow
in many parts. To find out which bottleneck to tackle next, however, the
annotation file is not enough, as it shows us only how much Python runtime
interaction a line of code causes, and not how much it contributes to the total
runtime. Profiling is what we need, and fortunately, Cython makes this extremely
easy. You need to only add the following compiler directive to the top of a pyx
file to make Cython generate profiling data which can be processed by the
standard Python profilers:
# cython: profile=True
Another difference to normal operation is that you need to run the code under
profiling as a module, and not from the command line, which means that the code
has to be compiled as a module with the usual python setup.py build_ext
--inplace
. Now we are ready to profile our code by starting a Python shell and
running the following:
import cProfile
import eight_cython
cProfile.run("eight_cython.main('start.txt')", sort='time')
And here is the output I have received:
403088 function calls in 0.097 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
33453 0.018 0.000 0.022 0.000 eight_cython.pyx:59(add_or_get_child)
44935 0.017 0.000 0.021 0.000 eight_cython.pyx:67(get_child)
1 0.009 0.009 0.097 0.097 eight_cython.pyx:202(search)
3717 0.008 0.000 0.012 0.000 eight_cython.pyx:163(children)
6413 0.008 0.000 0.037 0.000 eight_cython.pyx:88(contains)
44935 0.008 0.000 0.029 0.000 eight_cython.pyx:67(get_child (wrapper))
[..snip..]
The runtime increased a little due to the profiling overhead, but we now have
ample information on which functions are causing the most execution overhead,
namely the get_child
and add_or_get_child
methods of the TrieNode
extension class. This is not surprising, as this code, used for checking whether
a state has already been seen, is guaranteed to be executed for every evaluated
state. On top of that, it uses the Python list
type, leading to huge overhead
in interaction with the runtime. This can also be seen in the annotation file in
which the two TrieNode
methods are a very deep shade of yellow. If we want to
speed up our code, we need to tackle the way the board trie is built and used;
we will now see how.
Replacing extension types with C structs
The difficulty we are facing in the optimization of the TrieNode
class is that
its methods need to return lists of TrieNode
instances. As already mentioned,
interaction with the Python list class requires a lot of runtime interaction, so
the way to speed up list iteration is to get rid of it, and use C arrays
instead. There are two ways of going about this. The first is turning TrieNode
into a C struct, and storing pointers to instances of it in the array. This is
the path we will take, and we will have a look at the alternative, using
pointers to Python objects, later. The file eight_cython_improved.pyx
contains
the implementation where TrieNode
is now a struct, defined as follows:
cdef struct TrieNode:
int value
TrieNode **children
Code that used to be methods on the TrieNode
cdef class is now packed into
individual functions, all prefixed with trie
, and accepting a pointer to a
TrieNode
as the first argument. This is a common pattern of organizing C code,
somewhat similar to object oriented coding. If you compare this new trie code
with the implementation in eight.c
, you can see that the Cython version is
pretty much a line-by-line translation, except for the semicolons, and with an
extra cast of the malloc return value, which Cython needs for typechecking.
Another C pattern that is used in both eight.c
and this improved Cython
version is marking the end of a list with a NULL
value. The end of the list of
children of a TrieNode
is marked with NULL
, a keyword in Cython that is
synonymous with its meaning in C.
This improved version benchmarks at 33 msec, shaving off 10 more msec from the
previous version. Is there any other big win we can score, as with the trie
optimization? Once more, the answer lies in profiling. Here is the result of
profiling eight_cython_improved.pyx
:
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.009 0.009 0.054 0.054 eight_cython_improved.pyx:206(search)
3717 0.008 0.000 0.013 0.000 eight_cython_improved.pyx:167(children)
6413 0.006 0.000 0.011 0.000 eight_cython_improved.pyx:93(trie_contains_board)
44935 0.005 0.000 0.005 0.000 eight_cython_improved.pyx:77(trie_get_child)
33453 0.005 0.000 0.009 0.000 eight_cython_improved.pyx:64(trie_add_or_get_child)
3717 0.005 0.000 0.013 0.000 eight_cython_improved.pyx:86(trie_add_board)
There are two more bottlenecks that promise significant improvements: the
search
function, and the children
method of State
. These two are
connected, however, in that the children
method returns a list, and the
search
function iterates over its result. According to the annotation file,
this iteration is the most intensive interaction with the Python runtime in
search
. We might achieve further speedup by doing the same kind of refactoring
we did with the TrieNode
structure, namely by turning State
into a struct
and its methods into functions that accept a State
as an argument. The
template is already available in eight.c
, but instead of this easy fix, I
wanted to give the alternative a try, namely using a C array to collect pointers
to Python objects. The result is in the file eight_cython_improved.pyx
. As you
can see there, a ChildSet
struct is necessary to return the number of children
and the actual pointers to the children:
cdef struct ChildSet:
PyObject **children
int count
The
PyObject
is the C struct that Python uses to keep track of objects; you can use it in
Cython to refer to arbitrary objects. It is your responsibility, however, to
manage reference counts when you create a reference to an object, as with the
pointer in the child_set.children
array on line 198. This is the reason for
the call to Py_XINCREF
on line 196. PyObject
, Py_XINCREF
and Py_XDECREF
are imported from the same package with the following line at the top of the
file:
from cpython.ref cimport PyObject, Py_XINCREF, Py_XDECREF
Py_XDECREF
is later used to decrement the reference count, on line 233.
Another thing explicitly done in a number of places is casting between the
objects and pointers. When we are adding the child State
object into the array
on line 198, we are casting it to a PyObject *
. Later, when this entity needs
to be accessed as a State
object again, it is cast back again on line 231.
Building and timing the version with object pointers, I get an average runtime
of 32 msec, which is the same with the version with extension types in Python
lists. Apparently, the load of the casts and reference counting balance out the
wins from interacting with the list, leading to zero improvement.
Going Further
The material presented here covers only the basics of Cython. There are numerous other features, explained in detail in the official documentation and some other resources, that allow Cython to interact with other C or Python libraries, generate better optimized code, or let you tune the interaction with the Python runtime to your needs. Here are some resources that can help you get further in general Cython or in specific areas that interest you:
-
The Basic Tutorial in cython documentation is another good starting point, but it leaves off at a point where confusions arise if you try to write useful code.
-
The Language Basics page is worth going through before embarking on any significant Cython project, as it touches on all Cython features, with links to further documentation.
-
Typed Memoryviews allow Cython code to interact with memory buffers of uniform data types, such as Numpy arrays or built-in Python array types. This feature is based on the buffer protocol, the C-level infrastructure that lays out the groundwork for shared data buffers in Python.
-
Cython also allows for easy and GIL-free parallelism using OpenMP with the cython.parallel package.
-
The book Cython - A Guide for Python Programmers is an in-depth discussion of Cython, with all the ins and outs and corner cases. I would highly recommend it if you are going to work extensively with Cython.
-
The Day of the EXE Is Upon Us is an excellent talk given by Brandon Rhodes at PyCon 2014. Among others, it touches upon the complicated distinctions between interpreted and compiled code (surprise: even x86 assembly is interpreted), why Python is slow, and why Cython is incredible.