Cython is one of the best kept secrets of Python. It extends Python in a direction that addresses many of the shortcomings of the language and the platform, such as execution speed, GIL-free concurrency, absence of type checking and not creating an executable. It is a mature tool with a number of widely used packages that are written in it, such as spaCy, uvloop, and significant parts of scikit-learn, Numpy and Pandas. It smoothly hooks into the latter two, giving you access to underlying data structures in a straightforward way. All these superpowers come with the baggage of certain parts of C, however, which makes becoming proficient in Cython a bit steep for those who don't know C. In this tutorial, I will give an overview of working with Cython, focusing on the parts of C that are relevant.
The requirements for running the code samples in this tutorial are Python 3 (preferably 3.6), a C compiler (GCC or Clang should do), and virtualenv or pipenv. Once you have these, getting Cython is as easy as creating a pipenv or a virtualenv, and within that environment, running
pip install Cython.
What is Cython, and what is it not?
Cython is built on the fact that Python the language (or at least the CPython implementation of it) is built on top of a C API that is instrumented through an intermediate language. The way CPython works is by compiling Python code into a bytecode representation, and then executing the result on the virtual machine at runtime. Individual instructions of this bytecode consist of an opcode and a reference to any arguments. You can see the existing opcodes in the file opcode.h. In the main evaluation loop of the runtime, the opcode of an instruction is used to determine what to do next, in the form of one big switch statement. For example, if the opcode is
BUILD_LIST, specifying the construction of a new list, the
PyList_New function is called with the appropriate arguments. Obviously, there is a great deal of work the Python runtime is doing around this evaluation loop, such as garbage collection and error handling. Another large piece of the runtime work pertains to the dynamic nature of Python, where the actual methods to run have to be figured out at runtime, based on the object structure available. For example, when you try to multiply two numbers, Python has to figure out whether these are floats, integers etc., or the multiplication operator has been overloaded as with the string type. In many other languages, explicit type information helps the compiler figure this out at compile time, leading to faster compiled code. This distinction is known as early vs. late binding, and is the source of one of the major performance gains one can achieve by using Cython.
Cython makes use of the architectural organization of Python by translating (or 'transpiling', as it is now called) a Python file into the C equivalent of what the Python runtime would be doing, and compiling this into machine code - this can be a Python extension which can be dynamically loaded, or an actual executable. The resulting module makes calls to the Python runtime in order to deal with things like above mentioned dispatch, which means that straightforward Python code will not be executing much faster than it would anyway. The speed difference becomes significant when you code using Cython-specific constructs that are transpiled directly to their C equivalents, thereby avoiding the Python runtime. We will see what these constructs are in a minute, but let's start with a simple example to get used to working with the Cython toolchain.
The first thing you need to do is to install Cython, obviously. Please refer to the official Cython documentation for the installation instructions. You will also need a C compiler; GCC on Linux and Clang on Mac should do. Once you have have these two, we can start with the usual hello world. Save the following in a file named
hello_world.pyx (or simply clone and use the sample code repository):
def say_hello(): print("Hello world from Cython!")
And then execute the command
cython hello_world.pyx. You should end up with a file named
hello_world.c. This file can be compiled into a shared library with the following command:
gcc -shared -fPIC `pkg-config --cflags python-3.6m` hello_world.c -o hello_world.so
Your Python extension should now be the ready, in the form of a file named
hello_world.so. You should be able to drop into a Python shell in the same directory with this file, import it with a simple
import hello_world, and then run
hello_world.say_hello(), the output of which should be "Hello world from Cython!". Voila, your first Cython extension.
Now let's have a quick look at the
hello_world.c file. The file is pretty big, and fortunately, you don't need to understand any of it to work with Cython. Still, it's interesting to have a look at what Cython did to your Python code. Go ahead and search for
Hello world from Cython in this file (there is a much easier way to compare the generated code with the original, which we will see in a minute). You will see that Cython has generated C functions for the high-level
hello_cython and the inner
__pyx_pf_11hello_world_hello_cython on my computer). It makes a call to
__Pyx_PrintOne, which is a wrapper for the Python
Easier compilation of Cython extensions
There is a much easier way to turn Cython code into native modules, involving the distutils core module that is responsible for building and installing modules in Python. This will allow us to delegate the trans- & compilation to Cython and distutils, and be able to import the file like a normal Python module. In order to do so, you need to create a
setup.py file with the following contents in the same directory as
from distutils.core import setup from Cython.Build import cythonize setup(ext_modules = cythonize("*.pyx", annotate=True))
The role of the
annotate argument will be explained in a bit. After running the command
python setup.py build_ext --inplace to build the extension module, you should be able to import
hello_cython and call
hello_cython.say_hello(), getting the same result as above. From now on, whenever you make a change to a Cython file, you need to run the above command, which will build all the Cython files that have changed.
Python is valid Cython
As you can see with the hello world example, Cython accepts and processes correctly the Python you are used to writing everyday. It was mentioned above that there is an easy way to view the C code generated by Cython from the input, and also that the
annotate argument would be explained later. If you compiled
hello_world.pyx using distutils as explained above, you should now see a file named
hello_cython.html in the same directory. This file contains a visual display of the Python code Cython has transformed, with the lines that require access to the Python interpreter colored in yellow. The darker the background yellow of a line, the more Python runtime interaction it contains. There is also a plus sign to the left of every yellow-tinged line, expanding to the C code that was generated by Cython as translation. Generally, the aim when converting Python to Cython for purposes of optimization is to decrease the number of yellow lines, or at least lighten the hue of their yellow. This will ensure that your code is as close as possible to a C version, and thus –probably– faster.
The simple things: Variables, Functions, Loops
Now let's take some Python code that is somewhat slow, turn it into Cython, and make it faster by annotating it with Cython-specific statements, decreasing the amount of yellow in the annotation. We will use this approach to write a fast version of the Sieve of Erastothenes, with which one can find the prime numbers up to a certain limit. Here is an implementation in pure Python:
import math def sieve(up_to): primes = [True for _ in range(up_to+1)] primes = primes = False upper_limit = int(math.sqrt(up_to)) for i in range(2, upper_limit+1): if not primes[i]: continue for j in range(2*i, up_to+1, i): primes[j] = False return [x for x in range(2, up_to+1) if primes[x]]
Let's find out how fast this code is, by saving it in the file
sieve_python.py, and benchmarking it with the handy
timeit module, as follows:
python -m timeit "import sieve_python; sieve_python.sieve(200000)"
On my computer, the output is
10 loops, best of 3: 27.9 msec per loop. Admittedly, this does not look like a slow way of computing 17984 primes (computers are freaking fast), but it already gives us a good starting point.
Small note on timeit: The timeit module will run the piece of code you are timing in loops that increase in number of steps, until execution time exceeds 0.2 seconds. So don't be surprised if different attempts lead to different number of loops. From here on, I will omit the complete output, and report only the best time for each timing run. Also, in the following, benchmarking and timing will be used to mean the same thing, namely measuring the running time of a piece of code.
Small note on benchmarking: Since building the extensions and benchmarking a module is an operation we will repeat frequently in the following, I have added a bash script named
benchmark.sh that does this for you. It accepts the name of the module as its single argument, as in
First attempt at Cythonizing
Now let's copy the contents of
sieve_python.py to a Cython file named e.g.
sieve_naive.pyx (also available in the sample code repo), build it as an extension, and benchmark it with
./benchmark.sh sieve_naive. On my computer, this leads to an average runtime of 17.2 msec, which is already an improvement over the pure Python version. Still, an improvement of 40% is not really worth our efforts. If we have a look at the annotation file in
sieve_naive.html, we can see that there is Python runtime interaction pretty much on every line, except for the line with
continue, which is a keyword in C, too. In order to optimize the naive Cython code, we need to convert all of these lines to Cython-specific code that would be translated to pure C. Without further ado, here is the properly cythonized version, available as
sieve_cython.pyx in the sample repo (discussion of the changes will follow):
from libc.math cimport sqrt from libc.stdlib cimport malloc, free def sieve(up_to): cdef bint *primes = _sieve(up_to) response = [x for x in range(up_to+1) if primes[x]] free(primes) return response cdef bint *_sieve(int up_to): cdef int i, j cdef bint *primes = <bint *>malloc((up_to+1) * sizeof(bint)) for i in range(up_to+1): primes[i] = 1 primes = primes = False cdef int upper_limit = int(sqrt(up_to)) for i in range(2, upper_limit+1): if not primes[i]: continue j = 2*i while j < up_to + 1: primes[j] = False j += i return primes
When we benchmark this new version with
./benchmark.sh sieve_cython, we achieve a considerable improvement in performance: 4.24 msec, 6.5 faster than the Python version. The Cython features we used to achieve this speed-up start right now.
1. Interface separation
Cython offers two fundamental ways of defining functions: With the usual def (Python functions) vs. with cdef (C functions). The common pattern of organizing Cython code is separating the interface and computation functions, writing them as Python and C functions respectively. Python functions have the exact same facilities as in pure Python, and are accessible from the Python runtime. They are mostly responsible for any conversion of arguments & return values to and from C types, in addition to Python-specific things like exception handling. You can nevertheless use type-annotated variables as arguments or within them (such as the
cdef int *primes above). C functions, on the other hand, which are declared using the Cython-specific
cdef keyword, are transpiled directly to C functions. These functions (from here on called cdef functions) can be used exactly in the way Python ones are declared (optional arguments, Python objects as argument and return types etc.), but these lead to runtime interaction. In order to circumvent this, cdef functions can be annotated with types in the signature and in the variables used. The
_sieve function from the above code example, for example, does not accept, return or process any Python objects; all arguments and variables are annotated with C types, as well as the return value.
cdef functions have a couple of oddities you need to consider, however. The default return type of cdef functions is Python object, and they will convert whatever is returned into one. It therefore makes sense to declare some return type to avoid Python runtime interaction. Also, within them, rules of the C world are valid, meaning that integer division and overflow function differently. Especially if your code is doing intensive numeric calculation, you should watch out for these catches.
A look at the annotation file for
sieve_cython.html is rather revealing. In
sieve_cython.html, you can see pretty clearly that the
_sieve function has been completely converted to C, with no yellow lines. The four-line
sieve function, on the other hand, is yellow except for the call to
free. If you expand the third line of this function (line 6 in the whole file), you can get a very good view of what Cython is doing in the background. The one-line Python call to create a new list from a comprehension on a range calls has led to 58 lines of error handling, resource management and Python interaction. The above mentioned
PyList_New, for example, is called to create a new list.
There is a third type of function you can define with
cpdef instead of
cdef, which is a hybrid between Python and C functions. In the background, Cython will actually define two functions, a Python and a C one. Calls from Python runtime will be directed to the Python function, whereas calls from within Cython-generated code will be directed to the C function. Therefore, you will be getting the benefit of C optimization when your code is called from other C functions, whereas the Python function will still be available as an interface.
2. Type annotations
The type annotations used in Cython are relatively straightforward, especially if you are acquainted with C types. Any C type can be used as a valid type, including pointer types. Variables with C types have to be defined using the
cdef keyword, which can be done in both kinds of functions we have seen. One special type used also in
bint, which is a normal
int in C code, but is converted to the Python boolean when necessary. If a variable is defined in the usual Python way without type information, it's assumed to be a Python object. When you build a Python object out of C types, as on line 9 of
sieve_cython.pyx, Cython will do certain kinds of type conversions on the fly, without asking for more information. Similar conversions are possible also the other way around, from Python to C. You can see which conversions are possible, and how they work in the documentation on type conversions.
3. C libraries
Instead of importing and using the Python math module,
sieve_cython.pyx uses the C math library, which makes type conversions unnecessary. Cython facilitates this step by providing the C libraries as Python-like imports, as in
from libc.math cimport sqrt. The same is valid for the dynamic memory management routines malloc and free, which are discussed next.
4. Memory management
In C-land, memory demands much more of the programmer. The difficulty arises from the interaction of dynamic memory management and the stack vs. heap distinction. Data allocated on the stack is automatically managed, i.e. removed when the reference goes out of scope. If you want to keep a reference to a piece of data after its initial context is gone, however, you will need to allocate it using the notorious
malloc, and then return the memory back with
free. In our case, we need to keep a reference to the
primes list after
_sieve is done; this is the reason it is created using
malloc, which takes as input the amount of memory that needs to be set aside, and returns a pointer to it. The topics of dynamic memory management and pointers are very C-specific, but they are relatively straightforward, so any decent book on C should provide enough information to set you up for their use in Cython.
Python has numerous facilities for patterns that are handled using a single tool in C, the for loop. In the pure Python implementation of the sieve, we have three loops: A list comprehension (line 7 of
sieve_naive.pyx and two for-in loops, all three with a
range call. In C, all of these have to be written using a for loop with auxiliary variables and all, and Cython does its best to convert them properly, both the loop and the
range call. In the very first case, we don't really care that much about what the code is converted to, as it's an interface function, and a list comprehension is mixed with type conversion and range, but I would like to note that the call to
range is of a simple form, with a start and an end index. The second call to
range (on line 10) is also of the same form. This form can be converted relatively easily into a for loop by Cython. The last use of range, however, involves a step argument (on line 13), which means that it would require Python runtime interaction, which we want to avoid. For this reason, in the Cython-optimized version, this third call, together with the for loop, has been turned into a while loop with explicit loop counter increment to achieve the same functionality.
Defining and Collecting New Types with Cython
The above example is relatively straightforward, since it uses only built-in
Python and C types. To delve into more complex use cases, I will use as an
example the 8-puzzle that is a simple task used to demonstrate search
algorithms. The 8-puzzle involves integers from 1 to 8 placed on a grid, with
one empty cell. Moves can be made by sliding one number vertically or
horizontally to the empty cell. The board is considered "solved" if the numbers
are in order, and the last cell is the empty one. In the sample repository, you
can see the solutions I already wrote in Python and C. Both require a start
state as the single argument, for which there is a sample in the file
state.txt. In both implementations, state of the puzzle board is represented
as a 2-dimensional array of integers, with 0 representing the empty cell.
Breadth-first search is used to find a sequence of moves that solves the puzzle,
printing this sequence in reverse. The set of board states that were already
seen is represented as a trie, as membership in this set is checked frequently,
and must be approriately fast.
Let's start off with benchmarking. As we are comparing an executable with a
Python script in this step, the
timeit module won't cut it; we will have to
resort to a more general solution. The most comprehensive Linux tool for this
perf, but I should warn you that on Ubuntu it depends on kernel
tools specific to the kernel version, which caused quite some headache for me,
so proceed at your own risk. With that out of the way, here is how to benchmark
the python script:
perf stat -r 10 python eight.py start.txt > /dev/null
On the last line, I can see that it took 0.201 seconds on average on my
computer. I also gave pypy a try for completeness' sake, simply replacing
pypy in the above command, which resulted in 0.360 seconds on
average, surprisingly slower than Python itself. The C code delivers the
goodies, however: When benchmarked, the average runtime with the same input is 8
msec, nearly 30 times faster than Python. The question is now whether we can get
close to it using Cython.
First iteration of eight puzzle in Cython
As with the previous example, we will start by simply copying the Python solution to
eight_cython.pyx, and using it as a starting point. In order to get this file compiled into an executable, we need to use the
cython CLI command instead of the extension building mechanism, and pass it an extra argument, as in
cython eight_cython.pyx --embed. Don't forget adding
--annotate to the mix if you want the annotation file. The resulting
eight_cython.c file can be compiled into an executable with the following command (given that you have Python 3.6 and the relevant development package installed):
gcc `pkg-config --cflags python-3.6m` eight_cython.c -lpython3.6m -o eight_cython
When benchmarked, this file already leads to a significant improvement, clocking at 200 msec, 23 msec less than letting the Python runtime execute the script. There is a lot to be done to speed this up, however, which I went ahead and did already. You can see the results in the file
eight_cython.pyx. This version is considerably faster: With an average runtime of 43 msec, it is five times faster than the pure Python version. There were two major improvements that were used to achieve this speedup, explained in the following.
1. Class vs Struct
The main data structure in the Python version is the
State class that wraps the integer array and provides the necessary methods for generating children, checking final status etc. Cython can definitely deal with this class, as we saw in the naive attempt, but it will be approximately as slow as the Python version, so we have to convert it to something else. We have two options what this something can be: C structs and extension types. The former is the plain old C struct that can be declared in Cython using the
cdef struct construct. Here is an example from
cdef struct BoardPosition: int row int column
Extension types, on the other hand, are a Python-runtime related construct. Their instances are proper Python objects, managed by the Python garbage collector, and they can be subclassed in the usual way by other extension types or Python classes. You cannot add new attributes at runtime, however, as these are stored directly on the C struct that represents an instance object. The attributes are also not accessible to other classes or functions unless explicitly declared to be so. The way special methods of extension types function also differs in subtle ways; you should check in the list whether your expectations are met before you implement one. In
State class is declared as an extension type using the
cdef keyword, and it has the attributes
cdef class State: cdef int **board cdef BoardPosition _zero_index cdef public State parent
As you can see, the
board 2-dimensional array is declared exactly the way in C, but the reference to the parent
State is not a pointer, as it would have been in C. Cython manages references to Python objects (extension types or normal classes) as pointers in the background; you don't have to declare them as such. The
State extension type has methods defined with either
cdef; the same conditions as above are valid, but you cannot declare any special methods (those wrapped in double underscores) with
cdef. Also, it is not possible to turn
cdef methods into properties with the usual
@property decorator (as would have been convenient with the
One special method of
__cinit__ is worth special attention. This method is called once when an extension type is allocated, and it is the place where code that allocates any further C data structures belongs. In our case, the
board array is allocated here. There is a corresponding
__dealloc__ special method where you can free resources allocated in
__cinit__. This is also implemented in
State for completeness sake.
2. C arrays versus lists
State.children method, we need to collect structures for representing the two following things:
Possible ways of swapping positions on the eight board, depending on where the empty cell is (as
Child states that result from these swaps (as
When it comes to storing such structures, the simplest option is using the tried and proven Python list. Conveniently, basic Python collection types (list, dict, tuple and set) can be used as a type in
cdef functions. The problem with the list structure, however, is that it leads to Python runtime interaction, and is accordingly slow. Therefore, in performance-criticial situations, it is advisable to use C arrays instead. It is the bread and butter of C programming to allocate arrays of structs and iterate over these in every which way possible, and it is not any more difficult in Cython to do so; you can see how it is done with the array of
BoardPosiion structs in the
State.children method. The situation is more complicated with the child states, however, as these are represented with the
State extension type. Since extension types are managed by the Python garbage collector, and they can be referred to only as pointers to Python objects, their storage in arrays as C pointers is rather complicated, involving casts and calls into the reference counting mechanism of Python. For this reason, in order not to complicate the code too much, I opted for the simpler list type in the first iteration. There will later be a more detailed discussion of storing pointers to extension types in arrays.
3. Using DEF to declare a constant
Cython allows C-style constants with the
DEF directive. As with C, any values defined this way are replaced within the code at compile time. you have to be careful, though, as only the basic types int, long, float, bytes and unicode can be declared as constants. The number of rows and columns on the eight board is declared in
eight_cython.pyx as a constant with
DEF SIZE = 3. Cython also allows conditional compilation with the
Comparing the performance of the C version (9 msec) with
eight_cython.pyx (43 msec), we can see that there is still room for improvement to reach C level performance. This is also obvious in the annotation file, which is deep yellow in many parts. To find out which bottleneck to tackle next, however, the annotation file is not enough, as it shows us only how much Python runtime interaction a line of code causes, and not how much it contributes to the total runtime. Profiling is what we need, and fortunately, Cython makes this extremely easy. You need to only add the following compiler directive to the top of a pyx file to make Cython generate profiling data which can be processed by the standard Python profilers:
# cython: profile=True
Another difference to normal operation is that you need to run the code under profiling as a module, and not from the command line, which means that the code has to be compiled as a module with the usual
python setup.py build_ext --inplace. Now we are ready to profile our code by starting a Python shell and running the following:
import cProfile import eight_cython cProfile.run("eight_cython.main('start.txt')", sort='time')
And here is the output I have received:
403088 function calls in 0.097 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 33453 0.018 0.000 0.022 0.000 eight_cython.pyx:59(add_or_get_child) 44935 0.017 0.000 0.021 0.000 eight_cython.pyx:67(get_child) 1 0.009 0.009 0.097 0.097 eight_cython.pyx:202(search) 3717 0.008 0.000 0.012 0.000 eight_cython.pyx:163(children) 6413 0.008 0.000 0.037 0.000 eight_cython.pyx:88(contains) 44935 0.008 0.000 0.029 0.000 eight_cython.pyx:67(get_child (wrapper)) [..snip..]
The runtime increased a little due to the profiling overhead, but we now have ample information on which functions are causing the most execution overhead, namely the
add_or_get_child methods of the
TrieNode extension class. This is not surprising, as this code, used for checking whether a state has already been seen, is guaranteed to be executed for every evaluated state. On top of that, it uses the Python
list type, leading to huge overhead in interaction with the runtime. This can also be seen in the annotation file in which the two
TrieNode methods are a very deep shade of yellow. If we want to speed up our code, we need to tackle the way the board trie is built and used; we will now see how.
Replacing extension types with C structs
The difficulty we are facing in the optimization of the
TrieNode class is that its methods need to return lists of
TrieNode instances. As already mentioned, interaction with the Python list class requires a lot of runtime interaction, so the way to speed up list iteration is to get rid of it, and use C arrays instead. There are two ways of going about this. The first is turning
TrieNode into a C struct, and storing pointers to instances of it in the array. This is the path we will take, and we will have a look at the alternative, using pointers to Python objects, later. The file
eight_cython_improved.pyx contains the implementation where
TrieNode is now a struct, defined as follows:
cdef struct TrieNode: int value TrieNode **children
Code that used to be methods on the
TrieNode cdef class is now packed into individual functions, all prefixed with
trie, and accepting a pointer to a
TrieNode as the first argument. This is a common pattern of organizing C code, somewhat similar to object oriented coding. If you compare this new trie code with the implementation in
eight.c, you can see that the Cython version is pretty much a line-by-line translation, except for the semicolons, and with an extra cast of the malloc return value, which Cython needs for typechecking. Another C pattern that is used in both
eight.c and this improved Cython version is marking the end of a list with a
NULL value. The end of the list of children of a
TrieNode is marked with
NULL, a keyword in Cython that is synonymous with its meaning in C.
This improved version benchmarks at 33 msec, shaving off 10 more msec from the previous version. Is there any other big win we can score, as with the trie optimization? Once more, the answer lies in profiling. Here is the result of profiling
Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.009 0.009 0.054 0.054 eight_cython_improved.pyx:206(search) 3717 0.008 0.000 0.013 0.000 eight_cython_improved.pyx:167(children) 6413 0.006 0.000 0.011 0.000 eight_cython_improved.pyx:93(trie_contains_board) 44935 0.005 0.000 0.005 0.000 eight_cython_improved.pyx:77(trie_get_child) 33453 0.005 0.000 0.009 0.000 eight_cython_improved.pyx:64(trie_add_or_get_child) 3717 0.005 0.000 0.013 0.000 eight_cython_improved.pyx:86(trie_add_board)
There are two more bottlenecks that promise significant improvements: the
search function, and the
children method of
State. These two are connected, however, in that the
children method returns a list, and the
search function iterates over its result. According to the annotation file, this iteration is the most intensive interaction with the Python runtime in
search. We might achieve further speedup by doing the same kind of refactoring we did with the
TrieNode structure, namely by turning
State into a struct and its methods into functions that accept a
State as an argument. The template is already available in
eight.c, but instead of this easy fix, I wanted to give the alternative a try, namely using a C array to collect pointers to Python objects. The result is in the file
eight_cython_improved.pyx. As you can see there, a
ChildSet struct is necessary to return the number of children and the actual pointers to the children:
cdef struct ChildSet: PyObject **children int count
PyObject is the C struct that Python uses to keep track of objects; you can use it in Cython to refer to arbitrary objects. It is your responsibility, however, to manage reference counts when you create a reference to an object, as with the pointer in the
child_set.children array on line 198. This is the reason for the call to
Py_XINCREF on line 196.
Py_XDECREF are imported from the same package with the following line at the top of the file:
from cpython.ref cimport PyObject, Py_XINCREF, Py_XDECREF
Py_XDECREF is later used to decrement the reference count, on line 233. Another thing explicitly done in a number of places is casting between the objects and pointers. When we are adding the child
State object into the array on line 198, we are casting it to a
PyObject *. Later, when this entity needs to be accessed as a
State object again, it is cast back again on line 231. Building and timing the version with object pointers, I get an average runtime of 32 msec, which is the same with the version with extension types in Python lists. Apparently, the load of the casts and reference counting balance out the wins from interacting with the list, leading to zero improvement.
The material presented here covers only the basics of Cython. There are numerous other features, explained in detail in the official documentation and some other resources, that allow Cython to interact with other C or Python libraries, generate better optimized code, or let you tune the interaction with the Python runtime to your needs. Here are some resources that can help you get further in general Cython or in specific areas that interest you:
The Basic Tutorial in cython documentation is another good starting point, but it leaves off at a point where confusions arise if you try to write useful code.
The Language Basics page is worth going through before embarking on any significant Cython project, as it touches on all Cython features, with links to further documentation.
Typed Memoryviews allow Cython code to interact with memory buffers of uniform data types, such as Numpy arrays or built-in Python array types. This feature is based on the buffer protocol, the C-level infrastructure that lays out the groundwork for shared data buffers in Python.
Cython also allows for easy and GIL-free parallelism using OpenMP with the cython.parallel package.
The book Cython - A Guide for Python Programmers is an in-depth discussion of Cython, with all the ins and outs and corner cases. I would highly recommend it if you are going to work extensively with Cython.
The Day of the EXE Is Upon Us is an excellent talk given by Brandon Rhodes at PyCon 2014. Among others, it touches upon the complicated distinctions between interpreted and compiled code (surprise: even x86 assembly is interpreted), why Python is slow, and why Cython is incredible.