Monday, November 24, 2008

Pyrex for dummies

Pyrex is a very powerful tool if you need to speed up some Python code or connect it with some C code. My main motivation to use it was the ability to distribute only the compiled files and not also a C compiler etc.

Unfortunately, the learning curve is a bit steep if you want to do anything non-trivial (if you are not a skilled programmer, that is). I find it very difficult to read and understand just about anything about Python C API and related topics. So any attempt to learn new tricks means spending hours with Google and trying to find an example of usage which is simple enough for me to understand.

What follows is a few steps which I generally follow when rewriting Python into Pyrex code.
  • Profile your code. Start with rewriting the slowest parts of your code (unless you want to know how fast you can make everything).

  • Try to compile unchanged Python code with Pyrex. This way you will start with a working code and if something goes wrong you probably broke it with later changes.

  • Declare loop control variables as C types:
    cdef int i
    and change Python for loops into Pyrex for loops
    for i from 0 <= i < 10:

  • Numpy array data should be accessed directly. It's fairly easy to do now, that I finally know how to do this (see further).

  • Additional speedup may be achieved by eliminating or minimising Python function calls and/or replacing them with C function calls.

  • To help understanding your functions, add at least one line to the docstring such as:
    """myfunc( N, array x) -> array y"""
    Pyrex (or C) function parameters are not accessible as in Python functions by using help(myfunc) therefore you must explicitly write it to the docstring.

These steps have so far done the trick for my purposes. I essentially only need fast maths code so I have no idea about other areas. What I may have to learn later is some string stuff but so far I had neither guts nor reason to try it. This means I'm using pure Python strings in my Pyrex modules.

Now some selected details as promised:
To use some standard C function you have to declare it before use. So, if I e.g. need asin() from the math.h library I put this at the beginning of the Pyrex module:
cdef extern from "math.h":
double asin(double)

Using Numpy C API to access array data directly was tough to learn, this is my current way:
  • put to the beginning of the script:
    cimport c_numpy
    for this to work you have to copy 'c_numpy.pxd' file from 'Python\Lib\site-packages\numpy\doc\pyrex' into the directory with your script (there is a warning about future removal, I hope the same-named file in '\doc\cython' will work as well).

  • initialize numpy by :
    c_numpy.import_array()

  • declare new numpy array:
    cdef c_numpy.ndarray x

  • create numpy array:
    x = numpy.zeros(10)
    There is another (possibly faster) way of creating new arrays but this is what i use now (I also do not know the other way, should have made a note...).

  • declare pointer to store the adress of the numpy array data:
    cdef double *xdata

  • copy data adress to your pointer:
    xdata = <double *>x.data
    The x.data is a char poiter to the first number in your array. I have no idea what this means (why char?).

  • you may now index xdata to get desired element value:
    xdata[6] = 12.54
    or
    tempvar = xdata[1]

  • you may declare the numpy array in the same way during function declaration:
    def myfunc(double step, c_numpy.ndarray x):
    ...

Using the Numpy C API is more cumbersome than just indexing numpy arrays but the code speedup is often significant.
Looking through my Pyrex modules this should in essence be all that is needed to get started with Pyrex.

4 comments:

Nancy Savino said...
This comment has been removed by a blog administrator.
diffusing thoughts said...

#this assumes cython has access to numpy.pxd
from numpy cimport NPY_DOUBLE, NPY_OWNDATA, npy_intp

cdef extern from "arrayobject.h":
cdef object PyArray_SimpleNewFromData(int nd, npy_intp *dims,\
int typenum, void *data)

cdef npy_intp SIZE # the lenght of an 1D array, if nD
this should be an array of lenght n
import_array()
cdef np.ndarray n_src = PyArray_SimpleNewFromData(1, &SIZE, NPY_DOUBLE, {void*}c_src) # {} should be <>
#c_src is the malloced contiguous c-array
n_src.flags = n_src.flags|(NPY_OWNDATA) # this sets the ownership bit, i believe that this makes the array deallocate the c_src memory after its refcount = 0
return n_src

R.L. said...

diffusing thoughts, thank you for your helpful comment. That's exactly what I needed. I'll have to do some profiling and tests but so far it seems to work well. I'm not sure I will use it in production code though. It is a lot more writing and array creation isn't usually the bottleneck.

diffusing thoughts said...

Creation is not a bottleneck (I checked it:P) and I will definitely remove those calls as soon as the cython features with respect to numpy are settled. This is a response to my email on the cython-dev list from Dag who did all this remarkable stuff with cython/numpy.

"""
Your code inspired me to make some additions to the buffer code, up in
the devel branch now. Your calls to PyArray_SimpleNewFromData should no
longer be necesarry, instead you can (if you so wish, of course...) do

cdef np.ndarray[unsigned int, mode="c"] ridx = np.empty((k,), dtype='I')
...
for 0 <= i < k:
ridx[i] = some_c_array[i]
"""

the dtypes are explained in the numpy.pxd file also.