Python for scientific computing: Where to start
Python is a general-purpose programming language. NumPy is a “package” (group of Python programs and definitions) for storing and manipulating numerical data. By using NumPy and related packages (like SciPy for statistics, optimization, etc., Matplotlib for making graphs, Pandas for organizing and processing complex datasets), Python gets scientific programming capabilities similar to MATLAB, but with various advantages, one of which is the price difference (Python is free and open-source).
My suggestion: Use Spyder
To run and edit scientific programs in Python / NumPy, a great way to start is to use Spyder, a visual interface similar to MATLAB where you can run commands, edit and debug programs, check the values of variables and the definitions of functions, etc.
Unfortunately, it’s not enough to download Spyder by itself. First, you need Python itself, and PyQt [a program which the Spyder interface uses], and also NumPy, SciPy, etc. Moreover, the components have to be compatible with each other, and they have to be installed properly so that they can interact with each other. This multi-step installation process is not rocket science, but I get the impression that it can be confusing and error-prone. As just one example, Mac OS X ships with Python installed, but NumPy says you are not allowed to use that version; you have to do a fresh installation.
Luckily, in recent years, installation has become significantly easier: You can now install all the components together automatically in one step. The Spyder installation page lists one-step installations of Spyder/Python/everything on different platforms. For Windows, they suggest using Python(x,y), WinPython, or Anaconda. For Mac OS X, they suggest the Mac Spyder stand-alone app, or Anaconda. For Linux, they have links to the appropriate packages.
More about Python(x,y) and WinPython: Both are open-source (and non-commercially-supported) ways to conveniently install Python, Spyder, NumPy, etc. on Windows. Python(x,y) is older and much better-known. The advantages of WinPython are (1) You can run it off a USB-stick without installing anything (if you want to), (2) On 64-bit Windows, you can use Python(x,y) but only in 32-bit mode. So it will have worse performance than WinPython which can run in 64-bit mode. Anyway, I have used both and they’re both great. They’re very fast to install: For example, you install Python(x,y) with a couple clicks, and then you can run Spyder via Start menu –>Python(x,y) –> Spyder –> Spyder.
More about company-sponsored Python distributions: This category includes Anaconda (from Continuum Analytics Corp.) [mentioned above], and Enthought (from Enthought Inc.), and ActivePython (from ActiveState Software Corp.). All three of these have free versions and paid versions. But the free versions are perfectly functional; the main reason to pay is for professional technical support. (And even if you pay, the price is 10X lower than MATLAB.) The distinctions among these three, as far as I can tell, are: Anaconda is generally focused on “big data” analysis for math and science, including running code on GPUs, on parallel architectures, etc. etc. ActivePython is for programmers in general, not just science. Enthought is generally focused on math and science computing, is popular among students, and (unlike the other two) does not include Spyder (you can install it yourself, but it takes some effort).
More about Mac options: Before the Spyder stand-alone app was released in 2013, the traditional way to install Spyder / Python / etc. on Mac OS X was MacPorts. See how to install MacPorts. Don’t be surprised if the process takes an hour or two.
More about Linux options: This is very very easy, you just install the appropriate package in the standard way. For example, in Ubuntu, you would open the Ubuntu Software Center, search for “Spyder”, and click “Install”. Another option for Linux users is Pip, part of python. This program will download and install python packages, including Spyder. (Pip exists in Windows or Mac too, but there, it can normally only install simple packages, because it sometimes requires a C compiler etc.) One reason to use Pip is to get the most recent versions of the software; another reason is if you’re installing into a virtualenv. For example, here is how to install Spyder using pip in Ubuntu: Before you’re ready to use pip, open a terminal and run the command sudo apt-get install python-pip python-dev build-essential python-qt4 python-numpy python-scipy python-matplotlib This tells the Linux package manager to install seven Ubuntu packages: The first three make pip work; the fourth (python-qt4) is a python package that cannot be installed with pip; and the last three are python packages that are tricky to install with pip. Now we’re ready for pip. Open a terminal and run the command sudo pip install sphinx to download and install the python package sphinx, then do the same thing eight more times but replace sphinx by spyder, pyflakes, rope, pylint, psutil, ipython, pyzmq, pygments. Now you have Spyder, its required prerequisites, and all its optional extra features.
Alternatives that do not involve Spyder
Spyder is great, but you can certainly program in Python without it. There are many other IDEs (Integrated Development Environments) that work with Python. So, if you don’t want Spyder in particular, but still want to install Python, NumPy, SciPy, etc., in a convenient one-step way, here are a few options:
Enthought was discussed above. Another option is Sage, which is discussed at length below. Another option for Mac OS X users is the Scipy Superpack.
Another option (Windows only) is Portable Python. Like WinPython, it is a self-contained folder, so you can (optionally) run it directly from a USB drive without installing anything. It comes with the essential scientific packages (NumPy etc.), and a very nice IDE similar to Spyder (PyScripter). On the download page, you’ll see that the “Python 2″ version has NumPy / SciPy / Matplotlib, while the “Python 3″ version does not. Be sure to get the Python 2 version. Which brings us to…
Python 2 or Python 3?
The Python language changed between Python 2 and Python 3. For scientific computing, the changes were relatively minor, but they make most Python 2 code fail to work in Python 3. So you have to pick which one to run. As of this writing (May 2013), scientists should probably use Python 2, not Python 3. Most widely-used packages, including NumPy / SciPy / Matplotlib / Spyder, are available in both Python 2.7 and Python 3.x. The majority of random python code you’ll find online will work in 2.7 but not be compatible with Python 3.x — you would have to edit the code before using it. (Luckily that’s usually not too hard.) In any case, there is nothing wrong with Python 2.7, and almost all available Python code works in Python 2.7, so that is probably your best bet.
Actually, there was one change from Python 2 to Python 3 that was a desperately-needed improvement for scientists: The treatment of integer division. In Python 2, an integer divided by an integer is always rounded down to another integer, e.g. 10 / 3 = 3. In Python 3 they fixed it: 10 / 3 = 3.333…. (On the rare occasions that you want floor integer division, Python 3 lets you do it with a different symbol, “//”.) Luckily, Python 2 users need not suffer: Just start all your modules with the magic line
from __future__ import division
This lets you use the improved Python 3 division rules in a Python 2 program. It also will make it easier to update your code for Python 3 if you want to do that someday in the future. (There are other “future imports” too, and using those along with other tricks, you can write code that works in both Python 2.7 and Python 3.)
“Notebook” interface
A nice idea popularized by Mathematica (among other programs) is a “notebook” interface, where you can run and re-run commands, and interlace nicely-formatted text, code, outputs, and graphs into a single document.
Python has that too: The IPython notebook. This link is a blog post written entirely in an IPython notebook.
Alternatively, if you install Sage (see below), you get a notebook as the default interface.
If you come from a background in math (like me) and maybe have been using Mathematica and Maple for many years, you may have the idea that you can do 100% of your programming and calculating in a notebook interface. Bad idea! When you use a real IDE like Spyder, you get delightful features like syntax checking, pop-up information telling you function definitions and parameters, code analysis and debugging as you type, compatibility with version control systems, etc. etc. Unless you are doing straightforward calculations, you should invest the time to get familiar with a real IDE.
After you install…
Of course, you also need to know the language. Here is the official Python tutorial. You do NOT need to read every word before you can start doing scientific computing. For example, in small scientific computing projects, you will rarely want to use exception handling, so skip chapter 8. You will rarely want to use classes (object-oriented programming), so skip chapter 9. (If you have a collection of properties or parameters, you can store them in a dictionary. You don’t need a custom class.)
Here is a NumPy tutorial. Here is a NumPy overview for MATLAB users. Here is a “course” on NumPy/SciPy/ PyTables / Matplotlib. Here is a broad-based overview of MATLAB versus Python. Here is my own introduction to Python for scientific computing (powerpoint slides). [Note: I made this presentation shortly after I learned the language, so sorry for any mistakes.]
Don’t reinvent the wheel! The SciPy website has a very-incomplete list of scientific routines and packages you can use, and mailing lists you can join. The Python website has a much longer and more complete package list: List of all packages (not just science), and List of packages tagged as “Scientific/Engineering”. These are analogous to MATLAB’s “file exchange”.
Many scientists do not receive formal training in practical programming. Therefore there are many “best practices” which scientists rarely follow but which are second-nature to professional software developers, such as version control [beyond just saving periodic backup copies], testing and assertions [beyond just seeing whether the program runs and whether the output is plausible—hint: use the assert command as often as possible!], writing clear, commented, and maintainable code, avoiding premature optimization, etc. etc. Although it takes some time to master these practices, it is extremely worthwhile, even for a casual and infrequent programmer. This is not specifically a Python issue; in fact, I think that an advantage of Python (compared to MATLAB etc.) is its healthy “cultural pressure” to follow these good practices. There is an organization called Software Carpentry which is trying to bring good programming practices to scientists. You can go through their online lessons and video lectures.
Appendix 1: Python program speed, and Cython
Sometimes a Python program might run too slowly to be useful. This can almost always be fixed by improving the program. For example, maybe you’re doing a calculation that is actually already implemented in NumPy or SciPy. This article by MATLAB lists other common causes of slowness, many of which are also applicable for Python / NumPy. The most important advice is “vectorization” — writing your code in such a way that you’re storing lots of numbers in arrays, and manipulating them all at once using built-in array operations, rather than accessing the numbers one at a time. Vectorized NumPy code will probably run at a similar speed to vectorized MATLAB code, and a similar speed to well-written FORTRAN or C code. This speed can easily be hundreds of times faster than non-vectorized NumPy code (which in turn is a similar speed to non-vectorized MATLAB code).
BUT, having said that, maybe vectorization is impossible (or prohibitively time-consuming and inconvenient), and the slowness is inevitable. For example, maybe you have no choice but to perform some arithmetic inside a loop that gets repeated billions of times. In this situation, Cython is an excellent tool. All you need to do is add or change a few lines in your program, and you can make basic operations (arithmetic, variable assignments, etc.) run hundreds of times faster…the same speed as if you had written it in C. And the parts of the program that are already sufficiently fast do not need to be modified at all; they can still be written in regular Python.
Although Cython code will run at the same speed as C, it is much much easier to write reliable Cython code than C code. For example, Cython can check for integer-overflow errors, divide-by-zero errors, array-out-of-bounds errors, etc. and alert you when they occur. (These checks will slow down the code by maybe 30% compared to C, but that is usually a worthwhile tradeoff, and anyways you can turn off those checks once your code is perfect.) Cython can take care of memory allocation and deallocation. Cython lets you use a different symbol for rounded-integer-division versus normal division. All these things create countless hours of frustration for C coders. Finally, and most importantly, Cython lets you use Python for the non-performance-critical parts of the code, such as defining methods to call the code and graph the output.
Here is a typical example of a Cython program modification:
Python or Slow Cython:
count = 23
count += 10
Fast Cython:
cdef int count = 23
count += 10
The part cdef int count means “The variable count should be treated as a C integer typed variable”. It makes the statement count += 10 run extremely fast, because Cython can convert it into a single line of C code. Without the cdef int part, the same statement would be converted into many many lines of C code: “Does count exist as a local variable? If not, is it a global variable? OK, I’ve found it. Does count have a legal += operation? Yes, here it is. Is 10 a legal input for that operation? Yes. OK, I will do the operation…” (Yes, Python does this kind of stuff under-the-hood for every line of code it runs! MATLAB and other high-level languages do that too, whereas normal C or Fortran code only do those checks during compiling.)
[Another unrelated application of Cython is that it's an alternative to SWIG, i.e. a way to call C or C++ code from Python.]
You need a very slight familiarity with the C language to use Cython, maybe what you would learn on the first day of a course in C, or the first ten pages of a book about C. If you know what is a “double” or an “int” in C, what is a “header file” in C, what does the word “compile” mean, then you are probably capable of writing Cython code.
Writing Cython code: See official documentation. Example of how to use Cython in a NumPy program. [That example uses "ndarray" to access numpy arrays in Cython. In Cython version 0.16 or later, you also have the (even more convenient) option of using "Typed Memoryviews" to access numpy arrays in Cython.]
Running Cython code: See section of official documentation (There are various methods, but the “pyximport” method is easiest for getting started.) How to use Cython within Python(x,y) and more generally in Windows … actually this link may be useful for other systems too.
Debugging Cython code: To get the Cython speedup, you want all the operations that get repeated billions of times to be fast C operations, not slow python operations. Cython produces an annotated HTML file telling you which lines of your code are which, so that you can fix anything you missed. How to get that HTML file. (This file usually ends up in the same folder as the module you wrote, or else in a hidden folder somewhere called “.pyxbuild”) If there is a bright-yellow-highlighted line that gets repeated billions of times, you should try to fix it. (Double-click the line number to see details.) On the other hand, the slightly-yellow-highlights are not a big deal, even on a line repeated billions of times. These represent things like array-bounds-checking and divide-by-zero-checking that only slightly slow down the code (maybe 30% slowdown, not 30000%) and are probably worth keeping as discussed above.
For what it’s worth, here is my current setup for day-to-day Cython coding in Ubuntu / Spyder: I added the following code to my Spyder startup script:
import pyximport; pyximport.install(reload_support=True)
import Cython.Compiler.Options
Cython.Compiler.Options.annotate = True
print 'Note to self: Cython HTMLs are in the folder /home/steve/.pyxbld/temp.linux-x86_64-2.7/pyrex/'
and then in Spyder, if I’m working on the Cython module blah.pyx, I test and run it using import blah and reload(blah) and blah.some_function(567). Same as a normal Python module.
On Windows / Python(x,y) / Spyder, I have something similar, but with a more complicated setup command as discussed here:
import numpy, pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"], "include_dirs":numpy.get_include()}, reload_support=True)
import Cython.Compiler.Options
Cython.Compiler.Options.annotate = True
Appendix 2: Sage
Sage is a (mainly) mathematics program built on python (and cython), starting out as competition to Mathematica, Maple, etc. Well, it can be used more generally, but it’s easy to tell that Sage was mainly written by and for mathematicians. For example, if you type x = sin(2), x is NOT (by default) rounded off to a floating-point decimal; it is stored as an exact expression, and you can later display x with a million decimal places. (This is the expected, common-sense behavior if you’re a mathematician, but it’s an unexpected, annoying complication if you’re an engineer.) Likewise, Sage has lots of capabilities in symbolic math, obscure mathematical objects, etc.
If you are not a mathematician, you can browse the list of Sage components (various open-source packages) to see if there is anything you might want to use. Although you can download any of these components yourself, the great strength of Sage is getting all these components to work together smoothly out-of-the-box. The output of one component can immediately be the input of another. So if you plan to write code that simultaneously uses two or more of the Sage components, you may save yourself a lot of headaches by using Sage. (Especially if the components are written in different programming languages, and/or use different data formats, etc. etc.)
Sage comes with a “notebook” environment where you can run calculations and write code. See discussion of notebooks above. You are not obligated to use the Sage notebook if you don’t want to; you can write code using any IDE or source code editor, and then load it into a Sage notebook for testing.
