Python for scientific computing: Where to start

(Last update: July 2014. Email me if you see errors and omissions.)

This page is a guide for how to install Python and start using it for scientific computing.

But first: What is Python and why should I use it?

Python is a general-purpose programming language. NumPy is a “package” (group of Python programs and definitions) for storing and manipulating numerical data. By using NumPy and related packages (like SciPy for statistics, optimization, etc., Matplotlib for making graphs, Pandas for organizing and processing complex datasets), Python gets scientific programming capabilities similar to MATLAB, but with various advantages.

What advantages? You can read some blog posts. From my perspective, the most important advantages are: (1) Python is free and open-source (there is no annoying license manager, etc.); (2) Compared to Python, the MATLAB programming language is much more annoying to use (in too many ways to list); (3) There is excellent Python code on the internet for doing anything a computer can do (crawling the internet, parsing text, managing databases, reading PDFs, you name it), whereas the universe of MATLAB is much more limited to science; (4) Cython is great (see below); (5) Knowing Python is a valuable career skill (this is true for MATLAB too).

Installation suggestion: Use Spyder

Screenshot of me using Spyder

To run and edit scientific programs in Python / NumPy, a great way to start is to use Spyder, a visual interface similar to MATLAB where you can run commands, edit and debug programs, check the values of variables and the definitions of functions, etc.

Until recently, installing Spyder was an annoying multi-step process (first download and install Python itself, then download and install NumPy, SciPy, PyQt, etc., then finally download and install Spyder).

Luckily, in recent years, installation has become significantly easier: You can now download and install all the components together automatically in one step. The Spyder installation page lists one-step installations of Spyder/Python/everything on different platforms. For Windows, they suggest using Python(x,y), WinPython, or Anaconda. For Mac OS X, they suggest the Mac Spyder stand-alone app, or Anaconda. For Linux, they have links to the appropriate packages.

That’s all you need to know to install Python. But if you’re curious, go to the bottom (Appendix 4) for a more thorough list and discussion of all the ways to install Python.

Should I install Python 2 or Python 3?

The Python language changed between Python 2 and Python 3. For scientific computing, the changes were relatively minor, but there are enough incompatible changes that if you find code written in Python 2, it usually cannot run in Python 3, and vice-versa. (I mean, it usually cannot run directly. It’s usually not too hard to edit the code so that it can run.)

The good news is, most widely-used packages, including NumPy / SciPy / Matplotlib / Spyder, are available for both Python 2 and Python 3. The bad news is, if you find some random python code online, it is probably available in only one of the versions, 2 or 3. So it matters: Which one should you install and use?

If you are part of a group that exchanges Python code, you should obviously use whatever version of Python your collaborators are using.

If not, you should obviously use whatever version of Python is most compatible with other Python code you’re likely to use. This usually means Python 2, especially in scientific computing. (Python 3 was quite rare in scientific computing before 2013.)

Otherwise, other things equal, you might as well use Python 3, because in a few more years that’s what everyone will be using.

I said above that the changes from Python 2 to Python 3 were minor, but actually there was one change which was a desperately-needed improvement: The rules for integer division. In Python 2, an integer divided by an integer is always rounded down to another integer, e.g. 5 / 2 = 2. In Python 3 they fixed it: 5 / 2 = 2.5. (On the rare occasions that you want floor integer division, Python 3 lets you do it with a different symbol, “//”.) Luckily, Python 2 users need not suffer: Just start all your modules with the magic line

from __future__ import division

This lets you use the improved Python 3 division rules in a Python 2 program. It also will make it easier to update your code for Python 3 if you want to do that someday in the future. (There are other “future imports” too, and using those along with other tricks, you can write code that works in both Python 2.7 and Python 3.)

“Notebook” interface

A nice idea popularized by Mathematica (among other programs) is a “notebook” interface, where you can run and re-run commands, and interlace nicely-formatted text, code, outputs, and graphs into a single document.

Python has that too: The IPython notebook. This link is a document I wrote entirely in an IPython notebook (then I saved it as HTML). More information.

Alternatively, if you install Sage (see below), you get a notebook as the default interface.

If you come from a background in math (like me) and maybe have been using Mathematica and Maple for many years, you may have the idea that you can do 100% of your programming and calculating in a notebook interface. Bad idea! When you use a real IDE like Spyder, you get delightful features like syntax checking, pop-up information telling you function definitions and parameters, code analysis and debugging as you type, compatibility with version control systems, etc. etc. Unless you are doing really straightforward calculations, you should invest the time to get familiar with a real IDE.

Getting started with Python in general…

Of course, you need to know the language. Here is the official Python tutorial. You do NOT need to read every word before you can start doing scientific computing. For example, in small scientific computing projects, you will rarely want to use exception handling, so skip chapter 8. You will rarely want to use classes (object-oriented programming), so skip chapter 9. (If you have a collection of properties or parameters, you can store them in a dictionary. You don’t need a custom class.) If you don’t like the official Python tutorial, you can find many other Python tutorials online. The most popular is Learn Python the Hard Way.

Here is a NumPy tutorial. Here is a NumPy overview for MATLAB users. Here is a “course” on NumPy/SciPy/ PyTables / Matplotlib. Here is a broad-based overview of MATLAB versus Python. Here is my own introduction to Python for scientific computing (powerpoint slides). [Note: I made this presentation shortly after I learned the language, so sorry for any mistakes.]

Don’t reinvent the wheel! The SciPy website has a very-incomplete list of scientific routines and packages you can use, and mailing lists you can join. The Python website has a much longer and more complete package list: List of all packages (not just science), and List of packages tagged as “Scientific/Engineering”. These are analogous to MATLAB’s “file exchange”.

…And getting started with Spyder specifically

When you open Spyder for the first time, you’ll see that the bottom-right quarter of the screen is called a Console, and you can type Python code into it. Type 3+4, press enter, it should say 7. The left half of the screen is where you can type Python code into files. There is an almost-empty file already there. Go to the end of that file, press Enter a couple times, then type print 3+4 (for Python 2) or print(3+4) (for Python 3). Press the green triangle Play button (top of the screen; use the default settings). When the script runs, it should print out 7 in the Console.

Congratulations, you have now learned two ways to run Python code in Spyder! Here is the third way: Create a folder somewhere on your computer, and call it PythonScripts. In Spyder, go to Tools –> PYTHONPATH Manager, and add PythonScripts as a new path. Exit and restart Spyder, then create a new Python file. As above, type print 3+4 in this file, and save the file under the name hello.py in the PythonScripts folder. Now go to the Console (bottom right), and type import hello and press Enter — it should print 7. Congratulations, you have created a Python module (that’s what hello.py is) and ran the module in the standard way. (This third method is important because it is the only method that works when one Python script wants to call another. This third method is an official part of the Python language, while the first two methods are specific to Spyder.)

If you have a version of Spyder older than 2.3, there is a confusing aspect of this. The third method of running Python code is noticeably different than the first two methods: For example, try replacing print 3+4 with print pi or print 2/5 — See the difference? The third method doesn’t work because it runs the Python code in a standard, plain Python environment, whereas the first two methods run the Python code after running the “Spyder startup script” which conveniently defines many math and science functions for you (including pi), and also switches to Python 3 division rules (see above). If you want the third method to work like the other two methods, just add the two lines from __future__ import division and from pylab import *  at the top of the file which you are editing. Those are the exact commands used by the Spyder startup script. (To view the entire Spyder startup script, type scientific in the console, or go to Tools–>Preferences–>Console–>Advanced settings, or on some systems it’s Python–>Preferences–>Console–>Advanced settings, to find where the Spyder startup script is stored on your computer. ) Recognizing how confusing this is, the defaults were changed in Spyder 2.3 so that consoles always start in a standard, plain Python environment.

For more introduction to Spyder, try this link.

Appendix 1: Python program speed, and Cython

Sometimes a Python program might run too slowly to be useful. This can almost always be fixed by improving the program. For example, maybe you’re doing a calculation that is actually already implemented in NumPy or SciPy. (Important: Never try to speed up a program without first profiling it; otherwise you won’t know which part of the calculation is the slow part! In Spyder, press “F10″ to profile your script with the standard Python profiler.)

This article by MATLAB lists other common causes of slowness, many of which are also applicable for Python / NumPy. The most important advice is “vectorization” — writing your code in such a way that you’re storing lots of numbers in arrays, and manipulating them all at once using built-in array operations, rather than accessing the numbers one at a time. Vectorized NumPy code will probably run at a similar speed to vectorized MATLAB code, and a similar speed to well-written FORTRAN or C code. This speed can easily be hundreds of times faster than non-vectorized NumPy code (which in turn is a similar speed to non-vectorized MATLAB code).

BUT, having said that, maybe vectorization is impossible (or prohibitively time-consuming and inconvenient), and the slowness is inevitable. For example, maybe you have no choice but to perform some arithmetic inside a loop that gets repeated billions of times. In this situation, Cython is an excellent tool. All you need to do is add or change a few lines in your program, and you can make basic operations (arithmetic, variable assignments, etc.) run hundreds of times faster…the same speed as if you had written it in C. And the parts of the program that are already sufficiently fast do not need to be modified at all; they can still be written in regular Python.

Although Cython code will run at the same speed as C, it is much much easier to write reliable Cython code than C code. For example, Cython can check for integer-overflow errors, divide-by-zero errors, array-out-of-bounds errors, etc. and alert you when they occur. (These checks will slow down the code by maybe 30% compared to C, but that is usually a worthwhile tradeoff, and anyways you can turn off those checks once your code is perfect.) Cython can take care of memory allocation and deallocation. Cython lets you use a different symbol for rounded-integer-division versus normal division. All these things create countless hours of frustration for C coders. Finally, and most importantly, Cython lets you use Python for the non-performance-critical parts of the code, such as defining methods to call the code and graph the output.

Here is a typical example of a Cython program modification:

Python or Slow Cython:
count = 23
count += 10

Fast Cython:
cdef int count = 23
count += 10

The part cdef int count means “The variable count should be treated as a C integer typed variable”. It makes the statement count += 10 run extremely fast, because Cython can convert it into a single line of C code. Without the cdef int part, the same statement would be converted into many many lines of C code: “Does count exist as a local variable? If not, is it a global variable? OK, I’ve found it. Does count have a legal += operation? Yes, here it is. Is 10 a legal input for that operation? Yes. OK, I will do the operation…” (Yes, Python does this kind of stuff under-the-hood for every line of code it runs! MATLAB and other high-level languages do that too, whereas normal C or Fortran code only do those checks during compiling.)

[Another unrelated application of Cython is that it’s an alternative to SWIG, i.e. a way to call C or C++ code from Python.]

You need a very slight familiarity with the C language to use Cython, maybe what you would learn on the first day of a course in C, or the first ten pages of a book about C. If you know what is a “double” or an “int” in C, what is a “header file” in C, what does the word “compile” mean, then you are probably capable of writing Cython code.

Writing Cython code: See official documentation. Example of how to use Cython in a NumPy program. [That example uses “ndarray” to access numpy arrays in Cython. In Cython version 0.16 or later, you also have the (even more convenient) option of using “Typed Memoryviews” to access numpy arrays in Cython.]

Running Cython code: See section of official documentation (There are various methods, but the “pyximport” method is easiest for getting started.) How to use Cython within Python(x,y) and more generally in Windows … actually this link may be useful for other systems too.

Checking Cython code: To get the Cython speedup, you want all the operations that get repeated billions of times to be fast C operations, not slow python operations. Cython produces an annotated HTML file telling you which lines of your code are which, so that you can fix anything you missed. How to get that HTML file. (This file usually ends up in the same folder as the module you wrote, or else in a hidden folder somewhere called “.pyxbuild”) If there is a bright-yellow-highlighted line that gets repeated billions of times, you should try to fix it. (Double-click the line number to see details.) On the other hand, the slightly-yellow-highlights are not a big deal, even on a line repeated billions of times. These represent things like array-bounds-checking and divide-by-zero-checking that only slightly slow down the code (maybe 30% slowdown, not 30000%) and are probably worth keeping as discussed above.

For what it’s worth, here is my current setup for day-to-day Cython coding in Ubuntu / Spyder: I added the following code to my Spyder startup script:

import pyximport; pyximport.install(reload_support=True)
import Cython.Compiler.Options
Cython.Compiler.Options.annotate = True
print 'Note to self: Cython HTMLs are in the folder /home/steve/.pyxbld/temp.linux-x86_64-2.7/pyrex/'

and then in Spyder, if I’m working on the Cython module blah.pyx, I test and run it using import blah and reload(blah) and blah.some_function(567). Same as a normal Python module.

On Windows / Python(x,y) / Spyder, I have something similar, but with a more complicated setup command as discussed here:

import numpy, pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"], "include_dirs":numpy.get_include()}, reload_support=True)
import Cython.Compiler.Options
Cython.Compiler.Options.annotate = True

Appendix 2: Sage

Sage is a (mainly) mathematics program built on python (and cython), starting out as competition to Mathematica, Maple, etc.  Well, it can be used more generally, but it’s easy to tell that Sage was mainly written by and for mathematicians. For example, if you type x = sin(2), x is NOT (by default) rounded off to a floating-point decimal; it is stored as an exact expression, and you can later display x with a million decimal places. (This is the expected, common-sense behavior if you’re a mathematician, but it’s an unexpected, annoying complication if you’re an engineer.) Likewise, Sage has lots of capabilities in symbolic math, obscure mathematical objects, etc.

If you are not a mathematician, you can browse the list of Sage components (various open-source packages) to see if there is anything you might want to use. Although you can download any of these components yourself, the great strength of Sage is getting all these components to work together smoothly out-of-the-box. The output of one component can immediately be the input of another. So if you plan to write code that simultaneously uses two or more of the Sage components, you may save yourself a lot of headaches by using Sage. (Especially if the components are written in different programming languages, and/or use different data formats, etc. etc.)

Sage comes with a “notebook” environment where you can run calculations and write code. See discussion of notebooks above. You are not obligated to use the Sage notebook if you don’t want to; you can write code using any IDE or source code editor, and then load it into a Sage notebook for testing.

Appendix 3: Software engineering

Most scientists write software (at least a little bit), but most scientists do not know anything about “software engineering”, i.e. the practical aspects of writing good, correct software quickly. Even if you hardly ever write software, it is worth your time to learn a few basics of software engineering. These include: How and why to use revision control software (“git” or “mercurial”);  How and why to write tests and assertions into your code (for example, use Python’s assert command as often as possible!); How and why to write clearly and use comments; why toavoid premature optimization, etc. etc.

People get the idea that software engineering is something used only by big professional teams working on big professional software projects. Not true! Even if I am spending a few hours writing a little script for my own personal use, I will still use revision control, testing, assertions, comments, etc. Once you get used to these things, you can’t live without them! It is well worth the time to learn these things, even for a casual and infrequent programmer.

If you’re using Python (rather than science-specific programs like MATLAB, Octave, LabVIEW, etc.), you are already at an advantage because Python is more widely used by professional software engineers, and those people create resources and pressure for learning software engineering. For example, there is an organization called Software Carpentry which is trying to teach software engineering to scientists. You can go through their online lessons and video lectures. Their lessons apply to most programming languages … but all of their examples are in Python!

Appendix 4: More comprehensive discussion of all the different ways to install Python

 

More about company-sponsored Python distributions: This category includes Anaconda (from Continuum Analytics Corp.) [mentioned above], and Enthought (from Enthought Inc.), and ActivePython (from ActiveState Software Corp.). All three of these have free versions and paid versions. But the free versions are perfectly functional; the main reason to pay is for professional technical support. (And even if you pay, the price is 10X lower than MATLAB.) Out of these, I suggest using Anaconda because I’ve heard the most good things about it, and it is “endorsed” by the Spyder installation page. (A close second place would be Enthought.) The distinctions among these three, as far as I can tell, are: Anaconda is generally focused on “big data” analysis for math and science, including running code on GPUs, on parallel architectures, etc. etc. ActivePython is for programmers in general, not just science. Enthought is generally focused on math and science. For Anaconda and ActivePython, Spyder is included, while Enthought comes with a different but equally good IDE called “Enthought Canopy”.

More about Python(x,y) and WinPython: Both are open-source (and non-commercially-supported) ways to conveniently install Python, Spyder, NumPy, etc. on Windows. Python(x,y) is older and much better-known. The advantages of WinPython are (1) You can run it off a USB-stick without installing anything (if you want to), (2) On 64-bit Windows, you can use Python(x,y) but only in 32-bit mode (as of this writing), (3) WinPython is available for Python 2 and 3, while Python(x,y) is only available for Python 2 (as of this writing). The disadvantage of WinPython is that sometimes you might need to install a complicated Python library, which might involve not just “pure” Python code but also C code, weird installation requirements, etc. These complicated libraries can occasionally be tricky to install “from scratch”. Therefore Python(x,y) has prepared a long list of popular libraries in this category which are either included or can be installed with one click. WinPython has some of those, but not as many as Python(x,y). (The company-sponsored distributions above have even more still.) Anyway, I have used both Python(x,y) and WinPython and they’re both great, I’ve never had any problems. They’re both very fast to install: For example, you install Python(x,y) with a couple clicks, and then you can run Spyder via Start menu –>Python(x,y) –> Spyder –> Spyder.

More about Mac options: Before the Spyder stand-alone app was released in 2013, the traditional way to install Spyder / Python / etc. on Mac OS X was MacPorts. See how to install MacPorts. Don’t be surprised if the process takes an hour or two.

More about Linux options: This is very very easy, you just install the appropriate package in the standard way. For example, in Ubuntu, you would open the Ubuntu Software Center, search for “Spyder”, and click “Install”. Another option for Linux users is Pip, part of python. This program will download and install python packages, including Spyder. (Pip exists in Windows or Mac too, but there, it can normally only install simple packages, because it sometimes requires a C compiler etc.) One reason to use Pip is to get the most recent versions of the software; another reason is if you’re installing into a virtualenv. For example, here is how to install Spyder using pip in Ubuntu: Before you’re ready to use pip, open a terminal and run the command sudo apt-get install python-pip python-dev build-essential python-qt4 python-numpy python-scipy python-matplotlib This tells the Linux package manager to install seven Ubuntu packages: The first three make pip work; the fourth (python-qt4) is a python package that cannot be installed with pip; and the last three are python packages that are tricky to install with pip. Now we’re ready for pip. Open a terminal and run the command sudo pip install sphinx to download and install the python package sphinx, then do the same thing eight more times but replace sphinx by spyderpyflakesropepylintpsutilipythonpyzmqpygments. Now you have Spyder, its required prerequisites, and all its optional extra features.

Python installation options that do not come with Spyder: Spyder is just one of many interfaces to help you program in Python. If you don’t want Spyder in particular, but still want to install Python, NumPy, SciPy, etc., in a convenient one-step way, here are a few options. (1) Enthought was discussed above. (2) Sage is discussed at length above (Appendix 2). (3) Mac OS X users can try the Scipy Superpack. (4) Windows users can try Portable Python.