Python for scientific computing: Where to start

(Last update: December 2016. Email me if you see errors and omissions.)

This page is a guide for how to install Python and start using it for scientific computing.

But first: What is Python and why should I use it?

Python is a general-purpose programming language. NumPy is a “package” (group of Python programs and definitions) for storing and manipulating numerical data. By using NumPy and related packages (like SciPy for statistics, optimization, etc., Matplotlib for making graphs, Pandas for organizing and processing complex datasets), Python gets scientific programming capabilities similar to MATLAB, but with various advantages.

What advantages? You can read some blog posts. From my perspective, the most important advantages are: (1) Python is free and open-source (there is no annoying license manager, etc.); (2) Compared to Python, the MATLAB programming language is much more annoying to use (in too many ways to list); (3) There is excellent Python code on the internet for doing anything a computer can do (crawling the internet, parsing text, managing databases, reading PDFs, you name it), whereas the universe of MATLAB is much more limited to science; (4) Numba & Cython are great (see below); (5) Knowing Python is a valuable career skill (this is true for MATLAB too).

To begin: Install Python and Spyder

Screenshot of me using Spyder. Right now, the file editor is on the left, an interactive Python shell is on the bottom right, and function definitions (inputs, outputs, examples, etc.) pop up in the top-right whenever I type a function name (even a function I just invented myself).

To run and edit scientific programs in Python / NumPy, a great way to start is to use Spyder, a visual interface similar to MATLAB where you can run commands, edit and debug programs, check the values of variables and the definitions of functions, etc.

Way back in the day, installing Spyder used to be an annoying multi-step process (first download and install Python itself, then download and install NumPy, SciPy, PyQt, etc., then finally download and install Spyder).

Luckily, installation is now very easy: You can download and install all the components together automatically in one step. For Windows or Mac, your best bet is to download and install Anaconda, which comes with everything you need. In Linux, install Spyder through your package manager, e.g. Ubuntu Software. (See Spyder installation page for package listings). NumPy etc. will also be automatically installed at the same time.

That’s all you need to know to install Python. But if you’re curious, go to the bottom (Appendix 4) for a more thorough list and discussion of all the ways to install Python.

Should I install Python 2 or Python 3?

The Python language changed between Python 2 and Python 3. For scientific computing, the changes were relatively minor, but there are enough incompatible changes that if you find code written in Python 2, it usually cannot run in Python 3, and vice-versa. (I mean, it usually cannot run directly. It’s usually not too hard to edit the code so that it can run.)

The good news is, most widely-used packages, including NumPy / SciPy / Matplotlib / Spyder, are available for both Python 2 and Python 3. The bad news is, if you find some random python code online, it is probably available in only one of the versions, 2 or 3. So it matters: Which one should you install and use?

  • If you are part of a group that exchanges Python code, you should obviously use whatever version of Python your collaborators are using.
  • If not, you should obviously use whatever version of Python is most compatible with other Python code you’re likely to use. If you know any Python programmers in your subfield of science, ask them.
  • Otherwise, other things equal, you might as well use Python 3, because that’s already what most people are using, and you’ll need to switch sooner or later.

…Well, I said above that the changes from Python 2 to Python 3 were minor, but there are actually two changes in Python 3 that make a big difference for scientists.

  • First, in a desperately-needed improvement, Python 3 changed the rules for integer division. In Python 2, an integer divided by an integer is always rounded down to another integer, e.g. 5 / 2 = 2. In Python 3 they fixed it: 5 / 2 = 2.5. (On the rare occasions that you want floor integer division, Python 3 lets you do it with a different symbol, “//”.) Luckily, Python 2 users need not suffer excessively: Just start all your modules with the magic line
    from __future__ import division
    This lets you use the improved Python 3 division rules in a Python 2 program. Use it every time you write Python 2 code—no exceptions.
  • Second, since late 2015 (Python 3.5 & NumPy 1.10), Python 3 has a better notation for matrix multiplication (link): A @ B, rather than the previous dot(A,B) or Think of the @ sign as like a big dot, or use the mnemonic “AT sign multiplies mATrices”. (A * B has always been element-by-element multiplication, and still is.) This feature makes matrix calculations much easier to read and debug.

“Notebook” interface

A nice idea popularized by Mathematica (among other programs) is a “notebook” interface, where you can run and re-run commands, and interlace nicely-formatted text, code, outputs, and graphs into a single document.

Python has that too: The Jupyter notebook (previously called “IPython notebook”). This link is a document I wrote entirely in a Jupyter notebook (then I exported it to HTML).

Alternatively, if you install Sage (see below), you get a notebook as the default interface.

If you come from a background in math (like me) and maybe have been using Mathematica and Maple for many years, you may have the idea that you can do 100% of your programming and calculating in a notebook interface. Bad idea! When you use a real IDE like Spyder, you get delightful features like syntax checking, pop-up information telling you function definitions and parameters, code analysis and debugging as you type, better compatibility with version control systems, etc. etc. Unless you are doing really straightforward calculations, you should invest the time to get familiar with a real IDE.

Getting started with Python in general…

Of course, you need to know the language. Here is the official Python tutorial. You do NOT need to read every word before you can start doing scientific computing. For example, in small scientific computing projects, you will rarely want to use exception handling, so skip chapter 8. You will rarely want to use classes (object-oriented programming), so skip chapter 9. (If you have a collection of properties or parameters, you can store them in a dictionary. You don’t need a custom class.) If you don’t like the official Python tutorial, you can find many other Python tutorials online. A popular one is Learn Python the Hard Way. I have also heard good things about the Google’s short Python class.

Those are for general Python. Next, you need to learn NumPy / SciPy etc. specifically. My friend made a series of jupyter notebooks here including a quick overview and intro to numpy and scipy, matplotlib, pandas, and scikit-learn, with links to further resources. Some others: Here is a NumPy tutorial. Here is a NumPy overview for MATLAB users. Here is a guide to NumPy/SciPy/ PyTables / Matplotlib. Here is a broad-based overview of MATLAB versus Python.

Don’t reinvent the wheel! The SciPy website has a very-incomplete list of scientific routines and packages you can use, and mailing lists you can join. The Python website has a much longer and more complete package list: List of all packages (not just science), and List of packages tagged as “Scientific/Engineering”. These are analogous to MATLAB’s “file exchange”.

…And getting started with Spyder specifically

When you open Spyder for the first time, you’ll see that the bottom-right quarter of the screen is called a Console, and you can type Python code into it. Type 3+4, press enter, it should say 7. The left half of the screen is where you can type Python code into files. There is an almost-empty file already there. Go to the end of that file, press Enter a couple times, then type print 3+4 (for Python 2) or print(3+4) (for Python 3). Press the green triangle Play button (top of the screen; use the default settings). When the script runs, it should print out 7 in the Console.

Congratulations, you have now learned two ways to run Python code in Spyder! Here is the third way: Create a folder somewhere on your computer, and call it PythonScripts. In Spyder, go to Tools –> PYTHONPATH Manager, and add PythonScripts as a new path. Exit and restart Spyder, then create a new Python file. As above, type print 3+4 in this file, and save the file under the name in the PythonScripts folder. Now go to the Console (bottom right), and type import hello and press Enter — it should print 7. Congratulations, you have created a Python module (that’s what is) and ran the module in the standard way. (This third method is important because it is the only method that works when one Python script wants to call another. This third method is an official part of the Python language, while the first two methods are specific to Spyder.)

Tips on using Spyder:

“Console” vs “IPython console”: In the bottom right, there are probably tabs for both “Console” and “IPython console”. Unless you’re doing something fancy, these are basically the same … except that each has its own settings menu (each under Preferences), with some different defaults. So you can use either of them.

If you want graphs in their own window in an IPython console: Under Preferences –> iPython console –> Graphics –> Graphics backend, the default “inline” will display little graphs right in the ipython console window. You are probably better off with one of the other backend options (like “Qt”), which displays graphs in their own window, letting you zoom, pan, export, etc.

Startup scripts: In the preference menu you can specify a startup script, i.e. a set of commands that are run when you open a console. So in the “IPython console” preferences you can specify a startup script for IPython consoles, and in the “Console” preferences you can specify a startup script for consoles.

For more introduction to Spyder, try this link.

Appendix 1: Python program speed, Numba, and Cython

Sometimes a Python program might run too slowly to be useful. This can almost always be fixed by improving the program. For example, maybe you’re doing a calculation that is actually already implemented in NumPy or SciPy. (Important: Never try to speed up a program without first profiling it; otherwise you won’t know which part of the calculation is the slow part! In Spyder, press “F10” to profile your script with the standard Python profiler.)

This article by MATLAB lists other common causes of slowness, many of which are also applicable for Python / NumPy. The most important advice is “vectorization” — writing your code in such a way that you’re storing lots of numbers in arrays, and manipulating them all at once using built-in array operations, rather than accessing the numbers one at a time. Vectorized NumPy code will probably run at a similar speed to vectorized MATLAB code, and a similar speed to well-written FORTRAN or C code. This speed can easily be hundreds of times faster than non-vectorized NumPy code (which in turn is a similar speed to non-vectorized MATLAB code).

BUT, having said that, maybe vectorization is impossible (or prohibitively time-consuming and inconvenient), and the slowness is inevitable. For example, maybe you have no choice but to perform some arithmetic inside a loop that gets repeated billions of times. Then there are two things you can try: Numba and Cython. Each of them make basic operations (arithmetic, variable assignments, etc.) run hundreds of times faster…the same speed as if you had written it in C, or perhaps even faster.

Numba: Of the two choices, Numba is apparently much much simpler to use and also works better, at least according to this blog post. You install Numba and tell it to speed up the relevant functions, and that’s it. You’re done. Numba will translate your Python / NumPy code into very fast machine-language code. See Numba tips for more details. You can also buy a “professional” version of Numba which can automatically create code for GPU computations and various other things.

Cython: The other good option is Cython. If Numba is so great, why use Cython at all? Well, the last time I used Cython, Numba didn’t exist yet. But even today there are good reasons for some people to use Cython. (1) Cython makes it very very easy to call other C / C++ code as a subroutine of your Python code (i.e., Cython is an alternative to SWIG), (2) Cython gives you full control over what’s going on, which I presume is useful for people who need fine control over memory usage, parallel processing, etc. etc., (3) maybe Numba cannot compile your code because of whatever reason, I don’t know.

Anyway, if you want to use Cython, it’s not hard. All you need to do is add or change a few lines in your program. The parts of the program that are already sufficiently fast do not need to be modified at all; they can still be written in regular Python.

Although Cython code will run at the same speed as C, it is much much easier to write reliable Cython code than C code. For example, Cython can check for integer-overflow errors, divide-by-zero errors, array-out-of-bounds errors, etc. and alert you when they occur. (These checks will slow down the code by maybe 30% compared to C, but that is usually a worthwhile tradeoff, and anyways you can turn off those checks once your code is perfect.) Cython can take care of memory allocation and deallocation. Cython lets you use a different symbol for rounded-integer-division versus normal division. All these things create countless hours of frustration for C coders. Finally, and most importantly, Cython lets you use Python for the non-performance-critical parts of the code, such as defining methods to call the code and graph the output.

Here is a typical example of a Cython program modification:

Python or Slow Cython:
count = 23
count += 10

Fast Cython:
cdef int count = 23
count += 10

The part cdef int count means “The variable count should be treated as a C integer typed variable”. It makes the statement count += 10 run extremely fast, because Cython can convert it into a single line of C code. Without the cdef int part, the same statement would be converted into many many lines of C code: “Does count exist as a local variable? If not, is it a global variable? OK, I’ve found it. Does count have a legal += operation? Yes, here it is. Is 10 a legal input for that operation? Yes. OK, I will do the operation…” (Yes, Python does this kind of stuff under-the-hood for every line of code it runs! MATLAB and other high-level languages do that too, whereas normal C or Fortran code only do those checks during compiling.)

You need a very slight familiarity with the C language to use Cython, maybe what you would learn on the first day of a course in C, or the first ten pages of a book about C. If you know what is a “double” or an “int” in C, what is a “header file” in C, what does the word “compile” mean, then you are probably capable of writing Cython code.

Writing Cython code: See official documentation. Example of how to use Cython in a NumPy program. [That example uses “ndarray” to access numpy arrays in Cython. In Cython version 0.16 or later, you also have the (even more convenient) option of using “Typed Memoryviews” to access numpy arrays in Cython.]

Running Cython code: This page shows how to install and run Cython within Anaconda (among other things). After it’s installed: In IPython (a.k.a. Jupyter) notebooks, run cython code by simply starting a cell with %%cython (e.g. see cells 6-7 here). In normal Python, it’s only slightly more complicated: See section of official documentation. There are various methods, but the “pyximport” method is easiest for getting started.

Checking Cython code: To get the Cython speedup, you want all the operations that get repeated billions of times to be fast C operations, not slow python operations. Cython produces an annotated HTML file telling you which lines of your code are which, so that you can fix anything you missed. How to get that HTML file.  (This file usually ends up in the same folder as the module you wrote, or else in a hidden folder somewhere called “.pyxbuild”) (Or in IPython (a.k.a. Jupyter) notebooks, follow the example of cells 6-7 here.) If there is a bright-yellow-highlighted line that gets repeated billions of times, you should try to fix it. (Double-click the line number to see details.) On the other hand, the slightly-yellow-highlights are not a big deal, even on a line repeated billions of times. These represent things like array-bounds-checking and divide-by-zero-checking that only slightly slow down the code (maybe 30% slowdown, not 30000%) and are probably worth keeping as discussed above.

For what it’s worth, here was my setup for day-to-day Cython coding in Ubuntu / Spyder: I added the following code to my Spyder startup script:

import pyximport; pyximport.install(reload_support=True)
import Cython.Compiler.Options
Cython.Compiler.Options.annotate = True
print 'Note to self: Cython HTMLs are in the folder /home/steve/.pyxbld/temp.linux-x86_64-2.7/pyrex/'

and then in Spyder, if I’m working on the Cython module blah.pyx, I test and run it using import blah and reload(blah) and blah.some_function(567). Same as a normal Python module.

On Windows / Python(x,y) / Spyder, I had something similar, but with a more complicated setup command as discussed here:

import numpy, pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"], "include_dirs":numpy.get_include()}, reload_support=True)
import Cython.Compiler.Options
Cython.Compiler.Options.annotate = True

Appendix 2: Sage

Sage is a (mainly) mathematics program built on python (and cython), starting out as competition to Mathematica, Maple, etc.  Well, it can be used more generally, but it’s easy to tell that Sage was mainly written by and for mathematicians. For example, if you type x = sin(2), x is NOT (by default) rounded off to a floating-point decimal; it is stored as an exact expression, and you can later display x with a million decimal places. (This is the expected, common-sense behavior if you’re a mathematician, but it’s an unexpected, annoying complication if you’re an engineer.) Likewise, Sage has lots of capabilities in symbolic math, obscure mathematical objects, etc.

If you are not a mathematician, you can browse the list of Sage components (various open-source packages) to see if there is anything you might want to use. Although you can download any of these components yourself, the great strength of Sage is getting all these components to work together smoothly out-of-the-box. The output of one component can immediately be the input of another. So if you plan to write code that simultaneously uses two or more of the Sage components, you may save yourself a lot of headaches by using Sage. (Especially if the components are written in different programming languages, and/or use different data formats, etc. etc.)

Sage comes with a “notebook” environment where you can run calculations and write code. See discussion of notebooks above. You are not obligated to use the Sage notebook if you don’t want to; you can write code using any IDE or source code editor, and then load it into a Sage notebook for testing.

Sage is easy to install in Linux, the usual way. For Windows, rather than installing Sage, think about using SageMathCloud. You run it directly in your browser, and the calculations are done in the cloud. SageMathCloud is actually really neat: In addition to running Sage, you can also write LaTeX documents and IPython (a.k.a. Jupyter) notebooks, among other things, directly in your browser.

Appendix 3: Software engineering

Most scientists write software (at least a little bit), but most scientists do not know anything about “software engineering”, i.e. the practical aspects of writing good, correct software quickly. Even if you hardly ever write software, it is worth your time to learn a few basics of software engineering. These include: How and why to use revision control software (“git” or “mercurial”);  How and why to write tests and assertions into your code (for example, use Python’s assert command as often as possible!); How and why to write clearly and use comments; why to avoid premature optimization, etc. etc.

People get the idea that software engineering is something used only by big professional teams working on big professional software projects. Not true! Even if I am spending a few hours writing a little script for my own personal use, I will still use revision control, testing, assertions, comments, etc. Once you get used to these things, you can’t live without them! It is well worth the time to learn these things, even for a casual and infrequent programmer.

If you’re using Python (rather than science-specific programs like MATLAB, Octave, LabVIEW, etc.), you are already at an advantage because Python is more widely used by professional software engineers, and those people create resources and pressure for learning software engineering. For example, there is an organization called Software Carpentry which is trying to teach software engineering to scientists. You can go through their online lessons and video lectures. Their lessons apply to most programming languages … but all of their examples are in Python!

Appendix 4: More comprehensive discussion of all the different ways to install Python


More about company-sponsored Python distributions: This category includes Anaconda (from Continuum Analytics Corp.) [mentioned above], and Enthought (from Enthought Inc.), and ActivePython (from ActiveState Software Corp.). All three of these have free versions and paid versions. But the free versions are perfectly functional; the main reason to pay is for professional technical support. (And even if you pay, the price is 10X lower than MATLAB.) Out of these, I suggest using Anaconda because I’ve heard the most good things about it, and it is “endorsed” by the Spyder installation page. (A close second place would be Enthought.) The distinctions among these three, as far as I can tell, are: Anaconda is generally focused on “big data” analysis for math and science, including running code on GPUs, on parallel architectures, etc. etc. ActivePython is for programmers in general, not just science. Enthought is generally focused on math and science. For Anaconda and ActivePython, Spyder is included, while Enthought comes with a different but equally good IDE called “Enthought Canopy”.

More about Python(x,y) and WinPython: Both are open-source (and non-commercially-supported) ways to conveniently install Python, Spyder, NumPy, etc. on Windows. Python(x,y) is older and much better-known. The advantages of WinPython are (1) You can run it off a USB-stick without installing anything (if you want to), (2) On 64-bit Windows, you can use Python(x,y) but only in 32-bit mode (as of this writing), (3) WinPython is available for Python 2 and 3, while Python(x,y) is only available for Python 2 (as of this writing). The disadvantage of WinPython is that sometimes you might need to install a complicated Python library, which might involve not just “pure” Python code but also C code, weird installation requirements, etc. These complicated libraries can occasionally be tricky to install “from scratch”. Therefore Python(x,y) has prepared a long list of popular libraries in this category which are either included or can be installed with one click. WinPython has some of those, but not as many as Python(x,y). (The company-sponsored distributions above have even more still.) Anyway, I have used both Python(x,y) and WinPython and they’re both great, I’ve never had any problems. They’re both very fast to install: For example, you install Python(x,y) with a couple clicks, and then you can run Spyder via Start menu –>Python(x,y) –> Spyder –> Spyder.

More about Mac options: Before the Spyder stand-alone app was released in 2013, the traditional way to install Spyder / Python / etc. on Mac OS X was MacPorts. See how to install MacPorts. Don’t be surprised if the process takes an hour or two.

More about Linux options: This is very very easy, you just install the appropriate package in the standard way. For example, in Ubuntu, you would open Ubuntu Software, search for “Spyder” (Python 2) or “Spyder3” (Python 3), and click “Install”. The other good option is Anaconda. Finally, you can alternatively install Spyder using Pip, part of python, but I do not recommend this for beginners. It often doesn’t work, for tricky reasons.

Python installation options that do not come with Spyder: Spyder is just one of many interfaces to help you program in Python. If you don’t want Spyder in particular, but still want to install Python, NumPy, SciPy, etc., in a convenient one-step way, here are a few options. (1) Pyzo is available for Windows, Mac, and Linux. Like WinPython, you can run Pyzo off a USB-stick without installing anything (if you want to). (2) Enthought was discussed above. (3) Sage is discussed at length above (Appendix 2). (4) Mac OS X users can try the Scipy Superpack.