Python

Python for scientific computing: Where to start

THIS PAGE IS NO LONGER MAINTAINED AND OUT OF DATE.

I first made it in the 2000s, when installing Python was an obscure, unpopular, and technically-complicated thing for scientists to do. I kept updating it through around 2018. These days, Python is extremely widespread in science—if you’re having trouble using Python for science, flag down your nearest fellow scientist and ask them for help. Or search the internet for tons of up-to-date getting-started resources.

But first: What is Python and why should I use it?

This page is a guide for how to install Python and start using it for scientific computing.

Python is a general-purpose programming language—one of the most popular in the world. Python has extensive capabilities for scientific computing, mainly via a few very popular add-on “packages” like NumPy for numerical data and matrices, SciPy for statistics, optimization, special functions, etc., Matplotlib for plotting, Pandas for data-frame tools (kinda like in R), and a few others. Thanks to these tools, Python has scientific programming capabilities similar to MATLAB, but with various advantages.

What advantages? You can read some blog posts. From my perspective, the most important advantages are:

1. Python is free and open-source (there is no annoying license manager, etc.);

2. Compared to Python, the MATLAB programming language is much more annoying to use (In MATLAB, statements must end in semicolons, each function needs its own file, the code winds up unreadable for dozens of reasons, it’s a hassle to have function arguments with defaults or labels, many essential common functions are unavailable unless you have all the toolbox licenses, you can’t use Greek letters as variable names or even write them in comments (!), etc. etc. etc.)

3. There is excellent Python code on the internet for doing almost anything a computer can do (crawling the internet, parsing text, managing databases, reading PDFs, you name it), whereas the universe of MATLAB is much more limited to science;

4. For performance-sensitive code and compiling to C, I have personally found Numba & Cython (see below) to be comparable or better than the analogous MATLAB tools (though I don’t have too much experience with either);

5. Knowing Python is a valuable career skill (probably moreso than MATLAB, these days);

6. More and more of your current and future colleagues know Python—e.g. it is currently the most popular language for college intro programming courses—and if they don’t, it is easy to learn, with abundant learning resources available (see below). So if you’re writing code that needs to be understood and/or used by many people of diverse backgrounds, Python is probably the best choice.

7. Relatedly, if there is some random specialized scientific program or algorithm you need to use, it is more and more likely to be written in Python, or to have an interface with Python.

To begin: Install Python and Spyder

To run and edit scientific programs in Python / NumPy / SciPy / etc., a great way to start is to use Spyder, a visual interface (IDE) similar to MATLAB where you can run commands, edit and debug programs, check the values of variables and the definitions of functions, etc.

WAY back in the early 2000s, installing Python was an annoying multi-step process: First install Python, then install NumPy, then install SciPy ….

But today, installation is very easy: You can download and install everything you need automatically in one step. Regardless of whether you’re using Windows or Mac or Linux,your best bet is to download and install Anaconda. It comes with everything that you’re likely to need, plus a “conda” tool that can download & install any of a large selection of more obscure packages later on.

That’s all you need to know to install Python. But if you’re curious, go to the bottom (Appendix 4) for a more thorough list and discussion of all the ways to install Python.

(If you’re on a shared or public computer and cannot install anything, you can run Python off a USB stick—some options are in Appendix 4—or the next best thing is CoCalc, which lets you run Python / NumPy / etc. through your browser for free.)

Should I install Python 2 or Python 3?

You should use Python 3, unless you have very specific reasons not to. (If your coworkers are all using Python 2, then obviously you should too.) In general, almost everyone is using Python 3 now, and the vast majority of Python code and resources that you’ll find are for Python 3.

If you originally learned Python in version 2, make sure you’re taking advantage of all the neat new features of Python 3 for scientists. I especially appreciate the A @ B notation for matrix multiplication, and the ability to freely use Greek and other characters in the code, e.g. θ = π/4, or xlabel("Thickness (Å)") etc. (How to take advantage of this in practice.)

“Notebook” interface

A nice idea popularized by Mathematica (among other programs) is a “notebook” interface, where you can run and re-run commands, and interlace nicely-formatted text, code, outputs, and graphs into a single document.

Python has that too: The Jupyter notebook (previously called “IPython notebook”). This link is a document I wrote entirely in a Jupyter notebook (then I exported it to HTML).

Alternatively, if you install Sage (see below), you get a notebook as the default interface.

If you come from a background in math (like me) and maybe have been using Mathematica and Maple for many years, you may have the idea that you can do 100% of your programming and calculating in a notebook interface. Bad idea! When you use a real IDE like Spyder, you get delightful features like syntax checking, pop-up information telling you function definitions and parameters, code analysis and debugging as you type, better compatibility with version control systems, etc. etc. Unless you are doing really straightforward calculations, you should invest the time to get familiar with a real IDE.

Getting started with Python in general…

Of course, you need to know the language. Here is the official Python tutorial. You do NOT need to read every word before you can start doing scientific computing. For example, in small scientific computing projects, you will rarely want to use exception handling, so skip chapter 8. You will rarely want to use classes (object-oriented programming), so skip chapter 9. (If you have a collection of properties or parameters, you can store them in a dictionary. You don’t need a custom class.) If you don’t like the official Python tutorial, you can find many other Python tutorials online. A popular one is Learn Python the Hard Way. I have also heard good things about the Google’s short Python class.

Those are for general Python. Next, you need to learn NumPy / SciPy etc. specifically. My friend made a series of jupyter notebooks here including a quick overview and intro to numpy and scipy, matplotlib, pandas, and scikit-learn, with links to further resources. Some others: Here is a NumPy tutorial. Here is a NumPy overview for MATLAB users. Here is a guide to NumPy/SciPy/ PyTables / Matplotlib. Here is a broad-based overview of MATLAB versus Python.

Don’t reinvent the wheel! The SciPy website has a very-incomplete list of scientific routines and packages you can use, and mailing lists you can join. The Python website has a much longer and more complete package list: List of all packages (not just science), and List of packages tagged as “Scientific/Engineering”. These are analogous to MATLAB’s “file exchange”.

…And getting started with Spyder specifically

When you open Spyder for the first time, you’ll see that the bottom-right quarter of the screen is called “IPython Console”, and you can type Python code into it. Type 3+4, press enter, it should say 7. The left half of the screen is where you can type Python code into files. There is an almost-empty file already there. Go to the end of that file, press Enter a couple times, then type print(3+4) (for Python 3). Press the green triangle Play button (top of the screen; use the default settings). When the script runs, it should print out 7 in the Console.

Congratulations, you have now learned two ways to run Python code in Spyder! Here is the third way: Create a folder somewhere on your computer, and call it PythonScripts. In Spyder, go to Tools –> PYTHONPATH Manager, and add PythonScripts as a new path. Exit and restart Spyder, then create a new Python file. As above, type print(3+4) in this file, and save the file under the name hello.py in the PythonScripts folder. Now go to the IPython Console (bottom right), and type import hello and press Enter — it should print 7. Congratulations, you have created a Python module (that’s what hello.py is) and ran the module in the standard way. (This third method is important because it is the only method that works when one Python file wants to call another. This third method is an official part of the Python language, while the first two methods are specific to Spyder.)

Tips on using Spyder:

If you want graphs in their own window: Under Tools –> Preferences –> IPython console –> Graphics –> Graphics backend, the default “inline” will display little graphs right in the ipython console window. I prefer “Automatic”, which displays graphs in their own window, letting you zoom, pan, export, etc.

Startup scripts: In the preference menu you can specify a startup script, i.e. a set of commands that are run when you open a console. Go to Tools –> Preferences –> IPython console –> Startup.

For more introduction to Spyder, try this link.

Appendix 1: Python program speed, Numba, and Cython

Sometimes a Python program might run too slowly to be useful. This can almost always be fixed by improving the program. For example, maybe you’re doing a calculation that is actually already implemented in NumPy or SciPy. (Important: Never try to speed up a program without first profiling it; otherwise you won’t know which part of the calculation is the slow part! In Spyder, press “F10” to profile your script with the standard Python profiler.)

This article by MATLAB lists other common causes of slowness, many of which are also applicable for Python / NumPy. The most important advice is “vectorization” — writing your code in such a way that you’re storing lots of numbers in arrays, and manipulating them all at once using built-in array operations, rather than accessing the numbers one at a time. Vectorized NumPy code will probably run at a similar speed to vectorized MATLAB code, and a similar speed to well-written FORTRAN or C code. This speed can easily be hundreds of times faster than non-vectorized NumPy code (which in turn is a similar speed to non-vectorized MATLAB code).

BUT, having said that, maybe vectorization is impossible (or prohibitively time-consuming and inconvenient), and the slowness is inevitable. For example, maybe you have no choice but to perform some arithmetic inside a loop that gets repeated billions of times. Then there are two things you can try: Numba and Cython. Each of them make basic operations (arithmetic, variable assignments, etc.) run hundreds of times faster…the same speed as if you had written it in C, or perhaps even faster.

Numba: Of the two choices, Numba is apparently much much simpler to use and also works better, at least according to this blog post. You install Numba and tell it to speed up the relevant functions, and that’s it. You’re done. Numba will translate your Python / NumPy code into very fast machine-language code. It even includes tools to auto-generate code for GPU computations. (The latter used to be a paid add-on but is now free.)

Cython: The other good option is Cython. If Numba is so great, why use Cython at all? Well, the last time I used Cython, Numba didn’t exist yet. But even today there are good reasons for some people to use Cython. (1) Cython makes it very very easy to call other C / C++ code as a subroutine of your Python code (i.e., Cython is an alternative to SWIG), (2) Cython gives you full control over what’s going on, which I presume is useful for people who need fine control over memory usage, parallel processing, etc. etc., (3) maybe Numba cannot compile your code because of whatever reason, I don’t know.

Anyway, if you want to use Cython, it’s not hard. All you need to do is add or change a few lines in your program. The parts of the program that are already sufficiently fast do not need to be modified at all; they can still be written in regular Python.

Although Cython code will run at the same speed as C, it is much much easier to write reliable Cython code than C code. For example, Cython can check for integer-overflow errors, divide-by-zero errors, array-out-of-bounds errors, etc. and alert you when they occur. (These checks will slow down the code by maybe 30% compared to C, but that is usually a worthwhile tradeoff, and anyways you can turn off those checks once your code is perfect.) Cython can take care of memory allocation and deallocation. Cython lets you use a different symbol for rounded-integer-division versus normal division. All these things create countless hours of frustration for C coders. Finally, and most importantly, Cython lets you use Python for the non-performance-critical parts of the code, such as defining methods to call the code and graph the output.

Here is a typical example of a Cython program modification:

Python or Slow Cython:
count = 23
count += 10

Fast Cython:
cdef int count = 23
count += 10

The part cdef int count means “The variable count should be treated as a C integer typed variable”. It makes the statement count += 10 run extremely fast, because Cython can convert it into a single line of C code. Without the cdef int part, the same statement would be converted into many many lines of C code: “Does count exist as a local variable? If not, is it a global variable? OK, I’ve found it. Does count have a legal += operation? Yes, here it is. Is 10 a legal input for that operation? Yes. OK, I will do the operation…” (Yes, Python does this kind of stuff under-the-hood for every line of code it runs! MATLAB and other high-level languages do that too, whereas normal C or Fortran code only do those checks during compiling.)

You need a very slight familiarity with the C language to use Cython, maybe what you would learn on the first day of a course in C, or the first ten pages of a book about C. If you know what is a “double” or an “int” in C, what is a “header file” in C, what does the word “compile” mean, then you are probably capable of writing Cython code.

Writing Cython code: See official documentation. Example of how to use Cython in a NumPy program. [That example uses “ndarray” to access numpy arrays in Cython. In Cython version 0.16 or later, you also have the (even more convenient) option of using “Typed Memoryviews” to access numpy arrays in Cython.]

Running Cython code: This page shows how to install and run Cython within Anaconda (among other things). After it’s installed: In IPython (a.k.a. Jupyter) notebooks, run cython code by simply starting a cell with %%cython (e.g. see cells 6-7 here). In normal Python, it’s only slightly more complicated: See section of official documentation. There are various methods, but the “pyximport” method is easiest for getting started.

Checking Cython code: To get the Cython speedup, you want all the operations that get repeated billions of times to be fast C operations, not slow python operations. Cython produces an annotated HTML file telling you which lines of your code are which, so that you can fix anything you missed. How to get that HTML file. (This file usually ends up in the same folder as the module you wrote, or else in a hidden folder somewhere called “.pyxbuild”) (Or in IPython (a.k.a. Jupyter) notebooks, follow the example of cells 6-7 here.) If there is a bright-yellow-highlighted line that gets repeated billions of times, you should try to fix it. (Double-click the line number to see details.) On the other hand, the slightly-yellow-highlights are not a big deal, even on a line repeated billions of times. These represent things like array-bounds-checking and divide-by-zero-checking that only slightly slow down the code (maybe 30% slowdown, not 30000%) and are probably worth keeping as discussed above.

For what it’s worth, here was my setup for day-to-day Cython coding in Ubuntu / Spyder: I added the following code to my Spyder startup script:

import pyximport; pyximport.install(reload_support=True)
import Cython.Compiler.Options
Cython.Compiler.Options.annotate = True
print 'Note to self: Cython HTMLs are in the folder /home/steve/.pyxbld/temp.linux-x86_64-2.7/pyrex/'

and then in Spyder, if I’m working on the Cython module blah.pyx, I test and run it using import blah and reload(blah) and blah.some_function(567). Same as a normal Python module.

On Windows / Python(x,y) / Spyder, I had something similar, but with a more complicated setup command as discussed here:

import numpy, pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"], "include_dirs":numpy.get_include()}, reload_support=True)
import Cython.Compiler.Options
Cython.Compiler.Options.annotate = True

Appendix 2: Sage

Sage is a (mainly) mathematics program built on python (and cython), starting out as competition to Mathematica, Maple, etc. Well, it can be used more generally, but it’s easy to tell that Sage was mainly written by and for mathematicians. For example, if you type x = sin(2), x is NOT (by default) rounded off to a floating-point decimal; it is stored as an exact expression, and you can later display x with a million decimal places. (This is the expected, common-sense behavior if you’re a mathematician, but it’s an unexpected, annoying complication if you’re an engineer.) Likewise, Sage has lots of capabilities in symbolic math, obscure mathematical objects, etc.

If you are not a mathematician, you can browse the list of Sage components (various open-source packages) to see if there is anything you might want to use. Although you can download any of these components yourself, the great strength of Sage is getting all these components to work together smoothly out-of-the-box. The output of one component can immediately be the input of another. So if you plan to write code that simultaneously uses two or more of the Sage components, you may save yourself a lot of headaches by using Sage. (Especially if the components are written in different programming languages, and/or use different data formats, etc. etc.)

Sage comes with a “notebook” environment where you can run calculations and write code. See discussion of notebooks above. You are not obligated to use the Sage notebook if you don’t want to; you can write code using any IDE or source code editor, and then load it into a Sage notebook for testing.

Sage is easy to install in Linux, the usual way. For Windows, rather than installing Sage, think about using CoCalc (formerly SageMathCloud). You run it directly in your browser, and the calculations are done in the cloud. It’s really neat: Directly in your browser, and for free, you can not only run Sage, but also write LaTeX documents, use Python and R (via a Jupyter notebook), etc. etc.

Appendix 3: Software engineering

Most scientists write software (at least a little bit), but most scientists do not know anything about “software engineering”, i.e. the practical aspects of writing good, correct software quickly. Even if you hardly ever write software, it is worth your time to learn a few basics of software engineering. These include: How and why to use revision control software (“git” or “mercurial”); How and why to write tests and assertions into your code (for example, use Python’s assert command as often as possible!); How and why to write clearly and use comments; why to avoid premature optimization, etc. etc.

People get the idea that software engineering is something used only by big professional teams working on big professional software projects. Not true! Even if I am spending a few hours writing a little script for my own personal use, I will still use revision control, testing, assertions, comments, etc. Once you get used to these things, you can’t live without them! It is well worth the time to learn these things, even for a casual and infrequent programmer.

If you’re using Python (rather than science-specific programs like MATLAB, Octave, LabVIEW, etc.), you are already at an advantage because Python is more widely used by professional software engineers, and those people create resources and pressure for learning software engineering. For example, there is an organization called Software Carpentry which is trying to teach software engineering to scientists. You can go through their online lessons and video lectures. Their lessons apply to most programming languages … but all of their examples are in Python!

Appendix 4: More comprehensive discussion of all the different ways to install Python

If you have a public or shared computer and cannot install anything, you can run WinPython directly off a USB stick (Windows only) (see below), or else you should use your web browser with Python running in the cloud. I recommend the CoCalc Python cloud service, because they have an option for free private access, and they have lots of other useful software also available, like Sage (see above), LaTeX, etc. Besides CoCalc, there is also Anaconda Cloud which is free but only if your work is public. If you use either CoCalc or Anaconda Cloud regularly, you may well want to buy a premium subscription, it is not expensive.

More about company-sponsored Python distributions: This category includes Anaconda (from Continuum Analytics Corp.) [mentioned above], and Enthought (from Enthought Inc.), and ActivePython (from ActiveState Software Corp.). All three of these have free versions and paid versions. But the free versions are perfectly functional; the main reason to pay is for professional technical support. (And even if you pay, the price is 10X lower than MATLAB.) Out of these, I suggest using Anaconda because I’ve heard the most good things about it, and it is “endorsed” by the Spyder installation page. (A close second place would be Enthought.) The distinctions among these three, as far as I can tell, are: Anaconda is generally focused on “big data” analysis for math and science, including running code on GPUs, on parallel architectures, etc. etc. ActivePython is for programmers in general, not just science. Enthought is generally focused on math and science. For Anaconda and ActivePython, Spyder is included, while Enthought comes with a different but equally good IDE called “Enthought Canopy”.

More about Python(x,y) and WinPython: Both are open-source (and non-commercially-supported) ways to conveniently install Python, Spyder, NumPy, etc. on Windows. Python(x,y) does not seem very active, and in particular it is not available in Python 3, so is not recommended. WinPython is nice though, and has the special feature that you can run it off a USB-stick without installing anything (if you want to). The disadvantage of WinPython is that sometimes you might need to install a complicated Python library, which might involve not just “pure” Python code but also C code, weird installation requirements, etc. These complicated libraries can occasionally be tricky to install “from scratch”. Anaconda has a quite extensive collection of libraries that can be easily installed with their “conda” tool, whereas WinPython has many fewer. Sometimes you need to use the Python Unofficial Binaries page here.

More about Mac options: Way back in the day, the only way to install Spyder / Python / etc. on a Mac was MacPorts. See how to install MacPorts. But I hear it is slower and less user-friendly than just installing Anaconda.

More about Linux options: This is very very easy, you just install the appropriate package in the standard way. For example, in Ubuntu, you would open Ubuntu Software, search for “Spyder” (Python 2) or “Spyder3” (Python 3), and click “Install”. The other good option is Anaconda. Finally, you can alternatively install Spyder using Pip, part of python, but I do not recommend this for beginners. It often doesn’t work, for tricky reasons.

Python installation options that do not come with Spyder: Spyder is just one of many interfaces to help you program in Python. If you don’t want Spyder in particular, but still want to install Python, NumPy, SciPy, etc., in a convenient one-step way, here are a few options. (1) Pyzo is available for Windows, Mac, and Linux. (2) Enthought was discussed above. (3) Sage is discussed at length above (Appendix 2). (4) Mac OS X users can try the Scipy Superpack.