(Jan 28 2014): Updated installation to conform to the recommended virtualenv source install.
I’ve been using scikit-learn over the past few weeks on a project. While developing and analyzing the data I just needed to get work done without the hassle of a complex installation, the Ubuntu image on EC2 provided just that. Now that the project is ready to be deployed, I need to install scikit-learn on the default Amazon Linux AMI. As I learned, installing scikit-learn is not trivial. It only has two dependencies, but those dependencies have dependencies and you have to sift through documentation of at least 5 packages to truly understand what what is needed to install and in what order. So I decided to brush up on my writing skills, dust off the old blog, and pen a simple guide that I can reference later. I’ll explain what the scikit-learn dependencies are and how to install them on the Amazon Linux AMI, specifically image ami-1624987f.
- c compiler (gcc)
- fortran compiler (gfortran)
- python header files (2.4.x - 3.2.x)
- Strongly recommended BLAS or LAPACK
- Complete LAPACK library
Here is where it got confusing for me. NumPy optionally requires (very strongly recommended from what I can tell) BLAS or LAPACK. SciPy requires LAPACK, but NumPy does not (If BLAS is installed). By deduction, it seems the sensible choice is to install LAPACK, which both can use, and we’re all set.
Not so fast, almost all the documentation says to install ATLAS as a substitute for BLAS. (Where did ATLAS show up in all this?) It’s also recommended to install an optimized LAPACK with a machine specific BLAS library. What does that all mean? If you’re really interested you can read my attempt to figure all this out at the bottom of this post. For now, let’s just get down to installing all these dependencies and start using scikit-learn.
The really daring can run the full install script directly from the gist
I’ll continue with an explanation of each step in the gist. Let’s install ATLAS, LAPACK, the Python header files, a c++ compiler.
lapack-devel depends on blas-devel which in turn depends on the fortran compiler, so they both pulled in automatically. Next install virtualenv and create a virtual python install to keep all our packages separate from the default machine install.
Activate the new Python 2.7 virtualenv.
Now install numpy, it should find and use the optmized linear algebra libraries.
Verify that NumPy found the optmized linear algebra libraries.
If you don’t see output for atlas_threads_info, blas_opt_info, atlas_blas_threads_info, or lapack_opt_info then NumPy did not find the ATLAS libraries. If you’re seeing output similar to the following it’s probably not what you want.
NumPy is installed but will not use the ATLAS libraries. At this point it’s best to start over from step 1 and make sure atlas-sse3-devel and lapack-devel are installed. I recommend removing the sk-learn (virtualenv) directory and creating the virtualenv again, this makes sure NumPy gets re-installed from scratch and the old version is not lingering around to confuse things.
Once NumPy is successfully installed and linked to the ATLAS libraries, continue by installing SciPy and scikit-learn.
scikit-learn is now installed!! Let’s run the scikit-learn tests to verify everything is installed correctly. Install nose
and in a directory outside the source run the tests.
That’s it. We now have scikit-learn installed and ready to go on EC2.
This is where I ramble a little as I try to keep for future reference what all these acronyms mean.
- BLAS (Basic Linear Algebra Subprograms)
Routines that provide standard building blocks for performing basic vector and matrix operations
- LAPACK (Linear Algebra PACKage)
Routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems
- ATLAS (Automatically Tuned Linear Algebra Software)
Complete optimized implementation of the BLAS API and a small subset of the LAPACK API
I pick up at the point where we know we need to install BLAS and LAPACK as a prerequisite to installing SciPy. ATLAS is the recommended BLAS implementation as it provides a machine optimized complete implementation of the BLAS libraries. We still need to install a full version of LAPACK for SciPy to be satisfied since ATLAS only provides a small subset of the LAPACK API.
It seems so simple now that I write it down, but I had to hunt through various message groups for it all to make sense.