Reproducible Jupyter Notebooks with Docker
Reproducing the computational steps in your own or somebody else’s notebook is fraught with perils. Primarily because there’s no way to capture environment information (OS, dependencies etc.) in a notebook. In this blog post, we are going to show you how to capture the environment information in a docker image & how to run notebooks as a docker container.
When to use docker with Jupyter
- If your notebook relies on specific python packages
- If your notebook has OS level dependencies e.g. c libraries
- If you want to package data files as part of the executable environment
- If you’d like to specify some environment variables (static or runtime)
- If you need to run your notebook on cloud machines
If answer to any of the above is yes, then you should consider packaging your notebooks as a docker image.
Docker Overview
Today, docker containers is THE standard format for running any software in a fully specified environment right down to the OS. Before we proceed, here are the basic building blocks of docker ecosystem you need to understand,
- Image: Docker image is the actual executable package that contains the complete environment including OS, all the files, installed libraries and so on.
- Container: A running image is called a docker container
- Registry: A place to store docker images
- Dockerfile: It’s a file that specifies how the docker image should be built. It specifies what all should the image contain - libraries, OS, files, env variables and so on.
Prerequisites
Install docker locally
Steps to dockerize your notebooks
Let’s say you authored a notebook and it’s running fine locally. Now we’ll go through the steps to run it as a docker container.
-
First we need to create a dockerfile. Here are some ready-to-use dockerfiles for executing Jupyter notebooks. For our sake we just need following contents in the dockerfile.
# Let's extend from the base notebook # https://github.com/jupyter/docker-stacks/blob/master/base-notebook FROM jupyter/base-notebook # Install required packages on top of base Jupyter image RUN pip install --no-cache \ scipy \ numpy \ pandas \ scikit-learn \ matplotlib \ tensorflow # Copy all files (current directory onwards) into the image COPY . /
-
Put the above dockerfile at the base of directory containing your notebooks. If you want to include data files in the docker image keep them alongside your notebooks (since we copy the entire folder into the image).
-
Modify dockerfile to suit your needs - add/remove packages, specify package versions, set env variables and so on. Reference documentation for writing dockerfile.
-
Build the docker image
>> docker build . Sending build context to Docker daemon 600.1kB Step 1/3 : FROM jupyter/base-notebook latest: Pulling from jupyter/base-notebook 5b7339215d1d: Pull complete 14ca88e9f672: Pull complete a31c3b1caad4: Pull complete ... ... Step 3/3 : COPY ./ ./ ---> 7e742e969855 Successfully built 7e742e969855
-
Now that your docker image is created you can see them with,
>> docker images REPOSITORY TAG IMAGE ID CREATED SIZE <none> <none> 7e742e969855 19 minutes ago 1.36GB
-
You can start the Jupyter server with,
>> docker run -p 8888:8888 7e742e969855
Above command runs the image we created earlier and binds the Jupyter port 8888 of the container to port 8888 of the host machine we are running this command on. Please note
7e742e969855
is the image id for me you can replace it with your own image id from step 4. -
Your notebooks are now acessible on
http://127.0.0.1:8888
as usual
Conclusion
One might wonder what’s the benefit of running dockerized Jupyter if they can run Jupyter locally without any of the docker jazz. Fair question. Here are the two main benefits,
-
Once created, you can share the dockerfile with your notebooks and anyone (including future you) can run the notebooks simply with docker build+run. The entire environment including OS, libraries, data files will be recreated exactly as intended.
-
You can push this docker image to a registry (e.g. hub.docker.com) & anyone can pull this image and execute the notebook directly. It also makes it super convenient for you to execute the notebooks on a cloud machine (docker pull+run).
As a best practice, always commit these dockerfiles along with your notebooks to a version control system such as GitHub or GitLab. That’s all! If you find the tutorial useful, do checkout ReviewNB for your Jupyter notebook code reviews.