Reproducible Jupyter Notebooks with Docker

3 minute read

Reproducing the computational steps in your own or somebody else’s notebook is fraught with perils. Primarily because there’s no way to capture environment information (OS, dependencies etc.) in a notebook. In this blog post, we are going to show you how to capture the environment information in a docker image & how to run notebooks as a docker container.

When to use docker with Jupyter

If your notebook relies on specific python packages
If your notebook has OS level dependencies e.g. c libraries
If you want to package data files as part of the executable environment
If you’d like to specify some environment variables (static or runtime)
If you need to run your notebook on cloud machines

If answer to any of the above is yes, then you should consider packaging your notebooks as a docker image.

Docker Overview

Today, docker containers is THE standard format for running any software in a fully specified environment right down to the OS. Before we proceed, here are the basic building blocks of docker ecosystem you need to understand,

Image: Docker image is the actual executable package that contains the complete environment including OS, all the files, installed libraries and so on.
Container: A running image is called a docker container
Registry: A place to store docker images
Dockerfile: It’s a file that specifies how the docker image should be built. It specifies what all should the image contain - libraries, OS, files, env variables and so on.

Prerequisites

Install docker locally

Steps to dockerize your notebooks

Let’s say you authored a notebook and it’s running fine locally. Now we’ll go through the steps to run it as a docker container.

First we need to create a dockerfile. Here are some ready-to-use dockerfiles for executing Jupyter notebooks. For our sake we just need following contents in the dockerfile.

 # Let's extend from the base notebook
 # https://github.com/jupyter/docker-stacks/blob/master/base-notebook
 FROM jupyter/base-notebook

 # Install required packages on top of base Jupyter image
 RUN pip install --no-cache \
   scipy \
   numpy \
   pandas \
   scikit-learn \
   matplotlib \
   tensorflow

 # Copy all files (current directory onwards) into the image
 COPY . /

Put the above dockerfile at the base of directory containing your notebooks. If you want to include data files in the docker image keep them alongside your notebooks (since we copy the entire folder into the image).
Modify dockerfile to suit your needs - add/remove packages, specify package versions, set env variables and so on. Reference documentation for writing dockerfile.

Build the docker image

 >> docker build .
 Sending build context to Docker daemon  600.1kB
 Step 1/3 : FROM jupyter/base-notebook
 latest: Pulling from jupyter/base-notebook
 5b7339215d1d: Pull complete
 14ca88e9f672: Pull complete
 a31c3b1caad4: Pull complete
 ...
 ...
 Step 3/3 : COPY ./ ./
 ---> 7e742e969855
 Successfully built 7e742e969855

Now that your docker image is created you can see them with,

 >> docker images
 REPOSITORY                   TAG                 IMAGE ID            CREATED             SIZE
 <none>                       <none>              7e742e969855        19 minutes ago    1.36GB

You can start the Jupyter server with,
```
 >> docker run -p 8888:8888 7e742e969855
```
Above command runs the image we created earlier and binds the Jupyter port 8888 of the container to port 8888 of the host machine we are running this command on. Please note 7e742e969855 is the image id for me you can replace it with your own image id from step 4.
Your notebooks are now acessible on http://127.0.0.1:8888 as usual

Conclusion

One might wonder what’s the benefit of running dockerized Jupyter if they can run Jupyter locally without any of the docker jazz. Fair question. Here are the two main benefits,

Once created, you can share the dockerfile with your notebooks and anyone (including future you) can run the notebooks simply with docker build+run. The entire environment including OS, libraries, data files will be recreated exactly as intended.
You can push this docker image to a registry (e.g. hub.docker.com) & anyone can pull this image and execute the notebook directly. It also makes it super convenient for you to execute the notebooks on a cloud machine (docker pull+run).

As a best practice, always commit these dockerfiles along with your notebooks to a version control system such as GitHub or GitLab. That’s all! If you find the tutorial useful, do checkout ReviewNB for your Jupyter notebook code reviews.

Twitter Facebook LinkedIn

Reproducible Jupyter Notebooks with Docker

When to use docker with Jupyter

Docker Overview

Prerequisites

Steps to dockerize your notebooks

Conclusion

You May Also Enjoy

Production Jupyter notebooks: A guide to managing dependencies, data, and secrets for reproducibility

Tips and tricks for visualizing data with Matplotlib

Choosing the right IDE for your Jupyter Notebook projects

Top 10 tips for working efficiently with Jupyter Notebook