Reproducible Jupyter Notebooks with Docker

3 minute read

Reproducing the computational steps in your own or somebody else’s notebook is fraught with perils. Primarily because there’s no way to capture environment information (OS, dependencies etc.) in a notebook. In this blog post, we are going to show you how to capture the environment information in a docker image & how to run notebooks as a docker container.

When to use docker with Jupyter

  • If your notebook relies on specific python packages
  • If your notebook has OS level dependencies e.g. c libraries
  • If you want to package data files as part of the executable environment
  • If you’d like to specify some environment variables (static or runtime)
  • If you need to run your notebook on cloud machines

If answer to any of the above is yes, then you should consider packaging your notebooks as a docker image.

Docker Overview

Today, docker containers is THE standard format for running any software in a fully specified environment right down to the OS. Before we proceed, here are the basic building blocks of docker ecosystem you need to understand,

  • Image: Docker image is the actual executable package that contains the complete environment including OS, all the files, installed libraries and so on.
  • Container: A running image is called a docker container
  • Registry: A place to store docker images
  • Dockerfile: It’s a file that specifies how the docker image should be built. It specifies what all should the image contain - libraries, OS, files, env variables and so on.

Prerequisites

Install docker locally

Steps to dockerize your notebooks

Let’s say you authored a notebook and it’s running fine locally. Now we’ll go through the steps to run it as a docker container.

  1. First we need to create a dockerfile. Here are some ready-to-use dockerfiles for executing Jupyter notebooks. For our sake we just need following contents in the dockerfile.

     # Let's extend from the base notebook
     # https://github.com/jupyter/docker-stacks/blob/master/base-notebook
     FROM jupyter/base-notebook
    
     # Install required packages on top of base Jupyter image
     RUN pip install --no-cache \
       scipy \
       numpy \
       pandas \
       scikit-learn \
       matplotlib \
       tensorflow
    
     # Copy all files (current directory onwards) into the image
     COPY . /
    
  2. Put the above dockerfile at the base of directory containing your notebooks. If you want to include data files in the docker image keep them alongside your notebooks (since we copy the entire folder into the image).

  3. Modify dockerfile to suit your needs - add/remove packages, specify package versions, set env variables and so on. Reference documentation for writing dockerfile.

  4. Build the docker image

     >> docker build .
     Sending build context to Docker daemon  600.1kB
     Step 1/3 : FROM jupyter/base-notebook
     latest: Pulling from jupyter/base-notebook
     5b7339215d1d: Pull complete
     14ca88e9f672: Pull complete
     a31c3b1caad4: Pull complete
     ...
     ...
     Step 3/3 : COPY ./ ./
     ---> 7e742e969855
     Successfully built 7e742e969855
    
  5. Now that your docker image is created you can see them with,

     >> docker images
     REPOSITORY                   TAG                 IMAGE ID            CREATED             SIZE
     <none>                       <none>              7e742e969855        19 minutes ago    1.36GB
    
  6. You can start the Jupyter server with,

     >> docker run -p 8888:8888 7e742e969855
    

    Above command runs the image we created earlier and binds the Jupyter port 8888 of the container to port 8888 of the host machine we are running this command on. Please note 7e742e969855 is the image id for me you can replace it with your own image id from step 4.

  7. Your notebooks are now acessible on http://127.0.0.1:8888 as usual

Conclusion

One might wonder what’s the benefit of running dockerized Jupyter if they can run Jupyter locally without any of the docker jazz. Fair question. Here are the two main benefits,

  • Once created, you can share the dockerfile with your notebooks and anyone (including future you) can run the notebooks simply with docker build+run. The entire environment including OS, libraries, data files will be recreated exactly as intended.

  • You can push this docker image to a registry (e.g. hub.docker.com) & anyone can pull this image and execute the notebook directly. It also makes it super convenient for you to execute the notebooks on a cloud machine (docker pull+run).

As a best practice, always commit these dockerfiles along with your notebooks to a version control system such as GitHub or GitLab. That’s all! If you find the tutorial useful, do checkout ReviewNB for your Jupyter notebook code reviews.