Jupyter notebooks are fantastic in many ways but collaboration is not so easy with them. In this article we’ll look at all the tools you can leverage to make notebooks play nicely with modern version control systems like git!
Why is Jupyter version control so hard?
The software world has converged on git as it’s version control tool of choice. Git is designed to work primarily for human-readable text files. Whereas Jupyter is a rich JSON document with source code, markdown, HTML, images all rolled into a single .ipynb file.
Git doesn’t handle rich documents like notebooks very well. E.g. git merge of long nested JSON document is humanly impossible, git diff for binary image string is horrible (shown below).
What’s required from notebook version control?
Here’s what we need from a modern version control system -
- Ability to create checkpoints / commits
- Quickly checkout any of the past notebook versions
- See what changed from one version to another (a.k.a visual diff for notebooks)
- Multiple people can work on a single notebook with easy merge conflict resolution
- Ability to provide feedback & ask questions about a specific notebook cell
That’s our wishlist! This blogpost is going to introduce you to all the important tools that can help you achieve these.
Disclaimer: I’m the author of two of the tools listed below (ReviewNB & GitPlus) but this is an unbiased review of all the useful tools in this space.
nbdime is an open source library for diffing and merging notebooks locally. You can set this up to work with local git client so that
git diff &
git merge commands use nbdime for .ipynb files. With nbdime you can -
- Run git diff to see how notebook has changed before committing
- Easily merge remote changes with your locally edited notebook
JupyterLab Git Extensions
Following JupyterLab extensions are useful for notebook version control. You can install these on your local JupyterLab.
- jupyterlab-git can be used to browse repositories, look at visual diffs of changed files, and push your commits
- GitPlus can be used to push commits and create pull requests on GitHub directly from JupyterLab UI
SageMaker is a managed service from AWS that gives you access to hosted JupyterLab. It integrates with GitHub repositories so you can clone your public/private repositories into the SageMaker instance. It uses jupyterlab-git extension so you can commit your notebooks to GitHub.
With SageMaker you can spin up a powerful EC2 instance with a few clicks to train your models. The downside is you are always using expensive cloud compute even for tasks that can easily be done on your local machine e.g. exploring data, editing notebooks. A healthy balance of local Jupyter sessions with sparse SageMaker usage when you really need powerful cloud compute is ideal for most people.
Another popular option is Google Colab. They’ve decent GitHub integration using which you can open a specific notebook in a GitHub repository. You can also commit any changes back to the repository.
Colab provides limited free GPU and you can upgrade to Colab Pro for higher usage limits. Some of the limitations are,
- Colab pro is a paid offering but does not provide any resource guarantee in terms of GPU time & type.
- Everything is centered around a single notebook file, you do not clone the entire repository or have access to the local file system. Colab seems to suggest using Google Drive as your virtual file system. That would impact speed of training (fetching data files over a network call vs. having them available locally on the disk).
ReviewNB is a GitHub App that shows visual diffs for any notebook commit or pull request. Additionally, you can write comments on a specific notebook cell to provide feedback or ask questions to your teammates.
It’s free for open source repositories but requires a paid plan for private repositories. ReviewNB app has been verified by GitHub & approved for selling it on GitHub marketplace.
There is no single tool fits all when it comes to Jupyter notebook version control & collaboration. But your team can leverage following purpose built tools to have a solid notebook workflow -
- GitHub to store notebooks
- jupyterlab-git extension for git commands & rich local diffs
- GitPlus extension to create pull requests
- ReviewNB for notebook code reviews
- AWS SageMaker if & when you need to run notebooks on a large cloud instance
That’s all for now! Happy Hacking!