At ReviewNB, we’ve studied hundreds of teams successfully collaborating on Jupyter Notebooks via GitHub Pull Requests. More than 100,000+ notebooks have been reviewed on our platform by cutting edge data science teams from Amazon, Microsoft, Google, NASA JPL and many others.
While Git is a proven way of collaborating on software projects, there are some rough edges when it comes to using Git with Jupyter Notebooks. In this article, we’re going to share tools, tips & workflows used by the best data science teams in the world to collaborate on their Jupyter Notebooks.
Notebook Collaboration Workflow
Since notebooks contain rich media (images, plots, audio, widgets) it’s very effective to collaborate directly on notebooks. Teammates can ask clarifying questions, suggest improvements or offer feedback in the context of a notebook cell (see screenshot below). This type of collaboration is difficult on a simple .py file on GitHub as it doesn’t contain any outputs.
Notebook Usage Patterns
We’ve seen teams mainly use notebooks in one of the following ways,
An individual user will use notebooks to run some rough analysis or explore data on their own machine. There’s no sharing, collaboration or version control. It’s good for quick one off analysis but any learnings will need to be re-written in another format for sharing (doc, pdf, .py etc.).
Some teams strip all the outputs from notebooks or convert notebooks to markdown so as to make the notebook “git friendly”. It’s a poor collaboration practice since the power of notebooks comes from its ability to have markdown, code and output all in one place. If you’re not going to store outputs, why even use notebooks? Might as well move to plain python files with comments in them.
Full Notebook Mode
Many teams will push the entire notebook file to GitHub along with all the outputs and rich media in it. They use specialized tools like nbdime & ReviewNB to look at rich notebook diffs. In this mode, the notebook code review becomes much more natural & in-depth as users can see plots, graphs, tables, images & play with interactive widgets inline. These teams can then employ the same software engineering rigour (pull request approval for merge-to-master) to data science projects
Once the org embraces full notebook format they tend to start using more purpose built tools & libraries (e.g. Papermill, Voila) to get most out of their notebooks. E.g. Netflix is running 150,000+ daily jobs via notebook-based execution. We’ve seen these types of orgs to be ahead of the curve in terms of technology prowess & adoption. They also make the most returns on their investment in Jupyter notebooks.
Tools of the trade
nbdime provides tools for diff’ing & merging notebooks in your local environment. You can run nbdiff or nbdiff-web commands to see notebook diffs on the command line or web browser respectively. nbmerge supports three-way merge of notebooks with automatic conflict resolution. Some points to note regarding nbdime,
- It can’t render pull request diff (only commits or direct file names are supported with nbdiff)
- There’s no way to write comments or provide feedback (to be fair, nbdime was not built for this purpose)
- nbdime is great for local notebook diff’ing & is the only tool out there to resolve notebook merge conflicts
JupyterLab Git Extensions
Following JupyterLab extensions are useful for notebook version control. You can install these on your local JupyterLab.
- jupyterlab-git can be used to browse repositories, look at visual diffs of changed files, and push your commits
- GitPlus can be used to push commits and create pull requests on GitHub directly from JupyterLab UI
ReviewNB is a GitHub App that shows visual diffs for any notebook commit or pull request. You can write comments on a specific notebook cell to provide feedback or ask questions to your teammates.
It’s free for open source repositories but requires a paid plan for private repositories. ReviewNB app has been verified by GitHub & approved for selling on GitHub marketplace.
We saw how teams use notebooks in different modes - Scratchpad, Stripped, and full notebook mode. We also learned about the most common tools used for Jupyter notebook & GitHub integration - nbdime, ReviewNB & JupyterLab extensions. Hope this helps you choose the right workflow & tools suitable for your team.