GitHub’s pull request system is great for code reviews & collaboration, but working with Git can be tricky if you’re using Jupyter Notebooks.
Notebook diffs can be awkward to parse or unusably long if they contain metadata and plot binaries. Differences in metadata can cause merge conflicts that are difficult to visualize and difficult to fix. And when you are reviewing code in a pull request, the notebook diffs are a mess of raw notebook code snippets, making them near impossible to meaningfully comment on inline the way you can with code files. Git just doesn’t work well for notebooks!
Thankfully, there are tools to streamline your Git-Jupyter workflow. We’ll take a look at four of them - nbdev, nbdime, Jupytext & ReviewNB
nbdev solves two of the biggest Git-Jupyter problems:
- It strips unnecessary metadata from the notebook, thereby reducing diff noise.
- It helps to visualize and resolve Git merge conflicts for notebooks.
The cleaning step strips out unnecessary metadata and Python reprs from your notebook, including cell run order counts, which Python kernel was used to run the notebook, and memory addresses of Matplotlib plots. These can all change from notebook run to notebook run, and may differ across development environments.
Merge conflicts with Jupyter Notebooks can be caused by code or metadata, and can sometimes leave a notebook in such a state that it is no longer a valid notebook and can’t be opened as one. If you can’t open a notebook as a notebook, it’s tricky to fix a merge conflict. nbdev automatically tries to fix harmless merge conflicts (for example, metadata changes). If nbdev can’t auto-merge, it will show the Git conflict marker in a valid notebook JSON format, so you can manually resolve merge conflicts inside the Jupyter IDE.
nbdime is a dedicated Git diff and merge tool for Jupyter Notebooks. Like nbdev, nbdime functionality addresses metadata in Git merges. However, instead of cleaning the notebook metadata, nbdime changes the diffs themselves to give more useful output.
The terminal output of the nbdime diff is composed of a neat, cell-based list that shortens binary image diffs to a single line, making command-line Jupyter diffs usable (unlike a direct Git diff, which is rendered unusable by metadata diffs and long plot diffs).
But the biggest selling point of nbdime is its web-based diffs, which open in a browser with the two notebooks displayed side by side. Differences — including differences in output — are highlighted so that notebook changes are easier to evaluate.
Installing nbdime will add a
git nbdiff button to your Jupyter toolbar so you can check your diff with the click of a button from inside your notebook — what a pleasure!
Jupytext takes a different approach to the Git-Jupyter problem. Instead of changing the notebook in any way, it creates a new representation of the same notebook in a Git-friendly format (like Markdown, for example). Jupytext maintains a two-way sync between the notebook & its corresponding Markdown file, so any changes to the notebook are reflected in the Markdown file and vice versa.
You can check diffs in the Markdown version of your notebook since Markdown versions remove most of the metadata. Markdown diffs are visually easy to review in GitHub pull requests, and you can comment on them inline the same way as with other code files.
One drawback of the Jupytext method is that the Markdown files remove cell output. So if you want to check how a plot has changed, for example, or the contents of a Pandas DataFrame, Jupytext can’t help you.
Installing Jupytext adds a new Jupytext option in the File menu of your Jupyter Notebook. You can choose to pair your notebook with Markdown as well as a list of other formats, including Python scripts.
ReviewNB tackles the problem of doing Git pull requests on notebooks from the opposite direction. Unlike the other tools here, which run locally, ReviewNB is a SaaS solution that integrates directly with online version control platforms like GitHub & Bitbucket.
You can push your notebook to GitHub and then launch the ReviewNB app from within your pull request. This will take you to a visual diff interface where you can compare notebooks side-by-side and add comments inline. Comments will sync back to GitHub, where they will be part of your PR review, and your colleagues can add their comments in return. This allows for thorough, line-by-line pull request reviews on Jupyter Notebooks.
Each of the tools we’ve looked at here brings something different to the table. You may find that the best solution for you is to use a couple of these tools together.
|Helps with local diffs||✅||✅||✅|
|Helps with Git merge||✅||✅|
|Preserves cell output||✅||✅||✅|
|Provides online diff for GitHub PRs / commits||✅|
|Converts Notebook to Markdown||✅|
nbdev uses hooks to automate cleaning your notebooks of metadata and provides notebook-friendly Git merging functionality. If you have several people working within the same Jupyter Notebook and merge conflicts are likely, this tool will make your life easier.
For better local diffs of your unstaged notebook changes, nbdime is the tool you need. If you are working on some analysis, for example, and want to check all of the changes you have made before sharing the analysis or pushing the updated work to GitHub, the nbdime visual diffs are a good solution.
Jupytext is a good first-order solution for collaborating on notebooks: Save your notebook as Markdown, push to GitHub, then your team can check and comment on your notebook in markdown format. You will lose your outputs though. In some cases, that’s OK — and if you are working with sensitive data, having outputs automatically removed might even be a bonus.
Then we have ReviewNB, which provides convenient visual diffs on a web app while keeping the outputs and allows for commenting. For collaborative work on Jupyter Notebooks, ReviewNB is a great option.