Data scientists and analysts worldwide breathed a sigh of relief at the recent news that GitHub is introducing rich diffs for Jupyter Notebooks. The feature is not yet generally available, so if you want to try it out you can sign up here for early access.
Version control and code review of Jupyter Notebooks is notoriously tricky. One reason is that the under-the-hood JSON format of Jupyter Notebooks makes it non-trivial to compare them, so the standard version-control process of checking diffs doesn’t work well. Command-line notebook diffs are awful to work with and prior to this new feature, GitHub diffs for notebooks were equally bad:
The GitHub diffs were just raw diffs of the underlying notebook JSON—nearly impossible to meaningfully compare or comment on. The new rich diff feature changes this, with context-aware diffs that display differences in code in notebook cells and differences in cell output. This addresses a real challenge faced by Jupyter Notebook users, making notebooks much easier to collaborate on.
If you don’t already have your notebook on GitHub, create a repository for yourself and push your notebook to it. You can follow the instructions here if you are new to GitHub.
We want to explore the new diff functionality, so once you have pushed your notebook, merge it to your main branch. Next, create a new branch, make a change to your notebook, push it to GitHub, and create a pull request. If you haven’t discovered it, the JupyterLab Git extension provides a built-in interface for interacting with GitHub. You can use it to manage your branching and pull-request flow from within JupyterLab.
Now check the
Files changed tab of your pull request to see the diff between your original notebook and the changes you made. If you have signed up for the notebook rich diff feature, you will see a diff that looks something like the one in this pull request:
We have a helpful diff!
The GitHub rich diff displays code changes grouped by the cells the code is in. Figures that have changed are displayed natively as part of the rich diff, plus, they are shown side by side, so it’s easy to visually compare changes.
The diff also sensibly hides metadata. The metadata and output diffs are collapsible, so if you do want to see them you can. But by default, metadata diffs are collapsed because we aren’t usually interested in them.
All of this is built into the usual GitHub pull request interface—you don’t need any other set up or tooling to access it.
This is a big jump in functionality from the awkward raw JSON diffs that GitHub used to provide. These diffs can be meaningfully used to review your pull requests.
The not so good
Unfortunately, there are still a few drawbacks to the GitHub rich diff functionality:
- You can’t comment on diffs.
- Large notebook diffs sometimes don’t render.
- Matplotlib memory address output causes unnecessary diff content.
- Interactive components don’t render.
GitHub hasn’t provided a way to comment on the diffs, so that aspect of reviews is still awkward. We suggest a couple of workarounds here. Often you need to be able to make line-by-line comments when you are reviewing a colleague’s work, so this missing functionality is a big minus.
The rich diffs can also fail on large notebooks, even if the actual differences you are reviewing are small. So if you are working in large notebooks, the GitHub diffs won’t be of much use.
Matplotlib memory addresses
Being able to see figure differences side by side is a big win for reviewing notebooks. But changing Matplotlib memory address locations still causes an annoyance. The changed memory address output is interpreted as part of the diff, leading to diff blocks from plots you haven’t actually changed:
|Rich Notebook Diffs||✅||✅|
|Jupyter Notebook Commenting Support||✅|
|Renders Large Notebook Diffs||✅|
|Interactive (HTML/JS) Output Rendering||✅|
|Shows Metadata Changes||✅|
As happy as we are about the built-in GitHub rich diffs, for a fully featured notebook diff experience, the best option is still ReviewNB. ReviewNB integrates directly with your GitHub repositories, lets you comment on notebooks in line, doesn’t fail for large notebook diffs, and renders your interactive visualizations. For seamless collaboration on versioned Jupyter Notebooks, ReviewNB is the solution.