How to use Git / GitHub with Jupyter Notebook

5 minute read

This article is Git 101 for Jupyter users that are not familiar with Git / GitHub. It’s a hands on tutorial & is meant to be comprehensive. Feel free to skip a section if you are already familar with it. At the end you’ll be able to,

  • Push your notebooks to a GitHub repository in cloud
  • Start versioning your notebooks + learn how to revert to a specific notebook version
  • Get feedback & discuss notebook changes with your peers
  • Easily share your notebooks for others to view

Create GitHub Account

If you don’t have a GitHub account please create one here.

Setup Git

Create New Repository

A GitHub repository is like your supercharged folder in the cloud. You can store files (notebooks, data, source code), look at historical changes to these files, open issues, discuss changes and much more. People typically create one repository per project.

Let’s go ahead & create a repository on GitHub. Once created, you’ll see a page like below, copy the highlighted repository URL.

Clone Repository

Let’s clone the GitHub repository on our machine by running following on the terminal. It will create projectA directory on our machine which is linked to amit1rrr/projectA repository on GitHub.

    >> git clone https://github.com/amit1rrr/projectA.git
    Cloning into 'projectA'...
    warning: You appear to have cloned an empty repository.

Push Notebooks to GitHub

Our repository is empty right now, let’s push some notebooks to it. We copy two notebooks to the directory where we cloned projectA repository,

    >> cp /some/path/analysis1.ipynb /path/of/projectA/
    >> cp /some/path/scratch.ipynb /path/of/projectA/

Let’s say we want to push analysis1.ipynb to GitHub. We first need to tell local git client to start tracking the file.

    >> git add analysis1.ipynb

You can check which files are being tracked with git status,

You can see that analysis1.ipynb is under “Changes to be commited” so it’s being tracked by our local git client. Now let’s commit the changes,

    # -m flag is used to provide a human friendly message describing the change
    >> git commit -m "Adds customer data analysis notebook"

Commit simply creates a checkpoint (or version) that you can revert back to at any time. Now finally push this commit to GitHub.

    >> git push

Now you can visit the repository page on GitHub to see your commits.

Develop in a Branch

Let’s say you are working on a large project spanning multiple days but you want to periodically push work in progress checkpoints (“commits”). The way you do that is by creating a feature branch. Each repository has a default branch (typically called “master”) that stores most up-to-date versions of completed work. Each member of your team can create their own feature branches to store their WIP commits. When their work in a feature branch is ready to be shared they can create a pull request for peer review & subsequently merge the feature branch into master. Let’s unpack that with concrete steps.

Say I’m about to start working on a new project to analyse customer data. First, I will create a new branch,

    >> git checkout -b customer_data_insights

Then I’ll create/edit some notebooks & other files to do the actual analysis. When I’m ready to commit my WIP, I’ll do the usual git add, git commit, git push. At git push you will see following error since the branch does not exist on GitHub yet.

Simply push the branch first by copying the command shown in error,

    >> git push --set-upstream origin customer_data_insights

And then do git push to push your commits to this newly created branch.

Create Pull Request

Let’s say you’ve been working on feature branch for a while and it’s ready for prime time. Most likely, you’d want to first share it with your peers, get their feedback before merging it into master branch. That’s what pull requests are for.

You can create pull requests from GitHub UI. Go to your Project page -> Pull requests tab -> click “New pull request”.

Choose which branch you’d like to merge into master. Verify commits & list of files changed. Click “Create pull request”.

On the next page provide title & describe your changes in brief, hit “Create pull request” again.

Comment on Jupyter Notebooks in a GitHub Pull Request

GitHub pull request are fantastic for peer review as they let you see changes side-by-side, write comment to seek clarifiacation or provide feedback, and finally merge the changes once approved. Although one limitation is that GitHub shows JSON diffs for notebook which is hard to read since notebooks contain images/plots/graphs and other rich output in binary format.

You can use ReviewNB to solve this problem. It shows you visual diff & let’s you comment on any notebook cell. Once your changes are approved you can merge them from GitHub UI.

Or run git merge + git push from command line,

Revert to a specific notebook version

If you want to temporarily go back to a commit, checkout the files, and come back to where you are then you can simply checkout the desired commit. At the end run “git checkout master” to go back to the current state.

If you want to actually revert to an old state and make some changes there, you can start a new branch from that commit.

    >> git checkout -b old-state f33939cd63004e3e67b111f7bcb350ffd2b0608a

You can also browse old commits on GitHub by going to Your project page -> Commits. Open the desired commit and click “View File” to see the notebook status at that commit.

When you browse notebooks in your repository on GitHub it renders them as HTML. So it’s very convenient to share read-only links to the notebook like this one. If it’s a private repository, the person you are sharing the link with needs to have a GitHub account and have permission to access your repository.

For security reasons, GitHub does not run any Javascript in the notebook. You can use nbviewer if your notebook contains interactive widgets and such.

Conclusion

If you are new to Git, it can take some time to get used to all the commands. But it’s a proven way of collaborating on software projects & is widely used in data science work as well. You can combine it with ReviewNB to remove some of the kinks in the workflow.

Happy Hacking!