Thu 25 December 2014

tl;dr: this is a post on how to use IPython Notebooks in git repositories. You don't actually need to read the post. All the code is in this gist. Just download it and follow the instructions.

Pretty much everyone and their grandmother have started using the IPython notebook for basically everything. For good reason, because notebooks are great.

Notebooks don't play particularly well with version control, though. Imagine this: you've got this cool open source package, and you want to write some examples for it. IPython notebooks seem like the way to go, so you write an example notebook, which you save, add to your git repository and commit. But, it turns out, your package uses random numbers (all the best packages do). Two days later, you re-open the notebook, re-run some of the code cells by accident, and now all the output has changed because of the random numbers, so your git index is a mess with lots of red M's everywhere. Then you think, "gee (because, apparently, you say 'gee'), it would be great if git magically knew it should only look at the input cells!".

This post is a recipe for using IPython notebooks in git. It's not wholly original. In fact, it's an extension of this answer on Stack Overflow.

We are going to write a script that tells git to:

  • ignore the output cells
  • ignore prompt numbers

We don't necessarily want to do this for every notebook, though. Sometimes it makes sense to keep the output. We'll specify in the notebook metadata (more on that in a second) that git needs to ignore the output and prompt for this particular notebook.

We'll do this using git filters. Filters let you specify how to process a file when going from the working directory to the repository and vice-versa. Filters expect to read the input file from stdin and spit the filtered version to stdout. Note that they don't actually change the file in the working directory at all, only the way it is represented in git.

Git filters

filter drawing

Filters come in two flavours:

  1. clean filters, which are applied when git needs to interpret a file in the current directory (when running git add, or to decide what has changed when running git status or git diff). We want this filter to exclude all output cells.
  2. smudge filters, which are applied when git needs to re-construct the working directory, for example following a git clone command. It would be great if this filter could, somehow, re-construct the original output, but that might be difficult without running the notebook, so we'll let this filter do nothing.
filter drawing

We want the clean filter to:

  1. Check the notebook metadata to see if it should filter it,
  2. If so, print a new version of the notebook to stdout without any output cells or prompt numbers.

The notebook is just a JSON document, so it's easy to parse. The script below does exactly this. You can download a version here.

In []:
#!/usr/bin/env python

import sys
import json

nb = sys.stdin.read() # Read the notebook from stdin.

# First, check the metadata for the following JSON block:
# "git" : {
#     "suppress_outputs" : true
# }
json_in = json.loads(nb)
nb_metadata = json_in["metadata"]
suppress_output = False
if "git" in nb_metadata:
    if "suppress_outputs" in nb_metadata["git"] and nb_metadata["git"]["suppress_outputs"]:
        suppress_output = True

if not suppress_output:
    # Metadata tells us not to suppress output:
    # simply send notebook, as is, to stdout.
    sys.stdout.write(nb)
    exit() 
 
# Get the IPython version used to write the notebook.
ipy_version = int(json_in["nbformat"])-1 # nbformat is 1 more than actual version.
 
def strip_output_from_cell(cell):
    """
    Takes a notebook cell and removes the "prompt_number" field 
    and the "outputs" field.
    """
    if "outputs" in cell:
        cell["outputs"] = []
    if "prompt_number" in cell:
        del cell["prompt_number"]
 
# Process the notebook
if ipy_version == 2:
    for sheet in json_in["worksheets"]:
        for cell in sheet["cells"]:
            strip_output_from_cell(cell)
else:
    for cell in json_in["cells"]:
        strip_output_from_cell(cell)

# Dump the processed notebook to stdout.
json.dump(json_in, sys.stdout, sort_keys=True, indent=1, separators=(",",": "))

This script reads a notebook from stdin and outputs it to stdout, which is what git filters expect. Before we move on to how to use the script, let's take a brief detour through IPython notebook metadata.

Notebook metadata

You can associate arbitrary data with an IPython notebook through its metadata. The metadata is a JSON document that you can access by clicking Edit > Edit Notebook Metadata. We will use notebook metadata to tell git to suppress outputs (or not). The script outlined above checks the notebook metadata for a "git" field. Adding

"git" : { "suppress_outputs" : true },

to the notebook metadata will tell the git filter to strip output cells. The full metadata will now look like:

{
    "name" : "",
    "git" : { "suppress_outputs" : true },
    "signature" : "some long string"
}

Enabling git filters

So we have a script in place to strip outputs and prompts, and we understand how to edit the notebook metadata to tell the script to do this. All that we now need to do is tell git that it needs to use the script as a clean filter for *.ipynb files.

First, download the script and save it to a directory in the system path. For future reference, I will assume that you saved the file as ~/scripts/ipynb_drop_output. Make the file executable with chmod u+x ~/scripts/ipynb_drop_output.

We can tell git to use a filter for IPython notebooks by editing one of the .gitattributes file. Which file you need to edit depends on whether you want the filter to apply:

  • system-wide: you want ~/.config/git/attributes.
  • in a particular repository: you want .gitattributes in the repository's root directory.

Whichever .gitattributes file you edit, you need to add the line *.ipynb filter=clean_ipynb. This tells git to use the filter clean_ipynb for any file with extension *.ipynb. Here, clean_ipynb is the just name of the filter. It can be whatever you want.

We now need to edit the git configuration so that git knows what the filter clean_ipynb is. The easiest is to type the following commands in a terminal:

$ git config --global filter.clean_ipynb.clean ipynb_drop_output

$ git config --global filter.clean_ipynb.smudge cat

These apply the settings system-wide. To apply them to a specific repository instead, use the same commands but without the --global option (from anywhere inside the repository).

Note that we just use cat for the smudge filter. The smudge filter therefore does nothing.

Storing the filter configuration with a repository

You might want to bundle the filter and various options associated with it with a repository when you distribute it. This is what I'm planning on doing for gmaps, for instance.

To do this, you need to:

  • bundle the ipynb_drop_output script with your repository,
  • put *.ipynb filter=clean_ipynb in the file .gitattributes in the top directory of your repository,
  • create a file .gitconfig in the top directory of your repository,
  • run the commands
    $ git config --file .gitconfig filter.clean_ipynb.clean ipynb_drop_output
    
    $ git config --file .gitconfig filter.clean_ipynb.smudge cat in the top directory of your repository.

You can then add .gitattributes and .gitconfig to the repository.

Each time a user clones the repository, she will need to:

  • make sure ipynb_drop_output is both executable and on the system path,
  • run git config --add include.path /path/to/repository/.gitconfig from anywhere inside the git repository so that git knows to look at the .gitconfig file.

Conclusion

This post describes a sensible way of including IPython notebooks in git. Of course, it might not do exactly what you want, but it should be quite easy to adapt the code to something closer to what you need.

In []: