CleanPy: Automatically Format and Tidy Jupyter Notebooks

Python has some great tools for improving code quality and I often find myself reusing some of these tools to the extent that I thought bundling them within a single, "meta" application was justified. The following details the rationale, workings and distribution for clean_py, a package capable of cleaning .ipynb and .py source code.

Overview

Code quality is important. Good quality code should be easy to read, maintain and extend, from the perspective of yourself and your team members. Whilst the tools which are currently available are not able to correct for fundamentally bad design, there's still a lot they can do around code readability and cleanliness (though perhaps this will change in the future. With this in mind, some of my favourite code quality packages include:

Black, for code formatting. Code linting can be a prickly topic and I am aware of some of the controversies surrounding black, chief among them being how opinionated the package is (naturally) and some differences around specific pep elements like single/double quotations. Nevertheless, black has found traction within some pretty large projects) and it's out of the box usability is extremely appealing. Notable alternatives include yapf and autopep8.
Autoflake, for removing unused imports and variables, courtesy of pyflakes. Notable alternatives include flake8.
Isort, for alphabetizing and segmenting imports.

Some common themes running through these libraries is:

They all actively change the input source code, as distinct from just flagging errors and suggesting fixes.
They're complementary to one another, fulfilling different purposes, mainly WRT source code and dependencies
They're currently all geared to only work with .py source files

With these libraries in hand, I set about combining them into a simple CLI application.

CleanPy Under the Hood

Programmatically Calling Dependent Libraries

What's required is to trace back through each of the library's CLI applications and find the relevant code which processes the source code as a string, so that we can integrate this code within a larger program. Isort reduces to a single parameterized function called SortImports. Autoflake similarly reduces to a single, parameterized function called fix_code, whose application was forked to account for notebook cells (details below). Black was a little trickier, requiring a specific configuration to be passed to the format_file_contents function (something I found difficult to find within the 4000-line main file, yikes).

def clean_python_code(
    python_source, isort=True, black=True, autoflake=True, is_notebook_cell=False
):
    # run source code string through autoflake, isort, and black
    formatted_source = python_source

    if autoflake:
        if is_notebook_cell:
            # do not remove unused imports (RE: notebook cell dependencies)
            formatted_source = fix_code(
                formatted_source,
                expand_star_imports=True,
                remove_all_unused_imports=False,
                remove_duplicate_keys=True,
                remove_unused_variables=True,
            )
        else:
            formatted_source = fix_code(
                formatted_source,
                expand_star_imports=True,
                remove_all_unused_imports=True,
                remove_duplicate_keys=True,
                remove_unused_variables=True,
            )

    if isort:
        formatted_source = SortImports(file_contents=formatted_source).output

    if black:
        mode = FileMode(
            target_versions=PY36_VERSIONS,
            line_length=DEFAULT_LINE_LENGTH,
            is_pyi=False,
            string_normalization=True,
        )
        try:
            formatted_source = format_file_contents(
                formatted_source, fast=True, mode=mode
            )
        except NothingChanged:
            pass
    return formatted_source

isort, black and autoflake generally apply equally to standard .py source as well as the .py source code within individual .ipynb cells. The only exception is that we disable any import cleans usually handled by autoflake, as it's often unclear where/how imports are used within jupyter notebooks and each call to clean_python_code is independently applied to each cell.

Parsing Jupyter Notebooks

There are many problems with Jupyter notebooks; they can become a state-riddled rats nest, they're a nightmare to version and are extremely "fragile" in the sense that small changes to the underlying JSON will break the notebook (RE: merge conflicts). Nevertheless, they're an integral part of the Data Science and Machine Learning ecosystem and I find myself using them most days.

What we want to do is parse the underlying "source JSON" of a jupyter notebook, specifically, the JSON entry for cells containing the python source code. This is as simple as loading the notebook JSON with python's standard JSON library and indexing into the cells value.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing ./dev.py\n"
     ]
    }
   ],
   "source": [
    "print('hello world')\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Example "source JSON" for a jupyter notebook. We're interested in the top-most "cells" attribute, which returns a list of dictionaries which map to each notebook cell. Within each of these dictionaries, we're interested in the "source" value of "code" cells, which is a new-line delimited list of python source code.

With a list of dictionaries corresponding to the source code of each notebook cell, we'd like to map the same clean_python_code function that we apply to .py source strings to the python source strings within each of the notebook cells. This gets a little messy, as we have to join a list of source strings back together and wrap the whole thing in try/catch statements for tolerance, but is very achievable nonetheless.

def clean_ipynb_cell(cell_dict):
    # clean a single cell within a jupyter notebook
    if cell_dict["cell_type"] == "code":
        try:
            clean_lines = clean_python_code(
                "".join(cell_dict["source"]), is_notebook_cell=True
            ).split("\n")

            if len(clean_lines) == 1 and clean_lines[0] == "":
                clean_lines = []
            else:
                clean_lines[:-1] = [
                    clean_line + "\n" for clean_line in clean_lines[:-1]
                ]
            cell_dict["source"] = clean_lines
            return cell_dict

        except:
            # return original cell dict otherwise
            return cell_dict
    else:
        return cell_dict

def clean_ipynb(
    ipynb_file_path, clear_output=True, autoflake=True, isort=True, black=True
):
    # load, clean and write .ipynb source in-place, back to original file
    if clear_output:
        clear_ipynb_output(ipynb_file_path)

    with open(ipynb_file_path) as ipynb_file:
        ipynb_dict = load(ipynb_file)

    # mulithread the map operation
    processed_cells = pool.map(clean_py_cell, ipynb_dict["cells"])
    ipynb_dict["cells"] = processed_cells

    with open(ipynb_file_path, "w") as ipynb_file:
        dump(ipynb_dict, ipynb_file, indent=1)
        ipynb_file.write("\n")

Mapping the cleaning function across each cell within the jupyter notebook. We also flag that we're processing a notebook cell (to ensure imports aren't removed), and we optionally specify a clear_output flag within the broader clean_ipynb function, to remove notebook artefacts.

Putting it All Together

So we now have code that is capable of parsing and formatting .py and .ipynb source code. What's left to do is to wrap the whole thing in a minimum CLI application. Inspired by the brevity of spacy's source code examples I decided to use plac for arg parsing, and wasabi for pretty logging (thanks Ines!)

import glob
from pathlib import Path

import plac
from wasabi import Printer

from .clean_py import clean_ipynb, clean_py

msg = Printer()

@plac.annotations(
    path=("File or dir to clean", "positional", None, str),
    py=("Apply to .py source", "option", None, bool),
    ipynb=("Apply to .ipynb source", "option", None, bool),
    autoflake=("Apply autoflake to source", "option", None, bool),
    isort=("Apply isort to source", "option", None, bool),
    black=("Apply black to source", "option", None, bool),
)
def main(path, py=True, ipynb=True, autoflake=True, isort=True, black=True):
    path = Path(path)
    if not path.exists():
        raise ValueError("Provide a valid path to a file or directory")

    if path.is_dir():
        # recursively apply to all .py source within dir
        msg.info(f"Recursively cleaning directory: {path}")
        if py:
            for e in glob.iglob(path.as_posix() + "/**/*.py", recursive=True):
                try:
                    msg.info(f"Cleaning file: {e}")
                    clean_py(e, autoflake, isort, black)
                except:
                    msg.fail(f"Unable to clean file: {e}")
        if ipynb:
            # recursively apply to all .ipynb source within dir
            for e in glob.iglob(path.as_posix() + "/**/*.ipynb", recursive=True):
                try:
                    msg.info(f"Cleaning file: {e}")
                    clean_ipynb(e, autoflake, isort, black)
                except:
                    msg.fail(f"Unable to clean file: {e}")

    if path.is_file():
        msg.info(f"Cleaning file: {path}")

        if path.suffix not in [".py", ".ipynb"]:
            # valid extensions
            raise ValueError("Ensure valid .py or .ipynb path is provided")

            if py and path.suffix == ".py":
                clean_py(path, autoflake, isort, black)

            if ipynb and path.suffix == ".ipynb":
                clean_ipynb(path, autoflake, isort, black)

def main_wrapper():
    plac.call(main)

I've also included recursive directory traversal as an explicit part of the app, as I often just want to "lint" an entire repository/directory of .py and .ipynb files and not think about it too much. Of course, the application can lint individual files if required as well!

Forking clean_ipynb

I first came across clean_py in its original form as clean_ipynb a few months ago. I forked the repo initially and added some additional functionality around multi-threading notebook cleaning, including autoflake, reforming the CLI, and these changes were merged back into master. Lately, however, I noticed that the project has done away with some of this functionality and is also moving towards including Julia support. And this is great! As my understanding of Julia is exactly zero and I'm sure this is a useful addition for Julia users. Respectfully, I think that a tool should just do one thing well (clean Python) and that my previous work is sufficiently distinct and justified as a standalone project.

So now I'm in a situation where I have a fork of the main repo, developed under a standard MIT License, but I'd like to create a new project with the current state of this fork, ideally preserving the dev history around it. After doing some reading here, here and here, I'm of the impression that possible grounds for creating a new project, as distinct from maintaining a fork include:

You want to use the project as a starting point for some other project with a completely different goal.
When your version of the project will not be regularly merging in changes from the original, upstream, branch.

In addition, the proper etiquette surrounding the licensing and attribution of these derivative projects is to:

Properly attribute the original author and project (informally) within the README
Properly attribute the original author (formally) within a revised MIT license

In terms of repo management, this meant creating a new github repo (clean_py), and changing the remote of the old repo (clean_ipynb) to target this new repo via:

git remote set-url origin https://github.com/samhardyhey/clean_py.git

Distributing clean_py

I've never uploaded and distributed a package from within the PyPi repo before; the process always seemed complex and "far away" to me, insomuch as my projects were not of a sufficient quality to distribute. And for personal projects which you don't want to share with the world, you probably can get away with just using github as your PyPi server. I suppose I have slightly more ambition for clean_py, as I feel it would be useful to many more people than myself, and so I set about distributing the package on PyPi. The process is remarkably straightforward, and an excellent technical guide exists here detailing the process WRT the Test PyPi repo and Main PyPi repo, so I'll save some words. At long last we can pip install via:

pip install clean-py

And we can now remove notebook artefacts and apply black, autoflake and isort to a notebook's source code via:

clean_py ./some_notebook.ipynb

More sophisticated usage instructions within the repo. As always, feel free to critique, fork and modify clean_py however you see fit. Happy linting!

🧹 CleanPy: Automatically Format and Tidy Jupyter Notebooks