How To Run Yaml With Jupyter Notebook

9 minute read

You might use YAML in a Jupyter notebook to manage deployment configuration files or to automate and track machine learning experimentation, but YAML doesn’t have to be used for configuration only – in some cases, full datasets are stored in YAML format.

Let’s take a look at how you can work with YAML in Jupyter.

YAML parsing / writing tools

PyYAML

You can use the pyyaml package to read and write YAML using Python.

Let’s say you have a YAML config file transaction_clustering.yaml, and in it you’ve saved some of the steps you used in a data science experiment:

---
- step: scaling
  method: robust
  parameters:
    with_centering: true 

- step: clustering
  method: kmeans
  parameters:
    n_clusters: 6
    max_iter: 10000
    algorithm: elkan

You can read the file in with the pyyaml shorthand functions load or safe_load. The safe_load function prevents the running arbitrary of Python code as the YAML is loaded, so it is usually preferred.

import yaml

experiment_details = yaml.safe_load(open('transaction_clustering.yaml'))
print(experiment_details)
[{'step': 'scaling',
  'method': 'robust',
  'parameters': {'with_centering': True}},
{'step': 'clustering',
  'method': 'kmeans',
  'parameters': {'n_clusters': 6, 'max_iter': 10000, 'algorithm': 'elkan'}}]

The data in the YAML file has been read into a Python list of dictionaries, which you can now use as normal:

print(experiment_details[1]['parameters']['max_iter'])
10000

You can also save Python objects to YAML strings:

print(yaml.safe_dump(experiment_details[1]))
method: kmeans
parameters:
  algorithm: elkan
  max_iter: 10000
  n_clusters: 6
step: clustering

And as YAML files:

yaml.safe_dump(experiment_details[1], open('clustering_details.yaml', 'w'))

Notice here that we’re using the safe version of the dump operation, which just supports basic YAML tags.

The yaml shorthand functions do have some limited configurability. For example, you can configure the dumping functions to include document markers, which aren’t included by default:

print(yaml.safe_dump(experiment_details[1], explicit_start=True))
---
method: kmeans
parameters:
  algorithm: elkan
  max_iter: 10000
  n_clusters: 6
step: clustering

If you’d like more configurability than the shorthand functions provide, you can use the pyyaml loader and dumper classes directly.

yamlmagic

Another convenient Python package for working with YAML is yamlmagic, which provides the magic to turn YAML text into Python objects.

For example, we can use the yamlmagic cell magic to turn this cell into a YAML interpreting cell:

%load_ext yamlmagic
%%yaml clustering_details

step: clustering
method: kmeans
parameters:
  n_clusters: 6
  max_iter: 10000
  algorithm: elkan

We now have this data, inputted in YAML format, in the Python variable clustering_details:

print(clustering_details)
{'step': 'clustering',
'method': 'kmeans',
'parameters': {'n_clusters': 6, 'max_iter': 10000, 'algorithm': 'elkan'}}

Writefile

You can use the built-in writefile magic to save YAML-formatted data into a file.

For example:

%%writefile clustering_details.yaml

step: clustering
method: kmeans
parameters:
  n_clusters: 6
  max_iter: 10000
  algorithm: elkan

Here we save our clustering parameter data into the YAML file clustering_details.yaml, which we can now use elsewhere. You could just use an editor to save these parameters into a file, but if you’re already doing analysis or modelling in a Jupyter notebook, it’s convenient to continue working in the notebook.

Jupyter and YAML use cases

Let’s take a look at some of the contexts in which data scientists and data engineers may come across YAML files.

Configuration management for DevOps, MLOps, and DataOps

YAML is used widely in DevOps, MLOps and DataOps for orchestration, deployment configuration, data pipeline configuration, and ML pipeline configuration.

The Seldon documentation uses the writefile Jupyter magic to save graph-level metadata. You may find yourself designing your Seldon graphs in a Jupyter notebook and setting up the YAML files there as part of your development and testing process.

You can also imagine opening a Kubernetes cronjob YAML file to check when the cronjob is run:

yaml.safe_load(open('daily_clustering.yaml'))
{'apiVersion': 'batch/v1',
 'kind': 'CronJob',
 'metadata': {'name': 'clustering-daily:latest'},
 'spec': {'schedule': '30 8 * * *',
  'jobTemplate': {'spec': {'template': {'spec': {'containers': [{'name': 'clustering',
        'image': 'clustering'}],
      'command': ['run-clustering'],
      'restartPolicy': 'OnFailure'}}}}}}

You could potentially also programmatically duplicate and edit these configuration files:

clustering_cron = yaml.safe_load(open("daily-clustering.yaml"))

for hour in (13, 14, 15, 16):
    clustering_cron["spec"]["schedule"] = f"30 {hour} * * "
    yaml.safe_dump(
        clustering_cron, open(f"daily_clustering_{hour}h30m.yaml", "w"), sort_keys=False
    )

Here we create four new cronjob definition files, which will run at 13:30, 14:30, 15:30, and 16:30 respectively.

Experiment automation and tracking

Data science preprocessing pipelines and machine learning models usually have many parameters and hyperparameters. YAML can be used to track the parameters used in individual experiments, or as part of hyperparameter tuning steps where modelling is performed and evaluated for many configurations.

Storing these kinds of parameters and hyperparameters in YAML has several benefits: It makes them human readable and easily human editable, it allows them to be version controlled, and it decouples parameters from code, so that code can be re-run on different parameter sets, enabling automation.

For experimentation tracking, you could use YAML files to track parameters of experiments in a version controlled folder:

transaction_clustering
 |_ eda.ipynb
 |_ transaction_clustering.ipynb
 |_ experiment_parameters_20220404.yaml
 |_ experiment_parameters_20220408.yaml

Here each YAML file could contain parameters for a particular version of your experiment:

date: 2022-04-04
data: iris.csv

clustering:
  method: kmeans
  fields:
    - sepal_length
    - sepal_width
    - petal_length
    - petal_width
  parameters:
    n_clusters: 3
    max_iter: 100
    algorithm: elkan
    random_state: 22

Then this configuration YAML would be read in and used as variables in your modelling code in transaction_clustering.ipynb:

import pandas as pd
from sklearn.cluster import KMeans
import yaml

config = yaml.safe_load(open('experiment_parameters_20220404.yaml'))

print(config['date'])
df = pd.read_csv(config['data'])

clustering_config = config['clustering']

if clustering_config['method'] == 'kmeans':
    kmeans_parameters = clustering_config['parameters']
    clustering = KMeans(
        n_clusters=kmeans_parameters['n_clusters'],
        max_iter=kmeans_parameters['max_iter'],
        algorithm=kmeans_parameters['algorithm'],
        random_state=kmeans_parameters['random_state']
    )
    y = clustering.fit_predict(df[clustering_config['fields']])

Note how easy it would be to re-run the experiment set up in transaction_clustering.ipynb using a different YAML config file with different parameters. You could run this experiment using many different config files, and potentially different datasets, and at the end of each notebook save the results of the experiment to a file.

A similar approach can be used for config files for hyperparameter searches. For example, the hyperparameter tuning tool Google AI Platform can use YAML files to define the hyperparameter search space for use with various machine learning packages.

Data Storage

YAML can also be used to store data. Like JSON, it’s a useful format for nested data that can’t be represented well in a tabular format like CSV.

As an example, let’s have a look at some cricket match data provided by Cricsheet. First we download Indian Premier League data from the matches page and read it in:

ipl_data = yaml.safe_load(open('ipl/1082591.yaml'))
print(ipl_data.keys())
dict_keys(['meta', 'info', 'innings'])

We can navigate the data to the results of each ball in each innings:

ipl_data['innings'][0]

Then we can do some munging of the data into a format readable by pandas:

import pandas as pd

flattened_innings = [{**{'ball': k}, **v} for c in ipl_data['innings'][0]['1st innings']['deliveries'] for k, v in c.items()]
df = pd.DataFrame(flattened_innings)
print(df.head())

Cricket 1st Innings

Using YAML outside of a Jupyter Notebook

We’ve looked at a couple of Python modules that allow us to create, edit, and import data from YAML files within a notebook. But often it’s more convenient to work with YAML outside of a Jupyter notebook.

Text Editors

If you’re already working in a Jupyter or Jupyter Lab environment, it’s easy enough to open a text file within the environment and create or edit your YAML there.

But you don’t have to stay in Jupyter – widely used editors like Visual Studio Code and Sublime Code let you create and edit YAML files, and also come with syntax highlighting that make YAML even easier to read. Many editors also have more advanced YAML plugins available, that work with specific flavours of YAML, and offer auto-completion and syntax verification. If you often work with large YAML files, this can be invaluable.

Terminal-based editors like Vim, Emacs, and Notepad all work as YAML editors too. If you need to log in to remote machines to check or edit YAML, you will likely need to use one of these.

yamllint

Linters are programs that check the correctness of your code. yamllint is a linter for YAML that checks if your YAML is syntactically correct and also does some cosmetic checking (for trailing whitespace, for example). It is worth running it on your YAML files before using them or committing them to a version control system.

Say we have an indentation error in our YAML:

---
- step: scaling
  method: robust
  parameters:
    with_centering: true 

- step: clustering
   method: kmeans
   parameters:
    n_clusters: 6
    max_iter: 10000
    algorithm: elkan

That isn’t an easy error to spot by eye. The larger or more complex a YAML file is, the more difficult it is to pick up errors, so the automated checking done by linters can save time.

yamllint is a Python package that can be installed with pip, and then run from the command line. If we run it for the YAML above, it tells us about the error:

$ yamllint transaction_clustering_with_error.yaml

transaction_clustering_with_error.yaml
  5:25      error    trailing spaces  (trailing-spaces)
  8:10      error    syntax error: mapping values are not allowed here (syntax)

It also picked up a trailing whitespace character in line 5, something we wouldn’t have noticed otherwise.

Sometimes yamllint is more strict than you may want it to be. For example, it will create an error if you leave out the document start marker --- in the first line of a YAML file. This can clash with the default dump behavior of pyyaml, which omits the start marker. But the yamllint configuration is adjustable through a custom configuration file, so you can set it to ignore rules you don’t want to be checked.

If you want yamllint to pass YAML created by the default pyyaml dump and safe_dump functions, you can use the following custom configuration:

rules:
  document-start: disable
  indentation:
    indent-sequences: false

Without this config, running yamllint on one of our clustering_config cronjob files will fail:

$ yamllint daily_clustering_16h30mb.yaml   

daily_clustering_16h30mb.yaml
  1:1       warning  missing document start "---"  (document-start)
  12:11     error    wrong indentation: expected 12 but found 10  (indentation)
  15:11     error    wrong indentation: expected 12 but found 10  (indentation)

If we use the config file, these warnings and errors won’t be raised:

$ yamllint daily_clustering_16h30mb.yaml -c yamllint_config.yaml

How permissive or conservative you want your linting to be is up to you, and can depend on the applications you are using your YAML with.

Command line YAML navigators

The command line tools yq and shyaml allow you to navigate YAML from the command line. Let’s first look at yq:

$ yq . transaction_clustering.yaml   

This prints the full contents of our YAML file. To pull out elements of a list, we use indices:

$ yq '.[0]' transaction_clustering.yaml   

{
  "step": "scaling",
  "method": "robust",
  "parameters": {
    "with_centering": true
  }
}

And we use dots to navigate down levels:

$ yq '.[1].method' transaction_clustering.yaml   

"kmeans"

While yq is a Python package installable with pip, it is actually a wrapper for the JSON processor jq which you will have to install too. Installation instructions can be found on the yq PyPI site.

An alternative to yq is shyaml, which is also a pip-installable Python package. The shyaml script only reads from stdin (unlike yq, which can read from stdin and from files), so the syntax to use is:

$ cat transaction_clustering.yaml | shyaml get-value '0.method'

robust 

$ cat transaction_clustering.yaml | shyaml get-value '1.parameters.n_clusters'

6

$ cat transaction_clustering.yaml | shyaml get-type '1.parameters.n_clusters'

int

Round up

Jupyter Notebooks is a convenient and interactive way for data scientists to interact with YAML, but one of Jupyter’s downsides is that it’s tricky to review.

ReviewNB solves this problem by supporting notebook diffs and comment conversation threads, which helps us spot problems (easy to make mistakes with nested YAML!) and enhances collaboration on notebook development.

ReviewNB