Intro to Bioinformatics Engineering, Part 3: Jupyter Notebook to Nextflow Pipeline
Turning your most-used Jupyter Notebook into a pipeline
This article is part of our Intro to Bioinformatics Engineering series, where we’ve been exploring best practices and practical tips for how to build for bioinformatics. In Part 1, we covered the why and when of building pipelines at a high level. Here, we’ll provide a practical example of building your first pipeline.
You have great Jupyter notebooks you reuse constantly for data analysis…but which version did you use to make the graph in last month’s presentation? How will you reprocess the last 20 datasets with your newest version? And how can your teammates use this algorithm for their datasets? Using a pipeline can solve these problems – here’s how to get your notebook code into a Nextflow pipeline.
Introduction
If you’ve analyzed data using Python, you’ve probably used Jupyter notebooks. When you get the very first pieces of data from your initial experiments for a new project, a Jupyter notebook lets you explore and iterate quickly. As you refine your experiments and the resulting data gets more standardized, you might come up with a definitive Jupyter notebook that you use over and over with different input files. You’ve gone from bushwhacking to following a paved trail with signposts and a map. This is great! You don’t need to put in a ton of new effort every time you get new data.
As your notebook becomes a crucial part of your workflow and your colleagues’ workflows, you might start to notice a few issues creeping in. You make some improvements to your algorithm. But now what version of the notebook you used to make past graphs is not documented anywhere. Or, you want to run the new version of the analysis on the last 20 datasets, and it’d be way quicker if you could parallelize. And you want to make sure that other people who are using your algorithm start using the updated version.
It’s time to turn your Jupyter notebook into your first pipeline. In this post, we’ll show you how.
Pipelines are written in workflow definition languages, such as Nextflow
Workflow definition languages provide a structured framework for describing and orchestrating the series of computational tasks needed to handle and analyze data. For data engineering in general, these tasks typically encompass data extraction, transformation, loading, and analysis.
There are many workflow definition languages that data and bioinformatics engineers use to write computational pipelines. We’ll use Nextflow as an example in this article because it is especially popular with bioinformaticians and computational biologists. As free open-source software, Nextflow is supported by a large community of developers. There’s also nf-core, a large library of open-source bioinformatics pipelines written in Nextflow.
A Nextflow pipeline consists of one or more modules or processes. A Nextflow process allows you to execute a script, which can be written in any popular scripting language, including Bash, Python, and R.
Let’s get to it!
Here are the steps for turning code from your Jupyter notebook into a Nextflow pipeline:
Jupyter notebook
You already have a trusty Jupyter notebook for data processing and analysis. In this example, we’ll use a notebook written in Python that performs a simple image thresholding task.Python script
Turn your notebook into one or more executable scripts. If this is your first Nextflow pipeline, you may want to write one script instead of splitting your workflow into multiple modules. In this example, we’ll make one script.Nextflow pipeline
Write a Nextflow pipeline that executes your script(s) for you.
Example code and images are available on our GitHub.
Jupyter notebook
Let’s take a look at an example of a Jupyter notebook that takes in images, segments them, and saves the resulting masks. Here’s the code for reference (we’ll break it down in the next section):
Notable features
Inputs
The notebook takes two inputs that the user must change every time they want to run the notebook: input_dir
, the path to the directory where the input images are stored, and output_dir
, the path to the directory where the processed images should be saved.
Imports
Depending on the user’s Python environment, tqdm
, numpy
, scikit-image
(skimage
), and opencv-python
(cv2
) may not be installed. Additionally, the notebook does not by default store any information about what versions of each package are installed.
Python script
The first step in building the Nextflow pipeline is to turn the Jupyter notebook into one or more executable scripts. As a first pass, we’ll turn our example notebook into a single script:
#!/usr/bin/env python3
import argparse
import os
import glob
import tqdm
import numpy as np
import skimage
import cv2
def preprocess(img_path: str) -> np.array:
"""
Reads in, normalizes, and thresholds a single image.
Returns np.array of preprocessed image.
"""
# Read in
img = skimage.io.imread(img_path)
# Normalize
norm = np.zeros_like(img)
cv2.normalize(img, norm, 0, 255, cv2.NORM_MINMAX)
# Threshold
_, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)
return thresh
def main():
# Instatiate argument parser
parser = argparse.ArgumentParser(
prog='example_script',
description="Loads images and preprocesses by normalizing and
thresholding."
)
# Add arguments to the argument parser
parser.add_argument(
'input_dir',
type=str,
help="Directory containing input images."
)
parser.add_argument(
'-o',
'--output_dir',
dest='output_dir',
type=str,
help="Directory to which to save outputs."
)
# Run argument parser and extract data
args = parser.parse_args()
all_image_paths = glob.glob(os.path.join(args.input_dir, "*.tif"))
# Make sure output directory exists
if not os.path.exists(args.output_dir):
os.mkdir(args.output_dir)
# Preprocess all the images in the input directory
# and write out to the output directory
for path in all_image_paths:
# Apply preprocessing
processed_img = preprocess(path)
# Save
basename = os.path.basename(path)
extension_idx = basename.rfind(".")
fname = os.path.join(
args.output_dir,
f"{basename[:extension_idx]}_preprocessed.tif"
)
skimage.io.imsave(fname, processed_img, check_contrast=False)
if __name__ == "__main__":
main()
Notable features
Shebang
The shebang #!/usr/bin/env python3
indicates the interpreter that the program loader should use to run the script (Python3 in this case).
Command line arguments
We use the argparse
library to create an argument parser so that the script can take command line arguments as inputs.
If we were to run this script on its own, the usage would be:
./example_script.py <input_dir> -o <output_dir>
Versioning
After you have your script, you can check it into GitHub for version control. You can iterate on it and push new versions, and if you’ve shared the repository with colleagues, they can pull in your changes. To maximize reproducibility, you can add a requirements.txt
file with package versions, or create a Docker container that others can run your script in.
If you want to go further and put your script into a pipeline, read on:
Nextflow pipeline
Now, we want to write a pipeline that will run the script we wrote in the last step.
We create a directory with the following structure:
example_nextflow_pipeline
│─── main.nf [1]
│─── nextflow.config [2]
│
│─── bin
│ └─── example_script.py [3]
│
└─── modules
└─── preprocessing
└───main.nf [4]
To run the Nextflow pipeline, using the command line, change directory to example_nextflow_pipeline
, then run:
nextflow run main.nf --input_dir <input_dir> --output_dir <output_dir>
Now, we’ll go through each component in detail.
[1] main.nf
This is the main pipeline script:
include { PREPROCESS_IMAGES } from './modules/preprocessing'
log.info """\
EXAMPLE PIPELINE
---------------------
input_dir: ${params.input_dir}
output_dir: ${params.output_dir}
"""
.stripIndent(true)
workflow {
PREPROCESS_IMAGES ( params.input_dir )
}
The first line imports the PREPROCESS_IMAGES
process, which is contained in modules/preprocessing/main.nf
.
The next line outputs information to the console.
Finally, we have the workflow block, which calls the PREPROCESS_IMAGES
process with the input input_dir
that was specified in the command line.
params
contains the command line arguments, which are anything specified like this: --<argument> value
.
[2] nextflow.config
This is the Nextflow configuration file:
process.container = "<docker_image:tag>"
docker.enabled = true
docker.runOptions = '-u $(id -u):$(id -g) -v /Users:/Users'
In this example, we specify that processes should run in the given container. Additionally, we specify that Docker should always be used when executing hte pipeline, and give some options for Docker.
We’re only scratching the surface here, though this suffices for our simple example pipeline — for more information on Nextflow configuration, refer to the Nextflow documentation.
[3] bin/example_script.py
This is the example script that we wrote above.
Note that you need to make the script executable in order for Nextflow to run it. In Unix-like systems, you can do this in the command line by changing to the bin
directory and running:
chmod +x example_script.py
[4] modules/preprocessing/main.nf
This is the module that contains the process that the Nextflow pipeline will execute:
process PREPROCESS_IMAGES {
publishDir params.output_dir, mode: 'copy'
input:
path process_input_dir
output:
path('*')
script:
"""
example_script.py \\
${process_input_dir} \\
-o "./" \\
"""
}
All files produced by the process script are stored in a work directory. The publishDir
directive indicates that the output files of this process (specified in the output
block) should be published to output_dir
, which we specified as a command line argument.
The input
block defines the input channels of a process, similar to function arguments. Inputs are specified by a qualifier (the type of data) and a name. The name is similar to a variable name. In our example, the input is a path.
The output
block defines the output channels of a process. These can be accessed by downstream processes, or published to the directory specified by the publishDir
directive. Here, the outputs are all the paths to files produced by the process. You can be as specific as you want here—e.g. you could specify path('*.tif')
to emit only TIF files if other types were produced as well, or path('image006.tif')
to emit only that single file.
The script
block defines the script that the process executes. In this case, we are running the example_script.py
script, with the input_dir
process input as the path to the directory with the input files, and with the current working directory ./
as the directory to which to write the output files (the output files are then published according to the publishDir
directive).
For more information on Nextflow processes, refer to the Nextflow Documentation.
Notable features
Containerization is automatically supported
A container is an isolated virtual environment for your code. It allows you to run your code in a reproducible way by having the same packages and the same versions of your packages installed in the environment every time. One of the most common container platforms is Docker. To learn more about containers and their application to bioinformatics, make sure to catch up with the previous post in our series:
In the Nextflow configuration, we specified a global Docker container in which the whole pipeline should run. If we had multiple process modules, Nextflow also would allow us to specify a Docker container for each module, if they had different requirements.
You could use Docker with Jupyter notebooks on your own, but Nextflow makes it easy to always use the same environment each time the pipeline runs.
Versioning
We didn’t explicitly touch on this in our tour through the example Nextflow pipeline, but we can implement version control for the pipeline through Nextflow’s integration with GitHub (along with BitBucket and GitLab). That way, if you’re sharing the pipeline with colleagues, they can always be up to date (or explicitly run a past version, if that’s what suits their needs).
Wrapping up
Nextflow is a powerful tool for creating data processing and analysis pipelines. By simplifying containerization and versioning, it helps you to increase the reproducibility and portability of your code. This post should help you get started with writing pipelines through turning a Jupyter notebook you use over and over into a script and then a Nextflow pipeline.
We only included one process in our simple example pipeline here. In future posts, we’ll highlight how (and why) you can make your pipelines modular and scalable, and how to run them in the cloud.
Lealia Xiong is a Senior Applications Engineer at Mantle. Her favorite organism is Hypsibius exemplaris.