Intro to Bioinformatics Engineering, Part 4: Running in Production

Preparing Bioinformatics Pipelines for Scale

Jun 28, 2024

This article is the last in our Intro to Bioinformatics Engineering series, where we’ve been discussing best practices and practical tips for building bioinformatics. In Part 3, we covered a hands-on example of transitioning a Jupyter notebook to a Nextflow workflow, a first step for creating a pipeline.

But is the pipeline reliable? Will it be easy to update? Is it efficient? In this article, we will be covering further considerations for creating a high-quality production pipeline.

Introduction

As a bioinformatics engineer, a typical day often involves getting a new set of data from the lab, determining what insights you need to make, producing the code to generate the insights, and presenting metrics, graphs, and results to the rest of the team. Since you are constantly working with new data for new experiments, you must adapt to changes and produce results that reflect the data and meaningfully impact the next experiment.

At some point, some experimentation may solidify around an essential process. Whether you do Nanopore sequencing every week or are using a standardized flow cytometry analyte panel daily, you realize you are running some of the same processes every time. This is when (as described in Part 1) you are going from a hike to the road, so now you have to build the road!

In this article, we will discuss some of the best practices for developing production pipelines your teammates will trust and you can easily maintain.

When and when not to consider a Pipeline

As a recap from part 1 of this series, creating a production-ready pipeline is not always the right direction for the given task. Creating a reliable and scalable pipeline takes time. If you spend time working on this and run the pipeline only once, running the code in a Jupyter notebook would have been faster and easier.

Here are a couple of reasons you might consider productionizing a pipeline:

The size of the data is massive
If you work with single cell sequencing, you know that the preprocessing from BCL or FASTQ file to a count matrix is done through sets of standard pipelines, no matter how many times you run it. This is because even for the smallest datasets, the computing time can take hours and each dataset can produce GBs to TBs of data.
It is run by multiple different people
If you need to share the code and environment with many different people, it's worth considering creating a standard pipeline. This will allow everyone to run the same version in the same environment, no matter who is running it, allowing for effective comparison of results.

What is a WDL?

A Workflow Definition Language (WDL) is a specialized scripting language that defines, manages, and automates complex computational workflows. These languages provide a standardized, human-readable format for specifying the sequence of tasks, dependencies, inputs, and outputs involved in data processing pipelines.

The syntax of a WDL is typically intuitive and straightforward. This readability promotes collaboration and ensures that workflows are easy to understand, share, and modify. By explicitly defining each component of a workflow, WDLs help eliminate ambiguity, ensuring precise and consistent execution.

WDLs support parallel execution and efficient resource management, allowing workflows to scale effectively across various computing environments, such as local clusters, cloud platforms, or high-performance computing (HPC) systems. Additionally, WDLs can provide enhanced reproducibility. Putting your script in a git repository may allow you to track which version of the code you ran, but do you know what environment it ran in? What versions of the packages you had installed? A workflow definition language can track both the code and the environment.

Picking a WDL

There are many different workflow management systems out there that were developed for specific use cases—so many that it’s almost overwhelming. A long list of workflow languages can be found here. Its even not unheard of for a company to build its own in house; for example, GRAIL has reflow and Insitro has redun.

The most commonly used WDLs in bioinformatics are Airflow, Nextflow, Snakemake, and Cromwell.

There are a few things to consider when picking which WDL to use:

Architecture support
Different languages work best for running on cloud vs. on HPC clusters, with differing levels of support for cloud infrastructure and containerization.
Scalability
Understanding how the language will scale to more samples and larger samples.
Community support
All of these tools are open source. Picking one that is actively maintained and well-documented can ease a lot of pressure. This can also be a good indicator of the ability to find engineers with experience.
Flexibility
Bioinformatics tools require many languages and must be run in multiple locations. The flexibility to run different workflows can be a huge advantage.

Nextflow and Snakemake are growing in popularity and are heavily supported by the bioinformatics community. Because of this, they are generally the best choice if you are setting up a new tech stack.

Scaling your pipeline

Clarifying the code

When writing bioinformatics analyses, you often end up with a long script that does many things. The first step to productionizing and scaling your script is to clean the code to a structure ready for scale.

Below are some tricks we have used over the years to help with that process.

1. Removing “magic numbers” / creating arguments

In software development, a “magic number” means using a value (a number or a string) within your code without specifying its meaning. These are often an indication that something is a variable within your code. Collecting each of these and assigning a name to the variable at the top of your script can provide clarity to the reader.

For example, see this snippet:

def main():
  # ...
  imgs = ["input_images/img1.png","input_images/img1.png"]
  for img in imgs:
    # ...
    img = skimage.io.imread("input_images/img1.png")
    norm = np.zeros_like(img)
    cv2.normalize(img, norm, 0, 255, cv2.NORM_MINMAX)
    # ...

We can convert this to the following, adding the image as an input and the 0, 255 as constant values.

# Constants
NORM_MIN = 0
NORM_MAX = 255
DEFAULT_NORM_TYPE = cv2.NORM_MINMAX

def main():

  # Arguments.
  parser.add_argument('input_dir', nargs='+',
                      help='List of input image paths')

  for img_path in args.input_imgs:
    # ...
    img = skimage.io.imread(img_path)
    norm = np.zeros_like(img)
    cv2.normalize(img, norm, norm_min, norm_max, norm_type)
    # ...

This is particularly critical in life science, where a “magic number” might have been difficult to determine and may dramatically impact results. Is it a metric from a publication? Which one? Is it something you experimentally tuned over time? Is it the value from a control from a different experiment? Documenting this well will help you update this code if the number changes and will reduce the chance of a teammate misinterpreting the number and introducing a bug.

2. Single responsibility principle

Each function should perform a single action. For example, from the code above, we can split to have a function for processing inputs and a function for evaluating the model:

def process_input_dir(input_dir):
  input_image_list = glob.glob(os.path.join(args.input_image_dir, "*"))
  input_image_list.sort()
  return [io.imread(f) for f in input_image_list]


def eval_model(imgs, cyto_channel, nucl_channel):
  channels = [args.cyto_channel, args.nucl_channel]
  return model.eval(imgs, diameter=None, channels=channels)


def main():
  # argparsing stays in main
  parser.add_argument(
        'input_image_dir',
        type=str,
        help="Input image directory.")

  parser.add_argument(
          '-c',
          '--cytoplasm',
          dest='cyto_channel',
          type=int,
          help="Integer representing the cytoplasm channel. Grayscale=0, R=1, G=2, B=3.")

  parser.add_argument(
          '-n',
          '--nucleus',
          dest='nucl_channel',
          type=int,
          help="Integer representing the nucleus channel. None=0, Grayscale=0, R=1, G=2, B=3.")

  imgs = process_input_dir()

  masks, flows, styles, diams = eval_model(imgs, args.cyto_channel,
  args.nucl_channel)

3. Function within loops

If you perform the same action within a loop, extract the code into its own function:

def process_img(img_path):
  # ...
  img = skimage.io.imread(img_path)
  norm = np.zeros_like(img)
  cv2.normalize(img, norm, norm_min, norm_max, norm_type)
  # ...


def main():
  # Arguments.
  parser.add_argument('input_dir', nargs='+',
                      help='List of input image paths')

  for img_path in args.input_imgs:
    process_img(img_path)

4. Repetitive code

Create a function for any piece of code that is repeated.

def apply_threshold(data, column, threshold):
  return data[data[column] < threshold]

df = apply_threshold(data, 'p_value', 0.05)
df = apply_threshold(data, 'score', 0.1)

Documentation and version control

Having well-documented code allows for the next person who edits the code to quickly learn the code and provide improvements. Each function should be a verb and describe in simple terms what it does, with a comment at the top describing the inputs and outputs.

Storing each pipeline in its own Git repository allows for a clear separation of concerns. Adding a README.md to the directory's base with instructions on running and expected inputs and outputs can also be helpful.

Parallelization

Parallelizing a bioinformatics pipeline enhances performance by distributing tasks across multiple processors or machines, which is essential for large datasets and time-sensitive projects.

Parallelization in software can take many different forms:

Task level: Processing multiple tasks in parallel. This is common within data engineering workflows, where data flows into a system that must process the data for many different use cases.
Data level: Processing the same task on different pieces of data in parallel. This is common when you need to process large data all at once.

Data level parallelism is the most common use case in bioinformatics, where you get in hundreds of samples at once and must pre-process them to an analyzable state simultaneously.

The following Nextflow workflow takes in a channel of FASTQ files, grouping them by read 1 and 2 and processing them through PROCESS_A and PROCESS_B simultaneously. The workflow will fan out these steps to the maximum allowed compute.

// Create a channel for input FASTQ files
Channel
  .fromFilePairs("${params.inputDir}/*_{1,2}.fastq.gz")
  .set {read_pairs_ch}

workflow {
  PROCESS_A(read_pairs_ch)

  PROCESS_B(read_pair_ch)

  PROCESS_C(PROCESS_A.out)
}

In this example, PROCESS_A will run in parallel for each pair of input FASTQ files, potentially processing multiple samples simultaneously.

Conclusion

This is the final article in our Intro to Bioinformatics series. We’ve covered from a high level of what it means to be a bioinformatics engineer to practical tips on containerization and building robust pipelines.

We hope you have enjoyed and learned something to take away into your next project!

Madeline Schade is the CTO and Co-Founder of Mantle. Her favorite organism is Carcharodon carcharias.