The aim of this post is to present a few cases where containers offer significant advantages for the bioinformatician, as well as to share practical insight from a software engineering perspective for how to leverage Docker like a professional. While the first post in our Intro to Bioinformatics Engineering series presented a conceptual model for building scientific workflows, this article offers tactical approaches for building bioinformatics software with containers.
A Brief History of Containers
This decade saw containers become a permanent part of the software engineer’s toolkit. The latest in a series of virtualization technologies that began with the virtual machine, containers proved a popular building block for software applications, providing functional isolation and clean interfaces for hardware and operating system resources like CPU, RAM, networking, and storage. Cloud providers and other software vendors quickly took advantage of this new specification, the Open Container Initiative, and developed an ecosystem of new technologies. One product in particular, Docker, became the de-facto standard for containers due to its simple configuration language and well-built tooling.
Engineers found that containers enabled powerful new workflows. They could run the same code nearly anywhere, on multiple different architectures, because they could treat the operating system as an interface instead of a hard dependency. Software was reproducible because all software dependencies were baked into the container. In addition, containers could start up without rebooting the underlying operating system, speeding up launch times and enabling rapid scaling.
More than 15 years after Docker began, containers are finding their way into the bioinformatician’s toolkit. Analyzing large multi-omic datasets and applying new machine learning techniques demand sophisticated software, and today, containers are what make it up.
Applications for Bioinformatics
One of the core benefits of using Docker is reproducibility – in practical terms: your code, in a compiled container, will behave the same in five years as it does today. Imagine your favorite computer game from childhood being available to you on your new laptop, or a Python script from graduate school that still runs despite your colleague changing their library. That is what Docker enables.
Case Study: Sharing An Analysis With Colleagues
You’ve done all the hard work – you’ve completed a rigorous statistical analysis of your biological dataset and found a few surprising results. You’ve even done some extra stuff – you use Git to version control your code, and you store your datasets in the cloud so that it is backed up and accessible to your collaborators.
You reach out to your colleague and have her pull your code and data; you give her a list of Python dependencies to install and have her run the code. To your dismay, she messages you back with an error message:
read_csv() got an unexpected keyword argument 'error_bad_lines'
“But…” you cry out, “it worked on my machine!”
A cursory Google search reveals that this particular argument to `read_csv()` was deprecated in Pandas 1.4.0; you were running version 1.3.0 on your computer while your colleague had upgraded to 1.5.3. A sympathetic bug, but one that could have been avoided had you shared a Docker container instead of your code directly.
You decide to package your code into a container using Docker. You read that you need a requirements.txt file to explicitly list your Python dependencies and their versions. It looks something like:
pandas==2.2.2
numpy==1.26.2
scanpy==1.9.3
Within your Dockerfile, you copy over the requirements.txt file and your updated Python script and install the dependencies:
FROM python:3.11
# Copy and install pip requirements
COPY requirements.txt /src/
RUN pip install -r /src/requirements.txt
# Copy your python script
COPY main.py /src/
# Run this command when the container runs
CMD ["python3", "main.py"]
You pull up an old blog post that taught you how to build your container and store it in a repository for your colleague to access (see below). You message your colleague with instructions for how to pull and run the container, which she does successfully. She sees the results of your analysis and is confident that you will win the Nobel Prize.
Case Study 2: Running Computational Pipelines in the Cloud
Another core benefit of using Docker is the ability to run your code “anywhere.” As mentioned earlier, containers are the latest in a series of virtualization technologies that began with the virtual machine (or, one could argue, the operating system itself). Virtual machines broke the link between hardware and the host operating system. This link was replaced with a piece of software called a ‘hypervisor’, which provided virtual interfaces for hardware resources like CPU and RAM. This technology allows you to run multiple operating systems on top of the same hardware.
Containers take this one step further. Operating systems are cumbersome to install and don’t fully abstract away networking or storage. Technologies like Docker provide something called a “container engine” that sits on top of the host operating system to share operating-system-level resources with multiple isolated applications.
As a computational biologist in an industry lab, you are tired of running the same script every few days to generate results for your scientists. You research pipeline-running software and learn about an open-source tool called Nextflow.
You do the difficult work of understanding the Nextflow model and refactoring your scripts to take in standard inputs and produce consistent outputs. You understand where you can modularize your pipeline to take advantage of parallelization. You learn from the documentation that you can leverage containers to create consistent script environments.
You realize quickly the new capabilities you possess – simply by learning a little bit about container registries and Dockerfiles, you can now run your scripts in production ready and highly parallel Nextflow pipelines on your on-prem hardware or in the cloud. All you need to do to support a large influx of data is click a few buttons to increase the compute capacity of your cloud pipeline, which you set up in a day by following the Nextflow documentation.
Using containers allowed you to run your pipeline on different architectures with little overhead and allowed you to rapidly scale with just a few clicks. What a time to be alive.
A Software Engineer’s Real-world Advice For Using Docker Effectively
Use a Container Registry to Version Your Images
Just as repositories like GitHub and GitLab provide a place to store and access code, container registries allow developers to store images and binaries. Previously, you would have to manually ensure your production instance was configured exactly like your local workstation, checking dependencies and configuration files line by line. Using a container registry allows you to access the same environment across the cloud, your local machine, or your specialized hardware; all that is needed is an operating system with Docker installed and a network connection.
Many software vendors and cloud providers offer a container registry service, differentiating on access controls, price, and integrations with other services. Below, we present a common workflow using AWS Elastic Container Registry (ECR) and Docker to demonstrate how you might build, push, and access a container using a registry.
#!/bin/bash
# Build the Docker image for a specific platform and label it as 'my_server'
docker build --platform linux/amd64 -f path/to/Dockerfile -t my_server .
# Log in to Amazon ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com
# Tag the 'my_server' image with a label that specifies where it will be pushed
# and what version it is (1.0.0)
docker tag my_server:latest $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/my_server_repository:1.0.0
# Push the 'my_server' image to the specified ECR repository
docker push $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/my_server_repository:1.0.0
There are a few subtleties in the above script that demonstrate the power of this workflow.
In the first step, we built the image locally and labeled it `my_server`; by default, Docker assigns this image a tag called `latest`, so we reference the local image as `my_server:latest`.
In the second step, we use the AWS CLI to get our login credentials for ECR (this requires us to be locally authenticated against AWS). We then provide those credentials to our local Docker client to pull and push images to ECR. Think of this like having a registered ssh key on your local machine so you can pull and push to GitHub.
In the third and fourth steps, we re-label our local image with a fully qualified domain name so that our Docker client knows where to push the image. This step will vary across container registry providers, but the structure is similar, usually `<registry_domain_name.url>/<repository_name>:<version>`
With just a few lines of code, we now have versioned images that we can pull anywhere. Note that the source code is not versioned, just the compiled image. Much like having development and production branches for code, you may want to have a development and production repository. The former would be accessible to developers, while the latter may only be accessible within a CI/CD pipeline. This ensures that production images are being built in a consistent environment and that it is only building images from validated code that has been merged into your production branch.
Optimize your Dockerfile
It may be enough to copy/paste examples from the internet or a chatbot to get started writing your first Dockerfile, but understanding a few of the underlying mechanics pays dividends. Below are a few strategies to help you optimize your Dockerfile.
Caching
Software engineers often talk about containers being composed of layers. Think of Photoshop: the final image is a composition of independent layers, one for the background, one for the foreground, and one for intricate styling. Docker containers are constructed in the same way. There is a base layer, which can be an operating system interface or a language runtime, and intermediate layers that define new behavior. When building images, Docker creates a new layer for each instruction in the Dockerfile.
To optimize build times, Docker caches these layers, and only re-builds the layer if it detects a change. Suppose a container has 7 layers; if Docker detects a change to layer 3, it will rebuild all layers from 3-7 while reading the first two layers from its cache.
Consider the following Dockerfile snippet:
FROM python:3.11
# Copy and install pip requirements
COPY requirements.txt /app/
RUN pip install -r /app/requirements.txt
# Copy the application code
COPY src/ /app/src/
COPY main.py /app/
...
In this example, we first copy and install the Python requirements before we copy our application code. In a common local development workflow, developers would iterate on the application code and may repeatedly build the image locally to test it end-to-end. If we had to re-install the dependencies for each build, iterating would take significantly longer. Therefore, it matters where we install the dependencies in our Dockerfile. If we had copied and installed the Python requirements after we copied our application code, Docker would see that the application code had changed and would invalidate the cache for all subsequent instructions, thereby forcing a reinstall for each build. We can iterate significantly faster by simply installing Python requirements before we copy the application code.
Use Environment Variables (Wisely)
In containerized applications, environment variables provide a powerful way to configure applications at runtime. Unlike monolithic servers where environment variables are shared across the entire host, containers isolate these variables, eliminating conflicts and race conditions. This isolation allows each container to serve a single functional purpose with its own specific configuration, enhancing security and manageability.
Consider the following Dockerfile snippet:
FROM golang:1.20
ENV APP_ENV=development
ENV DB_URL=""
...
A developer running this container would pass in arguments for their local database, enabling local development while ensuring access only to resources they have been explicitly granted.
docker run -e DB_URL="your_database_url" my_image
Environment variables can be used for various purposes. We use them to adapt our application to different environments (development, testing, production) and pass configuration information and, when appropriate, secrets. If you decide to store sensitive information in environment variables, it is important to understand a few principles to keep that information secure.
First, secrets should not be stored unencrypted; that means you should not hard-code them in your Dockerfile and instead store them in an encrypted key store that you access at runtime, such as AWS KMS, Azure Key Vault, or HashiCorp Vault.
Also, remember that you can retrieve secrets at runtime in different ways. You could programmatically access the key store within your application instead of using a variable, and if you run on a cloud provider, you may find opportunities to delegate some permissions to them (for example, AWS IAM handles AWS internal access for you).
Third, remember that the host operating system that runs your containers generally has superuser permissions for running containers; be careful where you deploy production services.
If you decide to pass sensitive information as an environment variable, here is an example of how to mount the secret at runtime using AWS Secrets Manager:
aws secretsmanager get-secret-value --secret-id my_secret_id --query SecretString --output text | docker run -e DB_URL="your_database_url" \
-e SECRET_KEY="$(jq -r '.SECRET_KEY' -)" \
-e AWS_REGION="us-west-2" \
-e APP_ENV=production \
my_image
This command might be run as part of the launch script for your production service in an environment that has been explicitly granted the correct permissions, making it more secure.
Prevent Bloat
A common criticism of Docker is the significant disk space required by images, often ranging from a few hundred megabytes to a gigabyte. Preventing image sizes from getting too large requires careful optimization. Here are a few best practices:
Use Minimal Base Images
A language runtime (like python:3.11 or golang:1.20) is more lightweight than a full-fledged OS image (like ‘ubuntu’ or ‘debian’). Unless you need a package that requires access to OS internals, opt for a language runtime. Use official images from trusted sources whenever possible as they are generally better maintained and more secure than the alternatives.
Use a `.dockerignore` File
Create a `.dockerignore` file to exclude unnecessary files and directories from the build context (the set of files sent to the Docker daemon during build).
For example:
.git node_modules tmp
Combine Commands
It’s important to remember that Docker creates a separate layer for each command. Although combining commands may sacrifice readability, it can save significant space. It is also a good idea to delete vestigial build artifacts and unnecessary files at runtime.
RUN apt-get update && apt-get install -y \ package1 \ package2 \ package3 && \ apt-get clean && rm -rf /var/lib/apt/lists/*
Conclusion
Learning any new technology takes time, but at Mantle, we believe the gains from using containers are disproportionate to the effort required to learn. We hope that we’ve provided some practical insight into how containers are used by professional engineers so that you can get up to speed as quickly as possible.
If you want to learn more about how to get the most from your multi-omic data, stay tuned for more bioinformatics tips in upcoming posts in our Intro to Bioinformatics Engineering series and beyond.
Aakash Shah is a Senior Software Engineer at Mantle. His favorite organism is Ornithorhynchus anatinus.