Cloud Computing for Life Science: The Way Forward

Leveraging the cloud to accelerate biotech innovation

May 29, 2024

In the rapidly evolving world of biotechnology, companies face the ongoing challenge of choosing the most efficient computing infrastructure to support their advanced research and development efforts. This blog post will explore why cloud computing is increasingly becoming favored over on-premises solutions, particularly for biotech companies seeking scalability, flexibility, and cost-effectiveness.

Understanding the needs of biotech companies

Biotech companies are at the forefront of scientific innovation, requiring substantial computational resources for tasks like genomic sequencing, drug discovery, and data analysis. These tasks demand high computing power and the ability to scale resources up or down based on the dynamic nature of research projects. Cloud computing is particularly beneficial for biotech companies as it offers unparalleled scalability and flexibility. With cloud computing, organizations can easily scale up their computing resources to handle large datasets and complex calculations during peak times, such as when running intensive genomic analyses or large-scale simulations. Conversely, they can scale down during periods of lower demand, optimizing costs and resource usage. Additionally, cloud services provide access to cutting-edge technologies and infrastructure without the need for substantial upfront investments, enabling biotech companies to stay at the forefront of innovation while maintaining financial efficiency.

With these benefits in mind, comparing the practical aspects of running workflows on the cloud versus on-premises is essential. For instance, running the nf-core RNA-seq pipeline on AWS offers dynamic scalability, enabling biotech companies to handle peak computational loads without the need for significant upfront hardware investments. In contrast, on-premises solutions require substantial initial capital expenditure for hardware and ongoing maintenance costs. While on-premises infrastructure might provide lower latency and more control over physical resources, it lacks the flexibility and cost optimization offered by cloud services. By evaluating the costs and benefits of each approach, biotech companies can make informed decisions about their data infrastructure. The next section will explore how to estimate the cost of running nf-core RNA-seq on AWS compared to on-premises solutions, providing insights into the factors that influence these expenses and strategies for optimizing them.

Comparing the costs of compute

Running nf-core RNA-seq on AWS

When running bioinformatics pipelines like nf-core RNA-seq on the cloud, understanding the associated costs is crucial for budgeting and resource allocation. Below is the cost estimation for running the nf-core RNA-seq pipeline on AWS using memory-optimized instances with a 100GB dataset.

Assumptions and setup

For our cost estimation, we assume the following setup:

Instance type: Memory-optimized r5.large AWS Elastic Compute Cloud (EC2) instance
Number of samples: 10 samples
Data size per sample: Initially, 100 GB
Runtime: Approximately 10 hours per sample
Storage needs: Temporary storage to stage samples
Data transfer: Depends on long-term storage setup; within AWS incurs no additional transfer costs

Cost breakdown

Compute costs:

Using the r5.large EC2 instance in the US East (N. Virginia) region, which costs approximately $0.126 per hour, we calculate the compute costs:

10 hours × $0.126/hour = $1.26 per sample

Storage costs:

A general-purpose SSD (gp2) on AWS Elastic Block Storage (EBS) costs $0.10 per GB per month. Let’s assume 200 GB per instance for 100 GB samples for temporary storage.

Assuming 1 day of usage: 200 GB × $0.10/GB-month × (1/30 month) ≈ $0.67 per sample

Total cost

Compute: $1.26 / sample
EBS: $0.67 / sample

Total: $1.93 per sample

Building an on-prem machine to run nf-core RNA-seq

The summary table provides a detailed breakdown of the costs of building a high-performance PC for running the nf-core RNA-seq pipeline, including the estimated annual maintenance and upgrade costs.

Here is the summary of the costs:

Initial build cost: $2,200

Annual maintenance and upgrade costs:

• Annual maintenance: $200
• Annual upgrades: $300

Total annual cost: $500

Total cost over 3 years:

• Initial Cost: $2,200
• Annual Costs for 3 Years: $500 × 3 = $1,500

Total Cost Over 3 Years: $2,200 + $1,500 = $3,700

Break-even point analysis: AWS vs. PC build for RNA-seq

The plot above incorporates the estimated annual maintenance and hardware upgrade costs for a 3-year period, providing a comprehensive comparison between running the nf-core RNA-seq pipeline on AWS and building a dedicated PC.

Key points:

AWS cost:
- The cost of processing 100 GB samples on AWS is $1.93 per sample.
PC build cost:
- Initial investment: $2,200
- Annual maintenance and upgrades: $500 per year
- Total cost over 3 years: $3,700

The break-even point for a 3-year period occurs at approximately 1,917 samples. This means if you process more than 1,917 samples over 3 years, building a PC/on-prem machine becomes more cost-effective than using AWS. This is equivalent to running ~1.75 samples per day.

Comparing the costs of storage

Many biotechnology companies hesitate to migrate to cloud storage due to concerns over storage costs. However, a detailed comparison reveals that leveraging cloud services like AWS S3 Glacier for raw data and AWS S3 Standard for pipeline outputs can be more cost-effective than maintaining on-premise storage solutions. On-premise storage requires a significant initial investment in hardware, such as a 100TB server costing around $4,799.99 with additional hardware for backup storage, plus ongoing expenses for maintenance, power, and cooling.

In contrast, AWS S3 Glacier offers long-term data archiving at a low cost of $0.00099 per GB per month, making it possible to store 95TB for approximately $3,385.80 over three years. AWS S3 Standard provides immediate access to frequently used data at $0.023 per GB per month, costing around $4,140.00 for 5TB over the same period. Thus, the total cost of using AWS S3 Glacier and S3 Standard for 100TB over three years is roughly $7,525.80. Included in this price is data replication across multiple AWS Availability Zones, which protects against data loss.

Splitting storage between AWS S3 Glacier and S3 Standard

In bioinformatics, the data generated and used in analyses can be broadly categorized into raw data and processed output data. Here’s an explanation for the reasoning behind splitting cloud storage into 95TB for raw data and 5TB for pipeline output files:

1. Raw data (95TB)

• Volume: Most of the data (95%) used in bioinformatics is raw data, such as high-throughput sequencing data, which are typically large files. This can include BCL and FASTQ files from RNA-seq, DNA-seq, or other omics technologies.

• Usage: Raw data is essential for the initial stages of analysis and must be stored securely and reliably. However, once processed, it is not frequently accessed.

•Storage solution: Raw data can be stored in AWS S3 Glacier, which is a cost-effective storage solution for long-term archival. S3 Glacier is suitable for data that does not need to be accessed frequently but must be retained for compliance or future re-analysis.

2. Processed output data (5TB)

• Volume: Processed data, such as count matrices, variant call files (VCFs), and other summary files, constitutes a smaller portion of the total data (5%). These files are significantly smaller than the raw data files but are critical for downstream analysis.

• Usage: These files are used frequently for various downstream analyses, visualization, and reporting. They need to be readily accessible to researchers for further analysis and interpretation.

•Storage solution: Processed output files should be stored in a more accessible and faster cloud storage solution, such as AWS S3 Standard, to ensure quick and easy access for ongoing research and analysis.

Summary

The split between 95TB for raw data and 5TB for processed output data reflects the different storage needs based on the usage patterns:

• Raw data: Stored in AWS S3 Glacier for cost-effective, long-term storage. This data is not frequently accessed but needs to be retained.

• Processed output data: Stored in AWS S3 Standard for quick and frequent access required for ongoing bioinformatics analysis and research.

This approach optimizes storage costs while ensuring that the necessary data is available, without compromising accessibility or data integrity.

Limitations of on-prem computing

Traditionally, biotech firms have relied on on-premises infrastructure. However, this comes with significant drawbacks:

High capital expenditure: Setting up and maintaining on-prem infrastructure requires a hefty initial investment, often hundreds of thousands of dollars for state-of-the-art servers and data storage solutions.
Scalability issues: Scaling on-prem infrastructure can be slow and costly. It often involves purchasing additional hardware that might not be used at full capacity, leading to inefficiencies.
Maintenance and upgrades: On-prem systems require ongoing maintenance by highly skilled IT staff, adding to operational costs. Keeping up with the latest technology also necessitates regular hardware upgrades.

Benefits of cloud computing for biotech

Cloud computing offers several advantages that align perfectly with the needs of the biotech sector:

Flexibility and scalability: Cloud services provide the ability to quickly scale computing resources up or down. This is crucial for biotech firms that may need to ramp up resources for large-scale experiments or dial them back during less intensive periods.
Cost-effectiveness: With cloud computing, companies pay only for the resources they use. This "pay-as-you-go" model can lead to significant cost savings compared to the fixed costs associated with maintaining on-prem infrastructure.
Advanced technologies and collaboration: Cloud providers often offer advanced analytics and machine learning services that can be integrated seamlessly into biotech workflows. Additionally, the cloud facilitates easier data sharing and collaboration across distributed teams, which is essential for today's remote work environment.
Built-in compliance: Cloud services with built-in compliance features, such as AWS HealthLake or Google Cloud’s healthcare solutions, help ensure adherence to industry regulations like HIPAA and GDPR, protecting sensitive patient and research data

Conclusion

Studies demonstrate that companies adopting cloud computing can reduce their IT expenses by 30-50% compared to maintaining on-premises infrastructure. Additionally, the agility of cloud computing significantly accelerates the time-to-market for new scientific findings or drugs, enhancing potential revenue generation. As these companies continue to push the boundaries of scientific research, the selection of computing infrastructure becomes paramount. Cloud computing offers a more flexible and cost-effective solution than traditional systems, boosting a company’s ability to innovate and collaborate globally. This not only aids in beating out the competition but also establishes cloud computing as a strategic imperative for biotech firms determined to lead in innovation and efficiency.

Patriss Moradi is a Senior Software Engineer at Mantle. His favorite organism is Vulpes vulpes.

A guest post by

Patriss Moradi