<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[MantleBio]]></title><description><![CDATA[Connecting Computation and Biology]]></description><link>https://blog.mantlebio.com</link><image><url>https://substackcdn.com/image/fetch/$s_!Miey!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda06f1f-f053-4719-bc64-291befb58629_1200x1200.png</url><title>MantleBio</title><link>https://blog.mantlebio.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 21 Apr 2026 10:00:57 GMT</lastBuildDate><atom:link href="https://blog.mantlebio.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[MantleBio, Inc.]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[mantlebio@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[mantlebio@substack.com]]></itunes:email><itunes:name><![CDATA[Emily Damato]]></itunes:name></itunes:owner><itunes:author><![CDATA[Emily Damato]]></itunes:author><googleplay:owner><![CDATA[mantlebio@substack.com]]></googleplay:owner><googleplay:email><![CDATA[mantlebio@substack.com]]></googleplay:email><googleplay:author><![CDATA[Emily Damato]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Collate acquires MantleBio]]></title><description><![CDATA[We are thrilled to announce that Collate has acquired MantleBio to accelerate the development of AI solutions for the life sciences.]]></description><link>https://blog.mantlebio.com/p/collate-acquires-mantlebio</link><guid isPermaLink="false">https://blog.mantlebio.com/p/collate-acquires-mantlebio</guid><dc:creator><![CDATA[Emily Damato]]></dc:creator><pubDate>Mon, 11 Aug 2025 19:26:43 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/700510b6-eba3-4935-bf62-6616c39b3609_1456x1040.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We are thrilled to announce that <a href="https://collate.com/">Collate</a> has acquired <a href="https://www.mantlebio.com/">MantleBio</a> to accelerate the development of AI solutions for the life sciences.</p><p>Collate and MantleBio share a mission for developing technologies that help bring life-saving innovations to patients faster. Together, they are building an AI-first platform that streamlines the life science product lifecycle from concept to market.</p><h2>About MantleBio</h2><p>Biotech companies generate vast amounts of complex data: a single DNA sequencing experiment can create 7 TB of files. However, the majority of biotech companies lacked the tools necessary to fully leverage this data.</p><p>Founded in 2023, MantleBio developed a modern data engineering platform purpose-built for life sciences. MantleBio's platform enabled research teams to organize and analyze complex biological data in one centralized location while integrating seamlessly with existing research tools like electronic lab notebooks, instruments, and databases. This infrastructure helped teams accelerate discovery by making critical research data more accessible and actionable across the entire organization.</p><p>Emily Damato and Madeline Schade founded MantleBio to make the best data engineering technology accessible to all life science research. They brought over a decade of experience in biotech and big tech, including software development at GRAIL, Google, ArsenalBio, MIT, and the Broad Institute.</p><p>Following a $5 million funding round led by Y Combinator and Initialized, MantleBio quickly scaled their team and product capabilities. Mantle's platform has helped scientists process complex multi-omics experiments, prepare data for machine learning, and accelerate critical analysis workflows. The platform enabled research teams to focus on discovery, transforming how life science teams perform data-driven research.</p><h2>AI for Every Step of Life Sciences Innovation</h2><p>At every step of research and development, life science companies generate tremendous amounts of data and documentation. This administrative burden creates bottlenecks that delay new technologies from reaching patients.</p><p>Recent advances in AI technology now make it possible to streamline these processes at scale. Collate leverages artificial intelligence to automate the extensive documentation burdens faced by life science companies. Organizations developing drugs and medical devices require vast amounts of paperwork across their entire operations&#8212;from initial research through clinical trials, manufacturing, and regulatory submissions. By applying generative AI to these workflows, companies can significantly reduce manual documentation work and accelerate the timeline for bringing critical treatments to market.</p><p>MantleBio's team brings expertise in developing software specifically for life sciences, with deep understanding of the unique challenges that R&amp;D and clinical teams face daily. Their experience building data infrastructure for complex biological research, combined with years working directly with scientists and regulatory requirements, makes them ideal partners in advancing Collate's mission.</p><p>Together, we're building AI solutions that understand both the science and the compliance needs of modern life sciences development, accelerating the path from discovery to patients.</p><h2>Thank You</h2><p>Thank you to our partners, investors, team, and the scientific community who pursued this mission with us. Together, we will continue to develop technology that helps bring life science innovations to the patients who need them.</p><p><strong>Please stay in touch,</strong></p><p>Emily Damato</p><p><a href="mailto:emily@mantlebio.com">emily@mantlebio.com</a></p><p>Madeline Schade</p><p><a href="mailto:madeline@mantlebio.com">madeline@mantlebio.com</a></p>]]></content:encoded></item><item><title><![CDATA[Designing for Life Science – from 0 to 1]]></title><description><![CDATA[Learnings from the design process]]></description><link>https://blog.mantlebio.com/p/designing-for-life-science-from-0</link><guid isPermaLink="false">https://blog.mantlebio.com/p/designing-for-life-science-from-0</guid><dc:creator><![CDATA[Esmeralda Nava]]></dc:creator><pubDate>Thu, 29 Aug 2024 19:59:48 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8ecc8e3c-9809-46de-80ee-2e3366fd61ae_5824x4192.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Designing for life science can be particularly tricky because scientists often deal with vast amounts of unstructured data and need tools that can handle complex workflows. As the founding frontend engineer and UX designer, I designed and implemented a frontend that meets these unique needs. This blog post will walk you through the design process and key findings along the path from 0 to 1.</p><h1>Understanding the problem</h1><p>The first step in the design process is to clearly understand the problem. Scientific research generates vast amounts of unstructured data. Computational biologists and bioinformaticians&nbsp; create pipelines to process these types of data, and scientists run these pipelines to analyze their research. Bench scientists and computational biologists may be on the same team or on different teams.</p><h1>User research</h1><p>At Mantle, our user research involved shadowing scientists working in the lab, allowing us to understand their workflows and challenges firsthand. By asking scientists to walk us through their processes, we gained valuable insights. Our team&#8217;s strong science background played a crucial role in these visits by facilitating in-depth discussions.</p><p>A key finding was that scientists generate data of different types (using different instruments or assays) for different samples, and they prefer to group data by sample or sample property. For example, instead of categorizing by data type (e.g. all flow cytometry data), they organized datasets related to a specific sample, which could involve multiple data types (e.g. sample 123&#8217;s flow cytometry, sequencing, and ELISA data).This insight helped us design a data management system that aligns with their natural organizational habits, making their workflows more intuitive and efficient.</p><h1>Define the problem and ideate</h1><p>After brainstorming with the team, we identified critical "how might we" questions that guided our design process:</p><ul><li><p>How might we make searching and filtering for datasets fast and intuitive?</p></li><li><p>How might we make pipeline versioning and releases easy?</p></li><li><p>How might we better communicate the science and progress to biotech leaders?</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8G-w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8G-w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png 424w, https://substackcdn.com/image/fetch/$s_!8G-w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png 848w, https://substackcdn.com/image/fetch/$s_!8G-w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!8G-w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8G-w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png" width="1456" height="951" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:951,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:773687,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8G-w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png 424w, https://substackcdn.com/image/fetch/$s_!8G-w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png 848w, https://substackcdn.com/image/fetch/$s_!8G-w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!8G-w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5985a01a-7bf5-4613-9805-98c927c0ff0c_1596x1042.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Mantle team members brainstorming &#8220;how might we&#8221; questions.</figcaption></figure></div><p>These questions helped us define the core problems: scientists struggle with unstructured data and need an intuitive way to search and filter datasets. They also require an easy method for pipeline versioning and releases. Their leaders need clearer communication of scientific progress. Addressing these issues became our primary goal as we began ideating solutions to enhance data management, streamline workflows, and improve overall efficiency in the lab.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IcpB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IcpB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png 424w, https://substackcdn.com/image/fetch/$s_!IcpB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png 848w, https://substackcdn.com/image/fetch/$s_!IcpB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png 1272w, https://substackcdn.com/image/fetch/$s_!IcpB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IcpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png" width="1456" height="683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:683,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:880077,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IcpB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png 424w, https://substackcdn.com/image/fetch/$s_!IcpB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png 848w, https://substackcdn.com/image/fetch/$s_!IcpB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png 1272w, https://substackcdn.com/image/fetch/$s_!IcpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0665fce9-b832-4809-91a0-e7e06102a1c2_6912x3240.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">User personas.</figcaption></figure></div><h1>Design and prototype</h1><p>Working at a fast-paced startup like Mantle, our approach is to learn rapidly by releasing products and gathering feedback directly from our users. We don't have the luxury of conducting in-depth user studies, creating detailed user journeys or spending a lot of time on medium fidelity prototypes. Thus, we prioritize solutions that meet the minimum viable requirements, are easy and quick to build, and can be launched as fast as possible.</p><p>Instead, here are some solutions we focused on that target each of our personas.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Th3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Th3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png 424w, https://substackcdn.com/image/fetch/$s_!4Th3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png 848w, https://substackcdn.com/image/fetch/$s_!4Th3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png 1272w, https://substackcdn.com/image/fetch/$s_!4Th3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Th3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png" width="1456" height="683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:683,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:417795,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Th3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png 424w, https://substackcdn.com/image/fetch/$s_!4Th3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png 848w, https://substackcdn.com/image/fetch/$s_!4Th3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png 1272w, https://substackcdn.com/image/fetch/$s_!4Th3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fe3656e-4173-43a5-bf57-e48313e367cd_3456x1620.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4sYu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4sYu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png 424w, https://substackcdn.com/image/fetch/$s_!4sYu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png 848w, https://substackcdn.com/image/fetch/$s_!4sYu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png 1272w, https://substackcdn.com/image/fetch/$s_!4sYu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4sYu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png" width="1456" height="355" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:355,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2949728,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4sYu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png 424w, https://substackcdn.com/image/fetch/$s_!4sYu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png 848w, https://substackcdn.com/image/fetch/$s_!4sYu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png 1272w, https://substackcdn.com/image/fetch/$s_!4sYu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90815618-b9ff-4849-871e-33b1823a6504_5574x1360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Proposed solutions.</figcaption></figure></div><p>Once we settled on promising ideas, I created high-fidelity prototypes on Figma.</p><h2>Data management</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NF_D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NF_D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png 424w, https://substackcdn.com/image/fetch/$s_!NF_D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png 848w, https://substackcdn.com/image/fetch/$s_!NF_D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png 1272w, https://substackcdn.com/image/fetch/$s_!NF_D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NF_D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png" width="1440" height="915" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:915,&quot;width&quot;:1440,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:158719,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NF_D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png 424w, https://substackcdn.com/image/fetch/$s_!NF_D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png 848w, https://substackcdn.com/image/fetch/$s_!NF_D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png 1272w, https://substackcdn.com/image/fetch/$s_!NF_D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25c13a3f-2be2-450e-9a76-20325af0f789_1440x915.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">High-fidelity prototype of data management page.</figcaption></figure></div><p>Goals:</p><ul><li><p>Easily search and filter by creator, usage, and modification date.</p></li><li><p>Save frequently used filters for quick access.</p></li><li><p>Identify and organize datasets by type (e.g., Flow, Sequencing, Image).</p></li><li><p>View IDs, names, creators, and last modified dates.</p></li><li><p>Show additional dataset properties for more detailed information.</p></li><li><p>Quickly upload new datasets.</p></li><li><p>Star important datasets or archive inactive ones.</p></li></ul><h2>Pipeline versioning</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vMxk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vMxk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png 424w, https://substackcdn.com/image/fetch/$s_!vMxk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png 848w, https://substackcdn.com/image/fetch/$s_!vMxk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png 1272w, https://substackcdn.com/image/fetch/$s_!vMxk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vMxk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png" width="1456" height="911" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:911,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:309947,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vMxk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png 424w, https://substackcdn.com/image/fetch/$s_!vMxk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png 848w, https://substackcdn.com/image/fetch/$s_!vMxk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png 1272w, https://substackcdn.com/image/fetch/$s_!vMxk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e014673-0e4f-4637-b5d1-c54d2560cf40_2880x1802.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">High-fidelity prototype of pipeline page.</figcaption></figure></div><p>Goals:</p><ul><li><p>Easily switch between different pipeline versions from a dropdown menu.</p></li><li><p>View and manage all pipeline runs, including details such as run ID, name, date, status, creator, and updater.</p></li><li><p>Quickly see the status of each run (e.g., running, completed, queued, error).</p></li><li><p>Access detailed information about the current pipeline version, including repository name, GitHub URL, config info, and description.</p></li></ul><p>Import templates and create new runs with the click of a button.</p><h2>Dashboard</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!klhL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!klhL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png 424w, https://substackcdn.com/image/fetch/$s_!klhL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png 848w, https://substackcdn.com/image/fetch/$s_!klhL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png 1272w, https://substackcdn.com/image/fetch/$s_!klhL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!klhL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png" width="1456" height="925" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:925,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1039425,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!klhL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png 424w, https://substackcdn.com/image/fetch/$s_!klhL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png 848w, https://substackcdn.com/image/fetch/$s_!klhL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png 1272w, https://substackcdn.com/image/fetch/$s_!klhL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06889841-0cb3-4dec-b5ac-c1f3510de5ec_5760x3660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">High-fidelity prototype of dashboard page.</figcaption></figure></div><p>Goals:</p><ul><li><p>Displays key pipelines with their last update, status indicators, and quick access to all pipelines.</p></li><li><p>Lists data types with a tabbed interface for quick navigation, showing recent dataset entries and creation dates.</p></li><li><p>Easy access to create a pipeline run, upload datasets, import data types, and import pipelines.&nbsp;</p></li><li><p>Lists recent analyses with creation dates and quick links to detailed views.</p></li><li><p>Overview of ongoing and successful runs, providing a snapshot of current activity.</p></li></ul><h1>Implementation</h1><p>As both a designer and a coder, I am aware of the challenges that arise during the implementation phase and keep them in mind when designing. At Mantle, we leverage Ant Design for our UI components. Although Ant Design has some design limitations, it allows us to implement features rapidly, which is crucial in our fast-paced environment.</p><p>To further enhance our workflow, I utilize ChatGPT for problem-solving and ideation. For those interested in improving their prompt engineering skills, consider exploring <a href="https://www.promptingguide.ai/">dedicated courses</a>.&nbsp;</p><p>My approach involves setting the context for ChatGPT as an expert in the relevant field and then posing specific questions related to that expertise. This method has been instrumental in quickly overcoming complex challenges and accelerating our development process.</p><h1>Continuous improvement</h1><p>The design journey never truly ends. We plan to conduct further usability tests, iterate on our designs, and uncover new user insights. Continuous improvement will help us stay aligned with the evolving needs of scientists and computational biologists, ultimately driving more impactful scientific discoveries. Our journey is just beginning, and we are excited to continue refining and expanding our platform to support the future of biotech.</p><div><hr></div><p><em>If you&#8217;d like to try Mantle, sign up for a free account <a href="https://mantlebio.com/get-started/">here</a>.</em></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe to receive new posts about bioinformatics and software for science.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Intro to Bioinformatics Engineering, Part 4: Running in Production]]></title><description><![CDATA[Preparing Bioinformatics Pipelines for Scale]]></description><link>https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering-dea</link><guid isPermaLink="false">https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering-dea</guid><dc:creator><![CDATA[Madeline Schade]]></dc:creator><pubDate>Fri, 28 Jun 2024 19:08:38 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/36206256-24e3-49e0-b157-3424beda854c_5824x4192.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article is the last in our Intro to Bioinformatics Engineering series, where we&#8217;ve been discussing best practices and practical tips for building bioinformatics. In <a href="https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering-74f">Part 3</a>, we covered a hands-on example of transitioning a Jupyter notebook to a Nextflow workflow, a first step for creating a pipeline.</em></p><p><em>But is the pipeline reliable? Will it be easy to update? Is it efficient? In this article, we will be covering further considerations for creating a high-quality production pipeline.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe for future articles from Mantle</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Introduction</h1><p>As a bioinformatics engineer, a typical day often involves getting a new set of data from the lab, determining what insights you need to make, producing the code to generate the insights, and presenting metrics, graphs, and results to the rest of the team. Since you are constantly working with new data for new experiments, you must adapt to changes and produce results that reflect the data and meaningfully impact the next experiment.</p><p>At some point, some experimentation may solidify around an essential process. Whether you do Nanopore sequencing every week or are using a standardized flow cytometry analyte panel daily, you realize you are running some of the same processes every time. This is when (as described in <a href="https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering">Part 1</a>) you are going from a hike to the road, so now you have to build the road!</p><p>In this article, we will discuss some of the best practices for developing production pipelines your teammates will trust and you can easily maintain.</p><h1>When and when not to consider a Pipeline</h1><p>As a recap from part 1 of this series, creating a production-ready pipeline is not always the right direction for the given task. Creating a reliable and scalable pipeline takes time. If you spend time working on this and run the pipeline only once, running the code in a Jupyter notebook would have been faster and easier.</p><p>Here are a couple of reasons you might consider productionizing a pipeline:</p><ul><li><p><strong>The size of the data is massive</strong></p><p>If you work with single cell sequencing, you know that the preprocessing from BCL or FASTQ file to a count matrix is done through sets of standard pipelines, no matter how many times you run it. This is because even for the smallest datasets, the computing time can take hours and each dataset can produce GBs to TBs of data.&nbsp;</p></li><li><p><strong>It is run by multiple different people</strong></p><p>If you need to share the code and environment with many different people, it's worth considering creating a standard pipeline. This will allow everyone to run the same version in the same environment, no matter who is running it, allowing for effective comparison of results.</p></li></ul><h1>What is a WDL?</h1><p>A Workflow Definition Language (WDL) is a specialized scripting language that defines, manages, and automates complex computational workflows. These languages provide a standardized, human-readable format for specifying the sequence of tasks, dependencies, inputs, and outputs involved in data processing pipelines.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ktP_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ktP_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png 424w, https://substackcdn.com/image/fetch/$s_!ktP_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png 848w, https://substackcdn.com/image/fetch/$s_!ktP_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png 1272w, https://substackcdn.com/image/fetch/$s_!ktP_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ktP_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png" width="1103" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1103,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:145390,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ktP_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png 424w, https://substackcdn.com/image/fetch/$s_!ktP_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png 848w, https://substackcdn.com/image/fetch/$s_!ktP_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png 1272w, https://substackcdn.com/image/fetch/$s_!ktP_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb037bbbe-d858-45b9-8410-8e6d5ee634ad_1103x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The syntax of a WDL is typically intuitive and straightforward. This readability promotes collaboration and ensures that workflows are easy to understand, share, and modify. By explicitly defining each component of a workflow, WDLs help eliminate ambiguity, ensuring precise and consistent execution.</p><p>WDLs support parallel execution and efficient resource management, allowing workflows to scale effectively across various computing environments, such as local clusters, cloud platforms, or high-performance computing (HPC) systems. Additionally, WDLs can provide enhanced reproducibility. Putting your script in a git repository may allow you to track which version of the code you ran, but do you know what environment it ran in? What versions of the packages you had installed? A workflow definition language can track both the code and the environment.</p><h1>Picking a WDL</h1><p>There are many different workflow management systems out there that were developed for specific use cases&#8212;so many that it&#8217;s almost overwhelming. A long list of workflow languages can be found <a href="https://github.com/pditommaso/awesome-pipeline">here</a>. Its even not unheard of for a company to build its own in house; for example, GRAIL has <a href="https://github.com/grailbio/reflow">reflow</a> and Insitro has <a href="https://github.com/insitro/redun">redun</a>.</p><p>The most commonly used WDLs in bioinformatics are Airflow, Nextflow, Snakemake, and Cromwell.</p><p>There are a few things to consider when&nbsp;picking which WDL to use:</p><ul><li><p><strong>Architecture support</strong></p><p>Different languages work best for running on cloud vs. on HPC clusters, with differing levels of support for cloud infrastructure and containerization.</p></li><li><p><strong>Scalability</strong></p><p>Understanding how the language will scale to more samples and larger samples.</p></li><li><p><strong>Community support</strong></p><p>All of these tools are open source. Picking one that is actively maintained and well-documented can ease a lot of pressure. This can also be a good indicator of the ability to find engineers with experience.</p></li><li><p><strong>Flexibility</strong></p><p>Bioinformatics tools require many languages and must be run in multiple locations. The flexibility to run different workflows can be a huge advantage.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W4Kg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W4Kg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png 424w, https://substackcdn.com/image/fetch/$s_!W4Kg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png 848w, https://substackcdn.com/image/fetch/$s_!W4Kg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png 1272w, https://substackcdn.com/image/fetch/$s_!W4Kg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W4Kg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png" width="1247" height="488" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:488,&quot;width&quot;:1247,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136074,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W4Kg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png 424w, https://substackcdn.com/image/fetch/$s_!W4Kg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png 848w, https://substackcdn.com/image/fetch/$s_!W4Kg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png 1272w, https://substackcdn.com/image/fetch/$s_!W4Kg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5fe65-7b45-44f1-9bce-7cbf61cbf14b_1247x488.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Nextflow and Snakemake are growing in popularity and are heavily supported by the bioinformatics community. Because of this, they are generally the best choice if you are setting up a new tech stack. </p><h1>Scaling your pipeline</h1><h2>Clarifying the code</h2><p>When writing bioinformatics analyses, you often end up with a long script that does many things. The first step to productionizing and scaling your script is to clean the code to a structure ready for scale.</p><p>Below are some tricks we have used over the years to help with that process.</p><h3>1. Removing &#8220;magic numbers&#8221; / creating arguments</h3><p>In software development, a &#8220;magic number&#8221; means using a value (a number or a string) within your code without specifying its meaning. These are often an indication that something is a variable within your code. Collecting each of these and assigning a name to the variable at the top of your script can provide clarity to the reader.</p><p>For example, see this snippet:</p><pre><code>def main():
  # ...
  imgs = ["input_images/img1.png","input_images/img1.png"]
  for img in imgs:
    # ...
    img = skimage.io.imread("input_images/img1.png")
    norm = np.zeros_like(img)
    cv2.normalize(img, norm, 0, 255, cv2.NORM_MINMAX)
    # ...</code></pre><p>We can convert this to the following, adding the image as an input and the 0, 255 as constant values.</p><pre><code># Constants
NORM_MIN = 0
NORM_MAX = 255
DEFAULT_NORM_TYPE = cv2.NORM_MINMAX

def main():

  # Arguments.
  parser.add_argument('input_dir', nargs='+',
                      help='List of input image paths')

  for img_path in args.input_imgs:
    # ...
    img = skimage.io.imread(img_path)
    norm = np.zeros_like(img)
    cv2.normalize(img, norm, norm_min, norm_max, norm_type)
    # ...</code></pre><p>This is particularly critical in life science, where a &#8220;magic number&#8221; might have been difficult to determine and may dramatically impact results. Is it a metric from a publication? Which one? Is it something you experimentally tuned over time? Is it the value from a control from a different experiment? Documenting this well will help you update this code if the number changes and will reduce the chance of a teammate misinterpreting the number and introducing a bug.</p><h3>2. <strong>Single responsibility principle</strong></h3><p>Each function should perform a single action. For example, from the code above, we can split to have a function for processing inputs and a function for evaluating the model:</p><pre><code>def process_input_dir(input_dir):
  input_image_list = glob.glob(os.path.join(args.input_image_dir, "*"))
  input_image_list.sort()
  return [io.imread(f) for f in input_image_list]


def eval_model(imgs, cyto_channel, nucl_channel):
  channels = [args.cyto_channel, args.nucl_channel]
  return model.eval(imgs, diameter=None, channels=channels)


def main():
  # argparsing stays in main
  parser.add_argument(
        'input_image_dir',
        type=str,
        help="Input image directory.")

  parser.add_argument(
          '-c',
          '--cytoplasm',
          dest='cyto_channel',
          type=int,
          help="Integer representing the cytoplasm channel. Grayscale=0, R=1, G=2, B=3.")

  parser.add_argument(
          '-n',
          '--nucleus',
          dest='nucl_channel',
          type=int,
          help="Integer representing the nucleus channel. None=0, Grayscale=0, R=1, G=2, B=3.")

  imgs = process_input_dir()

  masks, flows, styles, diams = eval_model(imgs, args.cyto_channel,
  args.nucl_channel)
</code></pre><h3>3. <strong>Function within loops</strong></h3><p>If you perform the same action within a loop, extract the code into its own function:</p><pre><code>def process_img(img_path):
  # ...
  img = skimage.io.imread(img_path)
  norm = np.zeros_like(img)
  cv2.normalize(img, norm, norm_min, norm_max, norm_type)
  # ...


def main():
  # Arguments.
  parser.add_argument('input_dir', nargs='+',
                      help='List of input image paths')

  for img_path in args.input_imgs:
    process_img(img_path)</code></pre><h3>4. Repetitive code</h3><p>&nbsp;Create a function for any piece of code that is repeated.</p><pre><code>def apply_threshold(data, column, threshold):
  return data[data[column] &lt; threshold]

df = apply_threshold(data, 'p_value', 0.05)
df = apply_threshold(data, 'score', 0.1)</code></pre><h2>Documentation and version control</h2><p>Having well-documented code allows for the next person who edits the code to quickly learn the code and provide improvements. Each function should be a verb and describe in simple terms what it does, with a comment at the top describing the inputs and outputs.</p><p>Storing each pipeline in its own Git repository allows for a clear separation of concerns. Adding a README.md to the directory's base with instructions on running and expected inputs and outputs can also be helpful.</p><h2>Parallelization</h2><p>Parallelizing a bioinformatics pipeline enhances performance by distributing tasks across multiple processors or machines, which is essential for large datasets and time-sensitive projects.</p><p>Parallelization in software can take many different forms:</p><ul><li><p><strong>Task level: </strong>Processing multiple tasks in parallel. This is common within data engineering workflows, where data flows into a system that must process the data for many different use cases.</p></li><li><p><strong>Data level: </strong>Processing the same task on different pieces of data in parallel. This is common when you need to process large data all at once.</p></li></ul><p>Data level parallelism is the most common use case in bioinformatics, where you get in hundreds of samples at once and must pre-process them to an analyzable state simultaneously.</p><p>The following Nextflow workflow takes in a channel of FASTQ files, grouping them by read 1 and 2 and processing them through PROCESS_A and PROCESS_B simultaneously. The workflow will fan out these steps to the maximum allowed compute.</p><pre><code>// Create a channel for input FASTQ files
Channel
  .fromFilePairs("${params.inputDir}/*_{1,2}.fastq.gz")
  .set {read_pairs_ch}

workflow {
  PROCESS_A(read_pairs_ch)

  PROCESS_B(read_pair_ch)

  PROCESS_C(PROCESS_A.out)
}</code></pre><p>In this example, PROCESS_A will run in parallel for each pair of input FASTQ files, potentially processing multiple samples simultaneously.</p><h1>Conclusion</h1><p>This is the final article in our Intro to Bioinformatics series. We&#8217;ve covered from a high level of what it means to be a bioinformatics engineer to practical tips on containerization and building robust pipelines.</p><p>We hope you have enjoyed and learned something to take away into your next project!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.mantlebio.com/subscribe?"><span>Subscribe now</span></a></p><p><em>Madeline Schade is the CTO and Co-Founder of Mantle. Her favorite organism is </em><a href="https://en.wikipedia.org/wiki/Great_white_shark">Carcharodon carcharias</a>.</p>]]></content:encoded></item><item><title><![CDATA[Intro to Bioinformatics Engineering, Part 3: Jupyter Notebook to Nextflow Pipeline]]></title><description><![CDATA[Turning your most-used Jupyter Notebook into a pipeline]]></description><link>https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering-74f</link><guid isPermaLink="false">https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering-74f</guid><dc:creator><![CDATA[Lealia Xiong]]></dc:creator><pubDate>Fri, 14 Jun 2024 17:01:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/639dd639-8c06-4325-b268-f18f3a5b347b_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article is part of our Intro to Bioinformatics Engineering series, where we&#8217;ve been exploring best practices and practical tips for how to build for bioinformatics. In<a href="https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering"> Part 1</a>, we covered the why and when of building pipelines at a high level. Here, we&#8217;ll provide a practical example of building your first pipeline.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe for future articles in Mantle&#8217;s Intro to Bioinformatics Engineering series</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>You have great Jupyter notebooks you reuse constantly for data analysis&#8230;but which version did you use to make the graph in last month&#8217;s presentation? How will you reprocess the last 20 datasets with your newest version? And how can your teammates use this algorithm for their datasets? Using a pipeline can solve these problems &#8211; here&#8217;s how to get your notebook code into a Nextflow pipeline.</p><h1>Introduction</h1><p>If you&#8217;ve analyzed data using Python, you&#8217;ve probably used Jupyter notebooks. When you get the very first pieces of data from your initial experiments for a new project, a Jupyter notebook lets you explore and iterate quickly. As you refine your experiments and the resulting data gets more standardized, you might come up with a definitive Jupyter notebook that you use over and over with different input files. You&#8217;ve gone from bushwhacking to following<a href="https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering"> a paved trail with signposts and a map</a>. This is great! You don&#8217;t need to put in a ton of new effort every time you get new data.</p><p>As your notebook becomes a crucial part of your workflow and your colleagues&#8217; workflows, you might start to notice a few issues creeping in. You make some improvements to your algorithm. But now what version of the notebook you used to make past graphs is not documented anywhere. Or, you want to run the new version of the analysis on the last 20 datasets, and it&#8217;d be way quicker if you could parallelize. And you want to make sure that other people who are using your algorithm start using the updated version.</p><p>It&#8217;s time to turn your Jupyter notebook into your first pipeline. In this post, we&#8217;ll show you how.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yPj-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yPj-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!yPj-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!yPj-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!yPj-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yPj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75617,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yPj-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!yPj-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!yPj-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!yPj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ca26f7-53e7-4683-bf8d-a76a13f7cddd_1600x900.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Pipelines are written in workflow definition languages, such as Nextflow</h2><p>Workflow definition languages provide a structured framework for describing and orchestrating the series of computational tasks needed to handle and analyze data. For data engineering in general, these tasks typically encompass data extraction, transformation, loading, and analysis.</p><p>There are <a href="https://github.com/pditommaso/awesome-pipeline">many workflow definition languages</a> that data and bioinformatics engineers use to write computational pipelines. We&#8217;ll use Nextflow as an example in this article because it is especially popular with bioinformaticians and computational biologists. As free open-source software, Nextflow is supported by a large community of developers. There&#8217;s also <a href="https://nf-co.re/">nf-core</a>, a large library of open-source bioinformatics pipelines written in Nextflow.</p><p>A Nextflow pipeline consists of one or more modules or processes. A Nextflow process allows you to execute a script, which can be written in any popular scripting language, including Bash, Python, and R.</p><h1>Let&#8217;s get to it!</h1><p>Here are the steps for turning code from your Jupyter notebook into a Nextflow pipeline:</p><ol><li><p><strong>Jupyter notebook<br> </strong>You already have a trusty Jupyter notebook for data processing and analysis. In this example, we&#8217;ll use a notebook written in Python that performs a simple image thresholding task.</p></li><li><p><strong>Python script<br></strong>Turn your notebook into one or more executable scripts. If this is your first Nextflow pipeline, you may want to write one script instead of splitting your workflow into multiple modules. In this example, we&#8217;ll make one script.</p></li><li><p><strong>Nextflow pipeline<br></strong>Write a Nextflow pipeline that executes your script(s) for you.</p></li></ol><p>Example code and images are available on our <a href="https://github.com/mantlebio/jupyter-to-nextflow">GitHub</a>.</p><h2>Jupyter notebook</h2><p>Let&#8217;s take a look at an example of a Jupyter notebook that takes in images, segments them, and saves the resulting masks. Here&#8217;s the code for reference (we&#8217;ll break it down in the next section):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yJEx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yJEx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png 424w, https://substackcdn.com/image/fetch/$s_!yJEx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png 848w, https://substackcdn.com/image/fetch/$s_!yJEx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png 1272w, https://substackcdn.com/image/fetch/$s_!yJEx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yJEx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png" width="1456" height="2749" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2749,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1303029,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yJEx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png 424w, https://substackcdn.com/image/fetch/$s_!yJEx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png 848w, https://substackcdn.com/image/fetch/$s_!yJEx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png 1272w, https://substackcdn.com/image/fetch/$s_!yJEx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b3fc53-28f3-4e33-9b27-aa7b2d463cda_1854x3500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Notable features</h3><h4>Inputs</h4><p>The notebook takes two inputs that the user must change every time they want to run the notebook: <code>input_dir</code>, the path to the directory where the input images are stored, and <code>output_dir</code>, the path to the directory where the processed images should be saved.</p><h4>Imports</h4><p>Depending on the user&#8217;s Python environment, <code>tqdm</code>, <code>numpy</code>, <code>scikit-image</code> (<code>skimage</code>), and <code>opencv-python</code> (<code>cv2</code>) may not be installed. Additionally, the notebook does not by default store any information about what versions of each package are installed.</p><h2>Python script</h2><p>The first step in building the Nextflow pipeline is to turn the Jupyter notebook into one or more executable scripts. As a first pass, we&#8217;ll turn our example notebook into a single script:</p><pre><code>#!/usr/bin/env python3

import argparse
import os
import glob
import tqdm
import numpy as np
import skimage
import cv2

def preprocess(img_path: str) -&gt; np.array:
    """
    Reads in, normalizes, and thresholds a single image.
    Returns np.array of preprocessed image.
    """

    # Read in
    img = skimage.io.imread(img_path)

    # Normalize
    norm = np.zeros_like(img)
    cv2.normalize(img, norm, 0, 255, cv2.NORM_MINMAX)

    # Threshold
    _, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)

    return thresh

def main():

    # Instatiate argument parser
    parser = argparse.ArgumentParser(
        prog='example_script',
        description="Loads images and preprocesses by normalizing and 
        thresholding."
    )
    # Add arguments to the argument parser
    parser.add_argument(
        'input_dir',
        type=str,
        help="Directory containing input images."
    )
    parser.add_argument(
        '-o',
        '--output_dir',
        dest='output_dir',
        type=str,
        help="Directory to which to save outputs."
    )
    # Run argument parser and extract data
    args = parser.parse_args()

    all_image_paths = glob.glob(os.path.join(args.input_dir, "*.tif"))

    # Make sure output directory exists
    if not os.path.exists(args.output_dir):
        os.mkdir(args.output_dir)

    # Preprocess all the images in the input directory
    # and write out to the output directory
    for path in all_image_paths:

        # Apply preprocessing
        processed_img = preprocess(path)

        # Save
        basename = os.path.basename(path)
        extension_idx = basename.rfind(".")
        fname = os.path.join(
            args.output_dir, 
            f"{basename[:extension_idx]}_preprocessed.tif"
        )
        skimage.io.imsave(fname, processed_img, check_contrast=False)

if __name__ == "__main__":
    main()</code></pre><h3>Notable features</h3><h4>Shebang</h4><p>The <a href="https://en.wikipedia.org/wiki/Shebang_(Unix)">shebang</a> <code>#!/usr/bin/env python3</code> indicates the interpreter that the program loader should use to run the script (Python3 in this case).</p><h4>Command line arguments</h4><p>We use the <code>argparse</code> <a href="https://docs.python.org/3/library/argparse.html">library</a> to create an argument parser so that the script can take command line arguments as inputs.</p><p>If we were to run this script on its own, the usage would be:</p><pre><code>./example_script.py &lt;input_dir&gt; -o &lt;output_dir&gt;</code></pre><h4>Versioning</h4><p>After you have your script, you can check it into GitHub for version control. You can iterate on it and push new versions, and if you&#8217;ve shared the repository with colleagues, they can pull in your changes. To maximize reproducibility, you can add a <code>requirements.txt</code> file with package versions, or create a Docker container that others can run your script in.</p><p>If you want to go further and put your script into a pipeline, read on:</p><h2>Nextflow pipeline</h2><p>Now, we want to write a pipeline that will run the script we wrote in the last step.</p><p>We create a directory with the following structure:</p><pre><code>example_nextflow_pipeline
&#9474;&#9472;&#9472;&#9472; main.nf [1]
&#9474;&#9472;&#9472;&#9472; nextflow.config [2]
&#9474;
&#9474;&#9472;&#9472;&#9472; bin
&#9474;    &#9492;&#9472;&#9472;&#9472; example_script.py [3]
&#9474;
&#9492;&#9472;&#9472;&#9472; modules
     &#9492;&#9472;&#9472;&#9472; preprocessing
          &#9492;&#9472;&#9472;&#9472;main.nf [4]</code></pre><p>To run the Nextflow pipeline, using the command line, change directory to <code>example_nextflow_pipeline</code>, then run:</p><pre><code>nextflow run main.nf --input_dir &lt;input_dir&gt; --output_dir &lt;output_dir&gt;</code></pre><p>Now, we&#8217;ll go through each component in detail.</p><h3>[1] <code>main.nf</code></h3><p>This is the main pipeline script:</p><pre><code>include { PREPROCESS_IMAGES } from './modules/preprocessing'

log.info """\
    EXAMPLE PIPELINE
    ---------------------
    input_dir: ${params.input_dir}
    output_dir: ${params.output_dir}
"""
.stripIndent(true)

workflow {
    PREPROCESS_IMAGES ( params.input_dir )
}</code></pre><p>The first line imports the <code>PREPROCESS_IMAGES</code> process, which is contained in <code>modules/preprocessing/main.nf</code>.</p><p>The next line outputs information to the console.</p><p>Finally, we have the workflow block, which calls the <code>PREPROCESS_IMAGES</code> process with the input <code>input_dir</code> that was specified in the command line.</p><p><code>params</code> contains the command line arguments, which are anything specified like this: <code>--&lt;argument&gt; value</code>. </p><h3>[2] <code>nextflow.config</code></h3><p>This is the Nextflow configuration file:</p><pre><code>process.container = "&lt;docker_image:tag&gt;"
docker.enabled = true
docker.runOptions = '-u $(id -u):$(id -g) -v /Users:/Users'</code></pre><p>In this example, we specify that processes should run in the given container. Additionally, we specify that Docker should always be used when executing hte pipeline, and give some options for Docker.</p><p>We&#8217;re only scratching the surface here, though this suffices for our simple example pipeline &#8212; for more information on Nextflow configuration, refer to the <a href="https://www.nextflow.io/docs/latest/config.html">Nextflow documentation</a>.</p><h3>[3] <code>bin/example_script.py</code></h3><p>This is the example script that we wrote above.</p><p>Note that you need to make the script executable in order for Nextflow to run it. In Unix-like systems, you can do this in the command line by changing to the <code>bin</code> directory and running:</p><pre><code>chmod +x example_script.py</code></pre><h3>[4] <code>modules/preprocessing/main.nf</code></h3><p>This is the module that contains the process that the Nextflow pipeline will execute:</p><pre><code>process PREPROCESS_IMAGES {

    publishDir params.output_dir, mode: 'copy'
    
    input:
    path process_input_dir

    output:
    path('*')

    script:
    """
    example_script.py \\
    ${process_input_dir} \\
    -o "./" \\
    """
}</code></pre><p>All files produced by the process script are stored in a work directory. The <code>publishDir</code> directive indicates that the output files of this process (specified in the <code>output</code> block) should be published to <code>output_dir</code>, which we specified as a command line argument.</p><p>The <code>input</code> block defines the input channels of a process, similar to function arguments. Inputs are specified by a qualifier (the type of data) and a name. The name is similar to a variable name. In our example, the input is a path.</p><p>The <code>output</code> block defines the output channels of a process. These can be accessed by downstream processes, or published to the directory specified by the <code>publishDir</code> directive. Here, the outputs are all the paths to files produced by the process. You can be as specific as you want here&#8212;e.g. you could specify <code>path('*.tif')</code> to emit only TIF files if other types were produced as well, or <code>path('image006.tif')</code>to emit only that single file. </p><p>The <code>script</code> block defines the script that the process executes. In this case, we are running the <code>example_script.py</code> script, with the <code>input_dir</code> process input as the path to the directory with the input files, and with the current working directory <code>./</code> as the directory to which to write the output files (the output files are then published according to the <code>publishDir</code> directive).</p><p>For more information on Nextflow processes, refer to the <a href="https://www.nextflow.io/docs/latest/process.html#processes">Nextflow Documentation</a>.</p><h3>Notable features</h3><h4>Containerization is automatically supported</h4><p>A container is an isolated virtual environment for your code. It allows you to run your code in a reproducible way by having the same packages and the same versions of your packages installed in the environment every time. One of the most common container platforms is <a href="https://docs.docker.com/guides/get-started/">Docker</a>. To learn more about containers and their application to bioinformatics, make sure to catch up with the previous post in our series:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;61fc89c4-0c19-40c7-b7e1-fda0bce636af&quot;,&quot;caption&quot;:&quot;The aim of this post is to present a few cases where containers offer significant advantages for the bioinformatician, as well as to share practical insight from a software engineering perspective for how to leverage Docker like a professional. While&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Intro to Bioinformatics Engineering, Part 2: Docker&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:198109610,&quot;name&quot;:&quot;Aakash Shah&quot;,&quot;bio&quot;:&quot;Software Engineer&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/daace69e-eaff-494d-a963-dccbb3c92fd8_1170x1101.jpeg&quot;,&quot;is_guest&quot;:true,&quot;bestseller_tier&quot;:null,&quot;primaryPublicationSubscribeUrl&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;primaryPublicationUrl&quot;:&quot;https://blog.mantlebio.com&quot;,&quot;primaryPublicationName&quot;:&quot;MantleBio&quot;,&quot;primaryPublicationId&quot;:2425618}],&quot;post_date&quot;:&quot;2024-06-07T17:53:13.232Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0746ed6b-cbdf-4d33-85a0-6e2f385dc3a7_1456x1048.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering-681&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:145417898,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;MantleBio&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda06f1f-f053-4719-bc64-291befb58629_1200x1200.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>In the Nextflow configuration, we specified a global Docker container in which the whole pipeline should run. If we had multiple process modules, Nextflow also would allow us to specify a Docker container for each module, if they had different requirements.</p><p>You could use Docker with Jupyter notebooks on your own, but Nextflow makes it easy to always use the same environment each time the pipeline runs.</p><h4>Versioning</h4><p>We didn&#8217;t explicitly touch on this in our tour through the example Nextflow pipeline, but we can implement version control for the pipeline through <a href="https://www.nextflow.io/docs/latest/sharing.html">Nextflow&#8217;s integration with GitHub</a> (along with BitBucket and GitLab). That way, if you&#8217;re sharing the pipeline with colleagues, they can always be up to date (or explicitly run a past version, if that&#8217;s what suits their needs).</p><h1>Wrapping up</h1><p>Nextflow is a powerful tool for creating data processing and analysis pipelines. By simplifying containerization and versioning, it helps you to increase the reproducibility and portability of your code. This post should help you get started with writing pipelines through turning a Jupyter notebook you use over and over into a script and then a Nextflow pipeline.</p><p>We only included one process in our simple example pipeline here. In future posts, we&#8217;ll highlight how (and why) you can make your pipelines modular and scalable, and how to run them in the cloud.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.mantlebio.com/subscribe?"><span>Subscribe now</span></a></p><p><em>Lealia Xiong is a Senior Applications Engineer at Mantle. Her favorite organism is </em><a href="https://en.wikipedia.org/wiki/Hypsibius_dujardini">Hypsibius exemplaris</a>.</p>]]></content:encoded></item><item><title><![CDATA[Intro to Bioinformatics Engineering, Part 2: Docker]]></title><description><![CDATA[Run Anywhere, Scale Fast, Reproduce Results]]></description><link>https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering-681</link><guid isPermaLink="false">https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering-681</guid><dc:creator><![CDATA[Aakash Shah]]></dc:creator><pubDate>Fri, 07 Jun 2024 17:53:13 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0746ed6b-cbdf-4d33-85a0-6e2f385dc3a7_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The aim of this post is to present a few cases where containers offer significant advantages for the bioinformatician, as well as to share practical insight from a software engineering perspective for how to leverage Docker like a professional. While <a href="https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering">the first post in our Intro to Bioinformatics Engineering series</a> presented a conceptual model for building scientific workflows, this article offers tactical approaches for building bioinformatics software with containers.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe for future articles in Mantle&#8217;s Intro to Bioinformatics Engineering series</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>A Brief History of Containers</h1><p>This decade saw containers become a permanent part of the software engineer&#8217;s toolkit. The latest in a series of virtualization technologies that began with the virtual machine, containers proved a popular building block for software applications, providing functional isolation and clean interfaces for hardware and operating system resources like CPU, RAM, networking, and storage. Cloud providers and other software vendors quickly took advantage of this new specification, the Open Container Initiative, and developed an ecosystem of new technologies. One product in particular, Docker, became the de-facto standard for containers due to its simple configuration language and well-built tooling.&nbsp;</p><p>Engineers found that containers enabled powerful new workflows. They could run the same code nearly anywhere, on multiple different architectures, because they could treat the operating system as an interface instead of a hard dependency. Software was reproducible because all software dependencies were baked into the container. In addition, containers could start up without rebooting the underlying operating system, speeding up launch times and enabling rapid scaling.</p><p>More than 15 years after Docker began, containers are finding their way into the bioinformatician&#8217;s toolkit. Analyzing large multi-omic datasets and applying new machine learning techniques demand sophisticated software, and today, containers are what make it up.</p><h1>Applications for Bioinformatics</h1><p>One of the core benefits of using Docker is <em>reproducibility</em> &#8211; in practical terms: your code, in a compiled container, will behave the same in five years as it does today. Imagine your favorite computer game from childhood being available to you on your new laptop, or a Python script from graduate school that still runs despite your colleague changing their library. That is what Docker enables.</p><h2><strong>Case Study: Sharing An Analysis With Colleagues</strong></h2><p>You&#8217;ve done all the hard work &#8211; you&#8217;ve completed a rigorous statistical analysis of your biological dataset and found a few surprising results. You&#8217;ve even done some extra stuff &#8211; you use Git to version control your code, and you store your datasets in the cloud so that it is backed up and accessible to your collaborators.&nbsp;</p><p>You reach out to your colleague and have her pull your code and data; you give her a list of Python dependencies to install and have her run the code. To your dismay, she messages you back with an error message:&nbsp;</p><pre><code>read_csv() got an unexpected keyword argument 'error_bad_lines'</code></pre><p>&#8220;But&#8230;&#8221; you cry out, &#8220;it worked on my machine!&#8221;</p><p>A cursory Google search reveals that this particular argument to `read_csv()` was deprecated in Pandas 1.4.0; you were running version 1.3.0 on your computer while your colleague had upgraded to 1.5.3. A sympathetic bug, but one that could have been avoided had you shared a Docker container instead of your code directly.</p><p>You decide to package your code into a container using Docker. You read that you need a requirements.txt file to explicitly list your Python dependencies and their versions. It looks something like:</p><pre><code>pandas==2.2.2
numpy==1.26.2
scanpy==1.9.3</code></pre><p>Within your Dockerfile, you copy over the requirements.txt file and your updated Python script and install the dependencies:</p><pre><code>FROM python:3.11

# Copy and install pip requirements
COPY requirements.txt /src/
RUN pip install -r /src/requirements.txt

# Copy your python script
COPY main.py /src/

# Run this command when the container runs
CMD ["python3", "main.py"]</code></pre><p>You pull up an old blog post that taught you how to build your container and store it in a repository for your colleague to access (see below). You message your colleague with instructions for how to pull and run the container, which she does successfully. She sees the results of your analysis and is confident that you will win the Nobel Prize.</p><h2>Case Study 2: Running Computational Pipelines in the Cloud</h2><p>Another core benefit of using Docker is the ability to run your code &#8220;anywhere.&#8221; As mentioned earlier, containers are the latest in a series of virtualization technologies that began with the virtual machine (or, one could argue, the operating system itself). Virtual machines broke the link between hardware and the host operating system. This link was replaced with a piece of software called a &#8216;hypervisor&#8217;, which provided virtual interfaces for hardware resources like CPU and RAM. This technology allows you to run multiple operating systems on top of the same hardware.&nbsp;</p><p>Containers take this one step further. Operating systems are cumbersome to install and don&#8217;t fully abstract away networking or storage. Technologies like Docker provide something called a &#8220;container engine&#8221; that sits on top of the host operating system to share operating-system-level resources with multiple isolated applications.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_98Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_98Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png 424w, https://substackcdn.com/image/fetch/$s_!_98Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png 848w, https://substackcdn.com/image/fetch/$s_!_98Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png 1272w, https://substackcdn.com/image/fetch/$s_!_98Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_98Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png" width="1456" height="553" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb73a9cb-adca-4d73-854b-859817742739_1864x708.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:553,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:159122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_98Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png 424w, https://substackcdn.com/image/fetch/$s_!_98Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png 848w, https://substackcdn.com/image/fetch/$s_!_98Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png 1272w, https://substackcdn.com/image/fetch/$s_!_98Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb73a9cb-adca-4d73-854b-859817742739_1864x708.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">https://www.atlassian.com/microservices/cloud-computing/containers-vs-vms</figcaption></figure></div><p>As a computational biologist in an industry lab, you are tired of running the same script every few days to generate results for your scientists. You research pipeline-running software and learn about an open-source tool called <a href="https://nextflow.io/docs/latest/index.html">Nextflow</a>.&nbsp;</p><p>You do the difficult work of understanding the Nextflow model and refactoring your scripts to take in standard inputs and produce consistent outputs. You understand where you can modularize your pipeline to take advantage of parallelization. You learn from the <a href="https://nextflow.io/docs/latest/container.html">documentation</a> that you can leverage containers to create consistent script environments.&nbsp;</p><p>You realize quickly the new capabilities you possess &#8211; simply by learning a little bit about container registries and Dockerfiles, you can now run your scripts in production ready and highly parallel Nextflow pipelines on your on-prem hardware or in the cloud. All you need to do to support a large influx of data is click a few buttons to increase the compute capacity of your cloud pipeline, which you set up in a day by following the Nextflow documentation.&nbsp;</p><p>Using containers allowed you to run your pipeline on different architectures with little overhead and allowed you to rapidly scale with just a few clicks. What a time to be alive.</p><h1>A Software Engineer&#8217;s Real-world Advice For Using Docker Effectively</h1><h2>Use a Container Registry to Version Your Images</h2><p>Just as repositories like GitHub and GitLab provide a place to store and access code, container registries allow developers to store images and binaries. Previously, you would have to manually ensure your production instance was configured exactly like your local workstation, checking dependencies and configuration files line by line. Using a container registry allows you to access the same environment across the cloud, your local machine, or your specialized hardware; all that is needed is an operating system with Docker installed and a network connection.</p><p>Many software vendors and cloud providers offer a container registry service, differentiating on access controls, price, and integrations with other services. Below, we present a common workflow using AWS Elastic Container Registry (ECR) and Docker to demonstrate how you might build, push, and access a container using a registry.</p><pre><code>#!/bin/bash

# Build the Docker image for a specific platform and label it as 'my_server'
docker build --platform linux/amd64 -f path/to/Dockerfile -t my_server .

# Log in to Amazon ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com

# Tag the 'my_server' image with a label that specifies where it will be pushed
# and what version it is (1.0.0)
docker tag my_server:latest $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/my_server_repository:1.0.0

# Push the 'my_server' image to the specified ECR repository
docker push $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/my_server_repository:1.0.0</code></pre><p>There are a few subtleties in the above script that demonstrate the power of this workflow.</p><ul><li><p>In the first step, we built the image locally and labeled it `my_server`; by default, Docker assigns this image a tag called `latest`, so we reference the local image as `my_server:latest`.&nbsp;</p></li></ul><ul><li><p>In the second step, we use the AWS CLI to get our login credentials for ECR (this requires us to be locally authenticated against AWS). We then provide those credentials to our local Docker client to pull and push images to ECR. Think of this like having a registered ssh key on your local machine so you can pull and push to GitHub.</p></li></ul><ul><li><p>In the third and fourth steps, we re-label our local image with a fully qualified domain name so that our Docker client knows where to push the image. This step will vary across container registry providers, but the structure is similar, usually `&lt;registry_domain_name.url&gt;/&lt;repository_name&gt;:&lt;version&gt;`</p></li></ul><p>With just a few lines of code, we now have versioned images that we can pull anywhere. Note that the source code is not versioned, just the compiled image. Much like having development and production branches for code, you may want to have a development and production repository. The former would be accessible to developers, while the latter may only be accessible within a CI/CD pipeline. This ensures that production images are being built in a consistent environment and that it is only building images from validated code that has been merged into your production branch.</p><h2><strong>Optimize your Dockerfile</strong></h2><p>It may be enough to copy/paste examples from the internet or a chatbot to get started writing your first Dockerfile, but understanding a few of the underlying mechanics pays dividends. Below are a few strategies to help you optimize your Dockerfile.</p><h3>Caching</h3><p>Software engineers often talk about containers being composed of layers. Think of Photoshop: the final image is a composition of independent layers, one for the background, one for the foreground, and one for intricate styling. Docker containers are constructed in the same way. There is a base layer, which can be an operating system interface or a language runtime, and intermediate layers that define new behavior. When building images, Docker creates a new layer for each instruction in the Dockerfile.&nbsp;</p><p>To optimize build times, Docker caches these layers, and only re-builds the layer if it detects a change. Suppose a container has 7 layers; if Docker detects a change to layer 3, it will rebuild all layers from 3-7 while reading the first two layers from its cache.</p><p>Consider the following Dockerfile snippet:</p><pre><code>FROM python:3.11

# Copy and install pip requirements
COPY requirements.txt /app/
RUN pip install -r /app/requirements.txt

# Copy the application code
COPY src/ /app/src/
COPY main.py /app/

...</code></pre><p>In this example, we first copy and install the Python requirements <em>before</em> we copy our application code. In a common local development workflow, developers would iterate on the application code and may repeatedly build the image locally to test it end-to-end. If we had to re-install the dependencies for each build, iterating would take significantly longer. Therefore, it matters where we install the dependencies in our Dockerfile. If we had copied and installed the Python requirements <em>after</em> we copied our application code, Docker would see that the application code had changed and would invalidate the cache for all subsequent instructions, thereby forcing a reinstall for each build. We can iterate significantly faster by simply installing Python requirements before we copy the application code.&nbsp;</p><h3><strong>Use Environment Variables (Wisely)</strong></h3><p>In containerized applications, environment variables provide a powerful way to configure applications at runtime. Unlike monolithic servers where environment variables are shared across the entire host, containers isolate these variables, eliminating conflicts and race conditions. This isolation allows each container to serve a single functional purpose with its own specific configuration, enhancing security and manageability.</p><p>Consider the following Dockerfile snippet:</p><pre><code>FROM golang:1.20

ENV APP_ENV=development
ENV DB_URL=""

...</code></pre><p>A developer running this container would pass in arguments for their local database, enabling local development while ensuring access only to resources they have been explicitly granted.</p><pre><code>docker run -e DB_URL="your_database_url" my_image</code></pre><p>Environment variables can be used for various purposes. We use them to adapt our application to different environments (development, testing, production) and pass configuration information and, when appropriate, secrets. If you decide to store sensitive information in environment variables, it is important to understand a few principles to keep that information secure.</p><p>First, secrets should not be stored unencrypted; that means you should not hard-code them in your Dockerfile and instead store them in an encrypted key store that you access at runtime, such as AWS KMS, Azure Key Vault, or HashiCorp Vault.</p><p>Also, remember that you can retrieve secrets at runtime in different ways. You could programmatically access the key store within your application instead of using a variable, and if you run on a cloud provider, you may find opportunities to delegate some permissions to them (for example, AWS IAM handles AWS internal access for you).&nbsp;</p><p>Third, remember that the host operating system that runs your containers generally has superuser permissions for running containers; be careful where you deploy production services.</p><p>If you decide to pass sensitive information as an environment variable, here is an example of how to mount the secret at runtime using AWS Secrets Manager:</p><pre><code>aws secretsmanager get-secret-value --secret-id my_secret_id --query SecretString --output text | docker run -e DB_URL="your_database_url" \
           -e SECRET_KEY="$(jq -r '.SECRET_KEY' -)" \
           -e AWS_REGION="us-west-2" \
           -e APP_ENV=production \
           my_image</code></pre><p>This command might be run as part of the launch script for your production service in an environment that has been explicitly granted the correct permissions, making it more secure.&nbsp;</p><h3><strong>Prevent Bloat</strong></h3><p>A common criticism of Docker is the significant disk space required by images, often ranging from a few hundred megabytes to a gigabyte. Preventing image sizes from getting too large requires careful optimization. Here are a few best practices:</p><ol><li><p><strong>Use Minimal Base Images</strong></p><p>A language runtime (like python:3.11 or golang:1.20) is more lightweight than a full-fledged OS image (like &#8216;ubuntu&#8217; or &#8216;debian&#8217;). Unless you need a package that requires access to OS internals, opt for a language runtime. Use official images from trusted sources whenever possible as they are generally better maintained and more secure than the alternatives.</p></li><li><p><strong>Use a `.dockerignore` File</strong></p><p>Create a `.dockerignore` file to exclude unnecessary files and directories from the build context (the set of files sent to the Docker daemon during build).</p><p>For example:</p><pre><code>.git
node_modules
tmp</code></pre></li><li><p><strong>Combine Commands</strong></p><p>It&#8217;s important to remember that Docker creates a separate layer for each command. Although combining commands may sacrifice readability, it can save significant space. It is also a good idea to delete vestigial build artifacts and unnecessary files at runtime.</p><pre><code>RUN apt-get update &amp;&amp; apt-get install -y \
    package1 \
    package2 \
    package3 &amp;&amp; \
    apt-get clean &amp;&amp; rm -rf /var/lib/apt/lists/*</code></pre></li></ol><h1>Conclusion</h1><p>Learning any new technology takes time, but at Mantle, we believe the gains from using containers are disproportionate to the effort required to learn. We hope that we&#8217;ve provided some practical insight into how containers are used by professional engineers so that you can get up to speed as quickly as possible.</p><p>If you want to learn more about how to get the most from your multi-omic data, stay tuned for more bioinformatics tips in upcoming posts in our Intro to Bioinformatics Engineering series and beyond.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.mantlebio.com/subscribe?"><span>Subscribe now</span></a></p><p><em>Aakash Shah is a Senior Software Engineer at Mantle. His favorite organism is </em><a href="https://en.wikipedia.org/wiki/Platypus">Ornithorhynchus anatinus</a><em>.</em></p>]]></content:encoded></item><item><title><![CDATA[Intro to Bioinformatics Engineering, Part 1: The Purpose of Pipelines]]></title><description><![CDATA[When, why, and how to build a bioinformatics pipeline]]></description><link>https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering</link><guid isPermaLink="false">https://blog.mantlebio.com/p/intro-to-bioinformatics-engineering</guid><dc:creator><![CDATA[Emily Damato]]></dc:creator><pubDate>Fri, 31 May 2024 18:06:14 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/6e8c3265-259a-4c66-834c-c9b631be1b41_5824x4192.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the first article in our Intro to Bioinformatics Engineering series. In these articles, we will discuss some of the foundational blocks for building in bioinformatics. While these articles assume familiarity with programming and biology, they are written to be accessible for folks new to this field. We hope you enjoy!</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe for future articles in Mantle&#8217;s Intro to Bioinformatics Engineering series</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h1>Introduction: Hiking or Driving</h1><p>Analyzing new data is a bit like hiking in unmapped woods. You&#8217;re hoping to find something, but you might not know what that something is or where it will be. Armed with a compass and a blank map (or whining laptop and Jupyter Notebook), you set out to explore. If you find something interesting you might make a note, and if you find something exciting you might start a new trail.</p><p>Running a production bioinformatics pipeline is more like driving on a road. You (hopefully) know where you are going, and it will be faster than walking. Most drivers on the road are not the engineers who built it. Your priorities differ from a hiker&#8217;s; it&#8217;s more important to travel quickly and reliably than to investigate something off the route.</p><p>For a traveler, the tradeoffs between trail and road are usually clear. But the considerations become far more complicated if, instead, you are a civil engineer.</p><p>In this post, I will discuss what makes an amazing bioinformatics engineer, what they might consider when &#8220;city planning,&#8221; and a few of the most common patterns we have seen for developing a successful bioinformatics system.&nbsp;</p><h1>Why and When to Build a Pipeline</h1><p>&#8220;Analysis&#8221; and &#8220;pipeline&#8221; can mean many things and are sometimes used interchangeably. Without claiming that these are the only definitions, for the sake of this article, they mean the following:</p><p><strong>Analysis</strong> (hiking): Code that is written and run in a single setting. Examples: analyzing data in a Jupyter Notebook, editing and running an R script locally.</p><p><strong>Pipeline</strong> (driving): Code that is written ahead of time and used at runtime. Examples: running a Python script with Snakemake, using 10x&#8217;s Cell Ranger Count, running Illumina&#8217;s BCL2FASTQ.</p><p>End-to-end data processing often requires a combination of analyses and pipelines.</p><p>As an example, processing single cell transcriptomics data often begins with the raw output of a DNA sequencer (BCL).</p><ol><li><p>The BCL2FASTQ pipeline converts the raw observations of the DNA sequencer into human-readable files containing DNA sequences and quality scores (FASTQ).</p></li><li><p>The Cell Ranger COUNT pipeline groups the reads in the FASTQ file by cell and by gene to create a table of gene counts per cell (Count Matrix).</p></li><li><p>To understand and visualize this data, someone does an analysis of the count matrix using the Scanpy package in a Jupyter Notebook. After gaining understanding, they create a visualization (UMAP) of the data to communicate their insights to others.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QPcA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QPcA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png 424w, https://substackcdn.com/image/fetch/$s_!QPcA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png 848w, https://substackcdn.com/image/fetch/$s_!QPcA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png 1272w, https://substackcdn.com/image/fetch/$s_!QPcA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QPcA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png" width="1456" height="685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:685,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:199020,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QPcA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png 424w, https://substackcdn.com/image/fetch/$s_!QPcA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png 848w, https://substackcdn.com/image/fetch/$s_!QPcA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png 1272w, https://substackcdn.com/image/fetch/$s_!QPcA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b623fc8-a8dc-41b0-9ff3-ef93ed945de2_3750x1765.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This example captures a few benefits of pipelines:</p><ul><li><p><strong>Shareable.</strong> A well-built pipeline can be written by a domain expert and then run by anyone. Illumina and 10x developed BCL2FASTQ and Cell Ranger COUNT respectively, enabling many scientists to analyze data without developing them from scratch.</p></li><li><p><strong>Consistent.</strong> If two different people analyze the same dataset, they will likely make slightly different choices. This may be as superficial as different line colors in a graph or as significant as choosing different statistical tests. Using versioned pipelines makes processing consistent and results comparable, and may facilitate interoperability with other systems.</p></li><li><p><strong>Testable.</strong> Almost everyone who works with sequencing data has used the output of BCL2FASTQ and trusts that the pipeline worked correctly. Because the pipeline has been benchmarked and tested by many use cases, it is now highly reliable and trustworthy.&nbsp;</p></li><li><p><strong>Scalable.</strong> Suppose someone asks you to process thousands of BCL files with the process above. Though the analysis step will likely stay time-consuming, the pipeline steps can be run in parallel and automated.</p></li><li><p><strong>Reproducible.</strong> If you capture the input, environment, and pipeline version you can reproduce a pipeline run. Analyses may also be reproducible, but it can require meticulous manual documentation.</p></li></ul><p>But there are also advantages to analyses:</p><ul><li><p><strong>Flexible.</strong> Flexibility is naturally a tradeoff with consistency. Everyone loves being able to change something quickly. No one likes when a coworker&#8217;s quick change introduces a bug and blocks operations. If you want to move fast (and break things), an analysis is the answer. In an analysis, it is much easier to pull in additional files and packages on the go.</p></li><li><p><strong>No &#8220;Upfront Cost&#8221;.</strong> A road must be planned and paved before others can drive on it. Similarly, building a pipeline requires more up-front work than using an analysis. An analysis only needs to support one exact use case; a pipeline should be designed to handle a range of values and inputs.</p></li><li><p><strong>No Maintenance Cost.</strong> Roads must be maintained to be usable. Pipelines are not built in a vacuum; requirements, inputs, and infrastructure will change, and the pipeline will need to change to keep up. If you live somewhere with cold winters, imagine driving on a road that has not been repaved after a few freeze-thaw cycles. It is similarly unpleasant to use a pipeline that has not been updated after a faster or better tool became available.</p></li></ul><p>In the example above, when would you replace the Count Matrix analysis with a pipeline? If you build a pipeline you only run once, you&#8217;ve wasted effort and slowed progress. If you use analyses hundreds of times, the results may be too inconsistent to analyze in aggregate.</p><p>How to weigh the tradeoffs depends on your priorities and capabilities. The faster you can write a pipeline and the more easily you can maintain it, the sooner you should do so. If you plan to change requirements frequently, it may be better to use analyses for longer. By combining both pipelines and analyses in your workflows, you can have the best of both worlds.</p><h1>Common Pipeline Patterns</h1><p>If you&#8217;re working with a complex analysis, it may be unclear if and how you should convert it to a pipeline. Here are a few of the most common patterns we have seen for creating pipelines.</p><h2>The Last Mile</h2><p>Many people experience the last-mile problem while commuting: the train almost takes you to the office but doesn&#8217;t quite get you there, or you park in a structure a few blocks from your workplace.</p><p>Here are some symptoms that your workflow is suffering from the last mile problem:</p><ul><li><p>You feel like it&#8217;s &#8220;almost&#8221; a pipeline, but you want to be able to edit the code for the last step regularly.&nbsp;</p></li><li><p>There is an &#8220;intermediate file&#8221; that your pipeline produces.</p><ul><li><p>You find yourself opening that intermediate file in a Notebook for an additional analysis.</p></li><li><p>Your script may take that intermediate file as an optional input, and when provided, a significant section of code is bypassed.</p></li></ul></li><li><p>You are frequently asked to reprocess data with a small change to a final step.</p></li></ul><p>What is often happening in this case is that the analysis begins with data cleaning/preprocessing, but ends with visualization/postprocessing. While the first step is fairly stable, the second step undergoes frequent changes and development.</p><p>Solution:</p><ul><li><p>Create a pipeline for the &#8220;preprocessing&#8221; step.</p><ul><li><p>If there is an intermediate file generated, use this as the checkpoint. If your pipeline has clear preprocessing logic but no intermediate file, think about a way you can create a standardized file to write out after the preprocessing. If you&#8217;re using Python, an easy option may be to <a href="https://docs.python.org/3/library/pickle.html">pickle</a> the intermediate object.</p></li></ul></li><li><p>The &#8220;postprocessing&#8221; step may either be a pipeline or an analysis.</p><ul><li><p>This depends on how frequently you are updating this section.</p></li><li><p>A common pattern is to start with an analysis and then convert it to a pipeline when it is stable. If you need to do a bespoke analysis, you can always open your intermediate file in a Jupyter Notebook. More on that below.</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gHGd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gHGd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png 424w, https://substackcdn.com/image/fetch/$s_!gHGd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png 848w, https://substackcdn.com/image/fetch/$s_!gHGd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png 1272w, https://substackcdn.com/image/fetch/$s_!gHGd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gHGd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png" width="1456" height="685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:685,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:130220,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gHGd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png 424w, https://substackcdn.com/image/fetch/$s_!gHGd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png 848w, https://substackcdn.com/image/fetch/$s_!gHGd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png 1272w, https://substackcdn.com/image/fetch/$s_!gHGd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46108574-6e5f-4c9d-b50d-25bee62a0d20_3750x1765.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Multiple Destinations</h2><p>&#8220;Multiple Destinations&#8221; is very similar to &#8220;The Last Mile.&#8221; The key difference is whether the pipeline always produces outputs of the same general type. Here are a few other symptoms of the Multiple Destinations pattern:</p><ul><li><p>There are large sections of code that only run when certain conditions are met. For example, perhaps there is a graph that is only produced for data from a specific CRO, or there are different QC checks for different organisms.</p></li><li><p>You are frequently changing the script for certain types of input, but for other types of input you can run the analysis as-is.</p></li></ul><p>Solution:</p><ul><li><p>Go through the code, line by line if necessary, and determine which sections run for ALL inputs.</p></li><li><p>Separate this logic into a separate function or pipeline. This may take some refactoring. This will also require determining or creating an intermediate file, similar to the Last Mile pattern.</p></li><li><p>Examine the code that does not run for all inputs. What determines when each of these runs? Is it the data type of the input? Is it a property, like the organism? Is it whether the raw data passes or fails a QC check? Make a list of these categories.</p></li><li><p>Look at the processing and output for each category. You may find that some categories have a very established analysis that can be converted to a pipeline while others are still exploratory.</p></li></ul><p><strong>Example:</strong></p><p>Suppose you are processing videos of zebrafish to study behavior. You had been doing experiments to study feeding behavior, and now you are beginning to study social behavior. Your script always begins by converting the video into wireframe models of the fish. Then, depending on the type of experiment, either feeding or social behavior metrics are extracted from the wireframes.</p><p><em>Solution</em>:&nbsp;</p><p>Create a pipeline that takes any video and produces wireframes. Create a second pipeline that takes wireframes and calculates feeding metrics. Since you are still developing the social behavior experiments, continue to analyze the data in a Jupyter Notebook until the protocol is refined. Now, if your coworkers want to start analyzing a new type of behavior they can use your preprocessing pipeline!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eYTw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eYTw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png 424w, https://substackcdn.com/image/fetch/$s_!eYTw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png 848w, https://substackcdn.com/image/fetch/$s_!eYTw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png 1272w, https://substackcdn.com/image/fetch/$s_!eYTw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eYTw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png" width="1456" height="685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:685,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:273067,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eYTw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png 424w, https://substackcdn.com/image/fetch/$s_!eYTw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png 848w, https://substackcdn.com/image/fetch/$s_!eYTw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png 1272w, https://substackcdn.com/image/fetch/$s_!eYTw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ab714a-0f11-4882-a88e-d46c33bdb4da_3750x1765.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The First Mile</h2><p>The most common story for The First Mile pattern is the following:</p><p>You&#8217;d like to do an analysis involving many datasets. These datasets are all of the same general type, but they come from multiple collaborators, CROs, instrument types, public databases, or protocols. You are frequently editing your script because it does not quite work for a new dataset, or you are manually tweaking the dataset to make it a valid input to your pipeline.</p><p>The solution is basically the same as for the Multiple Destinations Pattern, but reversed. First, determine the part of your script that runs for all data, and separate it into a pipeline. The input to this pipeline may be obvious, or it may take a little creativity.</p><p>Once you have decided on the input to the pipeline, your next goal becomes clear. For each new raw dataset, convert the dataset into the input. Depending on your data, this might mean writing an analysis for every dataset! Or maybe you will find that some datasets are similar enough to create a preprocessing pipeline. Either way, you do not need to edit the processing pipeline for each new dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_f32!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_f32!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png 424w, https://substackcdn.com/image/fetch/$s_!_f32!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png 848w, https://substackcdn.com/image/fetch/$s_!_f32!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png 1272w, https://substackcdn.com/image/fetch/$s_!_f32!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_f32!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png" width="1456" height="685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:685,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:145255,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_f32!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png 424w, https://substackcdn.com/image/fetch/$s_!_f32!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png 848w, https://substackcdn.com/image/fetch/$s_!_f32!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png 1272w, https://substackcdn.com/image/fetch/$s_!_f32!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ceb351f-bdd9-4357-8802-9841ff8ed6df_3750x1765.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Pipeline Building is One Part of Bioinformatics Engineering</h1><p>When should a city build a new highway? Where should the road go? Is a highway worth the investment? Will the road be able to widen and scale when the city grows in the future?</p><p>Similar to civil engineering, great bioinformatics engineering means building reliable, scalable, and ready-for-change systems.</p><p>Bioinformatics engineers are not the only ones who do bioinformatics engineering. Bioinformatics engineering is often a critical skill for computational biologists, data engineers, software engineers, scientists, and research engineers.</p><p>Bioinformatics engineering overlaps with computational biology and software engineering. While it&#8217;s common for one person to wear multiple hats, it's helpful to consider these as separate skills.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UGrc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UGrc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!UGrc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!UGrc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!UGrc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UGrc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UGrc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!UGrc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!UGrc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!UGrc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85e5a483-31be-404e-a9ff-0327e3799ca4_1600x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Advice for Teams</h1><p>Great bioinformatics engineers are often people who love working in a collaborative environment. Commonly, the creator of the data, developer of the algorithm, and stakeholder for the results are three different people. Here are a few thoughts on effective collaboration with these groups:</p><ul><li><p>If you ask someone where they would like a road, they will likely ask for it to be a straight line from their starting point to their destination. Unfortunately, if this was done for everyone, the countless overlapping roads would be nearly impossible to maintain or change. Requests from users are often focused on the fastest solution to the current problem; bioinformatics engineers need to balance this with efficiency and sustainability. It&#8217;s important that a team is structured to allow scientists and engineers to collaborate on project requirements.</p></li><li><p>When developing a new pipeline, it&#8217;s important to have clearly communicated requirements from the teams that will use it. I know not every engineer loves documentation, but having a shared design doc can prevent timelines from slipping due to misunderstood requirements. Consider versioning the document, and having 1-2 stakeholders &#8220;sign off&#8221; on every new version.</p></li><li><p>Be careful not to convert an analysis into a pipeline too early; you wouldn&#8217;t want to build a road that never gets used.&nbsp; A good rule of thumb is to convert an analysis to a pipeline only after it has been run at least five times without significant changes. This has the additional benefit of providing testing data to validate the pipeline.</p></li></ul><h1>Bioinformatics is a BLAST</h1><p>I had a great time writing this, and I hope you enjoyed reading it! We have more posts planned for our Intro to Bioinformatics Engineering Series including guides for leveraging essential tools, tips for being ready-to-scale, example developer workflows, and more. Please subscribe below if you would like to be notified, or visit us <a href="https://blog.mantlebio.com/">here</a> for more bioinformatics tips.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.mantlebio.com/subscribe?"><span>Subscribe now</span></a></p><p><em>Emily Damato is the CEO and Co-Founder of Mantle. Her favorite organism is </em><a href="https://en.wikipedia.org/wiki/Drosophila_melanogaster">Drosophila melanogaster</a><em>.</em></p>]]></content:encoded></item><item><title><![CDATA[Cloud Computing for Life Science: The Way Forward]]></title><description><![CDATA[Leveraging the cloud to accelerate biotech innovation]]></description><link>https://blog.mantlebio.com/p/cloud-computing-for-life-science</link><guid isPermaLink="false">https://blog.mantlebio.com/p/cloud-computing-for-life-science</guid><dc:creator><![CDATA[Patriss Moradi]]></dc:creator><pubDate>Wed, 29 May 2024 21:02:11 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c2ff39a8-88bb-408b-97e6-25e9cf828fcd_5824x4192.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the rapidly evolving world of biotechnology, companies face the ongoing challenge of choosing the most efficient computing infrastructure to support their advanced research and development efforts. This blog post will explore why cloud computing is increasingly becoming favored over on-premises solutions, particularly for biotech companies seeking scalability, flexibility, and cost-effectiveness.</p><h1><strong>Understanding the needs of biotech companies</strong></h1><p>Biotech companies are at the forefront of scientific innovation, requiring substantial computational resources for tasks like genomic sequencing, drug discovery, and data analysis. These tasks demand high computing power and the ability to scale resources up or down based on the dynamic nature of research projects. Cloud computing is particularly beneficial for biotech companies as it offers unparalleled scalability and flexibility. With cloud computing, organizations can easily scale up their computing resources to handle large datasets and complex calculations during peak times, such as when running intensive genomic analyses or large-scale simulations. Conversely, they can scale down during periods of lower demand, optimizing costs and resource usage. Additionally, cloud services provide access to cutting-edge technologies and infrastructure without the need for substantial upfront investments, enabling biotech companies to stay at the forefront of innovation while maintaining financial efficiency.</p><p>With these benefits in mind, comparing the practical aspects of running workflows on the cloud versus on-premises is essential. For instance, running the <a href="https://github.com/nf-core/rnaseq">nf-core RNA-seq</a> pipeline on AWS offers dynamic scalability, enabling biotech companies to handle peak computational loads without the need for significant upfront hardware investments. In contrast, on-premises solutions require substantial initial capital expenditure for hardware and ongoing maintenance costs. While on-premises infrastructure might provide lower latency and more control over physical resources, it lacks the flexibility and cost optimization offered by cloud services. By evaluating the costs and benefits of each approach, biotech companies can make informed decisions about their data infrastructure. The next section will explore how to estimate the cost of running nf-core RNA-seq on AWS compared to on-premises solutions, providing insights into the factors that influence these expenses and strategies for optimizing them.</p><h1><strong>Comparing the costs of compute</strong></h1><h2><strong>Running nf-core RNA-seq on AWS</strong></h2><p>When running bioinformatics pipelines like <a href="https://github.com/nf-core/rnaseq">nf-core RNA-seq</a> on the cloud, understanding the associated costs is crucial for budgeting and resource allocation. Below is the cost estimation for running the nf-core RNA-seq pipeline on AWS using memory-optimized instances with a 100GB dataset.</p><h3><strong>Assumptions and setup</strong></h3><p>For our cost estimation, we assume the following setup:</p><ul><li><p><strong>Instance type</strong>: Memory-optimized <strong><a href="https://aws.amazon.com/ec2/instance-types/r5/">r5.large</a></strong> AWS Elastic Compute Cloud (EC2) instance</p></li><li><p><strong>Number of samples</strong>: 10 samples</p></li><li><p><strong>Data size per sample</strong>: Initially, 100 GB</p></li><li><p><strong>Runtime</strong>: Approximately 10 hours per sample</p></li><li><p><strong>Storage needs</strong>: Temporary storage to stage samples</p></li><li><p><strong>Data transfer</strong>: Depends on long-term storage setup; within AWS incurs no additional transfer costs</p></li></ul><h3><strong>Cost breakdown</strong></h3><h4><strong>Compute costs:</strong></h4><p>Using the <strong>r5.large</strong> EC2 instance in the US East (N. Virginia) region, which costs approximately $0.126 per hour, we calculate the compute costs:</p><ul><li><p>10 hours &#215; $0.126/hour = $1.26 per sample</p></li></ul><h4><strong>Storage costs</strong>:</h4><p>A general-purpose SSD (gp2) on AWS Elastic Block Storage (EBS) costs $0.10 per GB per month. Let&#8217;s assume 200 GB per instance for 100 GB samples for temporary storage.</p><ul><li><p>Assuming 1 day of usage: 200 GB &#215; $0.10/GB-month &#215; (1/30 month) &#8776; $0.67 per sample</p></li></ul><h3><strong>Total cost</strong></h3><ul><li><p>Compute: $1.26 / sample</p></li><li><p>EBS: $0.67 / sample</p></li></ul><p><strong>Total</strong>: <strong>$1.93 per sample</strong></p><h2><strong>Building an on-prem machine to run nf-core RNA-seq</strong></h2><p>The summary table provides a detailed breakdown of the costs of building a high-performance PC for running the nf-core RNA-seq pipeline, including the estimated annual maintenance and upgrade costs.</p><p>Here is the summary of the costs:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VE0u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VE0u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png 424w, https://substackcdn.com/image/fetch/$s_!VE0u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png 848w, https://substackcdn.com/image/fetch/$s_!VE0u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png 1272w, https://substackcdn.com/image/fetch/$s_!VE0u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VE0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png" width="452" height="693.7341772151899" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1940,&quot;width&quot;:1264,&quot;resizeWidth&quot;:452,&quot;bytes&quot;:80722,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VE0u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png 424w, https://substackcdn.com/image/fetch/$s_!VE0u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png 848w, https://substackcdn.com/image/fetch/$s_!VE0u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png 1272w, https://substackcdn.com/image/fetch/$s_!VE0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff6bec5e-5a9e-462a-a938-0f56d7403b21_1264x1940.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Initial build cost: </strong>$2,200</p><p><strong>Annual maintenance and upgrade costs:</strong></p><blockquote><p>&#8226; Annual maintenance: $200</p><p>&#8226; Annual upgrades: $300</p></blockquote><p><strong>Total annual cost:</strong> $500</p><p><strong>Total cost over 3 years:</strong></p><blockquote><p>&#8226; Initial Cost: $2,200</p><p>&#8226; Annual Costs for 3 Years: $500 &#215; 3 = $1,500</p></blockquote><p>Total Cost Over 3 Years: $2,200 + $1,500 = <strong>$3,700</strong></p><h2><strong>Break-even point analysis: AWS vs. PC build for RNA-seq</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ozHP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ozHP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png 424w, https://substackcdn.com/image/fetch/$s_!ozHP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png 848w, https://substackcdn.com/image/fetch/$s_!ozHP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!ozHP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ozHP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png" width="1456" height="1093" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1093,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:189493,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ozHP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png 424w, https://substackcdn.com/image/fetch/$s_!ozHP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png 848w, https://substackcdn.com/image/fetch/$s_!ozHP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!ozHP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ebed89-d7aa-4bf3-ab8e-c031b090f3d1_2072x1556.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The plot above incorporates the estimated annual maintenance and hardware upgrade costs for a 3-year period, providing a comprehensive comparison between running the nf-core RNA-seq pipeline on AWS and building a dedicated PC.</p><p>Key points:</p><ul><li><p>AWS cost:</p><ul><li><p>The cost of processing 100 GB samples on AWS is $1.93 per sample.</p></li></ul></li><li><p>PC build cost:</p><ul><li><p>Initial investment: $2,200</p></li><li><p>Annual maintenance and upgrades: $500 per year</p></li><li><p>Total cost over 3 years: $3,700</p></li></ul></li></ul><p>The break-even point for a 3-year period occurs at approximately 1,917 samples. This means if you process more than 1,917 samples over 3 years, building a PC/on-prem machine becomes more cost-effective than using AWS. This is equivalent to running ~1.75 samples per day.</p><h1>Comparing the costs of storage</h1><p>Many biotechnology companies hesitate to migrate to cloud storage due to concerns over storage costs. However, a detailed comparison reveals that leveraging cloud services like AWS S3 Glacier for raw data and AWS S3 Standard for pipeline outputs can be more cost-effective than maintaining on-premise storage solutions. On-premise storage requires a significant initial investment in hardware, such as a <a href="https://www.amazon.com/SanDisk-Professional-144TB-G-RAID-Shuttle/dp/B0961J28NG/ref=sr_1_5?dib=eyJ2IjoiMSJ9.SqOSOEC6MoAiLeSoX7riAKKTLHqsGqtK4nFfroxpE_743qcsVew6qAAyJb9yvEcZcmEFHYnzK3xQ9-qY_Syt20Nh-w-fsfkygZKkO8oh-NOSbM5XDnBPd5Hg7-DQ68VztLjmJyqIPBolfSsRlVYS0dyuJrFt22-Cvqwow7lhntHcI-YAErKb1z-QYnr37RoTUzDRB3qbaaDz3y9B9hngo2LiH8KJo7lZTf-sPkKpjsc.Y3rKb81W8mmaGIL4CrE3pYUq61Be8r61bkCS5qm0fm0&amp;dib_tag=se&amp;keywords=100tb%2Bhard%2Bdrive&amp;qid=1716937020&amp;sr=8-5&amp;th=1">100TB server costing around $4,799.99</a> with additional hardware for backup storage, plus ongoing expenses for maintenance, power, and cooling.</p><p>In contrast, <a href="https://www.cloudforecast.io/blog/amazon-s3-pricing-and-optimization-guide/">AWS S3 Glacier</a> offers long-term data archiving at a low cost of $0.00099 per GB per month, making it possible to store 95TB for approximately $3,385.80 over three years. AWS S3 Standard provides immediate access to frequently used data at $0.023 per GB per month, costing around $4,140.00 for 5TB over the same period. Thus, the total cost of using AWS S3 Glacier and S3 Standard for 100TB over three years is roughly $7,525.80. Included in this price is data replication across multiple AWS Availability Zones, which protects against data loss.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!syxS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!syxS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png 424w, https://substackcdn.com/image/fetch/$s_!syxS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png 848w, https://substackcdn.com/image/fetch/$s_!syxS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png 1272w, https://substackcdn.com/image/fetch/$s_!syxS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!syxS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png" width="1456" height="777" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60665,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!syxS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png 424w, https://substackcdn.com/image/fetch/$s_!syxS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png 848w, https://substackcdn.com/image/fetch/$s_!syxS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png 1272w, https://substackcdn.com/image/fetch/$s_!syxS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5c0b88-e1f0-4f8d-82aa-b9c00af3d1c7_1844x984.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Splitting storage between AWS S3 Glacier and S3 Standard</strong></h3><p>In bioinformatics, the data generated and used in analyses can be broadly categorized into raw data and processed output data. Here&#8217;s an explanation for the reasoning behind splitting cloud storage into 95TB for raw data and 5TB for pipeline output files:</p><h4><strong>1. Raw data (95TB)</strong></h4><p>&#8226; <strong>Volume</strong>: Most of the data (95%) used in bioinformatics is raw data, such as high-throughput sequencing data, which are typically large files. This can include BCL and FASTQ files from RNA-seq, DNA-seq, or other omics technologies.</p><p>&#8226; <strong>Usage</strong>: Raw data is essential for the initial stages of analysis and must be stored securely and reliably. However, once processed, it is not frequently accessed.</p><p>&#8226;<strong>Storage solution</strong>: Raw data can be stored in AWS S3 Glacier, which is a cost-effective storage solution for long-term archival. S3 Glacier is suitable for data that does not need to be accessed frequently but must be retained for compliance or future re-analysis.</p><h4><strong>2. Processed output data (5TB)</strong></h4><p>&#8226; <strong>Volume</strong>: Processed data, such as count matrices, variant call files (VCFs), and other summary files, constitutes a smaller portion of the total data (5%). These files are significantly smaller than the raw data files but are critical for downstream analysis.</p><p>&#8226; <strong>Usage</strong>: These files are used frequently for various downstream analyses, visualization, and reporting. They need to be readily accessible to researchers for further analysis and interpretation.</p><p>&#8226;<strong>Storage solution</strong>: Processed output files should be stored in a more accessible and faster cloud storage solution, such as AWS S3 Standard, to ensure quick and easy access for ongoing research and analysis.</p><h4>Summary</h4><p>The split between 95TB for raw data and 5TB for processed output data reflects the different storage needs based on the usage patterns:</p><p>&#8226; Raw data: Stored in AWS S3 Glacier for cost-effective, long-term storage. This data is not frequently accessed but needs to be retained.</p><p>&#8226; Processed output data: Stored in AWS S3 Standard for quick and frequent access required for ongoing bioinformatics analysis and research.</p><p>This approach optimizes storage costs while ensuring that the necessary data is available, without compromising accessibility or data integrity.</p><h1><strong>Limitations of on-prem computing</strong></h1><p>Traditionally, biotech firms have relied on on-premises infrastructure. However, this comes with significant drawbacks:</p><ul><li><p><strong>High capital expenditure:</strong> Setting up and maintaining on-prem infrastructure requires a hefty initial investment, often hundreds of thousands of dollars for state-of-the-art servers and data storage solutions.</p></li><li><p><strong>Scalability issues:</strong> Scaling on-prem infrastructure can be slow and costly. It often involves purchasing additional hardware that might not be used at full capacity, leading to inefficiencies.</p></li><li><p><strong>Maintenance and upgrades:</strong> On-prem systems require ongoing maintenance by highly skilled IT staff, adding to operational costs. Keeping up with the latest technology also necessitates regular hardware upgrades.</p></li></ul><h1><strong>Benefits of cloud computing for biotech</strong></h1><p>Cloud computing offers several advantages that align perfectly with the needs of the biotech sector:</p><ul><li><p><strong>Flexibility and scalability:</strong> Cloud services provide the ability to quickly scale computing resources up or down. This is crucial for biotech firms that may need to ramp up resources for large-scale experiments or dial them back during less intensive periods.</p></li><li><p><strong>Cost-effectiveness:</strong> With cloud computing, companies pay only for the resources they use. This "pay-as-you-go" model can lead to significant cost savings compared to the fixed costs associated with maintaining on-prem infrastructure.</p></li><li><p><strong>Advanced technologies and collaboration:</strong> Cloud providers often offer advanced analytics and machine learning services that can be integrated seamlessly into biotech workflows. Additionally, the cloud facilitates easier data sharing and collaboration across distributed teams, which is essential for today's remote work environment.</p></li><li><p><strong>Built-in compliance: </strong>Cloud services with built-in compliance features, such as AWS HealthLake or Google Cloud&#8217;s healthcare solutions, help ensure adherence to industry regulations like HIPAA and GDPR, protecting sensitive patient and research data</p></li></ul><h1><strong>Conclusion</strong></h1><p>Studies demonstrate that companies adopting cloud computing can reduce their IT expenses by 30-50% compared to maintaining on-premises infrastructure. Additionally, the agility of cloud computing significantly accelerates the time-to-market for new scientific findings or drugs, enhancing potential revenue generation. As these companies continue to push the boundaries of scientific research, the selection of computing infrastructure becomes paramount. Cloud computing offers a more flexible and cost-effective solution than traditional systems, boosting a company&#8217;s ability to innovate and collaborate globally. This not only aids in beating out the competition but also establishes cloud computing as a strategic imperative for biotech firms determined to lead in innovation and efficiency.</p><p><em>Patriss Moradi is a Senior Software Engineer at Mantle. His favorite organism is </em><a href="https://en.wikipedia.org/wiki/Red_fox">Vulpes vulpes</a><em>.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.mantlebio.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[ELN Breakdown: Scale, Scope, and Science]]></title><description><![CDATA[Choosing and using the perfect ELN]]></description><link>https://blog.mantlebio.com/p/eln-breakdown-scale-scope-and-science</link><guid isPermaLink="false">https://blog.mantlebio.com/p/eln-breakdown-scale-scope-and-science</guid><dc:creator><![CDATA[Emily Damato]]></dc:creator><pubDate>Thu, 14 Mar 2024 20:50:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4f0e0d80-c37e-4945-8bfe-72fe4ea98dd4_5824x4192.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>What ELN should I use?</h1><p><em>You&#8217;re on an R&amp;D team that records lab notes in Google Docs. This was fine at first, but now that the group has grown it can feel overwhelming to try to find a detail from a few months ago. This solution is reaching a breaking point, but when should you change and what should you use?</em></p><p>Almost every biotech company I know has been in this position at some point. In the last ten years of being a software engineer in life science, I have been asked for ELN advice countless times. The problem is that &#8220;ELN&#8221; has come to mean different things to different people and there is no one answer that is perfect for everyone.</p><h2>What is an ELN?</h2><p><em>&#8220;I was hoping you would just tell me what ELN to use and I could go back to research&#8221;</em></p><p><em>&#8211; Almost everyone</em></p><p>An ELN is an electronic lab notebook. By taking lab notes in a Google Doc, it counts as an ELN and has many advantages over the paper notebook I kept in a locked drawer during my first research project:</p><ul><li><p><strong>Version Controlled</strong></p><ul><li><p>Timestamped changes can help you track how an SOP changed over time as well as prove ownership in an IP dispute.</p></li></ul></li><li><p><strong>Shareable</strong></p><ul><li><p><strong>Documents can be easily shared with teammates, making research more transparent and collaborative.</strong></p></li></ul></li><li><p><strong>Searchable</strong></p><ul><li><p><strong>The ability to search for information in past records can help you find information about an experiment you performed three years ago or see if a teammate previously explored a particular topic.&nbsp;</strong></p></li></ul></li></ul><p>When you are doing exploratory research where every experiment is unique, the blank-canvas Google Doc will let you move quickly. But if you are looking for a new ELN, you have likely felt the limitations of a blank sheet.</p><p>When you perform the same experiment 5 or 500 times, lack of structure can lead to inconsistency that will make comparing results difficult or impossible. While this was never something a paper notebook could provide, we have come to expect a modern ELN to have structured data.</p><h2>What is the right amount of structure?</h2><p>If you try to capture data from a 200 sample experiment in a Google Doc it creates chaos; if you try to capture a one-off experiment in a database it creates unnecessary work. So what is the right amount of structure? Here is how I think about it:</p><ul><li><p>1 - 5 samples / assays</p><ul><li><p>Do something completely unstructured, but not untracked.</p></li><li><p>This phase is like finding a new hiking trail: you might not know where you are going the first time, but you still need to draw a map to retrace your path in the future.</p></li><li><p>Examples: Google Doc, Benchling Entry, Notion page</p></li></ul></li><li><p>5 - 50</p><ul><li><p>This should be structured, but there should also be a way to capture unstructured notes.&nbsp;</p></li><li><p>This phase is like going on a hike: you should be following a route, but there may be unexpected events that require you to go off trail.</p></li><li><p>Examples: Google Sheets, Benchling Entry + Registry Tables, Notion Table</p></li></ul></li><li><p>50 - 500</p><ul><li><p>This should be structured, and structured carefully. Someone with data engineering experience and someone with detailed domain knowledge should both contribute to the design.</p></li><li><p>This phase is like driving on a road. Most of the decisions will be made when you pave the road, and driving down it should be very consistent. This is often when lab automation equipment is purchased.</p></li><li><p>Examples: Benchling, Notion, AirTable</p><ul><li><p>500 is the upper limit of what many traditional ELNs support. If you do not know the limit of a particular ELN, you can ask the customer service team for benchmark data or run a benchmarking experiment yourself.</p></li></ul></li></ul></li><li><p>500 +</p><ul><li><p>This is very well structured. At this point you want to have versioned SOPs (or even automation!) and versioned data schemas to track structure changes over time.</p></li><li><p>This phase is the highway. It should be similar to the road, but with fewer potholes and more lanes.</p></li><li><p>Examples: Airtable, Software databases (e.g.,&nbsp; Postgres)</p></li></ul></li></ul><p>Most research groups have different projects that require different levels of structure, and that&#8217;s okay. It&#8217;s nice if you have one tool for everything, but it is much more important to have the right tool for each thing. Many companies have two data systems, one for low-throughput experiments and one for high-throughput experiments. The good news is you can focus on the 1-500 problem now and solve the 500+ problem later.</p><h2>What does biology have to do with this?</h2><p>There are two categories of ELN / data tools:</p><ul><li><p>Domain Agnostic</p><ul><li><p>Examples: Notion, AirTable</p></li></ul></li><li><p>Biology Specific</p><ul><li><p>Examples: Benchling, SciSpot</p></li></ul></li></ul><p>Generally speaking, domain agnostic tools will be cheaper but you will need to do more work to configure them for your needs. There are also biology specific features that the domain-agnostic tools do not provide out the box, including:</p><ul><li><p>Customer support services with biology experience</p><ul><li><p>E.g. pre-configured schemas for in vivo experiments, advice on how to set up a reagent tracking database</p></li></ul></li><li><p>Analysis and visualization tools for biological data</p><ul><li><p>E.g. a plasmid viewer, BLAST tool, small molecule database</p></li></ul></li><li><p>Life science regulatory compliance</p><ul><li><p>E.g. HIPAA, GxP</p></li></ul></li></ul><p>Whether to use a domain agnostic or biology specific tool depends on your requirements and how much time you are willing to invest in setting up the tool. In general, I don&#8217;t recommend using a domain agnostic tool unless someone on your team has data engineering or software experience.</p><h1>I chose an ELN. Am I done now?</h1><p>Unfortunately, no.</p><p>Your research is unique &#8211; that is why you are doing it. Even a biology specific tool like Benchling will need to be customized for your research. As your research evolves over time, you will need to keep your ELN up to date. Scientists will need to learn to use it, adopt it into their workflow, and enter data consistently.</p><p><em><strong>Maintaining your ELN is likely to cost more than the ELN</strong></em></p><p>Biotech companies that ask for initial implementation help from applications engineers or data engineers tend to have a better long term ELN experience. Biotech companies over 50 people tend to hire an informatics specialist or data manager to be responsible for the ELN and other information systems.&nbsp;</p><p>Don&#8217;t forget to factor usability and maintainability into the price. A poorly implemented ELN that slows down your research and takes multiple employees to maintain will cost your company far more than any SaaS.</p><h1>(Self Promotion) What about computational biology?</h1><p>Computational research needs an ELN too &#8211; and that&#8217;s why we built Mantle.&nbsp;</p><p>Traditional ELN systems are designed for research done at the bench. This is great for recording sample preparation for a Western blot, but not for capturing the analysis of single cell transcriptomics FASTQ files.</p><p><a href="https://mantlebio.com/">Mantle</a> connects &#8220;big biological data&#8221; (FASTQs, PDB, micrographs, etc), analysis notebooks (Jupyter, R Studio), and bioinformatics pipelines (Nextflow). With automated workflows, you can run a DNA sequencer, go grab coffee, and receive reproducible results in your inbox. We integrate with the wet lab ELNs above to keep the entire lab connected.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mantlebio.com/see-a-demo/&quot;,&quot;text&quot;:&quot;Learn More&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mantlebio.com/see-a-demo/"><span>Learn More</span></a></p><p></p><h1>How do you manage data?</h1><p>What ELN do you use? We&#8217;d love to hear what has worked for you and what lessons you&#8217;ve learned along the way. Thank you!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.mantlebio.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.mantlebio.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item></channel></rss>