Assembling a Pandemic: My First Genome Assembly

Walking through the process of reconstructing the complete SARS-CoV-2 genome from millions of short DNA reads using SPAdes and Galaxy

The Problem: 2.5 Million Pieces of a Viral Puzzle

Imagine I dumped a 30,000-piece jigsaw puzzle on your table, but there's a catch: I first shredded each piece into dozens of tiny fragments, mixed them all together, and then asked you to reconstruct the original image. Oh, and some fragments are damaged, some are duplicates, and about 10% don't even belong to this puzzle—they're contamination from other puzzles.

Welcome to genome assembly.

When I downloaded dataset SRR10971381 from NCBI—raw sequencing data from a COVID-19 patient in Wuhan, China, collected in early January 2020—I wasn't getting a nice, neat SARS-CoV-2 genome sequence. I was getting 2,512,682 short DNA reads, each only 150 base pairs long. The complete viral genome is about 29,903 nucleotides. That's roughly 200 times more coverage than needed, but in tiny, overlapping fragments that needed to be computationally stitched back together.

This is my story of learning genome assembly—the most fundamental skill in bioinformatics—by reconstructing the genome of the virus that caused a global pandemic.

What I Knew (and Didn't Know)

What My Wet-Lab Background Taught Me:

From my undergraduate work on the ATP8B1 gene, I understood: - DNA sequencing basics: Sanger sequencing gives you clean, long reads (~800 bp) - PCR products: I'd sequence a single amplicon and get a readable chromatogram - The concept: DNA → sequencer → sequence data

I'd seen sequencing results that looked like this:

ATCGATCGATCGATCG...

Clean, ordered, interpretable.

What I Didn't Understand:

Next-generation sequencing (Illumina) is completely different: - Millions of short reads instead of one long read - Paired-end sequencing (reads from both ends of DNA fragments) - Raw FASTQ files filled with quality scores I'd never seen - No "correct order" of reads—that's what assembly does - Computational intensity that would melt my laptop

When I first opened the SRR10971381 dataset, I expected something manageable. What I got was a 1.2 GB file containing sequences like:

@SRR10971381.1
NGCGGATTTCGTCACCCACGTGGCGGATGTCATAGGTTATAATAATATTCGTATGGCGGCG
+
#8ACCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@SRR10971381.2
ACCCCCGATCGATCGATCGATCGGGGATCGGGATCGGGATCGGGATCGGGACCCCCGATCG
+
CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

Repeated 2.5 million times.

I had no idea what I was looking at. The '@' symbols, the '+' symbols, those cryptic quality strings—this was a completely different language. And somehow, I was supposed to turn this incomprehensible mess into a complete viral genome.

The Jigsaw Puzzle Analogy (Actually Perfect)

Before diving into the technical process, let me explain why genome assembly really is like solving a jigsaw puzzle:

Traditional Jigsaw Puzzle:

Pieces: Each has unique shape and image
Borders: Clear edges help you start
Image: Visual guide (the box cover)
Strategy: Group by color, connect obvious matches
End result: One complete picture

Genome Assembly Puzzle:

Pieces: Short DNA reads (150 bp each)
Borders: No clear start/end markers
Image: Unknown (that's what you're trying to discover!)
Strategy: Find overlapping sequences
End result: Complete genome sequence

But here's where it gets harder than any puzzle you've done:

Challenge 1: Repeats
Imagine your puzzle has large sections with identical patterns. How do you know which blue sky piece goes where when all the sky looks the same? DNA genomes have repeated sequences—sometimes the exact same 100 bp sequence appears in multiple locations. The assembler has to figure out if two reads matching that sequence belong to the same location or different ones.

Challenge 2: No Edge Pieces
You can't easily identify the beginning or end of the genome. Viral genomes are linear, but there are no obvious "edge pieces" to start with.

Challenge 3: Errors and Contamination
Some puzzle pieces are damaged (sequencing errors). Some pieces are from completely different puzzles (bacterial DNA contamination from the patient sample). You need to identify and exclude these.

Challenge 4: Coverage Depth
Instead of one copy of each piece, you have ~200 copies of each region (because of high sequencing coverage). Sounds helpful, but now you need to figure out which reads are true duplicates versus actual genomic repeats.

Challenge 5: Computational Scale
You can't manually inspect 2.5 million pieces. The human brain can't hold that much information. You need algorithms—programs that can systematically find overlaps and build the genome.

This is why genome assembly is computationally intensive and why tools like SPAdes exist.

My First Attempt: Understanding the Input Data

The course started gently: "Download the SARS-CoV-2 dataset and upload it to Galaxy."

Galaxy is a web-based platform that lets you run bioinformatics tools through a graphical interface—no command line required. Perfect for beginners like me.

Step 1: Understanding FASTQ Format

Each sequencing read in my file looked like this:

@SRR10971381.1 1 length=150
NGCGGATTTCGTCACCCACGTGGCGGATGTCATAGGTTATAATAATATTCGTATGGCGGCG
+
#8ACCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Breaking this down: - Line 1 (@): Read identifier (this is read #1) - Line 2: The actual DNA sequence (A, T, C, G, and sometimes N for ambiguous) - Line 3 (+): Separator - Line 4: Quality scores (higher letters = better sequencing quality)

That quality line was confusing at first. Those letters are Phred quality scores encoded in ASCII. 'J' means very high quality (probability of error ~0.001%), '#' means terrible quality. That first 'N' in the sequence? The '#' in the quality line tells you the sequencer wasn't confident about that base.

First lesson learned: Not all sequencing data is equal quality. Some reads are pristine, some are garbage.

Step 2: Paired-End Sequencing

The dataset wasn't just one file—it was TWO files: - SRR10971381_1.fastq (forward reads) - SRR10971381_2.fastq (reverse reads)

This is paired-end sequencing. Here's how it works:

DNA fragment from virus: [-----~500 bp fragment-----]
Sequence both ends:
Forward: ATCGAT... (150 bp from left end)
Reverse: ...GCTAGC (150 bp from right end)

The assembler knows these two reads come from the same DNA molecule, roughly 500 bp apart. This is incredibly valuable information! It's like having two puzzle pieces that you know must be exactly one foot apart in the final picture.

Second lesson learned: Paired-end data is like having contextual clues that constrain the solution space.

The Assembly Algorithm: De Bruijn Graphs

Before running SPAdes, the course explained the underlying algorithm: de Bruijn graphs. This was the moment bioinformatics started feeling like computer science.

The Naive Approach (Doesn't Work):

My first instinct was: "Why not just compare every read to every other read, find overlaps, and connect them?"

Problem: That's 2.5 million × 2.5 million = 6.25 trillion comparisons. Computationally impossible.

The Clever Approach (De Bruijn Graphs):

Instead of comparing entire reads, SPAdes breaks each read into k-mers—short sequences of length k.

For our assembly, we used k=25 (25-nucleotide chunks).

Example read:

ATCGATCGATCGATCGATCGATCG (26 bp)

Breaking into k-mers (k=25):

ATCGATCGATCGATCGATCGATC  (positions 1-25)
 TCGATCGATCGATCGATCGATCG  (positions 2-26)

Now do this for all 2.5 million reads. You get millions of k-mers, but many are duplicates (high coverage!). The de Bruijn graph:

Nodes: Each unique k-mer is a node
Edges: If k-mer A's suffix overlaps k-mer B's prefix, draw an edge
Path: The genome is a path through this graph

Why This is Brilliant:

Instead of comparing giant reads, you're comparing tiny k-mers. Instead of quadratic complexity (n²), you get near-linear complexity. Suddenly, assembly becomes computationally feasible.

The Aha Moment:

When I first saw this explained, I thought: "This is like Scrabble!"

In Scrabble, you build words by connecting tiles that share letters. In genome assembly, you build genomes by connecting k-mers that share sequences. The difference? The genome "word" is 30,000 letters long and you have 2.5 million tiles to work with.

Third lesson learned: The right data structure (graph) makes impossible problems solvable.

Actually Running SPAdes

Galaxy made this deceptively simple:

Galaxy Interface: 1. Tool: "SPAdes" 2. Input dataset 1: SRR10971381_1.fastq (forward reads) 3. Input dataset 2: SRR10971381_2.fastq (reverse reads) 4. K-mer sizes: 21, 33, 55 (SPAdes automatically tries multiple) 5. Click: "Execute"

Then I waited. And waited. And waited.

Assembly runtime: ~45 minutes.

While it ran, I learned what was happening inside:

The SPAdes Pipeline:

Step 1: Error Correction
SPAdes first uses BayesHammer to correct sequencing errors. Remember those low-quality bases (N, #)? The algorithm uses the fact that most of the genome has 200x coverage. If 199 reads say "ATCG" and one says "ATCA" at the same position, that single read probably has an error.

Step 2: K-mer Graph Construction
Build de Bruijn graphs with multiple k-mer sizes (21, 33, 55). Why multiple? Small k-mers resolve repeats poorly but handle low-coverage regions well. Large k-mers resolve repeats well but require high coverage. Using multiple k-mer sizes gives you the best of both worlds.

Step 3: Graph Simplification
The graph has millions of nodes and edges. SPAdes removes: - Dead-end branches (sequencing errors) - Bubbles (heterogeneity or errors) - Tips (incomplete coverage at genome ends)

Step 4: Repeat Resolution
This is the hardest part. When the graph has ambiguous paths (caused by repeats), SPAdes uses paired-end information to figure out which path is correct. Remember: you know read pairs are ~500 bp apart. That constraint resolves many ambiguities.

Step 5: Contig Generation
Finally, SPAdes outputs contigs—continuous stretches of assembled sequence.

The Results: Opening Pandora's Box (or the Assembly File)

When SPAdes finished, I got several output files. The most important was contigs.fasta:

>NODE_1_length_29903_cov_193.456
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
...
[continues for 29,903 nucleotides]

>NODE_2_length_184_cov_12.743
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
...
[shorter contigs]

Initial reaction: Pure joy. I HAD ASSEMBLED A GENOME.

Immediate questions: - Is this the full genome? (Yes—29,903 bp matches known SARS-CoV-2 length!) - Is it one piece? (Mostly—NODE_1 is the main genome) - What are these other NODEs? (Smaller contigs, possibly contamination or assembly artifacts) - Is it correct? (Need to validate...)

Understanding Assembly Quality: The N50 Metric

To assess assembly quality, I ran QUAST (Quality Assessment Tool for Genome Assemblies).

QUAST gave me statistics like this:

Number of contigs: 3
Total length: 30,121 bp
Largest contig: 29,903 bp
N50: 29,903 bp
GC%: 37.97%

What is N50?

This confused me at first. Here's the simplest explanation:

Definition: Order all your contigs by length (longest to shortest). Start adding their lengths together. N50 is the length of the contig where you've reached 50% of the total assembly length.

Example: - Contigs: 100 bp, 80 bp, 50 bp, 30 bp - Total length: 260 bp - Add: 100... 180 (reached 50% of 260)... - N50 = 80 bp (the contig where we crossed 50%)

Why it matters: - High N50 = few large contigs = good assembly - Low N50 = many small contigs = fragmented assembly

Our result: N50 = 29,903 bp

This is PERFECT. It means our largest contig accounts for >50% of the assembly, because it IS essentially the entire genome in one piece.

NG50 (The Reference-Based Version):

If you know the expected genome size (for SARS-CoV-2: ~30,000 bp), NG50 uses that instead of total assembly length. Same concept, just referenced to known biology.

Fourth lesson learned: N50 is the single most important assembly quality metric. High = good, Low = fragmented mess.

Validating the Assembly: Did We Get It Right?

Having a 29,903 bp contig is great, but is it the CORRECT sequence? Time for validation.

Method 1: BLAST Against Known Genomes

I took our assembled contig and BLASTed it against NCBI's nucleotide database (limiting to viruses):

Top hit:

Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1
Query Cover: 100%
Percent Identity: 99.99%

Translation: Our assembly is nearly identical to the reference SARS-CoV-2 genome from Wuhan. The 0.01% difference is likely real biological variation (this is a different patient) or minor sequencing errors.

Method 2: Check for Closest Relative

Next question: What's the closest known relative of SARS-CoV-2?

I BLASTed against all viruses BEFORE January 2020 (excluding SARS-CoV-2 itself):

Top hit:

Bat coronavirus RaTG13
Query Cover: 98%
Percent Identity: 96.1%

Significance: This confirmed what scientists reported in early 2020—SARS-CoV-2's closest known relative is a bat coronavirus, sharing 96% sequence identity. That ~4% difference represents years of evolution and adaptation, including the mutations that allowed human transmission.

Method 3: GC Content Check

SARS-CoV-2 should have ~38% GC content (percentage of bases that are G or C vs. A or T).

Our assembly: 37.97%

Close enough! This is another sanity check. Wildly different GC% would suggest contamination or assembly error.

Fifth lesson learned: Assembly is just the first step. Validation is critical—BLAST, reference comparison, sanity checks.

What Could Go Wrong: Assembly Failure Modes

Not all assemblies work this well. Here's what can go wrong:

Problem 1: Low Coverage

If you only have 10x coverage instead of 200x, the graph has gaps. You get many short contigs instead of one long one.

Solution: Sequence deeper (more data).

Problem 2: Contamination

If your sample has bacterial DNA mixed in, you get extra contigs that don't belong.

Our dataset: Those NODE_2 and NODE_3 contigs (184 bp, 12x coverage) were likely bacterial contamination or human RNA that co-extracted with the virus.

Solution: Filter by coverage depth (keep high coverage contigs) or BLAST contigs to identify source.

Problem 3: Repeats Longer Than Read Length

If a repeat is 500 bp and your reads are only 150 bp, you can't span it. The assembler may collapse repeats (report one copy when there are multiple) or break the assembly.

SARS-CoV-2 advantage: Small genome, few long repeats. Easy to assemble.

Solution: Long-read sequencing (PacBio, Nanopore) or paired-end with longer insert sizes.

Problem 4: High Error Rates

Low-quality sequencing produces wrong k-mers, confusing the graph.

Solution: Quality filtering before assembly, or use error-correction (like SPAdes' BayesHammer).

The Moment of Wonder: Genome in Hand

After validation, I had the complete SARS-CoV-2 genome sequence from a Wuhan patient, January 2020—one of the earliest pandemic samples.

This single file—29,903 letters of genetic code—contained: - Instructions for 11 viral proteins - The spike protein that binds human cells - The RNA polymerase that copies the viral genome - Mutations that distinguish this virus from bat coronaviruses - Evidence of where this pandemic began

And I had reconstructed it from millions of tiny, disordered DNA fragments.

That feeling was surreal.

In my wet-lab work, I amplified and sequenced a single gene (ATP8B1, ~2,000 bp). It took months and required PCR optimization, gel troubleshooting, Sanger sequencing. Here, I assembled an ENTIRE VIRAL GENOME (29,903 bp) in under an hour of compute time from publicly available data.

The power of computational biology hit me viscerally.

Comparing Assembly Strategies: De Novo vs. Reference-Guided

There are actually two main approaches to genome assembly:

De Novo Assembly (What We Did):

No prior knowledge of the genome sequence
Build from scratch using only read overlaps
Harder computationally
Required for novel organisms or novel variants

Reference-Guided Assembly:

Use known reference genome (e.g., published SARS-CoV-2 Wuhan-Hu-1)
Align reads to reference
Call variants where your sample differs
Easier and faster
Misses large structural changes

Why we did de novo: 1. Educational value—teaches the fundamental algorithm 2. What if this were a completely new virus? 3. Doesn't bias toward reference sequence

When to use reference-guided: - Studying human genomes (reference exists) - SNP calling in populations - Variant analysis - Speed matters

Key Concepts I Finally Understood

After completing the assembly, several concepts clicked:

1. K-mer Size is a Trade-off

Small k (e.g., 21): Better for low coverage, worse for repeats
Large k (e.g., 55): Better for repeats, worse for low coverage
SPAdes uses multiple k-mers to get both benefits

2. Coverage Depth Matters Exponentially

10x coverage: Barely sufficient, many gaps
50x coverage: Good, few gaps
200x coverage: Excellent, can correct errors
1000x coverage: Overkill, wastes resources

Our 200x coverage was the sweet spot.

3. Paired-End Info is Gold

The knowledge that read pairs are ~500 bp apart resolves so many ambiguities. It's like having scaffolding for your puzzle—you know spatial constraints.

4. Graph Algorithms Are Beautiful

De Bruijn graphs convert sequence assembly into a graph traversal problem. It's the computational breakthrough that made modern genomics possible.

5. Assembly is Lossy

You can never be 100% certain the assembly is perfect. Some regions are ambiguous. Some repeats might be misplaced. Validation is essential.

What I'd Do Differently

Looking back, here's what I wish I'd known:

1. Visualize the Graph
SPAdes can output the assembly graph. Tools like Bandage let you visualize it. Seeing the actual de Bruijn graph would have made the algorithm much more concrete.

2. Try Different Parameters
I used default settings (k=21,33,55). What if I used k=25,35,45? How would results differ? Experimentation teaches more than following instructions.

3. Assemble Related Viruses
I could have downloaded SARS-CoV-1 data and assembled it for comparison. Would it be harder (more divergent from reference) or easier (smaller genome)?

4. Quantify Errors
Where exactly were the differences between my assembly and the reference? Were they in specific genes? Random? Clustered? This would reveal biological insights.

5. Assembly QC Beyond N50
Metrics like: - L50: Number of contigs comprising 50% of assembly - BUSCO: Completeness based on expected genes - Misassemblies: Structural errors - Coverage uniformity: Are some regions under-represented?

The Bigger Picture: Why Assembly Matters

Genome assembly isn't just an academic exercise. It's foundational to:

1. Pandemic Response

SARS-CoV-2 was sequenced and shared globally within weeks of discovery. This enabled: - Diagnostic test development (PCR primers) - Vaccine design (spike protein sequence) - Variant tracking (omicron, delta, etc.) - Epidemiological tracing

All depended on accurate, rapid genome assembly.

2. Personalized Medicine

Assembling patient tumor genomes identifies mutations driving cancer. Guides targeted therapy selection.

3. Biodiversity & Conservation

Assembling endangered species genomes helps: - Understand population genetics - Guide breeding programs - Identify adaptive traits

4. Agricultural Improvement

Crop genome assembly enables: - Marker-assisted breeding - Disease resistance identification - Yield optimization

5. Evolutionary Biology

Comparing assembled genomes across species reveals: - How organisms adapt - When species diverged - What makes us human

Every one of these applications requires assembling genomes from fragmented sequencing reads. The algorithm I learned—de Bruijn graphs, k-mer decomposition, graph traversal—underpins all modern genomics.

My Assembly Journey: Key Takeaways

Week 1: Confused by FASTQ format, overwhelmed by file sizes, no idea what k-mers were.

Week 2: Understanding de Bruijn graphs conceptually, but still treating SPAdes as a black box.

Week 3: Successfully assembled SARS-CoV-2, validated with BLAST, understood output files.

Week 4: Could explain the algorithm to classmates, appreciate trade-offs, design assembly strategies.

Today: Comfortable with assembly concepts, ready to tackle more complex genomes (bacteria, eukaryotes), understand when to use de novo vs. reference-guided.

Most important lesson: Genome assembly converts experimental data (sequencing reads) into biological knowledge (genome sequence) through clever algorithms. Understanding the algorithm makes you a better bioinformatician than just knowing which buttons to click.

Resources That Helped Me Learn

Tools: - SPAdes: De novo assembler (what we used) - QUAST: Assembly quality assessment - Galaxy: Web-based platform (beginner-friendly) - Bandage: Graph visualization - BLAST: Sequence similarity search

Tutorials: - UC San Diego Bioinformatics Specialization (Coursera) - SPAdes documentation: https://github.com/ablab/spades - Ben Langmead's Algorithms for DNA Sequencing (Coursera)

Papers: - Bankevich et al. (2012): SPAdes algorithm - Miller et al. (2010): Assembly complexity - Bradnam et al. (2013): GAGE assembly comparison

Concepts to Master: - K-mers and de Bruijn graphs - Paired-end sequencing - Coverage depth - N50/NG50 statistics - Graph traversal algorithms

What's Next: Beyond SARS-CoV-2

Now that I've assembled a 30kb viral genome, what's next?

Short-term goals: - Assemble bacterial genomes (~5 Mb, 100x larger) - Work with metagenomic data (mixed species) - Try long-read assembly (PacBio/Nanopore) - Learn hybrid assembly (short + long reads)

Medium-term goals: - Assemble eukaryotic genomes (gigabases!) - Understand scaffolding and gap-filling - Compare assemblers (SPAdes vs. MegaHit vs. Flye) - Contribute to open-source assembly tools

Dream project: Assemble a novel virus genome from environmental samples—something that's never been sequenced before. Use the same skills I learned with SARS-CoV-2, but for discovery rather than replication.

Final Thoughts: The Jigsaw Puzzle Completed

When I started, genome assembly seemed impossibly complex—how could anyone reconstruct a genome from millions of fragments?

Now I understand: it's not magic, it's mathematics. It's graph theory, probabilistic error correction, and constraint satisfaction. It's turning a biological problem (what's the genome sequence?) into a computational problem (what's the optimal path through this graph?).

The jigsaw puzzle analogy holds: - Sequencing reads are puzzle pieces - K-mers are the matching edges - De Bruijn graph is the solution strategy - The assembled genome is the completed picture

But unlike a jigsaw puzzle, the picture we're revealing isn't a landscape or portrait—it's the genetic blueprint of life itself. Each assembly is a window into evolution, adaptation, and the molecular machinery that makes organisms work.

My first genome assembly—SARS-CoV-2, January 2020, 29,903 nucleotides—represents more than just a technical accomplishment. It represents understanding how computational tools convert raw data into biological knowledge. It represents joining the community of scientists who can read life's code and use it to answer questions, fight diseases, and push the boundaries of what we know.

Next time someone asks me "What's genome assembly?"

I'll say: "It's solving the world's largest jigsaw puzzle—except the pieces are microscopic, there are millions of them, you don't know what the picture looks like, and when you're done, you've decoded the instructions for building a living organism."

And then I'll show them my SARS-CoV-2 assembly and explain how I did it.

Try It Yourself: A Step-by-Step Guide

Want to assemble your own genome? Here's how:

Prerequisites:

Computer with internet connection
Galaxy account (free): https://usegalaxy.org
2-3 hours of time

Steps:

1. Get Data - Go to NCBI SRA: https://www.ncbi.nlm.nih.gov/sra - Search: SRR10971381 - Download FASTQ files (or use Galaxy's SRA toolkit)

2. Upload to Galaxy - Click "Upload Data" - Select files: SRR10971381_1.fastq and SRR10971381_2.fastq

3. Run SPAdes - Tools → Assembly → SPAdes - Dataset 1: forward reads - Dataset 2: reverse reads - Execute!

4. Run QUAST - Tools → Assembly → QUAST - Input: SPAdes contigs output - Execute!

5. Validate with BLAST - Tools → NCBI BLAST+ → blastn - Query: largest contig - Database: NCBI nt (nucleotide) - Examine results!

6. Celebrate You've assembled a pandemic genome! 🎉

Syed Muhammad Ali Shirazi is a bioinformatics researcher specializing in viral genomics and computational biology. He combines wet-lab molecular biology expertise with computational analysis to bridge experimental and theoretical approaches in genomics.

Connect: GitHub | LinkedIn | Email

Questions about genome assembly? Stuck on a particular concept? Drop a comment or reach out—I'm happy to help explain!