Sequencing run S0592 complete! Final step: Analysis

After 16 cycles of incorporation, imaging, and cleaving, sequencing run S0592 is complete!

Sunday, December 17, 2023

After 16 cycles of incorporation, imaging, and cleaving, sequencing run S0592 is complete! All that’s left to be done is to run through the analysis pipeline.

The first step is the Image Pre-Processing Module. The raw sequencing images are fed through this module, and the output is a folder containing all of the pre-processed and registered images. Pre-processing cleans up the raw images up to ensure sequencing clusters can be identified at each cycle and any interfering surface imperfections or debris are excluded from analysis.

S0592 Image Pre-Processing Input:

Raw Images

S0592 Image Pre-Processing Output:

Pre-Processed Images

The Pre-Processed Images are used as input for step two of the analysis pipeline: ROI Detection, Intensity Extraction, and Auto-Dictionary. This step of the pipeline includes defining all of the sequencing clusters in the images (referred to as “ROIs”- Regions of Interest); extracting the intensity values from each sequencing cluster (ROI) across all 4 channels in all 16 cycles of sequencing; and defining the color composition for A, C, G, and T (referred to as the dictionary) using signal across all 4 channels for a representative A, C, G, and T cluster.

The output of interest from the ROI Detection, Intensity Extraction, and Auto-Dictionary step in the pipeline is a .csv file. This file will contain coordinates and signal data from all 4 channels for each ROI and the 4 base dictionary across all 16 cycles. We expect our dictionary to reflect the dominating signal for each of the four Lightning Terminators™:

Lightning Terminator™ (base)	Dominating Signal
LTdA	590, “orange”
LTdC	525, “green”
LTdG	445, “blue”
LTdT	645, “red”

ROI Detection, Intensity Extraction, and Auto-Dictionary Input:

Pre-Processed Images

ROI Detection, Intensity Extraction, and Auto-Dictionary Output:

Cluster_intensities.csv

The third module data is fed through is the Color Transformation. This algorithm does a few important things to the Cluster_intensities.csv data:

Compares the unique color compositions for each sequencing cluster to the dictionary for A, C, G, T. We expect high signal in one channel for each of the 4 bases
Calculates the relative amount of each base in each cluster using the measured color composition at cycle 1 compared to the dictionary
Repeats this process for each cycle of sequencing for all identified clusters

Color Transformation Input:

Cluster_intensities.csv

Color Transformation Output:

color_transformed_spots.csv

At cycle one of sequencing, we expect clusters to have high amounts of one Lightning Terminator. At cycles greater than one, however, clusters will have lower purity as dephasing accumulates. Here’s an example of Color Transformation results for one cluster across 5 cycles:

At cycle 1, this cluster has LTdT (red) incorporation- when we get to basecalling, this cluster will definitely be a “T”. At cycle 2, things are a little less obvious. It looks like some “T” is remaining from cycle 1. In addition, there is blue in our color transformed results indicating a G was incorporated. There’s also a very small amount of green C present. To make a basecall at cycle 2 and greater, it will be helpful to put this data through the Dephasing algorithm.

Looking at the color transformed data above, it is clear our data has been impacted by dephasing. Our sequencing clusters each have thousands of copies of the same DNA molecule being sequenced. At cycle 1, when we should not see any dephasing, all of the sequencing primers will be extended to the first base in the template DNA. Dephasing can manifest in many ways using our LED-TIRF Transformer along with Lightning Terminators™, but dephasing will always fall into one of two categories (although we will see both in the same clusters!):

Lead: when some extended sequencing primers are at a position in the template greater than the number of cycles.

ex.: At cycle 7 of sequencing, every sequencing primer in the cluster should be extended to the 7th base in the sequencing template. If we see evidence that some primers have extended to the 8th, or even 9th base in the template, there is lead.
Lag: when some extended sequencing primers are at a position in the template less than the number of cycles.

ex.: At cycle 7 of sequencing, every sequencing primer in the cluster should be extended to the 7th base in the sequencing template. If we see evidence that some primers have only extended to the 5th or 6th base, there is lag

Luckily, the final step in our analysis pipeline will attempt to account for lead and lag to generate base calls, quality scores, and align the data to a reference. The Dephasing and Basecaller module attempts to find the best-fit model of lead and lag parameters for the color transformed data. The dephasing parameters are used to generate basecalls which are all given a quality score. The basecalls are finally aligned to a reference, and since we’re working with known sequencing templates, we can also determine the error rate. This module has numerous outputs, one of which is sequencing data in FASTQ format. The FASTQ can be fed into some of 454 Bio’s Open Source analysis modules, existing bioinformatics tools, or you can develop your own workflows to share with the 454 Bio Open Source Community!

Mel’s Code Input:

color_transformed_spots.csv

Mel’s Code Output:

Run Analysis