Analysis software

This section outlines the data analysis pipeline for the “One pot” Lightning Terminators™ (LT)

Introduction

This section outlines the data analysis pipeline for the “One pot” Lightning Terminators™ (LT) sequencing process. During each sequencing cycle, the Transformer MK3 prototype sequentially activates different LED light excitations. This process captures and stores four monochrome raw images corresponding to each fluorescent dye-labeled LT (Atto647-LT-dT, AF594-LT-dA, AF532-LT-dC, AF488-LT-dG). The total number of images generated is equal to the number of cycles multiplied by four, representing the four distinct color channels.

Image Processing

The image processing module performs several critical functions:

  • Reading and Cropping raw images
  • Noise Reduction: Image binning and filtering
  • Correction Procedures: magnification correction, background normalization across different color channels, light illumination uniformity correction, cycle to cycle image registration (translation, rotation)
  • Defect Removal: Processes the well-registered images to eliminate surface scattering defects.
  • ROI Detection and Intensity Extraction: Utilizes the first cycle image set to identify each DNA cluster’s centroid (Region of Interest, ROI) and extracts intensity values at each cycle for subsequent analysis.

Below is a set of processed raw images demonstrating these steps.

Color Transformation

This section introduces our newly-developed algorithm designed to assess the ‘color purity level’ of each cluster in a given cycle.

The algorithm evaluates the unique fluorescence signature of each dye and considers the optical cross-talk among the four color channels. In the initial cycle, images are presumed to be nearly ‘pure’, typically exhibiting only a single dye per cluster spot. However, in subsequent cycles, a mixture of dyes may appear within each cluster spot towards the end of the cycle, a result of dephasing. Our color transformation algorithm adeptly quantifies the proportion of each dye present at the conclusion of each cycle.

Below shows a typical example of how a certain cluster’s “color purity” (each color represents base A, C, G, T) changes from cycle 1 to cycle 5.

Dephasing Correction and Basecalling

Dephasing is a significant issue in our cyclic reversible terminations. It refers to the loss of synchrony in the signal from a cluster of DNA molecules being sequenced simultaneously. In our one pot sequencing, it’s expected that all DNA strands in a specific cluster will incorporate the same nucleotide at each cycle of the sequencing process. However, due to various factors like insufficient UV cleavage, Polymerase fall off, some strands may fail to incorporate a nucleotide at a given cycle (a ’lag’) or some factors like “darkbase” (non-terminated LTs) generation, some strands incorporate more than one (a ’lead or carry forward’). This results in a loss of synchrony or “phasing” over time.

Our dephasing correction algorithm uses computational methods to model the expected signal from the color transformed clusters, detect and minimize the deviations, and apply corrections to the sequencing readouts to improve the accuracy of the base call.

Setup

Install Python 3.9 and above

Libraries: numpy, roifile, matplotlib, opencv, pandas, scipy, scikit-learn, scikit-image, joblib (parallel computing), PyImageJ (which also requires OpenJDK 11 and Maven) https://py.imagej.net/en/latest/Install.html

Download and install fiji.app https://imagej.net/software/fiji/ Make sure the newest fiji contains: “…\plugins\Descriptor_based_registration-2.1.8.jar” in the “plugins” folder, which can be found in image processing module

Usage

Image Processing:

ClusterSeqIP_v2.py

Functionality:
ClusterSeqIP_v2.py preprocesses raw images from a sequencing run and outputs processed images into a designated folder, preparing them for color transformation. Key operations include image renaming, filtering, binning, background normalization, magnification correction, image registration, cropping, and illumination correction.

Execution Instructions:

  1. Ensure Fiji.app is downloaded and installed.
  2. Verify that PyImageJ, OpenJDK 11, and Maven are properly installed.
  3. In the script, set JAVA_HOME to the correct directory, for example:
    os.environ['JAVA_HOME'] = 'C:\\Program Files\\Microsoft\\jdk-11.0.21.9-hotspot'
    
  4. Specify the Fiji path:
     fijipath='...your directory/Fiji.app'
    
  5. Run the script in the terminal, including the path to your raw image folder, ensure no non-.tif files are in this folder.:
& C:/Users/...your directory/python.exe "...your directory/ClusterSeqIP_v2.py" -i "/your_path_to/raw_image_folder"

Note: Python will call Fiji to perform image registration. This requires users to manually adjust the cropping rectangle to the center and set the threshold to approximately 0.08.

Output: 2_preprocess: Stores preprocessed images for debugging and manual registration. 2_Regis: Contains registered images for end-users to visualize and check registration quality. 2_processed_final: Holds the final processed images for the next step of color transformation. Pass this directory to DoG_v1.py once you confirm all images are properly processed.

DoG_v1.py

Functionality: DoG_v1.py processes the final image files to perform cluster spot detection using the Difference of Gaussians (DoG) method. It extracts intensity values from the processed images, applies a Chastity filter, and utilizes several other distance/intensity-based filters to eliminate unwanted spots. Finally, the script identifies dictionaries for all four base colors and outputs the intensities, cluster centroids, and dictionary as a CSV file, which is essential for the subsequent color transformation process.

Execution Instructions:

  1. Confirm that the folder 2_processed_final exists and contains all properly processed images. The total number of images should be a multiple of four, corresponding to the four fluorescent channels.
  2. Execute the script in the terminal, specifying the path to your 2_processed_final image folder:
& C:/Users/...your directory/python.exe "...your directory/DoG_v1.py" -i "/your_path_to/2_processed_final"

Output: Binary_Mask: Stores the binary mask of auto-detected cluster spots using the first cycle of the sequencing run (images 00001-00004.tif), and the final mask used for intensity extraction after all filtering processes (Mask_afterChastity.tif). Cluster_intensities.csv: This CSV file contains the spot ID, the centroid of each spot, the Gaussian integrated intensity values of each spot across all channels and cycles of sequencing, and the auto-detected cluster medians chosen as a dictionary for A, C, G, T, and global background intensity.

Color Transformation:

color_transform.py

Functionality: color_transform.py processes the extracted intensity values of each cluster spot based on the selected dictionaries to solve the transformation matrix, which eventually transforms the intensity into each fluorescence channel’s (or in other words, each A, C, G, T bases’) color purity as its name indicated. color_transform.py processes the intensity values extracted from each cluster spot. It solves the transformation matrix, based on the provided dictionaries, effectively converting the intensity data into color purity levels for each fluorescence channel. In essence, it translates intensity values with optical cross-talk into the respective color purities of the A, C, G, T bases.

Execution Instructions:

  1. Ensure the file Cluster_intensities.csv is present and accessible.
  2. Execute the script in the terminal, specifying the path to your Cluster_intensities.csv:
& C:/Users/...your directory/python.exe "...your directory/color_transform.py" "/your_path_to/Cluster_intensities.csv"

Output: color_transformed_spots.csv: This CSV file contains the proportion of each color (A, C, G, T) for each spot across different sequencing cycles. It is a crucial file for subsequent steps in dephasing correction and base calling.

Dephasing Correction and Basecalling:

default_analysis.sh

Functionality: default_analysis.sh is a Bash shell script that executes a series of commands to invoke various Python files. These files perform crucial tasks such as dephasing correction (addressing incomplete extension, carry-forward, and signal decay), base calling, mapping sequences to ground truth, evaluating mapping quality scores, and generating an HTML-based run report for end-users to visualize sequencing results.

Execution Instructions:

  1. Ensure the file color_transformed_spots.csv is present and accessible.
  2. For first-time users, open default_analysis.sh, review, and modify the directory paths as needed.
  3. Execute the script in the terminal, and make sure the color_transformed_spots.csv is under the same directory as your default_analysis.sh
bash ...your directory/default_analysis.sh your_output_name fast

or

bash ...your directory/default_analysis.sh your_output_name

Please note that the “fast” version will predetermine fixed parameters of the incomplete extension, carry-forward, and signal decay, and the standard version without “fast” will perform grid search and error function fitting for each individual cluster.

Output: Some of the key files include your_output_name.html your_output_name.fastq your_output_name.txt.filtered

Please refer to our blog post for more details of how to run the entire data analysis pipeline for an exemplary data set.