Transcripts Identification with IsoSeq3
IsoSeq is a composable workflow of existing tools and algorithms, combined with a new clustering technique, which allows processing the ever-increasing yield of PacBio machines. Starting from subreads or CCS reads, this tool allows identifying transcripts in PacBio single-molecule sequencing data. The IsoSeq pipeline is made up of five steps:
Circular consensus sequence calling: Each sequencing run is processed by the ccs software to generate one representative circular consensus sequence (CCS) for each ZMW (Zero-mode Waveguide).
Primer removal and demultiplexing: Removal of primers and identification of barcodes is performed using lima.
Refine: This step consists of trimming of poly(A) tails and concatemer identification and removal.
Clustering: Clustering using hierarchical n*log(n) alignment and iterative cluster merging.
Polishing (optional): Generate per base QVs for transcript consensus sequences and improve results.
Please, cite IsoSeq as:
IsoSeq v3. Scalable De Novo Isoform Discovery. Töpfer, A. and Tseng, E. 2020. https://github.com/PacificBiosciences/IsoSeq
Input PacBio Reads: Select the files containing PacBio sequencing reads. Both subreads and CCS are allowed.
Note: If only CCS reads are provided, the circular consensus sequence calling step will not be performed.
Primers/Barcodes File: Specify a FASTA file with primer or barcoded primer sequences.
Figure 1: Input Data Page
Circular Consensus Sequence Calling
Minimum Passes: Minimum number of full-length subreads required to generate CCS for a ZMW.
Minimum SNR: Minimum SNR of subreads to use for generating CCS.
Minimum Length: Minimum draft length.
Skip Polishing: Only output the initial draft template (faster, less accurate). It does not refer to the last optional polishing step.
Minimum Predicted Accuracy: Establish the minimum predicted accuracy (0 - 1).
Primer Removal and Demultiplexing
Minimum Score: Reads below the minimum barcode score are removed from downstream analysis.
Minimum End Score: Minimum end barcode score threshold is applied to the individual leading and trailing ends.
Minimum Signal Increase: The minimal score difference, between first and combined, required to call a barcode pair different.
Minimum Score Lead: The minimal score lead required to call a barcode pair significant.
Peek Guess: Try to infer the used barcodes subset, by peeking at the first 50000 ZMWs, whitelisting barcode pairs with more than 10 counts and mean score >= 45. Check this option to remove spurious false-positive signals.
Figure 2: Configuration 1 Page
Remove Poly(A) Tails: Check this option if your sample has poly(A) tails. This filters for FL reads that have a poly(A) tail with at least the number of base pairs set in the following parameter. It removes identified tails.
Minimum Poly(A) Tail Length: Establish the minimum poly(A) tail length.
Filter by RQ: Filter those reads that do not exceed a minimum RQ value.
Minimum CCS RQ: Establish the minimum predicted accuracy (0 - 1).
POA Coverage: Maximum number of CCS reads used for Partial Order Alignment (POA) consensus.
Use CCS QVs: Use CCS QVs. If it is checked, the POA Coverage is set to 100.
Perform Polishing: In this optional step, results can be improved by generating per base QVs for transcript consensus sequences. Note that this step is very time consuming, and it can only be applied if all input libraries are subreads.
Note that this step is very time consuming and you likely do not need the extra quality and QVs.
RQ Cutoff: RQ cutoff for fastx output.
Coverage: Maximum number of subreads used for polishing.
Figure 3: Configuration 2 Page
Additional Files: Output consensus sequences are returned in FASTA format. However, you can select additional formats for the output sequences. If the Use QVs option is checked, two files of each type are returned: one containing sequences with predicted accuracy >= 0.99 (hq), and the other containing the remaining sequences (lq). Please select the desired output formats:
BAM: Sequences are returned in BAM format, along with their PacBio BAM index file (bam.pbi).
FASTQ: Sequences are returned in FASTQ format. This output is only available if the polishing step is performed.
Consensus Transcripts: Select a destination folder to save output files. In this directory will be saved:
FASTA file(s) containign consensus transcripts.
Additional files, if they are selected (e.g. BAM file(s)).
TSV report containing the read ID and the read type that contributed to each consensus transcript.
Figure 4: Output Data Page
The main output is the clustered/polished .fasta file. It contains the transcripts identified from the input data. Additional BAM and FASTQ files contain the same information in a different format.
The report.csv file contains information about how many PacBio reads have contributed to the reconstruction of each transcript.
In addition, a report and a chart are generated with complementary information. The report shows a summary of the IsoSeq results (Figure 5).
Figure 5: Summary Report
In addition, a report for each input sample can be opened (Figure 6). They contain additional details about the processing of each sample.
Figure 6: Per Sample Report
Three types of charts are generated:
Show the distribution of the lengths of the resulting transcripts (Figure 7).
Figure 7: Length Distribution Chart
Show the distribution of the coverage, this is, the number of reads supporting each transcript (Figure 8).
Figure 8: Coverage Distribution Chart
This chart is only generated if the polishing step is applied. It shows the distribution of subreads supporting the resulting transcripts (Figure 9).
Figure 9: Subreads Distribution