Long reads analysis - IsoSeq
Introduction
IsoSeq v3 contains the newest tools to identify transcripts in PacBio single-molecule sequencing data. A composable workflow of existing tools and algorithms, combined with a new clustering technique, allows processing the ever-increasing yield of PacBio machines with similar performance to IsoSeq versions 1 and 2.
Dataset Description
Long reads of the full-length transcriptome of COLO829T melanoma cell line.
Organism: Homo sapiens
Instrument: PacBio
Layout: PacBio Single Molecule, Real-Time (SMRT) Sequencing
Publication
Tseng, E., Galvin, B., Hon, T., Kloosterman, W. P., & Ashby, M. (2019). Full length transcriptome sequencing of melanoma cell line complements long read sequencing assessment of genomic rearrangements.
Original Data
Data can be downloaded from:
https://downloads-ap.pacbcloud.com/public/dataset/Melanoma2019_IsoSeq/subreads/COLO829T/
Bioinformatic Analysis
Application
Input
Be aware that the Circular Consensus Sequence Calling is a very time and computation intensive step (12h+). It is recommended to directly use the CCS output when playing with this dataset.
PacBio Subreads (BAM format):
Primers file (FASTA format):
Parameters
Circular Consensus Sequence Calling
Minimum Passes: 3
Minimum SNR: 2.5
Minimum Length: 10
Skip Polishing: False
Minimum Predicted Accuracy: 0.9
Primer Removal and Demultiplexing
Minimum Score: 0
Minimum End Score: 0
Minimum Signal Increase: 10
Minimum Score Lead: 10
Peek Guess: True
Refine
Remove Poly(A) Tails: True
Minimum Poly(A) Tail Length: 20
Filter by RQ: False
Clustering
POA Coverage: 100
Use CCS QVs: True
Polishing
Perform polishing: True
RQ Cutoff: 0.99
Coverage: 60
Final output
Additional files: BAM, FASTQ
Execution Time
12 h per subreads BAM file.
Be aware that the Circular Consensus Sequence Calling is a very time and computation intensive step (12h+). It is recommended to directly use the CCS output when playing with this dataset.
Output
BAM: Sequences are returned in BAM format, along with their PacBio BAM index file (bam.pbi):
polished.bam
polished.bam.pbi
FASTQ: Sequences are returned in FASTQ format. This output is only available if the polishing step is performed:
polished.hq.fastq.gz
polished.lq.fastq.gz
FASTA file(s) containing consensus transcripts:
polished.hq.fasta.gz
polished.lq.fasta.gz
TSV report containing the read ID and the read type that contributed to each consensus transcript (report.csv)