Long reads analysis - IsoSeq

Introduction

IsoSeq v3 contains the newest tools to identify transcripts in PacBio single-molecule sequencing data. A composable workflow of existing tools and algorithms, combined with a new clustering technique, allows processing the ever-increasing yield of PacBio machines with similar performance to IsoSeq versions 1 and 2.

Dataset Description

Long reads of the full-length transcriptome of COLO829T melanoma cell line.

  • Organism: Homo sapiens

  • Instrument: PacBio

  • Layout: PacBio Single Molecule, Real-Time (SMRT) Sequencing

Publication

Tseng, E., Galvin, B., Hon, T., Kloosterman, W. P., & Ashby, M. (2019). Full length transcriptome sequencing of melanoma cell line complements long read sequencing assessment of genomic rearrangements.

 Abstract

Transcriptome sequencing has proven to be an important tool for understanding the biological changes in cancer genomes including the consequences of structural rearrangements. Short-read sequencing has been the method of choice, as the high throughput at low cost allows for transcript quantitation and the detection of even rare transcripts. However, the reads are generally too short to reconstruct complete isoforms. Conversely, long-read sequencing can provide unambiguous full-length isoforms, but lower throughput has complicated quantitation and high RNA input requirements has made working with cancer samples challenging.

Recently, the COLO 829 cell line was sequenced to 50-fold coverage with PacBio Single Molecule, Real-Time (SMRT) Sequencing. To validate and extend the findings from this effort, we have generated long-read transcriptome data using an updated PacBio Iso-Seq method, the results of which will be shared at the AACR 2019 General Meeting. With this complimentary transcriptome data, we demonstrate how recent innovations in the PacBio Iso-Seq method sample preparation and sequencing chemistry have made long-read sequencing of cancer transcriptomes more practical. In particular, library preparation has been simplified and throughput has increased. The improved protocol has reduced sample prep time from several days to one day while reducing the sample input requirements ten-fold. In addition, the incorporation of unique molecular identifier (UMI) tags into the workflow has improved the bioinformatics analysis. Yield has also increased, with 3.0 sequencing chemistry typically delivering >30 Gb per SMRT Cell 1M. By integrating long and short read data, we demonstrate that the Iso-Seq method is a practical tool for annotating cancer genomes with high-quality transcript information.

Original Data

Data can be downloaded from:

https://downloads-ap.pacbcloud.com/public/dataset/Melanoma2019_IsoSeq/subreads/COLO829T/

Bioinformatic Analysis

Application

IsoSeq

Input

Be aware that the Circular Consensus Sequence Calling is a very time and computation intensive step (12h+). It is recommended to directly use the CCS output when playing with this dataset.

Parameters

Circular Consensus Sequence Calling

  • Minimum Passes: 3

  • Minimum SNR: 2.5

  • Minimum Length: 10

  • Skip Polishing: False

  • Minimum Predicted Accuracy: 0.9

Primer Removal and Demultiplexing

  • Minimum Score: 0

  • Minimum End Score: 0

  • Minimum Signal Increase: 10

  • Minimum Score Lead: 10

  • Peek Guess: True

Refine

  • Remove Poly(A) Tails: True

  • Minimum Poly(A) Tail Length: 20

  • Filter by RQ: False

Clustering

  • POA Coverage: 100

  • Use CCS QVs: True

Polishing

  • Perform polishing: True

  • RQ Cutoff: 0.99

  • Coverage: 60

Final output

  • Additional files: BAM, FASTQ

Execution Time

12 h per subreads BAM file.

Be aware that the Circular Consensus Sequence Calling is a very time and computation intensive step (12h+). It is recommended to directly use the CCS output when playing with this dataset.

Output

  • BAM: Sequences are returned in BAM format, along with their PacBio BAM index file (bam.pbi):

    • polished.bam

    • polished.bam.pbi

  • FASTQ: Sequences are returned in FASTQ format. This output is only available if the polishing step is performed:

    • polished.hq.fastq.gz

    • polished.lq.fastq.gz

  • FASTA file(s) containing consensus transcripts:

    • polished.hq.fasta.gz

    • polished.lq.fasta.gz

  • TSV report containing the read ID and the read type that contributed to each consensus transcript (report.csv)