Repeat Masking

Content of this page:

Introduction

Repetitive DNA sequences are abundant in a broad range of species. The term repeat is used to describe two different types of sequences: low complexity sequences, such as homopolymeric runs of nucleotides, and transposable elements, such as viruses, long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs). Eukaryotic genomes can be very repeat rich: for example, 47% of the human genome is thought to consists of repeats. Adequate repeat annotation should be a part of every genome annotation project. 

Repeat identification and masking is usually a previous step to the gene prediction and annotation phase. The term 'masking' means transforming every nucleotide identified as a repeat to an 'N', 'X' or to a lower case a, t, g or c (the latter is known as soft masking). The masking step signals to downstream sequence alignment and gene prediction tools that these regions are repeats. Identifying repeats is complicated by the fact that repeats are often poorly conserved; thus, accurate repeat detection usually requires a repeat library for the species of interest. Also, the borders of these repeats are usually ill-defined; repeats often insert within other repeats, and only fragments within fragments are present, which means that complete elements are found quite rarely. 

Users must carefully post-process the outputs of this process since that failure to mask genome sequences can be catastrophic. Left unmasked repeats can seed millions of spurious BLAST alignments, producing false evidence for gene annotation. Worse still, many transposon open reading frames (ORFs) look like true host genes to gene predictors, causing portions of transposon ORFs to be added as additional exons to gene predictions, completely corrupting the final gene annotations. Good repeat masking is thus crucial for the accurate annotation of protein-coding genes. 

This application is based on RepeatMasker. RepeatMasker is a program that screens DNA sequences and detects transposable elements, satellites, and low-complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked. RepeatMasker uses a sequence search engine to perform its search for repeats. In OmicsBox, RMBLast and HMMER are supported. RepeatMasker also uses the Tandem Repeat Finder to detect tandem repeats. 

RepeatMasker comes with the Dfam Database. The Dfam database is a open collection of DNA Transposable Element sequence alignments, hidden Markov Models (HMMs), consensus sequences, and genome annotations. Dfam represents a collection of multiple sequence alignments, each containing a set of representative members of a specific transposable element family. These alignments (seed alignments) are used to generate HMMs and consensus sequences for each family. The Dfam website gives information about each family, and provides genome annotations for a collection of core genomes.The current release (Dfam 3.0) contains 6,235 TE families spanning five organisms: human, mouse, zebrafish, fruit fly, nematode, and a growing number of additional species. 

To supplement these databases, OmicsBox allows providing custom libraries, as well as the RepeatMasker edition of RepBase. RepBase is a database of representative repetitive sequences from eukaryotic species. Users can download the RepeatMasker library file from the Genetic Information Research Institute (GIRI) web site after requesting an account opening.

Please cite RepeatMasker as:

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013-2015 <http://www.repeatmasker.org>.

Run Repeat Masking

This functionality can be found under Genome Analysis → Repeat Masking. The wizard allows to select input files and adjust analysis parameters (Figures 1 to 4).

Input

  • Input FASTA: Select the file that contains the DNA sequences to be masked. Input sequences must be in FASTA format. 

Figure 1: Input Page

Basic Configuration

  • Search Configuration: Select the search engine to perform the search for repeats.
    • RMBlast: Is a RepeatMasker compatible version of the NCBI Blast tool suite. 
    • HMMER: It uses the nhmmer program to search sequences against the Dfam database. 
  • Repeat Database: RepeatMasker works with these databases:
    • Dfam: It is a database of transposable elements included in the application, so it is not necessary to provide any additional file. 
    • Custom: Allows providing a custom library of sequences to be masked in the query. The library file needs to contain repetitive elements in FASTA format. The recommended format for IDs in a custom library is ">repeatname#class/subclass". 
    • RepBase: We highly recommend obtaining the RepeatMasker edition of RepBase. Searches are optimized to use this database and can produce higher quality annotations. To obtain RepBase RepeatMasker edition go to the Genetic Information Institute website. This option expects an EMBL file as a database file. 

This functionality is compatible with the RepBase RepeatMasker edition 20181026 and 20170127. Make sure you are providing the proper database.


  • Database FIle: If it is necessary, select the file containing the database to perform the search. 
    • Custom: The library file needs to contain sequences in FASTA format. The recommended format for IDs in a custom library is ">repeatname#class/subclass". 
    • Repbase: EMBL file downloaded from the Genetic Information Institute website.
  • Species: Specify the species or clade of the input sequence. The species name must be a valid NCBI Taxonomy Database species name and be contained in the RepeatMasker repeat database. Take into account that if HMMER is selected as search engine, the Dfam database only contains information about human, mouse, zebrafish, fruit fly, and nematode.

Figure 2: Basic Configuration Data Page

Advanced Configuration

  • RMBlast Options. Speed/Sensitivity: Select the sensitivity of the search. The more sensitive the longer the processing time:
    • Rush: About 10% less sensitive and 4-10 times faster than the default option (quick searches are fine under most circumstances).  
    • Quick: 5-10% less sensitive, 2-5 times faster than default. 
    • Slow: 0-5% more sensitive, 2-3 times slower than default. 
  • RMBlast Options. Apply Divergence Cutoff: This option masks only those repeats that are less divergent from the consensus than a specific percentage. 
  • Masking Options: Select how sequences will be masked. Repetitive elements can be replaced by N, by X, or by lower case. Note that some downstream applications require a specific type of masking. 
  • Only Alu elements: Only masks Alus and 7SLRNA, SVA and LTR5. This option only works for primate DNA. 
  • Type of repeat: Select the type of repeats that the algorithm will detect and mask: Interspersed repeats, simple repeats, and low complexity DNA, or both. 
  • Not mask RNA genes: RepeatMasker by default screens for matches to small pol III transcribed RNAs (mostly tRNAs and snRNAs) due to their close similarity to SINEs and the abundance of some of their pseudogenes. Check this option if you are interested in leaving the small RNA genes sequences unmasked. 

Figure 3: Advanced Configuration

Output

  • Output FASTA: Select a file where the masked sequences will be placed.  

Figure 4: Output page

Results

The Repeat Masking process returns the masked sequences in FASTA format and the location of the detected repeats in GFF format (Figure 5 and Figure 6). The repeat sequences found during the procedure are replaced by X, N or lowercase (according to the selected mask option), so the output FASTA will contain the same sequences as the input FASTA but with the nucleotides corresponding to a repetitive element masked. The coordinates and strand, as well as the class and subclass of each repetitive element is annotated in the output GFF project. 


Figure 5: Masked sequences

Figure 6: Output GFF with the repetitive elements coordinates.

In addition to the resulting FASTA and GFF files, a report and a chart are generated. The report shows a summary of the Repeat Masking results (Figure 7). This page contains information about the input sequencing data and a results overview. The Results Overview table shows the number of elements, the length occupied and the percentage of sequence that each repeat class and subclass covers.

The Repeat Distribution chart (Figure 8) shows the percentage of sequence covered by each repeat class. 


Figure 7: Summary Report

Figure 8: Repeat Distribution Pie Chart