[maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?)

Thu Feb 5 09:27:41 MST 2015

There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way.  How this is done is completely dependent on the user and the project (different strategies work better in different organisms).  The only requirement for MAKER is that you have some form of transcript evidence.  The utility of transcript evidence is primarily for identification of introns and splice sites.  If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner).  If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT).  How you generate these FASTA files or GFF3 files is up to you.   Remember that final models are not strictly based on the transcript data (It’s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated.

With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity.  There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes).  Assembly based methods on the other hand result in very high specificity, but lower sensitivity.  Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity’s jaccard_clip option for example).  A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs.

What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc.  You can try both methods and see which appears to work better for your organism.  Both have their trade offs.

For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads).  Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase.  You will also end up recovering mostly the 5’ end of genes because the inhibition of reverse transcriptase results in a 5’ bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3’ end.

There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias.  Each has it’s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome.

—Carson

> On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne <avhoeck at SCKCEN.BE> wrote:
> 
> Dear,
>  
> I have read the manuscript on the MAKER-P tool. I’m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). :
>  
> Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,…), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes.
>  
> Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery.
>  
> From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something?
> We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage.
> Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome?
>  
> PS: can I add a question on the google group? I couldn’t start a new topic
>  
> Thanks in advance,
> Arne Van Hoeck
>  
> 
> 
> 	Consider the environment before you print
> Denk aan het milieu voor u deze e-mail print
> Pensez à l'environnement avant d'imprimer
> 
> 
> SCK•CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer <http://www.sckcen.be/en/e-mail_disclaimer>_______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150205/21bb907f/attachment-0003.html>