<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">I find that erring on the side of specificity works better for most annotation projects.  But this is not always true, and you can try a few large contigs using an alignment approach like cufflinks and compare it to an assembly approach like trinity to decide which appears to perform better.  Also you need to take into account the ultimate goal of the project.  Some projects want to annotate absolutely everything and don’t care about false positives, while others want to maximize specificity and care more about having bad models.  Often times this has to do with some planned downstream experiment that would be adversely affected by one or the other.</div><div class=""><br class=""></div><div class="">I tend to prefer high specificity because MAKER’s automated approach to re-annotation means that if evidence ever presents itself later on that a real gene is missing, then that evidence automatically supports inclusion of the gene in the next automated release of the genome. But false models tends to persist and are harder to get rid of even though they lack any evidence support. These false models produced by sensitivity focused approaches then tend to poison downstream experiments and lead to more time being wasted by researchers.  This is seen a lot in plant genomes where transposons and pseudogenes tend to pollute genome releases for historical reasons.  Basically once they were in the genome release, then the burden of proof for removing them becomes higher than if they were never included in the first place.  For researchers unaware of this, they may find they have been studying a transposon for weeks or months because some expression or variant analysis early on listed it as a canidate gene for some desired phenotype. </div><div class=""><br class=""></div><div class="">MAKER can handle several hundred thousand contigs in the assembly, but in general contigs smaller than 10kb will not be annotatable (although smaller contigs can be used for gene dense organisms with short introns).  It is better to exclude these short contigs from the analysis for processing efficiency. </div><div class=""><br class=""></div><div class="">—Carson</div><div class=""><br class=""></div><div class=""><br class=""></div><br class=""><div><blockquote type="cite" class=""><div class="">On Feb 5, 2015, at 9:52 AM, Van Hoeck Arne <<a href="mailto:avhoeck@SCKCEN.BE" class="">avhoeck@SCKCEN.BE</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="WordSection1" style="page: WordSection1; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class="">Thanks for this comprehensive and clear answer, Carlson.<o:p class=""></o:p></span></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class=""> </span></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class="">So in conclusion, it s better to make a concise file with very accurate transcripts (assembly method) instead of large possibly transcripts (map RNAseq data to reference) with contain more false positives.<o:p class=""></o:p></span></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class=""> </span></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class="">Another small question, can MAKER handle a lot of contigs (around 10.000) or is it better to make artificial chromosomes by pasting contigs to each other with an certain number  N’s (let s say 1000 > exon length).<o:p class=""></o:p></span></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class=""> </span></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class="">Thanks a lot for your quick response<o:p class=""></o:p></span></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class="">Arne<o:p class=""></o:p></span></div><div class=""><div style="border-style: solid none none; border-top-color: rgb(181, 196, 223); border-top-width: 1pt; padding: 3pt 0cm 0cm;" class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><b class=""><span style="font-size: 10pt; font-family: Tahoma, sans-serif;" class="">From:</span></b><span style="font-size: 10pt; font-family: Tahoma, sans-serif;" class=""><span class="Apple-converted-space"> </span>Carson Holt [<a href="mailto:carsonhh@gmail.com" style="color: purple; text-decoration: underline;" class="">mailto:carsonhh@gmail.com</a>]<span class="Apple-converted-space"> </span><br class=""><b class="">Sent:</b><span class="Apple-converted-space"> </span>donderdag 5 februari 2015 17:28<br class=""><b class="">To:</b><span class="Apple-converted-space"> </span>Van Hoeck Arne<br class=""><b class="">Cc:</b><span class="Apple-converted-space"> </span><a href="mailto:maker-devel@yandell-lab.org" style="color: purple; text-decoration: underline;" class="">maker-devel@yandell-lab.org</a><br class=""><b class="">Subject:</b><span class="Apple-converted-space"> </span>Re: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?)<o:p class=""></o:p></span></div></div></div><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class="">There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way.  How this is done is completely dependent on the user and the project (different strategies work better in different organisms).  The only requirement for MAKER is that you have some form of transcript evidence.  The utility of transcript evidence is primarily for identification of introns and splice sites.  If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner).  If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT).  How you generate these FASTA files or GFF3 files is up to you.   Remember that final models are not strictly based on the transcript data (It’s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated.<o:p class=""></o:p></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class="">With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity.  There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes).  Assembly based methods on the other hand result in very high specificity, but lower sensitivity.  Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity’s jaccard_clip option for example).  A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs.<o:p class=""></o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class="">What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc.  You can try both methods and see which appears to work better for your organism.  Both have their trade offs.<o:p class=""></o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class="">For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads).  Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase.  You will also end up recovering mostly the 5’ end of genes because the inhibition of reverse transcriptase results in a 5’ bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3’ end.<o:p class=""></o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class="">There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias.  Each has it’s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome.<o:p class=""></o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class="">—Carson<o:p class=""></o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div></div><div class=""><blockquote style="margin-top: 5pt; margin-bottom: 5pt;" class=""><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class="">On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne <<a href="mailto:avhoeck@SCKCEN.BE" style="color: purple; text-decoration: underline;" class="">avhoeck@SCKCEN.BE</a>> wrote:<o:p class=""></o:p></div></div><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div><div class=""><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">Dear,</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""> </span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">I have read the manuscript on the MAKER-P tool. I’m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). :</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""> </span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,…), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes.</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""> </span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery.</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""> </span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something?</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage.</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome?</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""> </span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">PS: can I add a question on the google group? I couldn’t start a new topic</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""> </span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">Thanks in advance,</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span lang="NL-BE" style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">Arne Van Hoeck</span><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><o:p class=""></o:p></span></div></div><div class=""><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""> <o:p class=""></o:p></span></div></div><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class=""><br style="orphans: auto; text-align: start; widows: auto; -webkit-text-stroke-width: 0px; word-spacing: 0px;" class=""><br class=""></span><o:p class=""></o:p></div><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="350" style="width: 262.5pt; margin-left: 36pt; orphans: auto; widows: auto; -webkit-text-stroke-width: 0px; word-spacing: 0px;"><tbody class=""><tr style="height: 42pt;" class=""><td width="96" valign="bottom" style="width: 72pt; padding: 0cm; height: 42pt;" class=""><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-family: Helvetica, sans-serif;" class=""><img border="0" width="96" height="56" id="_x0000_i1025" src="http://www.sckcen.be/images/disclaimer/sckcen2.png" alt="-" class=""><o:p class=""></o:p></span></div></td><td width="190" style="width: 142.5pt; padding: 0cm; height: 42pt;" class=""><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif; text-align: center;" class=""><span style="font-size: 7pt; font-family: 'Segoe UI', sans-serif; color: rgb(0, 125, 195);" class="">Consider the environment before you print<br class=""></span><span style="font-size: 7pt; font-family: 'Segoe UI', sans-serif; color: rgb(36, 179, 255);" class="">Denk aan het milieu voor u deze e-mail print</span><span style="font-family: Helvetica, sans-serif;" class=""><br class=""></span><span style="font-size: 7pt; font-family: 'Segoe UI', sans-serif; color: rgb(0, 125, 195);" class="">Pensez à l'environnement avant d'imprimer</span><span style="font-family: Helvetica, sans-serif;" class=""><o:p class=""></o:p></span></div></td><td width="64" valign="bottom" style="width: 48pt; padding: 0cm; height: 42pt;" class=""><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif; text-align: center;" class=""><span style="font-family: Helvetica, sans-serif;" class=""><img border="0" width="64" height="56" id="_x0000_i1026" src="http://www.sckcen.be/images/disclaimer/tree.png" alt="-" class=""><o:p class=""></o:p></span></div></td></tr><tr style="height: 27pt;" class=""><td colspan="3" style="padding: 0cm; height: 27pt;" class=""><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif; text-align: center;" class=""><span style="font-family: Helvetica, sans-serif;" class=""><img border="0" width="350" height="36" id="_x0000_i1027" src="http://www.sckcen.be/images/disclaimer/footer.png" alt="-" class=""><o:p class=""></o:p></span></div></td></tr><tr class=""><td nowrap="" colspan="3" style="padding: 0cm;" class=""><div style="margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif; text-align: center;" class=""><span style="font-size: 7.5pt; font-family: 'Segoe UI', sans-serif; color: rgb(204, 204, 204);" class="">SCK•CEN Disclaimer:<span class="apple-converted-space"> </span><a href="http://www.sckcen.be/en/e-mail_disclaimer" target="_blank" style="color: purple; text-decoration: underline;" class=""><span style="color: rgb(204, 204, 204);" class="">http://www.sckcen.be/en/e-mail_disclaimer</span></a></span><span style="font-family: Helvetica, sans-serif;" class=""><o:p class=""></o:p></span></div></td></tr></tbody></table><div style="margin: 0cm 0cm 0.0001pt 36pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class="">_______________________________________________<br class="">maker-devel mailing list<br class=""></span><a href="mailto:maker-devel@box290.bluehost.com" style="color: purple; text-decoration: underline;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif; color: purple;" class="">maker-devel@box290.bluehost.com</span></a><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class=""><br class=""></span><a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org" style="color: purple; text-decoration: underline;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif; color: purple;" class="">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org</span></a></div></div></blockquote></div></div></div></blockquote></div><br class=""></body></html>