<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
Hi Qihua,
<div class=""><br class="">
</div>
<div class="">If you are using online version of SOBA, I would suggest you use the command line version found here <a href="https://github.com/The-Sequence-Ontology/SOBA" class="">https://github.com/The-Sequence-Ontology/SOBA</a> as it is more flexible for
the kinds of analyses you are talking about.</div>
<div class=""><br class="">
</div>
<div class="">If you are using ‘footprint’ as the --data_type argument you should get the nucleotide count for collapsed features that you are talking about. In addition I suggest you take a look at bedtools (<a href="http://bedtools.readthedocs.io/en/latest/index.html" class="">http://bedtools.readthedocs.io/en/latest/index.html</a>)
for example bedtools merge as a flexible way to generate the kind of merged features you want and then you can always pass that output of that through SOBAcl for counting, graphing and reporting.</div>
<div class=""><br class="">
</div>
<div class="">Finally, if you want a great deal of flexibility in generating your own manipulation and reporting of GFF3 files that is beyond the scope of SOBA and/or BedTools, I suggest you take a look at the GAL library (<a href="https://github.com/The-Sequence-Ontology/GAL" class="">https://github.com/The-Sequence-Ontology/GAL</a>) if
you don’t mind writing some perl code.</div>
<div class=""><br class="">
</div>
<div class="">Regards,</div>
<div class=""><br class="">
</div>
<div class="">Barry</div>
<div class=""><br class="">
</div>
<div class="">
<div>
<blockquote type="cite" class="">
<div class="">On Feb 20, 2017, at 1:34 PM, Qihua Liang <<a href="mailto:qlian003@ucr.edu" class="">qlian003@ucr.edu</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">Hi Carson,<br class="">
<br class="">
Thanks for your reply! Now I understand the minimal length of SOBA analysis of Maker gene models in GFF3.<br class="">
<br class="">
I am also using SOBA to calculate the statistics of another sources in the GFF3 file, and I have found another strange thing about RepeatMasker annotation and footprint percentage.
<br class="">
<br class="">
Previously, I ran RepeatMasker outside of Maker once, with my_trained.lib (same as used in Maker), and I had bases masked of ~42% from the output report.
<br class="">
In running Maker, I provided both “model_org=all” and “rmlib=my_trained.lib”. Under these setting, RepeatMasker should be run twice and the merged results of the twice running will be the output of RepeatMasker in GFF3. I am expecting the bases masked by RepeatMasker
in the GFF3 will be more than 42%. <br class="">
<br class="">
But in SOBA calculation, the footprint percentage is only ~18%. Referring to the SOBA paper, footprint is calculated as "non-redundant nucleotide count of all features of a given type”. I assume that when SOBA calculates footprint of RepeatMasker features in
GFF3, it should be counting the same as "masked bps" as RepeatMasker itself. <br class="">
<br class="">
When Maker “combines” the 2 runs of RepeatMasker, is it a merge or an overlapping of 2 RepeatMasker results?
<br class="">
Besides, instead of using SOBA, are there any accessory scripts updated in Maker to calculate the statistics of the annotations?<br class="">
<br class="">
Thanks<br class="">
Qihua<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 19, 2017, at 10:05 PM, Carson Holt <<a href="mailto:carsonhh@gmail.com" class="">carsonhh@gmail.com</a>> wrote:<br class="">
<br class="">
IN GFF3 the CDS and UTR lengths are actually the merge of all CDSs or UTR features, but SOBA is reporting each part individually which may be causing your confusion. This is because SOBA reports per feature statistics and not merged feature statistics.<br class="">
<br class="">
CDS’s do not have to take up entire exons. For example start/stop codons may cross splice sites and be split across exons (very common). The result is that each part of the split CDS becomes a separate feature. As a result SOBA will treat each one separately.
So a single bp CDS here is not abnormal, since the remaining part of the CDS continues on the next exon as a separate line. The exact same is true for UTR.<br class="">
<br class="">
If you want the merged length of the UTR and CDS, it is bets to pull that info out of the _QI= part of the GFF3 attributes for each mRNA.<br class="">
<br class="">
What about single bp exons? Those cannot occur unless you gave an input GFF3 with predictions that have single bp exons. The predictors like SNAP and Augustus just won’t produce them, with one exception. They can potentially produce them for the first/last
exon. This is not because the exon is 1 bp, but rather because the predictor only reports the CDS part of the exon. As a result if the stop/start codon may have only 1 bp overlapping that exon, but one you add UTR the exon will extend from that point and will
no longer be 1bp in length. But if the UTR never gets added, then you can be left with a partial initial/terminal exon.<br class="">
<br class="">
However more than likely what you are seeing is just related to how SOBA reports individual feature line stats as opposed to merged stats for CDS and UTR.<br class="">
<br class="">
Thanks,<br class="">
Carson<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 18, 2017, at 9:43 AM, Qihua Liang <<a href="mailto:qlian003@ucr.edu" class="">qlian003@ucr.edu</a>> wrote:<br class="">
<br class="">
Dear Maker develop team,<br class="">
<br class="">
I used SOBA website to calculate the statistics of Maker annotation, and I found out the length of some features of Maker, like CDS, exon, 5’ and 3’UTR, the minimal length of such features can be as short as 1bp. These are confusing, with such features length
of 1bp. When Maker combines different gene models and makes such predictions, how will it accept such abnormal exon/CDS length? And is there any parameters in the bopt.ctl or evm.ctl to avoid such abnormal gene models?<br class="">
<br class="">
Thanks<br class="">
Qihua<br class="">
_______________________________________________<br class="">
maker-devel mailing list<br class="">
<a href="mailto:maker-devel@box290.bluehost.com" class="">maker-devel@box290.bluehost.com</a><br class="">
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org<br class="">
</blockquote>
<br class="">
</blockquote>
<br class="">
<br class="">
_______________________________________________<br class="">
maker-devel mailing list<br class="">
<a href="mailto:maker-devel@box290.bluehost.com" class="">maker-devel@box290.bluehost.com</a><br class="">
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org<br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</body>
</html>