[maker-devel] SOBA statistics of Maker annotation
Qihua Liang
qlian003 at ucr.edu
Sat Feb 25 10:14:04 MST 2017
Thank you Barry and Carson!
I compared the SOBA statistics of RepeatMasker footprint and the report generated by running RepeatMasker alone, I got 2 different parentage of repeats masked. Running RepeatMasker with myTrained.lib, the repeats masked are 42%. But within Maker GFF3, the percentage of repeats masker is only ~18%. What may cause such difference here?
Thanks
Qihua
> On Feb 21, 2017, at 1:34 PM, Carson Holt <carsonhh at gmail.com> wrote:
>
> MAKER merges overlapping RepeatMasker results into a single longer feature.
>
> —Carson
>
>
>> On Feb 20, 2017, at 1:34 PM, Qihua Liang <qlian003 at ucr.edu> wrote:
>>
>> Hi Carson,
>>
>> Thanks for your reply! Now I understand the minimal length of SOBA analysis of Maker gene models in GFF3.
>>
>> I am also using SOBA to calculate the statistics of another sources in the GFF3 file, and I have found another strange thing about RepeatMasker annotation and footprint percentage.
>>
>> Previously, I ran RepeatMasker outside of Maker once, with my_trained.lib (same as used in Maker), and I had bases masked of ~42% from the output report.
>> In running Maker, I provided both “model_org=all” and “rmlib=my_trained.lib”. Under these setting, RepeatMasker should be run twice and the merged results of the twice running will be the output of RepeatMasker in GFF3. I am expecting the bases masked by RepeatMasker in the GFF3 will be more than 42%.
>>
>> But in SOBA calculation, the footprint percentage is only ~18%. Referring to the SOBA paper, footprint is calculated as "non-redundant nucleotide count of all features of a given type”. I assume that when SOBA calculates footprint of RepeatMasker features in GFF3, it should be counting the same as "masked bps" as RepeatMasker itself.
>>
>> When Maker “combines” the 2 runs of RepeatMasker, is it a merge or an overlapping of 2 RepeatMasker results?
>> Besides, instead of using SOBA, are there any accessory scripts updated in Maker to calculate the statistics of the annotations?
>>
>> Thanks
>> Qihua
>>
>>
>>> On Feb 19, 2017, at 10:05 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>>
>>> IN GFF3 the CDS and UTR lengths are actually the merge of all CDSs or UTR features, but SOBA is reporting each part individually which may be causing your confusion. This is because SOBA reports per feature statistics and not merged feature statistics.
>>>
>>> CDS’s do not have to take up entire exons. For example start/stop codons may cross splice sites and be split across exons (very common). The result is that each part of the split CDS becomes a separate feature. As a result SOBA will treat each one separately. So a single bp CDS here is not abnormal, since the remaining part of the CDS continues on the next exon as a separate line. The exact same is true for UTR.
>>>
>>> If you want the merged length of the UTR and CDS, it is bets to pull that info out of the _QI= part of the GFF3 attributes for each mRNA.
>>>
>>> What about single bp exons? Those cannot occur unless you gave an input GFF3 with predictions that have single bp exons. The predictors like SNAP and Augustus just won’t produce them, with one exception. They can potentially produce them for the first/last exon. This is not because the exon is 1 bp, but rather because the predictor only reports the CDS part of the exon. As a result if the stop/start codon may have only 1 bp overlapping that exon, but one you add UTR the exon will extend from that point and will no longer be 1bp in length. But if the UTR never gets added, then you can be left with a partial initial/terminal exon.
>>>
>>> However more than likely what you are seeing is just related to how SOBA reports individual feature line stats as opposed to merged stats for CDS and UTR.
>>>
>>> Thanks,
>>> Carson
>>>
>>>> On Feb 18, 2017, at 9:43 AM, Qihua Liang <qlian003 at ucr.edu> wrote:
>>>>
>>>> Dear Maker develop team,
>>>>
>>>> I used SOBA website to calculate the statistics of Maker annotation, and I found out the length of some features of Maker, like CDS, exon, 5’ and 3’UTR, the minimal length of such features can be as short as 1bp. These are confusing, with such features length of 1bp. When Maker combines different gene models and makes such predictions, how will it accept such abnormal exon/CDS length? And is there any parameters in the bopt.ctl or evm.ctl to avoid such abnormal gene models?
>>>>
>>>> Thanks
>>>> Qihua
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>
>
More information about the maker-devel
mailing list