[maker-devel] clarification on creating a standard build
Valerie Soza
vsoza at uw.edu
Thu Mar 29 12:42:28 MDT 2018
Hi MAKER community,
I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file.
I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff.
sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
Then I extracted only the IDs from the .tsv file to grep against the all.gff file.
cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep.
sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :)
-Valerie
> On Mar 26, 2018, at 11:49 AM, Valerie Soza <vsoza at uw.edu> wrote:
>
> Hi Carson
>
> Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.
>
> I created the .gff file by this command:
> gff3_merge -d Rwill7_master_datastore_index.log
>
> I created the .fasta files by this command:
> fasta_merge -d Rwill7_master_datastore_index.log
>
> I ran InterProScan with this command:
> interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>
> When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
>
> $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv
>
> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 d146190e642a740520c9
> 7a782a74fe32 356 Pfam PF13365 Trypsin-like peptidase domain 77 2281.4E-17 T 20-03-2018
>
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
> #no results
>
> There is no "processed-gene" with this ID in the Rwill7.all.gff file:
>
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff
>
> LG12_ordered_scaffold_85 maker gene 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
> LG12_ordered_scaffold_85 maker mRNA 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
> LG12_ordered_scaffold_85 maker exon 63727 63768 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker exon 64269 64340 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker exon 64896 65000 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker exon 65268 65327 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker exon 65716 65915 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker exon 66930 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker CDS 63727 63768 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker CDS 64269 64340 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker CDS 64896 65000 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker CDS 65268 65327 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker CDS 65716 65915 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85 maker CDS 66930 67053 . + 1 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
>
> However, there are Names and Targets called "abinit-gene”, not "processed-gene”, that appear to have this gene number in the .gff file:
>
> $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff
>
> #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?
>
> LG12_ordered_scaffold_85 snap_masked match 101798 108141 35.366 + ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
> LG12_ordered_scaffold_85 snap_masked match_part 101798 102633 35.236 ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
> LG12_ordered_scaffold_85 snap_masked match_part 107907 108141 0.130 ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235
>
> So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this “abinit-gene” from the .gff file was present:
>
> $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> #no results using the “abinit-gene” Name from the .gff file
>
> versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:
>
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356
>
> I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these “abinit-genes” in the .gff file as “processed-genes” in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?
>
> If so, all I would have to do is grep the .gff file replacing “processed” with “abinit”, correct?
>
> Thanks for your help.
>
> -Valerie
>
>> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
>>
>> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
>>
>> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
>>
>> You then have two alternate ways to get those models into your dataset.
>>
>> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
>>
>> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
>>
>> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
>>
>> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
>>
>> —Carson
>>
>>
>>
>>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>>>
>>> Hi MAKER community
>>>
>>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>>>
>>> I can’t reply to this original thread, but I am trying to follow Carson’s suggestion for a standard build using this protocol instead now:
>>> "One note I’d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.”
>>>
>>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script.
>>>
>>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>>>
>>> Thanks.
>>>
>>> -Valerie
>>>
>>> Valerie Soza, Ph.D.
>>> c/o Hall Lab
>>> Department of Biology
>>> University of Washington
>>> Johnson Hall 202A
>>> Box 351800
>>> Seattle, WA 98195-1800
>>> 206-543-6740
>>> http://staff.washington.edu/vsoza/
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
>
Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/
More information about the maker-devel
mailing list