[maker-devel] clarification on creating a standard build

Valerie Soza vsoza at uw.edu
Thu Mar 29 12:42:28 MDT 2018


Hi MAKER community,

I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file.

I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff. 

sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

Then I extracted only the IDs from the .tsv file to grep against the all.gff file.

cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep.

sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :)

-Valerie


> On Mar 26, 2018, at 11:49 AM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Carson
> 
> Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.
> 
> I created the .gff file by this command:
> gff3_merge -d Rwill7_master_datastore_index.log
> 
> I created the .fasta files by this command:
> fasta_merge -d Rwill7_master_datastore_index.log
> 
> I ran InterProScan with this command:
> interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
> 
> $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv
> 
> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1	d146190e642a740520c9
> 7a782a74fe32	356	Pfam	PF13365	Trypsin-like peptidase domain	77	2281.4E-17	T	20-03-2018
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
> #no results
> 
> There is no "processed-gene" with this ID in the Rwill7.all.gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff
> 
> LG12_ordered_scaffold_85	maker	gene	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
> LG12_ordered_scaffold_85	maker	mRNA	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
> LG12_ordered_scaffold_85	maker	exon	63727	63768	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64269	64340	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64896	65000	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65268	65327	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65716	65915	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	66930	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	63727	63768	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64269	64340	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64896	65000	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65268	65327	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65716	65915	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	66930	67053	.	+	1	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> 
> However, there are Names and Targets called "abinit-gene”, not "processed-gene”, that appear to have this gene number in the .gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff
> 
> #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?
> 
> LG12_ordered_scaffold_85	snap_masked	match	101798	108141	35.366	+	ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
> LG12_ordered_scaffold_85	snap_masked	match_part	101798	102633	35.236	ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
> LG12_ordered_scaffold_85	snap_masked	match_part	107907	108141	0.130	ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235
> 
> So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this “abinit-gene” from the .gff file was present:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> #no results using the “abinit-gene” Name from the .gff file
> 
> versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356
> 
> I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these “abinit-genes” in the .gff file as “processed-genes” in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?
> 
> If so, all I would have to do is grep the .gff file replacing “processed” with “abinit”, correct?
> 
> Thanks for your help.
> 
> -Valerie
> 
>> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
>> 
>> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
>> 
>> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
>> 
>> You then have two alternate ways to get those models into your dataset.
>> 
>> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
>> 
>> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
>> 
>> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
>> 
>> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
>> 
>> —Carson
>> 
>> 
>> 
>>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>>> 
>>> Hi MAKER community
>>> 
>>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>>> 
>>> I can’t reply to this original thread, but I am trying to follow Carson’s suggestion for a standard build using this protocol instead now:
>>> "One note I’d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.”
>>> 
>>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
>>> 
>>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>>> 
>>> Thanks.
>>> 
>>> -Valerie
>>> 
>>> Valerie Soza, Ph.D.
>>> c/o Hall Lab
>>> Department of Biology
>>> University of Washington
>>> Johnson Hall 202A
>>> Box 351800
>>> Seattle, WA 98195-1800
>>> 206-543-6740
>>> http://staff.washington.edu/vsoza/
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> 
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/





More information about the maker-devel mailing list