From carsonhh at gmail.com Mon Jun 2 10:10:30 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 02 Jun 2014 09:10:30 -0600 Subject: [maker-devel] Precomputed alignments In-Reply-To: References: Message-ID: With the Target and Gap attribute you get slightly better behavior on filtering when you specify the blast_depth=X parameter in the maker_bopts.ctl file (keeps only X best hits). They will also affect the eAED score since it takes reading frame into account (so no Gap attribute means no assumption of reading frame). Otherwise they are only beneficial for seeing the alignment in a viewer as some viewers can recover the alignment when those values are specified. If you are not using blast_depth or trying to view the alignments in a viewer they don't really do anything. MAKER will just assume perfect match across the specified regions. --Carson From: Daniel Standage Date: Saturday, May 31, 2014 at 9:23 AM To: Maker Mailing List Subject: [maker-devel] Precomputed alignments Hello again! About a year ago I asked about using precomputed alignments with Maker. The thread quickly took a different direction as we tried to track down other issues, and I never got the thread back on its original track. So, to return to the original question, what exactly is required when providing pre-computed alignments in GFF3 format? For example, does it affect Maker's behavior whether a score is given? The "Target" attribute? The "Gap" attribute? Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 2 10:23:25 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 02 Jun 2014 09:23:25 -0600 Subject: [maker-devel] tRNAscan and map_gff_ids Message-ID: I've now patched the current download to fix this and a plus strand spliced tRNA bug. --Carson On 5/20/14, 1:17 PM, "Fields, Christopher J" wrote: >I found a problem with some tRNAscan output using MAKER 2.31.5. I had a >full MAKER data set (run initially using MAKER 2.31.5) that I mapped IDs >for. This was then run as follows, with the requisite error: > >-system-specific-4.1$ map_gff_ids id.map Zalbi.all.gff3 >Nested quantifiers in regex; marked by <-- HERE in >m/trnascan-KB913038.1-noncoding-Undet_??? <-- HERE -gene-79.0/ at >/home/groups/hpcbio/apps/maker/maker-2.31.5/bin/map_gff_ids line 111, ><$IN> line 3067590. > >The problematic lines: > >---------------------------------------------- >-system-specific-4.1$ grep "???" Zalbi.all.gff3 >KB913038.1 maker gene 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene >-79.0 >KB913038.1 maker tRNA 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1;Parent=trnascan-KB913038.1-noncoding-Undet >_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene-79.0-tRNA >-1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|70|0 >KB913038.1 maker exon 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1:exon:2193;Parent=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1 >KB913039.1 maker gene 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene >-72.0 >KB913039.1 maker tRNA 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1;Parent=trnascan-KB913039.1-noncoding-Undet >_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene-72.0-tRNA >-1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|74|0 >KB913039.1 maker exon 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1:exon:4036;Parent=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1 >---------------------------------------------- > >I managed to get it going by using the following modifications (regex >quotemeta) in map_gff_ids (lines 107-112): > > for my $id (@map_ids) { > # Only if the value (or the portion preceding > # the first colon) is equal to the map key. > next unless ($value eq $id || $value =~ /^\Q$id\E:/); > $value =~ s/\Q$id\E/$map{$id}/ unless($tag eq 'Name' && $id !~ >/\-gene\-\d+\.\d+|^CG\:|^....\:|^[^\:]+\:temp\d+\:/); > } > >I?m guessing there may be a similar problem with map_fasta_ids? > >chris >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From cjfields at illinois.edu Mon Jun 2 11:45:09 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 2 Jun 2014 16:45:09 +0000 Subject: [maker-devel] tRNAscan and map_gff_ids In-Reply-To: References: Message-ID: <007A79A7-8C68-4AFC-AC4F-451194D4CD29@illinois.edu> Thanks Carson! chris On Jun 2, 2014, at 10:23 AM, Carson Holt wrote: > I've now patched the current download to fix this and a plus strand > spliced tRNA bug. > > --Carson > > > On 5/20/14, 1:17 PM, "Fields, Christopher J" wrote: > >> I found a problem with some tRNAscan output using MAKER 2.31.5. I had a >> full MAKER data set (run initially using MAKER 2.31.5) that I mapped IDs >> for. This was then run as follows, with the requisite error: >> >> -system-specific-4.1$ map_gff_ids id.map Zalbi.all.gff3 >> Nested quantifiers in regex; marked by <-- HERE in >> m/trnascan-KB913038.1-noncoding-Undet_??? <-- HERE -gene-79.0/ at >> /home/groups/hpcbio/apps/maker/maker-2.31.5/bin/map_gff_ids line 111, >> <$IN> line 3067590. >> >> The problematic lines: >> >> ---------------------------------------------- >> -system-specific-4.1$ grep "???" Zalbi.all.gff3 >> KB913038.1 maker gene 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene >> -79.0 >> KB913038.1 maker tRNA 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1;Parent=trnascan-KB913038.1-noncoding-Undet >> _???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene-79.0-tRNA >> -1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|70|0 >> KB913038.1 maker exon 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1:exon:2193;Parent=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1 >> KB913039.1 maker gene 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene >> -72.0 >> KB913039.1 maker tRNA 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1;Parent=trnascan-KB913039.1-noncoding-Undet >> _???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene-72.0-tRNA >> -1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|74|0 >> KB913039.1 maker exon 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1:exon:4036;Parent=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1 >> ---------------------------------------------- >> >> I managed to get it going by using the following modifications (regex >> quotemeta) in map_gff_ids (lines 107-112): >> >> for my $id (@map_ids) { >> # Only if the value (or the portion preceding >> # the first colon) is equal to the map key. >> next unless ($value eq $id || $value =~ /^\Q$id\E:/); >> $value =~ s/\Q$id\E/$map{$id}/ unless($tag eq 'Name' && $id !~ >> /\-gene\-\d+\.\d+|^CG\:|^....\:|^[^\:]+\:temp\d+\:/); >> } >> >> I?m guessing there may be a similar problem with map_fasta_ids? >> >> chris >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From anthony.bretaudeau at rennes.inra.fr Tue Jun 3 03:38:31 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Tue, 03 Jun 2014 10:38:31 +0200 Subject: [maker-devel] Merging 2 annotations Message-ID: <538D8987.4090606@rennes.inra.fr> Hello, I am working on the annotation of an insect genome, and I have 2 gff files: -an automatic annotation (done by another lab, with something else than maker, ~20000genes) -a manually curated annotation (with webapollo, ~1500 genes) From this, I would like to produce a single gff combining the 2. I'd like to keep all the manually curated models, and only the automatic ones that have no equivalent in the manually curated gff. Is it possible to do something like this with maker? I guess I could play with the model_gff option, but I'm not sure how exactly I could use it. Thank you for your help Regards Anthony From shpeng at shou.edu.cn Mon Jun 2 21:30:17 2014 From: shpeng at shou.edu.cn (=?UTF-8?B?5b2t5Y+45Y2O?=) Date: Tue, 3 Jun 2014 10:30:17 +0800 (GMT+08:00) Subject: [maker-devel] Maker can not run repeatmasker Message-ID: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> Using the example data, I can not run maker successfully. Please see the following: [root at c0105 test3]# ls -l total 72 -rw-r--r--. 1 root root 32712 May 29 02:46 dpp_contig.fasta -rw-r--r--. 1 root root 19138 May 29 02:46 dpp_est.fasta -rw-r--r--. 1 root root 3045 May 29 02:46 dpp_protein.fasta -rw-r--r--. 1 root root 1413 May 29 02:46 maker_bopts.ctl -rw-r--r--. 1 root root 1288 May 29 02:46 maker_exe.ctl -rw-r--r--. 1 root root 4630 May 29 04:27 maker_opts.ctl [root at c0105 test3]# maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore To access files for individual sequences use the datastore index: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_master_datastore_index.log STATUS: Now running MAKER... examining contents of the fasta file and run log --Next Contig-- #--------------------------------------------------------------------- Now starting the contig!! SeqID: contig-dpp-500-500 Length: 32156 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased #--------------------------------------------------------------------- Now retrying the contig!! SeqID: contig-dpp-500-500 Length: 32156 Tries: 2!! #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased Maker is now finished!!! Start_time: 1401761680 End_time: 1401761688 Elapsed: 8 Thnaks. Sihua -------------- next part -------------- An HTML attachment was scrubbed... URL: From janphilipoyen at gmail.com Tue Jun 3 10:07:17 2014 From: janphilipoyen at gmail.com (=?UTF-8?Q?Jan_Philip_=C3=98yen?=) Date: Tue, 3 Jun 2014 17:07:17 +0200 Subject: [maker-devel] AED scores and thresholds: Not filtering? Message-ID: Hello, We are currently working on annotating arthropod genomes with the maker pipeline. We have tried to filter out genes with low support using the AED thresholds using AED_threshold=0.5. However, we still find genes without scores and most importantly genes with scores above the set threshold (even AED=1). Is there a known issue which could cause this? We can supply the control files as well as gffs, but would not like to make them public at this stage. Best Regards, Jan Philip Oeyen ZFMK / ZMB / University of Bonn -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 10:10:27 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 09:10:27 -0600 Subject: [maker-devel] Maker can not run repeatmasker In-Reply-To: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> References: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> Message-ID: The message is basically saying that RepeatMasker is not installed correctly. Follow the instructions here --> http://www.repeatmasker.org/RMDownload.html --Carson From: ??? Date: Monday, June 2, 2014 at 8:30 PM To: Subject: [maker-devel] Maker can not run repeatmasker Using the example data, I can not run maker successfully. Please see the following: [root at c0105 test3]# ls -l total 72 -rw-r--r--. 1 root root 32712 May 29 02:46 dpp_contig.fasta -rw-r--r--. 1 root root 19138 May 29 02:46 dpp_est.fasta -rw-r--r--. 1 root root 3045 May 29 02:46 dpp_protein.fasta -rw-r--r--. 1 root root 1413 May 29 02:46 maker_bopts.ctl -rw-r--r--. 1 root root 1288 May 29 02:46 maker_exe.ctl -rw-r--r--. 1 root root 4630 May 29 04:27 maker_opts.ctl [root at c0105 test3]# maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore To access files for individual sequences use the datastore index: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_master_datas tore_index.log STATUS: Now running MAKER... examining contents of the fasta file and run log --Next Contig-- #--------------------------------------------------------------------- Now starting the contig!! SeqID: contig-dpp-500-500 Length: 32156 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.al l.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVo id.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased #--------------------------------------------------------------------- Now retrying the contig!! SeqID: contig-dpp-500-500 Length: 32156 Tries: 2!! #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.al l.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVo id.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased Maker is now finished!!! Start_time: 1401761680 End_time: 1401761688 Elapsed: 8 Thnaks. Sihua _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 10:51:44 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 09:51:44 -0600 Subject: [maker-devel] AED scores and thresholds: Not filtering? In-Reply-To: References: Message-ID: No. It should use whichever is lower the AED or eAED score. The only exception is model_gff results. Those are always kept. Also note that the filter is for the entire gene, not just individual splice forms if you have alternate splicing. If you want I can take a look if there is anything non-obvious. You would have to send me the final GFF3 and the maker_opts.ctl file. --Carson From: Jan Philip ?yen Date: Tuesday, June 3, 2014 at 9:07 AM To: Subject: [maker-devel] AED scores and thresholds: Not filtering? Hello, We are currently working on annotating arthropod genomes with the maker pipeline. We have tried to filter out genes with low support using the AED thresholds using AED_threshold=0.5. However, we still find genes without scores and most importantly genes with scores above the set threshold (even AED=1). Is there a known issue which could cause this? We can supply the control files as well as gffs, but would not like to make them public at this stage. Best Regards, Jan Philip Oeyen ZFMK / ZMB / University of Bonn _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 11:15:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 10:15:46 -0600 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <538D8987.4090606@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> Message-ID: You can give the manually curate ones to model_gff and the other ones to pred_gff. Then set keep_preds=1. The model_gff resuls always get kept even without evidence support, the pred_gff will be kept even without evidence support because you set keep_preds=1, but model_gff results will take precedence. --Carson On 6/3/14, 2:38 AM, "Anthony Bretaudeau" wrote: >Hello, > >I am working on the annotation of an insect genome, and I have 2 gff >files: >-an automatic annotation (done by another lab, with something else than >maker, ~20000genes) >-a manually curated annotation (with webapollo, ~1500 genes) > > From this, I would like to produce a single gff combining the 2. I'd >like to keep all the manually curated models, and only the automatic >ones that have no equivalent in the manually curated gff. > >Is it possible to do something like this with maker? I guess I could >play with the model_gff option, but I'm not sure how exactly I could use >it. > >Thank you for your help >Regards > >Anthony > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Jun 3 21:20:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 20:20:20 -0600 Subject: [maker-devel] Short Introns In-Reply-To: References: Message-ID: I think you may be best off using WebApollo to manually annotated the few hundred short intron ones. It's not that fun to do, but you should be able to get them all in a couple of days by yourself or under a day if you had a helper. --Carson On 5/15/14, 11:15 AM, "Mack, Brian" wrote: >Hi, I examined the genes that had introns less than 10 bp that were being >flagged by tbl2asn and I noticed that all 438 of them were genes called >by SNAP. Also they were found in the CDS and not the UTR. It seems >strange that all of the genes that have these short introns are from SNAP >when only about one third of the final gene models are from SNAP. I've >examined the evidence for a handful of these genes and the short introns >do not seem supported by the evidence. Has anybody else had short intron >issues with SNAP? > >Brian > >-----Original Message----- >From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf >Of Carson Holt >Sent: Friday, April 18, 2014 10:36 AM >To: UMD Bioinformatics; maker-devel at yandell-lab.org >Subject: Re: [maker-devel] Short Introns > >Look at the name of those genes. The original name will let you know >where it came from because it will contain, augustus, genemark, snap, etc. > You will also want to open up the contig containing those geens in a >viewer like apollo >(http://weatherby.genetics.utah.edu/apollo/apollo.tar.gz). See if the >short intron is part of the CDS or UTR. If it's UTR then, it has >evidence support from an EST, which either means there are problems with >the EST/cDNA evidence or it's real. For those, even if they are real you >can just trim them off. If it's part of the CDS, then investigate >whether it is suggested by EST or protein evidence, or if the ab initio >predictor called it (sometime the ab initio predictor calls things to >force an ORF to work). This can sometimes be indicative of assembly >issues in that region. > >--Carson > > >On 4/18/14, 7:14 AM, "UMD Bioinformatics" >wrote: > >>Hello, >> >>We are preparing two submission for NCBI, nightmare. However some of >>our MAKER gene models have short introns that are being flagged by >>NCBI. In one species we have >400 introns smaller then 20bp which is >>almost biologically impossible. I know we can set max intron length in >>the opts.ctl file but can we set a minimum intron length? >> >>I saw yesterdays posts that mention this is a result of the external ab >>initio predictors but I didn?t see an indication as to which predictor >>and how to change that setting. >> >>from yesterday: >>*These are just short introns (intron size is under control of the ab >>initio >>predictors) --> 438 ERROR: SEQ_FEAT.ShortIntron >> >>Cheers >>Ian >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > >This electronic message contains information generated by the USDA solely >for the intended recipients. Any unauthorized interception of this >message or the use or disclosure of the information it contains may >violate the law and subject the violator to civil or criminal penalties. >If you believe you have received this message in error, please notify the >sender and delete the email immediately. From sujaikumar at gmail.com Wed Jun 4 07:26:09 2014 From: sujaikumar at gmail.com (Sujai) Date: Wed, 4 Jun 2014 13:26:09 +0100 Subject: [maker-devel] Augustus compilation Message-ID: Hi all I've installed older versions of Maker (up to 2.28) before successfully. I was trying to install maker 2.31.6 on a new cluster and decided to use the built in installers for the dependencies. Unfortunately ./Build augustuc gives this error: Unpacking augustus tarball... Configuring augustus... g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o genbank.cc -I../include g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o properties.cc -I../include properties.cc: In static member function 'static void Properties::init(int, char**)': properties.cc:349:25: error: 'boost::filesystem::path' has no member named 'native' configPath = cpath.native(); ^ properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': properties.cc:615:10: error: 'read_symlink' is not a member of 'boost::filesystem' bpath = boost::filesystem::read_symlink(bpath); ^ make: *** [properties.o] Error 1 ERROR: Failed installing augustus, now cleaning installation path... You may need to install augustus manually. ---- Would anyone have any suggestions for how to fix this? I've tried editing the ../exe/augustus-3.0.2/src/Makefile line: LIBS = -lboost_iostreams -lboost_system -lboost_filesystem to add the path to my system boost lib: LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib -lboost_iostreams -lboost_system -lboost_filesystem and then running make from inside ../exe/augustus-3.0.2/src but I get the same error again From mike.thon at gmail.com Wed Jun 4 08:31:30 2014 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 4 Jun 2014 15:31:30 +0200 Subject: [maker-devel] Augustus compilation In-Reply-To: References: Message-ID: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> Hi - Yes it the latest version of augustus needs the boost library. If you?re on linux you should be able to install it through the package manager. its called libboost-dev or some such thing. -Mike On Jun 4, 2014, at 2:26 PM, Sujai wrote: > Hi all > > I've installed older versions of Maker (up to 2.28) before successfully. > > I was trying to install maker 2.31.6 on a new cluster and decided to > use the built in installers for the dependencies. > > Unfortunately > > ./Build augustuc > > gives this error: > > Unpacking augustus tarball... > Configuring augustus... > g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o > genbank.cc -I../include > g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o > properties.cc -I../include > properties.cc: In static member function 'static void > Properties::init(int, char**)': > properties.cc:349:25: error: 'boost::filesystem::path' has no member > named 'native' > configPath = cpath.native(); > ^ > properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': > properties.cc:615:10: error: 'read_symlink' is not a member of > 'boost::filesystem' > bpath = boost::filesystem::read_symlink(bpath); > ^ > make: *** [properties.o] Error 1 > > ERROR: Failed installing augustus, now cleaning installation path... > You may need to install augustus manually. > > ---- > > Would anyone have any suggestions for how to fix this? I've tried > editing the ../exe/augustus-3.0.2/src/Makefile line: > > LIBS = -lboost_iostreams -lboost_system -lboost_filesystem > > to add the path to my system boost lib: > > LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib > -lboost_iostreams -lboost_system -lboost_filesystem > > and then running make from inside ../exe/augustus-3.0.2/src but I get > the same error again > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From sujaikumar at gmail.com Wed Jun 4 08:34:50 2014 From: sujaikumar at gmail.com (Sujai) Date: Wed, 4 Jun 2014 14:34:50 +0100 Subject: [maker-devel] Augustus compilation In-Reply-To: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> References: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> Message-ID: Hi Mike Thanks for the super prompt response. I am on a cluster where I can't install libboost-dev. However, boost is on the cluster (as I wrote, it is compiled in the /system/software/linux-x86_64/lib/boost/1_55_0/lib directory) so is my modification to the Makefile below correct, or is there something else I need to do? LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib -lboost_iostreams -lboost_system -lboost_filesystem Cheers, - Sujai On 4 June 2014 14:31, Michael Thon wrote: > Hi - Yes it the latest version of augustus needs the boost library. If you're on linux you should be able to install it through the package manager. its called libboost-dev or some such thing. > > -Mike > > On Jun 4, 2014, at 2:26 PM, Sujai wrote: > >> Hi all >> >> I've installed older versions of Maker (up to 2.28) before successfully. >> >> I was trying to install maker 2.31.6 on a new cluster and decided to >> use the built in installers for the dependencies. >> >> Unfortunately >> >> ./Build augustuc >> >> gives this error: >> >> Unpacking augustus tarball... >> Configuring augustus... >> g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o >> genbank.cc -I../include >> g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o >> properties.cc -I../include >> properties.cc: In static member function 'static void >> Properties::init(int, char**)': >> properties.cc:349:25: error: 'boost::filesystem::path' has no member >> named 'native' >> configPath = cpath.native(); >> ^ >> properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': >> properties.cc:615:10: error: 'read_symlink' is not a member of >> 'boost::filesystem' >> bpath = boost::filesystem::read_symlink(bpath); >> ^ >> make: *** [properties.o] Error 1 >> >> ERROR: Failed installing augustus, now cleaning installation path... >> You may need to install augustus manually. >> >> ---- >> >> Would anyone have any suggestions for how to fix this? I've tried >> editing the ../exe/augustus-3.0.2/src/Makefile line: >> >> LIBS = -lboost_iostreams -lboost_system -lboost_filesystem >> >> to add the path to my system boost lib: >> >> LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib >> -lboost_iostreams -lboost_system -lboost_filesystem >> >> and then running make from inside ../exe/augustus-3.0.2/src but I get >> the same error again >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From daniel.standage at gmail.com Wed Jun 4 14:03:27 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Wed, 4 Jun 2014 15:03:27 -0400 Subject: [maker-devel] Filtering of ab initio gene models Message-ID: Thanks everyone for your responses recently! The reason for my recent flurry of email activity is that I'm seeing some unexpected trends when running the new version of Maker with precomputed alignments. Compared with an annotation I did a while ago (Maker 2.10, Maker-computed alignments), this new annotation has a substantial number of new genes annotated. If I compare distributions of AED scores between the old and new annotation, it's clear that the new annotation has a lot more low-quality models. If I look at new gene models that do not overlap with any gene model from the old annotation, the likelihood that it's a low-quality model is much higher. I decided to run a little experiment. I annotated a scaffold first using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed transcript and protein alignments and the same (latest) version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models from 2.31.3. Before this experiment, I assumed the issue was related to providing pre-computed alignments in GFF3 format and perhaps violating some important assumption. However, this experiment makes me wonder whether there have been changes to how Maker filters *ab initio* gene models between version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could put together a small data set that reproduces the behavior I just described. Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 4 14:09:29 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 04 Jun 2014 13:09:29 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Sure. that would be helpful. One question. Do you provide the Gap attribute in your precomputed alignments? Having or not having that attribute affects the eAED score which takes reading frame into account, and may cause some things to be kept that normally would be dropped, because MAKER won't be able to take the points of mismatch of the alignment into account (it just assumes match everywhere). --Carson From: Daniel Standage Date: Wednesday, June 4, 2014 at 1:03 PM To: Maker Mailing List Subject: [maker-devel] Filtering of ab initio gene models Thanks everyone for your responses recently! The reason for my recent flurry of email activity is that I'm seeing some unexpected trends when running the new version of Maker with precomputed alignments. Compared with an annotation I did a while ago (Maker 2.10, Maker-computed alignments), this new annotation has a substantial number of new genes annotated. If I compare distributions of AED scores between the old and new annotation, it's clear that the new annotation has a lot more low-quality models. If I look at new gene models that do not overlap with any gene model from the old annotation, the likelihood that it's a low-quality model is much higher. I decided to run a little experiment. I annotated a scaffold first using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed transcript and protein alignments and the same (latest) version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models from 2.31.3. Before this experiment, I assumed the issue was related to providing pre-computed alignments in GFF3 format and perhaps violating some important assumption. However, this experiment makes me wonder whether there have been changes to how Maker filters ab initio gene models between version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could put together a small data set that reproduces the behavior I just described. Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Wed Jun 4 14:11:44 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Wed, 4 Jun 2014 15:11:44 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: I do not provide Gap or Target attributes in the GFF3. Will this affect the AED as well, or just the eAED? -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: > Sure. that would be helpful. One question. Do you provide the Gap > attribute in your precomputed alignments? Having or not having that > attribute affects the eAED score which takes reading frame into account, > and may cause some things to be kept that normally would be dropped, > because MAKER won't be able to take the points of mismatch of the alignment > into account (it just assumes match everywhere). > > --Carson > > > From: Daniel Standage > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that I'm seeing some > unexpected trends when running the new version of Maker with precomputed > alignments. Compared with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a substantial number of > new genes annotated. If I compare distributions of AED scores between the > old and new annotation, it's clear that the new annotation has a lot more > low-quality models. If I look at new gene models that do not overlap with > any gene model from the old annotation, the likelihood that it's a > low-quality model is much higher. > > I decided to run a little experiment. I annotated a scaffold first using > Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same > pre-computed transcript and protein alignments and the same (latest) > version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted > 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci > by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 > locus with only models from 2.10, and 28 loci with only models from 2.31.3. > > Before this experiment, I assumed the issue was related to providing > pre-computed alignments in GFF3 format and perhaps violating some important > assumption. However, this experiment makes me wonder whether there have > been changes to how Maker filters *ab initio* gene models between version > 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could > put together a small data set that reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 4 14:17:34 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 04 Jun 2014 13:17:34 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Just eAED, but eAED can affects selection of ab initio results. For example reading frame match of protein evidence which also affects whether evidence from single_exon=1 and genes with single_exon protein evidence get kept. There is also the assumption that your alignments in GFF3 are are correctly spliced (like BLAT does). So giving blastn results as precomputed est_gff would create a lot of noise, since maker ignores blastn and is using it only to seed the polished exonerate alignments. --Carson From: Daniel Standage Date: Wednesday, June 4, 2014 at 1:11 PM To: Carson Holt Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models I do not provide Gap or Target attributes in the GFF3. Will this affect the AED as well, or just the eAED? -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: > Sure. that would be helpful. One question. Do you provide the Gap attribute > in your precomputed alignments? Having or not having that attribute affects > the eAED score which takes reading frame into account, and may cause some > things to be kept that normally would be dropped, because MAKER won't be able > to take the points of mismatch of the alignment into account (it just assumes > match everywhere). > > --Carson > > > From: Daniel Standage > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that I'm seeing some > unexpected trends when running the new version of Maker with precomputed > alignments. Compared with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a substantial number of > new genes annotated. If I compare distributions of AED scores between the old > and new annotation, it's clear that the new annotation has a lot more > low-quality models. If I look at new gene models that do not overlap with any > gene model from the old annotation, the likelihood that it's a low-quality > model is much higher. > > I decided to run a little experiment. I annotated a scaffold first using Maker > 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed > transcript and protein alignments and the same (latest) version of SNAP as the > only ab initio predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 > predicted 63. If we group gene models into loci by overlap, there are 33 loci > with gene models from both 2.10 and 2.31.3, 1 locus with only models from > 2.10, and 28 loci with only models from 2.31.3. > > Before this experiment, I assumed the issue was related to providing > pre-computed alignments in GFF3 format and perhaps violating some important > assumption. However, this experiment makes me wonder whether there have been > changes to how Maker filters ab initio gene models between version 2.10 and > version 2.31.3? Do you have any ideas? If it would help, I could put together > a small data set that reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ranjani at uga.edu Thu Jun 5 10:49:36 2014 From: ranjani at uga.edu (Sivaranjani Namasivayam) Date: Thu, 5 Jun 2014 15:49:36 +0000 Subject: [maker-devel] protein2genome gene models from protein gff Message-ID: <1401983375868.65464@uga.edu> Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Mack at ARS.USDA.GOV Thu Jun 5 12:56:04 2014 From: Brian.Mack at ARS.USDA.GOV (Mack, Brian) Date: Thu, 5 Jun 2014 17:56:04 +0000 Subject: [maker-devel] missing start and stop codons Message-ID: I've been looking at the start and stop codons of the maker predicted transcripts after I stripped off the utr with the fasta_tool script and 7% of the transcripts do not have ATG as a start codon. Also 2% do not have TAA, TAG, or TGA as a stop codon. Could these be real start and stop codons and just rare variants, or should I consider these incomplete genes? If I was to turn on the "always_complete" option in Maker what would that actually do? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 13:01:24 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:01:24 -0600 Subject: [maker-devel] missing start and stop codons Message-ID: They are incomplete genes there are many reasons why this happens in new assemblies. You can turn always_complete on to try and force a complete, but what is added or subtracted to get a start and stop codon may not be biologically correct. It's just forced canonical. Also make sure to use the latest MAKER version. 2.29 and before didn't correct for the BioPerl codon table which allows for an extra non-cannonical start codon. Now MAKER exports a strict canonical table to BioPerl so 'M' is the only start. --Carson From: "Mack, Brian" Date: Thursday, June 5, 2014 at 11:56 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] missing start and stop codons I?ve been looking at the start and stop codons of the maker predicted transcripts after I stripped off the utr with the fasta_tool script and 7% of the transcripts do not have ATG as a start codon. Also 2% do not have TAA, TAG, or TGA as a stop codon. Could these be real start and stop codons and just rare variants, or should I consider these incomplete genes? If I was to turn on the ?always_complete? option in Maker what would that actually do? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 13:08:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:08:20 -0600 Subject: [maker-devel] protein2genome gene models from protein gff Message-ID: est_gff assumes the alignments are spliced correctly. The protein2genome option also makes that assumption but with a little less confidence that the user always provides splice aware alignments, so in some instances (like protein2genome=1) it may not pass them forward as guaranteed splice aware alignments. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 9:49 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] protein2genome gene models from protein gff Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 13:24:03 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:24:03 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: Like I said. The predictors do the best they can, so there is probably something about the regions to make the CDS, reading frame, or start/stop work that requires exons to be dropped or added. In several ant genomes we saw something like this caused by incorrect homopolymers in the assembly which force the predictor to slightly alter the intron/exon structure because otherwise the reading frame made no sense (the EST alignments were used to confirmed that the assembly homopolymers were incorrect - lots of bad single base pair deletions). The way hints work is as follows. At the simplest level ab initio predictors are calculating the probability of being in different states (intergenic, intron, exon, etc.). The hints increase the probability of being in the intron state where MAKER gives an intron hint or being in an exon/CDS state when MAKER gives an exon/CDS hint. So this bends the likelihood of the ab intio gene predictor to call something similar in structure to the evidence overlapping it. That being said, if there is strong enough signal from something else in the sequence or my hints won't work with the splice sites in the region or the reading frame breaks, then no amount of hints can force augustus to make that model. --Carson On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >Hi, > >thanks for the feedback. I spent some more time on this and am still >somewhat unsatisfied with the whole thing? > >A few comments: > >I quite frequently see augustus and in extension Maker including exons >that are not supported by EST/Protein evidence and are not critical for >the gene model (i.e. I can take them out and still get a proper CDS). >Maybe I don?t know enough about how Maker creates hints and more >importantly what role these hints play for augustus, but I cannot really >see a great effect (any, really) on the gene models even if both ESTs and >proteins contradict an augustus gene model and the surplus exon is >non-essential. > >(all evidence is provided as fasta files, protein2genome and est2genome >are set to 0) > >As for the repeat library, I suppose this is a critical point. I am using >repeats from a closely related species via Repeatmasker, modelled and >filtered repeats from RepeatModeler and repeats derived from a >high-coverage 454 data set. Not sure what else I can do to improve that. > >As for evidence, I am using the curated reference proteome from a related >species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >reads. I don?t think it gets a whole lot better, in terms of what data >can be used. > >So in summary, I just don?t get where I want to using Augustus and Maker >- specifically, the gene models are full of weird, unsupported artefacts >despite manually curating > 850 models for training. I suppose I was >hoping for some secret trick to improve on this - but I guess there is >none? Actually, if I only do a pure evidence build (seeing that my input >data is very high quality), it looks better - which sort of goes against >the premise of Maker :/ > >Regards, > >Marc > > > > >Marc P. Hoeppner, PhD >Team Leader >Department for Medical Biochemistry and Microbiology >Uppsala University, Sweden >marc.hoeppner at bils.se > >On 27 May 2014, at 17:25, Carson Holt wrote: > >> Extra exons can be required for predictors to make sense of a region >>(they >> do the best they can). This can be due to imperfect assemblies or >> repeats. For plants the repeat database is the the one thing that will >> most affect the annotation quality. You may need to spend some time >> building the best repeat library you can. The repeat library is the >>next >> most important thing next to training the predictor, because they >>confuse >> the predictor (sometimes a lot) causing it to behave oddly in those >> regions (because repeats do encode real protein and protein fragments). >> Also when running now with MAKER make sure to include the entire >>proteome >> of a related species and not just UniProt, and you will get better >> performance. Now that you have Augustus trained, using it inside of >>MAKER >> with an improved repeat library and additional protein evidence should >> give it the feedback that will allow it to perform better than it would >> with just naked ab initio prediction. >> >> Thanks, >> Carson >> >> >> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >> >>> Hi, >>> >>> I wanted to get some feedback regarding the training of ab-initio gene >>> finders - it?s not strictly Maker related, but I suppose there are many >>> people on this list that have encountered and solved this issue in one >>> way or another. >>> >>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>> plant genome. This has always been a very frustrating process for me, >>>but >>> while I have a better idea now how to do it, I still don?t get the sort >>> of accuracy that I am hoping for. A quick run-through of my process; >>> >>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>> Sanger-sequenced EST data >>> >>> Filtered for Models with an AED <= 0.3 >>> >>> Loaded that into WebApollo, together with an existing reference >>> annotation and the evidence tracks >>> >>> Manually curated/selected 750 gene models using the following rules: >>> - Must have start/stop codon >>> - Most have as many exons as possible >>> - Must agree with evidence >>> - Must be >= 2kb part from other gene models (provided as flanking >>> regions for augustus to train intergenic sequence) >>> >>> From these models, I created a GBK file, split it into 650 (train) and >>> 100 (test) models and created a new profile using the documented >>> procedure. >>> >>> But: >>> >>> While the naked ab-init models created through maker get a lot of genes >>> ?sort of right?, I still see too many issues to be really satisfied. >>> Problems include: >>> >>> - random exon calls which are not supported by any line of evidence (~1 >>> per gene model, I would guess) >>> - poor congruency with some gene models (especially ones not used for >>> training/testing) >>> >>> Is there any best-practice guide on how to improve this? The Augustus >>> website is unfortunately quite poor on detail? My impression so far is >>> that ramping up the number of training models isn?t really doing too >>>much >>> beyond a certain point (tried 400, 500 and 750). >>> >>> Regards, >>> >>> Marc >>> >>> >>> Marc P. Hoeppner, PhD >>> Team Leader >>> BILS Genome Annotation Platform >>> Department for Medical Biochemistry and Microbiology >>> Uppsala University, Sweden >>> marc.hoeppner at bils.se >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Thu Jun 5 13:28:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:28:55 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: One thing you might want to try is adding another predictor like SNAP together with Augustus and then process the MAKER results using EVM. We actually have a collaboration with the EVM group to produce a MAKER-EVM pipeline (MAKER 3.0). EVM will produce consensus models using the predictions and the evidence in the MAKER GFF3 which are generally better than just SNAP and Augustus with hints, so it might be able to remove some of the artifacts you are worried about. --Carson On 6/5/14, 12:24 PM, "Carson Holt" wrote: >Like I said. The predictors do the best they can, so there is probably >something about the regions to make the CDS, reading frame, or start/stop >work that requires exons to be dropped or added. In several ant genomes >we saw something like this caused by incorrect homopolymers in the >assembly which force the predictor to slightly alter the intron/exon >structure because otherwise the reading frame made no sense (the EST >alignments were used to confirmed that the assembly homopolymers were >incorrect - lots of bad single base pair deletions). > >The way hints work is as follows. At the simplest level ab initio >predictors are calculating the probability of being in different states >(intergenic, intron, exon, etc.). The hints increase the probability of >being in the intron state where MAKER gives an intron hint or being in an >exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >likelihood of the ab intio gene predictor to call something similar in >structure to the evidence overlapping it. That being said, if there is >strong enough signal from something else in the sequence or my hints won't >work with the splice sites in the region or the reading frame breaks, then >no amount of hints can force augustus to make that model. > >--Carson > > > >On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: > >>Hi, >> >>thanks for the feedback. I spent some more time on this and am still >>somewhat unsatisfied with the whole thing? >> >>A few comments: >> >>I quite frequently see augustus and in extension Maker including exons >>that are not supported by EST/Protein evidence and are not critical for >>the gene model (i.e. I can take them out and still get a proper CDS). >>Maybe I don?t know enough about how Maker creates hints and more >>importantly what role these hints play for augustus, but I cannot really >>see a great effect (any, really) on the gene models even if both ESTs and >>proteins contradict an augustus gene model and the surplus exon is >>non-essential. >> >>(all evidence is provided as fasta files, protein2genome and est2genome >>are set to 0) >> >>As for the repeat library, I suppose this is a critical point. I am using >>repeats from a closely related species via Repeatmasker, modelled and >>filtered repeats from RepeatModeler and repeats derived from a >>high-coverage 454 data set. Not sure what else I can do to improve that. >> >>As for evidence, I am using the curated reference proteome from a related >>species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>reads. I don?t think it gets a whole lot better, in terms of what data >>can be used. >> >>So in summary, I just don?t get where I want to using Augustus and Maker >>- specifically, the gene models are full of weird, unsupported artefacts >>despite manually curating > 850 models for training. I suppose I was >>hoping for some secret trick to improve on this - but I guess there is >>none? Actually, if I only do a pure evidence build (seeing that my input >>data is very high quality), it looks better - which sort of goes against >>the premise of Maker :/ >> >>Regards, >> >>Marc >> >> >> >> >>Marc P. Hoeppner, PhD >>Team Leader >>Department for Medical Biochemistry and Microbiology >>Uppsala University, Sweden >>marc.hoeppner at bils.se >> >>On 27 May 2014, at 17:25, Carson Holt wrote: >> >>> Extra exons can be required for predictors to make sense of a region >>>(they >>> do the best they can). This can be due to imperfect assemblies or >>> repeats. For plants the repeat database is the the one thing that will >>> most affect the annotation quality. You may need to spend some time >>> building the best repeat library you can. The repeat library is the >>>next >>> most important thing next to training the predictor, because they >>>confuse >>> the predictor (sometimes a lot) causing it to behave oddly in those >>> regions (because repeats do encode real protein and protein fragments). >>> Also when running now with MAKER make sure to include the entire >>>proteome >>> of a related species and not just UniProt, and you will get better >>> performance. Now that you have Augustus trained, using it inside of >>>MAKER >>> with an improved repeat library and additional protein evidence should >>> give it the feedback that will allow it to perform better than it would >>> with just naked ab initio prediction. >>> >>> Thanks, >>> Carson >>> >>> >>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>> >>>> Hi, >>>> >>>> I wanted to get some feedback regarding the training of ab-initio gene >>>> finders - it?s not strictly Maker related, but I suppose there are >>>>many >>>> people on this list that have encountered and solved this issue in one >>>> way or another. >>>> >>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>>> plant genome. This has always been a very frustrating process for me, >>>>but >>>> while I have a better idea now how to do it, I still don?t get the >>>>sort >>>> of accuracy that I am hoping for. A quick run-through of my process; >>>> >>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>> Sanger-sequenced EST data >>>> >>>> Filtered for Models with an AED <= 0.3 >>>> >>>> Loaded that into WebApollo, together with an existing reference >>>> annotation and the evidence tracks >>>> >>>> Manually curated/selected 750 gene models using the following rules: >>>> - Must have start/stop codon >>>> - Most have as many exons as possible >>>> - Must agree with evidence >>>> - Must be >= 2kb part from other gene models (provided as flanking >>>> regions for augustus to train intergenic sequence) >>>> >>>> From these models, I created a GBK file, split it into 650 (train) >>>>and >>>> 100 (test) models and created a new profile using the documented >>>> procedure. >>>> >>>> But: >>>> >>>> While the naked ab-init models created through maker get a lot of >>>>genes >>>> ?sort of right?, I still see too many issues to be really satisfied. >>>> Problems include: >>>> >>>> - random exon calls which are not supported by any line of evidence >>>>(~1 >>>> per gene model, I would guess) >>>> - poor congruency with some gene models (especially ones not used for >>>> training/testing) >>>> >>>> Is there any best-practice guide on how to improve this? The Augustus >>>> website is unfortunately quite poor on detail? My impression so far is >>>> that ramping up the number of training models isn?t really doing too >>>>much >>>> beyond a certain point (tried 400, 500 and 750). >>>> >>>> Regards, >>>> >>>> Marc >>>> >>>> >>>> Marc P. Hoeppner, PhD >>>> Team Leader >>>> BILS Genome Annotation Platform >>>> Department for Medical Biochemistry and Microbiology >>>> Uppsala University, Sweden >>>> marc.hoeppner at bils.se >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > From marc.hoeppner at bils.se Thu Jun 5 03:15:55 2014 From: marc.hoeppner at bils.se (=?windows-1252?Q?Marc_H=F6ppner?=) Date: Thu, 5 Jun 2014 10:15:55 +0200 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> Message-ID: <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Hi, thanks for the feedback. I spent some more time on this and am still somewhat unsatisfied with the whole thing? A few comments: I quite frequently see augustus and in extension Maker including exons that are not supported by EST/Protein evidence and are not critical for the gene model (i.e. I can take them out and still get a proper CDS). Maybe I don?t know enough about how Maker creates hints and more importantly what role these hints play for augustus, but I cannot really see a great effect (any, really) on the gene models even if both ESTs and proteins contradict an augustus gene model and the surplus exon is non-essential. (all evidence is provided as fasta files, protein2genome and est2genome are set to 0) As for the repeat library, I suppose this is a critical point. I am using repeats from a closely related species via Repeatmasker, modelled and filtered repeats from RepeatModeler and repeats derived from a high-coverage 454 data set. Not sure what else I can do to improve that. As for evidence, I am using the curated reference proteome from a related species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 reads. I don?t think it gets a whole lot better, in terms of what data can be used. So in summary, I just don?t get where I want to using Augustus and Maker - specifically, the gene models are full of weird, unsupported artefacts despite manually curating > 850 models for training. I suppose I was hoping for some secret trick to improve on this - but I guess there is none? Actually, if I only do a pure evidence build (seeing that my input data is very high quality), it looks better - which sort of goes against the premise of Maker :/ Regards, Marc Marc P. Hoeppner, PhD Team Leader Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at bils.se On 27 May 2014, at 17:25, Carson Holt wrote: > Extra exons can be required for predictors to make sense of a region (they > do the best they can). This can be due to imperfect assemblies or > repeats. For plants the repeat database is the the one thing that will > most affect the annotation quality. You may need to spend some time > building the best repeat library you can. The repeat library is the next > most important thing next to training the predictor, because they confuse > the predictor (sometimes a lot) causing it to behave oddly in those > regions (because repeats do encode real protein and protein fragments). > Also when running now with MAKER make sure to include the entire proteome > of a related species and not just UniProt, and you will get better > performance. Now that you have Augustus trained, using it inside of MAKER > with an improved repeat library and additional protein evidence should > give it the feedback that will allow it to perform better than it would > with just naked ab initio prediction. > > Thanks, > Carson > > > On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: > >> Hi, >> >> I wanted to get some feedback regarding the training of ab-initio gene >> finders - it?s not strictly Maker related, but I suppose there are many >> people on this list that have encountered and solved this issue in one >> way or another. >> >> Specifically, I am trying to train Augustus (and possibly SNAP) for a >> plant genome. This has always been a very frustrating process for me, but >> while I have a better idea now how to do it, I still don?t get the sort >> of accuracy that I am hoping for. A quick run-through of my process; >> >> Evidence build with maker on level 1 and 2 proteins from Uniprot + >> Sanger-sequenced EST data >> >> Filtered for Models with an AED <= 0.3 >> >> Loaded that into WebApollo, together with an existing reference >> annotation and the evidence tracks >> >> Manually curated/selected 750 gene models using the following rules: >> - Must have start/stop codon >> - Most have as many exons as possible >> - Must agree with evidence >> - Must be >= 2kb part from other gene models (provided as flanking >> regions for augustus to train intergenic sequence) >> >> From these models, I created a GBK file, split it into 650 (train) and >> 100 (test) models and created a new profile using the documented >> procedure. >> >> But: >> >> While the naked ab-init models created through maker get a lot of genes >> ?sort of right?, I still see too many issues to be really satisfied. >> Problems include: >> >> - random exon calls which are not supported by any line of evidence (~1 >> per gene model, I would guess) >> - poor congruency with some gene models (especially ones not used for >> training/testing) >> >> Is there any best-practice guide on how to improve this? The Augustus >> website is unfortunately quite poor on detail? My impression so far is >> that ramping up the number of training models isn?t really doing too much >> beyond a certain point (tried 400, 500 and 750). >> >> Regards, >> >> Marc >> >> >> Marc P. Hoeppner, PhD >> Team Leader >> BILS Genome Annotation Platform >> Department for Medical Biochemistry and Microbiology >> Uppsala University, Sweden >> marc.hoeppner at bils.se >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From fbarreto at ucsd.edu Thu Jun 5 14:01:05 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 5 Jun 2014 12:01:05 -0700 Subject: [maker-devel] Generating GFF with selected tracks Message-ID: Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 14:02:36 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 13:02:36 -0600 Subject: [maker-devel] protein2genome gene models from protein gff In-Reply-To: <1401994595132.44761@uga.edu> References: <1401994595132.44761@uga.edu> Message-ID: That's what I'd do. But really protein2genome=1 is just meant to get enough rough gene models to train a gene predictor. You don't need to run it across the whole genome. But if you do, when you run again after training the gene predictor, MAKER will detect the old BLAST jobs and they won't have to rerun on the second MAKER run. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 12:56 PM To: Carson Holt Subject: RE: [maker-devel] protein2genome gene models from protein gff So what would you suggest is the best way to get protein2genome predictions? Use fasta sequences, instead of gff? Thanks, Ranjani From: Carson Holt Sent: Thursday, June 05, 2014 2:08 PM To: Sivaranjani Namasivayam; maker-devel at yandell-lab.org Subject: Re: [maker-devel] protein2genome gene models from protein gff est_gff assumes the alignments are spliced correctly. The protein2genome option also makes that assumption but with a little less confidence that the user always provides splice aware alignments, so in some instances (like protein2genome=1) it may not pass them forward as guaranteed splice aware alignments. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 9:49 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] protein2genome gene models from protein gff Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 14:05:30 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 13:05:30 -0600 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: gff3_merge just merges any two GFF3 files. So if you have two files just give both of them to it. Example --> gff3_merge maker_genes.gff repeats.gff Also if all you are trying to do is filter out certain feature types from the file, just use grep instead. Example --> grep -v -P "\tpred_gff\t" maker.gff Thanks, Carson From: Felipe Barreto Date: Thursday, June 5, 2014 at 1:01 PM To: MAKER group Subject: [maker-devel] Generating GFF with selected tracks Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Thu Jun 5 14:08:08 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 5 Jun 2014 19:08:08 +0000 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: Hi Felipe, I seem to remember that some of the gene model names did change when I did things similar to what you described. I think that you could accomplish the same thing with some cat and grep commands on the full gff. That would avoid the trouble of rerunning maker. Something like "cat full.gff | grep -P "\trepeatrunner\t" > tmp.gff " would get you started. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 5, 2014, at 12:01 PM, Felipe Barreto > wrote: Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jun 5 15:07:51 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 5 Jun 2014 13:07:51 -0700 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: OK, I see. I will just use grep to extract the desired features from the full.gff and merge them with gff3_merge. Don't know why I was making it more complicated. I guess I don't understand gff formats very well quite yet. Thanks yet again! On Thu, Jun 5, 2014 at 12:08 PM, Daniel Ence wrote: > Hi Felipe, I seem to remember that some of the gene model names did > change when I did things similar to what you described. I think that you > could accomplish the same thing with some cat and grep commands on the full > gff. That would avoid the trouble of rerunning maker. Something like "cat > full.gff | grep -P "\trepeatrunner\t" > tmp.gff " would get you started. > > ~Daniel > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 5, 2014, at 12:01 PM, Felipe Barreto > wrote: > > Hi, all, > > I would like to produce a gff file that contains Maker gene models AND > repeats. I know that using gff3_merge with -g will generate one with only > the gene models, but I didn't see any options for adding additional tracks. > > The way I did this was to use the Re-annotation section in the control > file. I provided the original full gff file in maker_gff, and turned on > the rm_pass and model_pass. All other options in the control file were > turned off. This seemed to work, though it also added a 'model_gff:maker' > track, which is not a problem for me. I compared a few new and original > scaffolds in Apollo, and all seem to match perfectly. But since I cannot > check the whole genome, I was wondering if what I did was appropriate. Are > all the gene models (and their names) and repeat alignments identical > between the new and original files? Or is Maker potentially changing a few > things since it's treated as a new run? > > Thanks! > > -- > Felipe Barreto > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 11:33:06 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:33:06 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Another question: is there documentation anywhere for the naming conventions of the genes annotated by Maker? Of course it's easy to spot genes based on a particular *ab initio* gene predictor, as the names are in the IDs. But what is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs "maker-$seqid-snap-gene"? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage wrote: > I have attached data for a small 18kb region with a handful of genes, as > well as the corresponding maker_opts.ctl file. (This is a smaller and > different data set than what I was looking at yesterday, with a more > well-defined problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 > with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a > different gene from 6111 to 8345 with an AED of 0.01. Both of these genes > have transcript support: will Maker report overlapping genes under any > conditions? And even if Maker is forced to choose only a single gene to > report, why would the model from 4125 to 6400 ever be reported in place of > the one from 6111 to 8345, especially since this is provided in the > model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: > >> Just eAED, but eAED can affects selection of ab initio results. For >> example reading frame match of protein evidence which also affects whether >> evidence from single_exon=1 and genes with single_exon protein evidence get >> kept. There is also the assumption that your alignments in GFF3 are are >> correctly spliced (like BLAT does). So giving blastn results as >> precomputed est_gff would create a lot of noise, since maker ignores blastn >> and is using it only to seed the polished exonerate alignments. >> >> --Carson >> >> >> From: Daniel Standage >> Date: Wednesday, June 4, 2014 at 1:11 PM >> To: Carson Holt >> Cc: Maker Mailing List >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> I do not provide Gap or Target attributes in the GFF3. Will this affect >> the AED as well, or just the eAED? >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >> >>> Sure. that would be helpful. One question. Do you provide the Gap >>> attribute in your precomputed alignments? Having or not having that >>> attribute affects the eAED score which takes reading frame into account, >>> and may cause some things to be kept that normally would be dropped, >>> because MAKER won't be able to take the points of mismatch of the alignment >>> into account (it just assumes match everywhere). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:03 PM >>> To: Maker Mailing List >>> Subject: [maker-devel] Filtering of ab initio gene models >>> >>> Thanks everyone for your responses recently! >>> >>> The reason for my recent flurry of email activity is that I'm seeing >>> some unexpected trends when running the new version of Maker with >>> precomputed alignments. Compared with an annotation I did a while ago >>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>> substantial number of new genes annotated. If I compare distributions of >>> AED scores between the old and new annotation, it's clear that the new >>> annotation has a lot more low-quality models. If I look at new gene models >>> that do not overlap with any gene model from the old annotation, the >>> likelihood that it's a low-quality model is much higher. >>> >>> I decided to run a little experiment. I annotated a scaffold first using >>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>> pre-computed transcript and protein alignments and the same (latest) >>> version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted >>> 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>> >>> Before this experiment, I assumed the issue was related to providing >>> pre-computed alignments in GFF3 format and perhaps violating some important >>> assumption. However, this experiment makes me wonder whether there have >>> been changes to how Maker filters *ab initio* gene models between >>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>> could put together a small data set that reproduces the behavior I just >>> described. >>> >>> Thanks! >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> _______________________________________________ maker-devel mailing >>> list maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 11:39:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 10:39:43 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked sequence without hints (i.e. the ab initio call). maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. In both cases MAKER is allowed to add UTR to the model (hence the 'processed' tag). --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 10:33 AM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Another question: is there documentation anywhere for the naming conventions of the genes annotated by Maker? Of course it's easy to spot genes based on a particular ab initio gene predictor, as the names are in the IDs. But what is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs "maker-$seqid-snap-gene"? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage wrote: > I have attached data for a small 18kb region with a handful of genes, as well > as the corresponding maker_opts.ctl file. (This is a smaller and different > data set than what I was looking at yesterday, with a more well-defined > problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 with > an AED of 0.23. If you exclude transcript TSA024184, Maker reports a different > gene from 6111 to 8345 with an AED of 0.01. Both of these genes have > transcript support: will Maker report overlapping genes under any conditions? > And even if Maker is forced to choose only a single gene to report, why would > the model from 4125 to 6400 ever be reported in place of the one from 6111 to > 8345, especially since this is provided in the model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> Just eAED, but eAED can affects selection of ab initio results. For example >> reading frame match of protein evidence which also affects whether evidence >> from single_exon=1 and genes with single_exon protein evidence get kept. >> There is also the assumption that your alignments in GFF3 are are correctly >> spliced (like BLAT does). So giving blastn results as precomputed est_gff >> would create a lot of noise, since maker ignores blastn and is using it only >> to seed the polished exonerate alignments. >> >> --Carson >> >> >> From: Daniel Standage >> Date: Wednesday, June 4, 2014 at 1:11 PM >> To: Carson Holt >> Cc: Maker Mailing List >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> I do not provide Gap or Target attributes in the GFF3. Will this affect the >> AED as well, or just the eAED? >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> Sure. that would be helpful. One question. Do you provide the Gap >>> attribute in your precomputed alignments? Having or not having that >>> attribute affects the eAED score which takes reading frame into account, and >>> may cause some things to be kept that normally would be dropped, because >>> MAKER won't be able to take the points of mismatch of the alignment into >>> account (it just assumes match everywhere). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:03 PM >>> To: Maker Mailing List >>> Subject: [maker-devel] Filtering of ab initio gene models >>> >>> Thanks everyone for your responses recently! >>> >>> The reason for my recent flurry of email activity is that I'm seeing some >>> unexpected trends when running the new version of Maker with precomputed >>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>> Maker-computed alignments), this new annotation has a substantial number of >>> new genes annotated. If I compare distributions of AED scores between the >>> old and new annotation, it's clear that the new annotation has a lot more >>> low-quality models. If I look at new gene models that do not overlap with >>> any gene model from the old annotation, the likelihood that it's a >>> low-quality model is much higher. >>> >>> I decided to run a little experiment. I annotated a scaffold first using >>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>> pre-computed transcript and protein alignments and the same (latest) version >>> of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 genes while >>> Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, >>> there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with >>> only models from 2.10, and 28 loci with only models from 2.31.3. >>> >>> Before this experiment, I assumed the issue was related to providing >>> pre-computed alignments in GFF3 format and perhaps violating some important >>> assumption. However, this experiment makes me wonder whether there have been >>> changes to how Maker filters ab initio gene models between version 2.10 and >>> version 2.31.3? Do you have any ideas? If it would help, I could put >>> together a small data set that reproduces the behavior I just described. >>> >>> Thanks! >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m >>> aker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 11:46:41 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:46:41 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Good to know, thanks. If multiple *ab initio* predictors inform a single annotation, how does Maker decide which one will be included in the gene's ID? Given your quick response just now, I wanted to confirm that you got the message and data set I sent yesterday. I received an email saying the size of my message required list admin approval to be distributed, but since you were also a direct recipient of the email I didn't worry about it too much. Thanks again! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to spot > genes based on a particular *ab initio* gene predictor, as the names are > in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> I have attached data for a small 18kb region with a handful of genes, as >> well as the corresponding maker_opts.ctl file. (This is a smaller and >> different data set than what I was looking at yesterday, with a more >> well-defined problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the >> model_gff file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> Just eAED, but eAED can affects selection of ab initio results. For >>> example reading frame match of protein evidence which also affects whether >>> evidence from single_exon=1 and genes with single_exon protein evidence get >>> kept. There is also the assumption that your alignments in GFF3 are are >>> correctly spliced (like BLAT does). So giving blastn results as >>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>> and is using it only to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>> the AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing >>>> some unexpected trends when running the new version of Maker with >>>> precomputed alignments. Compared with an annotation I did a while ago >>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>> substantial number of new genes annotated. If I compare distributions of >>>> AED scores between the old and new annotation, it's clear that the new >>>> annotation has a lot more low-quality models. If I look at new gene models >>>> that do not overlap with any gene model from the old annotation, the >>>> likelihood that it's a low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first >>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>> from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters *ab initio* gene models between >>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>> could put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel mailing >>>> list maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 11:56:38 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 10:56:38 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: I got the e-mail. Thanks for the test set. Multiple ab initio predictors don't inform a single annotation, rather one must be chosen from the pool of available models (I.e. it has to be SNAP or Augustus, or GeneMark). They all supply their own ab initio as well as hint based prediction, and then the one with best evidence match (measured by AED) is kept (it's like a competition that only one predictor can win). If you want a consensus model instead, you can take MAKER results in GFF3 format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a collaboration with the EVM group and will have this option, but for now users can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then produces consensus models based on the GFF3 content. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 10:46 AM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Good to know, thanks. If multiple ab initio predictors inform a single annotation, how does Maker decide which one will be included in the gene's ID? Given your quick response just now, I wanted to confirm that you got the message and data set I sent yesterday. I received an email saying the size of my message required list admin approval to be distributed, but since you were also a direct recipient of the email I didn't worry about it too much. Thanks again! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked > sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the 'processed' > tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel > > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming conventions > of the genes annotated by Maker? Of course it's easy to spot genes based on a > particular ab initio gene predictor, as the names are in the IDs. But what is > the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: >> I have attached data for a small 18kb region with a handful of genes, as well >> as the corresponding maker_opts.ctl file. (This is a smaller and different >> data set than what I was looking at yesterday, with a more well-defined >> problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the model_gff >> file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> Just eAED, but eAED can affects selection of ab initio results. For example >>> reading frame match of protein evidence which also affects whether evidence >>> from single_exon=1 and genes with single_exon protein evidence get kept. >>> There is also the assumption that your alignments in GFF3 are are correctly >>> spliced (like BLAT does). So giving blastn results as precomputed est_gff >>> would create a lot of noise, since maker ignores blastn and is using it only >>> to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>> AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing some >>>> unexpected trends when running the new version of Maker with precomputed >>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>> Maker-computed alignments), this new annotation has a substantial number of >>>> new genes annotated. If I compare distributions of AED scores between the >>>> old and new annotation, it's clear that the new annotation has a lot more >>>> low-quality models. If I look at new gene models that do not overlap with >>>> any gene model from the old annotation, the likelihood that it's a >>>> low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first using >>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci by >>>> overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters ab initio gene models between version >>>> 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could >>>> put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel mailing list >>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ >>>> maker-devel_yandell-lab.org >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 11:59:16 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:59:16 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: This helps, thanks. -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: > I got the e-mail. Thanks for the test set. > > Multiple *ab initio* predictors don't inform a single annotation, rather > one must be chosen from the pool of available models (I.e. it has to be > SNAP or Augustus, or GeneMark). They all supply their own *ab initio* as > well as hint based prediction, and then the one with best evidence match > (measured by AED) is kept (it's like a competition that only one predictor > can win). > > If you want a consensus model instead, you can take MAKER results in GFF3 > format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is > a collaboration with the EVM group and will have this option, but for now > users can just split the MAKER GFF3 by evidence types and give it to EVM. > EVM then produces consensus models based on the GFF3 content. > > --Carson > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:46 AM > > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Good to know, thanks. If multiple *ab initio* predictors inform a single > annotation, how does Maker decide which one will be included in the gene's > ID? > > Given your quick response just now, I wanted to confirm that you got the > message and data set I sent yesterday. I received an email saying the size > of my message required list admin approval to be distributed, but since you > were also a direct recipient of the email I didn't worry about it too much. > > Thanks again! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat >> masked sequence without hints (i.e. the ab initio call). >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >> MAKER. >> >> In both cases MAKER is allowed to add UTR to the model (hence the >> 'processed' tag). >> >> --Carson >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel < >> vbrendel at indiana.edu> >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Another question: is there documentation anywhere for the naming >> conventions of the genes annotated by Maker? Of course it's easy to spot >> genes based on a particular *ab initio* gene predictor, as the names are >> in the IDs. But what is the significance of, say, >> "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> Thanks, >> Daniel >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage < >> daniel.standage at gmail.com> wrote: >> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to >>> 6400 with an AED of 0.23. If you exclude transcript TSA024184, Maker >>> reports a different gene from 6111 to 8345 with an AED of 0.01. Both of >>> these genes have transcript support: will Maker report overlapping genes >>> under any conditions? And even if Maker is forced to choose only a single >>> gene to report, why would the model from 4125 to 6400 ever be reported in >>> place of the one from 6111 to 8345, especially since this is provided in >>> the model_gff file? >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>> the AED as well, or just the eAED? >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the alignment >>>>> into account (it just assumes match everywhere). >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing >>>>> some unexpected trends when running the new version of Maker with >>>>> precomputed alignments. Compared with an annotation I did a while ago >>>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>>> substantial number of new genes annotated. If I compare distributions of >>>>> AED scores between the old and new annotation, it's clear that the new >>>>> annotation has a lot more low-quality models. If I look at new gene models >>>>> that do not overlap with any gene model from the old annotation, the >>>>> likelihood that it's a low-quality model is much higher. >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first >>>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>>> from 2.31.3. >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>>> assumption. However, this experiment makes me wonder whether there have >>>>> been changes to how Maker filters *ab initio* gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>>> could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> _______________________________________________ maker-devel mailing >>>>> list maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 13:38:23 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 14:38:23 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: > >> I got the e-mail. Thanks for the test set. >> >> Multiple *ab initio* predictors don't inform a single annotation, rather >> one must be chosen from the pool of available models (I.e. it has to be >> SNAP or Augustus, or GeneMark). They all supply their own *ab initio* >> as well as hint based prediction, and then the one with best evidence match >> (measured by AED) is kept (it's like a competition that only one predictor >> can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is >> a collaboration with the EVM group and will have this option, but for now >> users can just split the MAKER GFF3 by evidence types and give it to EVM. >> EVM then produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel < >> vbrendel at indiana.edu> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple *ab initio* predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size >> of my message required list admin approval to be distributed, but since you >> were also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >> >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat >>> masked sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel < >>> vbrendel at indiana.edu> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming >>> conventions of the genes annotated by Maker? Of course it's easy to spot >>> genes based on a particular *ab initio* gene predictor, as the names >>> are in the IDs. But what is the significance of, say, >>> "snap_masked-$seqid-processed-gene" in a gene ID vs >>> "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage < >>> daniel.standage at gmail.com> wrote: >>> >>>> I have attached data for a small 18kb region with a handful of genes, >>>> as well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to >>>> 6400 with an AED of 0.23. If you exclude transcript TSA024184, Maker >>>> reports a different gene from 6111 to 8345 with an AED of 0.01. Both of >>>> these genes have transcript support: will Maker report overlapping genes >>>> under any conditions? And even if Maker is forced to choose only a single >>>> gene to report, why would the model from 4125 to 6400 ever be reported in >>>> place of the one from 6111 to 8345, especially since this is provided in >>>> the model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>> >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>>> kept. There is also the assumption that your alignments in GFF3 are are >>>>> correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>>> and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this >>>>> affect the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt >>>>> wrote: >>>>> >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the alignment >>>>>> into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing >>>>>> some unexpected trends when running the new version of Maker with >>>>>> precomputed alignments. Compared with an annotation I did a while ago >>>>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>>>> substantial number of new genes annotated. If I compare distributions of >>>>>> AED scores between the old and new annotation, it's clear that the new >>>>>> annotation has a lot more low-quality models. If I look at new gene models >>>>>> that do not overlap with any gene model from the old annotation, the >>>>>> likelihood that it's a low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first >>>>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>>>> from 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>>>> assumption. However, this experiment makes me wonder whether there have >>>>>> been changes to how Maker filters *ab initio* gene models between >>>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>>>> could put together a small data set that reproduces the behavior I just >>>>>> described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing >>>>>> list maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>> >>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 13:51:18 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 12:51:18 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: There can be overlapping meddles if you have multiple gene predictors. Also the hint based models will overlap the ab initio models, but you never get to see them (they are not kept in the evidence because they are confusing and really not useful unless they are chosen as the best model). So they will overlap the ab initio models, but you may never get top see them. All models regardless of location and overlap get sorted by their AED score. The best model is then kept from the list. Then the next, then the next. If the next best model overlaps a model that has already come off the list (which means the other model has a better AED score), then it gets skipped, and the next best one in the list gets added to the non-overlapping space. The result is that the final models will be non-redundant and non-overlapping, but if you look at the evidence aligments you will find ab initio models different than the MAKER models that were rejected and do not overlap the final models. model_gff competes just like any other model with AED. Ties always go to model_gff, and if there is a region where no model gets chosen (they all have AED of 1) and a model_gff entry will fit (even with an AED score of 1), then it will be chosen, because model_gff do not need evidence support to end up in the final annotations. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 12:38 PM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: >> I got the e-mail. Thanks for the test set. >> >> Multiple ab initio predictors don't inform a single annotation, rather one >> must be chosen from the pool of available models (I.e. it has to be SNAP or >> Augustus, or GeneMark). They all supply their own ab initio as well as hint >> based prediction, and then the one with best evidence match (measured by AED) >> is kept (it's like a competition that only one predictor can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a >> collaboration with the EVM group and will have this option, but for now users >> can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then >> produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple ab initio predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size of >> my message required list admin approval to be distributed, but since you were >> also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >>> sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel >>> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming conventions >>> of the genes annotated by Maker? Of course it's easy to spot genes based on >>> a particular ab initio gene predictor, as the names are in the IDs. But what >>> is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene >>> ID vs "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >>> wrote: >>>> I have attached data for a small 18kb region with a handful of genes, as >>>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>>> have transcript support: will Maker report overlapping genes under any >>>> conditions? And even if Maker is forced to choose only a single gene to >>>> report, why would the model from 4125 to 6400 ever be reported in place of >>>> the one from 6111 to 8345, especially since this is provided in the >>>> model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence >>>>> get kept. There is also the assumption that your alignments in GFF3 are >>>>> are correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores >>>>> blastn and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>>> the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the >>>>>> alignment into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>>> unexpected trends when running the new version of Maker with precomputed >>>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>>> Maker-computed alignments), this new annotation has a substantial number >>>>>> of new genes annotated. If I compare distributions of AED scores between >>>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>>> more low-quality models. If I look at new gene models that do not overlap >>>>>> with any gene model from the old annotation, the likelihood that it's a >>>>>> low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>>> 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>>> important assumption. However, this experiment makes me wonder whether >>>>>> there have been changes to how Maker filters ab initio gene models >>>>>> between version 2.10 and version 2.31.3? Do you have any ideas? If it >>>>>> would help, I could put together a small data set that reproduces the >>>>>> behavior I just described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing list >>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinf >>>>>> o/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 18:58:26 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 19:58:26 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: <53923808.7030401@indiana.edu> References: <53923808.7030401@indiana.edu> Message-ID: In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models > (supplied by the pred_gff or model_gff tag)? This seems to be one problem > we are running into. Our external models are high quality, but CDS only. > Thus their score gets knocked down relative to ab initio predictions with > added UTRs. > > Daniel will have more questions/observations later with regard to > overlapping gene models (we definitely need to allow gene models to overlap > in the UTRs, because transcript evidence clearly shows such negative > intergenic spaces). > > Thanks for all your help! > Volker > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to spot > genes based on a particular *ab initio* gene predictor, as the names are > in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> I have attached data for a small 18kb region with a handful of genes, >> as well as the corresponding maker_opts.ctl file. (This is a smaller and >> different data set than what I was looking at yesterday, with a more >> well-defined problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the >> model_gff file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> Just eAED, but eAED can affects selection of ab initio results. For >>> example reading frame match of protein evidence which also affects whether >>> evidence from single_exon=1 and genes with single_exon protein evidence get >>> kept. There is also the assumption that your alignments in GFF3 are are >>> correctly spliced (like BLAT does). So giving blastn results as >>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>> and is using it only to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this >>> affect the AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing >>>> some unexpected trends when running the new version of Maker with >>>> precomputed alignments. Compared with an annotation I did a while ago >>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>> substantial number of new genes annotated. If I compare distributions of >>>> AED scores between the old and new annotation, it's clear that the new >>>> annotation has a lot more low-quality models. If I look at new gene models >>>> that do not overlap with any gene model from the old annotation, the >>>> likelihood that it's a low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first >>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>> from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters *ab initio* gene models between >>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>> could put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel >>>> mailing list maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074http://brendelgroup.org/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbrendel at indiana.edu Fri Jun 6 16:52:08 2014 From: vbrendel at indiana.edu (Volker Brendel) Date: Fri, 06 Jun 2014 16:52:08 -0500 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: <53923808.7030401@indiana.edu> Hi Carson, is there a way of allowing MAKER to add UTRs to our external models (supplied by the pred_gff or model_gff tag)? This seems to be one problem we are running into. Our external models are high quality, but CDS only. Thus their score gets knocked down relative to ab initio predictions with added UTRs. Daniel will have more questions/observations later with regard to overlapping gene models (we definitely need to allow gene models to overlap in the UTRs, because transcript evidence clearly shows such negative intergenic spaces). Thanks for all your help! Volker On 6/6/2014 11:39 AM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > > Cc: Maker Mailing List >, Volker Brendel > > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to > spot genes based on a particular /ab initio/ gene predictor, as the > names are in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > > wrote: > > I have attached data for a small 18kb region with a handful of > genes, as well as the corresponding maker_opts.ctl file. (This is > a smaller and different data set than what I was looking at > yesterday, with a more well-defined problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 > to 6400 with an AED of 0.23. If you exclude transcript TSA024184, > Maker reports a different gene from 6111 to 8345 with an AED of > 0.01. Both of these genes have transcript support: will Maker > report overlapping genes under any conditions? And even if Maker > is forced to choose only a single gene to report, why would the > model from 4125 to 6400 ever be reported in place of the one from > 6111 to 8345, especially since this is provided in the model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt > wrote: > > Just eAED, but eAED can affects selection of ab initio > results. For example reading frame match of protein evidence > which also affects whether evidence from single_exon=1 and > genes with single_exon protein evidence get kept. There is > also the assumption that your alignments in GFF3 are are > correctly spliced (like BLAT does). So giving blastn results > as precomputed est_gff would create a lot of noise, since > maker ignores blastn and is using it only to seed the polished > exonerate alignments. > > --Carson > > > From: Daniel Standage > > Date: Wednesday, June 4, 2014 at 1:11 PM > To: Carson Holt > > Cc: Maker Mailing List > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > I do not provide Gap or Target attributes in the GFF3. Will > this affect the AED as well, or just the eAED? > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt > > wrote: > > Sure. that would be helpful. One question. Do you > provide the Gap attribute in your precomputed alignments? > Having or not having that attribute affects the eAED > score which takes reading frame into account, and may > cause some things to be kept that normally would be > dropped, because MAKER won't be able to take the points of > mismatch of the alignment into account (it just assumes > match everywhere). > > --Carson > > > From: Daniel Standage > > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that > I'm seeing some unexpected trends when running the new > version of Maker with precomputed alignments. Compared > with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a > substantial number of new genes annotated. If I compare > distributions of AED scores between the old and new > annotation, it's clear that the new annotation has a lot > more low-quality models. If I look at new gene models that > do not overlap with any gene model from the old > annotation, the likelihood that it's a low-quality model > is much higher. > > I decided to run a little experiment. I annotated a > scaffold first using Maker 2.10 and then using Maker > 2.31.3. I both cases, I used the same pre-computed > transcript and protein alignments and the same (latest) > version of SNAP as the only /ab initio/ predictor. Maker > 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. > If we group gene models into loci by overlap, there are 33 > loci with gene models from both 2.10 and 2.31.3, 1 locus > with only models from 2.10, and 28 loci with only models > from 2.31.3. > > Before this experiment, I assumed the issue was related to > providing pre-computed alignments in GFF3 format and > perhaps violating some important assumption. However, this > experiment makes me wonder whether there have been changes > to how Maker filters /ab initio/ gene models between > version 2.10 and version 2.31.3? Do you have any ideas? If > it would help, I could put together a small data set that > reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ > maker-devel mailing list maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > -- Volker Brendel Professor of Biology and Computer Science Indiana University Department of Biology & School of Informatics and Computing Simon Hall 205C 212 South Hawthorne Drive Bloomington, IN 47405-7003 Tel.: (812) 855-7074 http://brendelgroup.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jun 7 15:03:18 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:03:18 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: <53923808.7030401@indiana.edu> Message-ID: The problem in the example you sent is the geneseqer entries in the GFF3 you are passing in. It is causing merge of gene clusters. The result is that UTR is being over extended and is overlapping on the models (and probably some models get merged). As you noticed you can't have overlapping models on the same strand. If you set score_preds=1 in the maker_opts.ctl file it will give you AED scores for the rejected ab initio models. You will notice that none of them score better than 0.23. One thing you can do is set correct_est_fusion=1. This tries to correct for erroneous EST/transcript evidence that leads to over extend UTR and false gene merging. You will see in the attached image that is trims back the overlapping 3' and 5' UTR for the overlapping gene models, given that MAKER believes the evidence leading to the overlap is likely low confidence and is a false merge of regions. I think much of your geneseqer input is more of a problem than a help for the annotation. Many seem to be spurious alignments. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 5:58 PM To: Volker Brendel Cc: Carson Holt , Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models (supplied > by the pred_gff or model_gff tag)? This seems to be one problem we are > running into. Our external models are high quality, but CDS only. Thus their > score gets knocked down relative to ab initio predictions with added UTRs. > > Daniel will have more questions/observations later with regard to overlapping > gene models (we definitely need to allow gene models to overlap in the UTRs, > because transcript evidence clearly shows such negative intergenic spaces). > > Thanks for all your help! > Volker > > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > >> >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >> sequence without hints (i.e. the ab initio call). >> >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. >> >> >> >> >> In both cases MAKER is allowed to add UTR to the model (hence the 'processed' >> tag). >> >> >> >> >> --Carson >> >> >> >> >> >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> >> >> >> >> >> >> Another question: is there documentation anywhere for the naming conventions >> of the genes annotated by Maker? Of course it's easy to spot genes based on a >> particular ab initio gene predictor, as the names are in the IDs. But what is >> the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >> wrote: >> >>> >>> >>> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>> have transcript support: will Maker report overlapping genes under any >>> conditions? And even if Maker is forced to choose only a single gene to >>> report, why would the model from 4125 to 6400 ever be reported in place of >>> the one from 6111 to 8345, especially since this is provided in the >>> model_gff file? >>> >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>>> AED as well, or just the eAED? >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> >>>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the >>>>> alignment into account (it just assumes match everywhere). >>>>> >>>>> >>>>> >>>>> >>>>> --Carson >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>> unexpected trends when running the new version of Maker with precomputed >>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>> Maker-computed alignments), this new annotation has a substantial number >>>>> of new genes annotated. If I compare distributions of AED scores between >>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>> more low-quality models. If I look at new gene models that do not overlap >>>>> with any gene model from the old annotation, the likelihood that it's a >>>>> low-quality model is much higher. >>>>> >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>> 2.31.3. >>>>> >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>> important assumption. However, this experiment makes me wonder whether >>>>> there have been changes to how Maker filters ab initio gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, >>>>> I could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ maker-devel mailing list >>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo >>>>> /maker-devel_yandell-lab.org >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074 http://brendelgroup.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6246A771-E6A9-4875-9362-DC8A7A5BC9C4.png Type: image/png Size: 48365 bytes Desc: not available URL: From carsonhh at gmail.com Sat Jun 7 15:07:41 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:07:41 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: <53923808.7030401@indiana.edu> Message-ID: Example (attached) of geneseqer GFF3 input causing problems. Notice that all the geneseqer features are almost exact representations of the transposon, they are essentially reintroducing all the noise that repeat masking tried to remove (they are giving hints to the gene predictor to try and call the transposon as a gene). --Carson From: Carson Holt Date: Saturday, June 7, 2014 at 2:03 PM To: Daniel Standage , Volker Brendel Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models The problem in the example you sent is the geneseqer entries in the GFF3 you are passing in. It is causing merge of gene clusters. The result is that UTR is being over extended and is overlapping on the models (and probably some models get merged). As you noticed you can't have overlapping models on the same strand. If you set score_preds=1 in the maker_opts.ctl file it will give you AED scores for the rejected ab initio models. You will notice that none of them score better than 0.23. One thing you can do is set correct_est_fusion=1. This tries to correct for erroneous EST/transcript evidence that leads to over extend UTR and false gene merging. You will see in the attached image that is trims back the overlapping 3' and 5' UTR for the overlapping gene models, given that MAKER believes the evidence leading to the overlap is likely low confidence and is a false merge of regions. I think much of your geneseqer input is more of a problem than a help for the annotation. Many seem to be spurious alignments. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 5:58 PM To: Volker Brendel Cc: Carson Holt , Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models (supplied > by the pred_gff or model_gff tag)? This seems to be one problem we are > running into. Our external models are high quality, but CDS only. Thus their > score gets knocked down relative to ab initio predictions with added UTRs. > > Daniel will have more questions/observations later with regard to overlapping > gene models (we definitely need to allow gene models to overlap in the UTRs, > because transcript evidence clearly shows such negative intergenic spaces). > > Thanks for all your help! > Volker > > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > >> >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >> sequence without hints (i.e. the ab initio call). >> >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. >> >> >> >> >> In both cases MAKER is allowed to add UTR to the model (hence the 'processed' >> tag). >> >> >> >> >> --Carson >> >> >> >> >> >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> >> >> >> >> >> >> Another question: is there documentation anywhere for the naming conventions >> of the genes annotated by Maker? Of course it's easy to spot genes based on a >> particular ab initio gene predictor, as the names are in the IDs. But what is >> the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >> wrote: >> >>> >>> >>> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>> have transcript support: will Maker report overlapping genes under any >>> conditions? And even if Maker is forced to choose only a single gene to >>> report, why would the model from 4125 to 6400 ever be reported in place of >>> the one from 6111 to 8345, especially since this is provided in the >>> model_gff file? >>> >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>>> AED as well, or just the eAED? >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> >>>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the >>>>> alignment into account (it just assumes match everywhere). >>>>> >>>>> >>>>> >>>>> >>>>> --Carson >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>> unexpected trends when running the new version of Maker with precomputed >>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>> Maker-computed alignments), this new annotation has a substantial number >>>>> of new genes annotated. If I compare distributions of AED scores between >>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>> more low-quality models. If I look at new gene models that do not overlap >>>>> with any gene model from the old annotation, the likelihood that it's a >>>>> low-quality model is much higher. >>>>> >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>> 2.31.3. >>>>> >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>> important assumption. However, this experiment makes me wonder whether >>>>> there have been changes to how Maker filters ab initio gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, >>>>> I could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ maker-devel mailing list >>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo >>>>> /maker-devel_yandell-lab.org >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074 http://brendelgroup.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6246A771-E6A9-4875-9362-DC8A7A5BC9C4.png Type: image/png Size: 48365 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 48C1E0B9-001D-44C9-8D8E-37A52E4A17E8.png Type: image/png Size: 6592 bytes Desc: not available URL: From carsonhh at gmail.com Sat Jun 7 15:11:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:11:43 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: <53923808.7030401@indiana.edu> References: <53923808.7030401@indiana.edu> Message-ID: If you give input as pred_gff, set keep_preds=1, and then give MAKER EST evidence to work with then MAKER will just pass_through the pred_gff data you gave it with UTR added. Set correct_est_fusion=1 if your input contains false merges across regions (common from mRNA-seq results). This will trim overlapping UTR caused by the improperly merged EST evidence. --Carson From: Volker Brendel Date: Friday, June 6, 2014 at 3:52 PM To: Carson Holt , Daniel Standage Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models Hi Carson, is there a way of allowing MAKER to add UTRs to our external models (supplied by the pred_gff or model_gff tag)? This seems to be one problem we are running into. Our external models are high quality, but CDS only. Thus their score gets knocked down relative to ab initio predictions with added UTRs. Daniel will have more questions/observations later with regard to overlapping gene models (we definitely need to allow gene models to overlap in the UTRs, because transcript evidence clearly shows such negative intergenic spaces). Thanks for all your help! Volker On 6/6/2014 11:39 AM, Carson Holt wrote: > > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked > sequence without hints (i.e. the ab initio call). > > maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. > > > > > In both cases MAKER is allowed to add UTR to the model (hence the 'processed' > tag). > > > > > --Carson > > > > > > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > > > > > > > Another question: is there documentation anywhere for the naming conventions > of the genes annotated by Maker? Of course it's easy to spot genes based on a > particular ab initio gene predictor, as the names are in the IDs. But what is > the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > > Thanks, > > Daniel > > > > > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> >> >> >> I have attached data for a small 18kb region with a handful of genes, as well >> as the corresponding maker_opts.ctl file. (This is a smaller and different >> data set than what I was looking at yesterday, with a more well-defined >> problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the model_gff >> file? >> >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> >> Any light you could shed would be helpful. Thanks! >> >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> >>> >>> Just eAED, but eAED can affects selection of ab initio results. For example >>> reading frame match of protein evidence which also affects whether evidence >>> from single_exon=1 and genes with single_exon protein evidence get kept. >>> There is also the assumption that your alignments in GFF3 are are correctly >>> spliced (like BLAT does). So giving blastn results as precomputed est_gff >>> would create a lot of noise, since maker ignores blastn and is using it only >>> to seed the polished exonerate alignments. >>> >>> >>> >>> >>> --Carson >>> >>> >>> >>> >>> >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> >>> >>> >>> >>> >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>> AED as well, or just the eAED? >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing some >>>> unexpected trends when running the new version of Maker with precomputed >>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>> Maker-computed alignments), this new annotation has a substantial number of >>>> new genes annotated. If I compare distributions of AED scores between the >>>> old and new annotation, it's clear that the new annotation has a lot more >>>> low-quality models. If I look at new gene models that do not overlap with >>>> any gene model from the old annotation, the likelihood that it's a >>>> low-quality model is much higher. >>>> >>>> >>>> I decided to run a little experiment. I annotated a scaffold first using >>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci by >>>> overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>>> >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters ab initio gene models between version >>>> 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could >>>> put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ maker-devel mailing list >>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ >>>> maker-devel_yandell-lab.org >>>> >>> >>> >>> >>> >>> >>> >>> >> >> >> >> >> >> > > > > -- Volker Brendel Professor of Biology and Computer Science Indiana University Department of Biology & School of Informatics and Computing Simon Hall 205C 212 South Hawthorne Drive Bloomington, IN 47405-7003 Tel.: (812) 855-7074 http://brendelgroup.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jun 7 15:16:29 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:16:29 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Also MAKER 2.10 has a number of bugs with how UTR is generated and hints are generated for the ab into predictors (it's several years out of date). I don't think it checks from reading frame match when determining protein overlap match either. So no surprise that some models will be different from the current MAKER version. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 12:38 PM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: >> I got the e-mail. Thanks for the test set. >> >> Multiple ab initio predictors don't inform a single annotation, rather one >> must be chosen from the pool of available models (I.e. it has to be SNAP or >> Augustus, or GeneMark). They all supply their own ab initio as well as hint >> based prediction, and then the one with best evidence match (measured by AED) >> is kept (it's like a competition that only one predictor can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a >> collaboration with the EVM group and will have this option, but for now users >> can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then >> produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple ab initio predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size of >> my message required list admin approval to be distributed, but since you were >> also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >>> sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel >>> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming conventions >>> of the genes annotated by Maker? Of course it's easy to spot genes based on >>> a particular ab initio gene predictor, as the names are in the IDs. But what >>> is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene >>> ID vs "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >>> wrote: >>>> I have attached data for a small 18kb region with a handful of genes, as >>>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>>> have transcript support: will Maker report overlapping genes under any >>>> conditions? And even if Maker is forced to choose only a single gene to >>>> report, why would the model from 4125 to 6400 ever be reported in place of >>>> the one from 6111 to 8345, especially since this is provided in the >>>> model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence >>>>> get kept. There is also the assumption that your alignments in GFF3 are >>>>> are correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores >>>>> blastn and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>>> the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the >>>>>> alignment into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>>> unexpected trends when running the new version of Maker with precomputed >>>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>>> Maker-computed alignments), this new annotation has a substantial number >>>>>> of new genes annotated. If I compare distributions of AED scores between >>>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>>> more low-quality models. If I look at new gene models that do not overlap >>>>>> with any gene model from the old annotation, the likelihood that it's a >>>>>> low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>>> 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>>> important assumption. However, this experiment makes me wonder whether >>>>>> there have been changes to how Maker filters ab initio gene models >>>>>> between version 2.10 and version 2.31.3? Do you have any ideas? If it >>>>>> would help, I could put together a small data set that reproduces the >>>>>> behavior I just described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing list >>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinf >>>>>> o/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.hoeppner at imbim.uu.se Mon Jun 9 03:48:01 2014 From: marc.hoeppner at imbim.uu.se (=?Windows-1252?Q?Marc_H=F6ppner?=) Date: Mon, 9 Jun 2014 08:48:01 +0000 Subject: [maker-devel] Repeatmasked genome Message-ID: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> Hi, this may be an odd question, but I was wondering where, if at all, Maker reports the repeat-masked genome sequence? As far as I can tell the fasta sequences included in the gff annotation are unmasked (?) or at least not softmasked. I guess it wouldn?t be too hard to take the repeat masker features and use them to soft mask the assembly, but still... Regards, Marc Marc P. Hoeppner, PhD Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se From dence at genetics.utah.edu Mon Jun 9 10:22:13 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 9 Jun 2014 15:22:13 +0000 Subject: [maker-devel] Repeatmasked genome In-Reply-To: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> References: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> Message-ID: <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> Hi Marc, The masked genome sequence is stored in the "theVoid" directory for each scaffold. There are files "query.fasta", "query.masked.fasta", and "query.masked.gff". The masked sequence is in the query.masked.fasta file and the genomic locations of the masked regions are stored in the query.masked.gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 9, 2014, at 2:48 AM, Marc H?ppner > wrote: Hi, this may be an odd question, but I was wondering where, if at all, Maker reports the repeat-masked genome sequence? As far as I can tell the fasta sequences included in the gff annotation are unmasked (?) or at least not softmasked. I guess it wouldn?t be too hard to take the repeat masker features and use them to soft mask the assembly, but still... Regards, Marc Marc P. Hoeppner, PhD Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 9 11:11:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 09 Jun 2014 10:11:23 -0600 Subject: [maker-devel] Repeatmasked genome In-Reply-To: <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> References: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> Message-ID: Yes. Those are all temporary files, that (if you still have them) you can use to get at the masked fasta directly. Otherwise you can just use the features in the GFF3 file to remask the regions. --Carson From: Daniel Ence Date: Monday, June 9, 2014 at 9:22 AM To: Marc H?ppner Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Repeatmasked genome Hi Marc, The masked genome sequence is stored in the "theVoid" directory for each scaffold. There are files "query.fasta", "query.masked.fasta", and "query.masked.gff". The masked sequence is in the query.masked.fasta file and the genomic locations of the masked regions are stored in the query.masked.gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 9, 2014, at 2:48 AM, Marc H?ppner wrote: > Hi, > > this may be an odd question, but I was wondering where, if at all, Maker > reports the repeat-masked genome sequence? As far as I can tell the fasta > sequences included in the gff annotation are unmasked (?) or at least not > softmasked. I guess it wouldn?t be too hard to take the repeat masker features > and use them to soft mask the assembly, but still... > > Regards, > > Marc > > > Marc P. Hoeppner, PhD > > Department for Medical Biochemistry and Microbiology > Uppsala University, Sweden > marc.hoeppner at imbim.uu.se > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynsb1987 at gmail.com Mon Jun 9 23:22:47 2014 From: cynsb1987 at gmail.com (hueytyng) Date: Tue, 10 Jun 2014 14:22:47 +1000 Subject: [maker-devel] ERROR: This MpiChunk is not part of this level Message-ID: Hi Carson, I run Maker 2.31 on my assembled contigs. From the "master_datastore_index.log", 529 contigs run through, before I get the error below. Maker halts after this. #--------------------------------------------------------------------- Now starting the contig!! SeqID: 1AL_NODE_4659_length_8657_cov_8.758115 Length: 8739 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks ERROR: This MpiChunk is not part of this level at /ws/ws-group/app/maker/bin/../lib/Process/MpiTiers.pm line 439. Process::MpiTiers::update_chunk(Process::MpiTiers=HASH(0x63e64c0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 1537 main::update_chunk(ARRAY(0x40c69d0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 929 --> rank=4, hostname=safs-raijen deleted:0 hits deleted:0 hits Attached is my maker_opts.ctl file. Thank you, Jenny -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4931 bytes Desc: not available URL: From carsonhh at gmail.com Wed Jun 11 09:29:44 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 11 Jun 2014 08:29:44 -0600 Subject: [maker-devel] ERROR: This MpiChunk is not part of this level In-Reply-To: References: Message-ID: The cause of this is most likely a corrupt MPI message. It could be random (it happens with MPI messages). In which case it should succeed on retry. It could mean you need to reinstall you MPI communicator, or give fewer nodes to mpiexec when running your job (MPICH2 starts having communication issues after around 100 processes for example - even sooner on some systems). It may also mean that you set MAKER up with one communicator during the installation (like MPICH2) and then used mpiexec from another communicator to launch the job (OpenMPI for example or even a different version of MPICH2). Make sure you are not using MVAPICH2 because MAKER won't work with MVAPICH2. Also if you are using OpenMPI, you must preload libmpi.so or otherwise shared libraries won't work and it will fail while running MAKER. To do that you have to export the following environmental variable --> export LD_PRELOAD=/lib/libmpi.so #replace with the location of OpenMPI Also because a corrupt message has the chance to cause other issues, you may want to completely delete the folder for the failed contig (look in the datastore_index.log to see where that folder is). Also make sure you are using the latest version of MAKER because it has been vetted on OpenMPI using 8000+ cpus. Earlier version (I.e. 2.28 and below) may have issues on OpenMPI or on some systems with slow NFS storage or limited memory. --Carson From: hueytyng Date: Monday, June 9, 2014 at 10:22 PM To: Subject: [maker-devel] ERROR: This MpiChunk is not part of this level Hi Carson, I run Maker 2.31 on my assembled contigs. From the "master_datastore_index.log", 529 contigs run through, before I get the error below. Maker halts after this. #--------------------------------------------------------------------- Now starting the contig!! SeqID: 1AL_NODE_4659_length_8657_cov_8.758115 Length: 8739 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks ERROR: This MpiChunk is not part of this level at /ws/ws-group/app/maker/bin/../lib/Process/MpiTiers.pm line 439. Process::MpiTiers::update_chunk(Process::MpiTiers=HASH(0x63e64c0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 1537 main::update_chunk(ARRAY(0x40c69d0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 929 --> rank=4, hostname=safs-raijen deleted:0 hits deleted:0 hits Attached is my maker_opts.ctl file. Thank you, Jenny _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Wed Jun 11 15:44:41 2014 From: sjackman at gmail.com (Shaun Jackman) Date: Wed, 11 Jun 2014 13:44:41 -0700 Subject: [maker-devel] Alternate translation table Message-ID: Hi, Carson. I?m annotating a plastid genome. It has spliced genes so I?m using organism_type=eukaryotic. Its translation table however is 11 (Bacteria, Archaea, prokaryotic viruses and chloroplast proteins). Is it possible to change the translation table? I?m not doing any ab initio gene prediction, only homology-based annotation using protein for coding genes and est for non-coding genes. The sequences are from a very closely related species (99% identity). Cheers, Shaun? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 11 16:01:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 11 Jun 2014 15:01:23 -0600 Subject: [maker-devel] Alternate translation table In-Reply-To: References: Message-ID: Sorry. MAKER doesn't have an alternate codon table option. --Carson From: Shaun Jackman Reply-To: Shaun Jackman Date: Wednesday, June 11, 2014 at 2:44 PM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] Alternate translation table Hi, Carson. I?m annotating a plastid genome. It has spliced genes so I?m using organism_type=eukaryotic. Its translation table however is 11 (Bacteria, Archaea, prokaryotic viruses and chloroplast proteins). Is it possible to change the translation table? I?m not doing any ab initio gene prediction, only homology-based annotation using protein for coding genes and est for non-coding genes. The sequences are from a very closely related species (99% identity). Cheers, Shaun? _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at rennes.inra.fr Thu Jun 12 08:00:48 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Thu, 12 Jun 2014 15:00:48 +0200 Subject: [maker-devel] Merging 2 annotations In-Reply-To: References: <538D8987.4090606@rennes.inra.fr> Message-ID: <5399A480.10808@rennes.inra.fr> Thank you, it works fine! A little question which is related: I set the map_forward option to 1, but it seems to work only for the model_gff gff. Is there a way to make it keep the original IDs also for the pred_gff file? Thank you Anthony On 03/06/2014 18:15, Carson Holt wrote: > You can give the manually curate ones to model_gff and the other ones to > pred_gff. Then set keep_preds=1. The model_gff resuls always get kept > even without evidence support, the pred_gff will be kept even without > evidence support because you set keep_preds=1, but model_gff results will > take precedence. > > --Carson > > > On 6/3/14, 2:38 AM, "Anthony Bretaudeau" > wrote: > >> Hello, >> >> I am working on the annotation of an insect genome, and I have 2 gff >> files: >> -an automatic annotation (done by another lab, with something else than >> maker, ~20000genes) >> -a manually curated annotation (with webapollo, ~1500 genes) >> >> From this, I would like to produce a single gff combining the 2. I'd >> like to keep all the manually curated models, and only the automatic >> ones that have no equivalent in the manually curated gff. >> >> Is it possible to do something like this with maker? I guess I could >> play with the model_gff option, but I'm not sure how exactly I could use >> it. >> >> Thank you for your help >> Regards >> >> Anthony >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From dence at genetics.utah.edu Thu Jun 12 10:50:05 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 12 Jun 2014 15:50:05 +0000 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <5399A480.10808@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> Message-ID: <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> Hi Anthony, So I think that the gene ID gets changed in the process of promoting things from pred_gff to gene models. If you know which predictions you want to keep, then you can select those out and pass them to model_gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 12, 2014, at 7:00 AM, Anthony Bretaudeau > wrote: A little question which is related: I set the map_forward option to 1, but it seems to work only for the model_gff gff. Is there a way to make it keep the original IDs also for the pred_gff file? -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at rennes.inra.fr Thu Jun 12 11:17:11 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Thu, 12 Jun 2014 18:17:11 +0200 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> Message-ID: <5399D287.1090505@rennes.inra.fr> An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 12 11:23:06 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Jun 2014 10:23:06 -0600 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <5399D287.1090505@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> <5399D287.1090505@rennes.inra.fr> Message-ID: This might be a round about way to get them to have the names unaltered. Give the pred_gff ones to est_gff. Still give the model_gff ones to model_gff. Set est2genome=1 and single_exon=1. Then add this line to the control file est_forward=1. This is normally used to move transcripts forward onto new assemblies with names being drawn from the alignment, but by telling MAKER that these are ESTs instead of predictions and setting the appropriate values, it will think it's moving transcripts forward, and the final result may be what you want. --Carson From: Anthony Bretaudeau Date: Thursday, June 12, 2014 at 10:17 AM To: Daniel Ence Cc: Carson Holt , "" Subject: Re: [maker-devel] Merging 2 annotations Yes, I think that's why the ids get changed. But I don't know which predictions I want to keep as I'm using maker to only keep the ones that are not equivalent to the models that are in the model_gff. Anthony On 12/06/2014 17:50, Daniel Ence wrote: > Hi Anthony, So I think that the gene ID gets changed in the process of > promoting things from pred_gff to gene models. If you know which predictions > you want to keep, then you can select those out and pass them to model_gff. > > > > ~Daniel > > > > > > > > Daniel Ence > > Graduate Student > > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > > > > On Jun 12, 2014, at 7:00 AM, Anthony Bretaudeau > > > wrote: > > >> A little question which is related: I set the map_forward option to 1, but it >> seems to work only for the model_gff gff. Is there a way to make it keep the >> original IDs also for the pred_gff file? >> > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jun 12 16:58:16 2014 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 12 Jun 2014 14:58:16 -0700 Subject: [maker-devel] Poor Exonerate gene model Message-ID: Hi, Carson. I have a case where MAKER is choosing a poor gene model when a better model exists. The two genes, psaA and psaB, are adjacent and are similar (37% exonerate score). BLASTX finds only the correct alignments of psaA and psaB. When exonerate is run, it also finds poor alignments of psaA to psaB and psaB to psaA. The result is that MAKER chooses the correct model for psaB, but picks the poor psaB model for psaA. Increasing ep_score_limit from 20 to 40 works around the issue. I think MAKER could make a better choice in this situation without that hint. See the attached screen shots. The first is ep_score_limit=20 and the second ep_score_limit=40. I?ve attached the evidence GFF. Cheers, Shaun [image: Inline images 1] [image: Inline images 3] ? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 86112 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 90074 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1.gff.gz Type: application/x-gzip Size: 57657 bytes Desc: not available URL: From saad.arif at tuebingen.mpg.de Fri Jun 13 06:03:38 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Fri, 13 Jun 2014 13:03:38 +0200 Subject: [maker-devel] Help with updating an annotation Message-ID: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad From carsonhh at gmail.com Fri Jun 13 11:59:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 13 Jun 2014 10:59:46 -0600 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: Message-ID: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" wrote: >Dear All, > >I would like to use Maker pipeline to expand a current annotation (new >isoforms and novel genes with respect to current annotation) and was >wondering if anyone had experience with this and or suggestions to my >questions. > >Briefly: > > I have tophat splice junctions from RNAseq data or alternatively >cufflinks generated transcript models (fasts format) that i want to use >as my new data (est_gff or est). > >I want to provide the current Ensembl annotation for gene prediction but >i want this annotation to remain unchanged. Hence, i?m not sure if i >should provide this annotation as pred_gff > or model_gff. Can the model_gff be used for gene prediction or is this >just a subset of pred_gff that remain unaltered? Can we provide the same >annotation for both options (pred_ and mod_gff)? > > > >Importantly, my main goal is to use the new RNAseq data to add more >isoforms and (any) novel genes to the existing Ensembl annotation. Any >thoughts or suggestions on how to go about this would be sincerely >appreciated. > > >Thanks in advance, >saad > > > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From juefish at gmail.com Tue Jun 17 15:54:51 2014 From: juefish at gmail.com (Nathaniel Jue) Date: Tue, 17 Jun 2014 16:54:51 -0400 Subject: [maker-devel] issue with forks module Message-ID: I've been running into all kinds of issues with the implementation of forks in GMOD. I repeatedly get this error when running an MPI run of Maker: Use of each() on hash after insertion without resetting hash iterator results in undefined behavior at /data2/local_installs/perls/perl-5.18.1/lib/site_perl/5.18.1/x86_64-linux/ forks.pm line 1736. I had to install an alternative version of perl with perlbrew but everything seemed to work on the test data sets. Any thoughts on what might be the issue? Or if this is even an issue that would affect results? Thanks, Nate -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 17 16:09:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 17 Jun 2014 15:09:55 -0600 Subject: [maker-devel] issue with forks module In-Reply-To: References: Message-ID: There is a change in Perl 5.18 that makes the forks.pm module incompatible. The forks.pm model maintainers have yet to update their module to resolve the issue, so it only works on perl version prior to 5.18. One work around it to manually edit forks.pm line 1736 yourself. Change it from this --> $write = each %WRITE; To this (make sure to include the {} brackets)--> { no warnings qw(internal); $write = each %WRITE; } --Carson From: Nathaniel Jue Date: Tuesday, June 17, 2014 at 2:54 PM To: Subject: [maker-devel] issue with forks module I've been running into all kinds of issues with the implementation of forks in GMOD. I repeatedly get this error when running an MPI run of Maker: Use of each() on hash after insertion without resetting hash iterator results in undefined behavior at /data2/local_installs/perls/perl-5.18.1/lib/site_perl/5.18.1/x86_64-linux/fo rks.pm line 1736. I had to install an alternative version of perl with perlbrew but everything seemed to work on the test data sets. Any thoughts on what might be the issue? Or if this is even an issue that would affect results? Thanks, Nate _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From saad.arif at tuebingen.mpg.de Wed Jun 18 06:09:48 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Wed, 18 Jun 2014 12:09:48 +0100 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: Message-ID: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: > Use the cufflinks instead of the tophat features (tophat tends to be > really noisy). Give the existing models to model_gff (they will then > always be kept unless something better is found). There is no option to > keep models and then just add isoforms. The model_gff input will either > be kept as is (unchanged), or replaced with an updated model suggested by > the evidence (the updated model may contain multiple isoforms though), and > map_forward=1 can be used to pull names forward from the old model onto > the new models. > > Thansk, > Carson > > > On 6/13/14, 5:03 AM, "Saad Arif" wrote: > >> Dear All, >> >> I would like to use Maker pipeline to expand a current annotation (new >> isoforms and novel genes with respect to current annotation) and was >> wondering if anyone had experience with this and or suggestions to my >> questions. >> >> Briefly: >> >> I have tophat splice junctions from RNAseq data or alternatively >> cufflinks generated transcript models (fasts format) that i want to use >> as my new data (est_gff or est). >> >> I want to provide the current Ensembl annotation for gene prediction but >> i want this annotation to remain unchanged. Hence, i?m not sure if i >> should provide this annotation as pred_gff >> or model_gff. Can the model_gff be used for gene prediction or is this >> just a subset of pred_gff that remain unaltered? Can we provide the same >> annotation for both options (pred_ and mod_gff)? >> >> >> >> Importantly, my main goal is to use the new RNAseq data to add more >> isoforms and (any) novel genes to the existing Ensembl annotation. Any >> thoughts or suggestions on how to go about this would be sincerely >> appreciated. >> >> >> Thanks in advance, >> saad >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From dence at genetics.utah.edu Wed Jun 18 11:21:19 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 18 Jun 2014 16:21:19 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> Message-ID: <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Wed Jun 18 12:04:26 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 18 Jun 2014 17:04:26 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: Hi Saad, That seems to be right to me. You'll do one run of MAKER with the cufflinks output and est2genome turned on and train SNAP on that output. You can repeat this as many times as you want, but in my experience you don't gain much in predictive power beyond two rounds of training. Next, you'll turn on SNAP and turn off est2genome, but still include the cufflinks and proteome evidence and the ensemble models. The other ab initio predictors that maker can use internally (genemark and augustus) are worth looking into also. Genemark does a self-training thing, but can take a couple of days depending on how large your genome is. Augustus takes a lot of time and effort to train, but comes with many prebuilt training files. If one of its prebuilt files is close to your species of interest, you can just use that instead. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 10:42 AM, Saad Arif > wrote: Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.priyam at qmul.ac.uk Wed Jun 18 12:44:34 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Wed, 18 Jun 2014 23:14:34 +0530 Subject: [maker-devel] errors in final gff Message-ID: Hi, I compiled all annotations generated by MAKER into a single GFF file using the gff3_merge script distributed with MAKER. While formatting this GFF for use with JBrowse, I found a few errors: 1. Three instances where two features were assigned the same id. 2. One instance where a group of three subfeatures refer to a non-existent parent. Here is the relevant portion of the GFF file: https://gist.github.com/yeban/ffaf5cd419639dd073a7 I worked around the issue temporarily for the job at hand, but I am left wondering why would these errors creep in. -- Priyam From carsonhh at gmail.com Wed Jun 18 13:11:49 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 18 Jun 2014 12:11:49 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: What MAKER version are you using? --Carson On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >Hi, > >I compiled all annotations generated by MAKER into a single GFF file >using the gff3_merge script distributed with MAKER. While formatting >this GFF for use with JBrowse, I found a few errors: > >1. Three instances where two features were assigned the same id. >2. One instance where a group of three subfeatures refer to a >non-existent parent. > >Here is the relevant portion of the GFF file: >https://gist.github.com/yeban/ffaf5cd419639dd073a7 > >I worked around the issue temporarily for the job at hand, but I am >left wondering why would these errors creep in. > >-- Priyam > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jun 18 16:33:08 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 18 Jun 2014 15:33:08 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Are you passing in old data via GFF3? --Carson On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >It's version 2.31. > >-- Priyam > >On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt wrote: >> What MAKER version are you using? >> >> --Carson >> >> >> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >> >>>Hi, >>> >>>I compiled all annotations generated by MAKER into a single GFF file >>>using the gff3_merge script distributed with MAKER. While formatting >>>this GFF for use with JBrowse, I found a few errors: >>> >>>1. Three instances where two features were assigned the same id. >>>2. One instance where a group of three subfeatures refer to a >>>non-existent parent. >>> >>>Here is the relevant portion of the GFF file: >>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>> >>>I worked around the issue temporarily for the job at hand, but I am >>>left wondering why would these errors creep in. >>> >>>-- Priyam >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> From mhinsley at ebi.ac.uk Thu Jun 19 04:07:32 2014 From: mhinsley at ebi.ac.uk (Malcolm Hinsley) Date: Thu, 19 Jun 2014 10:07:32 +0100 Subject: [maker-devel] 'not a SCALAR reference' error In-Reply-To: References: Message-ID: <53A2A854.3000009@ebi.ac.uk> Hi I'm running maker 2.31 with mpich 3 and have run once with est and protein2genome, then trained augustus and snap and run the first iteration of ab-initio predictors, which finished cleanly with no errors/ failures. Having retrained augustus and snap I'm trying to run maker -a using the same augustus species and snap.hmm pathname... previously this has worked fine. I get a lot of errors like this (it looks like every scaffold fails): doing repeat masking ERROR: Not a SCALAR reference at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Fasta.pm line 382 thread 1. Fasta::_formatSeq(FastaSeq=HASH(0x4426298), 60) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Fasta.pm line 369 thread 1 Fasta::toFastaRef(">scaffold29 CHUNK number:0 size:100000 offset:0", REF(0x42e48f0)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/FastaChunk.pm line 217 thread 1 FastaChunk::fasta_ref(FastaChunk=HASH(0x42c76a8)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/FastaChunk.pm line 168 thread 1 FastaChunk::write_file(FastaChunk=HASH(0x42c76a8), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/GI.pm line 3138 thread 1 GI::repeatmask(FastaChunk=HASH(0x42c76a8), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"..., "scaffold29", "", "/nfs/production/panda/ensemblgenomes/external/RepeatMasker-op"..., "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/c."..., 1, runlog=HASH(0x430e730)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 785 thread 1 Process::MpiChunk::__ANON__() called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Error.pm line 415 thread 1 eval {...} called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Error.pm line 407 thread 1 Error::subs::try(CODE(0x4437a90), HASH(0x4425fc8)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 4215 thread 1 Process::MpiChunk::_go(Process::MpiChunk=HASH(0x426e0a0), "run", HASH(0x42a5410), 0, 1) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 341 thread 1 Process::MpiChunk::run(Process::MpiChunk=HASH(0x426e0a0), 29) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker line 1457 thread 1 main::node_thread("/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 799 thread 1 eval {...} called at /nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 799 thread 1 threads::new("threads", CODE(0x4168d70), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker line 917 thread 1 --> rank=29, hostname=ebi5-229.ebi.ac.uk ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:scaffold29 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:scaffold29 I see from the mailing list that there's a known issue w/ forks..pm (which is at the bottom of this stack) relating to perl 5.18, but I'm running 5.14. Any ideas? On 17/06/14 22:09, Carson Holt wrote: > There is a change in Perl 5.18 that makes the forks.pm module incompatible. > The forks.pm model maintainers have yet to update their module to resolve > the issue, so it only works on perl version prior to 5.18. > One work around it to manually edit forks.pm line 1736 yourself. > > Change it from this --> > $write = each %WRITE; > > To this (make sure to include the {} brackets)--> > { > no warnings qw(internal); > $write = each %WRITE; > } > > --Carson > -- malcolm hinsley | EnsEMBL Genomes | +44 (0)1223 49 4669 European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD United Kingdom From rbharris at uw.edu Thu Jun 19 14:07:36 2014 From: rbharris at uw.edu (Rebecca Harris) Date: Thu, 19 Jun 2014 14:07:36 -0500 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hi, I'm trying to use the iprscan2gff3 script to update my final gff3 file with annotations from Interproscan 5. I'm getting a bunch of errors similar to another user but do not see how their issue was resolved: https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ I ran Interproscan on my ab initio predictions, then converted the xml to raw format. When I run iprscan2gff3 I get the errors: Use of uninitialized value $name in hash element at /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. Thanks, Rebecca -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Thu Jun 19 15:44:46 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 19 Jun 2014 20:44:46 +0000 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hi Rebecca, I at the conversation you linked to and it seems that Carson resolved the those parsing issues in an update to maker. What version of maker are you using? Also, in that same conversation Carson said that those errors wouldn't affect the output (because the script was parsing the mRNA features fine, but giving errors on the gene features). Does the output that you get from iprscan2gff3 seem complete? ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 1:07 PM, Rebecca Harris > wrote: Hi, I'm trying to use the iprscan2gff3 script to update my final gff3 file with annotations from Interproscan 5. I'm getting a bunch of errors similar to another user but do not see how their issue was resolved: https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ I ran Interproscan on my ab initio predictions, then converted the xml to raw format. When I run iprscan2gff3 I get the errors: Use of uninitialized value $name in hash element at /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. Thanks, Rebecca _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 19 15:47:27 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Jun 2014 14:47:27 -0600 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Also make sure there are gene/mRNA features in your GFF3 for your iprscan results. If you used the ab initio calls (which will be match/match_part features in the GFF3) as your input to iprscan, then you will need to upgrade them to gene/mRNA features before the script will add domains to them. --Carson From: Daniel Ence Date: Thursday, June 19, 2014 at 2:44 PM To: Rebecca Harris Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Fwd: iprscan2gff3 Hi Rebecca, I at the conversation you linked to and it seems that Carson resolved the those parsing issues in an update to maker. What version of maker are you using? Also, in that same conversation Carson said that those errors wouldn't affect the output (because the script was parsing the mRNA features fine, but giving errors on the gene features). Does the output that you get from iprscan2gff3 seem complete? ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 1:07 PM, Rebecca Harris wrote: > Hi, > > I'm trying to use the iprscan2gff3 script to update my final gff3 file with > annotations from Interproscan 5. I'm getting a bunch of errors similar to > another user but do not see how their issue was resolved: > https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-deve > l/MykTxL2Da64/5yrO3e6WBHUJ > > I ran Interproscan on my ab initio predictions, then converted the xml to raw > format. When I run iprscan2gff3 I get the errors: > Use of uninitialized value $name in hash element at > /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. > > Thanks, > Rebecca > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From rbharris at uw.edu Thu Jun 19 16:22:34 2014 From: rbharris at uw.edu (Rebecca Harris) Date: Thu, 19 Jun 2014 14:22:34 -0700 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hey, Thanks for the reply. The problem was that I didn't upgrade the matches to gene/mRNA features before running the ipr_upgrade_gff3 script. R On Thu, Jun 19, 2014 at 1:47 PM, Carson Holt wrote: > Also make sure there are gene/mRNA features in your GFF3 for your iprscan > results. If you used the ab initio calls (which will be match/match_part > features in the GFF3) as your input to iprscan, then you will need to > upgrade them to gene/mRNA features before the script will add domains to > them. > > --Carson > > > From: Daniel Ence > Date: Thursday, June 19, 2014 at 2:44 PM > To: Rebecca Harris > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Fwd: iprscan2gff3 > > Hi Rebecca, I at the conversation you linked to and it seems that Carson > resolved the those parsing issues in an update to maker. What version of > maker are you using? > > Also, in that same conversation Carson said that those errors wouldn't > affect the output (because the script was parsing the mRNA features fine, > but giving errors on the gene features). Does the output that you get from > iprscan2gff3 seem complete? > > ~Daniel > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 19, 2014, at 1:07 PM, Rebecca Harris > wrote: > > Hi, > > I'm trying to use the iprscan2gff3 script to update my final gff3 file > with annotations from Interproscan 5. I'm getting a bunch of errors similar > to another user but do not see how their issue was resolved: > > https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ > > I ran Interproscan on my ab initio predictions, then converted the xml to > raw format. When I run iprscan2gff3 I get the errors: > > Use of uninitialized value $name in hash element at > /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line > 1090. > Thanks, > Rebecca > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.priyam at qmul.ac.uk Thu Jun 19 17:11:36 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Fri, 20 Jun 2014 03:41:36 +0530 Subject: [maker-devel] migrating annotations from old to new assembly Message-ID: Is it possible to migrate annotations from an old assembly to a new assembly using MAKER? Perhaps by setting est= to transcripts (spliced? or unspliced?) from the previous assembly and genome= to the new assembly? Maybe ask MAKER to use exonerate instead of BLASTN so splice junctions are accounted for better? -- Priyam From carsonhh at gmail.com Thu Jun 19 17:16:01 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Jun 2014 16:16:01 -0600 Subject: [maker-devel] migrating annotations from old to new assembly In-Reply-To: References: Message-ID: Here you go, this is covered in a previous post --> https://groups.google.com/forum/#!searchin/maker-devel/est_forward/maker-de vel/q9fxXGKO8mk/0ATwhJvZeI4J --Carson On 6/19/14, 4:11 PM, "Anurag Priyam" wrote: >Is it possible to migrate annotations from an old assembly to a new >assembly using MAKER? > >Perhaps by setting est= to transcripts (spliced? or unspliced?) from >the previous assembly and genome= to the new assembly? Maybe ask MAKER >to use exonerate instead of BLASTN so splice junctions are accounted >for better? > >-- Priyam > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From a.priyam at qmul.ac.uk Thu Jun 19 17:19:22 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Fri, 20 Jun 2014 03:49:22 +0530 Subject: [maker-devel] migrating annotations from old to new assembly In-Reply-To: References: Message-ID: Wow! Thanks :). I apologise that I didn't look through the archives before asking. -- Priyam On Fri, Jun 20, 2014 at 3:46 AM, Carson Holt wrote: > Here you go, this is covered in a previous post --> > https://groups.google.com/forum/#!searchin/maker-devel/est_forward/maker-de > vel/q9fxXGKO8mk/0ATwhJvZeI4J > > > --Carson > > > > On 6/19/14, 4:11 PM, "Anurag Priyam" wrote: > >>Is it possible to migrate annotations from an old assembly to a new >>assembly using MAKER? >> >>Perhaps by setting est= to transcripts (spliced? or unspliced?) from >>the previous assembly and genome= to the new assembly? Maybe ask MAKER >>to use exonerate instead of BLASTN so splice junctions are accounted >>for better? >> >>-- Priyam >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From saad.arif at tuebingen.mpg.de Wed Jun 18 11:42:17 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Wed, 18 Jun 2014 17:42:17 +0100 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> Message-ID: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: > Hi Saad, > > Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). > > You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. > > One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > > Let me know if that helps, or if you have more question > > > ~Daniel > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: > >> Thank you for the response. I still have one question though, with these options: >> >> est_GFF=cufflinksout.GFF >> >> modle_GFF= ensembl reference.GFF >> >> What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? >> Is there a simple way to combine adding (new genes) and improving of an existing annotation? >> >> Any feedback on this would be greatly appreciated. >> >> saad >> >> On 13 Jun 2014, at 17:59, Carson Holt wrote: >> >>> Use the cufflinks instead of the tophat features (tophat tends to be >>> really noisy). Give the existing models to model_gff (they will then >>> always be kept unless something better is found). There is no option to >>> keep models and then just add isoforms. The model_gff input will either >>> be kept as is (unchanged), or replaced with an updated model suggested by >>> the evidence (the updated model may contain multiple isoforms though), and >>> map_forward=1 can be used to pull names forward from the old model onto >>> the new models. >>> >>> Thansk, >>> Carson >>> >>> >>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>> >>>> Dear All, >>>> >>>> I would like to use Maker pipeline to expand a current annotation (new >>>> isoforms and novel genes with respect to current annotation) and was >>>> wondering if anyone had experience with this and or suggestions to my >>>> questions. >>>> >>>> Briefly: >>>> >>>> I have tophat splice junctions from RNAseq data or alternatively >>>> cufflinks generated transcript models (fasts format) that i want to use >>>> as my new data (est_gff or est). >>>> >>>> I want to provide the current Ensembl annotation for gene prediction but >>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>> should provide this annotation as pred_gff >>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>> annotation for both options (pred_ and mod_gff)? >>>> >>>> >>>> >>>> Importantly, my main goal is to use the new RNAseq data to add more >>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>> thoughts or suggestions on how to go about this would be sincerely >>>> appreciated. >>>> >>>> >>>> Thanks in advance, >>>> saad >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anurag08priyam at gmail.com Wed Jun 18 13:15:52 2014 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Wed, 18 Jun 2014 23:45:52 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: It's version 2.31. -- Priyam On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt wrote: > What MAKER version are you using? > > --Carson > > > On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: > >>Hi, >> >>I compiled all annotations generated by MAKER into a single GFF file >>using the gff3_merge script distributed with MAKER. While formatting >>this GFF for use with JBrowse, I found a few errors: >> >>1. Three instances where two features were assigned the same id. >>2. One instance where a group of three subfeatures refer to a >>non-existent parent. >> >>Here is the relevant portion of the GFF file: >>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >> >>I worked around the issue temporarily for the job at hand, but I am >>left wondering why would these errors creep in. >> >>-- Priyam >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From rajesh.bommareddy at tu-harburg.de Thu Jun 19 03:08:45 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Thu, 19 Jun 2014 10:08:45 +0200 Subject: [maker-devel] Maker control files Message-ID: <53A29A8D.5010709@tu-harburg.de> Dear Sir/Madam I have installed Maker on Linux. I have tested the installation using maker -h. It looks fine. But i cannot create the control files using /maker/bin ./maker -CTL I get the following error: Couldnot create maker_opts.ctl control files not found Can you please help me with possible reason and solution Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From dence at genetics.utah.edu Fri Jun 20 16:20:47 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 20 Jun 2014 21:20:47 +0000 Subject: [maker-devel] Maker control files In-Reply-To: <53A29A8D.5010709@tu-harburg.de> References: <53A29A8D.5010709@tu-harburg.de> Message-ID: <51B8C254-A912-4CF6-B0E3-5C66E6E3E9AE@genetics.utah.edu> Hi Rajesh, Do you have write permissions in the directory where you're running maker? Also, I can't tell whether you're doing one command or two commands? If you do "maker" and there's no control files, then you'll get the "control files not found" error, but if you do ./maker -CTL and don't have permission to write to the install directory (which isn't unusual) then you'll get the "Could not create maker_opts.ctl" error. Thanks, Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 2:08 AM, Rajesh Reddy Bommareddy > wrote: Dear Sir/Madam I have installed Maker on Linux. I have tested the installation using maker -h. It looks fine. But i cannot create the control files using /maker/bin ./maker -CTL I get the following error: Couldnot create maker_opts.ctl control files not found Can you please help me with possible reason and solution Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 20 16:42:13 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:42:13 -0600 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_G MOD_Online_Training_2014#Training_ab_initio_Gene_Predictors --Carson From: Saad Arif Date: Wednesday, June 18, 2014 at 10:42 AM To: Daniel Ence Cc: "" Subject: Re: [maker-devel] Help with updating an annotation Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: > Hi Saad, > > Maker doesn't view EST or protein evidence as a gene model in themselves. > There's a good reason for this. Aligners like blast don't guarantee complete > gene models, with accurate start and stop codons and splice sites. With it's > default settings maker won't make a gene model unless there's evidence that > overlaps an ab-initio prediction (or something from the pred_gff option). > > You can use est2genome to promote everything from the est_gff option to a gene > model, but this will probably give you many spurious results. What you're > saying with est2genome is, "Everything that this tool found is a complete gene > model." I don't think that's true even for cufflinks output. > > One of the gene predictors that can run internally is snap. It's really easy > to train; here's a link to a tutorial for training it: > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMO > D_Online_Training_2014#Training_ab_initio_Gene_Predictors > > Let me know if that helps, or if you have more question > > > ~Daniel > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: > >> Thank you for the response. I still have one question though, with these >> options: >> >> est_GFF=cufflinksout.GFF >> >> modle_GFF= ensembl reference.GFF >> >> What happens to cufflinks assembled transcripts that are not confined to >> current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to >> prepare ab initio gene predictions for each of these predicted 'new' genes? >> Is there a simple way to combine adding (new genes) and improving of an >> existing annotation? >> >> Any feedback on this would be greatly appreciated. >> >> saad >> >> On 13 Jun 2014, at 17:59, Carson Holt wrote: >> >>> Use the cufflinks instead of the tophat features (tophat tends to be >>> really noisy). Give the existing models to model_gff (they will then >>> always be kept unless something better is found). There is no option to >>> keep models and then just add isoforms. The model_gff input will either >>> be kept as is (unchanged), or replaced with an updated model suggested by >>> the evidence (the updated model may contain multiple isoforms though), and >>> map_forward=1 can be used to pull names forward from the old model onto >>> the new models. >>> >>> Thansk, >>> Carson >>> >>> >>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>> >>>> Dear All, >>>> >>>> I would like to use Maker pipeline to expand a current annotation (new >>>> isoforms and novel genes with respect to current annotation) and was >>>> wondering if anyone had experience with this and or suggestions to my >>>> questions. >>>> >>>> Briefly: >>>> >>>> I have tophat splice junctions from RNAseq data or alternatively >>>> cufflinks generated transcript models (fasts format) that i want to use >>>> as my new data (est_gff or est). >>>> >>>> I want to provide the current Ensembl annotation for gene prediction but >>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>> should provide this annotation as pred_gff >>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>> annotation for both options (pred_ and mod_gff)? >>>> >>>> >>>> >>>> Importantly, my main goal is to use the new RNAseq data to add more >>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>> thoughts or suggestions on how to go about this would be sincerely >>>> appreciated. >>>> >>>> >>>> Thanks in advance, >>>> saad >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 20 16:46:59 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:46:59 -0600 Subject: [maker-devel] 'not a SCALAR reference' error In-Reply-To: <53A2A854.3000009@ebi.ac.uk> References: <53A2A854.3000009@ebi.ac.uk> Message-ID: Make sure you are using the latest version of MAKER 3.31.6. Also you may have to use MPICH2. MPICH3 is actually a different MPI protocol and I have not had success running MAKER with it. --Carson On 6/19/14, 3:07 AM, "Malcolm Hinsley" wrote: >Hi > >I'm running maker 2.31 with mpich 3 and have run once with est and >protein2genome, then trained augustus and snap and run the first >iteration of ab-initio predictors, which finished cleanly with no >errors/ failures. > >Having retrained augustus and snap I'm trying to run maker -a using the >same augustus species and snap.hmm pathname... previously this has >worked fine. > > >I get a lot of errors like this (it looks like every scaffold fails): > >doing repeat masking >ERROR: Not a SCALAR reference > at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Fasta.pm >line 382 thread 1. > Fasta::_formatSeq(FastaSeq=HASH(0x4426298), 60) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Fasta.pm >line 369 thread 1 > Fasta::toFastaRef(">scaffold29 CHUNK number:0 size:100000 >offset:0", REF(0x42e48f0)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/FastaChunk.pm >line 217 thread 1 > FastaChunk::fasta_ref(FastaChunk=HASH(0x42c76a8)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/FastaChunk.pm >line 168 thread 1 > FastaChunk::write_file(FastaChunk=HASH(0x42c76a8), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/GI.pm >line 3138 thread 1 > GI::repeatmask(FastaChunk=HASH(0x42c76a8), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"..., >"scaffold29", "", >"/nfs/production/panda/ensemblgenomes/external/RepeatMasker-op"..., >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/c."..., 1, >runlog=HASH(0x430e730)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 785 thread 1 > Process::MpiChunk::__ANON__() called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Error.pm >line 415 thread 1 > eval {...} called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Error.pm >line 407 thread 1 > Error::subs::try(CODE(0x4437a90), HASH(0x4425fc8)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 4215 thread 1 > Process::MpiChunk::_go(Process::MpiChunk=HASH(0x426e0a0), >"run", HASH(0x42a5410), 0, 1) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 341 thread 1 > Process::MpiChunk::run(Process::MpiChunk=HASH(0x426e0a0), 29) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker >line >1457 thread 1 >main::node_thread("/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/ma >ker/v8"...) >called at >/nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/ >x86_64-linux-thread-multi/forks.pm >line 799 thread 1 > eval {...} called at >/nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/ >x86_64-linux-thread-multi/forks.pm >line 799 thread 1 > threads::new("threads", CODE(0x4168d70), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker >line >917 thread 1 >--> rank=29, hostname=ebi5-229.ebi.ac.uk >ERROR: Failed while doing repeat masking >ERROR: Chunk failed at level:0, tier_type:1 >FAILED CONTIG:scaffold29 > >ERROR: Chunk failed at level:2, tier_type:0 >FAILED CONTIG:scaffold29 > > >I see from the mailing list that there's a known issue w/ forks..pm >(which is at the bottom of this stack) relating to perl 5.18, but I'm >running 5.14. > > >Any ideas? > > > > > >On 17/06/14 22:09, Carson Holt wrote: >> There is a change in Perl 5.18 that makes the forks.pm module >>incompatible. >> The forks.pm model maintainers have yet to update their module to >>resolve >> the issue, so it only works on perl version prior to 5.18. >> One work around it to manually edit forks.pm line 1736 yourself. >> >> Change it from this --> >> $write = each %WRITE; >> >> To this (make sure to include the {} brackets)--> >> { >> no warnings qw(internal); >> $write = each %WRITE; >> } >> >> --Carson >> > >-- >malcolm hinsley | EnsEMBL Genomes | +44 (0)1223 49 4669 >European Bioinformatics Institute (EMBL-EBI) >European Molecular Biology Laboratory >Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD >United Kingdom > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Jun 20 16:50:38 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:50:38 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: did you use est_forward? Also in the example you showed all the IDs are unique (one says hit and the other hsp in the ID, so they are different)? Could you find the non-uunique IDs causing the error? --Carson On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >I used est_gff= option, which refers to a GFF file generated by >cufflinks2gff3. The erroneous annotations didn't come from this GFF. > >-- Priyam > >On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >> Are you passing in old data via GFF3? >> >> --Carson >> >> >> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >> >>>It's version 2.31. >>> >>>-- Priyam >>> >>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>wrote: >>>> What MAKER version are you using? >>>> >>>> --Carson >>>> >>>> >>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>> >>>>>Hi, >>>>> >>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>this GFF for use with JBrowse, I found a few errors: >>>>> >>>>>1. Three instances where two features were assigned the same id. >>>>>2. One instance where a group of three subfeatures refer to a >>>>>non-existent parent. >>>>> >>>>>Here is the relevant portion of the GFF file: >>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>> >>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>left wondering why would these errors creep in. >>>>> >>>>>-- Priyam >>>>> >>>>>_______________________________________________ >>>>>maker-devel mailing list >>>>>maker-devel at box290.bluehost.com >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>>> >> >> From carsonhh at gmail.com Fri Jun 20 16:56:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:56:46 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Also note that ID= must be unique. Name= does not have to be, and won't be if the same protein or repeat element aligns to more than one location for example. Thanks, Carson On 6/20/14, 3:50 PM, "Carson Holt" wrote: >did you use est_forward? Also in the example you showed all the IDs are >unique (one says hit and the other hsp in the ID, so they are different)? >Could you find the non-uunique IDs causing the error? > >--Carson > > >On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: > >>I used est_gff= option, which refers to a GFF file generated by >>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >> >>-- Priyam >> >>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >>> Are you passing in old data via GFF3? >>> >>> --Carson >>> >>> >>> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >>> >>>>It's version 2.31. >>>> >>>>-- Priyam >>>> >>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>wrote: >>>>> What MAKER version are you using? >>>>> >>>>> --Carson >>>>> >>>>> >>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>> >>>>>>Hi, >>>>>> >>>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>> >>>>>>1. Three instances where two features were assigned the same id. >>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>non-existent parent. >>>>>> >>>>>>Here is the relevant portion of the GFF file: >>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>> >>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>left wondering why would these errors creep in. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>_______________________________________________ >>>>>>maker-devel mailing list >>>>>>maker-devel at box290.bluehost.com >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>>> >>> >>> > > From a.priyam at qmul.ac.uk Tue Jun 24 13:56:41 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Wed, 25 Jun 2014 00:26:41 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: I am sorry. I have updated the gist - https://gist.github.com/yeban/ffaf5cd419639dd073a7. 1. The first two chunks contain the annotations with duplicate ids. (4 rows) 2. The last chunk contains the annotations that refer to a non-existent parent. And what looks like an incomplete line of annotation (I forgot to state this in my original email). No, I didn't use est_forward. I am not passing in any old data via GFF3. -- Priyam On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: > Also note that ID= must be unique. Name= does not have to be, and won't be > if the same protein or repeat element aligns to more than one location for > example. > > Thanks, > Carson > > > On 6/20/14, 3:50 PM, "Carson Holt" wrote: > >>did you use est_forward? Also in the example you showed all the IDs are >>unique (one says hit and the other hsp in the ID, so they are different)? >>Could you find the non-uunique IDs causing the error? >> >>--Carson >> >> >>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >> >>>I used est_gff= option, which refers to a GFF file generated by >>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>> >>>-- Priyam >>> >>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >>>> Are you passing in old data via GFF3? >>>> >>>> --Carson >>>> >>>> >>>> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >>>> >>>>>It's version 2.31. >>>>> >>>>>-- Priyam >>>>> >>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>wrote: >>>>>> What MAKER version are you using? >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>> >>>>>>>Hi, >>>>>>> >>>>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>> >>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>non-existent parent. >>>>>>> >>>>>>>Here is the relevant portion of the GFF file: >>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>> >>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>left wondering why would these errors creep in. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>_______________________________________________ >>>>>>>maker-devel mailing list >>>>>>>maker-devel at box290.bluehost.com >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>>r >>>>>>>g >>>>>> >>>>>> >>>> >>>> >> >> > > From carsonhh at gmail.com Tue Jun 24 15:05:00 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Jun 2014 14:05:00 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 The value 1026 is held in a global iterator, so it cannot repeat the same value during the life of the process. And 1.3.0.12 is generated from the point in the code the ID is being generated. This means that two distinct processses had to write to the same file at the same point in the code, which should normally be impossible. However, there are ways to make this happen. First if you turn file locks off (-nolock) option and then run MAKER multiple times on the same dataset you can get process collisions (because you disabled the locks that stop this). If your NFS file system does not support hard links (FhGFS for example) then you cannot lock the files (which is the same as setting -nolock). Or you have other serious IO failures over NFS. Note that NFS is your Network Mounted Storage. The last example you give shows the preceding line being truncated. This suggests that two processes are trying to write to the same file simultaneously (inserting lines in between other lines), or serious IO failures are occurring where writes are not completing but true is being returned for the operations (can happen on unreliable NFS implementations). So in summary either your NFS storage implementation is giving IO errors, you have run MAKER with -nolock set and then started MAKER multiple times in the same directory (process collisions), or your NFS implementation doesn't support hardlinks and won't allow MAKER to lock files (process collisions). If it is one of the latter two, you will have to make sure you never start MAKER more than once simultaneously on the same dataset. You can still run via MPI fro parallelization, but you won't be able to start a second MPI process while the first one is still running. Thanks, Carson On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >I am sorry. I have updated the gist - >https://gist.github.com/yeban/ffaf5cd419639dd073a7. >1. The first two chunks contain the annotations with duplicate ids. (4 >rows) >2. The last chunk contains the annotations that refer to a >non-existent parent. And what looks like an incomplete line of >annotation (I forgot to state this in my original email). > >No, I didn't use est_forward. I am not passing in any old data via GFF3. > >-- Priyam > >On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >> Also note that ID= must be unique. Name= does not have to be, and won't >>be >> if the same protein or repeat element aligns to more than one location >>for >> example. >> >> Thanks, >> Carson >> >> >> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >> >>>did you use est_forward? Also in the example you showed all the IDs are >>>unique (one says hit and the other hsp in the ID, so they are >>>different)? >>>Could you find the non-uunique IDs causing the error? >>> >>>--Carson >>> >>> >>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>> >>>>I used est_gff= option, which refers to a GFF file generated by >>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>> >>>>-- Priyam >>>> >>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>wrote: >>>>> Are you passing in old data via GFF3? >>>>> >>>>> --Carson >>>>> >>>>> >>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>wrote: >>>>> >>>>>>It's version 2.31. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>wrote: >>>>>>> What MAKER version are you using? >>>>>>> >>>>>>> --Carson >>>>>>> >>>>>>> >>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>>> >>>>>>>>Hi, >>>>>>>> >>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>file >>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>formatting >>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>> >>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>non-existent parent. >>>>>>>> >>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>> >>>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>>left wondering why would these errors creep in. >>>>>>>> >>>>>>>>-- Priyam >>>>>>>> >>>>>>>>_______________________________________________ >>>>>>>>maker-devel mailing list >>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>.o >>>>>>>>r >>>>>>>>g >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >> >> From a.priyam at qmul.ac.uk Wed Jun 25 16:11:22 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Thu, 26 Jun 2014 02:41:22 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER processes in the same directory. I feel it's unlikely that my file system doesn't allow hardlinks because a few processes quit earlier than the others, saying something to the tune of "Another MAKER process is processing this scaffold already." I remember one process in particular had _just_ crashed. I don't remember how: I might have Ctrl-C'ed by mistake instead of detaching screen? admin killed it? temporary system glitch? Could this have caused the same issue? -- Priyam On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: > Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 > > The value 1026 is held in a global iterator, so it cannot repeat the same > value during the life of the process. And 1.3.0.12 is generated from the > point in the code the ID is being generated. This means that two distinct > processses had to write to the same file at the same point in the code, > which should normally be impossible. > > However, there are ways to make this happen. First if you turn file locks > off (-nolock) option and then run MAKER multiple times on the same dataset > you can get process collisions (because you disabled the locks that stop > this). If your NFS file system does not support hard links (FhGFS for > example) then you cannot lock the files (which is the same as setting > -nolock). Or you have other serious IO failures over NFS. Note that NFS > is your Network Mounted Storage. > > The last example you give shows the preceding line being truncated. This > suggests that two processes are trying to write to the same file > simultaneously (inserting lines in between other lines), or serious IO > failures are occurring where writes are not completing but true is being > returned for the operations (can happen on unreliable NFS implementations). > > So in summary either your NFS storage implementation is giving IO errors, > you have run MAKER with -nolock set and then started MAKER multiple times > in the same directory (process collisions), or your NFS implementation > doesn't support hardlinks and won't allow MAKER to lock files (process > collisions). If it is one of the latter two, you will have to make sure > you never start MAKER more than once simultaneously on the same dataset. > You can still run via MPI fro parallelization, but you won't be able to > start a second MPI process while the first one is still running. > > Thanks, > Carson > > > On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: > >>I am sorry. I have updated the gist - >>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>1. The first two chunks contain the annotations with duplicate ids. (4 >>rows) >>2. The last chunk contains the annotations that refer to a >>non-existent parent. And what looks like an incomplete line of >>annotation (I forgot to state this in my original email). >> >>No, I didn't use est_forward. I am not passing in any old data via GFF3. >> >>-- Priyam >> >>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>> Also note that ID= must be unique. Name= does not have to be, and won't >>>be >>> if the same protein or repeat element aligns to more than one location >>>for >>> example. >>> >>> Thanks, >>> Carson >>> >>> >>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>> >>>>did you use est_forward? Also in the example you showed all the IDs are >>>>unique (one says hit and the other hsp in the ID, so they are >>>>different)? >>>>Could you find the non-uunique IDs causing the error? >>>> >>>>--Carson >>>> >>>> >>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>> >>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>> >>>>>-- Priyam >>>>> >>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>wrote: >>>>>> Are you passing in old data via GFF3? >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>wrote: >>>>>> >>>>>>>It's version 2.31. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>wrote: >>>>>>>> What MAKER version are you using? >>>>>>>> >>>>>>>> --Carson >>>>>>>> >>>>>>>> >>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>>>> >>>>>>>>>Hi, >>>>>>>>> >>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>file >>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>formatting >>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>> >>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>non-existent parent. >>>>>>>>> >>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>> >>>>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>>>left wondering why would these errors creep in. >>>>>>>>> >>>>>>>>>-- Priyam >>>>>>>>> >>>>>>>>>_______________________________________________ >>>>>>>>>maker-devel mailing list >>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>>.o >>>>>>>>>r >>>>>>>>>g >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >>> > > From carsonhh at gmail.com Wed Jun 25 16:26:45 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 25 Jun 2014 15:26:45 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Maybe if it died in a weird way some of the processes could have continued briefly without active locks, but I'd more likely attribute this to NFS weirdness. Because of how network storage works, some implementations take shortcuts (like returning success on an IO operation even though it has not completed and may even fail later on). Or an IO operation can be buffered and completed several seconds later (the process that called the write operation may not even be active anymore). This is extremely common on NFS. You should probably just start MAKER fewer times in the same directory on your system. You may also want to start a single MAKER job (you should use MPI to parallelize it though), and use the -a flag. This will cause that job just to just rebuild the current GFF3 and FASTA files. That way you can clean up your current results without having to rerun everything. It should run relatively quickly since MAKER will be able to make use of the existing BLAST reports etc. that are already there (exonerate will run again though, but it shouldn't take too long). --Carson On 6/25/14, 3:11 PM, "Anurag Priyam" wrote: >Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER >processes in the same directory. > >I feel it's unlikely that my file system doesn't allow hardlinks >because a few processes quit earlier than the others, saying something >to the tune of "Another MAKER process is processing this scaffold >already." > >I remember one process in particular had _just_ crashed. I don't >remember how: I might have Ctrl-C'ed by mistake instead of detaching >screen? admin killed it? temporary system glitch? Could this have >caused the same issue? > >-- Priyam > > >On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: >> Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 >> >> The value 1026 is held in a global iterator, so it cannot repeat the >>same >> value during the life of the process. And 1.3.0.12 is generated from the >> point in the code the ID is being generated. This means that two >>distinct >> processses had to write to the same file at the same point in the code, >> which should normally be impossible. >> >> However, there are ways to make this happen. First if you turn file >>locks >> off (-nolock) option and then run MAKER multiple times on the same >>dataset >> you can get process collisions (because you disabled the locks that stop >> this). If your NFS file system does not support hard links (FhGFS for >> example) then you cannot lock the files (which is the same as setting >> -nolock). Or you have other serious IO failures over NFS. Note that NFS >> is your Network Mounted Storage. >> >> The last example you give shows the preceding line being truncated. >>This >> suggests that two processes are trying to write to the same file >> simultaneously (inserting lines in between other lines), or serious IO >> failures are occurring where writes are not completing but true is being >> returned for the operations (can happen on unreliable NFS >>implementations). >> >> So in summary either your NFS storage implementation is giving IO >>errors, >> you have run MAKER with -nolock set and then started MAKER multiple >>times >> in the same directory (process collisions), or your NFS implementation >> doesn't support hardlinks and won't allow MAKER to lock files (process >> collisions). If it is one of the latter two, you will have to make sure >> you never start MAKER more than once simultaneously on the same dataset. >> You can still run via MPI fro parallelization, but you won't be able to >> start a second MPI process while the first one is still running. >> >> Thanks, >> Carson >> >> >> On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >> >>>I am sorry. I have updated the gist - >>>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>>1. The first two chunks contain the annotations with duplicate ids. (4 >>>rows) >>>2. The last chunk contains the annotations that refer to a >>>non-existent parent. And what looks like an incomplete line of >>>annotation (I forgot to state this in my original email). >>> >>>No, I didn't use est_forward. I am not passing in any old data via GFF3. >>> >>>-- Priyam >>> >>>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>>> Also note that ID= must be unique. Name= does not have to be, and >>>>won't >>>>be >>>> if the same protein or repeat element aligns to more than one location >>>>for >>>> example. >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>>> >>>>>did you use est_forward? Also in the example you showed all the IDs >>>>>are >>>>>unique (one says hit and the other hsp in the ID, so they are >>>>>different)? >>>>>Could you find the non-uunique IDs causing the error? >>>>> >>>>>--Carson >>>>> >>>>> >>>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>>> >>>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>>wrote: >>>>>>> Are you passing in old data via GFF3? >>>>>>> >>>>>>> --Carson >>>>>>> >>>>>>> >>>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>>wrote: >>>>>>> >>>>>>>>It's version 2.31. >>>>>>>> >>>>>>>>-- Priyam >>>>>>>> >>>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>>wrote: >>>>>>>>> What MAKER version are you using? >>>>>>>>> >>>>>>>>> --Carson >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" >>>>>>>>>wrote: >>>>>>>>> >>>>>>>>>>Hi, >>>>>>>>>> >>>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>>file >>>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>>formatting >>>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>>> >>>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>>non-existent parent. >>>>>>>>>> >>>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>>> >>>>>>>>>>I worked around the issue temporarily for the job at hand, but I >>>>>>>>>>am >>>>>>>>>>left wondering why would these errors creep in. >>>>>>>>>> >>>>>>>>>>-- Priyam >>>>>>>>>> >>>>>>>>>>_______________________________________________ >>>>>>>>>>maker-devel mailing list >>>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l >>>>>>>>>>ab >>>>>>>>>>.o >>>>>>>>>>r >>>>>>>>>>g >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> >> >> From a.priyam at qmul.ac.uk Wed Jun 25 16:38:17 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Thu, 26 Jun 2014 03:08:17 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: -a option looks like just the thing I need. I will forward concerns about NFS to our IT team. And definitely use MPI for parallelisation next time. Thanks a lot :). -- Priyam On Thu, Jun 26, 2014 at 2:56 AM, Carson Holt wrote: > Maybe if it died in a weird way some of the processes could have continued > briefly without active locks, but I'd more likely attribute this to NFS > weirdness. Because of how network storage works, some implementations > take shortcuts (like returning success on an IO operation even though it > has not completed and may even fail later on). Or an IO operation can be > buffered and completed several seconds later (the process that called the > write operation may not even be active anymore). This is extremely common > on NFS. You should probably just start MAKER fewer times in the same > directory on your system. You may also want to start a single MAKER job > (you should use MPI to parallelize it though), and use the -a flag. This > will cause that job just to just rebuild the current GFF3 and FASTA files. > That way you can clean up your current results without having to rerun > everything. It should run relatively quickly since MAKER will be able to > make use of the existing BLAST reports etc. that are already there > (exonerate will run again though, but it shouldn't take too long). > > --Carson > > > On 6/25/14, 3:11 PM, "Anurag Priyam" wrote: > >>Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER >>processes in the same directory. >> >>I feel it's unlikely that my file system doesn't allow hardlinks >>because a few processes quit earlier than the others, saying something >>to the tune of "Another MAKER process is processing this scaffold >>already." >> >>I remember one process in particular had _just_ crashed. I don't >>remember how: I might have Ctrl-C'ed by mistake instead of detaching >>screen? admin killed it? temporary system glitch? Could this have >>caused the same issue? >> >>-- Priyam >> >> >>On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: >>> Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 >>> >>> The value 1026 is held in a global iterator, so it cannot repeat the >>>same >>> value during the life of the process. And 1.3.0.12 is generated from the >>> point in the code the ID is being generated. This means that two >>>distinct >>> processses had to write to the same file at the same point in the code, >>> which should normally be impossible. >>> >>> However, there are ways to make this happen. First if you turn file >>>locks >>> off (-nolock) option and then run MAKER multiple times on the same >>>dataset >>> you can get process collisions (because you disabled the locks that stop >>> this). If your NFS file system does not support hard links (FhGFS for >>> example) then you cannot lock the files (which is the same as setting >>> -nolock). Or you have other serious IO failures over NFS. Note that NFS >>> is your Network Mounted Storage. >>> >>> The last example you give shows the preceding line being truncated. >>>This >>> suggests that two processes are trying to write to the same file >>> simultaneously (inserting lines in between other lines), or serious IO >>> failures are occurring where writes are not completing but true is being >>> returned for the operations (can happen on unreliable NFS >>>implementations). >>> >>> So in summary either your NFS storage implementation is giving IO >>>errors, >>> you have run MAKER with -nolock set and then started MAKER multiple >>>times >>> in the same directory (process collisions), or your NFS implementation >>> doesn't support hardlinks and won't allow MAKER to lock files (process >>> collisions). If it is one of the latter two, you will have to make sure >>> you never start MAKER more than once simultaneously on the same dataset. >>> You can still run via MPI fro parallelization, but you won't be able to >>> start a second MPI process while the first one is still running. >>> >>> Thanks, >>> Carson >>> >>> >>> On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >>> >>>>I am sorry. I have updated the gist - >>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>>>1. The first two chunks contain the annotations with duplicate ids. (4 >>>>rows) >>>>2. The last chunk contains the annotations that refer to a >>>>non-existent parent. And what looks like an incomplete line of >>>>annotation (I forgot to state this in my original email). >>>> >>>>No, I didn't use est_forward. I am not passing in any old data via GFF3. >>>> >>>>-- Priyam >>>> >>>>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>>>> Also note that ID= must be unique. Name= does not have to be, and >>>>>won't >>>>>be >>>>> if the same protein or repeat element aligns to more than one location >>>>>for >>>>> example. >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>>>> >>>>>>did you use est_forward? Also in the example you showed all the IDs >>>>>>are >>>>>>unique (one says hit and the other hsp in the ID, so they are >>>>>>different)? >>>>>>Could you find the non-uunique IDs causing the error? >>>>>> >>>>>>--Carson >>>>>> >>>>>> >>>>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>>>> >>>>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>>>wrote: >>>>>>>> Are you passing in old data via GFF3? >>>>>>>> >>>>>>>> --Carson >>>>>>>> >>>>>>>> >>>>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>>>wrote: >>>>>>>> >>>>>>>>>It's version 2.31. >>>>>>>>> >>>>>>>>>-- Priyam >>>>>>>>> >>>>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>>>wrote: >>>>>>>>>> What MAKER version are you using? >>>>>>>>>> >>>>>>>>>> --Carson >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" >>>>>>>>>>wrote: >>>>>>>>>> >>>>>>>>>>>Hi, >>>>>>>>>>> >>>>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>>>file >>>>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>>>formatting >>>>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>>>> >>>>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>>>non-existent parent. >>>>>>>>>>> >>>>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>>>> >>>>>>>>>>>I worked around the issue temporarily for the job at hand, but I >>>>>>>>>>>am >>>>>>>>>>>left wondering why would these errors creep in. >>>>>>>>>>> >>>>>>>>>>>-- Priyam >>>>>>>>>>> >>>>>>>>>>>_______________________________________________ >>>>>>>>>>>maker-devel mailing list >>>>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l >>>>>>>>>>>ab >>>>>>>>>>>.o >>>>>>>>>>>r >>>>>>>>>>>g >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>> >>> > > From rajesh.bommareddy at tu-harburg.de Mon Jun 30 05:18:12 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Mon, 30 Jun 2014 12:18:12 +0200 Subject: [maker-devel] Maker gene prediction Message-ID: <53B13964.3060608@tu-harburg.de> Dear Sir/Madam I have a general question regarding gene prediction and annotation in Maker. For example, I have a new sequence of a yeast strain, and i have to predict and annotate the genome. Of,course i know EST's from the same organism will help me to predict the genes accurately, but when i want to use EST or RNA transcripts from a closely related organism, how can i do that in Maker and how accurate will be the prediction ?. Is the produced prediction and annotation valid ? How do i check this ? Thank you and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From carsonhh at gmail.com Mon Jun 30 12:34:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 30 Jun 2014 11:34:23 -0600 Subject: [maker-devel] Maker gene prediction In-Reply-To: <53B13964.3060608@tu-harburg.de> References: <53B13964.3060608@tu-harburg.de> Message-ID: You can supply ESTs from a related organism to the alt_est= option. Note this runs really slow because it has to be translated in all 6 reading frames (target and query), and will be less sensitive (larger threshold for alignments to become statistically significant). So if you have protein evidence from a related species, use that instead of the EST evidence from a related species. With respect to accuracy, the alignment evidence that suggests the annotation is also the experimental evidence that supports an annotations accuracy (so it is kind of a circular argument). But the alignment evidence does provide a correlative measurement. Things with lower AED scores better match the evidence and should be considered as higher confidence, while genes with higher AED scores represent genes that have lower confidence (this correlation is very well supported across many many organisms). You should be aware of what is considered realistic with genome annotation. In general for newly sequenced organisms, a genome wide accuracy of greater than 80% is considered extremely well annotated (but can't directly be measured except retrospectively - i.e. once you have a future more complete assembly and more experimental evidence to compare to). Only a handful of genomes that have legions of curators working over a decade (drosophila for example) have accuracies of greater than 90%. --Carson On 6/30/14, 4:18 AM, "Rajesh Reddy Bommareddy" wrote: >Dear Sir/Madam > >I have a general question regarding gene prediction and annotation in >Maker. > >For example, I have a new sequence of a yeast strain, and i have to >predict and annotate the genome. Of,course i know EST's from the same >organism will help me to predict the genes accurately, but when i want >to use EST or RNA transcripts from a closely related organism, how can i >do that in Maker and how accurate will be the prediction ?. Is the >produced prediction and annotation valid ? How do i check this ? > >Thank you and Regards >-- >M.Sc. Rajesh Reddy Bommareddy >Institute of Bioprocess and Biosystems Engineering >Hamburg University of Technology(TUHH) >Denikestrasse 15, 20171 Hamburg >GERMANY. >e.mail: rajesh.bommareddy at tu-harburg.de >Phone:+4940428784011 >Mobile:+4917663673522 > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jun 2 09:10:30 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 02 Jun 2014 09:10:30 -0600 Subject: [maker-devel] Precomputed alignments In-Reply-To: References: Message-ID: With the Target and Gap attribute you get slightly better behavior on filtering when you specify the blast_depth=X parameter in the maker_bopts.ctl file (keeps only X best hits). They will also affect the eAED score since it takes reading frame into account (so no Gap attribute means no assumption of reading frame). Otherwise they are only beneficial for seeing the alignment in a viewer as some viewers can recover the alignment when those values are specified. If you are not using blast_depth or trying to view the alignments in a viewer they don't really do anything. MAKER will just assume perfect match across the specified regions. --Carson From: Daniel Standage Date: Saturday, May 31, 2014 at 9:23 AM To: Maker Mailing List Subject: [maker-devel] Precomputed alignments Hello again! About a year ago I asked about using precomputed alignments with Maker. The thread quickly took a different direction as we tried to track down other issues, and I never got the thread back on its original track. So, to return to the original question, what exactly is required when providing pre-computed alignments in GFF3 format? For example, does it affect Maker's behavior whether a score is given? The "Target" attribute? The "Gap" attribute? Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 2 09:23:25 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 02 Jun 2014 09:23:25 -0600 Subject: [maker-devel] tRNAscan and map_gff_ids Message-ID: I've now patched the current download to fix this and a plus strand spliced tRNA bug. --Carson On 5/20/14, 1:17 PM, "Fields, Christopher J" wrote: >I found a problem with some tRNAscan output using MAKER 2.31.5. I had a >full MAKER data set (run initially using MAKER 2.31.5) that I mapped IDs >for. This was then run as follows, with the requisite error: > >-system-specific-4.1$ map_gff_ids id.map Zalbi.all.gff3 >Nested quantifiers in regex; marked by <-- HERE in >m/trnascan-KB913038.1-noncoding-Undet_??? <-- HERE -gene-79.0/ at >/home/groups/hpcbio/apps/maker/maker-2.31.5/bin/map_gff_ids line 111, ><$IN> line 3067590. > >The problematic lines: > >---------------------------------------------- >-system-specific-4.1$ grep "???" Zalbi.all.gff3 >KB913038.1 maker gene 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene >-79.0 >KB913038.1 maker tRNA 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1;Parent=trnascan-KB913038.1-noncoding-Undet >_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene-79.0-tRNA >-1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|70|0 >KB913038.1 maker exon 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1:exon:2193;Parent=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1 >KB913039.1 maker gene 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene >-72.0 >KB913039.1 maker tRNA 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1;Parent=trnascan-KB913039.1-noncoding-Undet >_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene-72.0-tRNA >-1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|74|0 >KB913039.1 maker exon 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1:exon:4036;Parent=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1 >---------------------------------------------- > >I managed to get it going by using the following modifications (regex >quotemeta) in map_gff_ids (lines 107-112): > > for my $id (@map_ids) { > # Only if the value (or the portion preceding > # the first colon) is equal to the map key. > next unless ($value eq $id || $value =~ /^\Q$id\E:/); > $value =~ s/\Q$id\E/$map{$id}/ unless($tag eq 'Name' && $id !~ >/\-gene\-\d+\.\d+|^CG\:|^....\:|^[^\:]+\:temp\d+\:/); > } > >I?m guessing there may be a similar problem with map_fasta_ids? > >chris >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From cjfields at illinois.edu Mon Jun 2 10:45:09 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 2 Jun 2014 16:45:09 +0000 Subject: [maker-devel] tRNAscan and map_gff_ids In-Reply-To: References: Message-ID: <007A79A7-8C68-4AFC-AC4F-451194D4CD29@illinois.edu> Thanks Carson! chris On Jun 2, 2014, at 10:23 AM, Carson Holt wrote: > I've now patched the current download to fix this and a plus strand > spliced tRNA bug. > > --Carson > > > On 5/20/14, 1:17 PM, "Fields, Christopher J" wrote: > >> I found a problem with some tRNAscan output using MAKER 2.31.5. I had a >> full MAKER data set (run initially using MAKER 2.31.5) that I mapped IDs >> for. This was then run as follows, with the requisite error: >> >> -system-specific-4.1$ map_gff_ids id.map Zalbi.all.gff3 >> Nested quantifiers in regex; marked by <-- HERE in >> m/trnascan-KB913038.1-noncoding-Undet_??? <-- HERE -gene-79.0/ at >> /home/groups/hpcbio/apps/maker/maker-2.31.5/bin/map_gff_ids line 111, >> <$IN> line 3067590. >> >> The problematic lines: >> >> ---------------------------------------------- >> -system-specific-4.1$ grep "???" Zalbi.all.gff3 >> KB913038.1 maker gene 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene >> -79.0 >> KB913038.1 maker tRNA 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1;Parent=trnascan-KB913038.1-noncoding-Undet >> _???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene-79.0-tRNA >> -1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|70|0 >> KB913038.1 maker exon 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1:exon:2193;Parent=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1 >> KB913039.1 maker gene 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene >> -72.0 >> KB913039.1 maker tRNA 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1;Parent=trnascan-KB913039.1-noncoding-Undet >> _???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene-72.0-tRNA >> -1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|74|0 >> KB913039.1 maker exon 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1:exon:4036;Parent=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1 >> ---------------------------------------------- >> >> I managed to get it going by using the following modifications (regex >> quotemeta) in map_gff_ids (lines 107-112): >> >> for my $id (@map_ids) { >> # Only if the value (or the portion preceding >> # the first colon) is equal to the map key. >> next unless ($value eq $id || $value =~ /^\Q$id\E:/); >> $value =~ s/\Q$id\E/$map{$id}/ unless($tag eq 'Name' && $id !~ >> /\-gene\-\d+\.\d+|^CG\:|^....\:|^[^\:]+\:temp\d+\:/); >> } >> >> I?m guessing there may be a similar problem with map_fasta_ids? >> >> chris >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From anthony.bretaudeau at rennes.inra.fr Tue Jun 3 02:38:31 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Tue, 03 Jun 2014 10:38:31 +0200 Subject: [maker-devel] Merging 2 annotations Message-ID: <538D8987.4090606@rennes.inra.fr> Hello, I am working on the annotation of an insect genome, and I have 2 gff files: -an automatic annotation (done by another lab, with something else than maker, ~20000genes) -a manually curated annotation (with webapollo, ~1500 genes) From this, I would like to produce a single gff combining the 2. I'd like to keep all the manually curated models, and only the automatic ones that have no equivalent in the manually curated gff. Is it possible to do something like this with maker? I guess I could play with the model_gff option, but I'm not sure how exactly I could use it. Thank you for your help Regards Anthony From shpeng at shou.edu.cn Mon Jun 2 20:30:17 2014 From: shpeng at shou.edu.cn (=?UTF-8?B?5b2t5Y+45Y2O?=) Date: Tue, 3 Jun 2014 10:30:17 +0800 (GMT+08:00) Subject: [maker-devel] Maker can not run repeatmasker Message-ID: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> Using the example data, I can not run maker successfully. Please see the following: [root at c0105 test3]# ls -l total 72 -rw-r--r--. 1 root root 32712 May 29 02:46 dpp_contig.fasta -rw-r--r--. 1 root root 19138 May 29 02:46 dpp_est.fasta -rw-r--r--. 1 root root 3045 May 29 02:46 dpp_protein.fasta -rw-r--r--. 1 root root 1413 May 29 02:46 maker_bopts.ctl -rw-r--r--. 1 root root 1288 May 29 02:46 maker_exe.ctl -rw-r--r--. 1 root root 4630 May 29 04:27 maker_opts.ctl [root at c0105 test3]# maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore To access files for individual sequences use the datastore index: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_master_datastore_index.log STATUS: Now running MAKER... examining contents of the fasta file and run log --Next Contig-- #--------------------------------------------------------------------- Now starting the contig!! SeqID: contig-dpp-500-500 Length: 32156 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased #--------------------------------------------------------------------- Now retrying the contig!! SeqID: contig-dpp-500-500 Length: 32156 Tries: 2!! #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased Maker is now finished!!! Start_time: 1401761680 End_time: 1401761688 Elapsed: 8 Thnaks. Sihua -------------- next part -------------- An HTML attachment was scrubbed... URL: From janphilipoyen at gmail.com Tue Jun 3 09:07:17 2014 From: janphilipoyen at gmail.com (=?UTF-8?Q?Jan_Philip_=C3=98yen?=) Date: Tue, 3 Jun 2014 17:07:17 +0200 Subject: [maker-devel] AED scores and thresholds: Not filtering? Message-ID: Hello, We are currently working on annotating arthropod genomes with the maker pipeline. We have tried to filter out genes with low support using the AED thresholds using AED_threshold=0.5. However, we still find genes without scores and most importantly genes with scores above the set threshold (even AED=1). Is there a known issue which could cause this? We can supply the control files as well as gffs, but would not like to make them public at this stage. Best Regards, Jan Philip Oeyen ZFMK / ZMB / University of Bonn -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 09:10:27 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 09:10:27 -0600 Subject: [maker-devel] Maker can not run repeatmasker In-Reply-To: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> References: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> Message-ID: The message is basically saying that RepeatMasker is not installed correctly. Follow the instructions here --> http://www.repeatmasker.org/RMDownload.html --Carson From: ??? Date: Monday, June 2, 2014 at 8:30 PM To: Subject: [maker-devel] Maker can not run repeatmasker Using the example data, I can not run maker successfully. Please see the following: [root at c0105 test3]# ls -l total 72 -rw-r--r--. 1 root root 32712 May 29 02:46 dpp_contig.fasta -rw-r--r--. 1 root root 19138 May 29 02:46 dpp_est.fasta -rw-r--r--. 1 root root 3045 May 29 02:46 dpp_protein.fasta -rw-r--r--. 1 root root 1413 May 29 02:46 maker_bopts.ctl -rw-r--r--. 1 root root 1288 May 29 02:46 maker_exe.ctl -rw-r--r--. 1 root root 4630 May 29 04:27 maker_opts.ctl [root at c0105 test3]# maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore To access files for individual sequences use the datastore index: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_master_datas tore_index.log STATUS: Now running MAKER... examining contents of the fasta file and run log --Next Contig-- #--------------------------------------------------------------------- Now starting the contig!! SeqID: contig-dpp-500-500 Length: 32156 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.al l.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVo id.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased #--------------------------------------------------------------------- Now retrying the contig!! SeqID: contig-dpp-500-500 Length: 32156 Tries: 2!! #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.al l.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVo id.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased Maker is now finished!!! Start_time: 1401761680 End_time: 1401761688 Elapsed: 8 Thnaks. Sihua _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 09:51:44 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 09:51:44 -0600 Subject: [maker-devel] AED scores and thresholds: Not filtering? In-Reply-To: References: Message-ID: No. It should use whichever is lower the AED or eAED score. The only exception is model_gff results. Those are always kept. Also note that the filter is for the entire gene, not just individual splice forms if you have alternate splicing. If you want I can take a look if there is anything non-obvious. You would have to send me the final GFF3 and the maker_opts.ctl file. --Carson From: Jan Philip ?yen Date: Tuesday, June 3, 2014 at 9:07 AM To: Subject: [maker-devel] AED scores and thresholds: Not filtering? Hello, We are currently working on annotating arthropod genomes with the maker pipeline. We have tried to filter out genes with low support using the AED thresholds using AED_threshold=0.5. However, we still find genes without scores and most importantly genes with scores above the set threshold (even AED=1). Is there a known issue which could cause this? We can supply the control files as well as gffs, but would not like to make them public at this stage. Best Regards, Jan Philip Oeyen ZFMK / ZMB / University of Bonn _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 10:15:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 10:15:46 -0600 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <538D8987.4090606@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> Message-ID: You can give the manually curate ones to model_gff and the other ones to pred_gff. Then set keep_preds=1. The model_gff resuls always get kept even without evidence support, the pred_gff will be kept even without evidence support because you set keep_preds=1, but model_gff results will take precedence. --Carson On 6/3/14, 2:38 AM, "Anthony Bretaudeau" wrote: >Hello, > >I am working on the annotation of an insect genome, and I have 2 gff >files: >-an automatic annotation (done by another lab, with something else than >maker, ~20000genes) >-a manually curated annotation (with webapollo, ~1500 genes) > > From this, I would like to produce a single gff combining the 2. I'd >like to keep all the manually curated models, and only the automatic >ones that have no equivalent in the manually curated gff. > >Is it possible to do something like this with maker? I guess I could >play with the model_gff option, but I'm not sure how exactly I could use >it. > >Thank you for your help >Regards > >Anthony > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Jun 3 20:20:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 20:20:20 -0600 Subject: [maker-devel] Short Introns In-Reply-To: References: Message-ID: I think you may be best off using WebApollo to manually annotated the few hundred short intron ones. It's not that fun to do, but you should be able to get them all in a couple of days by yourself or under a day if you had a helper. --Carson On 5/15/14, 11:15 AM, "Mack, Brian" wrote: >Hi, I examined the genes that had introns less than 10 bp that were being >flagged by tbl2asn and I noticed that all 438 of them were genes called >by SNAP. Also they were found in the CDS and not the UTR. It seems >strange that all of the genes that have these short introns are from SNAP >when only about one third of the final gene models are from SNAP. I've >examined the evidence for a handful of these genes and the short introns >do not seem supported by the evidence. Has anybody else had short intron >issues with SNAP? > >Brian > >-----Original Message----- >From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf >Of Carson Holt >Sent: Friday, April 18, 2014 10:36 AM >To: UMD Bioinformatics; maker-devel at yandell-lab.org >Subject: Re: [maker-devel] Short Introns > >Look at the name of those genes. The original name will let you know >where it came from because it will contain, augustus, genemark, snap, etc. > You will also want to open up the contig containing those geens in a >viewer like apollo >(http://weatherby.genetics.utah.edu/apollo/apollo.tar.gz). See if the >short intron is part of the CDS or UTR. If it's UTR then, it has >evidence support from an EST, which either means there are problems with >the EST/cDNA evidence or it's real. For those, even if they are real you >can just trim them off. If it's part of the CDS, then investigate >whether it is suggested by EST or protein evidence, or if the ab initio >predictor called it (sometime the ab initio predictor calls things to >force an ORF to work). This can sometimes be indicative of assembly >issues in that region. > >--Carson > > >On 4/18/14, 7:14 AM, "UMD Bioinformatics" >wrote: > >>Hello, >> >>We are preparing two submission for NCBI, nightmare. However some of >>our MAKER gene models have short introns that are being flagged by >>NCBI. In one species we have >400 introns smaller then 20bp which is >>almost biologically impossible. I know we can set max intron length in >>the opts.ctl file but can we set a minimum intron length? >> >>I saw yesterdays posts that mention this is a result of the external ab >>initio predictors but I didn?t see an indication as to which predictor >>and how to change that setting. >> >>from yesterday: >>*These are just short introns (intron size is under control of the ab >>initio >>predictors) --> 438 ERROR: SEQ_FEAT.ShortIntron >> >>Cheers >>Ian >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > >This electronic message contains information generated by the USDA solely >for the intended recipients. Any unauthorized interception of this >message or the use or disclosure of the information it contains may >violate the law and subject the violator to civil or criminal penalties. >If you believe you have received this message in error, please notify the >sender and delete the email immediately. From sujaikumar at gmail.com Wed Jun 4 06:26:09 2014 From: sujaikumar at gmail.com (Sujai) Date: Wed, 4 Jun 2014 13:26:09 +0100 Subject: [maker-devel] Augustus compilation Message-ID: Hi all I've installed older versions of Maker (up to 2.28) before successfully. I was trying to install maker 2.31.6 on a new cluster and decided to use the built in installers for the dependencies. Unfortunately ./Build augustuc gives this error: Unpacking augustus tarball... Configuring augustus... g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o genbank.cc -I../include g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o properties.cc -I../include properties.cc: In static member function 'static void Properties::init(int, char**)': properties.cc:349:25: error: 'boost::filesystem::path' has no member named 'native' configPath = cpath.native(); ^ properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': properties.cc:615:10: error: 'read_symlink' is not a member of 'boost::filesystem' bpath = boost::filesystem::read_symlink(bpath); ^ make: *** [properties.o] Error 1 ERROR: Failed installing augustus, now cleaning installation path... You may need to install augustus manually. ---- Would anyone have any suggestions for how to fix this? I've tried editing the ../exe/augustus-3.0.2/src/Makefile line: LIBS = -lboost_iostreams -lboost_system -lboost_filesystem to add the path to my system boost lib: LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib -lboost_iostreams -lboost_system -lboost_filesystem and then running make from inside ../exe/augustus-3.0.2/src but I get the same error again From mike.thon at gmail.com Wed Jun 4 07:31:30 2014 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 4 Jun 2014 15:31:30 +0200 Subject: [maker-devel] Augustus compilation In-Reply-To: References: Message-ID: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> Hi - Yes it the latest version of augustus needs the boost library. If you?re on linux you should be able to install it through the package manager. its called libboost-dev or some such thing. -Mike On Jun 4, 2014, at 2:26 PM, Sujai wrote: > Hi all > > I've installed older versions of Maker (up to 2.28) before successfully. > > I was trying to install maker 2.31.6 on a new cluster and decided to > use the built in installers for the dependencies. > > Unfortunately > > ./Build augustuc > > gives this error: > > Unpacking augustus tarball... > Configuring augustus... > g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o > genbank.cc -I../include > g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o > properties.cc -I../include > properties.cc: In static member function 'static void > Properties::init(int, char**)': > properties.cc:349:25: error: 'boost::filesystem::path' has no member > named 'native' > configPath = cpath.native(); > ^ > properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': > properties.cc:615:10: error: 'read_symlink' is not a member of > 'boost::filesystem' > bpath = boost::filesystem::read_symlink(bpath); > ^ > make: *** [properties.o] Error 1 > > ERROR: Failed installing augustus, now cleaning installation path... > You may need to install augustus manually. > > ---- > > Would anyone have any suggestions for how to fix this? I've tried > editing the ../exe/augustus-3.0.2/src/Makefile line: > > LIBS = -lboost_iostreams -lboost_system -lboost_filesystem > > to add the path to my system boost lib: > > LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib > -lboost_iostreams -lboost_system -lboost_filesystem > > and then running make from inside ../exe/augustus-3.0.2/src but I get > the same error again > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From sujaikumar at gmail.com Wed Jun 4 07:34:50 2014 From: sujaikumar at gmail.com (Sujai) Date: Wed, 4 Jun 2014 14:34:50 +0100 Subject: [maker-devel] Augustus compilation In-Reply-To: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> References: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> Message-ID: Hi Mike Thanks for the super prompt response. I am on a cluster where I can't install libboost-dev. However, boost is on the cluster (as I wrote, it is compiled in the /system/software/linux-x86_64/lib/boost/1_55_0/lib directory) so is my modification to the Makefile below correct, or is there something else I need to do? LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib -lboost_iostreams -lboost_system -lboost_filesystem Cheers, - Sujai On 4 June 2014 14:31, Michael Thon wrote: > Hi - Yes it the latest version of augustus needs the boost library. If you're on linux you should be able to install it through the package manager. its called libboost-dev or some such thing. > > -Mike > > On Jun 4, 2014, at 2:26 PM, Sujai wrote: > >> Hi all >> >> I've installed older versions of Maker (up to 2.28) before successfully. >> >> I was trying to install maker 2.31.6 on a new cluster and decided to >> use the built in installers for the dependencies. >> >> Unfortunately >> >> ./Build augustuc >> >> gives this error: >> >> Unpacking augustus tarball... >> Configuring augustus... >> g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o >> genbank.cc -I../include >> g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o >> properties.cc -I../include >> properties.cc: In static member function 'static void >> Properties::init(int, char**)': >> properties.cc:349:25: error: 'boost::filesystem::path' has no member >> named 'native' >> configPath = cpath.native(); >> ^ >> properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': >> properties.cc:615:10: error: 'read_symlink' is not a member of >> 'boost::filesystem' >> bpath = boost::filesystem::read_symlink(bpath); >> ^ >> make: *** [properties.o] Error 1 >> >> ERROR: Failed installing augustus, now cleaning installation path... >> You may need to install augustus manually. >> >> ---- >> >> Would anyone have any suggestions for how to fix this? I've tried >> editing the ../exe/augustus-3.0.2/src/Makefile line: >> >> LIBS = -lboost_iostreams -lboost_system -lboost_filesystem >> >> to add the path to my system boost lib: >> >> LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib >> -lboost_iostreams -lboost_system -lboost_filesystem >> >> and then running make from inside ../exe/augustus-3.0.2/src but I get >> the same error again >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From daniel.standage at gmail.com Wed Jun 4 13:03:27 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Wed, 4 Jun 2014 15:03:27 -0400 Subject: [maker-devel] Filtering of ab initio gene models Message-ID: Thanks everyone for your responses recently! The reason for my recent flurry of email activity is that I'm seeing some unexpected trends when running the new version of Maker with precomputed alignments. Compared with an annotation I did a while ago (Maker 2.10, Maker-computed alignments), this new annotation has a substantial number of new genes annotated. If I compare distributions of AED scores between the old and new annotation, it's clear that the new annotation has a lot more low-quality models. If I look at new gene models that do not overlap with any gene model from the old annotation, the likelihood that it's a low-quality model is much higher. I decided to run a little experiment. I annotated a scaffold first using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed transcript and protein alignments and the same (latest) version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models from 2.31.3. Before this experiment, I assumed the issue was related to providing pre-computed alignments in GFF3 format and perhaps violating some important assumption. However, this experiment makes me wonder whether there have been changes to how Maker filters *ab initio* gene models between version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could put together a small data set that reproduces the behavior I just described. Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 4 13:09:29 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 04 Jun 2014 13:09:29 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Sure. that would be helpful. One question. Do you provide the Gap attribute in your precomputed alignments? Having or not having that attribute affects the eAED score which takes reading frame into account, and may cause some things to be kept that normally would be dropped, because MAKER won't be able to take the points of mismatch of the alignment into account (it just assumes match everywhere). --Carson From: Daniel Standage Date: Wednesday, June 4, 2014 at 1:03 PM To: Maker Mailing List Subject: [maker-devel] Filtering of ab initio gene models Thanks everyone for your responses recently! The reason for my recent flurry of email activity is that I'm seeing some unexpected trends when running the new version of Maker with precomputed alignments. Compared with an annotation I did a while ago (Maker 2.10, Maker-computed alignments), this new annotation has a substantial number of new genes annotated. If I compare distributions of AED scores between the old and new annotation, it's clear that the new annotation has a lot more low-quality models. If I look at new gene models that do not overlap with any gene model from the old annotation, the likelihood that it's a low-quality model is much higher. I decided to run a little experiment. I annotated a scaffold first using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed transcript and protein alignments and the same (latest) version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models from 2.31.3. Before this experiment, I assumed the issue was related to providing pre-computed alignments in GFF3 format and perhaps violating some important assumption. However, this experiment makes me wonder whether there have been changes to how Maker filters ab initio gene models between version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could put together a small data set that reproduces the behavior I just described. Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Wed Jun 4 13:11:44 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Wed, 4 Jun 2014 15:11:44 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: I do not provide Gap or Target attributes in the GFF3. Will this affect the AED as well, or just the eAED? -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: > Sure. that would be helpful. One question. Do you provide the Gap > attribute in your precomputed alignments? Having or not having that > attribute affects the eAED score which takes reading frame into account, > and may cause some things to be kept that normally would be dropped, > because MAKER won't be able to take the points of mismatch of the alignment > into account (it just assumes match everywhere). > > --Carson > > > From: Daniel Standage > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that I'm seeing some > unexpected trends when running the new version of Maker with precomputed > alignments. Compared with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a substantial number of > new genes annotated. If I compare distributions of AED scores between the > old and new annotation, it's clear that the new annotation has a lot more > low-quality models. If I look at new gene models that do not overlap with > any gene model from the old annotation, the likelihood that it's a > low-quality model is much higher. > > I decided to run a little experiment. I annotated a scaffold first using > Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same > pre-computed transcript and protein alignments and the same (latest) > version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted > 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci > by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 > locus with only models from 2.10, and 28 loci with only models from 2.31.3. > > Before this experiment, I assumed the issue was related to providing > pre-computed alignments in GFF3 format and perhaps violating some important > assumption. However, this experiment makes me wonder whether there have > been changes to how Maker filters *ab initio* gene models between version > 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could > put together a small data set that reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 4 13:17:34 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 04 Jun 2014 13:17:34 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Just eAED, but eAED can affects selection of ab initio results. For example reading frame match of protein evidence which also affects whether evidence from single_exon=1 and genes with single_exon protein evidence get kept. There is also the assumption that your alignments in GFF3 are are correctly spliced (like BLAT does). So giving blastn results as precomputed est_gff would create a lot of noise, since maker ignores blastn and is using it only to seed the polished exonerate alignments. --Carson From: Daniel Standage Date: Wednesday, June 4, 2014 at 1:11 PM To: Carson Holt Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models I do not provide Gap or Target attributes in the GFF3. Will this affect the AED as well, or just the eAED? -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: > Sure. that would be helpful. One question. Do you provide the Gap attribute > in your precomputed alignments? Having or not having that attribute affects > the eAED score which takes reading frame into account, and may cause some > things to be kept that normally would be dropped, because MAKER won't be able > to take the points of mismatch of the alignment into account (it just assumes > match everywhere). > > --Carson > > > From: Daniel Standage > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that I'm seeing some > unexpected trends when running the new version of Maker with precomputed > alignments. Compared with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a substantial number of > new genes annotated. If I compare distributions of AED scores between the old > and new annotation, it's clear that the new annotation has a lot more > low-quality models. If I look at new gene models that do not overlap with any > gene model from the old annotation, the likelihood that it's a low-quality > model is much higher. > > I decided to run a little experiment. I annotated a scaffold first using Maker > 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed > transcript and protein alignments and the same (latest) version of SNAP as the > only ab initio predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 > predicted 63. If we group gene models into loci by overlap, there are 33 loci > with gene models from both 2.10 and 2.31.3, 1 locus with only models from > 2.10, and 28 loci with only models from 2.31.3. > > Before this experiment, I assumed the issue was related to providing > pre-computed alignments in GFF3 format and perhaps violating some important > assumption. However, this experiment makes me wonder whether there have been > changes to how Maker filters ab initio gene models between version 2.10 and > version 2.31.3? Do you have any ideas? If it would help, I could put together > a small data set that reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ranjani at uga.edu Thu Jun 5 09:49:36 2014 From: ranjani at uga.edu (Sivaranjani Namasivayam) Date: Thu, 5 Jun 2014 15:49:36 +0000 Subject: [maker-devel] protein2genome gene models from protein gff Message-ID: <1401983375868.65464@uga.edu> Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Mack at ARS.USDA.GOV Thu Jun 5 11:56:04 2014 From: Brian.Mack at ARS.USDA.GOV (Mack, Brian) Date: Thu, 5 Jun 2014 17:56:04 +0000 Subject: [maker-devel] missing start and stop codons Message-ID: I've been looking at the start and stop codons of the maker predicted transcripts after I stripped off the utr with the fasta_tool script and 7% of the transcripts do not have ATG as a start codon. Also 2% do not have TAA, TAG, or TGA as a stop codon. Could these be real start and stop codons and just rare variants, or should I consider these incomplete genes? If I was to turn on the "always_complete" option in Maker what would that actually do? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 12:01:24 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:01:24 -0600 Subject: [maker-devel] missing start and stop codons Message-ID: They are incomplete genes there are many reasons why this happens in new assemblies. You can turn always_complete on to try and force a complete, but what is added or subtracted to get a start and stop codon may not be biologically correct. It's just forced canonical. Also make sure to use the latest MAKER version. 2.29 and before didn't correct for the BioPerl codon table which allows for an extra non-cannonical start codon. Now MAKER exports a strict canonical table to BioPerl so 'M' is the only start. --Carson From: "Mack, Brian" Date: Thursday, June 5, 2014 at 11:56 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] missing start and stop codons I?ve been looking at the start and stop codons of the maker predicted transcripts after I stripped off the utr with the fasta_tool script and 7% of the transcripts do not have ATG as a start codon. Also 2% do not have TAA, TAG, or TGA as a stop codon. Could these be real start and stop codons and just rare variants, or should I consider these incomplete genes? If I was to turn on the ?always_complete? option in Maker what would that actually do? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 12:08:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:08:20 -0600 Subject: [maker-devel] protein2genome gene models from protein gff Message-ID: est_gff assumes the alignments are spliced correctly. The protein2genome option also makes that assumption but with a little less confidence that the user always provides splice aware alignments, so in some instances (like protein2genome=1) it may not pass them forward as guaranteed splice aware alignments. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 9:49 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] protein2genome gene models from protein gff Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 12:24:03 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:24:03 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: Like I said. The predictors do the best they can, so there is probably something about the regions to make the CDS, reading frame, or start/stop work that requires exons to be dropped or added. In several ant genomes we saw something like this caused by incorrect homopolymers in the assembly which force the predictor to slightly alter the intron/exon structure because otherwise the reading frame made no sense (the EST alignments were used to confirmed that the assembly homopolymers were incorrect - lots of bad single base pair deletions). The way hints work is as follows. At the simplest level ab initio predictors are calculating the probability of being in different states (intergenic, intron, exon, etc.). The hints increase the probability of being in the intron state where MAKER gives an intron hint or being in an exon/CDS state when MAKER gives an exon/CDS hint. So this bends the likelihood of the ab intio gene predictor to call something similar in structure to the evidence overlapping it. That being said, if there is strong enough signal from something else in the sequence or my hints won't work with the splice sites in the region or the reading frame breaks, then no amount of hints can force augustus to make that model. --Carson On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >Hi, > >thanks for the feedback. I spent some more time on this and am still >somewhat unsatisfied with the whole thing? > >A few comments: > >I quite frequently see augustus and in extension Maker including exons >that are not supported by EST/Protein evidence and are not critical for >the gene model (i.e. I can take them out and still get a proper CDS). >Maybe I don?t know enough about how Maker creates hints and more >importantly what role these hints play for augustus, but I cannot really >see a great effect (any, really) on the gene models even if both ESTs and >proteins contradict an augustus gene model and the surplus exon is >non-essential. > >(all evidence is provided as fasta files, protein2genome and est2genome >are set to 0) > >As for the repeat library, I suppose this is a critical point. I am using >repeats from a closely related species via Repeatmasker, modelled and >filtered repeats from RepeatModeler and repeats derived from a >high-coverage 454 data set. Not sure what else I can do to improve that. > >As for evidence, I am using the curated reference proteome from a related >species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >reads. I don?t think it gets a whole lot better, in terms of what data >can be used. > >So in summary, I just don?t get where I want to using Augustus and Maker >- specifically, the gene models are full of weird, unsupported artefacts >despite manually curating > 850 models for training. I suppose I was >hoping for some secret trick to improve on this - but I guess there is >none? Actually, if I only do a pure evidence build (seeing that my input >data is very high quality), it looks better - which sort of goes against >the premise of Maker :/ > >Regards, > >Marc > > > > >Marc P. Hoeppner, PhD >Team Leader >Department for Medical Biochemistry and Microbiology >Uppsala University, Sweden >marc.hoeppner at bils.se > >On 27 May 2014, at 17:25, Carson Holt wrote: > >> Extra exons can be required for predictors to make sense of a region >>(they >> do the best they can). This can be due to imperfect assemblies or >> repeats. For plants the repeat database is the the one thing that will >> most affect the annotation quality. You may need to spend some time >> building the best repeat library you can. The repeat library is the >>next >> most important thing next to training the predictor, because they >>confuse >> the predictor (sometimes a lot) causing it to behave oddly in those >> regions (because repeats do encode real protein and protein fragments). >> Also when running now with MAKER make sure to include the entire >>proteome >> of a related species and not just UniProt, and you will get better >> performance. Now that you have Augustus trained, using it inside of >>MAKER >> with an improved repeat library and additional protein evidence should >> give it the feedback that will allow it to perform better than it would >> with just naked ab initio prediction. >> >> Thanks, >> Carson >> >> >> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >> >>> Hi, >>> >>> I wanted to get some feedback regarding the training of ab-initio gene >>> finders - it?s not strictly Maker related, but I suppose there are many >>> people on this list that have encountered and solved this issue in one >>> way or another. >>> >>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>> plant genome. This has always been a very frustrating process for me, >>>but >>> while I have a better idea now how to do it, I still don?t get the sort >>> of accuracy that I am hoping for. A quick run-through of my process; >>> >>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>> Sanger-sequenced EST data >>> >>> Filtered for Models with an AED <= 0.3 >>> >>> Loaded that into WebApollo, together with an existing reference >>> annotation and the evidence tracks >>> >>> Manually curated/selected 750 gene models using the following rules: >>> - Must have start/stop codon >>> - Most have as many exons as possible >>> - Must agree with evidence >>> - Must be >= 2kb part from other gene models (provided as flanking >>> regions for augustus to train intergenic sequence) >>> >>> From these models, I created a GBK file, split it into 650 (train) and >>> 100 (test) models and created a new profile using the documented >>> procedure. >>> >>> But: >>> >>> While the naked ab-init models created through maker get a lot of genes >>> ?sort of right?, I still see too many issues to be really satisfied. >>> Problems include: >>> >>> - random exon calls which are not supported by any line of evidence (~1 >>> per gene model, I would guess) >>> - poor congruency with some gene models (especially ones not used for >>> training/testing) >>> >>> Is there any best-practice guide on how to improve this? The Augustus >>> website is unfortunately quite poor on detail? My impression so far is >>> that ramping up the number of training models isn?t really doing too >>>much >>> beyond a certain point (tried 400, 500 and 750). >>> >>> Regards, >>> >>> Marc >>> >>> >>> Marc P. Hoeppner, PhD >>> Team Leader >>> BILS Genome Annotation Platform >>> Department for Medical Biochemistry and Microbiology >>> Uppsala University, Sweden >>> marc.hoeppner at bils.se >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Thu Jun 5 12:28:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:28:55 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: One thing you might want to try is adding another predictor like SNAP together with Augustus and then process the MAKER results using EVM. We actually have a collaboration with the EVM group to produce a MAKER-EVM pipeline (MAKER 3.0). EVM will produce consensus models using the predictions and the evidence in the MAKER GFF3 which are generally better than just SNAP and Augustus with hints, so it might be able to remove some of the artifacts you are worried about. --Carson On 6/5/14, 12:24 PM, "Carson Holt" wrote: >Like I said. The predictors do the best they can, so there is probably >something about the regions to make the CDS, reading frame, or start/stop >work that requires exons to be dropped or added. In several ant genomes >we saw something like this caused by incorrect homopolymers in the >assembly which force the predictor to slightly alter the intron/exon >structure because otherwise the reading frame made no sense (the EST >alignments were used to confirmed that the assembly homopolymers were >incorrect - lots of bad single base pair deletions). > >The way hints work is as follows. At the simplest level ab initio >predictors are calculating the probability of being in different states >(intergenic, intron, exon, etc.). The hints increase the probability of >being in the intron state where MAKER gives an intron hint or being in an >exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >likelihood of the ab intio gene predictor to call something similar in >structure to the evidence overlapping it. That being said, if there is >strong enough signal from something else in the sequence or my hints won't >work with the splice sites in the region or the reading frame breaks, then >no amount of hints can force augustus to make that model. > >--Carson > > > >On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: > >>Hi, >> >>thanks for the feedback. I spent some more time on this and am still >>somewhat unsatisfied with the whole thing? >> >>A few comments: >> >>I quite frequently see augustus and in extension Maker including exons >>that are not supported by EST/Protein evidence and are not critical for >>the gene model (i.e. I can take them out and still get a proper CDS). >>Maybe I don?t know enough about how Maker creates hints and more >>importantly what role these hints play for augustus, but I cannot really >>see a great effect (any, really) on the gene models even if both ESTs and >>proteins contradict an augustus gene model and the surplus exon is >>non-essential. >> >>(all evidence is provided as fasta files, protein2genome and est2genome >>are set to 0) >> >>As for the repeat library, I suppose this is a critical point. I am using >>repeats from a closely related species via Repeatmasker, modelled and >>filtered repeats from RepeatModeler and repeats derived from a >>high-coverage 454 data set. Not sure what else I can do to improve that. >> >>As for evidence, I am using the curated reference proteome from a related >>species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>reads. I don?t think it gets a whole lot better, in terms of what data >>can be used. >> >>So in summary, I just don?t get where I want to using Augustus and Maker >>- specifically, the gene models are full of weird, unsupported artefacts >>despite manually curating > 850 models for training. I suppose I was >>hoping for some secret trick to improve on this - but I guess there is >>none? Actually, if I only do a pure evidence build (seeing that my input >>data is very high quality), it looks better - which sort of goes against >>the premise of Maker :/ >> >>Regards, >> >>Marc >> >> >> >> >>Marc P. Hoeppner, PhD >>Team Leader >>Department for Medical Biochemistry and Microbiology >>Uppsala University, Sweden >>marc.hoeppner at bils.se >> >>On 27 May 2014, at 17:25, Carson Holt wrote: >> >>> Extra exons can be required for predictors to make sense of a region >>>(they >>> do the best they can). This can be due to imperfect assemblies or >>> repeats. For plants the repeat database is the the one thing that will >>> most affect the annotation quality. You may need to spend some time >>> building the best repeat library you can. The repeat library is the >>>next >>> most important thing next to training the predictor, because they >>>confuse >>> the predictor (sometimes a lot) causing it to behave oddly in those >>> regions (because repeats do encode real protein and protein fragments). >>> Also when running now with MAKER make sure to include the entire >>>proteome >>> of a related species and not just UniProt, and you will get better >>> performance. Now that you have Augustus trained, using it inside of >>>MAKER >>> with an improved repeat library and additional protein evidence should >>> give it the feedback that will allow it to perform better than it would >>> with just naked ab initio prediction. >>> >>> Thanks, >>> Carson >>> >>> >>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>> >>>> Hi, >>>> >>>> I wanted to get some feedback regarding the training of ab-initio gene >>>> finders - it?s not strictly Maker related, but I suppose there are >>>>many >>>> people on this list that have encountered and solved this issue in one >>>> way or another. >>>> >>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>>> plant genome. This has always been a very frustrating process for me, >>>>but >>>> while I have a better idea now how to do it, I still don?t get the >>>>sort >>>> of accuracy that I am hoping for. A quick run-through of my process; >>>> >>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>> Sanger-sequenced EST data >>>> >>>> Filtered for Models with an AED <= 0.3 >>>> >>>> Loaded that into WebApollo, together with an existing reference >>>> annotation and the evidence tracks >>>> >>>> Manually curated/selected 750 gene models using the following rules: >>>> - Must have start/stop codon >>>> - Most have as many exons as possible >>>> - Must agree with evidence >>>> - Must be >= 2kb part from other gene models (provided as flanking >>>> regions for augustus to train intergenic sequence) >>>> >>>> From these models, I created a GBK file, split it into 650 (train) >>>>and >>>> 100 (test) models and created a new profile using the documented >>>> procedure. >>>> >>>> But: >>>> >>>> While the naked ab-init models created through maker get a lot of >>>>genes >>>> ?sort of right?, I still see too many issues to be really satisfied. >>>> Problems include: >>>> >>>> - random exon calls which are not supported by any line of evidence >>>>(~1 >>>> per gene model, I would guess) >>>> - poor congruency with some gene models (especially ones not used for >>>> training/testing) >>>> >>>> Is there any best-practice guide on how to improve this? The Augustus >>>> website is unfortunately quite poor on detail? My impression so far is >>>> that ramping up the number of training models isn?t really doing too >>>>much >>>> beyond a certain point (tried 400, 500 and 750). >>>> >>>> Regards, >>>> >>>> Marc >>>> >>>> >>>> Marc P. Hoeppner, PhD >>>> Team Leader >>>> BILS Genome Annotation Platform >>>> Department for Medical Biochemistry and Microbiology >>>> Uppsala University, Sweden >>>> marc.hoeppner at bils.se >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > From marc.hoeppner at bils.se Thu Jun 5 02:15:55 2014 From: marc.hoeppner at bils.se (=?windows-1252?Q?Marc_H=F6ppner?=) Date: Thu, 5 Jun 2014 10:15:55 +0200 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> Message-ID: <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Hi, thanks for the feedback. I spent some more time on this and am still somewhat unsatisfied with the whole thing? A few comments: I quite frequently see augustus and in extension Maker including exons that are not supported by EST/Protein evidence and are not critical for the gene model (i.e. I can take them out and still get a proper CDS). Maybe I don?t know enough about how Maker creates hints and more importantly what role these hints play for augustus, but I cannot really see a great effect (any, really) on the gene models even if both ESTs and proteins contradict an augustus gene model and the surplus exon is non-essential. (all evidence is provided as fasta files, protein2genome and est2genome are set to 0) As for the repeat library, I suppose this is a critical point. I am using repeats from a closely related species via Repeatmasker, modelled and filtered repeats from RepeatModeler and repeats derived from a high-coverage 454 data set. Not sure what else I can do to improve that. As for evidence, I am using the curated reference proteome from a related species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 reads. I don?t think it gets a whole lot better, in terms of what data can be used. So in summary, I just don?t get where I want to using Augustus and Maker - specifically, the gene models are full of weird, unsupported artefacts despite manually curating > 850 models for training. I suppose I was hoping for some secret trick to improve on this - but I guess there is none? Actually, if I only do a pure evidence build (seeing that my input data is very high quality), it looks better - which sort of goes against the premise of Maker :/ Regards, Marc Marc P. Hoeppner, PhD Team Leader Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at bils.se On 27 May 2014, at 17:25, Carson Holt wrote: > Extra exons can be required for predictors to make sense of a region (they > do the best they can). This can be due to imperfect assemblies or > repeats. For plants the repeat database is the the one thing that will > most affect the annotation quality. You may need to spend some time > building the best repeat library you can. The repeat library is the next > most important thing next to training the predictor, because they confuse > the predictor (sometimes a lot) causing it to behave oddly in those > regions (because repeats do encode real protein and protein fragments). > Also when running now with MAKER make sure to include the entire proteome > of a related species and not just UniProt, and you will get better > performance. Now that you have Augustus trained, using it inside of MAKER > with an improved repeat library and additional protein evidence should > give it the feedback that will allow it to perform better than it would > with just naked ab initio prediction. > > Thanks, > Carson > > > On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: > >> Hi, >> >> I wanted to get some feedback regarding the training of ab-initio gene >> finders - it?s not strictly Maker related, but I suppose there are many >> people on this list that have encountered and solved this issue in one >> way or another. >> >> Specifically, I am trying to train Augustus (and possibly SNAP) for a >> plant genome. This has always been a very frustrating process for me, but >> while I have a better idea now how to do it, I still don?t get the sort >> of accuracy that I am hoping for. A quick run-through of my process; >> >> Evidence build with maker on level 1 and 2 proteins from Uniprot + >> Sanger-sequenced EST data >> >> Filtered for Models with an AED <= 0.3 >> >> Loaded that into WebApollo, together with an existing reference >> annotation and the evidence tracks >> >> Manually curated/selected 750 gene models using the following rules: >> - Must have start/stop codon >> - Most have as many exons as possible >> - Must agree with evidence >> - Must be >= 2kb part from other gene models (provided as flanking >> regions for augustus to train intergenic sequence) >> >> From these models, I created a GBK file, split it into 650 (train) and >> 100 (test) models and created a new profile using the documented >> procedure. >> >> But: >> >> While the naked ab-init models created through maker get a lot of genes >> ?sort of right?, I still see too many issues to be really satisfied. >> Problems include: >> >> - random exon calls which are not supported by any line of evidence (~1 >> per gene model, I would guess) >> - poor congruency with some gene models (especially ones not used for >> training/testing) >> >> Is there any best-practice guide on how to improve this? The Augustus >> website is unfortunately quite poor on detail? My impression so far is >> that ramping up the number of training models isn?t really doing too much >> beyond a certain point (tried 400, 500 and 750). >> >> Regards, >> >> Marc >> >> >> Marc P. Hoeppner, PhD >> Team Leader >> BILS Genome Annotation Platform >> Department for Medical Biochemistry and Microbiology >> Uppsala University, Sweden >> marc.hoeppner at bils.se >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From fbarreto at ucsd.edu Thu Jun 5 13:01:05 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 5 Jun 2014 12:01:05 -0700 Subject: [maker-devel] Generating GFF with selected tracks Message-ID: Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 13:02:36 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 13:02:36 -0600 Subject: [maker-devel] protein2genome gene models from protein gff In-Reply-To: <1401994595132.44761@uga.edu> References: <1401994595132.44761@uga.edu> Message-ID: That's what I'd do. But really protein2genome=1 is just meant to get enough rough gene models to train a gene predictor. You don't need to run it across the whole genome. But if you do, when you run again after training the gene predictor, MAKER will detect the old BLAST jobs and they won't have to rerun on the second MAKER run. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 12:56 PM To: Carson Holt Subject: RE: [maker-devel] protein2genome gene models from protein gff So what would you suggest is the best way to get protein2genome predictions? Use fasta sequences, instead of gff? Thanks, Ranjani From: Carson Holt Sent: Thursday, June 05, 2014 2:08 PM To: Sivaranjani Namasivayam; maker-devel at yandell-lab.org Subject: Re: [maker-devel] protein2genome gene models from protein gff est_gff assumes the alignments are spliced correctly. The protein2genome option also makes that assumption but with a little less confidence that the user always provides splice aware alignments, so in some instances (like protein2genome=1) it may not pass them forward as guaranteed splice aware alignments. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 9:49 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] protein2genome gene models from protein gff Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 13:05:30 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 13:05:30 -0600 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: gff3_merge just merges any two GFF3 files. So if you have two files just give both of them to it. Example --> gff3_merge maker_genes.gff repeats.gff Also if all you are trying to do is filter out certain feature types from the file, just use grep instead. Example --> grep -v -P "\tpred_gff\t" maker.gff Thanks, Carson From: Felipe Barreto Date: Thursday, June 5, 2014 at 1:01 PM To: MAKER group Subject: [maker-devel] Generating GFF with selected tracks Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Thu Jun 5 13:08:08 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 5 Jun 2014 19:08:08 +0000 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: Hi Felipe, I seem to remember that some of the gene model names did change when I did things similar to what you described. I think that you could accomplish the same thing with some cat and grep commands on the full gff. That would avoid the trouble of rerunning maker. Something like "cat full.gff | grep -P "\trepeatrunner\t" > tmp.gff " would get you started. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 5, 2014, at 12:01 PM, Felipe Barreto > wrote: Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jun 5 14:07:51 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 5 Jun 2014 13:07:51 -0700 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: OK, I see. I will just use grep to extract the desired features from the full.gff and merge them with gff3_merge. Don't know why I was making it more complicated. I guess I don't understand gff formats very well quite yet. Thanks yet again! On Thu, Jun 5, 2014 at 12:08 PM, Daniel Ence wrote: > Hi Felipe, I seem to remember that some of the gene model names did > change when I did things similar to what you described. I think that you > could accomplish the same thing with some cat and grep commands on the full > gff. That would avoid the trouble of rerunning maker. Something like "cat > full.gff | grep -P "\trepeatrunner\t" > tmp.gff " would get you started. > > ~Daniel > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 5, 2014, at 12:01 PM, Felipe Barreto > wrote: > > Hi, all, > > I would like to produce a gff file that contains Maker gene models AND > repeats. I know that using gff3_merge with -g will generate one with only > the gene models, but I didn't see any options for adding additional tracks. > > The way I did this was to use the Re-annotation section in the control > file. I provided the original full gff file in maker_gff, and turned on > the rm_pass and model_pass. All other options in the control file were > turned off. This seemed to work, though it also added a 'model_gff:maker' > track, which is not a problem for me. I compared a few new and original > scaffolds in Apollo, and all seem to match perfectly. But since I cannot > check the whole genome, I was wondering if what I did was appropriate. Are > all the gene models (and their names) and repeat alignments identical > between the new and original files? Or is Maker potentially changing a few > things since it's treated as a new run? > > Thanks! > > -- > Felipe Barreto > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 10:33:06 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:33:06 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Another question: is there documentation anywhere for the naming conventions of the genes annotated by Maker? Of course it's easy to spot genes based on a particular *ab initio* gene predictor, as the names are in the IDs. But what is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs "maker-$seqid-snap-gene"? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage wrote: > I have attached data for a small 18kb region with a handful of genes, as > well as the corresponding maker_opts.ctl file. (This is a smaller and > different data set than what I was looking at yesterday, with a more > well-defined problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 > with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a > different gene from 6111 to 8345 with an AED of 0.01. Both of these genes > have transcript support: will Maker report overlapping genes under any > conditions? And even if Maker is forced to choose only a single gene to > report, why would the model from 4125 to 6400 ever be reported in place of > the one from 6111 to 8345, especially since this is provided in the > model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: > >> Just eAED, but eAED can affects selection of ab initio results. For >> example reading frame match of protein evidence which also affects whether >> evidence from single_exon=1 and genes with single_exon protein evidence get >> kept. There is also the assumption that your alignments in GFF3 are are >> correctly spliced (like BLAT does). So giving blastn results as >> precomputed est_gff would create a lot of noise, since maker ignores blastn >> and is using it only to seed the polished exonerate alignments. >> >> --Carson >> >> >> From: Daniel Standage >> Date: Wednesday, June 4, 2014 at 1:11 PM >> To: Carson Holt >> Cc: Maker Mailing List >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> I do not provide Gap or Target attributes in the GFF3. Will this affect >> the AED as well, or just the eAED? >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >> >>> Sure. that would be helpful. One question. Do you provide the Gap >>> attribute in your precomputed alignments? Having or not having that >>> attribute affects the eAED score which takes reading frame into account, >>> and may cause some things to be kept that normally would be dropped, >>> because MAKER won't be able to take the points of mismatch of the alignment >>> into account (it just assumes match everywhere). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:03 PM >>> To: Maker Mailing List >>> Subject: [maker-devel] Filtering of ab initio gene models >>> >>> Thanks everyone for your responses recently! >>> >>> The reason for my recent flurry of email activity is that I'm seeing >>> some unexpected trends when running the new version of Maker with >>> precomputed alignments. Compared with an annotation I did a while ago >>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>> substantial number of new genes annotated. If I compare distributions of >>> AED scores between the old and new annotation, it's clear that the new >>> annotation has a lot more low-quality models. If I look at new gene models >>> that do not overlap with any gene model from the old annotation, the >>> likelihood that it's a low-quality model is much higher. >>> >>> I decided to run a little experiment. I annotated a scaffold first using >>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>> pre-computed transcript and protein alignments and the same (latest) >>> version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted >>> 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>> >>> Before this experiment, I assumed the issue was related to providing >>> pre-computed alignments in GFF3 format and perhaps violating some important >>> assumption. However, this experiment makes me wonder whether there have >>> been changes to how Maker filters *ab initio* gene models between >>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>> could put together a small data set that reproduces the behavior I just >>> described. >>> >>> Thanks! >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> _______________________________________________ maker-devel mailing >>> list maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 10:39:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 10:39:43 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked sequence without hints (i.e. the ab initio call). maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. In both cases MAKER is allowed to add UTR to the model (hence the 'processed' tag). --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 10:33 AM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Another question: is there documentation anywhere for the naming conventions of the genes annotated by Maker? Of course it's easy to spot genes based on a particular ab initio gene predictor, as the names are in the IDs. But what is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs "maker-$seqid-snap-gene"? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage wrote: > I have attached data for a small 18kb region with a handful of genes, as well > as the corresponding maker_opts.ctl file. (This is a smaller and different > data set than what I was looking at yesterday, with a more well-defined > problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 with > an AED of 0.23. If you exclude transcript TSA024184, Maker reports a different > gene from 6111 to 8345 with an AED of 0.01. Both of these genes have > transcript support: will Maker report overlapping genes under any conditions? > And even if Maker is forced to choose only a single gene to report, why would > the model from 4125 to 6400 ever be reported in place of the one from 6111 to > 8345, especially since this is provided in the model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> Just eAED, but eAED can affects selection of ab initio results. For example >> reading frame match of protein evidence which also affects whether evidence >> from single_exon=1 and genes with single_exon protein evidence get kept. >> There is also the assumption that your alignments in GFF3 are are correctly >> spliced (like BLAT does). So giving blastn results as precomputed est_gff >> would create a lot of noise, since maker ignores blastn and is using it only >> to seed the polished exonerate alignments. >> >> --Carson >> >> >> From: Daniel Standage >> Date: Wednesday, June 4, 2014 at 1:11 PM >> To: Carson Holt >> Cc: Maker Mailing List >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> I do not provide Gap or Target attributes in the GFF3. Will this affect the >> AED as well, or just the eAED? >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> Sure. that would be helpful. One question. Do you provide the Gap >>> attribute in your precomputed alignments? Having or not having that >>> attribute affects the eAED score which takes reading frame into account, and >>> may cause some things to be kept that normally would be dropped, because >>> MAKER won't be able to take the points of mismatch of the alignment into >>> account (it just assumes match everywhere). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:03 PM >>> To: Maker Mailing List >>> Subject: [maker-devel] Filtering of ab initio gene models >>> >>> Thanks everyone for your responses recently! >>> >>> The reason for my recent flurry of email activity is that I'm seeing some >>> unexpected trends when running the new version of Maker with precomputed >>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>> Maker-computed alignments), this new annotation has a substantial number of >>> new genes annotated. If I compare distributions of AED scores between the >>> old and new annotation, it's clear that the new annotation has a lot more >>> low-quality models. If I look at new gene models that do not overlap with >>> any gene model from the old annotation, the likelihood that it's a >>> low-quality model is much higher. >>> >>> I decided to run a little experiment. I annotated a scaffold first using >>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>> pre-computed transcript and protein alignments and the same (latest) version >>> of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 genes while >>> Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, >>> there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with >>> only models from 2.10, and 28 loci with only models from 2.31.3. >>> >>> Before this experiment, I assumed the issue was related to providing >>> pre-computed alignments in GFF3 format and perhaps violating some important >>> assumption. However, this experiment makes me wonder whether there have been >>> changes to how Maker filters ab initio gene models between version 2.10 and >>> version 2.31.3? Do you have any ideas? If it would help, I could put >>> together a small data set that reproduces the behavior I just described. >>> >>> Thanks! >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m >>> aker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 10:46:41 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:46:41 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Good to know, thanks. If multiple *ab initio* predictors inform a single annotation, how does Maker decide which one will be included in the gene's ID? Given your quick response just now, I wanted to confirm that you got the message and data set I sent yesterday. I received an email saying the size of my message required list admin approval to be distributed, but since you were also a direct recipient of the email I didn't worry about it too much. Thanks again! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to spot > genes based on a particular *ab initio* gene predictor, as the names are > in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> I have attached data for a small 18kb region with a handful of genes, as >> well as the corresponding maker_opts.ctl file. (This is a smaller and >> different data set than what I was looking at yesterday, with a more >> well-defined problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the >> model_gff file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> Just eAED, but eAED can affects selection of ab initio results. For >>> example reading frame match of protein evidence which also affects whether >>> evidence from single_exon=1 and genes with single_exon protein evidence get >>> kept. There is also the assumption that your alignments in GFF3 are are >>> correctly spliced (like BLAT does). So giving blastn results as >>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>> and is using it only to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>> the AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing >>>> some unexpected trends when running the new version of Maker with >>>> precomputed alignments. Compared with an annotation I did a while ago >>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>> substantial number of new genes annotated. If I compare distributions of >>>> AED scores between the old and new annotation, it's clear that the new >>>> annotation has a lot more low-quality models. If I look at new gene models >>>> that do not overlap with any gene model from the old annotation, the >>>> likelihood that it's a low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first >>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>> from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters *ab initio* gene models between >>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>> could put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel mailing >>>> list maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 10:56:38 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 10:56:38 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: I got the e-mail. Thanks for the test set. Multiple ab initio predictors don't inform a single annotation, rather one must be chosen from the pool of available models (I.e. it has to be SNAP or Augustus, or GeneMark). They all supply their own ab initio as well as hint based prediction, and then the one with best evidence match (measured by AED) is kept (it's like a competition that only one predictor can win). If you want a consensus model instead, you can take MAKER results in GFF3 format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a collaboration with the EVM group and will have this option, but for now users can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then produces consensus models based on the GFF3 content. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 10:46 AM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Good to know, thanks. If multiple ab initio predictors inform a single annotation, how does Maker decide which one will be included in the gene's ID? Given your quick response just now, I wanted to confirm that you got the message and data set I sent yesterday. I received an email saying the size of my message required list admin approval to be distributed, but since you were also a direct recipient of the email I didn't worry about it too much. Thanks again! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked > sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the 'processed' > tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel > > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming conventions > of the genes annotated by Maker? Of course it's easy to spot genes based on a > particular ab initio gene predictor, as the names are in the IDs. But what is > the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: >> I have attached data for a small 18kb region with a handful of genes, as well >> as the corresponding maker_opts.ctl file. (This is a smaller and different >> data set than what I was looking at yesterday, with a more well-defined >> problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the model_gff >> file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> Just eAED, but eAED can affects selection of ab initio results. For example >>> reading frame match of protein evidence which also affects whether evidence >>> from single_exon=1 and genes with single_exon protein evidence get kept. >>> There is also the assumption that your alignments in GFF3 are are correctly >>> spliced (like BLAT does). So giving blastn results as precomputed est_gff >>> would create a lot of noise, since maker ignores blastn and is using it only >>> to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>> AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing some >>>> unexpected trends when running the new version of Maker with precomputed >>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>> Maker-computed alignments), this new annotation has a substantial number of >>>> new genes annotated. If I compare distributions of AED scores between the >>>> old and new annotation, it's clear that the new annotation has a lot more >>>> low-quality models. If I look at new gene models that do not overlap with >>>> any gene model from the old annotation, the likelihood that it's a >>>> low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first using >>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci by >>>> overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters ab initio gene models between version >>>> 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could >>>> put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel mailing list >>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ >>>> maker-devel_yandell-lab.org >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 10:59:16 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:59:16 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: This helps, thanks. -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: > I got the e-mail. Thanks for the test set. > > Multiple *ab initio* predictors don't inform a single annotation, rather > one must be chosen from the pool of available models (I.e. it has to be > SNAP or Augustus, or GeneMark). They all supply their own *ab initio* as > well as hint based prediction, and then the one with best evidence match > (measured by AED) is kept (it's like a competition that only one predictor > can win). > > If you want a consensus model instead, you can take MAKER results in GFF3 > format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is > a collaboration with the EVM group and will have this option, but for now > users can just split the MAKER GFF3 by evidence types and give it to EVM. > EVM then produces consensus models based on the GFF3 content. > > --Carson > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:46 AM > > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Good to know, thanks. If multiple *ab initio* predictors inform a single > annotation, how does Maker decide which one will be included in the gene's > ID? > > Given your quick response just now, I wanted to confirm that you got the > message and data set I sent yesterday. I received an email saying the size > of my message required list admin approval to be distributed, but since you > were also a direct recipient of the email I didn't worry about it too much. > > Thanks again! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat >> masked sequence without hints (i.e. the ab initio call). >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >> MAKER. >> >> In both cases MAKER is allowed to add UTR to the model (hence the >> 'processed' tag). >> >> --Carson >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel < >> vbrendel at indiana.edu> >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Another question: is there documentation anywhere for the naming >> conventions of the genes annotated by Maker? Of course it's easy to spot >> genes based on a particular *ab initio* gene predictor, as the names are >> in the IDs. But what is the significance of, say, >> "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> Thanks, >> Daniel >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage < >> daniel.standage at gmail.com> wrote: >> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to >>> 6400 with an AED of 0.23. If you exclude transcript TSA024184, Maker >>> reports a different gene from 6111 to 8345 with an AED of 0.01. Both of >>> these genes have transcript support: will Maker report overlapping genes >>> under any conditions? And even if Maker is forced to choose only a single >>> gene to report, why would the model from 4125 to 6400 ever be reported in >>> place of the one from 6111 to 8345, especially since this is provided in >>> the model_gff file? >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>> the AED as well, or just the eAED? >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the alignment >>>>> into account (it just assumes match everywhere). >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing >>>>> some unexpected trends when running the new version of Maker with >>>>> precomputed alignments. Compared with an annotation I did a while ago >>>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>>> substantial number of new genes annotated. If I compare distributions of >>>>> AED scores between the old and new annotation, it's clear that the new >>>>> annotation has a lot more low-quality models. If I look at new gene models >>>>> that do not overlap with any gene model from the old annotation, the >>>>> likelihood that it's a low-quality model is much higher. >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first >>>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>>> from 2.31.3. >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>>> assumption. However, this experiment makes me wonder whether there have >>>>> been changes to how Maker filters *ab initio* gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>>> could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> _______________________________________________ maker-devel mailing >>>>> list maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 12:38:23 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 14:38:23 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: > >> I got the e-mail. Thanks for the test set. >> >> Multiple *ab initio* predictors don't inform a single annotation, rather >> one must be chosen from the pool of available models (I.e. it has to be >> SNAP or Augustus, or GeneMark). They all supply their own *ab initio* >> as well as hint based prediction, and then the one with best evidence match >> (measured by AED) is kept (it's like a competition that only one predictor >> can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is >> a collaboration with the EVM group and will have this option, but for now >> users can just split the MAKER GFF3 by evidence types and give it to EVM. >> EVM then produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel < >> vbrendel at indiana.edu> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple *ab initio* predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size >> of my message required list admin approval to be distributed, but since you >> were also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >> >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat >>> masked sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel < >>> vbrendel at indiana.edu> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming >>> conventions of the genes annotated by Maker? Of course it's easy to spot >>> genes based on a particular *ab initio* gene predictor, as the names >>> are in the IDs. But what is the significance of, say, >>> "snap_masked-$seqid-processed-gene" in a gene ID vs >>> "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage < >>> daniel.standage at gmail.com> wrote: >>> >>>> I have attached data for a small 18kb region with a handful of genes, >>>> as well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to >>>> 6400 with an AED of 0.23. If you exclude transcript TSA024184, Maker >>>> reports a different gene from 6111 to 8345 with an AED of 0.01. Both of >>>> these genes have transcript support: will Maker report overlapping genes >>>> under any conditions? And even if Maker is forced to choose only a single >>>> gene to report, why would the model from 4125 to 6400 ever be reported in >>>> place of the one from 6111 to 8345, especially since this is provided in >>>> the model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>> >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>>> kept. There is also the assumption that your alignments in GFF3 are are >>>>> correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>>> and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this >>>>> affect the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt >>>>> wrote: >>>>> >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the alignment >>>>>> into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing >>>>>> some unexpected trends when running the new version of Maker with >>>>>> precomputed alignments. Compared with an annotation I did a while ago >>>>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>>>> substantial number of new genes annotated. If I compare distributions of >>>>>> AED scores between the old and new annotation, it's clear that the new >>>>>> annotation has a lot more low-quality models. If I look at new gene models >>>>>> that do not overlap with any gene model from the old annotation, the >>>>>> likelihood that it's a low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first >>>>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>>>> from 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>>>> assumption. However, this experiment makes me wonder whether there have >>>>>> been changes to how Maker filters *ab initio* gene models between >>>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>>>> could put together a small data set that reproduces the behavior I just >>>>>> described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing >>>>>> list maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>> >>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 12:51:18 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 12:51:18 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: There can be overlapping meddles if you have multiple gene predictors. Also the hint based models will overlap the ab initio models, but you never get to see them (they are not kept in the evidence because they are confusing and really not useful unless they are chosen as the best model). So they will overlap the ab initio models, but you may never get top see them. All models regardless of location and overlap get sorted by their AED score. The best model is then kept from the list. Then the next, then the next. If the next best model overlaps a model that has already come off the list (which means the other model has a better AED score), then it gets skipped, and the next best one in the list gets added to the non-overlapping space. The result is that the final models will be non-redundant and non-overlapping, but if you look at the evidence aligments you will find ab initio models different than the MAKER models that were rejected and do not overlap the final models. model_gff competes just like any other model with AED. Ties always go to model_gff, and if there is a region where no model gets chosen (they all have AED of 1) and a model_gff entry will fit (even with an AED score of 1), then it will be chosen, because model_gff do not need evidence support to end up in the final annotations. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 12:38 PM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: >> I got the e-mail. Thanks for the test set. >> >> Multiple ab initio predictors don't inform a single annotation, rather one >> must be chosen from the pool of available models (I.e. it has to be SNAP or >> Augustus, or GeneMark). They all supply their own ab initio as well as hint >> based prediction, and then the one with best evidence match (measured by AED) >> is kept (it's like a competition that only one predictor can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a >> collaboration with the EVM group and will have this option, but for now users >> can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then >> produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple ab initio predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size of >> my message required list admin approval to be distributed, but since you were >> also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >>> sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel >>> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming conventions >>> of the genes annotated by Maker? Of course it's easy to spot genes based on >>> a particular ab initio gene predictor, as the names are in the IDs. But what >>> is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene >>> ID vs "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >>> wrote: >>>> I have attached data for a small 18kb region with a handful of genes, as >>>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>>> have transcript support: will Maker report overlapping genes under any >>>> conditions? And even if Maker is forced to choose only a single gene to >>>> report, why would the model from 4125 to 6400 ever be reported in place of >>>> the one from 6111 to 8345, especially since this is provided in the >>>> model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence >>>>> get kept. There is also the assumption that your alignments in GFF3 are >>>>> are correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores >>>>> blastn and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>>> the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the >>>>>> alignment into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>>> unexpected trends when running the new version of Maker with precomputed >>>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>>> Maker-computed alignments), this new annotation has a substantial number >>>>>> of new genes annotated. If I compare distributions of AED scores between >>>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>>> more low-quality models. If I look at new gene models that do not overlap >>>>>> with any gene model from the old annotation, the likelihood that it's a >>>>>> low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>>> 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>>> important assumption. However, this experiment makes me wonder whether >>>>>> there have been changes to how Maker filters ab initio gene models >>>>>> between version 2.10 and version 2.31.3? Do you have any ideas? If it >>>>>> would help, I could put together a small data set that reproduces the >>>>>> behavior I just described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing list >>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinf >>>>>> o/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 17:58:26 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 19:58:26 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: <53923808.7030401@indiana.edu> References: <53923808.7030401@indiana.edu> Message-ID: In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models > (supplied by the pred_gff or model_gff tag)? This seems to be one problem > we are running into. Our external models are high quality, but CDS only. > Thus their score gets knocked down relative to ab initio predictions with > added UTRs. > > Daniel will have more questions/observations later with regard to > overlapping gene models (we definitely need to allow gene models to overlap > in the UTRs, because transcript evidence clearly shows such negative > intergenic spaces). > > Thanks for all your help! > Volker > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to spot > genes based on a particular *ab initio* gene predictor, as the names are > in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> I have attached data for a small 18kb region with a handful of genes, >> as well as the corresponding maker_opts.ctl file. (This is a smaller and >> different data set than what I was looking at yesterday, with a more >> well-defined problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the >> model_gff file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> Just eAED, but eAED can affects selection of ab initio results. For >>> example reading frame match of protein evidence which also affects whether >>> evidence from single_exon=1 and genes with single_exon protein evidence get >>> kept. There is also the assumption that your alignments in GFF3 are are >>> correctly spliced (like BLAT does). So giving blastn results as >>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>> and is using it only to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this >>> affect the AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing >>>> some unexpected trends when running the new version of Maker with >>>> precomputed alignments. Compared with an annotation I did a while ago >>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>> substantial number of new genes annotated. If I compare distributions of >>>> AED scores between the old and new annotation, it's clear that the new >>>> annotation has a lot more low-quality models. If I look at new gene models >>>> that do not overlap with any gene model from the old annotation, the >>>> likelihood that it's a low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first >>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>> from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters *ab initio* gene models between >>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>> could put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel >>>> mailing list maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074http://brendelgroup.org/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbrendel at indiana.edu Fri Jun 6 15:52:08 2014 From: vbrendel at indiana.edu (Volker Brendel) Date: Fri, 06 Jun 2014 16:52:08 -0500 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: <53923808.7030401@indiana.edu> Hi Carson, is there a way of allowing MAKER to add UTRs to our external models (supplied by the pred_gff or model_gff tag)? This seems to be one problem we are running into. Our external models are high quality, but CDS only. Thus their score gets knocked down relative to ab initio predictions with added UTRs. Daniel will have more questions/observations later with regard to overlapping gene models (we definitely need to allow gene models to overlap in the UTRs, because transcript evidence clearly shows such negative intergenic spaces). Thanks for all your help! Volker On 6/6/2014 11:39 AM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > > Cc: Maker Mailing List >, Volker Brendel > > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to > spot genes based on a particular /ab initio/ gene predictor, as the > names are in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > > wrote: > > I have attached data for a small 18kb region with a handful of > genes, as well as the corresponding maker_opts.ctl file. (This is > a smaller and different data set than what I was looking at > yesterday, with a more well-defined problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 > to 6400 with an AED of 0.23. If you exclude transcript TSA024184, > Maker reports a different gene from 6111 to 8345 with an AED of > 0.01. Both of these genes have transcript support: will Maker > report overlapping genes under any conditions? And even if Maker > is forced to choose only a single gene to report, why would the > model from 4125 to 6400 ever be reported in place of the one from > 6111 to 8345, especially since this is provided in the model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt > wrote: > > Just eAED, but eAED can affects selection of ab initio > results. For example reading frame match of protein evidence > which also affects whether evidence from single_exon=1 and > genes with single_exon protein evidence get kept. There is > also the assumption that your alignments in GFF3 are are > correctly spliced (like BLAT does). So giving blastn results > as precomputed est_gff would create a lot of noise, since > maker ignores blastn and is using it only to seed the polished > exonerate alignments. > > --Carson > > > From: Daniel Standage > > Date: Wednesday, June 4, 2014 at 1:11 PM > To: Carson Holt > > Cc: Maker Mailing List > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > I do not provide Gap or Target attributes in the GFF3. Will > this affect the AED as well, or just the eAED? > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt > > wrote: > > Sure. that would be helpful. One question. Do you > provide the Gap attribute in your precomputed alignments? > Having or not having that attribute affects the eAED > score which takes reading frame into account, and may > cause some things to be kept that normally would be > dropped, because MAKER won't be able to take the points of > mismatch of the alignment into account (it just assumes > match everywhere). > > --Carson > > > From: Daniel Standage > > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that > I'm seeing some unexpected trends when running the new > version of Maker with precomputed alignments. Compared > with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a > substantial number of new genes annotated. If I compare > distributions of AED scores between the old and new > annotation, it's clear that the new annotation has a lot > more low-quality models. If I look at new gene models that > do not overlap with any gene model from the old > annotation, the likelihood that it's a low-quality model > is much higher. > > I decided to run a little experiment. I annotated a > scaffold first using Maker 2.10 and then using Maker > 2.31.3. I both cases, I used the same pre-computed > transcript and protein alignments and the same (latest) > version of SNAP as the only /ab initio/ predictor. Maker > 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. > If we group gene models into loci by overlap, there are 33 > loci with gene models from both 2.10 and 2.31.3, 1 locus > with only models from 2.10, and 28 loci with only models > from 2.31.3. > > Before this experiment, I assumed the issue was related to > providing pre-computed alignments in GFF3 format and > perhaps violating some important assumption. However, this > experiment makes me wonder whether there have been changes > to how Maker filters /ab initio/ gene models between > version 2.10 and version 2.31.3? Do you have any ideas? If > it would help, I could put together a small data set that > reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ > maker-devel mailing list maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > -- Volker Brendel Professor of Biology and Computer Science Indiana University Department of Biology & School of Informatics and Computing Simon Hall 205C 212 South Hawthorne Drive Bloomington, IN 47405-7003 Tel.: (812) 855-7074 http://brendelgroup.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jun 7 14:03:18 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:03:18 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: <53923808.7030401@indiana.edu> Message-ID: The problem in the example you sent is the geneseqer entries in the GFF3 you are passing in. It is causing merge of gene clusters. The result is that UTR is being over extended and is overlapping on the models (and probably some models get merged). As you noticed you can't have overlapping models on the same strand. If you set score_preds=1 in the maker_opts.ctl file it will give you AED scores for the rejected ab initio models. You will notice that none of them score better than 0.23. One thing you can do is set correct_est_fusion=1. This tries to correct for erroneous EST/transcript evidence that leads to over extend UTR and false gene merging. You will see in the attached image that is trims back the overlapping 3' and 5' UTR for the overlapping gene models, given that MAKER believes the evidence leading to the overlap is likely low confidence and is a false merge of regions. I think much of your geneseqer input is more of a problem than a help for the annotation. Many seem to be spurious alignments. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 5:58 PM To: Volker Brendel Cc: Carson Holt , Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models (supplied > by the pred_gff or model_gff tag)? This seems to be one problem we are > running into. Our external models are high quality, but CDS only. Thus their > score gets knocked down relative to ab initio predictions with added UTRs. > > Daniel will have more questions/observations later with regard to overlapping > gene models (we definitely need to allow gene models to overlap in the UTRs, > because transcript evidence clearly shows such negative intergenic spaces). > > Thanks for all your help! > Volker > > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > >> >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >> sequence without hints (i.e. the ab initio call). >> >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. >> >> >> >> >> In both cases MAKER is allowed to add UTR to the model (hence the 'processed' >> tag). >> >> >> >> >> --Carson >> >> >> >> >> >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> >> >> >> >> >> >> Another question: is there documentation anywhere for the naming conventions >> of the genes annotated by Maker? Of course it's easy to spot genes based on a >> particular ab initio gene predictor, as the names are in the IDs. But what is >> the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >> wrote: >> >>> >>> >>> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>> have transcript support: will Maker report overlapping genes under any >>> conditions? And even if Maker is forced to choose only a single gene to >>> report, why would the model from 4125 to 6400 ever be reported in place of >>> the one from 6111 to 8345, especially since this is provided in the >>> model_gff file? >>> >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>>> AED as well, or just the eAED? >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> >>>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the >>>>> alignment into account (it just assumes match everywhere). >>>>> >>>>> >>>>> >>>>> >>>>> --Carson >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>> unexpected trends when running the new version of Maker with precomputed >>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>> Maker-computed alignments), this new annotation has a substantial number >>>>> of new genes annotated. If I compare distributions of AED scores between >>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>> more low-quality models. If I look at new gene models that do not overlap >>>>> with any gene model from the old annotation, the likelihood that it's a >>>>> low-quality model is much higher. >>>>> >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>> 2.31.3. >>>>> >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>> important assumption. However, this experiment makes me wonder whether >>>>> there have been changes to how Maker filters ab initio gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, >>>>> I could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ maker-devel mailing list >>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo >>>>> /maker-devel_yandell-lab.org >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074 http://brendelgroup.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6246A771-E6A9-4875-9362-DC8A7A5BC9C4.png Type: image/png Size: 48365 bytes Desc: not available URL: From carsonhh at gmail.com Sat Jun 7 14:07:41 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:07:41 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: <53923808.7030401@indiana.edu> Message-ID: Example (attached) of geneseqer GFF3 input causing problems. Notice that all the geneseqer features are almost exact representations of the transposon, they are essentially reintroducing all the noise that repeat masking tried to remove (they are giving hints to the gene predictor to try and call the transposon as a gene). --Carson From: Carson Holt Date: Saturday, June 7, 2014 at 2:03 PM To: Daniel Standage , Volker Brendel Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models The problem in the example you sent is the geneseqer entries in the GFF3 you are passing in. It is causing merge of gene clusters. The result is that UTR is being over extended and is overlapping on the models (and probably some models get merged). As you noticed you can't have overlapping models on the same strand. If you set score_preds=1 in the maker_opts.ctl file it will give you AED scores for the rejected ab initio models. You will notice that none of them score better than 0.23. One thing you can do is set correct_est_fusion=1. This tries to correct for erroneous EST/transcript evidence that leads to over extend UTR and false gene merging. You will see in the attached image that is trims back the overlapping 3' and 5' UTR for the overlapping gene models, given that MAKER believes the evidence leading to the overlap is likely low confidence and is a false merge of regions. I think much of your geneseqer input is more of a problem than a help for the annotation. Many seem to be spurious alignments. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 5:58 PM To: Volker Brendel Cc: Carson Holt , Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models (supplied > by the pred_gff or model_gff tag)? This seems to be one problem we are > running into. Our external models are high quality, but CDS only. Thus their > score gets knocked down relative to ab initio predictions with added UTRs. > > Daniel will have more questions/observations later with regard to overlapping > gene models (we definitely need to allow gene models to overlap in the UTRs, > because transcript evidence clearly shows such negative intergenic spaces). > > Thanks for all your help! > Volker > > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > >> >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >> sequence without hints (i.e. the ab initio call). >> >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. >> >> >> >> >> In both cases MAKER is allowed to add UTR to the model (hence the 'processed' >> tag). >> >> >> >> >> --Carson >> >> >> >> >> >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> >> >> >> >> >> >> Another question: is there documentation anywhere for the naming conventions >> of the genes annotated by Maker? Of course it's easy to spot genes based on a >> particular ab initio gene predictor, as the names are in the IDs. But what is >> the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >> wrote: >> >>> >>> >>> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>> have transcript support: will Maker report overlapping genes under any >>> conditions? And even if Maker is forced to choose only a single gene to >>> report, why would the model from 4125 to 6400 ever be reported in place of >>> the one from 6111 to 8345, especially since this is provided in the >>> model_gff file? >>> >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>>> AED as well, or just the eAED? >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> >>>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the >>>>> alignment into account (it just assumes match everywhere). >>>>> >>>>> >>>>> >>>>> >>>>> --Carson >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>> unexpected trends when running the new version of Maker with precomputed >>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>> Maker-computed alignments), this new annotation has a substantial number >>>>> of new genes annotated. If I compare distributions of AED scores between >>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>> more low-quality models. If I look at new gene models that do not overlap >>>>> with any gene model from the old annotation, the likelihood that it's a >>>>> low-quality model is much higher. >>>>> >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>> 2.31.3. >>>>> >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>> important assumption. However, this experiment makes me wonder whether >>>>> there have been changes to how Maker filters ab initio gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, >>>>> I could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ maker-devel mailing list >>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo >>>>> /maker-devel_yandell-lab.org >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074 http://brendelgroup.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6246A771-E6A9-4875-9362-DC8A7A5BC9C4.png Type: image/png Size: 48365 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 48C1E0B9-001D-44C9-8D8E-37A52E4A17E8.png Type: image/png Size: 6592 bytes Desc: not available URL: From carsonhh at gmail.com Sat Jun 7 14:11:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:11:43 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: <53923808.7030401@indiana.edu> References: <53923808.7030401@indiana.edu> Message-ID: If you give input as pred_gff, set keep_preds=1, and then give MAKER EST evidence to work with then MAKER will just pass_through the pred_gff data you gave it with UTR added. Set correct_est_fusion=1 if your input contains false merges across regions (common from mRNA-seq results). This will trim overlapping UTR caused by the improperly merged EST evidence. --Carson From: Volker Brendel Date: Friday, June 6, 2014 at 3:52 PM To: Carson Holt , Daniel Standage Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models Hi Carson, is there a way of allowing MAKER to add UTRs to our external models (supplied by the pred_gff or model_gff tag)? This seems to be one problem we are running into. Our external models are high quality, but CDS only. Thus their score gets knocked down relative to ab initio predictions with added UTRs. Daniel will have more questions/observations later with regard to overlapping gene models (we definitely need to allow gene models to overlap in the UTRs, because transcript evidence clearly shows such negative intergenic spaces). Thanks for all your help! Volker On 6/6/2014 11:39 AM, Carson Holt wrote: > > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked > sequence without hints (i.e. the ab initio call). > > maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. > > > > > In both cases MAKER is allowed to add UTR to the model (hence the 'processed' > tag). > > > > > --Carson > > > > > > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > > > > > > > Another question: is there documentation anywhere for the naming conventions > of the genes annotated by Maker? Of course it's easy to spot genes based on a > particular ab initio gene predictor, as the names are in the IDs. But what is > the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > > Thanks, > > Daniel > > > > > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> >> >> >> I have attached data for a small 18kb region with a handful of genes, as well >> as the corresponding maker_opts.ctl file. (This is a smaller and different >> data set than what I was looking at yesterday, with a more well-defined >> problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the model_gff >> file? >> >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> >> Any light you could shed would be helpful. Thanks! >> >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> >>> >>> Just eAED, but eAED can affects selection of ab initio results. For example >>> reading frame match of protein evidence which also affects whether evidence >>> from single_exon=1 and genes with single_exon protein evidence get kept. >>> There is also the assumption that your alignments in GFF3 are are correctly >>> spliced (like BLAT does). So giving blastn results as precomputed est_gff >>> would create a lot of noise, since maker ignores blastn and is using it only >>> to seed the polished exonerate alignments. >>> >>> >>> >>> >>> --Carson >>> >>> >>> >>> >>> >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> >>> >>> >>> >>> >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>> AED as well, or just the eAED? >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing some >>>> unexpected trends when running the new version of Maker with precomputed >>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>> Maker-computed alignments), this new annotation has a substantial number of >>>> new genes annotated. If I compare distributions of AED scores between the >>>> old and new annotation, it's clear that the new annotation has a lot more >>>> low-quality models. If I look at new gene models that do not overlap with >>>> any gene model from the old annotation, the likelihood that it's a >>>> low-quality model is much higher. >>>> >>>> >>>> I decided to run a little experiment. I annotated a scaffold first using >>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci by >>>> overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>>> >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters ab initio gene models between version >>>> 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could >>>> put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ maker-devel mailing list >>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ >>>> maker-devel_yandell-lab.org >>>> >>> >>> >>> >>> >>> >>> >>> >> >> >> >> >> >> > > > > -- Volker Brendel Professor of Biology and Computer Science Indiana University Department of Biology & School of Informatics and Computing Simon Hall 205C 212 South Hawthorne Drive Bloomington, IN 47405-7003 Tel.: (812) 855-7074 http://brendelgroup.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jun 7 14:16:29 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:16:29 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Also MAKER 2.10 has a number of bugs with how UTR is generated and hints are generated for the ab into predictors (it's several years out of date). I don't think it checks from reading frame match when determining protein overlap match either. So no surprise that some models will be different from the current MAKER version. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 12:38 PM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: >> I got the e-mail. Thanks for the test set. >> >> Multiple ab initio predictors don't inform a single annotation, rather one >> must be chosen from the pool of available models (I.e. it has to be SNAP or >> Augustus, or GeneMark). They all supply their own ab initio as well as hint >> based prediction, and then the one with best evidence match (measured by AED) >> is kept (it's like a competition that only one predictor can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a >> collaboration with the EVM group and will have this option, but for now users >> can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then >> produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple ab initio predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size of >> my message required list admin approval to be distributed, but since you were >> also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >>> sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel >>> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming conventions >>> of the genes annotated by Maker? Of course it's easy to spot genes based on >>> a particular ab initio gene predictor, as the names are in the IDs. But what >>> is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene >>> ID vs "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >>> wrote: >>>> I have attached data for a small 18kb region with a handful of genes, as >>>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>>> have transcript support: will Maker report overlapping genes under any >>>> conditions? And even if Maker is forced to choose only a single gene to >>>> report, why would the model from 4125 to 6400 ever be reported in place of >>>> the one from 6111 to 8345, especially since this is provided in the >>>> model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence >>>>> get kept. There is also the assumption that your alignments in GFF3 are >>>>> are correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores >>>>> blastn and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>>> the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the >>>>>> alignment into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>>> unexpected trends when running the new version of Maker with precomputed >>>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>>> Maker-computed alignments), this new annotation has a substantial number >>>>>> of new genes annotated. If I compare distributions of AED scores between >>>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>>> more low-quality models. If I look at new gene models that do not overlap >>>>>> with any gene model from the old annotation, the likelihood that it's a >>>>>> low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>>> 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>>> important assumption. However, this experiment makes me wonder whether >>>>>> there have been changes to how Maker filters ab initio gene models >>>>>> between version 2.10 and version 2.31.3? Do you have any ideas? If it >>>>>> would help, I could put together a small data set that reproduces the >>>>>> behavior I just described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing list >>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinf >>>>>> o/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.hoeppner at imbim.uu.se Mon Jun 9 02:48:01 2014 From: marc.hoeppner at imbim.uu.se (=?Windows-1252?Q?Marc_H=F6ppner?=) Date: Mon, 9 Jun 2014 08:48:01 +0000 Subject: [maker-devel] Repeatmasked genome Message-ID: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> Hi, this may be an odd question, but I was wondering where, if at all, Maker reports the repeat-masked genome sequence? As far as I can tell the fasta sequences included in the gff annotation are unmasked (?) or at least not softmasked. I guess it wouldn?t be too hard to take the repeat masker features and use them to soft mask the assembly, but still... Regards, Marc Marc P. Hoeppner, PhD Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se From dence at genetics.utah.edu Mon Jun 9 09:22:13 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 9 Jun 2014 15:22:13 +0000 Subject: [maker-devel] Repeatmasked genome In-Reply-To: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> References: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> Message-ID: <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> Hi Marc, The masked genome sequence is stored in the "theVoid" directory for each scaffold. There are files "query.fasta", "query.masked.fasta", and "query.masked.gff". The masked sequence is in the query.masked.fasta file and the genomic locations of the masked regions are stored in the query.masked.gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 9, 2014, at 2:48 AM, Marc H?ppner > wrote: Hi, this may be an odd question, but I was wondering where, if at all, Maker reports the repeat-masked genome sequence? As far as I can tell the fasta sequences included in the gff annotation are unmasked (?) or at least not softmasked. I guess it wouldn?t be too hard to take the repeat masker features and use them to soft mask the assembly, but still... Regards, Marc Marc P. Hoeppner, PhD Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 9 10:11:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 09 Jun 2014 10:11:23 -0600 Subject: [maker-devel] Repeatmasked genome In-Reply-To: <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> References: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> Message-ID: Yes. Those are all temporary files, that (if you still have them) you can use to get at the masked fasta directly. Otherwise you can just use the features in the GFF3 file to remask the regions. --Carson From: Daniel Ence Date: Monday, June 9, 2014 at 9:22 AM To: Marc H?ppner Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Repeatmasked genome Hi Marc, The masked genome sequence is stored in the "theVoid" directory for each scaffold. There are files "query.fasta", "query.masked.fasta", and "query.masked.gff". The masked sequence is in the query.masked.fasta file and the genomic locations of the masked regions are stored in the query.masked.gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 9, 2014, at 2:48 AM, Marc H?ppner wrote: > Hi, > > this may be an odd question, but I was wondering where, if at all, Maker > reports the repeat-masked genome sequence? As far as I can tell the fasta > sequences included in the gff annotation are unmasked (?) or at least not > softmasked. I guess it wouldn?t be too hard to take the repeat masker features > and use them to soft mask the assembly, but still... > > Regards, > > Marc > > > Marc P. Hoeppner, PhD > > Department for Medical Biochemistry and Microbiology > Uppsala University, Sweden > marc.hoeppner at imbim.uu.se > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynsb1987 at gmail.com Mon Jun 9 22:22:47 2014 From: cynsb1987 at gmail.com (hueytyng) Date: Tue, 10 Jun 2014 14:22:47 +1000 Subject: [maker-devel] ERROR: This MpiChunk is not part of this level Message-ID: Hi Carson, I run Maker 2.31 on my assembled contigs. From the "master_datastore_index.log", 529 contigs run through, before I get the error below. Maker halts after this. #--------------------------------------------------------------------- Now starting the contig!! SeqID: 1AL_NODE_4659_length_8657_cov_8.758115 Length: 8739 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks ERROR: This MpiChunk is not part of this level at /ws/ws-group/app/maker/bin/../lib/Process/MpiTiers.pm line 439. Process::MpiTiers::update_chunk(Process::MpiTiers=HASH(0x63e64c0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 1537 main::update_chunk(ARRAY(0x40c69d0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 929 --> rank=4, hostname=safs-raijen deleted:0 hits deleted:0 hits Attached is my maker_opts.ctl file. Thank you, Jenny -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4931 bytes Desc: not available URL: From carsonhh at gmail.com Wed Jun 11 08:29:44 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 11 Jun 2014 08:29:44 -0600 Subject: [maker-devel] ERROR: This MpiChunk is not part of this level In-Reply-To: References: Message-ID: The cause of this is most likely a corrupt MPI message. It could be random (it happens with MPI messages). In which case it should succeed on retry. It could mean you need to reinstall you MPI communicator, or give fewer nodes to mpiexec when running your job (MPICH2 starts having communication issues after around 100 processes for example - even sooner on some systems). It may also mean that you set MAKER up with one communicator during the installation (like MPICH2) and then used mpiexec from another communicator to launch the job (OpenMPI for example or even a different version of MPICH2). Make sure you are not using MVAPICH2 because MAKER won't work with MVAPICH2. Also if you are using OpenMPI, you must preload libmpi.so or otherwise shared libraries won't work and it will fail while running MAKER. To do that you have to export the following environmental variable --> export LD_PRELOAD=/lib/libmpi.so #replace with the location of OpenMPI Also because a corrupt message has the chance to cause other issues, you may want to completely delete the folder for the failed contig (look in the datastore_index.log to see where that folder is). Also make sure you are using the latest version of MAKER because it has been vetted on OpenMPI using 8000+ cpus. Earlier version (I.e. 2.28 and below) may have issues on OpenMPI or on some systems with slow NFS storage or limited memory. --Carson From: hueytyng Date: Monday, June 9, 2014 at 10:22 PM To: Subject: [maker-devel] ERROR: This MpiChunk is not part of this level Hi Carson, I run Maker 2.31 on my assembled contigs. From the "master_datastore_index.log", 529 contigs run through, before I get the error below. Maker halts after this. #--------------------------------------------------------------------- Now starting the contig!! SeqID: 1AL_NODE_4659_length_8657_cov_8.758115 Length: 8739 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks ERROR: This MpiChunk is not part of this level at /ws/ws-group/app/maker/bin/../lib/Process/MpiTiers.pm line 439. Process::MpiTiers::update_chunk(Process::MpiTiers=HASH(0x63e64c0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 1537 main::update_chunk(ARRAY(0x40c69d0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 929 --> rank=4, hostname=safs-raijen deleted:0 hits deleted:0 hits Attached is my maker_opts.ctl file. Thank you, Jenny _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Wed Jun 11 14:44:41 2014 From: sjackman at gmail.com (Shaun Jackman) Date: Wed, 11 Jun 2014 13:44:41 -0700 Subject: [maker-devel] Alternate translation table Message-ID: Hi, Carson. I?m annotating a plastid genome. It has spliced genes so I?m using organism_type=eukaryotic. Its translation table however is 11 (Bacteria, Archaea, prokaryotic viruses and chloroplast proteins). Is it possible to change the translation table? I?m not doing any ab initio gene prediction, only homology-based annotation using protein for coding genes and est for non-coding genes. The sequences are from a very closely related species (99% identity). Cheers, Shaun? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 11 15:01:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 11 Jun 2014 15:01:23 -0600 Subject: [maker-devel] Alternate translation table In-Reply-To: References: Message-ID: Sorry. MAKER doesn't have an alternate codon table option. --Carson From: Shaun Jackman Reply-To: Shaun Jackman Date: Wednesday, June 11, 2014 at 2:44 PM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] Alternate translation table Hi, Carson. I?m annotating a plastid genome. It has spliced genes so I?m using organism_type=eukaryotic. Its translation table however is 11 (Bacteria, Archaea, prokaryotic viruses and chloroplast proteins). Is it possible to change the translation table? I?m not doing any ab initio gene prediction, only homology-based annotation using protein for coding genes and est for non-coding genes. The sequences are from a very closely related species (99% identity). Cheers, Shaun? _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at rennes.inra.fr Thu Jun 12 07:00:48 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Thu, 12 Jun 2014 15:00:48 +0200 Subject: [maker-devel] Merging 2 annotations In-Reply-To: References: <538D8987.4090606@rennes.inra.fr> Message-ID: <5399A480.10808@rennes.inra.fr> Thank you, it works fine! A little question which is related: I set the map_forward option to 1, but it seems to work only for the model_gff gff. Is there a way to make it keep the original IDs also for the pred_gff file? Thank you Anthony On 03/06/2014 18:15, Carson Holt wrote: > You can give the manually curate ones to model_gff and the other ones to > pred_gff. Then set keep_preds=1. The model_gff resuls always get kept > even without evidence support, the pred_gff will be kept even without > evidence support because you set keep_preds=1, but model_gff results will > take precedence. > > --Carson > > > On 6/3/14, 2:38 AM, "Anthony Bretaudeau" > wrote: > >> Hello, >> >> I am working on the annotation of an insect genome, and I have 2 gff >> files: >> -an automatic annotation (done by another lab, with something else than >> maker, ~20000genes) >> -a manually curated annotation (with webapollo, ~1500 genes) >> >> From this, I would like to produce a single gff combining the 2. I'd >> like to keep all the manually curated models, and only the automatic >> ones that have no equivalent in the manually curated gff. >> >> Is it possible to do something like this with maker? I guess I could >> play with the model_gff option, but I'm not sure how exactly I could use >> it. >> >> Thank you for your help >> Regards >> >> Anthony >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From dence at genetics.utah.edu Thu Jun 12 09:50:05 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 12 Jun 2014 15:50:05 +0000 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <5399A480.10808@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> Message-ID: <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> Hi Anthony, So I think that the gene ID gets changed in the process of promoting things from pred_gff to gene models. If you know which predictions you want to keep, then you can select those out and pass them to model_gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 12, 2014, at 7:00 AM, Anthony Bretaudeau > wrote: A little question which is related: I set the map_forward option to 1, but it seems to work only for the model_gff gff. Is there a way to make it keep the original IDs also for the pred_gff file? -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at rennes.inra.fr Thu Jun 12 10:17:11 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Thu, 12 Jun 2014 18:17:11 +0200 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> Message-ID: <5399D287.1090505@rennes.inra.fr> An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 12 10:23:06 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Jun 2014 10:23:06 -0600 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <5399D287.1090505@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> <5399D287.1090505@rennes.inra.fr> Message-ID: This might be a round about way to get them to have the names unaltered. Give the pred_gff ones to est_gff. Still give the model_gff ones to model_gff. Set est2genome=1 and single_exon=1. Then add this line to the control file est_forward=1. This is normally used to move transcripts forward onto new assemblies with names being drawn from the alignment, but by telling MAKER that these are ESTs instead of predictions and setting the appropriate values, it will think it's moving transcripts forward, and the final result may be what you want. --Carson From: Anthony Bretaudeau Date: Thursday, June 12, 2014 at 10:17 AM To: Daniel Ence Cc: Carson Holt , "" Subject: Re: [maker-devel] Merging 2 annotations Yes, I think that's why the ids get changed. But I don't know which predictions I want to keep as I'm using maker to only keep the ones that are not equivalent to the models that are in the model_gff. Anthony On 12/06/2014 17:50, Daniel Ence wrote: > Hi Anthony, So I think that the gene ID gets changed in the process of > promoting things from pred_gff to gene models. If you know which predictions > you want to keep, then you can select those out and pass them to model_gff. > > > > ~Daniel > > > > > > > > Daniel Ence > > Graduate Student > > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > > > > On Jun 12, 2014, at 7:00 AM, Anthony Bretaudeau > > > wrote: > > >> A little question which is related: I set the map_forward option to 1, but it >> seems to work only for the model_gff gff. Is there a way to make it keep the >> original IDs also for the pred_gff file? >> > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jun 12 15:58:16 2014 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 12 Jun 2014 14:58:16 -0700 Subject: [maker-devel] Poor Exonerate gene model Message-ID: Hi, Carson. I have a case where MAKER is choosing a poor gene model when a better model exists. The two genes, psaA and psaB, are adjacent and are similar (37% exonerate score). BLASTX finds only the correct alignments of psaA and psaB. When exonerate is run, it also finds poor alignments of psaA to psaB and psaB to psaA. The result is that MAKER chooses the correct model for psaB, but picks the poor psaB model for psaA. Increasing ep_score_limit from 20 to 40 works around the issue. I think MAKER could make a better choice in this situation without that hint. See the attached screen shots. The first is ep_score_limit=20 and the second ep_score_limit=40. I?ve attached the evidence GFF. Cheers, Shaun [image: Inline images 1] [image: Inline images 3] ? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 86112 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 90074 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1.gff.gz Type: application/x-gzip Size: 57657 bytes Desc: not available URL: From saad.arif at tuebingen.mpg.de Fri Jun 13 05:03:38 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Fri, 13 Jun 2014 13:03:38 +0200 Subject: [maker-devel] Help with updating an annotation Message-ID: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad From carsonhh at gmail.com Fri Jun 13 10:59:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 13 Jun 2014 10:59:46 -0600 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: Message-ID: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" wrote: >Dear All, > >I would like to use Maker pipeline to expand a current annotation (new >isoforms and novel genes with respect to current annotation) and was >wondering if anyone had experience with this and or suggestions to my >questions. > >Briefly: > > I have tophat splice junctions from RNAseq data or alternatively >cufflinks generated transcript models (fasts format) that i want to use >as my new data (est_gff or est). > >I want to provide the current Ensembl annotation for gene prediction but >i want this annotation to remain unchanged. Hence, i?m not sure if i >should provide this annotation as pred_gff > or model_gff. Can the model_gff be used for gene prediction or is this >just a subset of pred_gff that remain unaltered? Can we provide the same >annotation for both options (pred_ and mod_gff)? > > > >Importantly, my main goal is to use the new RNAseq data to add more >isoforms and (any) novel genes to the existing Ensembl annotation. Any >thoughts or suggestions on how to go about this would be sincerely >appreciated. > > >Thanks in advance, >saad > > > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From juefish at gmail.com Tue Jun 17 14:54:51 2014 From: juefish at gmail.com (Nathaniel Jue) Date: Tue, 17 Jun 2014 16:54:51 -0400 Subject: [maker-devel] issue with forks module Message-ID: I've been running into all kinds of issues with the implementation of forks in GMOD. I repeatedly get this error when running an MPI run of Maker: Use of each() on hash after insertion without resetting hash iterator results in undefined behavior at /data2/local_installs/perls/perl-5.18.1/lib/site_perl/5.18.1/x86_64-linux/ forks.pm line 1736. I had to install an alternative version of perl with perlbrew but everything seemed to work on the test data sets. Any thoughts on what might be the issue? Or if this is even an issue that would affect results? Thanks, Nate -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 17 15:09:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 17 Jun 2014 15:09:55 -0600 Subject: [maker-devel] issue with forks module In-Reply-To: References: Message-ID: There is a change in Perl 5.18 that makes the forks.pm module incompatible. The forks.pm model maintainers have yet to update their module to resolve the issue, so it only works on perl version prior to 5.18. One work around it to manually edit forks.pm line 1736 yourself. Change it from this --> $write = each %WRITE; To this (make sure to include the {} brackets)--> { no warnings qw(internal); $write = each %WRITE; } --Carson From: Nathaniel Jue Date: Tuesday, June 17, 2014 at 2:54 PM To: Subject: [maker-devel] issue with forks module I've been running into all kinds of issues with the implementation of forks in GMOD. I repeatedly get this error when running an MPI run of Maker: Use of each() on hash after insertion without resetting hash iterator results in undefined behavior at /data2/local_installs/perls/perl-5.18.1/lib/site_perl/5.18.1/x86_64-linux/fo rks.pm line 1736. I had to install an alternative version of perl with perlbrew but everything seemed to work on the test data sets. Any thoughts on what might be the issue? Or if this is even an issue that would affect results? Thanks, Nate _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From saad.arif at tuebingen.mpg.de Wed Jun 18 05:09:48 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Wed, 18 Jun 2014 12:09:48 +0100 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: Message-ID: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: > Use the cufflinks instead of the tophat features (tophat tends to be > really noisy). Give the existing models to model_gff (they will then > always be kept unless something better is found). There is no option to > keep models and then just add isoforms. The model_gff input will either > be kept as is (unchanged), or replaced with an updated model suggested by > the evidence (the updated model may contain multiple isoforms though), and > map_forward=1 can be used to pull names forward from the old model onto > the new models. > > Thansk, > Carson > > > On 6/13/14, 5:03 AM, "Saad Arif" wrote: > >> Dear All, >> >> I would like to use Maker pipeline to expand a current annotation (new >> isoforms and novel genes with respect to current annotation) and was >> wondering if anyone had experience with this and or suggestions to my >> questions. >> >> Briefly: >> >> I have tophat splice junctions from RNAseq data or alternatively >> cufflinks generated transcript models (fasts format) that i want to use >> as my new data (est_gff or est). >> >> I want to provide the current Ensembl annotation for gene prediction but >> i want this annotation to remain unchanged. Hence, i?m not sure if i >> should provide this annotation as pred_gff >> or model_gff. Can the model_gff be used for gene prediction or is this >> just a subset of pred_gff that remain unaltered? Can we provide the same >> annotation for both options (pred_ and mod_gff)? >> >> >> >> Importantly, my main goal is to use the new RNAseq data to add more >> isoforms and (any) novel genes to the existing Ensembl annotation. Any >> thoughts or suggestions on how to go about this would be sincerely >> appreciated. >> >> >> Thanks in advance, >> saad >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From dence at genetics.utah.edu Wed Jun 18 10:21:19 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 18 Jun 2014 16:21:19 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> Message-ID: <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Wed Jun 18 11:04:26 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 18 Jun 2014 17:04:26 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: Hi Saad, That seems to be right to me. You'll do one run of MAKER with the cufflinks output and est2genome turned on and train SNAP on that output. You can repeat this as many times as you want, but in my experience you don't gain much in predictive power beyond two rounds of training. Next, you'll turn on SNAP and turn off est2genome, but still include the cufflinks and proteome evidence and the ensemble models. The other ab initio predictors that maker can use internally (genemark and augustus) are worth looking into also. Genemark does a self-training thing, but can take a couple of days depending on how large your genome is. Augustus takes a lot of time and effort to train, but comes with many prebuilt training files. If one of its prebuilt files is close to your species of interest, you can just use that instead. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 10:42 AM, Saad Arif > wrote: Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.priyam at qmul.ac.uk Wed Jun 18 11:44:34 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Wed, 18 Jun 2014 23:14:34 +0530 Subject: [maker-devel] errors in final gff Message-ID: Hi, I compiled all annotations generated by MAKER into a single GFF file using the gff3_merge script distributed with MAKER. While formatting this GFF for use with JBrowse, I found a few errors: 1. Three instances where two features were assigned the same id. 2. One instance where a group of three subfeatures refer to a non-existent parent. Here is the relevant portion of the GFF file: https://gist.github.com/yeban/ffaf5cd419639dd073a7 I worked around the issue temporarily for the job at hand, but I am left wondering why would these errors creep in. -- Priyam From carsonhh at gmail.com Wed Jun 18 12:11:49 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 18 Jun 2014 12:11:49 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: What MAKER version are you using? --Carson On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >Hi, > >I compiled all annotations generated by MAKER into a single GFF file >using the gff3_merge script distributed with MAKER. While formatting >this GFF for use with JBrowse, I found a few errors: > >1. Three instances where two features were assigned the same id. >2. One instance where a group of three subfeatures refer to a >non-existent parent. > >Here is the relevant portion of the GFF file: >https://gist.github.com/yeban/ffaf5cd419639dd073a7 > >I worked around the issue temporarily for the job at hand, but I am >left wondering why would these errors creep in. > >-- Priyam > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jun 18 15:33:08 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 18 Jun 2014 15:33:08 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Are you passing in old data via GFF3? --Carson On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >It's version 2.31. > >-- Priyam > >On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt wrote: >> What MAKER version are you using? >> >> --Carson >> >> >> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >> >>>Hi, >>> >>>I compiled all annotations generated by MAKER into a single GFF file >>>using the gff3_merge script distributed with MAKER. While formatting >>>this GFF for use with JBrowse, I found a few errors: >>> >>>1. Three instances where two features were assigned the same id. >>>2. One instance where a group of three subfeatures refer to a >>>non-existent parent. >>> >>>Here is the relevant portion of the GFF file: >>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>> >>>I worked around the issue temporarily for the job at hand, but I am >>>left wondering why would these errors creep in. >>> >>>-- Priyam >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> From mhinsley at ebi.ac.uk Thu Jun 19 03:07:32 2014 From: mhinsley at ebi.ac.uk (Malcolm Hinsley) Date: Thu, 19 Jun 2014 10:07:32 +0100 Subject: [maker-devel] 'not a SCALAR reference' error In-Reply-To: References: Message-ID: <53A2A854.3000009@ebi.ac.uk> Hi I'm running maker 2.31 with mpich 3 and have run once with est and protein2genome, then trained augustus and snap and run the first iteration of ab-initio predictors, which finished cleanly with no errors/ failures. Having retrained augustus and snap I'm trying to run maker -a using the same augustus species and snap.hmm pathname... previously this has worked fine. I get a lot of errors like this (it looks like every scaffold fails): doing repeat masking ERROR: Not a SCALAR reference at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Fasta.pm line 382 thread 1. Fasta::_formatSeq(FastaSeq=HASH(0x4426298), 60) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Fasta.pm line 369 thread 1 Fasta::toFastaRef(">scaffold29 CHUNK number:0 size:100000 offset:0", REF(0x42e48f0)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/FastaChunk.pm line 217 thread 1 FastaChunk::fasta_ref(FastaChunk=HASH(0x42c76a8)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/FastaChunk.pm line 168 thread 1 FastaChunk::write_file(FastaChunk=HASH(0x42c76a8), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/GI.pm line 3138 thread 1 GI::repeatmask(FastaChunk=HASH(0x42c76a8), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"..., "scaffold29", "", "/nfs/production/panda/ensemblgenomes/external/RepeatMasker-op"..., "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/c."..., 1, runlog=HASH(0x430e730)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 785 thread 1 Process::MpiChunk::__ANON__() called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Error.pm line 415 thread 1 eval {...} called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Error.pm line 407 thread 1 Error::subs::try(CODE(0x4437a90), HASH(0x4425fc8)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 4215 thread 1 Process::MpiChunk::_go(Process::MpiChunk=HASH(0x426e0a0), "run", HASH(0x42a5410), 0, 1) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 341 thread 1 Process::MpiChunk::run(Process::MpiChunk=HASH(0x426e0a0), 29) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker line 1457 thread 1 main::node_thread("/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 799 thread 1 eval {...} called at /nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 799 thread 1 threads::new("threads", CODE(0x4168d70), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker line 917 thread 1 --> rank=29, hostname=ebi5-229.ebi.ac.uk ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:scaffold29 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:scaffold29 I see from the mailing list that there's a known issue w/ forks..pm (which is at the bottom of this stack) relating to perl 5.18, but I'm running 5.14. Any ideas? On 17/06/14 22:09, Carson Holt wrote: > There is a change in Perl 5.18 that makes the forks.pm module incompatible. > The forks.pm model maintainers have yet to update their module to resolve > the issue, so it only works on perl version prior to 5.18. > One work around it to manually edit forks.pm line 1736 yourself. > > Change it from this --> > $write = each %WRITE; > > To this (make sure to include the {} brackets)--> > { > no warnings qw(internal); > $write = each %WRITE; > } > > --Carson > -- malcolm hinsley | EnsEMBL Genomes | +44 (0)1223 49 4669 European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD United Kingdom From rbharris at uw.edu Thu Jun 19 13:07:36 2014 From: rbharris at uw.edu (Rebecca Harris) Date: Thu, 19 Jun 2014 14:07:36 -0500 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hi, I'm trying to use the iprscan2gff3 script to update my final gff3 file with annotations from Interproscan 5. I'm getting a bunch of errors similar to another user but do not see how their issue was resolved: https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ I ran Interproscan on my ab initio predictions, then converted the xml to raw format. When I run iprscan2gff3 I get the errors: Use of uninitialized value $name in hash element at /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. Thanks, Rebecca -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Thu Jun 19 14:44:46 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 19 Jun 2014 20:44:46 +0000 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hi Rebecca, I at the conversation you linked to and it seems that Carson resolved the those parsing issues in an update to maker. What version of maker are you using? Also, in that same conversation Carson said that those errors wouldn't affect the output (because the script was parsing the mRNA features fine, but giving errors on the gene features). Does the output that you get from iprscan2gff3 seem complete? ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 1:07 PM, Rebecca Harris > wrote: Hi, I'm trying to use the iprscan2gff3 script to update my final gff3 file with annotations from Interproscan 5. I'm getting a bunch of errors similar to another user but do not see how their issue was resolved: https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ I ran Interproscan on my ab initio predictions, then converted the xml to raw format. When I run iprscan2gff3 I get the errors: Use of uninitialized value $name in hash element at /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. Thanks, Rebecca _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 19 14:47:27 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Jun 2014 14:47:27 -0600 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Also make sure there are gene/mRNA features in your GFF3 for your iprscan results. If you used the ab initio calls (which will be match/match_part features in the GFF3) as your input to iprscan, then you will need to upgrade them to gene/mRNA features before the script will add domains to them. --Carson From: Daniel Ence Date: Thursday, June 19, 2014 at 2:44 PM To: Rebecca Harris Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Fwd: iprscan2gff3 Hi Rebecca, I at the conversation you linked to and it seems that Carson resolved the those parsing issues in an update to maker. What version of maker are you using? Also, in that same conversation Carson said that those errors wouldn't affect the output (because the script was parsing the mRNA features fine, but giving errors on the gene features). Does the output that you get from iprscan2gff3 seem complete? ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 1:07 PM, Rebecca Harris wrote: > Hi, > > I'm trying to use the iprscan2gff3 script to update my final gff3 file with > annotations from Interproscan 5. I'm getting a bunch of errors similar to > another user but do not see how their issue was resolved: > https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-deve > l/MykTxL2Da64/5yrO3e6WBHUJ > > I ran Interproscan on my ab initio predictions, then converted the xml to raw > format. When I run iprscan2gff3 I get the errors: > Use of uninitialized value $name in hash element at > /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. > > Thanks, > Rebecca > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From rbharris at uw.edu Thu Jun 19 15:22:34 2014 From: rbharris at uw.edu (Rebecca Harris) Date: Thu, 19 Jun 2014 14:22:34 -0700 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hey, Thanks for the reply. The problem was that I didn't upgrade the matches to gene/mRNA features before running the ipr_upgrade_gff3 script. R On Thu, Jun 19, 2014 at 1:47 PM, Carson Holt wrote: > Also make sure there are gene/mRNA features in your GFF3 for your iprscan > results. If you used the ab initio calls (which will be match/match_part > features in the GFF3) as your input to iprscan, then you will need to > upgrade them to gene/mRNA features before the script will add domains to > them. > > --Carson > > > From: Daniel Ence > Date: Thursday, June 19, 2014 at 2:44 PM > To: Rebecca Harris > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Fwd: iprscan2gff3 > > Hi Rebecca, I at the conversation you linked to and it seems that Carson > resolved the those parsing issues in an update to maker. What version of > maker are you using? > > Also, in that same conversation Carson said that those errors wouldn't > affect the output (because the script was parsing the mRNA features fine, > but giving errors on the gene features). Does the output that you get from > iprscan2gff3 seem complete? > > ~Daniel > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 19, 2014, at 1:07 PM, Rebecca Harris > wrote: > > Hi, > > I'm trying to use the iprscan2gff3 script to update my final gff3 file > with annotations from Interproscan 5. I'm getting a bunch of errors similar > to another user but do not see how their issue was resolved: > > https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ > > I ran Interproscan on my ab initio predictions, then converted the xml to > raw format. When I run iprscan2gff3 I get the errors: > > Use of uninitialized value $name in hash element at > /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line > 1090. > Thanks, > Rebecca > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.priyam at qmul.ac.uk Thu Jun 19 16:11:36 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Fri, 20 Jun 2014 03:41:36 +0530 Subject: [maker-devel] migrating annotations from old to new assembly Message-ID: Is it possible to migrate annotations from an old assembly to a new assembly using MAKER? Perhaps by setting est= to transcripts (spliced? or unspliced?) from the previous assembly and genome= to the new assembly? Maybe ask MAKER to use exonerate instead of BLASTN so splice junctions are accounted for better? -- Priyam From carsonhh at gmail.com Thu Jun 19 16:16:01 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Jun 2014 16:16:01 -0600 Subject: [maker-devel] migrating annotations from old to new assembly In-Reply-To: References: Message-ID: Here you go, this is covered in a previous post --> https://groups.google.com/forum/#!searchin/maker-devel/est_forward/maker-de vel/q9fxXGKO8mk/0ATwhJvZeI4J --Carson On 6/19/14, 4:11 PM, "Anurag Priyam" wrote: >Is it possible to migrate annotations from an old assembly to a new >assembly using MAKER? > >Perhaps by setting est= to transcripts (spliced? or unspliced?) from >the previous assembly and genome= to the new assembly? Maybe ask MAKER >to use exonerate instead of BLASTN so splice junctions are accounted >for better? > >-- Priyam > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From a.priyam at qmul.ac.uk Thu Jun 19 16:19:22 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Fri, 20 Jun 2014 03:49:22 +0530 Subject: [maker-devel] migrating annotations from old to new assembly In-Reply-To: References: Message-ID: Wow! Thanks :). I apologise that I didn't look through the archives before asking. -- Priyam On Fri, Jun 20, 2014 at 3:46 AM, Carson Holt wrote: > Here you go, this is covered in a previous post --> > https://groups.google.com/forum/#!searchin/maker-devel/est_forward/maker-de > vel/q9fxXGKO8mk/0ATwhJvZeI4J > > > --Carson > > > > On 6/19/14, 4:11 PM, "Anurag Priyam" wrote: > >>Is it possible to migrate annotations from an old assembly to a new >>assembly using MAKER? >> >>Perhaps by setting est= to transcripts (spliced? or unspliced?) from >>the previous assembly and genome= to the new assembly? Maybe ask MAKER >>to use exonerate instead of BLASTN so splice junctions are accounted >>for better? >> >>-- Priyam >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From saad.arif at tuebingen.mpg.de Wed Jun 18 10:42:17 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Wed, 18 Jun 2014 17:42:17 +0100 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> Message-ID: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: > Hi Saad, > > Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). > > You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. > > One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > > Let me know if that helps, or if you have more question > > > ~Daniel > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: > >> Thank you for the response. I still have one question though, with these options: >> >> est_GFF=cufflinksout.GFF >> >> modle_GFF= ensembl reference.GFF >> >> What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? >> Is there a simple way to combine adding (new genes) and improving of an existing annotation? >> >> Any feedback on this would be greatly appreciated. >> >> saad >> >> On 13 Jun 2014, at 17:59, Carson Holt wrote: >> >>> Use the cufflinks instead of the tophat features (tophat tends to be >>> really noisy). Give the existing models to model_gff (they will then >>> always be kept unless something better is found). There is no option to >>> keep models and then just add isoforms. The model_gff input will either >>> be kept as is (unchanged), or replaced with an updated model suggested by >>> the evidence (the updated model may contain multiple isoforms though), and >>> map_forward=1 can be used to pull names forward from the old model onto >>> the new models. >>> >>> Thansk, >>> Carson >>> >>> >>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>> >>>> Dear All, >>>> >>>> I would like to use Maker pipeline to expand a current annotation (new >>>> isoforms and novel genes with respect to current annotation) and was >>>> wondering if anyone had experience with this and or suggestions to my >>>> questions. >>>> >>>> Briefly: >>>> >>>> I have tophat splice junctions from RNAseq data or alternatively >>>> cufflinks generated transcript models (fasts format) that i want to use >>>> as my new data (est_gff or est). >>>> >>>> I want to provide the current Ensembl annotation for gene prediction but >>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>> should provide this annotation as pred_gff >>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>> annotation for both options (pred_ and mod_gff)? >>>> >>>> >>>> >>>> Importantly, my main goal is to use the new RNAseq data to add more >>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>> thoughts or suggestions on how to go about this would be sincerely >>>> appreciated. >>>> >>>> >>>> Thanks in advance, >>>> saad >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anurag08priyam at gmail.com Wed Jun 18 12:15:52 2014 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Wed, 18 Jun 2014 23:45:52 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: It's version 2.31. -- Priyam On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt wrote: > What MAKER version are you using? > > --Carson > > > On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: > >>Hi, >> >>I compiled all annotations generated by MAKER into a single GFF file >>using the gff3_merge script distributed with MAKER. While formatting >>this GFF for use with JBrowse, I found a few errors: >> >>1. Three instances where two features were assigned the same id. >>2. One instance where a group of three subfeatures refer to a >>non-existent parent. >> >>Here is the relevant portion of the GFF file: >>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >> >>I worked around the issue temporarily for the job at hand, but I am >>left wondering why would these errors creep in. >> >>-- Priyam >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From rajesh.bommareddy at tu-harburg.de Thu Jun 19 02:08:45 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Thu, 19 Jun 2014 10:08:45 +0200 Subject: [maker-devel] Maker control files Message-ID: <53A29A8D.5010709@tu-harburg.de> Dear Sir/Madam I have installed Maker on Linux. I have tested the installation using maker -h. It looks fine. But i cannot create the control files using /maker/bin ./maker -CTL I get the following error: Couldnot create maker_opts.ctl control files not found Can you please help me with possible reason and solution Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From dence at genetics.utah.edu Fri Jun 20 15:20:47 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 20 Jun 2014 21:20:47 +0000 Subject: [maker-devel] Maker control files In-Reply-To: <53A29A8D.5010709@tu-harburg.de> References: <53A29A8D.5010709@tu-harburg.de> Message-ID: <51B8C254-A912-4CF6-B0E3-5C66E6E3E9AE@genetics.utah.edu> Hi Rajesh, Do you have write permissions in the directory where you're running maker? Also, I can't tell whether you're doing one command or two commands? If you do "maker" and there's no control files, then you'll get the "control files not found" error, but if you do ./maker -CTL and don't have permission to write to the install directory (which isn't unusual) then you'll get the "Could not create maker_opts.ctl" error. Thanks, Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 2:08 AM, Rajesh Reddy Bommareddy > wrote: Dear Sir/Madam I have installed Maker on Linux. I have tested the installation using maker -h. It looks fine. But i cannot create the control files using /maker/bin ./maker -CTL I get the following error: Couldnot create maker_opts.ctl control files not found Can you please help me with possible reason and solution Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 20 15:42:13 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:42:13 -0600 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_G MOD_Online_Training_2014#Training_ab_initio_Gene_Predictors --Carson From: Saad Arif Date: Wednesday, June 18, 2014 at 10:42 AM To: Daniel Ence Cc: "" Subject: Re: [maker-devel] Help with updating an annotation Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: > Hi Saad, > > Maker doesn't view EST or protein evidence as a gene model in themselves. > There's a good reason for this. Aligners like blast don't guarantee complete > gene models, with accurate start and stop codons and splice sites. With it's > default settings maker won't make a gene model unless there's evidence that > overlaps an ab-initio prediction (or something from the pred_gff option). > > You can use est2genome to promote everything from the est_gff option to a gene > model, but this will probably give you many spurious results. What you're > saying with est2genome is, "Everything that this tool found is a complete gene > model." I don't think that's true even for cufflinks output. > > One of the gene predictors that can run internally is snap. It's really easy > to train; here's a link to a tutorial for training it: > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMO > D_Online_Training_2014#Training_ab_initio_Gene_Predictors > > Let me know if that helps, or if you have more question > > > ~Daniel > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: > >> Thank you for the response. I still have one question though, with these >> options: >> >> est_GFF=cufflinksout.GFF >> >> modle_GFF= ensembl reference.GFF >> >> What happens to cufflinks assembled transcripts that are not confined to >> current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to >> prepare ab initio gene predictions for each of these predicted 'new' genes? >> Is there a simple way to combine adding (new genes) and improving of an >> existing annotation? >> >> Any feedback on this would be greatly appreciated. >> >> saad >> >> On 13 Jun 2014, at 17:59, Carson Holt wrote: >> >>> Use the cufflinks instead of the tophat features (tophat tends to be >>> really noisy). Give the existing models to model_gff (they will then >>> always be kept unless something better is found). There is no option to >>> keep models and then just add isoforms. The model_gff input will either >>> be kept as is (unchanged), or replaced with an updated model suggested by >>> the evidence (the updated model may contain multiple isoforms though), and >>> map_forward=1 can be used to pull names forward from the old model onto >>> the new models. >>> >>> Thansk, >>> Carson >>> >>> >>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>> >>>> Dear All, >>>> >>>> I would like to use Maker pipeline to expand a current annotation (new >>>> isoforms and novel genes with respect to current annotation) and was >>>> wondering if anyone had experience with this and or suggestions to my >>>> questions. >>>> >>>> Briefly: >>>> >>>> I have tophat splice junctions from RNAseq data or alternatively >>>> cufflinks generated transcript models (fasts format) that i want to use >>>> as my new data (est_gff or est). >>>> >>>> I want to provide the current Ensembl annotation for gene prediction but >>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>> should provide this annotation as pred_gff >>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>> annotation for both options (pred_ and mod_gff)? >>>> >>>> >>>> >>>> Importantly, my main goal is to use the new RNAseq data to add more >>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>> thoughts or suggestions on how to go about this would be sincerely >>>> appreciated. >>>> >>>> >>>> Thanks in advance, >>>> saad >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 20 15:46:59 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:46:59 -0600 Subject: [maker-devel] 'not a SCALAR reference' error In-Reply-To: <53A2A854.3000009@ebi.ac.uk> References: <53A2A854.3000009@ebi.ac.uk> Message-ID: Make sure you are using the latest version of MAKER 3.31.6. Also you may have to use MPICH2. MPICH3 is actually a different MPI protocol and I have not had success running MAKER with it. --Carson On 6/19/14, 3:07 AM, "Malcolm Hinsley" wrote: >Hi > >I'm running maker 2.31 with mpich 3 and have run once with est and >protein2genome, then trained augustus and snap and run the first >iteration of ab-initio predictors, which finished cleanly with no >errors/ failures. > >Having retrained augustus and snap I'm trying to run maker -a using the >same augustus species and snap.hmm pathname... previously this has >worked fine. > > >I get a lot of errors like this (it looks like every scaffold fails): > >doing repeat masking >ERROR: Not a SCALAR reference > at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Fasta.pm >line 382 thread 1. > Fasta::_formatSeq(FastaSeq=HASH(0x4426298), 60) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Fasta.pm >line 369 thread 1 > Fasta::toFastaRef(">scaffold29 CHUNK number:0 size:100000 >offset:0", REF(0x42e48f0)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/FastaChunk.pm >line 217 thread 1 > FastaChunk::fasta_ref(FastaChunk=HASH(0x42c76a8)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/FastaChunk.pm >line 168 thread 1 > FastaChunk::write_file(FastaChunk=HASH(0x42c76a8), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/GI.pm >line 3138 thread 1 > GI::repeatmask(FastaChunk=HASH(0x42c76a8), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"..., >"scaffold29", "", >"/nfs/production/panda/ensemblgenomes/external/RepeatMasker-op"..., >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/c."..., 1, >runlog=HASH(0x430e730)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 785 thread 1 > Process::MpiChunk::__ANON__() called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Error.pm >line 415 thread 1 > eval {...} called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Error.pm >line 407 thread 1 > Error::subs::try(CODE(0x4437a90), HASH(0x4425fc8)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 4215 thread 1 > Process::MpiChunk::_go(Process::MpiChunk=HASH(0x426e0a0), >"run", HASH(0x42a5410), 0, 1) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 341 thread 1 > Process::MpiChunk::run(Process::MpiChunk=HASH(0x426e0a0), 29) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker >line >1457 thread 1 >main::node_thread("/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/ma >ker/v8"...) >called at >/nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/ >x86_64-linux-thread-multi/forks.pm >line 799 thread 1 > eval {...} called at >/nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/ >x86_64-linux-thread-multi/forks.pm >line 799 thread 1 > threads::new("threads", CODE(0x4168d70), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker >line >917 thread 1 >--> rank=29, hostname=ebi5-229.ebi.ac.uk >ERROR: Failed while doing repeat masking >ERROR: Chunk failed at level:0, tier_type:1 >FAILED CONTIG:scaffold29 > >ERROR: Chunk failed at level:2, tier_type:0 >FAILED CONTIG:scaffold29 > > >I see from the mailing list that there's a known issue w/ forks..pm >(which is at the bottom of this stack) relating to perl 5.18, but I'm >running 5.14. > > >Any ideas? > > > > > >On 17/06/14 22:09, Carson Holt wrote: >> There is a change in Perl 5.18 that makes the forks.pm module >>incompatible. >> The forks.pm model maintainers have yet to update their module to >>resolve >> the issue, so it only works on perl version prior to 5.18. >> One work around it to manually edit forks.pm line 1736 yourself. >> >> Change it from this --> >> $write = each %WRITE; >> >> To this (make sure to include the {} brackets)--> >> { >> no warnings qw(internal); >> $write = each %WRITE; >> } >> >> --Carson >> > >-- >malcolm hinsley | EnsEMBL Genomes | +44 (0)1223 49 4669 >European Bioinformatics Institute (EMBL-EBI) >European Molecular Biology Laboratory >Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD >United Kingdom > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Jun 20 15:50:38 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:50:38 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: did you use est_forward? Also in the example you showed all the IDs are unique (one says hit and the other hsp in the ID, so they are different)? Could you find the non-uunique IDs causing the error? --Carson On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >I used est_gff= option, which refers to a GFF file generated by >cufflinks2gff3. The erroneous annotations didn't come from this GFF. > >-- Priyam > >On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >> Are you passing in old data via GFF3? >> >> --Carson >> >> >> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >> >>>It's version 2.31. >>> >>>-- Priyam >>> >>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>wrote: >>>> What MAKER version are you using? >>>> >>>> --Carson >>>> >>>> >>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>> >>>>>Hi, >>>>> >>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>this GFF for use with JBrowse, I found a few errors: >>>>> >>>>>1. Three instances where two features were assigned the same id. >>>>>2. One instance where a group of three subfeatures refer to a >>>>>non-existent parent. >>>>> >>>>>Here is the relevant portion of the GFF file: >>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>> >>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>left wondering why would these errors creep in. >>>>> >>>>>-- Priyam >>>>> >>>>>_______________________________________________ >>>>>maker-devel mailing list >>>>>maker-devel at box290.bluehost.com >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>>> >> >> From carsonhh at gmail.com Fri Jun 20 15:56:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:56:46 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Also note that ID= must be unique. Name= does not have to be, and won't be if the same protein or repeat element aligns to more than one location for example. Thanks, Carson On 6/20/14, 3:50 PM, "Carson Holt" wrote: >did you use est_forward? Also in the example you showed all the IDs are >unique (one says hit and the other hsp in the ID, so they are different)? >Could you find the non-uunique IDs causing the error? > >--Carson > > >On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: > >>I used est_gff= option, which refers to a GFF file generated by >>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >> >>-- Priyam >> >>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >>> Are you passing in old data via GFF3? >>> >>> --Carson >>> >>> >>> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >>> >>>>It's version 2.31. >>>> >>>>-- Priyam >>>> >>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>wrote: >>>>> What MAKER version are you using? >>>>> >>>>> --Carson >>>>> >>>>> >>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>> >>>>>>Hi, >>>>>> >>>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>> >>>>>>1. Three instances where two features were assigned the same id. >>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>non-existent parent. >>>>>> >>>>>>Here is the relevant portion of the GFF file: >>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>> >>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>left wondering why would these errors creep in. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>_______________________________________________ >>>>>>maker-devel mailing list >>>>>>maker-devel at box290.bluehost.com >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>>> >>> >>> > > From a.priyam at qmul.ac.uk Tue Jun 24 12:56:41 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Wed, 25 Jun 2014 00:26:41 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: I am sorry. I have updated the gist - https://gist.github.com/yeban/ffaf5cd419639dd073a7. 1. The first two chunks contain the annotations with duplicate ids. (4 rows) 2. The last chunk contains the annotations that refer to a non-existent parent. And what looks like an incomplete line of annotation (I forgot to state this in my original email). No, I didn't use est_forward. I am not passing in any old data via GFF3. -- Priyam On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: > Also note that ID= must be unique. Name= does not have to be, and won't be > if the same protein or repeat element aligns to more than one location for > example. > > Thanks, > Carson > > > On 6/20/14, 3:50 PM, "Carson Holt" wrote: > >>did you use est_forward? Also in the example you showed all the IDs are >>unique (one says hit and the other hsp in the ID, so they are different)? >>Could you find the non-uunique IDs causing the error? >> >>--Carson >> >> >>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >> >>>I used est_gff= option, which refers to a GFF file generated by >>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>> >>>-- Priyam >>> >>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >>>> Are you passing in old data via GFF3? >>>> >>>> --Carson >>>> >>>> >>>> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >>>> >>>>>It's version 2.31. >>>>> >>>>>-- Priyam >>>>> >>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>wrote: >>>>>> What MAKER version are you using? >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>> >>>>>>>Hi, >>>>>>> >>>>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>> >>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>non-existent parent. >>>>>>> >>>>>>>Here is the relevant portion of the GFF file: >>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>> >>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>left wondering why would these errors creep in. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>_______________________________________________ >>>>>>>maker-devel mailing list >>>>>>>maker-devel at box290.bluehost.com >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>>r >>>>>>>g >>>>>> >>>>>> >>>> >>>> >> >> > > From carsonhh at gmail.com Tue Jun 24 14:05:00 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Jun 2014 14:05:00 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 The value 1026 is held in a global iterator, so it cannot repeat the same value during the life of the process. And 1.3.0.12 is generated from the point in the code the ID is being generated. This means that two distinct processses had to write to the same file at the same point in the code, which should normally be impossible. However, there are ways to make this happen. First if you turn file locks off (-nolock) option and then run MAKER multiple times on the same dataset you can get process collisions (because you disabled the locks that stop this). If your NFS file system does not support hard links (FhGFS for example) then you cannot lock the files (which is the same as setting -nolock). Or you have other serious IO failures over NFS. Note that NFS is your Network Mounted Storage. The last example you give shows the preceding line being truncated. This suggests that two processes are trying to write to the same file simultaneously (inserting lines in between other lines), or serious IO failures are occurring where writes are not completing but true is being returned for the operations (can happen on unreliable NFS implementations). So in summary either your NFS storage implementation is giving IO errors, you have run MAKER with -nolock set and then started MAKER multiple times in the same directory (process collisions), or your NFS implementation doesn't support hardlinks and won't allow MAKER to lock files (process collisions). If it is one of the latter two, you will have to make sure you never start MAKER more than once simultaneously on the same dataset. You can still run via MPI fro parallelization, but you won't be able to start a second MPI process while the first one is still running. Thanks, Carson On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >I am sorry. I have updated the gist - >https://gist.github.com/yeban/ffaf5cd419639dd073a7. >1. The first two chunks contain the annotations with duplicate ids. (4 >rows) >2. The last chunk contains the annotations that refer to a >non-existent parent. And what looks like an incomplete line of >annotation (I forgot to state this in my original email). > >No, I didn't use est_forward. I am not passing in any old data via GFF3. > >-- Priyam > >On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >> Also note that ID= must be unique. Name= does not have to be, and won't >>be >> if the same protein or repeat element aligns to more than one location >>for >> example. >> >> Thanks, >> Carson >> >> >> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >> >>>did you use est_forward? Also in the example you showed all the IDs are >>>unique (one says hit and the other hsp in the ID, so they are >>>different)? >>>Could you find the non-uunique IDs causing the error? >>> >>>--Carson >>> >>> >>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>> >>>>I used est_gff= option, which refers to a GFF file generated by >>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>> >>>>-- Priyam >>>> >>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>wrote: >>>>> Are you passing in old data via GFF3? >>>>> >>>>> --Carson >>>>> >>>>> >>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>wrote: >>>>> >>>>>>It's version 2.31. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>wrote: >>>>>>> What MAKER version are you using? >>>>>>> >>>>>>> --Carson >>>>>>> >>>>>>> >>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>>> >>>>>>>>Hi, >>>>>>>> >>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>file >>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>formatting >>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>> >>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>non-existent parent. >>>>>>>> >>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>> >>>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>>left wondering why would these errors creep in. >>>>>>>> >>>>>>>>-- Priyam >>>>>>>> >>>>>>>>_______________________________________________ >>>>>>>>maker-devel mailing list >>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>.o >>>>>>>>r >>>>>>>>g >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >> >> From a.priyam at qmul.ac.uk Wed Jun 25 15:11:22 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Thu, 26 Jun 2014 02:41:22 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER processes in the same directory. I feel it's unlikely that my file system doesn't allow hardlinks because a few processes quit earlier than the others, saying something to the tune of "Another MAKER process is processing this scaffold already." I remember one process in particular had _just_ crashed. I don't remember how: I might have Ctrl-C'ed by mistake instead of detaching screen? admin killed it? temporary system glitch? Could this have caused the same issue? -- Priyam On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: > Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 > > The value 1026 is held in a global iterator, so it cannot repeat the same > value during the life of the process. And 1.3.0.12 is generated from the > point in the code the ID is being generated. This means that two distinct > processses had to write to the same file at the same point in the code, > which should normally be impossible. > > However, there are ways to make this happen. First if you turn file locks > off (-nolock) option and then run MAKER multiple times on the same dataset > you can get process collisions (because you disabled the locks that stop > this). If your NFS file system does not support hard links (FhGFS for > example) then you cannot lock the files (which is the same as setting > -nolock). Or you have other serious IO failures over NFS. Note that NFS > is your Network Mounted Storage. > > The last example you give shows the preceding line being truncated. This > suggests that two processes are trying to write to the same file > simultaneously (inserting lines in between other lines), or serious IO > failures are occurring where writes are not completing but true is being > returned for the operations (can happen on unreliable NFS implementations). > > So in summary either your NFS storage implementation is giving IO errors, > you have run MAKER with -nolock set and then started MAKER multiple times > in the same directory (process collisions), or your NFS implementation > doesn't support hardlinks and won't allow MAKER to lock files (process > collisions). If it is one of the latter two, you will have to make sure > you never start MAKER more than once simultaneously on the same dataset. > You can still run via MPI fro parallelization, but you won't be able to > start a second MPI process while the first one is still running. > > Thanks, > Carson > > > On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: > >>I am sorry. I have updated the gist - >>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>1. The first two chunks contain the annotations with duplicate ids. (4 >>rows) >>2. The last chunk contains the annotations that refer to a >>non-existent parent. And what looks like an incomplete line of >>annotation (I forgot to state this in my original email). >> >>No, I didn't use est_forward. I am not passing in any old data via GFF3. >> >>-- Priyam >> >>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>> Also note that ID= must be unique. Name= does not have to be, and won't >>>be >>> if the same protein or repeat element aligns to more than one location >>>for >>> example. >>> >>> Thanks, >>> Carson >>> >>> >>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>> >>>>did you use est_forward? Also in the example you showed all the IDs are >>>>unique (one says hit and the other hsp in the ID, so they are >>>>different)? >>>>Could you find the non-uunique IDs causing the error? >>>> >>>>--Carson >>>> >>>> >>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>> >>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>> >>>>>-- Priyam >>>>> >>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>wrote: >>>>>> Are you passing in old data via GFF3? >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>wrote: >>>>>> >>>>>>>It's version 2.31. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>wrote: >>>>>>>> What MAKER version are you using? >>>>>>>> >>>>>>>> --Carson >>>>>>>> >>>>>>>> >>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>>>> >>>>>>>>>Hi, >>>>>>>>> >>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>file >>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>formatting >>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>> >>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>non-existent parent. >>>>>>>>> >>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>> >>>>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>>>left wondering why would these errors creep in. >>>>>>>>> >>>>>>>>>-- Priyam >>>>>>>>> >>>>>>>>>_______________________________________________ >>>>>>>>>maker-devel mailing list >>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>>.o >>>>>>>>>r >>>>>>>>>g >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >>> > > From carsonhh at gmail.com Wed Jun 25 15:26:45 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 25 Jun 2014 15:26:45 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Maybe if it died in a weird way some of the processes could have continued briefly without active locks, but I'd more likely attribute this to NFS weirdness. Because of how network storage works, some implementations take shortcuts (like returning success on an IO operation even though it has not completed and may even fail later on). Or an IO operation can be buffered and completed several seconds later (the process that called the write operation may not even be active anymore). This is extremely common on NFS. You should probably just start MAKER fewer times in the same directory on your system. You may also want to start a single MAKER job (you should use MPI to parallelize it though), and use the -a flag. This will cause that job just to just rebuild the current GFF3 and FASTA files. That way you can clean up your current results without having to rerun everything. It should run relatively quickly since MAKER will be able to make use of the existing BLAST reports etc. that are already there (exonerate will run again though, but it shouldn't take too long). --Carson On 6/25/14, 3:11 PM, "Anurag Priyam" wrote: >Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER >processes in the same directory. > >I feel it's unlikely that my file system doesn't allow hardlinks >because a few processes quit earlier than the others, saying something >to the tune of "Another MAKER process is processing this scaffold >already." > >I remember one process in particular had _just_ crashed. I don't >remember how: I might have Ctrl-C'ed by mistake instead of detaching >screen? admin killed it? temporary system glitch? Could this have >caused the same issue? > >-- Priyam > > >On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: >> Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 >> >> The value 1026 is held in a global iterator, so it cannot repeat the >>same >> value during the life of the process. And 1.3.0.12 is generated from the >> point in the code the ID is being generated. This means that two >>distinct >> processses had to write to the same file at the same point in the code, >> which should normally be impossible. >> >> However, there are ways to make this happen. First if you turn file >>locks >> off (-nolock) option and then run MAKER multiple times on the same >>dataset >> you can get process collisions (because you disabled the locks that stop >> this). If your NFS file system does not support hard links (FhGFS for >> example) then you cannot lock the files (which is the same as setting >> -nolock). Or you have other serious IO failures over NFS. Note that NFS >> is your Network Mounted Storage. >> >> The last example you give shows the preceding line being truncated. >>This >> suggests that two processes are trying to write to the same file >> simultaneously (inserting lines in between other lines), or serious IO >> failures are occurring where writes are not completing but true is being >> returned for the operations (can happen on unreliable NFS >>implementations). >> >> So in summary either your NFS storage implementation is giving IO >>errors, >> you have run MAKER with -nolock set and then started MAKER multiple >>times >> in the same directory (process collisions), or your NFS implementation >> doesn't support hardlinks and won't allow MAKER to lock files (process >> collisions). If it is one of the latter two, you will have to make sure >> you never start MAKER more than once simultaneously on the same dataset. >> You can still run via MPI fro parallelization, but you won't be able to >> start a second MPI process while the first one is still running. >> >> Thanks, >> Carson >> >> >> On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >> >>>I am sorry. I have updated the gist - >>>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>>1. The first two chunks contain the annotations with duplicate ids. (4 >>>rows) >>>2. The last chunk contains the annotations that refer to a >>>non-existent parent. And what looks like an incomplete line of >>>annotation (I forgot to state this in my original email). >>> >>>No, I didn't use est_forward. I am not passing in any old data via GFF3. >>> >>>-- Priyam >>> >>>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>>> Also note that ID= must be unique. Name= does not have to be, and >>>>won't >>>>be >>>> if the same protein or repeat element aligns to more than one location >>>>for >>>> example. >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>>> >>>>>did you use est_forward? Also in the example you showed all the IDs >>>>>are >>>>>unique (one says hit and the other hsp in the ID, so they are >>>>>different)? >>>>>Could you find the non-uunique IDs causing the error? >>>>> >>>>>--Carson >>>>> >>>>> >>>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>>> >>>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>>wrote: >>>>>>> Are you passing in old data via GFF3? >>>>>>> >>>>>>> --Carson >>>>>>> >>>>>>> >>>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>>wrote: >>>>>>> >>>>>>>>It's version 2.31. >>>>>>>> >>>>>>>>-- Priyam >>>>>>>> >>>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>>wrote: >>>>>>>>> What MAKER version are you using? >>>>>>>>> >>>>>>>>> --Carson >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" >>>>>>>>>wrote: >>>>>>>>> >>>>>>>>>>Hi, >>>>>>>>>> >>>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>>file >>>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>>formatting >>>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>>> >>>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>>non-existent parent. >>>>>>>>>> >>>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>>> >>>>>>>>>>I worked around the issue temporarily for the job at hand, but I >>>>>>>>>>am >>>>>>>>>>left wondering why would these errors creep in. >>>>>>>>>> >>>>>>>>>>-- Priyam >>>>>>>>>> >>>>>>>>>>_______________________________________________ >>>>>>>>>>maker-devel mailing list >>>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l >>>>>>>>>>ab >>>>>>>>>>.o >>>>>>>>>>r >>>>>>>>>>g >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> >> >> From a.priyam at qmul.ac.uk Wed Jun 25 15:38:17 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Thu, 26 Jun 2014 03:08:17 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: -a option looks like just the thing I need. I will forward concerns about NFS to our IT team. And definitely use MPI for parallelisation next time. Thanks a lot :). -- Priyam On Thu, Jun 26, 2014 at 2:56 AM, Carson Holt wrote: > Maybe if it died in a weird way some of the processes could have continued > briefly without active locks, but I'd more likely attribute this to NFS > weirdness. Because of how network storage works, some implementations > take shortcuts (like returning success on an IO operation even though it > has not completed and may even fail later on). Or an IO operation can be > buffered and completed several seconds later (the process that called the > write operation may not even be active anymore). This is extremely common > on NFS. You should probably just start MAKER fewer times in the same > directory on your system. You may also want to start a single MAKER job > (you should use MPI to parallelize it though), and use the -a flag. This > will cause that job just to just rebuild the current GFF3 and FASTA files. > That way you can clean up your current results without having to rerun > everything. It should run relatively quickly since MAKER will be able to > make use of the existing BLAST reports etc. that are already there > (exonerate will run again though, but it shouldn't take too long). > > --Carson > > > On 6/25/14, 3:11 PM, "Anurag Priyam" wrote: > >>Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER >>processes in the same directory. >> >>I feel it's unlikely that my file system doesn't allow hardlinks >>because a few processes quit earlier than the others, saying something >>to the tune of "Another MAKER process is processing this scaffold >>already." >> >>I remember one process in particular had _just_ crashed. I don't >>remember how: I might have Ctrl-C'ed by mistake instead of detaching >>screen? admin killed it? temporary system glitch? Could this have >>caused the same issue? >> >>-- Priyam >> >> >>On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: >>> Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 >>> >>> The value 1026 is held in a global iterator, so it cannot repeat the >>>same >>> value during the life of the process. And 1.3.0.12 is generated from the >>> point in the code the ID is being generated. This means that two >>>distinct >>> processses had to write to the same file at the same point in the code, >>> which should normally be impossible. >>> >>> However, there are ways to make this happen. First if you turn file >>>locks >>> off (-nolock) option and then run MAKER multiple times on the same >>>dataset >>> you can get process collisions (because you disabled the locks that stop >>> this). If your NFS file system does not support hard links (FhGFS for >>> example) then you cannot lock the files (which is the same as setting >>> -nolock). Or you have other serious IO failures over NFS. Note that NFS >>> is your Network Mounted Storage. >>> >>> The last example you give shows the preceding line being truncated. >>>This >>> suggests that two processes are trying to write to the same file >>> simultaneously (inserting lines in between other lines), or serious IO >>> failures are occurring where writes are not completing but true is being >>> returned for the operations (can happen on unreliable NFS >>>implementations). >>> >>> So in summary either your NFS storage implementation is giving IO >>>errors, >>> you have run MAKER with -nolock set and then started MAKER multiple >>>times >>> in the same directory (process collisions), or your NFS implementation >>> doesn't support hardlinks and won't allow MAKER to lock files (process >>> collisions). If it is one of the latter two, you will have to make sure >>> you never start MAKER more than once simultaneously on the same dataset. >>> You can still run via MPI fro parallelization, but you won't be able to >>> start a second MPI process while the first one is still running. >>> >>> Thanks, >>> Carson >>> >>> >>> On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >>> >>>>I am sorry. I have updated the gist - >>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>>>1. The first two chunks contain the annotations with duplicate ids. (4 >>>>rows) >>>>2. The last chunk contains the annotations that refer to a >>>>non-existent parent. And what looks like an incomplete line of >>>>annotation (I forgot to state this in my original email). >>>> >>>>No, I didn't use est_forward. I am not passing in any old data via GFF3. >>>> >>>>-- Priyam >>>> >>>>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>>>> Also note that ID= must be unique. Name= does not have to be, and >>>>>won't >>>>>be >>>>> if the same protein or repeat element aligns to more than one location >>>>>for >>>>> example. >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>>>> >>>>>>did you use est_forward? Also in the example you showed all the IDs >>>>>>are >>>>>>unique (one says hit and the other hsp in the ID, so they are >>>>>>different)? >>>>>>Could you find the non-uunique IDs causing the error? >>>>>> >>>>>>--Carson >>>>>> >>>>>> >>>>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>>>> >>>>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>>>wrote: >>>>>>>> Are you passing in old data via GFF3? >>>>>>>> >>>>>>>> --Carson >>>>>>>> >>>>>>>> >>>>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>>>wrote: >>>>>>>> >>>>>>>>>It's version 2.31. >>>>>>>>> >>>>>>>>>-- Priyam >>>>>>>>> >>>>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>>>wrote: >>>>>>>>>> What MAKER version are you using? >>>>>>>>>> >>>>>>>>>> --Carson >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" >>>>>>>>>>wrote: >>>>>>>>>> >>>>>>>>>>>Hi, >>>>>>>>>>> >>>>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>>>file >>>>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>>>formatting >>>>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>>>> >>>>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>>>non-existent parent. >>>>>>>>>>> >>>>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>>>> >>>>>>>>>>>I worked around the issue temporarily for the job at hand, but I >>>>>>>>>>>am >>>>>>>>>>>left wondering why would these errors creep in. >>>>>>>>>>> >>>>>>>>>>>-- Priyam >>>>>>>>>>> >>>>>>>>>>>_______________________________________________ >>>>>>>>>>>maker-devel mailing list >>>>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l >>>>>>>>>>>ab >>>>>>>>>>>.o >>>>>>>>>>>r >>>>>>>>>>>g >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>> >>> > > From rajesh.bommareddy at tu-harburg.de Mon Jun 30 04:18:12 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Mon, 30 Jun 2014 12:18:12 +0200 Subject: [maker-devel] Maker gene prediction Message-ID: <53B13964.3060608@tu-harburg.de> Dear Sir/Madam I have a general question regarding gene prediction and annotation in Maker. For example, I have a new sequence of a yeast strain, and i have to predict and annotate the genome. Of,course i know EST's from the same organism will help me to predict the genes accurately, but when i want to use EST or RNA transcripts from a closely related organism, how can i do that in Maker and how accurate will be the prediction ?. Is the produced prediction and annotation valid ? How do i check this ? Thank you and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From carsonhh at gmail.com Mon Jun 30 11:34:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 30 Jun 2014 11:34:23 -0600 Subject: [maker-devel] Maker gene prediction In-Reply-To: <53B13964.3060608@tu-harburg.de> References: <53B13964.3060608@tu-harburg.de> Message-ID: You can supply ESTs from a related organism to the alt_est= option. Note this runs really slow because it has to be translated in all 6 reading frames (target and query), and will be less sensitive (larger threshold for alignments to become statistically significant). So if you have protein evidence from a related species, use that instead of the EST evidence from a related species. With respect to accuracy, the alignment evidence that suggests the annotation is also the experimental evidence that supports an annotations accuracy (so it is kind of a circular argument). But the alignment evidence does provide a correlative measurement. Things with lower AED scores better match the evidence and should be considered as higher confidence, while genes with higher AED scores represent genes that have lower confidence (this correlation is very well supported across many many organisms). You should be aware of what is considered realistic with genome annotation. In general for newly sequenced organisms, a genome wide accuracy of greater than 80% is considered extremely well annotated (but can't directly be measured except retrospectively - i.e. once you have a future more complete assembly and more experimental evidence to compare to). Only a handful of genomes that have legions of curators working over a decade (drosophila for example) have accuracies of greater than 90%. --Carson On 6/30/14, 4:18 AM, "Rajesh Reddy Bommareddy" wrote: >Dear Sir/Madam > >I have a general question regarding gene prediction and annotation in >Maker. > >For example, I have a new sequence of a yeast strain, and i have to >predict and annotate the genome. Of,course i know EST's from the same >organism will help me to predict the genes accurately, but when i want >to use EST or RNA transcripts from a closely related organism, how can i >do that in Maker and how accurate will be the prediction ?. Is the >produced prediction and annotation valid ? How do i check this ? > >Thank you and Regards >-- >M.Sc. Rajesh Reddy Bommareddy >Institute of Bioprocess and Biosystems Engineering >Hamburg University of Technology(TUHH) >Denikestrasse 15, 20171 Hamburg >GERMANY. >e.mail: rajesh.bommareddy at tu-harburg.de >Phone:+4940428784011 >Mobile:+4917663673522 > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jun 2 09:10:30 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 02 Jun 2014 09:10:30 -0600 Subject: [maker-devel] Precomputed alignments In-Reply-To: References: Message-ID: With the Target and Gap attribute you get slightly better behavior on filtering when you specify the blast_depth=X parameter in the maker_bopts.ctl file (keeps only X best hits). They will also affect the eAED score since it takes reading frame into account (so no Gap attribute means no assumption of reading frame). Otherwise they are only beneficial for seeing the alignment in a viewer as some viewers can recover the alignment when those values are specified. If you are not using blast_depth or trying to view the alignments in a viewer they don't really do anything. MAKER will just assume perfect match across the specified regions. --Carson From: Daniel Standage Date: Saturday, May 31, 2014 at 9:23 AM To: Maker Mailing List Subject: [maker-devel] Precomputed alignments Hello again! About a year ago I asked about using precomputed alignments with Maker. The thread quickly took a different direction as we tried to track down other issues, and I never got the thread back on its original track. So, to return to the original question, what exactly is required when providing pre-computed alignments in GFF3 format? For example, does it affect Maker's behavior whether a score is given? The "Target" attribute? The "Gap" attribute? Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 2 09:23:25 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 02 Jun 2014 09:23:25 -0600 Subject: [maker-devel] tRNAscan and map_gff_ids Message-ID: I've now patched the current download to fix this and a plus strand spliced tRNA bug. --Carson On 5/20/14, 1:17 PM, "Fields, Christopher J" wrote: >I found a problem with some tRNAscan output using MAKER 2.31.5. I had a >full MAKER data set (run initially using MAKER 2.31.5) that I mapped IDs >for. This was then run as follows, with the requisite error: > >-system-specific-4.1$ map_gff_ids id.map Zalbi.all.gff3 >Nested quantifiers in regex; marked by <-- HERE in >m/trnascan-KB913038.1-noncoding-Undet_??? <-- HERE -gene-79.0/ at >/home/groups/hpcbio/apps/maker/maker-2.31.5/bin/map_gff_ids line 111, ><$IN> line 3067590. > >The problematic lines: > >---------------------------------------------- >-system-specific-4.1$ grep "???" Zalbi.all.gff3 >KB913038.1 maker gene 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene >-79.0 >KB913038.1 maker tRNA 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1;Parent=trnascan-KB913038.1-noncoding-Undet >_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene-79.0-tRNA >-1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|70|0 >KB913038.1 maker exon 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1:exon:2193;Parent=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1 >KB913039.1 maker gene 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene >-72.0 >KB913039.1 maker tRNA 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1;Parent=trnascan-KB913039.1-noncoding-Undet >_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene-72.0-tRNA >-1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|74|0 >KB913039.1 maker exon 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1:exon:4036;Parent=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1 >---------------------------------------------- > >I managed to get it going by using the following modifications (regex >quotemeta) in map_gff_ids (lines 107-112): > > for my $id (@map_ids) { > # Only if the value (or the portion preceding > # the first colon) is equal to the map key. > next unless ($value eq $id || $value =~ /^\Q$id\E:/); > $value =~ s/\Q$id\E/$map{$id}/ unless($tag eq 'Name' && $id !~ >/\-gene\-\d+\.\d+|^CG\:|^....\:|^[^\:]+\:temp\d+\:/); > } > >I?m guessing there may be a similar problem with map_fasta_ids? > >chris >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From cjfields at illinois.edu Mon Jun 2 10:45:09 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 2 Jun 2014 16:45:09 +0000 Subject: [maker-devel] tRNAscan and map_gff_ids In-Reply-To: References: Message-ID: <007A79A7-8C68-4AFC-AC4F-451194D4CD29@illinois.edu> Thanks Carson! chris On Jun 2, 2014, at 10:23 AM, Carson Holt wrote: > I've now patched the current download to fix this and a plus strand > spliced tRNA bug. > > --Carson > > > On 5/20/14, 1:17 PM, "Fields, Christopher J" wrote: > >> I found a problem with some tRNAscan output using MAKER 2.31.5. I had a >> full MAKER data set (run initially using MAKER 2.31.5) that I mapped IDs >> for. This was then run as follows, with the requisite error: >> >> -system-specific-4.1$ map_gff_ids id.map Zalbi.all.gff3 >> Nested quantifiers in regex; marked by <-- HERE in >> m/trnascan-KB913038.1-noncoding-Undet_??? <-- HERE -gene-79.0/ at >> /home/groups/hpcbio/apps/maker/maker-2.31.5/bin/map_gff_ids line 111, >> <$IN> line 3067590. >> >> The problematic lines: >> >> ---------------------------------------------- >> -system-specific-4.1$ grep "???" Zalbi.all.gff3 >> KB913038.1 maker gene 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene >> -79.0 >> KB913038.1 maker tRNA 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1;Parent=trnascan-KB913038.1-noncoding-Undet >> _???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene-79.0-tRNA >> -1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|70|0 >> KB913038.1 maker exon 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1:exon:2193;Parent=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1 >> KB913039.1 maker gene 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene >> -72.0 >> KB913039.1 maker tRNA 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1;Parent=trnascan-KB913039.1-noncoding-Undet >> _???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene-72.0-tRNA >> -1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|74|0 >> KB913039.1 maker exon 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1:exon:4036;Parent=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1 >> ---------------------------------------------- >> >> I managed to get it going by using the following modifications (regex >> quotemeta) in map_gff_ids (lines 107-112): >> >> for my $id (@map_ids) { >> # Only if the value (or the portion preceding >> # the first colon) is equal to the map key. >> next unless ($value eq $id || $value =~ /^\Q$id\E:/); >> $value =~ s/\Q$id\E/$map{$id}/ unless($tag eq 'Name' && $id !~ >> /\-gene\-\d+\.\d+|^CG\:|^....\:|^[^\:]+\:temp\d+\:/); >> } >> >> I?m guessing there may be a similar problem with map_fasta_ids? >> >> chris >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From anthony.bretaudeau at rennes.inra.fr Tue Jun 3 02:38:31 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Tue, 03 Jun 2014 10:38:31 +0200 Subject: [maker-devel] Merging 2 annotations Message-ID: <538D8987.4090606@rennes.inra.fr> Hello, I am working on the annotation of an insect genome, and I have 2 gff files: -an automatic annotation (done by another lab, with something else than maker, ~20000genes) -a manually curated annotation (with webapollo, ~1500 genes) From this, I would like to produce a single gff combining the 2. I'd like to keep all the manually curated models, and only the automatic ones that have no equivalent in the manually curated gff. Is it possible to do something like this with maker? I guess I could play with the model_gff option, but I'm not sure how exactly I could use it. Thank you for your help Regards Anthony From shpeng at shou.edu.cn Mon Jun 2 20:30:17 2014 From: shpeng at shou.edu.cn (=?UTF-8?B?5b2t5Y+45Y2O?=) Date: Tue, 3 Jun 2014 10:30:17 +0800 (GMT+08:00) Subject: [maker-devel] Maker can not run repeatmasker Message-ID: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> Using the example data, I can not run maker successfully. Please see the following: [root at c0105 test3]# ls -l total 72 -rw-r--r--. 1 root root 32712 May 29 02:46 dpp_contig.fasta -rw-r--r--. 1 root root 19138 May 29 02:46 dpp_est.fasta -rw-r--r--. 1 root root 3045 May 29 02:46 dpp_protein.fasta -rw-r--r--. 1 root root 1413 May 29 02:46 maker_bopts.ctl -rw-r--r--. 1 root root 1288 May 29 02:46 maker_exe.ctl -rw-r--r--. 1 root root 4630 May 29 04:27 maker_opts.ctl [root at c0105 test3]# maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore To access files for individual sequences use the datastore index: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_master_datastore_index.log STATUS: Now running MAKER... examining contents of the fasta file and run log --Next Contig-- #--------------------------------------------------------------------- Now starting the contig!! SeqID: contig-dpp-500-500 Length: 32156 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased #--------------------------------------------------------------------- Now retrying the contig!! SeqID: contig-dpp-500-500 Length: 32156 Tries: 2!! #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased Maker is now finished!!! Start_time: 1401761680 End_time: 1401761688 Elapsed: 8 Thnaks. Sihua -------------- next part -------------- An HTML attachment was scrubbed... URL: From janphilipoyen at gmail.com Tue Jun 3 09:07:17 2014 From: janphilipoyen at gmail.com (=?UTF-8?Q?Jan_Philip_=C3=98yen?=) Date: Tue, 3 Jun 2014 17:07:17 +0200 Subject: [maker-devel] AED scores and thresholds: Not filtering? Message-ID: Hello, We are currently working on annotating arthropod genomes with the maker pipeline. We have tried to filter out genes with low support using the AED thresholds using AED_threshold=0.5. However, we still find genes without scores and most importantly genes with scores above the set threshold (even AED=1). Is there a known issue which could cause this? We can supply the control files as well as gffs, but would not like to make them public at this stage. Best Regards, Jan Philip Oeyen ZFMK / ZMB / University of Bonn -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 09:10:27 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 09:10:27 -0600 Subject: [maker-devel] Maker can not run repeatmasker In-Reply-To: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> References: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> Message-ID: The message is basically saying that RepeatMasker is not installed correctly. Follow the instructions here --> http://www.repeatmasker.org/RMDownload.html --Carson From: ??? Date: Monday, June 2, 2014 at 8:30 PM To: Subject: [maker-devel] Maker can not run repeatmasker Using the example data, I can not run maker successfully. Please see the following: [root at c0105 test3]# ls -l total 72 -rw-r--r--. 1 root root 32712 May 29 02:46 dpp_contig.fasta -rw-r--r--. 1 root root 19138 May 29 02:46 dpp_est.fasta -rw-r--r--. 1 root root 3045 May 29 02:46 dpp_protein.fasta -rw-r--r--. 1 root root 1413 May 29 02:46 maker_bopts.ctl -rw-r--r--. 1 root root 1288 May 29 02:46 maker_exe.ctl -rw-r--r--. 1 root root 4630 May 29 04:27 maker_opts.ctl [root at c0105 test3]# maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore To access files for individual sequences use the datastore index: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_master_datas tore_index.log STATUS: Now running MAKER... examining contents of the fasta file and run log --Next Contig-- #--------------------------------------------------------------------- Now starting the contig!! SeqID: contig-dpp-500-500 Length: 32156 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.al l.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVo id.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased #--------------------------------------------------------------------- Now retrying the contig!! SeqID: contig-dpp-500-500 Length: 32156 Tries: 2!! #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.al l.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVo id.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased Maker is now finished!!! Start_time: 1401761680 End_time: 1401761688 Elapsed: 8 Thnaks. Sihua _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 09:51:44 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 09:51:44 -0600 Subject: [maker-devel] AED scores and thresholds: Not filtering? In-Reply-To: References: Message-ID: No. It should use whichever is lower the AED or eAED score. The only exception is model_gff results. Those are always kept. Also note that the filter is for the entire gene, not just individual splice forms if you have alternate splicing. If you want I can take a look if there is anything non-obvious. You would have to send me the final GFF3 and the maker_opts.ctl file. --Carson From: Jan Philip ?yen Date: Tuesday, June 3, 2014 at 9:07 AM To: Subject: [maker-devel] AED scores and thresholds: Not filtering? Hello, We are currently working on annotating arthropod genomes with the maker pipeline. We have tried to filter out genes with low support using the AED thresholds using AED_threshold=0.5. However, we still find genes without scores and most importantly genes with scores above the set threshold (even AED=1). Is there a known issue which could cause this? We can supply the control files as well as gffs, but would not like to make them public at this stage. Best Regards, Jan Philip Oeyen ZFMK / ZMB / University of Bonn _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 10:15:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 10:15:46 -0600 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <538D8987.4090606@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> Message-ID: You can give the manually curate ones to model_gff and the other ones to pred_gff. Then set keep_preds=1. The model_gff resuls always get kept even without evidence support, the pred_gff will be kept even without evidence support because you set keep_preds=1, but model_gff results will take precedence. --Carson On 6/3/14, 2:38 AM, "Anthony Bretaudeau" wrote: >Hello, > >I am working on the annotation of an insect genome, and I have 2 gff >files: >-an automatic annotation (done by another lab, with something else than >maker, ~20000genes) >-a manually curated annotation (with webapollo, ~1500 genes) > > From this, I would like to produce a single gff combining the 2. I'd >like to keep all the manually curated models, and only the automatic >ones that have no equivalent in the manually curated gff. > >Is it possible to do something like this with maker? I guess I could >play with the model_gff option, but I'm not sure how exactly I could use >it. > >Thank you for your help >Regards > >Anthony > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Jun 3 20:20:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 20:20:20 -0600 Subject: [maker-devel] Short Introns In-Reply-To: References: Message-ID: I think you may be best off using WebApollo to manually annotated the few hundred short intron ones. It's not that fun to do, but you should be able to get them all in a couple of days by yourself or under a day if you had a helper. --Carson On 5/15/14, 11:15 AM, "Mack, Brian" wrote: >Hi, I examined the genes that had introns less than 10 bp that were being >flagged by tbl2asn and I noticed that all 438 of them were genes called >by SNAP. Also they were found in the CDS and not the UTR. It seems >strange that all of the genes that have these short introns are from SNAP >when only about one third of the final gene models are from SNAP. I've >examined the evidence for a handful of these genes and the short introns >do not seem supported by the evidence. Has anybody else had short intron >issues with SNAP? > >Brian > >-----Original Message----- >From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf >Of Carson Holt >Sent: Friday, April 18, 2014 10:36 AM >To: UMD Bioinformatics; maker-devel at yandell-lab.org >Subject: Re: [maker-devel] Short Introns > >Look at the name of those genes. The original name will let you know >where it came from because it will contain, augustus, genemark, snap, etc. > You will also want to open up the contig containing those geens in a >viewer like apollo >(http://weatherby.genetics.utah.edu/apollo/apollo.tar.gz). See if the >short intron is part of the CDS or UTR. If it's UTR then, it has >evidence support from an EST, which either means there are problems with >the EST/cDNA evidence or it's real. For those, even if they are real you >can just trim them off. If it's part of the CDS, then investigate >whether it is suggested by EST or protein evidence, or if the ab initio >predictor called it (sometime the ab initio predictor calls things to >force an ORF to work). This can sometimes be indicative of assembly >issues in that region. > >--Carson > > >On 4/18/14, 7:14 AM, "UMD Bioinformatics" >wrote: > >>Hello, >> >>We are preparing two submission for NCBI, nightmare. However some of >>our MAKER gene models have short introns that are being flagged by >>NCBI. In one species we have >400 introns smaller then 20bp which is >>almost biologically impossible. I know we can set max intron length in >>the opts.ctl file but can we set a minimum intron length? >> >>I saw yesterdays posts that mention this is a result of the external ab >>initio predictors but I didn?t see an indication as to which predictor >>and how to change that setting. >> >>from yesterday: >>*These are just short introns (intron size is under control of the ab >>initio >>predictors) --> 438 ERROR: SEQ_FEAT.ShortIntron >> >>Cheers >>Ian >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > >This electronic message contains information generated by the USDA solely >for the intended recipients. Any unauthorized interception of this >message or the use or disclosure of the information it contains may >violate the law and subject the violator to civil or criminal penalties. >If you believe you have received this message in error, please notify the >sender and delete the email immediately. From sujaikumar at gmail.com Wed Jun 4 06:26:09 2014 From: sujaikumar at gmail.com (Sujai) Date: Wed, 4 Jun 2014 13:26:09 +0100 Subject: [maker-devel] Augustus compilation Message-ID: Hi all I've installed older versions of Maker (up to 2.28) before successfully. I was trying to install maker 2.31.6 on a new cluster and decided to use the built in installers for the dependencies. Unfortunately ./Build augustuc gives this error: Unpacking augustus tarball... Configuring augustus... g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o genbank.cc -I../include g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o properties.cc -I../include properties.cc: In static member function 'static void Properties::init(int, char**)': properties.cc:349:25: error: 'boost::filesystem::path' has no member named 'native' configPath = cpath.native(); ^ properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': properties.cc:615:10: error: 'read_symlink' is not a member of 'boost::filesystem' bpath = boost::filesystem::read_symlink(bpath); ^ make: *** [properties.o] Error 1 ERROR: Failed installing augustus, now cleaning installation path... You may need to install augustus manually. ---- Would anyone have any suggestions for how to fix this? I've tried editing the ../exe/augustus-3.0.2/src/Makefile line: LIBS = -lboost_iostreams -lboost_system -lboost_filesystem to add the path to my system boost lib: LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib -lboost_iostreams -lboost_system -lboost_filesystem and then running make from inside ../exe/augustus-3.0.2/src but I get the same error again From mike.thon at gmail.com Wed Jun 4 07:31:30 2014 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 4 Jun 2014 15:31:30 +0200 Subject: [maker-devel] Augustus compilation In-Reply-To: References: Message-ID: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> Hi - Yes it the latest version of augustus needs the boost library. If you?re on linux you should be able to install it through the package manager. its called libboost-dev or some such thing. -Mike On Jun 4, 2014, at 2:26 PM, Sujai wrote: > Hi all > > I've installed older versions of Maker (up to 2.28) before successfully. > > I was trying to install maker 2.31.6 on a new cluster and decided to > use the built in installers for the dependencies. > > Unfortunately > > ./Build augustuc > > gives this error: > > Unpacking augustus tarball... > Configuring augustus... > g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o > genbank.cc -I../include > g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o > properties.cc -I../include > properties.cc: In static member function 'static void > Properties::init(int, char**)': > properties.cc:349:25: error: 'boost::filesystem::path' has no member > named 'native' > configPath = cpath.native(); > ^ > properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': > properties.cc:615:10: error: 'read_symlink' is not a member of > 'boost::filesystem' > bpath = boost::filesystem::read_symlink(bpath); > ^ > make: *** [properties.o] Error 1 > > ERROR: Failed installing augustus, now cleaning installation path... > You may need to install augustus manually. > > ---- > > Would anyone have any suggestions for how to fix this? I've tried > editing the ../exe/augustus-3.0.2/src/Makefile line: > > LIBS = -lboost_iostreams -lboost_system -lboost_filesystem > > to add the path to my system boost lib: > > LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib > -lboost_iostreams -lboost_system -lboost_filesystem > > and then running make from inside ../exe/augustus-3.0.2/src but I get > the same error again > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From sujaikumar at gmail.com Wed Jun 4 07:34:50 2014 From: sujaikumar at gmail.com (Sujai) Date: Wed, 4 Jun 2014 14:34:50 +0100 Subject: [maker-devel] Augustus compilation In-Reply-To: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> References: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> Message-ID: Hi Mike Thanks for the super prompt response. I am on a cluster where I can't install libboost-dev. However, boost is on the cluster (as I wrote, it is compiled in the /system/software/linux-x86_64/lib/boost/1_55_0/lib directory) so is my modification to the Makefile below correct, or is there something else I need to do? LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib -lboost_iostreams -lboost_system -lboost_filesystem Cheers, - Sujai On 4 June 2014 14:31, Michael Thon wrote: > Hi - Yes it the latest version of augustus needs the boost library. If you're on linux you should be able to install it through the package manager. its called libboost-dev or some such thing. > > -Mike > > On Jun 4, 2014, at 2:26 PM, Sujai wrote: > >> Hi all >> >> I've installed older versions of Maker (up to 2.28) before successfully. >> >> I was trying to install maker 2.31.6 on a new cluster and decided to >> use the built in installers for the dependencies. >> >> Unfortunately >> >> ./Build augustuc >> >> gives this error: >> >> Unpacking augustus tarball... >> Configuring augustus... >> g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o >> genbank.cc -I../include >> g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o >> properties.cc -I../include >> properties.cc: In static member function 'static void >> Properties::init(int, char**)': >> properties.cc:349:25: error: 'boost::filesystem::path' has no member >> named 'native' >> configPath = cpath.native(); >> ^ >> properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': >> properties.cc:615:10: error: 'read_symlink' is not a member of >> 'boost::filesystem' >> bpath = boost::filesystem::read_symlink(bpath); >> ^ >> make: *** [properties.o] Error 1 >> >> ERROR: Failed installing augustus, now cleaning installation path... >> You may need to install augustus manually. >> >> ---- >> >> Would anyone have any suggestions for how to fix this? I've tried >> editing the ../exe/augustus-3.0.2/src/Makefile line: >> >> LIBS = -lboost_iostreams -lboost_system -lboost_filesystem >> >> to add the path to my system boost lib: >> >> LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib >> -lboost_iostreams -lboost_system -lboost_filesystem >> >> and then running make from inside ../exe/augustus-3.0.2/src but I get >> the same error again >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From daniel.standage at gmail.com Wed Jun 4 13:03:27 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Wed, 4 Jun 2014 15:03:27 -0400 Subject: [maker-devel] Filtering of ab initio gene models Message-ID: Thanks everyone for your responses recently! The reason for my recent flurry of email activity is that I'm seeing some unexpected trends when running the new version of Maker with precomputed alignments. Compared with an annotation I did a while ago (Maker 2.10, Maker-computed alignments), this new annotation has a substantial number of new genes annotated. If I compare distributions of AED scores between the old and new annotation, it's clear that the new annotation has a lot more low-quality models. If I look at new gene models that do not overlap with any gene model from the old annotation, the likelihood that it's a low-quality model is much higher. I decided to run a little experiment. I annotated a scaffold first using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed transcript and protein alignments and the same (latest) version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models from 2.31.3. Before this experiment, I assumed the issue was related to providing pre-computed alignments in GFF3 format and perhaps violating some important assumption. However, this experiment makes me wonder whether there have been changes to how Maker filters *ab initio* gene models between version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could put together a small data set that reproduces the behavior I just described. Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 4 13:09:29 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 04 Jun 2014 13:09:29 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Sure. that would be helpful. One question. Do you provide the Gap attribute in your precomputed alignments? Having or not having that attribute affects the eAED score which takes reading frame into account, and may cause some things to be kept that normally would be dropped, because MAKER won't be able to take the points of mismatch of the alignment into account (it just assumes match everywhere). --Carson From: Daniel Standage Date: Wednesday, June 4, 2014 at 1:03 PM To: Maker Mailing List Subject: [maker-devel] Filtering of ab initio gene models Thanks everyone for your responses recently! The reason for my recent flurry of email activity is that I'm seeing some unexpected trends when running the new version of Maker with precomputed alignments. Compared with an annotation I did a while ago (Maker 2.10, Maker-computed alignments), this new annotation has a substantial number of new genes annotated. If I compare distributions of AED scores between the old and new annotation, it's clear that the new annotation has a lot more low-quality models. If I look at new gene models that do not overlap with any gene model from the old annotation, the likelihood that it's a low-quality model is much higher. I decided to run a little experiment. I annotated a scaffold first using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed transcript and protein alignments and the same (latest) version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models from 2.31.3. Before this experiment, I assumed the issue was related to providing pre-computed alignments in GFF3 format and perhaps violating some important assumption. However, this experiment makes me wonder whether there have been changes to how Maker filters ab initio gene models between version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could put together a small data set that reproduces the behavior I just described. Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Wed Jun 4 13:11:44 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Wed, 4 Jun 2014 15:11:44 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: I do not provide Gap or Target attributes in the GFF3. Will this affect the AED as well, or just the eAED? -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: > Sure. that would be helpful. One question. Do you provide the Gap > attribute in your precomputed alignments? Having or not having that > attribute affects the eAED score which takes reading frame into account, > and may cause some things to be kept that normally would be dropped, > because MAKER won't be able to take the points of mismatch of the alignment > into account (it just assumes match everywhere). > > --Carson > > > From: Daniel Standage > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that I'm seeing some > unexpected trends when running the new version of Maker with precomputed > alignments. Compared with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a substantial number of > new genes annotated. If I compare distributions of AED scores between the > old and new annotation, it's clear that the new annotation has a lot more > low-quality models. If I look at new gene models that do not overlap with > any gene model from the old annotation, the likelihood that it's a > low-quality model is much higher. > > I decided to run a little experiment. I annotated a scaffold first using > Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same > pre-computed transcript and protein alignments and the same (latest) > version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted > 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci > by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 > locus with only models from 2.10, and 28 loci with only models from 2.31.3. > > Before this experiment, I assumed the issue was related to providing > pre-computed alignments in GFF3 format and perhaps violating some important > assumption. However, this experiment makes me wonder whether there have > been changes to how Maker filters *ab initio* gene models between version > 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could > put together a small data set that reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 4 13:17:34 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 04 Jun 2014 13:17:34 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Just eAED, but eAED can affects selection of ab initio results. For example reading frame match of protein evidence which also affects whether evidence from single_exon=1 and genes with single_exon protein evidence get kept. There is also the assumption that your alignments in GFF3 are are correctly spliced (like BLAT does). So giving blastn results as precomputed est_gff would create a lot of noise, since maker ignores blastn and is using it only to seed the polished exonerate alignments. --Carson From: Daniel Standage Date: Wednesday, June 4, 2014 at 1:11 PM To: Carson Holt Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models I do not provide Gap or Target attributes in the GFF3. Will this affect the AED as well, or just the eAED? -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: > Sure. that would be helpful. One question. Do you provide the Gap attribute > in your precomputed alignments? Having or not having that attribute affects > the eAED score which takes reading frame into account, and may cause some > things to be kept that normally would be dropped, because MAKER won't be able > to take the points of mismatch of the alignment into account (it just assumes > match everywhere). > > --Carson > > > From: Daniel Standage > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that I'm seeing some > unexpected trends when running the new version of Maker with precomputed > alignments. Compared with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a substantial number of > new genes annotated. If I compare distributions of AED scores between the old > and new annotation, it's clear that the new annotation has a lot more > low-quality models. If I look at new gene models that do not overlap with any > gene model from the old annotation, the likelihood that it's a low-quality > model is much higher. > > I decided to run a little experiment. I annotated a scaffold first using Maker > 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed > transcript and protein alignments and the same (latest) version of SNAP as the > only ab initio predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 > predicted 63. If we group gene models into loci by overlap, there are 33 loci > with gene models from both 2.10 and 2.31.3, 1 locus with only models from > 2.10, and 28 loci with only models from 2.31.3. > > Before this experiment, I assumed the issue was related to providing > pre-computed alignments in GFF3 format and perhaps violating some important > assumption. However, this experiment makes me wonder whether there have been > changes to how Maker filters ab initio gene models between version 2.10 and > version 2.31.3? Do you have any ideas? If it would help, I could put together > a small data set that reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ranjani at uga.edu Thu Jun 5 09:49:36 2014 From: ranjani at uga.edu (Sivaranjani Namasivayam) Date: Thu, 5 Jun 2014 15:49:36 +0000 Subject: [maker-devel] protein2genome gene models from protein gff Message-ID: <1401983375868.65464@uga.edu> Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Mack at ARS.USDA.GOV Thu Jun 5 11:56:04 2014 From: Brian.Mack at ARS.USDA.GOV (Mack, Brian) Date: Thu, 5 Jun 2014 17:56:04 +0000 Subject: [maker-devel] missing start and stop codons Message-ID: I've been looking at the start and stop codons of the maker predicted transcripts after I stripped off the utr with the fasta_tool script and 7% of the transcripts do not have ATG as a start codon. Also 2% do not have TAA, TAG, or TGA as a stop codon. Could these be real start and stop codons and just rare variants, or should I consider these incomplete genes? If I was to turn on the "always_complete" option in Maker what would that actually do? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 12:01:24 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:01:24 -0600 Subject: [maker-devel] missing start and stop codons Message-ID: They are incomplete genes there are many reasons why this happens in new assemblies. You can turn always_complete on to try and force a complete, but what is added or subtracted to get a start and stop codon may not be biologically correct. It's just forced canonical. Also make sure to use the latest MAKER version. 2.29 and before didn't correct for the BioPerl codon table which allows for an extra non-cannonical start codon. Now MAKER exports a strict canonical table to BioPerl so 'M' is the only start. --Carson From: "Mack, Brian" Date: Thursday, June 5, 2014 at 11:56 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] missing start and stop codons I?ve been looking at the start and stop codons of the maker predicted transcripts after I stripped off the utr with the fasta_tool script and 7% of the transcripts do not have ATG as a start codon. Also 2% do not have TAA, TAG, or TGA as a stop codon. Could these be real start and stop codons and just rare variants, or should I consider these incomplete genes? If I was to turn on the ?always_complete? option in Maker what would that actually do? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 12:08:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:08:20 -0600 Subject: [maker-devel] protein2genome gene models from protein gff Message-ID: est_gff assumes the alignments are spliced correctly. The protein2genome option also makes that assumption but with a little less confidence that the user always provides splice aware alignments, so in some instances (like protein2genome=1) it may not pass them forward as guaranteed splice aware alignments. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 9:49 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] protein2genome gene models from protein gff Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 12:24:03 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:24:03 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: Like I said. The predictors do the best they can, so there is probably something about the regions to make the CDS, reading frame, or start/stop work that requires exons to be dropped or added. In several ant genomes we saw something like this caused by incorrect homopolymers in the assembly which force the predictor to slightly alter the intron/exon structure because otherwise the reading frame made no sense (the EST alignments were used to confirmed that the assembly homopolymers were incorrect - lots of bad single base pair deletions). The way hints work is as follows. At the simplest level ab initio predictors are calculating the probability of being in different states (intergenic, intron, exon, etc.). The hints increase the probability of being in the intron state where MAKER gives an intron hint or being in an exon/CDS state when MAKER gives an exon/CDS hint. So this bends the likelihood of the ab intio gene predictor to call something similar in structure to the evidence overlapping it. That being said, if there is strong enough signal from something else in the sequence or my hints won't work with the splice sites in the region or the reading frame breaks, then no amount of hints can force augustus to make that model. --Carson On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >Hi, > >thanks for the feedback. I spent some more time on this and am still >somewhat unsatisfied with the whole thing? > >A few comments: > >I quite frequently see augustus and in extension Maker including exons >that are not supported by EST/Protein evidence and are not critical for >the gene model (i.e. I can take them out and still get a proper CDS). >Maybe I don?t know enough about how Maker creates hints and more >importantly what role these hints play for augustus, but I cannot really >see a great effect (any, really) on the gene models even if both ESTs and >proteins contradict an augustus gene model and the surplus exon is >non-essential. > >(all evidence is provided as fasta files, protein2genome and est2genome >are set to 0) > >As for the repeat library, I suppose this is a critical point. I am using >repeats from a closely related species via Repeatmasker, modelled and >filtered repeats from RepeatModeler and repeats derived from a >high-coverage 454 data set. Not sure what else I can do to improve that. > >As for evidence, I am using the curated reference proteome from a related >species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >reads. I don?t think it gets a whole lot better, in terms of what data >can be used. > >So in summary, I just don?t get where I want to using Augustus and Maker >- specifically, the gene models are full of weird, unsupported artefacts >despite manually curating > 850 models for training. I suppose I was >hoping for some secret trick to improve on this - but I guess there is >none? Actually, if I only do a pure evidence build (seeing that my input >data is very high quality), it looks better - which sort of goes against >the premise of Maker :/ > >Regards, > >Marc > > > > >Marc P. Hoeppner, PhD >Team Leader >Department for Medical Biochemistry and Microbiology >Uppsala University, Sweden >marc.hoeppner at bils.se > >On 27 May 2014, at 17:25, Carson Holt wrote: > >> Extra exons can be required for predictors to make sense of a region >>(they >> do the best they can). This can be due to imperfect assemblies or >> repeats. For plants the repeat database is the the one thing that will >> most affect the annotation quality. You may need to spend some time >> building the best repeat library you can. The repeat library is the >>next >> most important thing next to training the predictor, because they >>confuse >> the predictor (sometimes a lot) causing it to behave oddly in those >> regions (because repeats do encode real protein and protein fragments). >> Also when running now with MAKER make sure to include the entire >>proteome >> of a related species and not just UniProt, and you will get better >> performance. Now that you have Augustus trained, using it inside of >>MAKER >> with an improved repeat library and additional protein evidence should >> give it the feedback that will allow it to perform better than it would >> with just naked ab initio prediction. >> >> Thanks, >> Carson >> >> >> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >> >>> Hi, >>> >>> I wanted to get some feedback regarding the training of ab-initio gene >>> finders - it?s not strictly Maker related, but I suppose there are many >>> people on this list that have encountered and solved this issue in one >>> way or another. >>> >>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>> plant genome. This has always been a very frustrating process for me, >>>but >>> while I have a better idea now how to do it, I still don?t get the sort >>> of accuracy that I am hoping for. A quick run-through of my process; >>> >>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>> Sanger-sequenced EST data >>> >>> Filtered for Models with an AED <= 0.3 >>> >>> Loaded that into WebApollo, together with an existing reference >>> annotation and the evidence tracks >>> >>> Manually curated/selected 750 gene models using the following rules: >>> - Must have start/stop codon >>> - Most have as many exons as possible >>> - Must agree with evidence >>> - Must be >= 2kb part from other gene models (provided as flanking >>> regions for augustus to train intergenic sequence) >>> >>> From these models, I created a GBK file, split it into 650 (train) and >>> 100 (test) models and created a new profile using the documented >>> procedure. >>> >>> But: >>> >>> While the naked ab-init models created through maker get a lot of genes >>> ?sort of right?, I still see too many issues to be really satisfied. >>> Problems include: >>> >>> - random exon calls which are not supported by any line of evidence (~1 >>> per gene model, I would guess) >>> - poor congruency with some gene models (especially ones not used for >>> training/testing) >>> >>> Is there any best-practice guide on how to improve this? The Augustus >>> website is unfortunately quite poor on detail? My impression so far is >>> that ramping up the number of training models isn?t really doing too >>>much >>> beyond a certain point (tried 400, 500 and 750). >>> >>> Regards, >>> >>> Marc >>> >>> >>> Marc P. Hoeppner, PhD >>> Team Leader >>> BILS Genome Annotation Platform >>> Department for Medical Biochemistry and Microbiology >>> Uppsala University, Sweden >>> marc.hoeppner at bils.se >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Thu Jun 5 12:28:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:28:55 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: One thing you might want to try is adding another predictor like SNAP together with Augustus and then process the MAKER results using EVM. We actually have a collaboration with the EVM group to produce a MAKER-EVM pipeline (MAKER 3.0). EVM will produce consensus models using the predictions and the evidence in the MAKER GFF3 which are generally better than just SNAP and Augustus with hints, so it might be able to remove some of the artifacts you are worried about. --Carson On 6/5/14, 12:24 PM, "Carson Holt" wrote: >Like I said. The predictors do the best they can, so there is probably >something about the regions to make the CDS, reading frame, or start/stop >work that requires exons to be dropped or added. In several ant genomes >we saw something like this caused by incorrect homopolymers in the >assembly which force the predictor to slightly alter the intron/exon >structure because otherwise the reading frame made no sense (the EST >alignments were used to confirmed that the assembly homopolymers were >incorrect - lots of bad single base pair deletions). > >The way hints work is as follows. At the simplest level ab initio >predictors are calculating the probability of being in different states >(intergenic, intron, exon, etc.). The hints increase the probability of >being in the intron state where MAKER gives an intron hint or being in an >exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >likelihood of the ab intio gene predictor to call something similar in >structure to the evidence overlapping it. That being said, if there is >strong enough signal from something else in the sequence or my hints won't >work with the splice sites in the region or the reading frame breaks, then >no amount of hints can force augustus to make that model. > >--Carson > > > >On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: > >>Hi, >> >>thanks for the feedback. I spent some more time on this and am still >>somewhat unsatisfied with the whole thing? >> >>A few comments: >> >>I quite frequently see augustus and in extension Maker including exons >>that are not supported by EST/Protein evidence and are not critical for >>the gene model (i.e. I can take them out and still get a proper CDS). >>Maybe I don?t know enough about how Maker creates hints and more >>importantly what role these hints play for augustus, but I cannot really >>see a great effect (any, really) on the gene models even if both ESTs and >>proteins contradict an augustus gene model and the surplus exon is >>non-essential. >> >>(all evidence is provided as fasta files, protein2genome and est2genome >>are set to 0) >> >>As for the repeat library, I suppose this is a critical point. I am using >>repeats from a closely related species via Repeatmasker, modelled and >>filtered repeats from RepeatModeler and repeats derived from a >>high-coverage 454 data set. Not sure what else I can do to improve that. >> >>As for evidence, I am using the curated reference proteome from a related >>species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>reads. I don?t think it gets a whole lot better, in terms of what data >>can be used. >> >>So in summary, I just don?t get where I want to using Augustus and Maker >>- specifically, the gene models are full of weird, unsupported artefacts >>despite manually curating > 850 models for training. I suppose I was >>hoping for some secret trick to improve on this - but I guess there is >>none? Actually, if I only do a pure evidence build (seeing that my input >>data is very high quality), it looks better - which sort of goes against >>the premise of Maker :/ >> >>Regards, >> >>Marc >> >> >> >> >>Marc P. Hoeppner, PhD >>Team Leader >>Department for Medical Biochemistry and Microbiology >>Uppsala University, Sweden >>marc.hoeppner at bils.se >> >>On 27 May 2014, at 17:25, Carson Holt wrote: >> >>> Extra exons can be required for predictors to make sense of a region >>>(they >>> do the best they can). This can be due to imperfect assemblies or >>> repeats. For plants the repeat database is the the one thing that will >>> most affect the annotation quality. You may need to spend some time >>> building the best repeat library you can. The repeat library is the >>>next >>> most important thing next to training the predictor, because they >>>confuse >>> the predictor (sometimes a lot) causing it to behave oddly in those >>> regions (because repeats do encode real protein and protein fragments). >>> Also when running now with MAKER make sure to include the entire >>>proteome >>> of a related species and not just UniProt, and you will get better >>> performance. Now that you have Augustus trained, using it inside of >>>MAKER >>> with an improved repeat library and additional protein evidence should >>> give it the feedback that will allow it to perform better than it would >>> with just naked ab initio prediction. >>> >>> Thanks, >>> Carson >>> >>> >>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>> >>>> Hi, >>>> >>>> I wanted to get some feedback regarding the training of ab-initio gene >>>> finders - it?s not strictly Maker related, but I suppose there are >>>>many >>>> people on this list that have encountered and solved this issue in one >>>> way or another. >>>> >>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>>> plant genome. This has always been a very frustrating process for me, >>>>but >>>> while I have a better idea now how to do it, I still don?t get the >>>>sort >>>> of accuracy that I am hoping for. A quick run-through of my process; >>>> >>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>> Sanger-sequenced EST data >>>> >>>> Filtered for Models with an AED <= 0.3 >>>> >>>> Loaded that into WebApollo, together with an existing reference >>>> annotation and the evidence tracks >>>> >>>> Manually curated/selected 750 gene models using the following rules: >>>> - Must have start/stop codon >>>> - Most have as many exons as possible >>>> - Must agree with evidence >>>> - Must be >= 2kb part from other gene models (provided as flanking >>>> regions for augustus to train intergenic sequence) >>>> >>>> From these models, I created a GBK file, split it into 650 (train) >>>>and >>>> 100 (test) models and created a new profile using the documented >>>> procedure. >>>> >>>> But: >>>> >>>> While the naked ab-init models created through maker get a lot of >>>>genes >>>> ?sort of right?, I still see too many issues to be really satisfied. >>>> Problems include: >>>> >>>> - random exon calls which are not supported by any line of evidence >>>>(~1 >>>> per gene model, I would guess) >>>> - poor congruency with some gene models (especially ones not used for >>>> training/testing) >>>> >>>> Is there any best-practice guide on how to improve this? The Augustus >>>> website is unfortunately quite poor on detail? My impression so far is >>>> that ramping up the number of training models isn?t really doing too >>>>much >>>> beyond a certain point (tried 400, 500 and 750). >>>> >>>> Regards, >>>> >>>> Marc >>>> >>>> >>>> Marc P. Hoeppner, PhD >>>> Team Leader >>>> BILS Genome Annotation Platform >>>> Department for Medical Biochemistry and Microbiology >>>> Uppsala University, Sweden >>>> marc.hoeppner at bils.se >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > From marc.hoeppner at bils.se Thu Jun 5 02:15:55 2014 From: marc.hoeppner at bils.se (=?windows-1252?Q?Marc_H=F6ppner?=) Date: Thu, 5 Jun 2014 10:15:55 +0200 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> Message-ID: <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Hi, thanks for the feedback. I spent some more time on this and am still somewhat unsatisfied with the whole thing? A few comments: I quite frequently see augustus and in extension Maker including exons that are not supported by EST/Protein evidence and are not critical for the gene model (i.e. I can take them out and still get a proper CDS). Maybe I don?t know enough about how Maker creates hints and more importantly what role these hints play for augustus, but I cannot really see a great effect (any, really) on the gene models even if both ESTs and proteins contradict an augustus gene model and the surplus exon is non-essential. (all evidence is provided as fasta files, protein2genome and est2genome are set to 0) As for the repeat library, I suppose this is a critical point. I am using repeats from a closely related species via Repeatmasker, modelled and filtered repeats from RepeatModeler and repeats derived from a high-coverage 454 data set. Not sure what else I can do to improve that. As for evidence, I am using the curated reference proteome from a related species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 reads. I don?t think it gets a whole lot better, in terms of what data can be used. So in summary, I just don?t get where I want to using Augustus and Maker - specifically, the gene models are full of weird, unsupported artefacts despite manually curating > 850 models for training. I suppose I was hoping for some secret trick to improve on this - but I guess there is none? Actually, if I only do a pure evidence build (seeing that my input data is very high quality), it looks better - which sort of goes against the premise of Maker :/ Regards, Marc Marc P. Hoeppner, PhD Team Leader Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at bils.se On 27 May 2014, at 17:25, Carson Holt wrote: > Extra exons can be required for predictors to make sense of a region (they > do the best they can). This can be due to imperfect assemblies or > repeats. For plants the repeat database is the the one thing that will > most affect the annotation quality. You may need to spend some time > building the best repeat library you can. The repeat library is the next > most important thing next to training the predictor, because they confuse > the predictor (sometimes a lot) causing it to behave oddly in those > regions (because repeats do encode real protein and protein fragments). > Also when running now with MAKER make sure to include the entire proteome > of a related species and not just UniProt, and you will get better > performance. Now that you have Augustus trained, using it inside of MAKER > with an improved repeat library and additional protein evidence should > give it the feedback that will allow it to perform better than it would > with just naked ab initio prediction. > > Thanks, > Carson > > > On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: > >> Hi, >> >> I wanted to get some feedback regarding the training of ab-initio gene >> finders - it?s not strictly Maker related, but I suppose there are many >> people on this list that have encountered and solved this issue in one >> way or another. >> >> Specifically, I am trying to train Augustus (and possibly SNAP) for a >> plant genome. This has always been a very frustrating process for me, but >> while I have a better idea now how to do it, I still don?t get the sort >> of accuracy that I am hoping for. A quick run-through of my process; >> >> Evidence build with maker on level 1 and 2 proteins from Uniprot + >> Sanger-sequenced EST data >> >> Filtered for Models with an AED <= 0.3 >> >> Loaded that into WebApollo, together with an existing reference >> annotation and the evidence tracks >> >> Manually curated/selected 750 gene models using the following rules: >> - Must have start/stop codon >> - Most have as many exons as possible >> - Must agree with evidence >> - Must be >= 2kb part from other gene models (provided as flanking >> regions for augustus to train intergenic sequence) >> >> From these models, I created a GBK file, split it into 650 (train) and >> 100 (test) models and created a new profile using the documented >> procedure. >> >> But: >> >> While the naked ab-init models created through maker get a lot of genes >> ?sort of right?, I still see too many issues to be really satisfied. >> Problems include: >> >> - random exon calls which are not supported by any line of evidence (~1 >> per gene model, I would guess) >> - poor congruency with some gene models (especially ones not used for >> training/testing) >> >> Is there any best-practice guide on how to improve this? The Augustus >> website is unfortunately quite poor on detail? My impression so far is >> that ramping up the number of training models isn?t really doing too much >> beyond a certain point (tried 400, 500 and 750). >> >> Regards, >> >> Marc >> >> >> Marc P. Hoeppner, PhD >> Team Leader >> BILS Genome Annotation Platform >> Department for Medical Biochemistry and Microbiology >> Uppsala University, Sweden >> marc.hoeppner at bils.se >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From fbarreto at ucsd.edu Thu Jun 5 13:01:05 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 5 Jun 2014 12:01:05 -0700 Subject: [maker-devel] Generating GFF with selected tracks Message-ID: Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 13:02:36 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 13:02:36 -0600 Subject: [maker-devel] protein2genome gene models from protein gff In-Reply-To: <1401994595132.44761@uga.edu> References: <1401994595132.44761@uga.edu> Message-ID: That's what I'd do. But really protein2genome=1 is just meant to get enough rough gene models to train a gene predictor. You don't need to run it across the whole genome. But if you do, when you run again after training the gene predictor, MAKER will detect the old BLAST jobs and they won't have to rerun on the second MAKER run. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 12:56 PM To: Carson Holt Subject: RE: [maker-devel] protein2genome gene models from protein gff So what would you suggest is the best way to get protein2genome predictions? Use fasta sequences, instead of gff? Thanks, Ranjani From: Carson Holt Sent: Thursday, June 05, 2014 2:08 PM To: Sivaranjani Namasivayam; maker-devel at yandell-lab.org Subject: Re: [maker-devel] protein2genome gene models from protein gff est_gff assumes the alignments are spliced correctly. The protein2genome option also makes that assumption but with a little less confidence that the user always provides splice aware alignments, so in some instances (like protein2genome=1) it may not pass them forward as guaranteed splice aware alignments. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 9:49 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] protein2genome gene models from protein gff Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 13:05:30 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 13:05:30 -0600 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: gff3_merge just merges any two GFF3 files. So if you have two files just give both of them to it. Example --> gff3_merge maker_genes.gff repeats.gff Also if all you are trying to do is filter out certain feature types from the file, just use grep instead. Example --> grep -v -P "\tpred_gff\t" maker.gff Thanks, Carson From: Felipe Barreto Date: Thursday, June 5, 2014 at 1:01 PM To: MAKER group Subject: [maker-devel] Generating GFF with selected tracks Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Thu Jun 5 13:08:08 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 5 Jun 2014 19:08:08 +0000 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: Hi Felipe, I seem to remember that some of the gene model names did change when I did things similar to what you described. I think that you could accomplish the same thing with some cat and grep commands on the full gff. That would avoid the trouble of rerunning maker. Something like "cat full.gff | grep -P "\trepeatrunner\t" > tmp.gff " would get you started. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 5, 2014, at 12:01 PM, Felipe Barreto > wrote: Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jun 5 14:07:51 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 5 Jun 2014 13:07:51 -0700 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: OK, I see. I will just use grep to extract the desired features from the full.gff and merge them with gff3_merge. Don't know why I was making it more complicated. I guess I don't understand gff formats very well quite yet. Thanks yet again! On Thu, Jun 5, 2014 at 12:08 PM, Daniel Ence wrote: > Hi Felipe, I seem to remember that some of the gene model names did > change when I did things similar to what you described. I think that you > could accomplish the same thing with some cat and grep commands on the full > gff. That would avoid the trouble of rerunning maker. Something like "cat > full.gff | grep -P "\trepeatrunner\t" > tmp.gff " would get you started. > > ~Daniel > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 5, 2014, at 12:01 PM, Felipe Barreto > wrote: > > Hi, all, > > I would like to produce a gff file that contains Maker gene models AND > repeats. I know that using gff3_merge with -g will generate one with only > the gene models, but I didn't see any options for adding additional tracks. > > The way I did this was to use the Re-annotation section in the control > file. I provided the original full gff file in maker_gff, and turned on > the rm_pass and model_pass. All other options in the control file were > turned off. This seemed to work, though it also added a 'model_gff:maker' > track, which is not a problem for me. I compared a few new and original > scaffolds in Apollo, and all seem to match perfectly. But since I cannot > check the whole genome, I was wondering if what I did was appropriate. Are > all the gene models (and their names) and repeat alignments identical > between the new and original files? Or is Maker potentially changing a few > things since it's treated as a new run? > > Thanks! > > -- > Felipe Barreto > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 10:33:06 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:33:06 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Another question: is there documentation anywhere for the naming conventions of the genes annotated by Maker? Of course it's easy to spot genes based on a particular *ab initio* gene predictor, as the names are in the IDs. But what is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs "maker-$seqid-snap-gene"? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage wrote: > I have attached data for a small 18kb region with a handful of genes, as > well as the corresponding maker_opts.ctl file. (This is a smaller and > different data set than what I was looking at yesterday, with a more > well-defined problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 > with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a > different gene from 6111 to 8345 with an AED of 0.01. Both of these genes > have transcript support: will Maker report overlapping genes under any > conditions? And even if Maker is forced to choose only a single gene to > report, why would the model from 4125 to 6400 ever be reported in place of > the one from 6111 to 8345, especially since this is provided in the > model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: > >> Just eAED, but eAED can affects selection of ab initio results. For >> example reading frame match of protein evidence which also affects whether >> evidence from single_exon=1 and genes with single_exon protein evidence get >> kept. There is also the assumption that your alignments in GFF3 are are >> correctly spliced (like BLAT does). So giving blastn results as >> precomputed est_gff would create a lot of noise, since maker ignores blastn >> and is using it only to seed the polished exonerate alignments. >> >> --Carson >> >> >> From: Daniel Standage >> Date: Wednesday, June 4, 2014 at 1:11 PM >> To: Carson Holt >> Cc: Maker Mailing List >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> I do not provide Gap or Target attributes in the GFF3. Will this affect >> the AED as well, or just the eAED? >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >> >>> Sure. that would be helpful. One question. Do you provide the Gap >>> attribute in your precomputed alignments? Having or not having that >>> attribute affects the eAED score which takes reading frame into account, >>> and may cause some things to be kept that normally would be dropped, >>> because MAKER won't be able to take the points of mismatch of the alignment >>> into account (it just assumes match everywhere). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:03 PM >>> To: Maker Mailing List >>> Subject: [maker-devel] Filtering of ab initio gene models >>> >>> Thanks everyone for your responses recently! >>> >>> The reason for my recent flurry of email activity is that I'm seeing >>> some unexpected trends when running the new version of Maker with >>> precomputed alignments. Compared with an annotation I did a while ago >>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>> substantial number of new genes annotated. If I compare distributions of >>> AED scores between the old and new annotation, it's clear that the new >>> annotation has a lot more low-quality models. If I look at new gene models >>> that do not overlap with any gene model from the old annotation, the >>> likelihood that it's a low-quality model is much higher. >>> >>> I decided to run a little experiment. I annotated a scaffold first using >>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>> pre-computed transcript and protein alignments and the same (latest) >>> version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted >>> 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>> >>> Before this experiment, I assumed the issue was related to providing >>> pre-computed alignments in GFF3 format and perhaps violating some important >>> assumption. However, this experiment makes me wonder whether there have >>> been changes to how Maker filters *ab initio* gene models between >>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>> could put together a small data set that reproduces the behavior I just >>> described. >>> >>> Thanks! >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> _______________________________________________ maker-devel mailing >>> list maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 10:39:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 10:39:43 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked sequence without hints (i.e. the ab initio call). maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. In both cases MAKER is allowed to add UTR to the model (hence the 'processed' tag). --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 10:33 AM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Another question: is there documentation anywhere for the naming conventions of the genes annotated by Maker? Of course it's easy to spot genes based on a particular ab initio gene predictor, as the names are in the IDs. But what is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs "maker-$seqid-snap-gene"? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage wrote: > I have attached data for a small 18kb region with a handful of genes, as well > as the corresponding maker_opts.ctl file. (This is a smaller and different > data set than what I was looking at yesterday, with a more well-defined > problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 with > an AED of 0.23. If you exclude transcript TSA024184, Maker reports a different > gene from 6111 to 8345 with an AED of 0.01. Both of these genes have > transcript support: will Maker report overlapping genes under any conditions? > And even if Maker is forced to choose only a single gene to report, why would > the model from 4125 to 6400 ever be reported in place of the one from 6111 to > 8345, especially since this is provided in the model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> Just eAED, but eAED can affects selection of ab initio results. For example >> reading frame match of protein evidence which also affects whether evidence >> from single_exon=1 and genes with single_exon protein evidence get kept. >> There is also the assumption that your alignments in GFF3 are are correctly >> spliced (like BLAT does). So giving blastn results as precomputed est_gff >> would create a lot of noise, since maker ignores blastn and is using it only >> to seed the polished exonerate alignments. >> >> --Carson >> >> >> From: Daniel Standage >> Date: Wednesday, June 4, 2014 at 1:11 PM >> To: Carson Holt >> Cc: Maker Mailing List >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> I do not provide Gap or Target attributes in the GFF3. Will this affect the >> AED as well, or just the eAED? >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> Sure. that would be helpful. One question. Do you provide the Gap >>> attribute in your precomputed alignments? Having or not having that >>> attribute affects the eAED score which takes reading frame into account, and >>> may cause some things to be kept that normally would be dropped, because >>> MAKER won't be able to take the points of mismatch of the alignment into >>> account (it just assumes match everywhere). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:03 PM >>> To: Maker Mailing List >>> Subject: [maker-devel] Filtering of ab initio gene models >>> >>> Thanks everyone for your responses recently! >>> >>> The reason for my recent flurry of email activity is that I'm seeing some >>> unexpected trends when running the new version of Maker with precomputed >>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>> Maker-computed alignments), this new annotation has a substantial number of >>> new genes annotated. If I compare distributions of AED scores between the >>> old and new annotation, it's clear that the new annotation has a lot more >>> low-quality models. If I look at new gene models that do not overlap with >>> any gene model from the old annotation, the likelihood that it's a >>> low-quality model is much higher. >>> >>> I decided to run a little experiment. I annotated a scaffold first using >>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>> pre-computed transcript and protein alignments and the same (latest) version >>> of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 genes while >>> Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, >>> there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with >>> only models from 2.10, and 28 loci with only models from 2.31.3. >>> >>> Before this experiment, I assumed the issue was related to providing >>> pre-computed alignments in GFF3 format and perhaps violating some important >>> assumption. However, this experiment makes me wonder whether there have been >>> changes to how Maker filters ab initio gene models between version 2.10 and >>> version 2.31.3? Do you have any ideas? If it would help, I could put >>> together a small data set that reproduces the behavior I just described. >>> >>> Thanks! >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m >>> aker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 10:46:41 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:46:41 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Good to know, thanks. If multiple *ab initio* predictors inform a single annotation, how does Maker decide which one will be included in the gene's ID? Given your quick response just now, I wanted to confirm that you got the message and data set I sent yesterday. I received an email saying the size of my message required list admin approval to be distributed, but since you were also a direct recipient of the email I didn't worry about it too much. Thanks again! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to spot > genes based on a particular *ab initio* gene predictor, as the names are > in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> I have attached data for a small 18kb region with a handful of genes, as >> well as the corresponding maker_opts.ctl file. (This is a smaller and >> different data set than what I was looking at yesterday, with a more >> well-defined problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the >> model_gff file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> Just eAED, but eAED can affects selection of ab initio results. For >>> example reading frame match of protein evidence which also affects whether >>> evidence from single_exon=1 and genes with single_exon protein evidence get >>> kept. There is also the assumption that your alignments in GFF3 are are >>> correctly spliced (like BLAT does). So giving blastn results as >>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>> and is using it only to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>> the AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing >>>> some unexpected trends when running the new version of Maker with >>>> precomputed alignments. Compared with an annotation I did a while ago >>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>> substantial number of new genes annotated. If I compare distributions of >>>> AED scores between the old and new annotation, it's clear that the new >>>> annotation has a lot more low-quality models. If I look at new gene models >>>> that do not overlap with any gene model from the old annotation, the >>>> likelihood that it's a low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first >>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>> from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters *ab initio* gene models between >>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>> could put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel mailing >>>> list maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 10:56:38 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 10:56:38 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: I got the e-mail. Thanks for the test set. Multiple ab initio predictors don't inform a single annotation, rather one must be chosen from the pool of available models (I.e. it has to be SNAP or Augustus, or GeneMark). They all supply their own ab initio as well as hint based prediction, and then the one with best evidence match (measured by AED) is kept (it's like a competition that only one predictor can win). If you want a consensus model instead, you can take MAKER results in GFF3 format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a collaboration with the EVM group and will have this option, but for now users can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then produces consensus models based on the GFF3 content. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 10:46 AM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Good to know, thanks. If multiple ab initio predictors inform a single annotation, how does Maker decide which one will be included in the gene's ID? Given your quick response just now, I wanted to confirm that you got the message and data set I sent yesterday. I received an email saying the size of my message required list admin approval to be distributed, but since you were also a direct recipient of the email I didn't worry about it too much. Thanks again! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked > sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the 'processed' > tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel > > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming conventions > of the genes annotated by Maker? Of course it's easy to spot genes based on a > particular ab initio gene predictor, as the names are in the IDs. But what is > the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: >> I have attached data for a small 18kb region with a handful of genes, as well >> as the corresponding maker_opts.ctl file. (This is a smaller and different >> data set than what I was looking at yesterday, with a more well-defined >> problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the model_gff >> file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> Just eAED, but eAED can affects selection of ab initio results. For example >>> reading frame match of protein evidence which also affects whether evidence >>> from single_exon=1 and genes with single_exon protein evidence get kept. >>> There is also the assumption that your alignments in GFF3 are are correctly >>> spliced (like BLAT does). So giving blastn results as precomputed est_gff >>> would create a lot of noise, since maker ignores blastn and is using it only >>> to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>> AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing some >>>> unexpected trends when running the new version of Maker with precomputed >>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>> Maker-computed alignments), this new annotation has a substantial number of >>>> new genes annotated. If I compare distributions of AED scores between the >>>> old and new annotation, it's clear that the new annotation has a lot more >>>> low-quality models. If I look at new gene models that do not overlap with >>>> any gene model from the old annotation, the likelihood that it's a >>>> low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first using >>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci by >>>> overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters ab initio gene models between version >>>> 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could >>>> put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel mailing list >>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ >>>> maker-devel_yandell-lab.org >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 10:59:16 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:59:16 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: This helps, thanks. -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: > I got the e-mail. Thanks for the test set. > > Multiple *ab initio* predictors don't inform a single annotation, rather > one must be chosen from the pool of available models (I.e. it has to be > SNAP or Augustus, or GeneMark). They all supply their own *ab initio* as > well as hint based prediction, and then the one with best evidence match > (measured by AED) is kept (it's like a competition that only one predictor > can win). > > If you want a consensus model instead, you can take MAKER results in GFF3 > format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is > a collaboration with the EVM group and will have this option, but for now > users can just split the MAKER GFF3 by evidence types and give it to EVM. > EVM then produces consensus models based on the GFF3 content. > > --Carson > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:46 AM > > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Good to know, thanks. If multiple *ab initio* predictors inform a single > annotation, how does Maker decide which one will be included in the gene's > ID? > > Given your quick response just now, I wanted to confirm that you got the > message and data set I sent yesterday. I received an email saying the size > of my message required list admin approval to be distributed, but since you > were also a direct recipient of the email I didn't worry about it too much. > > Thanks again! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat >> masked sequence without hints (i.e. the ab initio call). >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >> MAKER. >> >> In both cases MAKER is allowed to add UTR to the model (hence the >> 'processed' tag). >> >> --Carson >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel < >> vbrendel at indiana.edu> >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Another question: is there documentation anywhere for the naming >> conventions of the genes annotated by Maker? Of course it's easy to spot >> genes based on a particular *ab initio* gene predictor, as the names are >> in the IDs. But what is the significance of, say, >> "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> Thanks, >> Daniel >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage < >> daniel.standage at gmail.com> wrote: >> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to >>> 6400 with an AED of 0.23. If you exclude transcript TSA024184, Maker >>> reports a different gene from 6111 to 8345 with an AED of 0.01. Both of >>> these genes have transcript support: will Maker report overlapping genes >>> under any conditions? And even if Maker is forced to choose only a single >>> gene to report, why would the model from 4125 to 6400 ever be reported in >>> place of the one from 6111 to 8345, especially since this is provided in >>> the model_gff file? >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>> the AED as well, or just the eAED? >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the alignment >>>>> into account (it just assumes match everywhere). >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing >>>>> some unexpected trends when running the new version of Maker with >>>>> precomputed alignments. Compared with an annotation I did a while ago >>>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>>> substantial number of new genes annotated. If I compare distributions of >>>>> AED scores between the old and new annotation, it's clear that the new >>>>> annotation has a lot more low-quality models. If I look at new gene models >>>>> that do not overlap with any gene model from the old annotation, the >>>>> likelihood that it's a low-quality model is much higher. >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first >>>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>>> from 2.31.3. >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>>> assumption. However, this experiment makes me wonder whether there have >>>>> been changes to how Maker filters *ab initio* gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>>> could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> _______________________________________________ maker-devel mailing >>>>> list maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 12:38:23 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 14:38:23 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: > >> I got the e-mail. Thanks for the test set. >> >> Multiple *ab initio* predictors don't inform a single annotation, rather >> one must be chosen from the pool of available models (I.e. it has to be >> SNAP or Augustus, or GeneMark). They all supply their own *ab initio* >> as well as hint based prediction, and then the one with best evidence match >> (measured by AED) is kept (it's like a competition that only one predictor >> can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is >> a collaboration with the EVM group and will have this option, but for now >> users can just split the MAKER GFF3 by evidence types and give it to EVM. >> EVM then produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel < >> vbrendel at indiana.edu> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple *ab initio* predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size >> of my message required list admin approval to be distributed, but since you >> were also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >> >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat >>> masked sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel < >>> vbrendel at indiana.edu> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming >>> conventions of the genes annotated by Maker? Of course it's easy to spot >>> genes based on a particular *ab initio* gene predictor, as the names >>> are in the IDs. But what is the significance of, say, >>> "snap_masked-$seqid-processed-gene" in a gene ID vs >>> "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage < >>> daniel.standage at gmail.com> wrote: >>> >>>> I have attached data for a small 18kb region with a handful of genes, >>>> as well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to >>>> 6400 with an AED of 0.23. If you exclude transcript TSA024184, Maker >>>> reports a different gene from 6111 to 8345 with an AED of 0.01. Both of >>>> these genes have transcript support: will Maker report overlapping genes >>>> under any conditions? And even if Maker is forced to choose only a single >>>> gene to report, why would the model from 4125 to 6400 ever be reported in >>>> place of the one from 6111 to 8345, especially since this is provided in >>>> the model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>> >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>>> kept. There is also the assumption that your alignments in GFF3 are are >>>>> correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>>> and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this >>>>> affect the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt >>>>> wrote: >>>>> >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the alignment >>>>>> into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing >>>>>> some unexpected trends when running the new version of Maker with >>>>>> precomputed alignments. Compared with an annotation I did a while ago >>>>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>>>> substantial number of new genes annotated. If I compare distributions of >>>>>> AED scores between the old and new annotation, it's clear that the new >>>>>> annotation has a lot more low-quality models. If I look at new gene models >>>>>> that do not overlap with any gene model from the old annotation, the >>>>>> likelihood that it's a low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first >>>>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>>>> from 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>>>> assumption. However, this experiment makes me wonder whether there have >>>>>> been changes to how Maker filters *ab initio* gene models between >>>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>>>> could put together a small data set that reproduces the behavior I just >>>>>> described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing >>>>>> list maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>> >>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 12:51:18 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 12:51:18 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: There can be overlapping meddles if you have multiple gene predictors. Also the hint based models will overlap the ab initio models, but you never get to see them (they are not kept in the evidence because they are confusing and really not useful unless they are chosen as the best model). So they will overlap the ab initio models, but you may never get top see them. All models regardless of location and overlap get sorted by their AED score. The best model is then kept from the list. Then the next, then the next. If the next best model overlaps a model that has already come off the list (which means the other model has a better AED score), then it gets skipped, and the next best one in the list gets added to the non-overlapping space. The result is that the final models will be non-redundant and non-overlapping, but if you look at the evidence aligments you will find ab initio models different than the MAKER models that were rejected and do not overlap the final models. model_gff competes just like any other model with AED. Ties always go to model_gff, and if there is a region where no model gets chosen (they all have AED of 1) and a model_gff entry will fit (even with an AED score of 1), then it will be chosen, because model_gff do not need evidence support to end up in the final annotations. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 12:38 PM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: >> I got the e-mail. Thanks for the test set. >> >> Multiple ab initio predictors don't inform a single annotation, rather one >> must be chosen from the pool of available models (I.e. it has to be SNAP or >> Augustus, or GeneMark). They all supply their own ab initio as well as hint >> based prediction, and then the one with best evidence match (measured by AED) >> is kept (it's like a competition that only one predictor can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a >> collaboration with the EVM group and will have this option, but for now users >> can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then >> produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple ab initio predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size of >> my message required list admin approval to be distributed, but since you were >> also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >>> sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel >>> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming conventions >>> of the genes annotated by Maker? Of course it's easy to spot genes based on >>> a particular ab initio gene predictor, as the names are in the IDs. But what >>> is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene >>> ID vs "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >>> wrote: >>>> I have attached data for a small 18kb region with a handful of genes, as >>>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>>> have transcript support: will Maker report overlapping genes under any >>>> conditions? And even if Maker is forced to choose only a single gene to >>>> report, why would the model from 4125 to 6400 ever be reported in place of >>>> the one from 6111 to 8345, especially since this is provided in the >>>> model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence >>>>> get kept. There is also the assumption that your alignments in GFF3 are >>>>> are correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores >>>>> blastn and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>>> the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the >>>>>> alignment into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>>> unexpected trends when running the new version of Maker with precomputed >>>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>>> Maker-computed alignments), this new annotation has a substantial number >>>>>> of new genes annotated. If I compare distributions of AED scores between >>>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>>> more low-quality models. If I look at new gene models that do not overlap >>>>>> with any gene model from the old annotation, the likelihood that it's a >>>>>> low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>>> 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>>> important assumption. However, this experiment makes me wonder whether >>>>>> there have been changes to how Maker filters ab initio gene models >>>>>> between version 2.10 and version 2.31.3? Do you have any ideas? If it >>>>>> would help, I could put together a small data set that reproduces the >>>>>> behavior I just described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing list >>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinf >>>>>> o/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 17:58:26 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 19:58:26 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: <53923808.7030401@indiana.edu> References: <53923808.7030401@indiana.edu> Message-ID: In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models > (supplied by the pred_gff or model_gff tag)? This seems to be one problem > we are running into. Our external models are high quality, but CDS only. > Thus their score gets knocked down relative to ab initio predictions with > added UTRs. > > Daniel will have more questions/observations later with regard to > overlapping gene models (we definitely need to allow gene models to overlap > in the UTRs, because transcript evidence clearly shows such negative > intergenic spaces). > > Thanks for all your help! > Volker > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to spot > genes based on a particular *ab initio* gene predictor, as the names are > in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> I have attached data for a small 18kb region with a handful of genes, >> as well as the corresponding maker_opts.ctl file. (This is a smaller and >> different data set than what I was looking at yesterday, with a more >> well-defined problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the >> model_gff file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> Just eAED, but eAED can affects selection of ab initio results. For >>> example reading frame match of protein evidence which also affects whether >>> evidence from single_exon=1 and genes with single_exon protein evidence get >>> kept. There is also the assumption that your alignments in GFF3 are are >>> correctly spliced (like BLAT does). So giving blastn results as >>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>> and is using it only to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this >>> affect the AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing >>>> some unexpected trends when running the new version of Maker with >>>> precomputed alignments. Compared with an annotation I did a while ago >>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>> substantial number of new genes annotated. If I compare distributions of >>>> AED scores between the old and new annotation, it's clear that the new >>>> annotation has a lot more low-quality models. If I look at new gene models >>>> that do not overlap with any gene model from the old annotation, the >>>> likelihood that it's a low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first >>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>> from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters *ab initio* gene models between >>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>> could put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel >>>> mailing list maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074http://brendelgroup.org/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbrendel at indiana.edu Fri Jun 6 15:52:08 2014 From: vbrendel at indiana.edu (Volker Brendel) Date: Fri, 06 Jun 2014 16:52:08 -0500 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: <53923808.7030401@indiana.edu> Hi Carson, is there a way of allowing MAKER to add UTRs to our external models (supplied by the pred_gff or model_gff tag)? This seems to be one problem we are running into. Our external models are high quality, but CDS only. Thus their score gets knocked down relative to ab initio predictions with added UTRs. Daniel will have more questions/observations later with regard to overlapping gene models (we definitely need to allow gene models to overlap in the UTRs, because transcript evidence clearly shows such negative intergenic spaces). Thanks for all your help! Volker On 6/6/2014 11:39 AM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > > Cc: Maker Mailing List >, Volker Brendel > > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to > spot genes based on a particular /ab initio/ gene predictor, as the > names are in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > > wrote: > > I have attached data for a small 18kb region with a handful of > genes, as well as the corresponding maker_opts.ctl file. (This is > a smaller and different data set than what I was looking at > yesterday, with a more well-defined problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 > to 6400 with an AED of 0.23. If you exclude transcript TSA024184, > Maker reports a different gene from 6111 to 8345 with an AED of > 0.01. Both of these genes have transcript support: will Maker > report overlapping genes under any conditions? And even if Maker > is forced to choose only a single gene to report, why would the > model from 4125 to 6400 ever be reported in place of the one from > 6111 to 8345, especially since this is provided in the model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt > wrote: > > Just eAED, but eAED can affects selection of ab initio > results. For example reading frame match of protein evidence > which also affects whether evidence from single_exon=1 and > genes with single_exon protein evidence get kept. There is > also the assumption that your alignments in GFF3 are are > correctly spliced (like BLAT does). So giving blastn results > as precomputed est_gff would create a lot of noise, since > maker ignores blastn and is using it only to seed the polished > exonerate alignments. > > --Carson > > > From: Daniel Standage > > Date: Wednesday, June 4, 2014 at 1:11 PM > To: Carson Holt > > Cc: Maker Mailing List > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > I do not provide Gap or Target attributes in the GFF3. Will > this affect the AED as well, or just the eAED? > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt > > wrote: > > Sure. that would be helpful. One question. Do you > provide the Gap attribute in your precomputed alignments? > Having or not having that attribute affects the eAED > score which takes reading frame into account, and may > cause some things to be kept that normally would be > dropped, because MAKER won't be able to take the points of > mismatch of the alignment into account (it just assumes > match everywhere). > > --Carson > > > From: Daniel Standage > > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that > I'm seeing some unexpected trends when running the new > version of Maker with precomputed alignments. Compared > with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a > substantial number of new genes annotated. If I compare > distributions of AED scores between the old and new > annotation, it's clear that the new annotation has a lot > more low-quality models. If I look at new gene models that > do not overlap with any gene model from the old > annotation, the likelihood that it's a low-quality model > is much higher. > > I decided to run a little experiment. I annotated a > scaffold first using Maker 2.10 and then using Maker > 2.31.3. I both cases, I used the same pre-computed > transcript and protein alignments and the same (latest) > version of SNAP as the only /ab initio/ predictor. Maker > 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. > If we group gene models into loci by overlap, there are 33 > loci with gene models from both 2.10 and 2.31.3, 1 locus > with only models from 2.10, and 28 loci with only models > from 2.31.3. > > Before this experiment, I assumed the issue was related to > providing pre-computed alignments in GFF3 format and > perhaps violating some important assumption. However, this > experiment makes me wonder whether there have been changes > to how Maker filters /ab initio/ gene models between > version 2.10 and version 2.31.3? Do you have any ideas? If > it would help, I could put together a small data set that > reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ > maker-devel mailing list maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > -- Volker Brendel Professor of Biology and Computer Science Indiana University Department of Biology & School of Informatics and Computing Simon Hall 205C 212 South Hawthorne Drive Bloomington, IN 47405-7003 Tel.: (812) 855-7074 http://brendelgroup.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jun 7 14:03:18 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:03:18 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: <53923808.7030401@indiana.edu> Message-ID: The problem in the example you sent is the geneseqer entries in the GFF3 you are passing in. It is causing merge of gene clusters. The result is that UTR is being over extended and is overlapping on the models (and probably some models get merged). As you noticed you can't have overlapping models on the same strand. If you set score_preds=1 in the maker_opts.ctl file it will give you AED scores for the rejected ab initio models. You will notice that none of them score better than 0.23. One thing you can do is set correct_est_fusion=1. This tries to correct for erroneous EST/transcript evidence that leads to over extend UTR and false gene merging. You will see in the attached image that is trims back the overlapping 3' and 5' UTR for the overlapping gene models, given that MAKER believes the evidence leading to the overlap is likely low confidence and is a false merge of regions. I think much of your geneseqer input is more of a problem than a help for the annotation. Many seem to be spurious alignments. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 5:58 PM To: Volker Brendel Cc: Carson Holt , Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models (supplied > by the pred_gff or model_gff tag)? This seems to be one problem we are > running into. Our external models are high quality, but CDS only. Thus their > score gets knocked down relative to ab initio predictions with added UTRs. > > Daniel will have more questions/observations later with regard to overlapping > gene models (we definitely need to allow gene models to overlap in the UTRs, > because transcript evidence clearly shows such negative intergenic spaces). > > Thanks for all your help! > Volker > > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > >> >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >> sequence without hints (i.e. the ab initio call). >> >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. >> >> >> >> >> In both cases MAKER is allowed to add UTR to the model (hence the 'processed' >> tag). >> >> >> >> >> --Carson >> >> >> >> >> >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> >> >> >> >> >> >> Another question: is there documentation anywhere for the naming conventions >> of the genes annotated by Maker? Of course it's easy to spot genes based on a >> particular ab initio gene predictor, as the names are in the IDs. But what is >> the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >> wrote: >> >>> >>> >>> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>> have transcript support: will Maker report overlapping genes under any >>> conditions? And even if Maker is forced to choose only a single gene to >>> report, why would the model from 4125 to 6400 ever be reported in place of >>> the one from 6111 to 8345, especially since this is provided in the >>> model_gff file? >>> >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>>> AED as well, or just the eAED? >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> >>>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the >>>>> alignment into account (it just assumes match everywhere). >>>>> >>>>> >>>>> >>>>> >>>>> --Carson >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>> unexpected trends when running the new version of Maker with precomputed >>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>> Maker-computed alignments), this new annotation has a substantial number >>>>> of new genes annotated. If I compare distributions of AED scores between >>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>> more low-quality models. If I look at new gene models that do not overlap >>>>> with any gene model from the old annotation, the likelihood that it's a >>>>> low-quality model is much higher. >>>>> >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>> 2.31.3. >>>>> >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>> important assumption. However, this experiment makes me wonder whether >>>>> there have been changes to how Maker filters ab initio gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, >>>>> I could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ maker-devel mailing list >>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo >>>>> /maker-devel_yandell-lab.org >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074 http://brendelgroup.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6246A771-E6A9-4875-9362-DC8A7A5BC9C4.png Type: image/png Size: 48365 bytes Desc: not available URL: From carsonhh at gmail.com Sat Jun 7 14:07:41 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:07:41 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: <53923808.7030401@indiana.edu> Message-ID: Example (attached) of geneseqer GFF3 input causing problems. Notice that all the geneseqer features are almost exact representations of the transposon, they are essentially reintroducing all the noise that repeat masking tried to remove (they are giving hints to the gene predictor to try and call the transposon as a gene). --Carson From: Carson Holt Date: Saturday, June 7, 2014 at 2:03 PM To: Daniel Standage , Volker Brendel Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models The problem in the example you sent is the geneseqer entries in the GFF3 you are passing in. It is causing merge of gene clusters. The result is that UTR is being over extended and is overlapping on the models (and probably some models get merged). As you noticed you can't have overlapping models on the same strand. If you set score_preds=1 in the maker_opts.ctl file it will give you AED scores for the rejected ab initio models. You will notice that none of them score better than 0.23. One thing you can do is set correct_est_fusion=1. This tries to correct for erroneous EST/transcript evidence that leads to over extend UTR and false gene merging. You will see in the attached image that is trims back the overlapping 3' and 5' UTR for the overlapping gene models, given that MAKER believes the evidence leading to the overlap is likely low confidence and is a false merge of regions. I think much of your geneseqer input is more of a problem than a help for the annotation. Many seem to be spurious alignments. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 5:58 PM To: Volker Brendel Cc: Carson Holt , Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models (supplied > by the pred_gff or model_gff tag)? This seems to be one problem we are > running into. Our external models are high quality, but CDS only. Thus their > score gets knocked down relative to ab initio predictions with added UTRs. > > Daniel will have more questions/observations later with regard to overlapping > gene models (we definitely need to allow gene models to overlap in the UTRs, > because transcript evidence clearly shows such negative intergenic spaces). > > Thanks for all your help! > Volker > > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > >> >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >> sequence without hints (i.e. the ab initio call). >> >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. >> >> >> >> >> In both cases MAKER is allowed to add UTR to the model (hence the 'processed' >> tag). >> >> >> >> >> --Carson >> >> >> >> >> >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> >> >> >> >> >> >> Another question: is there documentation anywhere for the naming conventions >> of the genes annotated by Maker? Of course it's easy to spot genes based on a >> particular ab initio gene predictor, as the names are in the IDs. But what is >> the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >> wrote: >> >>> >>> >>> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>> have transcript support: will Maker report overlapping genes under any >>> conditions? And even if Maker is forced to choose only a single gene to >>> report, why would the model from 4125 to 6400 ever be reported in place of >>> the one from 6111 to 8345, especially since this is provided in the >>> model_gff file? >>> >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>>> AED as well, or just the eAED? >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> >>>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the >>>>> alignment into account (it just assumes match everywhere). >>>>> >>>>> >>>>> >>>>> >>>>> --Carson >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>> unexpected trends when running the new version of Maker with precomputed >>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>> Maker-computed alignments), this new annotation has a substantial number >>>>> of new genes annotated. If I compare distributions of AED scores between >>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>> more low-quality models. If I look at new gene models that do not overlap >>>>> with any gene model from the old annotation, the likelihood that it's a >>>>> low-quality model is much higher. >>>>> >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>> 2.31.3. >>>>> >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>> important assumption. However, this experiment makes me wonder whether >>>>> there have been changes to how Maker filters ab initio gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, >>>>> I could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ maker-devel mailing list >>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo >>>>> /maker-devel_yandell-lab.org >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074 http://brendelgroup.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6246A771-E6A9-4875-9362-DC8A7A5BC9C4.png Type: image/png Size: 48365 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 48C1E0B9-001D-44C9-8D8E-37A52E4A17E8.png Type: image/png Size: 6592 bytes Desc: not available URL: From carsonhh at gmail.com Sat Jun 7 14:11:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:11:43 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: <53923808.7030401@indiana.edu> References: <53923808.7030401@indiana.edu> Message-ID: If you give input as pred_gff, set keep_preds=1, and then give MAKER EST evidence to work with then MAKER will just pass_through the pred_gff data you gave it with UTR added. Set correct_est_fusion=1 if your input contains false merges across regions (common from mRNA-seq results). This will trim overlapping UTR caused by the improperly merged EST evidence. --Carson From: Volker Brendel Date: Friday, June 6, 2014 at 3:52 PM To: Carson Holt , Daniel Standage Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models Hi Carson, is there a way of allowing MAKER to add UTRs to our external models (supplied by the pred_gff or model_gff tag)? This seems to be one problem we are running into. Our external models are high quality, but CDS only. Thus their score gets knocked down relative to ab initio predictions with added UTRs. Daniel will have more questions/observations later with regard to overlapping gene models (we definitely need to allow gene models to overlap in the UTRs, because transcript evidence clearly shows such negative intergenic spaces). Thanks for all your help! Volker On 6/6/2014 11:39 AM, Carson Holt wrote: > > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked > sequence without hints (i.e. the ab initio call). > > maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. > > > > > In both cases MAKER is allowed to add UTR to the model (hence the 'processed' > tag). > > > > > --Carson > > > > > > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > > > > > > > Another question: is there documentation anywhere for the naming conventions > of the genes annotated by Maker? Of course it's easy to spot genes based on a > particular ab initio gene predictor, as the names are in the IDs. But what is > the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > > Thanks, > > Daniel > > > > > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> >> >> >> I have attached data for a small 18kb region with a handful of genes, as well >> as the corresponding maker_opts.ctl file. (This is a smaller and different >> data set than what I was looking at yesterday, with a more well-defined >> problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the model_gff >> file? >> >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> >> Any light you could shed would be helpful. Thanks! >> >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> >>> >>> Just eAED, but eAED can affects selection of ab initio results. For example >>> reading frame match of protein evidence which also affects whether evidence >>> from single_exon=1 and genes with single_exon protein evidence get kept. >>> There is also the assumption that your alignments in GFF3 are are correctly >>> spliced (like BLAT does). So giving blastn results as precomputed est_gff >>> would create a lot of noise, since maker ignores blastn and is using it only >>> to seed the polished exonerate alignments. >>> >>> >>> >>> >>> --Carson >>> >>> >>> >>> >>> >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> >>> >>> >>> >>> >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>> AED as well, or just the eAED? >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing some >>>> unexpected trends when running the new version of Maker with precomputed >>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>> Maker-computed alignments), this new annotation has a substantial number of >>>> new genes annotated. If I compare distributions of AED scores between the >>>> old and new annotation, it's clear that the new annotation has a lot more >>>> low-quality models. If I look at new gene models that do not overlap with >>>> any gene model from the old annotation, the likelihood that it's a >>>> low-quality model is much higher. >>>> >>>> >>>> I decided to run a little experiment. I annotated a scaffold first using >>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci by >>>> overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>>> >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters ab initio gene models between version >>>> 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could >>>> put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ maker-devel mailing list >>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ >>>> maker-devel_yandell-lab.org >>>> >>> >>> >>> >>> >>> >>> >>> >> >> >> >> >> >> > > > > -- Volker Brendel Professor of Biology and Computer Science Indiana University Department of Biology & School of Informatics and Computing Simon Hall 205C 212 South Hawthorne Drive Bloomington, IN 47405-7003 Tel.: (812) 855-7074 http://brendelgroup.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jun 7 14:16:29 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:16:29 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Also MAKER 2.10 has a number of bugs with how UTR is generated and hints are generated for the ab into predictors (it's several years out of date). I don't think it checks from reading frame match when determining protein overlap match either. So no surprise that some models will be different from the current MAKER version. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 12:38 PM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: >> I got the e-mail. Thanks for the test set. >> >> Multiple ab initio predictors don't inform a single annotation, rather one >> must be chosen from the pool of available models (I.e. it has to be SNAP or >> Augustus, or GeneMark). They all supply their own ab initio as well as hint >> based prediction, and then the one with best evidence match (measured by AED) >> is kept (it's like a competition that only one predictor can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a >> collaboration with the EVM group and will have this option, but for now users >> can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then >> produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple ab initio predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size of >> my message required list admin approval to be distributed, but since you were >> also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >>> sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel >>> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming conventions >>> of the genes annotated by Maker? Of course it's easy to spot genes based on >>> a particular ab initio gene predictor, as the names are in the IDs. But what >>> is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene >>> ID vs "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >>> wrote: >>>> I have attached data for a small 18kb region with a handful of genes, as >>>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>>> have transcript support: will Maker report overlapping genes under any >>>> conditions? And even if Maker is forced to choose only a single gene to >>>> report, why would the model from 4125 to 6400 ever be reported in place of >>>> the one from 6111 to 8345, especially since this is provided in the >>>> model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence >>>>> get kept. There is also the assumption that your alignments in GFF3 are >>>>> are correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores >>>>> blastn and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>>> the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the >>>>>> alignment into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>>> unexpected trends when running the new version of Maker with precomputed >>>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>>> Maker-computed alignments), this new annotation has a substantial number >>>>>> of new genes annotated. If I compare distributions of AED scores between >>>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>>> more low-quality models. If I look at new gene models that do not overlap >>>>>> with any gene model from the old annotation, the likelihood that it's a >>>>>> low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>>> 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>>> important assumption. However, this experiment makes me wonder whether >>>>>> there have been changes to how Maker filters ab initio gene models >>>>>> between version 2.10 and version 2.31.3? Do you have any ideas? If it >>>>>> would help, I could put together a small data set that reproduces the >>>>>> behavior I just described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing list >>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinf >>>>>> o/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.hoeppner at imbim.uu.se Mon Jun 9 02:48:01 2014 From: marc.hoeppner at imbim.uu.se (=?Windows-1252?Q?Marc_H=F6ppner?=) Date: Mon, 9 Jun 2014 08:48:01 +0000 Subject: [maker-devel] Repeatmasked genome Message-ID: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> Hi, this may be an odd question, but I was wondering where, if at all, Maker reports the repeat-masked genome sequence? As far as I can tell the fasta sequences included in the gff annotation are unmasked (?) or at least not softmasked. I guess it wouldn?t be too hard to take the repeat masker features and use them to soft mask the assembly, but still... Regards, Marc Marc P. Hoeppner, PhD Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se From dence at genetics.utah.edu Mon Jun 9 09:22:13 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 9 Jun 2014 15:22:13 +0000 Subject: [maker-devel] Repeatmasked genome In-Reply-To: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> References: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> Message-ID: <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> Hi Marc, The masked genome sequence is stored in the "theVoid" directory for each scaffold. There are files "query.fasta", "query.masked.fasta", and "query.masked.gff". The masked sequence is in the query.masked.fasta file and the genomic locations of the masked regions are stored in the query.masked.gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 9, 2014, at 2:48 AM, Marc H?ppner > wrote: Hi, this may be an odd question, but I was wondering where, if at all, Maker reports the repeat-masked genome sequence? As far as I can tell the fasta sequences included in the gff annotation are unmasked (?) or at least not softmasked. I guess it wouldn?t be too hard to take the repeat masker features and use them to soft mask the assembly, but still... Regards, Marc Marc P. Hoeppner, PhD Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 9 10:11:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 09 Jun 2014 10:11:23 -0600 Subject: [maker-devel] Repeatmasked genome In-Reply-To: <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> References: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> Message-ID: Yes. Those are all temporary files, that (if you still have them) you can use to get at the masked fasta directly. Otherwise you can just use the features in the GFF3 file to remask the regions. --Carson From: Daniel Ence Date: Monday, June 9, 2014 at 9:22 AM To: Marc H?ppner Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Repeatmasked genome Hi Marc, The masked genome sequence is stored in the "theVoid" directory for each scaffold. There are files "query.fasta", "query.masked.fasta", and "query.masked.gff". The masked sequence is in the query.masked.fasta file and the genomic locations of the masked regions are stored in the query.masked.gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 9, 2014, at 2:48 AM, Marc H?ppner wrote: > Hi, > > this may be an odd question, but I was wondering where, if at all, Maker > reports the repeat-masked genome sequence? As far as I can tell the fasta > sequences included in the gff annotation are unmasked (?) or at least not > softmasked. I guess it wouldn?t be too hard to take the repeat masker features > and use them to soft mask the assembly, but still... > > Regards, > > Marc > > > Marc P. Hoeppner, PhD > > Department for Medical Biochemistry and Microbiology > Uppsala University, Sweden > marc.hoeppner at imbim.uu.se > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynsb1987 at gmail.com Mon Jun 9 22:22:47 2014 From: cynsb1987 at gmail.com (hueytyng) Date: Tue, 10 Jun 2014 14:22:47 +1000 Subject: [maker-devel] ERROR: This MpiChunk is not part of this level Message-ID: Hi Carson, I run Maker 2.31 on my assembled contigs. From the "master_datastore_index.log", 529 contigs run through, before I get the error below. Maker halts after this. #--------------------------------------------------------------------- Now starting the contig!! SeqID: 1AL_NODE_4659_length_8657_cov_8.758115 Length: 8739 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks ERROR: This MpiChunk is not part of this level at /ws/ws-group/app/maker/bin/../lib/Process/MpiTiers.pm line 439. Process::MpiTiers::update_chunk(Process::MpiTiers=HASH(0x63e64c0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 1537 main::update_chunk(ARRAY(0x40c69d0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 929 --> rank=4, hostname=safs-raijen deleted:0 hits deleted:0 hits Attached is my maker_opts.ctl file. Thank you, Jenny -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4932 bytes Desc: not available URL: From carsonhh at gmail.com Wed Jun 11 08:29:44 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 11 Jun 2014 08:29:44 -0600 Subject: [maker-devel] ERROR: This MpiChunk is not part of this level In-Reply-To: References: Message-ID: The cause of this is most likely a corrupt MPI message. It could be random (it happens with MPI messages). In which case it should succeed on retry. It could mean you need to reinstall you MPI communicator, or give fewer nodes to mpiexec when running your job (MPICH2 starts having communication issues after around 100 processes for example - even sooner on some systems). It may also mean that you set MAKER up with one communicator during the installation (like MPICH2) and then used mpiexec from another communicator to launch the job (OpenMPI for example or even a different version of MPICH2). Make sure you are not using MVAPICH2 because MAKER won't work with MVAPICH2. Also if you are using OpenMPI, you must preload libmpi.so or otherwise shared libraries won't work and it will fail while running MAKER. To do that you have to export the following environmental variable --> export LD_PRELOAD=/lib/libmpi.so #replace with the location of OpenMPI Also because a corrupt message has the chance to cause other issues, you may want to completely delete the folder for the failed contig (look in the datastore_index.log to see where that folder is). Also make sure you are using the latest version of MAKER because it has been vetted on OpenMPI using 8000+ cpus. Earlier version (I.e. 2.28 and below) may have issues on OpenMPI or on some systems with slow NFS storage or limited memory. --Carson From: hueytyng Date: Monday, June 9, 2014 at 10:22 PM To: Subject: [maker-devel] ERROR: This MpiChunk is not part of this level Hi Carson, I run Maker 2.31 on my assembled contigs. From the "master_datastore_index.log", 529 contigs run through, before I get the error below. Maker halts after this. #--------------------------------------------------------------------- Now starting the contig!! SeqID: 1AL_NODE_4659_length_8657_cov_8.758115 Length: 8739 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks ERROR: This MpiChunk is not part of this level at /ws/ws-group/app/maker/bin/../lib/Process/MpiTiers.pm line 439. Process::MpiTiers::update_chunk(Process::MpiTiers=HASH(0x63e64c0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 1537 main::update_chunk(ARRAY(0x40c69d0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 929 --> rank=4, hostname=safs-raijen deleted:0 hits deleted:0 hits Attached is my maker_opts.ctl file. Thank you, Jenny _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Wed Jun 11 14:44:41 2014 From: sjackman at gmail.com (Shaun Jackman) Date: Wed, 11 Jun 2014 13:44:41 -0700 Subject: [maker-devel] Alternate translation table Message-ID: Hi, Carson. I?m annotating a plastid genome. It has spliced genes so I?m using organism_type=eukaryotic. Its translation table however is 11 (Bacteria, Archaea, prokaryotic viruses and chloroplast proteins). Is it possible to change the translation table? I?m not doing any ab initio gene prediction, only homology-based annotation using protein for coding genes and est for non-coding genes. The sequences are from a very closely related species (99% identity). Cheers, Shaun? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 11 15:01:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 11 Jun 2014 15:01:23 -0600 Subject: [maker-devel] Alternate translation table In-Reply-To: References: Message-ID: Sorry. MAKER doesn't have an alternate codon table option. --Carson From: Shaun Jackman Reply-To: Shaun Jackman Date: Wednesday, June 11, 2014 at 2:44 PM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] Alternate translation table Hi, Carson. I?m annotating a plastid genome. It has spliced genes so I?m using organism_type=eukaryotic. Its translation table however is 11 (Bacteria, Archaea, prokaryotic viruses and chloroplast proteins). Is it possible to change the translation table? I?m not doing any ab initio gene prediction, only homology-based annotation using protein for coding genes and est for non-coding genes. The sequences are from a very closely related species (99% identity). Cheers, Shaun? _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at rennes.inra.fr Thu Jun 12 07:00:48 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Thu, 12 Jun 2014 15:00:48 +0200 Subject: [maker-devel] Merging 2 annotations In-Reply-To: References: <538D8987.4090606@rennes.inra.fr> Message-ID: <5399A480.10808@rennes.inra.fr> Thank you, it works fine! A little question which is related: I set the map_forward option to 1, but it seems to work only for the model_gff gff. Is there a way to make it keep the original IDs also for the pred_gff file? Thank you Anthony On 03/06/2014 18:15, Carson Holt wrote: > You can give the manually curate ones to model_gff and the other ones to > pred_gff. Then set keep_preds=1. The model_gff resuls always get kept > even without evidence support, the pred_gff will be kept even without > evidence support because you set keep_preds=1, but model_gff results will > take precedence. > > --Carson > > > On 6/3/14, 2:38 AM, "Anthony Bretaudeau" > wrote: > >> Hello, >> >> I am working on the annotation of an insect genome, and I have 2 gff >> files: >> -an automatic annotation (done by another lab, with something else than >> maker, ~20000genes) >> -a manually curated annotation (with webapollo, ~1500 genes) >> >> From this, I would like to produce a single gff combining the 2. I'd >> like to keep all the manually curated models, and only the automatic >> ones that have no equivalent in the manually curated gff. >> >> Is it possible to do something like this with maker? I guess I could >> play with the model_gff option, but I'm not sure how exactly I could use >> it. >> >> Thank you for your help >> Regards >> >> Anthony >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From dence at genetics.utah.edu Thu Jun 12 09:50:05 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 12 Jun 2014 15:50:05 +0000 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <5399A480.10808@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> Message-ID: <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> Hi Anthony, So I think that the gene ID gets changed in the process of promoting things from pred_gff to gene models. If you know which predictions you want to keep, then you can select those out and pass them to model_gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 12, 2014, at 7:00 AM, Anthony Bretaudeau > wrote: A little question which is related: I set the map_forward option to 1, but it seems to work only for the model_gff gff. Is there a way to make it keep the original IDs also for the pred_gff file? -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at rennes.inra.fr Thu Jun 12 10:17:11 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Thu, 12 Jun 2014 18:17:11 +0200 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> Message-ID: <5399D287.1090505@rennes.inra.fr> An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 12 10:23:06 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Jun 2014 10:23:06 -0600 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <5399D287.1090505@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> <5399D287.1090505@rennes.inra.fr> Message-ID: This might be a round about way to get them to have the names unaltered. Give the pred_gff ones to est_gff. Still give the model_gff ones to model_gff. Set est2genome=1 and single_exon=1. Then add this line to the control file est_forward=1. This is normally used to move transcripts forward onto new assemblies with names being drawn from the alignment, but by telling MAKER that these are ESTs instead of predictions and setting the appropriate values, it will think it's moving transcripts forward, and the final result may be what you want. --Carson From: Anthony Bretaudeau Date: Thursday, June 12, 2014 at 10:17 AM To: Daniel Ence Cc: Carson Holt , "" Subject: Re: [maker-devel] Merging 2 annotations Yes, I think that's why the ids get changed. But I don't know which predictions I want to keep as I'm using maker to only keep the ones that are not equivalent to the models that are in the model_gff. Anthony On 12/06/2014 17:50, Daniel Ence wrote: > Hi Anthony, So I think that the gene ID gets changed in the process of > promoting things from pred_gff to gene models. If you know which predictions > you want to keep, then you can select those out and pass them to model_gff. > > > > ~Daniel > > > > > > > > Daniel Ence > > Graduate Student > > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > > > > On Jun 12, 2014, at 7:00 AM, Anthony Bretaudeau > > > wrote: > > >> A little question which is related: I set the map_forward option to 1, but it >> seems to work only for the model_gff gff. Is there a way to make it keep the >> original IDs also for the pred_gff file? >> > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jun 12 15:58:16 2014 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 12 Jun 2014 14:58:16 -0700 Subject: [maker-devel] Poor Exonerate gene model Message-ID: Hi, Carson. I have a case where MAKER is choosing a poor gene model when a better model exists. The two genes, psaA and psaB, are adjacent and are similar (37% exonerate score). BLASTX finds only the correct alignments of psaA and psaB. When exonerate is run, it also finds poor alignments of psaA to psaB and psaB to psaA. The result is that MAKER chooses the correct model for psaB, but picks the poor psaB model for psaA. Increasing ep_score_limit from 20 to 40 works around the issue. I think MAKER could make a better choice in this situation without that hint. See the attached screen shots. The first is ep_score_limit=20 and the second ep_score_limit=40. I?ve attached the evidence GFF. Cheers, Shaun [image: Inline images 1] [image: Inline images 3] ? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 86112 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 90074 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1.gff.gz Type: application/x-gzip Size: 57657 bytes Desc: not available URL: From saad.arif at tuebingen.mpg.de Fri Jun 13 05:03:38 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Fri, 13 Jun 2014 13:03:38 +0200 Subject: [maker-devel] Help with updating an annotation Message-ID: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad From carsonhh at gmail.com Fri Jun 13 10:59:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 13 Jun 2014 10:59:46 -0600 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: Message-ID: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" wrote: >Dear All, > >I would like to use Maker pipeline to expand a current annotation (new >isoforms and novel genes with respect to current annotation) and was >wondering if anyone had experience with this and or suggestions to my >questions. > >Briefly: > > I have tophat splice junctions from RNAseq data or alternatively >cufflinks generated transcript models (fasts format) that i want to use >as my new data (est_gff or est). > >I want to provide the current Ensembl annotation for gene prediction but >i want this annotation to remain unchanged. Hence, i?m not sure if i >should provide this annotation as pred_gff > or model_gff. Can the model_gff be used for gene prediction or is this >just a subset of pred_gff that remain unaltered? Can we provide the same >annotation for both options (pred_ and mod_gff)? > > > >Importantly, my main goal is to use the new RNAseq data to add more >isoforms and (any) novel genes to the existing Ensembl annotation. Any >thoughts or suggestions on how to go about this would be sincerely >appreciated. > > >Thanks in advance, >saad > > > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From juefish at gmail.com Tue Jun 17 14:54:51 2014 From: juefish at gmail.com (Nathaniel Jue) Date: Tue, 17 Jun 2014 16:54:51 -0400 Subject: [maker-devel] issue with forks module Message-ID: I've been running into all kinds of issues with the implementation of forks in GMOD. I repeatedly get this error when running an MPI run of Maker: Use of each() on hash after insertion without resetting hash iterator results in undefined behavior at /data2/local_installs/perls/perl-5.18.1/lib/site_perl/5.18.1/x86_64-linux/ forks.pm line 1736. I had to install an alternative version of perl with perlbrew but everything seemed to work on the test data sets. Any thoughts on what might be the issue? Or if this is even an issue that would affect results? Thanks, Nate -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 17 15:09:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 17 Jun 2014 15:09:55 -0600 Subject: [maker-devel] issue with forks module In-Reply-To: References: Message-ID: There is a change in Perl 5.18 that makes the forks.pm module incompatible. The forks.pm model maintainers have yet to update their module to resolve the issue, so it only works on perl version prior to 5.18. One work around it to manually edit forks.pm line 1736 yourself. Change it from this --> $write = each %WRITE; To this (make sure to include the {} brackets)--> { no warnings qw(internal); $write = each %WRITE; } --Carson From: Nathaniel Jue Date: Tuesday, June 17, 2014 at 2:54 PM To: Subject: [maker-devel] issue with forks module I've been running into all kinds of issues with the implementation of forks in GMOD. I repeatedly get this error when running an MPI run of Maker: Use of each() on hash after insertion without resetting hash iterator results in undefined behavior at /data2/local_installs/perls/perl-5.18.1/lib/site_perl/5.18.1/x86_64-linux/fo rks.pm line 1736. I had to install an alternative version of perl with perlbrew but everything seemed to work on the test data sets. Any thoughts on what might be the issue? Or if this is even an issue that would affect results? Thanks, Nate _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From saad.arif at tuebingen.mpg.de Wed Jun 18 05:09:48 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Wed, 18 Jun 2014 12:09:48 +0100 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: Message-ID: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: > Use the cufflinks instead of the tophat features (tophat tends to be > really noisy). Give the existing models to model_gff (they will then > always be kept unless something better is found). There is no option to > keep models and then just add isoforms. The model_gff input will either > be kept as is (unchanged), or replaced with an updated model suggested by > the evidence (the updated model may contain multiple isoforms though), and > map_forward=1 can be used to pull names forward from the old model onto > the new models. > > Thansk, > Carson > > > On 6/13/14, 5:03 AM, "Saad Arif" wrote: > >> Dear All, >> >> I would like to use Maker pipeline to expand a current annotation (new >> isoforms and novel genes with respect to current annotation) and was >> wondering if anyone had experience with this and or suggestions to my >> questions. >> >> Briefly: >> >> I have tophat splice junctions from RNAseq data or alternatively >> cufflinks generated transcript models (fasts format) that i want to use >> as my new data (est_gff or est). >> >> I want to provide the current Ensembl annotation for gene prediction but >> i want this annotation to remain unchanged. Hence, i?m not sure if i >> should provide this annotation as pred_gff >> or model_gff. Can the model_gff be used for gene prediction or is this >> just a subset of pred_gff that remain unaltered? Can we provide the same >> annotation for both options (pred_ and mod_gff)? >> >> >> >> Importantly, my main goal is to use the new RNAseq data to add more >> isoforms and (any) novel genes to the existing Ensembl annotation. Any >> thoughts or suggestions on how to go about this would be sincerely >> appreciated. >> >> >> Thanks in advance, >> saad >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From dence at genetics.utah.edu Wed Jun 18 10:21:19 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 18 Jun 2014 16:21:19 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> Message-ID: <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Wed Jun 18 11:04:26 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 18 Jun 2014 17:04:26 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: Hi Saad, That seems to be right to me. You'll do one run of MAKER with the cufflinks output and est2genome turned on and train SNAP on that output. You can repeat this as many times as you want, but in my experience you don't gain much in predictive power beyond two rounds of training. Next, you'll turn on SNAP and turn off est2genome, but still include the cufflinks and proteome evidence and the ensemble models. The other ab initio predictors that maker can use internally (genemark and augustus) are worth looking into also. Genemark does a self-training thing, but can take a couple of days depending on how large your genome is. Augustus takes a lot of time and effort to train, but comes with many prebuilt training files. If one of its prebuilt files is close to your species of interest, you can just use that instead. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 10:42 AM, Saad Arif > wrote: Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.priyam at qmul.ac.uk Wed Jun 18 11:44:34 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Wed, 18 Jun 2014 23:14:34 +0530 Subject: [maker-devel] errors in final gff Message-ID: Hi, I compiled all annotations generated by MAKER into a single GFF file using the gff3_merge script distributed with MAKER. While formatting this GFF for use with JBrowse, I found a few errors: 1. Three instances where two features were assigned the same id. 2. One instance where a group of three subfeatures refer to a non-existent parent. Here is the relevant portion of the GFF file: https://gist.github.com/yeban/ffaf5cd419639dd073a7 I worked around the issue temporarily for the job at hand, but I am left wondering why would these errors creep in. -- Priyam From carsonhh at gmail.com Wed Jun 18 12:11:49 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 18 Jun 2014 12:11:49 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: What MAKER version are you using? --Carson On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >Hi, > >I compiled all annotations generated by MAKER into a single GFF file >using the gff3_merge script distributed with MAKER. While formatting >this GFF for use with JBrowse, I found a few errors: > >1. Three instances where two features were assigned the same id. >2. One instance where a group of three subfeatures refer to a >non-existent parent. > >Here is the relevant portion of the GFF file: >https://gist.github.com/yeban/ffaf5cd419639dd073a7 > >I worked around the issue temporarily for the job at hand, but I am >left wondering why would these errors creep in. > >-- Priyam > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jun 18 15:33:08 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 18 Jun 2014 15:33:08 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Are you passing in old data via GFF3? --Carson On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >It's version 2.31. > >-- Priyam > >On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt wrote: >> What MAKER version are you using? >> >> --Carson >> >> >> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >> >>>Hi, >>> >>>I compiled all annotations generated by MAKER into a single GFF file >>>using the gff3_merge script distributed with MAKER. While formatting >>>this GFF for use with JBrowse, I found a few errors: >>> >>>1. Three instances where two features were assigned the same id. >>>2. One instance where a group of three subfeatures refer to a >>>non-existent parent. >>> >>>Here is the relevant portion of the GFF file: >>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>> >>>I worked around the issue temporarily for the job at hand, but I am >>>left wondering why would these errors creep in. >>> >>>-- Priyam >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> From mhinsley at ebi.ac.uk Thu Jun 19 03:07:32 2014 From: mhinsley at ebi.ac.uk (Malcolm Hinsley) Date: Thu, 19 Jun 2014 10:07:32 +0100 Subject: [maker-devel] 'not a SCALAR reference' error In-Reply-To: References: Message-ID: <53A2A854.3000009@ebi.ac.uk> Hi I'm running maker 2.31 with mpich 3 and have run once with est and protein2genome, then trained augustus and snap and run the first iteration of ab-initio predictors, which finished cleanly with no errors/ failures. Having retrained augustus and snap I'm trying to run maker -a using the same augustus species and snap.hmm pathname... previously this has worked fine. I get a lot of errors like this (it looks like every scaffold fails): doing repeat masking ERROR: Not a SCALAR reference at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Fasta.pm line 382 thread 1. Fasta::_formatSeq(FastaSeq=HASH(0x4426298), 60) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Fasta.pm line 369 thread 1 Fasta::toFastaRef(">scaffold29 CHUNK number:0 size:100000 offset:0", REF(0x42e48f0)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/FastaChunk.pm line 217 thread 1 FastaChunk::fasta_ref(FastaChunk=HASH(0x42c76a8)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/FastaChunk.pm line 168 thread 1 FastaChunk::write_file(FastaChunk=HASH(0x42c76a8), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/GI.pm line 3138 thread 1 GI::repeatmask(FastaChunk=HASH(0x42c76a8), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"..., "scaffold29", "", "/nfs/production/panda/ensemblgenomes/external/RepeatMasker-op"..., "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/c."..., 1, runlog=HASH(0x430e730)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 785 thread 1 Process::MpiChunk::__ANON__() called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Error.pm line 415 thread 1 eval {...} called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Error.pm line 407 thread 1 Error::subs::try(CODE(0x4437a90), HASH(0x4425fc8)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 4215 thread 1 Process::MpiChunk::_go(Process::MpiChunk=HASH(0x426e0a0), "run", HASH(0x42a5410), 0, 1) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 341 thread 1 Process::MpiChunk::run(Process::MpiChunk=HASH(0x426e0a0), 29) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker line 1457 thread 1 main::node_thread("/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 799 thread 1 eval {...} called at /nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 799 thread 1 threads::new("threads", CODE(0x4168d70), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker line 917 thread 1 --> rank=29, hostname=ebi5-229.ebi.ac.uk ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:scaffold29 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:scaffold29 I see from the mailing list that there's a known issue w/ forks..pm (which is at the bottom of this stack) relating to perl 5.18, but I'm running 5.14. Any ideas? On 17/06/14 22:09, Carson Holt wrote: > There is a change in Perl 5.18 that makes the forks.pm module incompatible. > The forks.pm model maintainers have yet to update their module to resolve > the issue, so it only works on perl version prior to 5.18. > One work around it to manually edit forks.pm line 1736 yourself. > > Change it from this --> > $write = each %WRITE; > > To this (make sure to include the {} brackets)--> > { > no warnings qw(internal); > $write = each %WRITE; > } > > --Carson > -- malcolm hinsley | EnsEMBL Genomes | +44 (0)1223 49 4669 European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD United Kingdom From rbharris at uw.edu Thu Jun 19 13:07:36 2014 From: rbharris at uw.edu (Rebecca Harris) Date: Thu, 19 Jun 2014 14:07:36 -0500 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hi, I'm trying to use the iprscan2gff3 script to update my final gff3 file with annotations from Interproscan 5. I'm getting a bunch of errors similar to another user but do not see how their issue was resolved: https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ I ran Interproscan on my ab initio predictions, then converted the xml to raw format. When I run iprscan2gff3 I get the errors: Use of uninitialized value $name in hash element at /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. Thanks, Rebecca -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Thu Jun 19 14:44:46 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 19 Jun 2014 20:44:46 +0000 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hi Rebecca, I at the conversation you linked to and it seems that Carson resolved the those parsing issues in an update to maker. What version of maker are you using? Also, in that same conversation Carson said that those errors wouldn't affect the output (because the script was parsing the mRNA features fine, but giving errors on the gene features). Does the output that you get from iprscan2gff3 seem complete? ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 1:07 PM, Rebecca Harris > wrote: Hi, I'm trying to use the iprscan2gff3 script to update my final gff3 file with annotations from Interproscan 5. I'm getting a bunch of errors similar to another user but do not see how their issue was resolved: https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ I ran Interproscan on my ab initio predictions, then converted the xml to raw format. When I run iprscan2gff3 I get the errors: Use of uninitialized value $name in hash element at /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. Thanks, Rebecca _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 19 14:47:27 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Jun 2014 14:47:27 -0600 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Also make sure there are gene/mRNA features in your GFF3 for your iprscan results. If you used the ab initio calls (which will be match/match_part features in the GFF3) as your input to iprscan, then you will need to upgrade them to gene/mRNA features before the script will add domains to them. --Carson From: Daniel Ence Date: Thursday, June 19, 2014 at 2:44 PM To: Rebecca Harris Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Fwd: iprscan2gff3 Hi Rebecca, I at the conversation you linked to and it seems that Carson resolved the those parsing issues in an update to maker. What version of maker are you using? Also, in that same conversation Carson said that those errors wouldn't affect the output (because the script was parsing the mRNA features fine, but giving errors on the gene features). Does the output that you get from iprscan2gff3 seem complete? ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 1:07 PM, Rebecca Harris wrote: > Hi, > > I'm trying to use the iprscan2gff3 script to update my final gff3 file with > annotations from Interproscan 5. I'm getting a bunch of errors similar to > another user but do not see how their issue was resolved: > https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-deve > l/MykTxL2Da64/5yrO3e6WBHUJ > > I ran Interproscan on my ab initio predictions, then converted the xml to raw > format. When I run iprscan2gff3 I get the errors: > Use of uninitialized value $name in hash element at > /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. > > Thanks, > Rebecca > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From rbharris at uw.edu Thu Jun 19 15:22:34 2014 From: rbharris at uw.edu (Rebecca Harris) Date: Thu, 19 Jun 2014 14:22:34 -0700 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hey, Thanks for the reply. The problem was that I didn't upgrade the matches to gene/mRNA features before running the ipr_upgrade_gff3 script. R On Thu, Jun 19, 2014 at 1:47 PM, Carson Holt wrote: > Also make sure there are gene/mRNA features in your GFF3 for your iprscan > results. If you used the ab initio calls (which will be match/match_part > features in the GFF3) as your input to iprscan, then you will need to > upgrade them to gene/mRNA features before the script will add domains to > them. > > --Carson > > > From: Daniel Ence > Date: Thursday, June 19, 2014 at 2:44 PM > To: Rebecca Harris > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Fwd: iprscan2gff3 > > Hi Rebecca, I at the conversation you linked to and it seems that Carson > resolved the those parsing issues in an update to maker. What version of > maker are you using? > > Also, in that same conversation Carson said that those errors wouldn't > affect the output (because the script was parsing the mRNA features fine, > but giving errors on the gene features). Does the output that you get from > iprscan2gff3 seem complete? > > ~Daniel > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 19, 2014, at 1:07 PM, Rebecca Harris > wrote: > > Hi, > > I'm trying to use the iprscan2gff3 script to update my final gff3 file > with annotations from Interproscan 5. I'm getting a bunch of errors similar > to another user but do not see how their issue was resolved: > > https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ > > I ran Interproscan on my ab initio predictions, then converted the xml to > raw format. When I run iprscan2gff3 I get the errors: > > Use of uninitialized value $name in hash element at > /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line > 1090. > Thanks, > Rebecca > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.priyam at qmul.ac.uk Thu Jun 19 16:11:36 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Fri, 20 Jun 2014 03:41:36 +0530 Subject: [maker-devel] migrating annotations from old to new assembly Message-ID: Is it possible to migrate annotations from an old assembly to a new assembly using MAKER? Perhaps by setting est= to transcripts (spliced? or unspliced?) from the previous assembly and genome= to the new assembly? Maybe ask MAKER to use exonerate instead of BLASTN so splice junctions are accounted for better? -- Priyam From carsonhh at gmail.com Thu Jun 19 16:16:01 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Jun 2014 16:16:01 -0600 Subject: [maker-devel] migrating annotations from old to new assembly In-Reply-To: References: Message-ID: Here you go, this is covered in a previous post --> https://groups.google.com/forum/#!searchin/maker-devel/est_forward/maker-de vel/q9fxXGKO8mk/0ATwhJvZeI4J --Carson On 6/19/14, 4:11 PM, "Anurag Priyam" wrote: >Is it possible to migrate annotations from an old assembly to a new >assembly using MAKER? > >Perhaps by setting est= to transcripts (spliced? or unspliced?) from >the previous assembly and genome= to the new assembly? Maybe ask MAKER >to use exonerate instead of BLASTN so splice junctions are accounted >for better? > >-- Priyam > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From a.priyam at qmul.ac.uk Thu Jun 19 16:19:22 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Fri, 20 Jun 2014 03:49:22 +0530 Subject: [maker-devel] migrating annotations from old to new assembly In-Reply-To: References: Message-ID: Wow! Thanks :). I apologise that I didn't look through the archives before asking. -- Priyam On Fri, Jun 20, 2014 at 3:46 AM, Carson Holt wrote: > Here you go, this is covered in a previous post --> > https://groups.google.com/forum/#!searchin/maker-devel/est_forward/maker-de > vel/q9fxXGKO8mk/0ATwhJvZeI4J > > > --Carson > > > > On 6/19/14, 4:11 PM, "Anurag Priyam" wrote: > >>Is it possible to migrate annotations from an old assembly to a new >>assembly using MAKER? >> >>Perhaps by setting est= to transcripts (spliced? or unspliced?) from >>the previous assembly and genome= to the new assembly? Maybe ask MAKER >>to use exonerate instead of BLASTN so splice junctions are accounted >>for better? >> >>-- Priyam >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From saad.arif at tuebingen.mpg.de Wed Jun 18 10:42:17 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Wed, 18 Jun 2014 17:42:17 +0100 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> Message-ID: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: > Hi Saad, > > Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). > > You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. > > One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > > Let me know if that helps, or if you have more question > > > ~Daniel > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: > >> Thank you for the response. I still have one question though, with these options: >> >> est_GFF=cufflinksout.GFF >> >> modle_GFF= ensembl reference.GFF >> >> What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? >> Is there a simple way to combine adding (new genes) and improving of an existing annotation? >> >> Any feedback on this would be greatly appreciated. >> >> saad >> >> On 13 Jun 2014, at 17:59, Carson Holt wrote: >> >>> Use the cufflinks instead of the tophat features (tophat tends to be >>> really noisy). Give the existing models to model_gff (they will then >>> always be kept unless something better is found). There is no option to >>> keep models and then just add isoforms. The model_gff input will either >>> be kept as is (unchanged), or replaced with an updated model suggested by >>> the evidence (the updated model may contain multiple isoforms though), and >>> map_forward=1 can be used to pull names forward from the old model onto >>> the new models. >>> >>> Thansk, >>> Carson >>> >>> >>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>> >>>> Dear All, >>>> >>>> I would like to use Maker pipeline to expand a current annotation (new >>>> isoforms and novel genes with respect to current annotation) and was >>>> wondering if anyone had experience with this and or suggestions to my >>>> questions. >>>> >>>> Briefly: >>>> >>>> I have tophat splice junctions from RNAseq data or alternatively >>>> cufflinks generated transcript models (fasts format) that i want to use >>>> as my new data (est_gff or est). >>>> >>>> I want to provide the current Ensembl annotation for gene prediction but >>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>> should provide this annotation as pred_gff >>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>> annotation for both options (pred_ and mod_gff)? >>>> >>>> >>>> >>>> Importantly, my main goal is to use the new RNAseq data to add more >>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>> thoughts or suggestions on how to go about this would be sincerely >>>> appreciated. >>>> >>>> >>>> Thanks in advance, >>>> saad >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anurag08priyam at gmail.com Wed Jun 18 12:15:52 2014 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Wed, 18 Jun 2014 23:45:52 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: It's version 2.31. -- Priyam On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt wrote: > What MAKER version are you using? > > --Carson > > > On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: > >>Hi, >> >>I compiled all annotations generated by MAKER into a single GFF file >>using the gff3_merge script distributed with MAKER. While formatting >>this GFF for use with JBrowse, I found a few errors: >> >>1. Three instances where two features were assigned the same id. >>2. One instance where a group of three subfeatures refer to a >>non-existent parent. >> >>Here is the relevant portion of the GFF file: >>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >> >>I worked around the issue temporarily for the job at hand, but I am >>left wondering why would these errors creep in. >> >>-- Priyam >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From rajesh.bommareddy at tu-harburg.de Thu Jun 19 02:08:45 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Thu, 19 Jun 2014 10:08:45 +0200 Subject: [maker-devel] Maker control files Message-ID: <53A29A8D.5010709@tu-harburg.de> Dear Sir/Madam I have installed Maker on Linux. I have tested the installation using maker -h. It looks fine. But i cannot create the control files using /maker/bin ./maker -CTL I get the following error: Couldnot create maker_opts.ctl control files not found Can you please help me with possible reason and solution Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From dence at genetics.utah.edu Fri Jun 20 15:20:47 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 20 Jun 2014 21:20:47 +0000 Subject: [maker-devel] Maker control files In-Reply-To: <53A29A8D.5010709@tu-harburg.de> References: <53A29A8D.5010709@tu-harburg.de> Message-ID: <51B8C254-A912-4CF6-B0E3-5C66E6E3E9AE@genetics.utah.edu> Hi Rajesh, Do you have write permissions in the directory where you're running maker? Also, I can't tell whether you're doing one command or two commands? If you do "maker" and there's no control files, then you'll get the "control files not found" error, but if you do ./maker -CTL and don't have permission to write to the install directory (which isn't unusual) then you'll get the "Could not create maker_opts.ctl" error. Thanks, Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 2:08 AM, Rajesh Reddy Bommareddy > wrote: Dear Sir/Madam I have installed Maker on Linux. I have tested the installation using maker -h. It looks fine. But i cannot create the control files using /maker/bin ./maker -CTL I get the following error: Couldnot create maker_opts.ctl control files not found Can you please help me with possible reason and solution Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 20 15:42:13 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:42:13 -0600 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_G MOD_Online_Training_2014#Training_ab_initio_Gene_Predictors --Carson From: Saad Arif Date: Wednesday, June 18, 2014 at 10:42 AM To: Daniel Ence Cc: "" Subject: Re: [maker-devel] Help with updating an annotation Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: > Hi Saad, > > Maker doesn't view EST or protein evidence as a gene model in themselves. > There's a good reason for this. Aligners like blast don't guarantee complete > gene models, with accurate start and stop codons and splice sites. With it's > default settings maker won't make a gene model unless there's evidence that > overlaps an ab-initio prediction (or something from the pred_gff option). > > You can use est2genome to promote everything from the est_gff option to a gene > model, but this will probably give you many spurious results. What you're > saying with est2genome is, "Everything that this tool found is a complete gene > model." I don't think that's true even for cufflinks output. > > One of the gene predictors that can run internally is snap. It's really easy > to train; here's a link to a tutorial for training it: > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMO > D_Online_Training_2014#Training_ab_initio_Gene_Predictors > > Let me know if that helps, or if you have more question > > > ~Daniel > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: > >> Thank you for the response. I still have one question though, with these >> options: >> >> est_GFF=cufflinksout.GFF >> >> modle_GFF= ensembl reference.GFF >> >> What happens to cufflinks assembled transcripts that are not confined to >> current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to >> prepare ab initio gene predictions for each of these predicted 'new' genes? >> Is there a simple way to combine adding (new genes) and improving of an >> existing annotation? >> >> Any feedback on this would be greatly appreciated. >> >> saad >> >> On 13 Jun 2014, at 17:59, Carson Holt wrote: >> >>> Use the cufflinks instead of the tophat features (tophat tends to be >>> really noisy). Give the existing models to model_gff (they will then >>> always be kept unless something better is found). There is no option to >>> keep models and then just add isoforms. The model_gff input will either >>> be kept as is (unchanged), or replaced with an updated model suggested by >>> the evidence (the updated model may contain multiple isoforms though), and >>> map_forward=1 can be used to pull names forward from the old model onto >>> the new models. >>> >>> Thansk, >>> Carson >>> >>> >>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>> >>>> Dear All, >>>> >>>> I would like to use Maker pipeline to expand a current annotation (new >>>> isoforms and novel genes with respect to current annotation) and was >>>> wondering if anyone had experience with this and or suggestions to my >>>> questions. >>>> >>>> Briefly: >>>> >>>> I have tophat splice junctions from RNAseq data or alternatively >>>> cufflinks generated transcript models (fasts format) that i want to use >>>> as my new data (est_gff or est). >>>> >>>> I want to provide the current Ensembl annotation for gene prediction but >>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>> should provide this annotation as pred_gff >>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>> annotation for both options (pred_ and mod_gff)? >>>> >>>> >>>> >>>> Importantly, my main goal is to use the new RNAseq data to add more >>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>> thoughts or suggestions on how to go about this would be sincerely >>>> appreciated. >>>> >>>> >>>> Thanks in advance, >>>> saad >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 20 15:46:59 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:46:59 -0600 Subject: [maker-devel] 'not a SCALAR reference' error In-Reply-To: <53A2A854.3000009@ebi.ac.uk> References: <53A2A854.3000009@ebi.ac.uk> Message-ID: Make sure you are using the latest version of MAKER 3.31.6. Also you may have to use MPICH2. MPICH3 is actually a different MPI protocol and I have not had success running MAKER with it. --Carson On 6/19/14, 3:07 AM, "Malcolm Hinsley" wrote: >Hi > >I'm running maker 2.31 with mpich 3 and have run once with est and >protein2genome, then trained augustus and snap and run the first >iteration of ab-initio predictors, which finished cleanly with no >errors/ failures. > >Having retrained augustus and snap I'm trying to run maker -a using the >same augustus species and snap.hmm pathname... previously this has >worked fine. > > >I get a lot of errors like this (it looks like every scaffold fails): > >doing repeat masking >ERROR: Not a SCALAR reference > at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Fasta.pm >line 382 thread 1. > Fasta::_formatSeq(FastaSeq=HASH(0x4426298), 60) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Fasta.pm >line 369 thread 1 > Fasta::toFastaRef(">scaffold29 CHUNK number:0 size:100000 >offset:0", REF(0x42e48f0)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/FastaChunk.pm >line 217 thread 1 > FastaChunk::fasta_ref(FastaChunk=HASH(0x42c76a8)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/FastaChunk.pm >line 168 thread 1 > FastaChunk::write_file(FastaChunk=HASH(0x42c76a8), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/GI.pm >line 3138 thread 1 > GI::repeatmask(FastaChunk=HASH(0x42c76a8), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"..., >"scaffold29", "", >"/nfs/production/panda/ensemblgenomes/external/RepeatMasker-op"..., >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/c."..., 1, >runlog=HASH(0x430e730)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 785 thread 1 > Process::MpiChunk::__ANON__() called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Error.pm >line 415 thread 1 > eval {...} called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Error.pm >line 407 thread 1 > Error::subs::try(CODE(0x4437a90), HASH(0x4425fc8)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 4215 thread 1 > Process::MpiChunk::_go(Process::MpiChunk=HASH(0x426e0a0), >"run", HASH(0x42a5410), 0, 1) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 341 thread 1 > Process::MpiChunk::run(Process::MpiChunk=HASH(0x426e0a0), 29) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker >line >1457 thread 1 >main::node_thread("/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/ma >ker/v8"...) >called at >/nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/ >x86_64-linux-thread-multi/forks.pm >line 799 thread 1 > eval {...} called at >/nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/ >x86_64-linux-thread-multi/forks.pm >line 799 thread 1 > threads::new("threads", CODE(0x4168d70), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker >line >917 thread 1 >--> rank=29, hostname=ebi5-229.ebi.ac.uk >ERROR: Failed while doing repeat masking >ERROR: Chunk failed at level:0, tier_type:1 >FAILED CONTIG:scaffold29 > >ERROR: Chunk failed at level:2, tier_type:0 >FAILED CONTIG:scaffold29 > > >I see from the mailing list that there's a known issue w/ forks..pm >(which is at the bottom of this stack) relating to perl 5.18, but I'm >running 5.14. > > >Any ideas? > > > > > >On 17/06/14 22:09, Carson Holt wrote: >> There is a change in Perl 5.18 that makes the forks.pm module >>incompatible. >> The forks.pm model maintainers have yet to update their module to >>resolve >> the issue, so it only works on perl version prior to 5.18. >> One work around it to manually edit forks.pm line 1736 yourself. >> >> Change it from this --> >> $write = each %WRITE; >> >> To this (make sure to include the {} brackets)--> >> { >> no warnings qw(internal); >> $write = each %WRITE; >> } >> >> --Carson >> > >-- >malcolm hinsley | EnsEMBL Genomes | +44 (0)1223 49 4669 >European Bioinformatics Institute (EMBL-EBI) >European Molecular Biology Laboratory >Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD >United Kingdom > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Jun 20 15:50:38 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:50:38 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: did you use est_forward? Also in the example you showed all the IDs are unique (one says hit and the other hsp in the ID, so they are different)? Could you find the non-uunique IDs causing the error? --Carson On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >I used est_gff= option, which refers to a GFF file generated by >cufflinks2gff3. The erroneous annotations didn't come from this GFF. > >-- Priyam > >On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >> Are you passing in old data via GFF3? >> >> --Carson >> >> >> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >> >>>It's version 2.31. >>> >>>-- Priyam >>> >>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>wrote: >>>> What MAKER version are you using? >>>> >>>> --Carson >>>> >>>> >>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>> >>>>>Hi, >>>>> >>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>this GFF for use with JBrowse, I found a few errors: >>>>> >>>>>1. Three instances where two features were assigned the same id. >>>>>2. One instance where a group of three subfeatures refer to a >>>>>non-existent parent. >>>>> >>>>>Here is the relevant portion of the GFF file: >>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>> >>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>left wondering why would these errors creep in. >>>>> >>>>>-- Priyam >>>>> >>>>>_______________________________________________ >>>>>maker-devel mailing list >>>>>maker-devel at box290.bluehost.com >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>>> >> >> From carsonhh at gmail.com Fri Jun 20 15:56:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:56:46 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Also note that ID= must be unique. Name= does not have to be, and won't be if the same protein or repeat element aligns to more than one location for example. Thanks, Carson On 6/20/14, 3:50 PM, "Carson Holt" wrote: >did you use est_forward? Also in the example you showed all the IDs are >unique (one says hit and the other hsp in the ID, so they are different)? >Could you find the non-uunique IDs causing the error? > >--Carson > > >On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: > >>I used est_gff= option, which refers to a GFF file generated by >>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >> >>-- Priyam >> >>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >>> Are you passing in old data via GFF3? >>> >>> --Carson >>> >>> >>> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >>> >>>>It's version 2.31. >>>> >>>>-- Priyam >>>> >>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>wrote: >>>>> What MAKER version are you using? >>>>> >>>>> --Carson >>>>> >>>>> >>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>> >>>>>>Hi, >>>>>> >>>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>> >>>>>>1. Three instances where two features were assigned the same id. >>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>non-existent parent. >>>>>> >>>>>>Here is the relevant portion of the GFF file: >>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>> >>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>left wondering why would these errors creep in. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>_______________________________________________ >>>>>>maker-devel mailing list >>>>>>maker-devel at box290.bluehost.com >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>>> >>> >>> > > From a.priyam at qmul.ac.uk Tue Jun 24 12:56:41 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Wed, 25 Jun 2014 00:26:41 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: I am sorry. I have updated the gist - https://gist.github.com/yeban/ffaf5cd419639dd073a7. 1. The first two chunks contain the annotations with duplicate ids. (4 rows) 2. The last chunk contains the annotations that refer to a non-existent parent. And what looks like an incomplete line of annotation (I forgot to state this in my original email). No, I didn't use est_forward. I am not passing in any old data via GFF3. -- Priyam On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: > Also note that ID= must be unique. Name= does not have to be, and won't be > if the same protein or repeat element aligns to more than one location for > example. > > Thanks, > Carson > > > On 6/20/14, 3:50 PM, "Carson Holt" wrote: > >>did you use est_forward? Also in the example you showed all the IDs are >>unique (one says hit and the other hsp in the ID, so they are different)? >>Could you find the non-uunique IDs causing the error? >> >>--Carson >> >> >>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >> >>>I used est_gff= option, which refers to a GFF file generated by >>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>> >>>-- Priyam >>> >>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >>>> Are you passing in old data via GFF3? >>>> >>>> --Carson >>>> >>>> >>>> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >>>> >>>>>It's version 2.31. >>>>> >>>>>-- Priyam >>>>> >>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>wrote: >>>>>> What MAKER version are you using? >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>> >>>>>>>Hi, >>>>>>> >>>>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>> >>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>non-existent parent. >>>>>>> >>>>>>>Here is the relevant portion of the GFF file: >>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>> >>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>left wondering why would these errors creep in. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>_______________________________________________ >>>>>>>maker-devel mailing list >>>>>>>maker-devel at box290.bluehost.com >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>>r >>>>>>>g >>>>>> >>>>>> >>>> >>>> >> >> > > From carsonhh at gmail.com Tue Jun 24 14:05:00 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Jun 2014 14:05:00 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 The value 1026 is held in a global iterator, so it cannot repeat the same value during the life of the process. And 1.3.0.12 is generated from the point in the code the ID is being generated. This means that two distinct processses had to write to the same file at the same point in the code, which should normally be impossible. However, there are ways to make this happen. First if you turn file locks off (-nolock) option and then run MAKER multiple times on the same dataset you can get process collisions (because you disabled the locks that stop this). If your NFS file system does not support hard links (FhGFS for example) then you cannot lock the files (which is the same as setting -nolock). Or you have other serious IO failures over NFS. Note that NFS is your Network Mounted Storage. The last example you give shows the preceding line being truncated. This suggests that two processes are trying to write to the same file simultaneously (inserting lines in between other lines), or serious IO failures are occurring where writes are not completing but true is being returned for the operations (can happen on unreliable NFS implementations). So in summary either your NFS storage implementation is giving IO errors, you have run MAKER with -nolock set and then started MAKER multiple times in the same directory (process collisions), or your NFS implementation doesn't support hardlinks and won't allow MAKER to lock files (process collisions). If it is one of the latter two, you will have to make sure you never start MAKER more than once simultaneously on the same dataset. You can still run via MPI fro parallelization, but you won't be able to start a second MPI process while the first one is still running. Thanks, Carson On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >I am sorry. I have updated the gist - >https://gist.github.com/yeban/ffaf5cd419639dd073a7. >1. The first two chunks contain the annotations with duplicate ids. (4 >rows) >2. The last chunk contains the annotations that refer to a >non-existent parent. And what looks like an incomplete line of >annotation (I forgot to state this in my original email). > >No, I didn't use est_forward. I am not passing in any old data via GFF3. > >-- Priyam > >On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >> Also note that ID= must be unique. Name= does not have to be, and won't >>be >> if the same protein or repeat element aligns to more than one location >>for >> example. >> >> Thanks, >> Carson >> >> >> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >> >>>did you use est_forward? Also in the example you showed all the IDs are >>>unique (one says hit and the other hsp in the ID, so they are >>>different)? >>>Could you find the non-uunique IDs causing the error? >>> >>>--Carson >>> >>> >>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>> >>>>I used est_gff= option, which refers to a GFF file generated by >>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>> >>>>-- Priyam >>>> >>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>wrote: >>>>> Are you passing in old data via GFF3? >>>>> >>>>> --Carson >>>>> >>>>> >>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>wrote: >>>>> >>>>>>It's version 2.31. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>wrote: >>>>>>> What MAKER version are you using? >>>>>>> >>>>>>> --Carson >>>>>>> >>>>>>> >>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>>> >>>>>>>>Hi, >>>>>>>> >>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>file >>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>formatting >>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>> >>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>non-existent parent. >>>>>>>> >>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>> >>>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>>left wondering why would these errors creep in. >>>>>>>> >>>>>>>>-- Priyam >>>>>>>> >>>>>>>>_______________________________________________ >>>>>>>>maker-devel mailing list >>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>.o >>>>>>>>r >>>>>>>>g >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >> >> From a.priyam at qmul.ac.uk Wed Jun 25 15:11:22 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Thu, 26 Jun 2014 02:41:22 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER processes in the same directory. I feel it's unlikely that my file system doesn't allow hardlinks because a few processes quit earlier than the others, saying something to the tune of "Another MAKER process is processing this scaffold already." I remember one process in particular had _just_ crashed. I don't remember how: I might have Ctrl-C'ed by mistake instead of detaching screen? admin killed it? temporary system glitch? Could this have caused the same issue? -- Priyam On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: > Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 > > The value 1026 is held in a global iterator, so it cannot repeat the same > value during the life of the process. And 1.3.0.12 is generated from the > point in the code the ID is being generated. This means that two distinct > processses had to write to the same file at the same point in the code, > which should normally be impossible. > > However, there are ways to make this happen. First if you turn file locks > off (-nolock) option and then run MAKER multiple times on the same dataset > you can get process collisions (because you disabled the locks that stop > this). If your NFS file system does not support hard links (FhGFS for > example) then you cannot lock the files (which is the same as setting > -nolock). Or you have other serious IO failures over NFS. Note that NFS > is your Network Mounted Storage. > > The last example you give shows the preceding line being truncated. This > suggests that two processes are trying to write to the same file > simultaneously (inserting lines in between other lines), or serious IO > failures are occurring where writes are not completing but true is being > returned for the operations (can happen on unreliable NFS implementations). > > So in summary either your NFS storage implementation is giving IO errors, > you have run MAKER with -nolock set and then started MAKER multiple times > in the same directory (process collisions), or your NFS implementation > doesn't support hardlinks and won't allow MAKER to lock files (process > collisions). If it is one of the latter two, you will have to make sure > you never start MAKER more than once simultaneously on the same dataset. > You can still run via MPI fro parallelization, but you won't be able to > start a second MPI process while the first one is still running. > > Thanks, > Carson > > > On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: > >>I am sorry. I have updated the gist - >>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>1. The first two chunks contain the annotations with duplicate ids. (4 >>rows) >>2. The last chunk contains the annotations that refer to a >>non-existent parent. And what looks like an incomplete line of >>annotation (I forgot to state this in my original email). >> >>No, I didn't use est_forward. I am not passing in any old data via GFF3. >> >>-- Priyam >> >>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>> Also note that ID= must be unique. Name= does not have to be, and won't >>>be >>> if the same protein or repeat element aligns to more than one location >>>for >>> example. >>> >>> Thanks, >>> Carson >>> >>> >>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>> >>>>did you use est_forward? Also in the example you showed all the IDs are >>>>unique (one says hit and the other hsp in the ID, so they are >>>>different)? >>>>Could you find the non-uunique IDs causing the error? >>>> >>>>--Carson >>>> >>>> >>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>> >>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>> >>>>>-- Priyam >>>>> >>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>wrote: >>>>>> Are you passing in old data via GFF3? >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>wrote: >>>>>> >>>>>>>It's version 2.31. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>wrote: >>>>>>>> What MAKER version are you using? >>>>>>>> >>>>>>>> --Carson >>>>>>>> >>>>>>>> >>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>>>> >>>>>>>>>Hi, >>>>>>>>> >>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>file >>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>formatting >>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>> >>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>non-existent parent. >>>>>>>>> >>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>> >>>>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>>>left wondering why would these errors creep in. >>>>>>>>> >>>>>>>>>-- Priyam >>>>>>>>> >>>>>>>>>_______________________________________________ >>>>>>>>>maker-devel mailing list >>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>>.o >>>>>>>>>r >>>>>>>>>g >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >>> > > From carsonhh at gmail.com Wed Jun 25 15:26:45 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 25 Jun 2014 15:26:45 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Maybe if it died in a weird way some of the processes could have continued briefly without active locks, but I'd more likely attribute this to NFS weirdness. Because of how network storage works, some implementations take shortcuts (like returning success on an IO operation even though it has not completed and may even fail later on). Or an IO operation can be buffered and completed several seconds later (the process that called the write operation may not even be active anymore). This is extremely common on NFS. You should probably just start MAKER fewer times in the same directory on your system. You may also want to start a single MAKER job (you should use MPI to parallelize it though), and use the -a flag. This will cause that job just to just rebuild the current GFF3 and FASTA files. That way you can clean up your current results without having to rerun everything. It should run relatively quickly since MAKER will be able to make use of the existing BLAST reports etc. that are already there (exonerate will run again though, but it shouldn't take too long). --Carson On 6/25/14, 3:11 PM, "Anurag Priyam" wrote: >Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER >processes in the same directory. > >I feel it's unlikely that my file system doesn't allow hardlinks >because a few processes quit earlier than the others, saying something >to the tune of "Another MAKER process is processing this scaffold >already." > >I remember one process in particular had _just_ crashed. I don't >remember how: I might have Ctrl-C'ed by mistake instead of detaching >screen? admin killed it? temporary system glitch? Could this have >caused the same issue? > >-- Priyam > > >On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: >> Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 >> >> The value 1026 is held in a global iterator, so it cannot repeat the >>same >> value during the life of the process. And 1.3.0.12 is generated from the >> point in the code the ID is being generated. This means that two >>distinct >> processses had to write to the same file at the same point in the code, >> which should normally be impossible. >> >> However, there are ways to make this happen. First if you turn file >>locks >> off (-nolock) option and then run MAKER multiple times on the same >>dataset >> you can get process collisions (because you disabled the locks that stop >> this). If your NFS file system does not support hard links (FhGFS for >> example) then you cannot lock the files (which is the same as setting >> -nolock). Or you have other serious IO failures over NFS. Note that NFS >> is your Network Mounted Storage. >> >> The last example you give shows the preceding line being truncated. >>This >> suggests that two processes are trying to write to the same file >> simultaneously (inserting lines in between other lines), or serious IO >> failures are occurring where writes are not completing but true is being >> returned for the operations (can happen on unreliable NFS >>implementations). >> >> So in summary either your NFS storage implementation is giving IO >>errors, >> you have run MAKER with -nolock set and then started MAKER multiple >>times >> in the same directory (process collisions), or your NFS implementation >> doesn't support hardlinks and won't allow MAKER to lock files (process >> collisions). If it is one of the latter two, you will have to make sure >> you never start MAKER more than once simultaneously on the same dataset. >> You can still run via MPI fro parallelization, but you won't be able to >> start a second MPI process while the first one is still running. >> >> Thanks, >> Carson >> >> >> On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >> >>>I am sorry. I have updated the gist - >>>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>>1. The first two chunks contain the annotations with duplicate ids. (4 >>>rows) >>>2. The last chunk contains the annotations that refer to a >>>non-existent parent. And what looks like an incomplete line of >>>annotation (I forgot to state this in my original email). >>> >>>No, I didn't use est_forward. I am not passing in any old data via GFF3. >>> >>>-- Priyam >>> >>>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>>> Also note that ID= must be unique. Name= does not have to be, and >>>>won't >>>>be >>>> if the same protein or repeat element aligns to more than one location >>>>for >>>> example. >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>>> >>>>>did you use est_forward? Also in the example you showed all the IDs >>>>>are >>>>>unique (one says hit and the other hsp in the ID, so they are >>>>>different)? >>>>>Could you find the non-uunique IDs causing the error? >>>>> >>>>>--Carson >>>>> >>>>> >>>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>>> >>>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>>wrote: >>>>>>> Are you passing in old data via GFF3? >>>>>>> >>>>>>> --Carson >>>>>>> >>>>>>> >>>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>>wrote: >>>>>>> >>>>>>>>It's version 2.31. >>>>>>>> >>>>>>>>-- Priyam >>>>>>>> >>>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>>wrote: >>>>>>>>> What MAKER version are you using? >>>>>>>>> >>>>>>>>> --Carson >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" >>>>>>>>>wrote: >>>>>>>>> >>>>>>>>>>Hi, >>>>>>>>>> >>>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>>file >>>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>>formatting >>>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>>> >>>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>>non-existent parent. >>>>>>>>>> >>>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>>> >>>>>>>>>>I worked around the issue temporarily for the job at hand, but I >>>>>>>>>>am >>>>>>>>>>left wondering why would these errors creep in. >>>>>>>>>> >>>>>>>>>>-- Priyam >>>>>>>>>> >>>>>>>>>>_______________________________________________ >>>>>>>>>>maker-devel mailing list >>>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l >>>>>>>>>>ab >>>>>>>>>>.o >>>>>>>>>>r >>>>>>>>>>g >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> >> >> From a.priyam at qmul.ac.uk Wed Jun 25 15:38:17 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Thu, 26 Jun 2014 03:08:17 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: -a option looks like just the thing I need. I will forward concerns about NFS to our IT team. And definitely use MPI for parallelisation next time. Thanks a lot :). -- Priyam On Thu, Jun 26, 2014 at 2:56 AM, Carson Holt wrote: > Maybe if it died in a weird way some of the processes could have continued > briefly without active locks, but I'd more likely attribute this to NFS > weirdness. Because of how network storage works, some implementations > take shortcuts (like returning success on an IO operation even though it > has not completed and may even fail later on). Or an IO operation can be > buffered and completed several seconds later (the process that called the > write operation may not even be active anymore). This is extremely common > on NFS. You should probably just start MAKER fewer times in the same > directory on your system. You may also want to start a single MAKER job > (you should use MPI to parallelize it though), and use the -a flag. This > will cause that job just to just rebuild the current GFF3 and FASTA files. > That way you can clean up your current results without having to rerun > everything. It should run relatively quickly since MAKER will be able to > make use of the existing BLAST reports etc. that are already there > (exonerate will run again though, but it shouldn't take too long). > > --Carson > > > On 6/25/14, 3:11 PM, "Anurag Priyam" wrote: > >>Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER >>processes in the same directory. >> >>I feel it's unlikely that my file system doesn't allow hardlinks >>because a few processes quit earlier than the others, saying something >>to the tune of "Another MAKER process is processing this scaffold >>already." >> >>I remember one process in particular had _just_ crashed. I don't >>remember how: I might have Ctrl-C'ed by mistake instead of detaching >>screen? admin killed it? temporary system glitch? Could this have >>caused the same issue? >> >>-- Priyam >> >> >>On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: >>> Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 >>> >>> The value 1026 is held in a global iterator, so it cannot repeat the >>>same >>> value during the life of the process. And 1.3.0.12 is generated from the >>> point in the code the ID is being generated. This means that two >>>distinct >>> processses had to write to the same file at the same point in the code, >>> which should normally be impossible. >>> >>> However, there are ways to make this happen. First if you turn file >>>locks >>> off (-nolock) option and then run MAKER multiple times on the same >>>dataset >>> you can get process collisions (because you disabled the locks that stop >>> this). If your NFS file system does not support hard links (FhGFS for >>> example) then you cannot lock the files (which is the same as setting >>> -nolock). Or you have other serious IO failures over NFS. Note that NFS >>> is your Network Mounted Storage. >>> >>> The last example you give shows the preceding line being truncated. >>>This >>> suggests that two processes are trying to write to the same file >>> simultaneously (inserting lines in between other lines), or serious IO >>> failures are occurring where writes are not completing but true is being >>> returned for the operations (can happen on unreliable NFS >>>implementations). >>> >>> So in summary either your NFS storage implementation is giving IO >>>errors, >>> you have run MAKER with -nolock set and then started MAKER multiple >>>times >>> in the same directory (process collisions), or your NFS implementation >>> doesn't support hardlinks and won't allow MAKER to lock files (process >>> collisions). If it is one of the latter two, you will have to make sure >>> you never start MAKER more than once simultaneously on the same dataset. >>> You can still run via MPI fro parallelization, but you won't be able to >>> start a second MPI process while the first one is still running. >>> >>> Thanks, >>> Carson >>> >>> >>> On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >>> >>>>I am sorry. I have updated the gist - >>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>>>1. The first two chunks contain the annotations with duplicate ids. (4 >>>>rows) >>>>2. The last chunk contains the annotations that refer to a >>>>non-existent parent. And what looks like an incomplete line of >>>>annotation (I forgot to state this in my original email). >>>> >>>>No, I didn't use est_forward. I am not passing in any old data via GFF3. >>>> >>>>-- Priyam >>>> >>>>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>>>> Also note that ID= must be unique. Name= does not have to be, and >>>>>won't >>>>>be >>>>> if the same protein or repeat element aligns to more than one location >>>>>for >>>>> example. >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>>>> >>>>>>did you use est_forward? Also in the example you showed all the IDs >>>>>>are >>>>>>unique (one says hit and the other hsp in the ID, so they are >>>>>>different)? >>>>>>Could you find the non-uunique IDs causing the error? >>>>>> >>>>>>--Carson >>>>>> >>>>>> >>>>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>>>> >>>>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>>>wrote: >>>>>>>> Are you passing in old data via GFF3? >>>>>>>> >>>>>>>> --Carson >>>>>>>> >>>>>>>> >>>>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>>>wrote: >>>>>>>> >>>>>>>>>It's version 2.31. >>>>>>>>> >>>>>>>>>-- Priyam >>>>>>>>> >>>>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>>>wrote: >>>>>>>>>> What MAKER version are you using? >>>>>>>>>> >>>>>>>>>> --Carson >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" >>>>>>>>>>wrote: >>>>>>>>>> >>>>>>>>>>>Hi, >>>>>>>>>>> >>>>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>>>file >>>>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>>>formatting >>>>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>>>> >>>>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>>>non-existent parent. >>>>>>>>>>> >>>>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>>>> >>>>>>>>>>>I worked around the issue temporarily for the job at hand, but I >>>>>>>>>>>am >>>>>>>>>>>left wondering why would these errors creep in. >>>>>>>>>>> >>>>>>>>>>>-- Priyam >>>>>>>>>>> >>>>>>>>>>>_______________________________________________ >>>>>>>>>>>maker-devel mailing list >>>>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l >>>>>>>>>>>ab >>>>>>>>>>>.o >>>>>>>>>>>r >>>>>>>>>>>g >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>> >>> > > From rajesh.bommareddy at tu-harburg.de Mon Jun 30 04:18:12 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Mon, 30 Jun 2014 12:18:12 +0200 Subject: [maker-devel] Maker gene prediction Message-ID: <53B13964.3060608@tu-harburg.de> Dear Sir/Madam I have a general question regarding gene prediction and annotation in Maker. For example, I have a new sequence of a yeast strain, and i have to predict and annotate the genome. Of,course i know EST's from the same organism will help me to predict the genes accurately, but when i want to use EST or RNA transcripts from a closely related organism, how can i do that in Maker and how accurate will be the prediction ?. Is the produced prediction and annotation valid ? How do i check this ? Thank you and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From carsonhh at gmail.com Mon Jun 30 11:34:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 30 Jun 2014 11:34:23 -0600 Subject: [maker-devel] Maker gene prediction In-Reply-To: <53B13964.3060608@tu-harburg.de> References: <53B13964.3060608@tu-harburg.de> Message-ID: You can supply ESTs from a related organism to the alt_est= option. Note this runs really slow because it has to be translated in all 6 reading frames (target and query), and will be less sensitive (larger threshold for alignments to become statistically significant). So if you have protein evidence from a related species, use that instead of the EST evidence from a related species. With respect to accuracy, the alignment evidence that suggests the annotation is also the experimental evidence that supports an annotations accuracy (so it is kind of a circular argument). But the alignment evidence does provide a correlative measurement. Things with lower AED scores better match the evidence and should be considered as higher confidence, while genes with higher AED scores represent genes that have lower confidence (this correlation is very well supported across many many organisms). You should be aware of what is considered realistic with genome annotation. In general for newly sequenced organisms, a genome wide accuracy of greater than 80% is considered extremely well annotated (but can't directly be measured except retrospectively - i.e. once you have a future more complete assembly and more experimental evidence to compare to). Only a handful of genomes that have legions of curators working over a decade (drosophila for example) have accuracies of greater than 90%. --Carson On 6/30/14, 4:18 AM, "Rajesh Reddy Bommareddy" wrote: >Dear Sir/Madam > >I have a general question regarding gene prediction and annotation in >Maker. > >For example, I have a new sequence of a yeast strain, and i have to >predict and annotate the genome. Of,course i know EST's from the same >organism will help me to predict the genes accurately, but when i want >to use EST or RNA transcripts from a closely related organism, how can i >do that in Maker and how accurate will be the prediction ?. Is the >produced prediction and annotation valid ? How do i check this ? > >Thank you and Regards >-- >M.Sc. Rajesh Reddy Bommareddy >Institute of Bioprocess and Biosystems Engineering >Hamburg University of Technology(TUHH) >Denikestrasse 15, 20171 Hamburg >GERMANY. >e.mail: rajesh.bommareddy at tu-harburg.de >Phone:+4940428784011 >Mobile:+4917663673522 > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jun 2 09:10:30 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 02 Jun 2014 09:10:30 -0600 Subject: [maker-devel] Precomputed alignments In-Reply-To: References: Message-ID: With the Target and Gap attribute you get slightly better behavior on filtering when you specify the blast_depth=X parameter in the maker_bopts.ctl file (keeps only X best hits). They will also affect the eAED score since it takes reading frame into account (so no Gap attribute means no assumption of reading frame). Otherwise they are only beneficial for seeing the alignment in a viewer as some viewers can recover the alignment when those values are specified. If you are not using blast_depth or trying to view the alignments in a viewer they don't really do anything. MAKER will just assume perfect match across the specified regions. --Carson From: Daniel Standage Date: Saturday, May 31, 2014 at 9:23 AM To: Maker Mailing List Subject: [maker-devel] Precomputed alignments Hello again! About a year ago I asked about using precomputed alignments with Maker. The thread quickly took a different direction as we tried to track down other issues, and I never got the thread back on its original track. So, to return to the original question, what exactly is required when providing pre-computed alignments in GFF3 format? For example, does it affect Maker's behavior whether a score is given? The "Target" attribute? The "Gap" attribute? Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 2 09:23:25 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 02 Jun 2014 09:23:25 -0600 Subject: [maker-devel] tRNAscan and map_gff_ids Message-ID: I've now patched the current download to fix this and a plus strand spliced tRNA bug. --Carson On 5/20/14, 1:17 PM, "Fields, Christopher J" wrote: >I found a problem with some tRNAscan output using MAKER 2.31.5. I had a >full MAKER data set (run initially using MAKER 2.31.5) that I mapped IDs >for. This was then run as follows, with the requisite error: > >-system-specific-4.1$ map_gff_ids id.map Zalbi.all.gff3 >Nested quantifiers in regex; marked by <-- HERE in >m/trnascan-KB913038.1-noncoding-Undet_??? <-- HERE -gene-79.0/ at >/home/groups/hpcbio/apps/maker/maker-2.31.5/bin/map_gff_ids line 111, ><$IN> line 3067590. > >The problematic lines: > >---------------------------------------------- >-system-specific-4.1$ grep "???" Zalbi.all.gff3 >KB913038.1 maker gene 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene >-79.0 >KB913038.1 maker tRNA 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1;Parent=trnascan-KB913038.1-noncoding-Undet >_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene-79.0-tRNA >-1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|70|0 >KB913038.1 maker exon 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1:exon:2193;Parent=trnascan-KB913038.1-nonco >ding-Undet_???-gene-79.0-tRNA-1 >KB913039.1 maker gene 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene >-72.0 >KB913039.1 maker tRNA 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1;Parent=trnascan-KB913039.1-noncoding-Undet >_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene-72.0-tRNA >-1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|74|0 >KB913039.1 maker exon 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1:exon:4036;Parent=trnascan-KB913039.1-nonco >ding-Undet_???-gene-72.0-tRNA-1 >---------------------------------------------- > >I managed to get it going by using the following modifications (regex >quotemeta) in map_gff_ids (lines 107-112): > > for my $id (@map_ids) { > # Only if the value (or the portion preceding > # the first colon) is equal to the map key. > next unless ($value eq $id || $value =~ /^\Q$id\E:/); > $value =~ s/\Q$id\E/$map{$id}/ unless($tag eq 'Name' && $id !~ >/\-gene\-\d+\.\d+|^CG\:|^....\:|^[^\:]+\:temp\d+\:/); > } > >I?m guessing there may be a similar problem with map_fasta_ids? > >chris >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From cjfields at illinois.edu Mon Jun 2 10:45:09 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 2 Jun 2014 16:45:09 +0000 Subject: [maker-devel] tRNAscan and map_gff_ids In-Reply-To: References: Message-ID: <007A79A7-8C68-4AFC-AC4F-451194D4CD29@illinois.edu> Thanks Carson! chris On Jun 2, 2014, at 10:23 AM, Carson Holt wrote: > I've now patched the current download to fix this and a plus strand > spliced tRNA bug. > > --Carson > > > On 5/20/14, 1:17 PM, "Fields, Christopher J" wrote: > >> I found a problem with some tRNAscan output using MAKER 2.31.5. I had a >> full MAKER data set (run initially using MAKER 2.31.5) that I mapped IDs >> for. This was then run as follows, with the requisite error: >> >> -system-specific-4.1$ map_gff_ids id.map Zalbi.all.gff3 >> Nested quantifiers in regex; marked by <-- HERE in >> m/trnascan-KB913038.1-noncoding-Undet_??? <-- HERE -gene-79.0/ at >> /home/groups/hpcbio/apps/maker/maker-2.31.5/bin/map_gff_ids line 111, >> <$IN> line 3067590. >> >> The problematic lines: >> >> ---------------------------------------------- >> -system-specific-4.1$ grep "???" Zalbi.all.gff3 >> KB913038.1 maker gene 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene >> -79.0 >> KB913038.1 maker tRNA 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1;Parent=trnascan-KB913038.1-noncoding-Undet >> _???-gene-79.0;Name=trnascan-KB913038.1-noncoding-Undet_???-gene-79.0-tRNA >> -1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|70|0 >> KB913038.1 maker exon 23847890 23847958 . - . ID=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1:exon:2193;Parent=trnascan-KB913038.1-nonco >> ding-Undet_???-gene-79.0-tRNA-1 >> KB913039.1 maker gene 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene >> -72.0 >> KB913039.1 maker tRNA 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1;Parent=trnascan-KB913039.1-noncoding-Undet >> _???-gene-72.0;Name=trnascan-KB913039.1-noncoding-Undet_???-gene-72.0-tRNA >> -1;_AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|74|0 >> KB913039.1 maker exon 21710152 21710224 . - . ID=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1:exon:4036;Parent=trnascan-KB913039.1-nonco >> ding-Undet_???-gene-72.0-tRNA-1 >> ---------------------------------------------- >> >> I managed to get it going by using the following modifications (regex >> quotemeta) in map_gff_ids (lines 107-112): >> >> for my $id (@map_ids) { >> # Only if the value (or the portion preceding >> # the first colon) is equal to the map key. >> next unless ($value eq $id || $value =~ /^\Q$id\E:/); >> $value =~ s/\Q$id\E/$map{$id}/ unless($tag eq 'Name' && $id !~ >> /\-gene\-\d+\.\d+|^CG\:|^....\:|^[^\:]+\:temp\d+\:/); >> } >> >> I?m guessing there may be a similar problem with map_fasta_ids? >> >> chris >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From anthony.bretaudeau at rennes.inra.fr Tue Jun 3 02:38:31 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Tue, 03 Jun 2014 10:38:31 +0200 Subject: [maker-devel] Merging 2 annotations Message-ID: <538D8987.4090606@rennes.inra.fr> Hello, I am working on the annotation of an insect genome, and I have 2 gff files: -an automatic annotation (done by another lab, with something else than maker, ~20000genes) -a manually curated annotation (with webapollo, ~1500 genes) From this, I would like to produce a single gff combining the 2. I'd like to keep all the manually curated models, and only the automatic ones that have no equivalent in the manually curated gff. Is it possible to do something like this with maker? I guess I could play with the model_gff option, but I'm not sure how exactly I could use it. Thank you for your help Regards Anthony From shpeng at shou.edu.cn Mon Jun 2 20:30:17 2014 From: shpeng at shou.edu.cn (=?UTF-8?B?5b2t5Y+45Y2O?=) Date: Tue, 3 Jun 2014 10:30:17 +0800 (GMT+08:00) Subject: [maker-devel] Maker can not run repeatmasker Message-ID: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> Using the example data, I can not run maker successfully. Please see the following: [root at c0105 test3]# ls -l total 72 -rw-r--r--. 1 root root 32712 May 29 02:46 dpp_contig.fasta -rw-r--r--. 1 root root 19138 May 29 02:46 dpp_est.fasta -rw-r--r--. 1 root root 3045 May 29 02:46 dpp_protein.fasta -rw-r--r--. 1 root root 1413 May 29 02:46 maker_bopts.ctl -rw-r--r--. 1 root root 1288 May 29 02:46 maker_exe.ctl -rw-r--r--. 1 root root 4630 May 29 04:27 maker_opts.ctl [root at c0105 test3]# maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore To access files for individual sequences use the datastore index: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_master_datastore_index.log STATUS: Now running MAKER... examining contents of the fasta file and run log --Next Contig-- #--------------------------------------------------------------------- Now starting the contig!! SeqID: contig-dpp-500-500 Length: 32156 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased #--------------------------------------------------------------------- Now retrying the contig!! SeqID: contig-dpp-500-500 Length: 32156 Tries: 2!! #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased Maker is now finished!!! Start_time: 1401761680 End_time: 1401761688 Elapsed: 8 Thnaks. Sihua -------------- next part -------------- An HTML attachment was scrubbed... URL: From janphilipoyen at gmail.com Tue Jun 3 09:07:17 2014 From: janphilipoyen at gmail.com (=?UTF-8?Q?Jan_Philip_=C3=98yen?=) Date: Tue, 3 Jun 2014 17:07:17 +0200 Subject: [maker-devel] AED scores and thresholds: Not filtering? Message-ID: Hello, We are currently working on annotating arthropod genomes with the maker pipeline. We have tried to filter out genes with low support using the AED thresholds using AED_threshold=0.5. However, we still find genes without scores and most importantly genes with scores above the set threshold (even AED=1). Is there a known issue which could cause this? We can supply the control files as well as gffs, but would not like to make them public at this stage. Best Regards, Jan Philip Oeyen ZFMK / ZMB / University of Bonn -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 09:10:27 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 09:10:27 -0600 Subject: [maker-devel] Maker can not run repeatmasker In-Reply-To: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> References: <61cfff7f.1d4.1465f901862.Coremail.shpeng@shou.edu.cn> Message-ID: The message is basically saying that RepeatMasker is not installed correctly. Follow the instructions here --> http://www.repeatmasker.org/RMDownload.html --Carson From: ??? Date: Monday, June 2, 2014 at 8:30 PM To: Subject: [maker-devel] Maker can not run repeatmasker Using the example data, I can not run maker successfully. Please see the following: [root at c0105 test3]# ls -l total 72 -rw-r--r--. 1 root root 32712 May 29 02:46 dpp_contig.fasta -rw-r--r--. 1 root root 19138 May 29 02:46 dpp_est.fasta -rw-r--r--. 1 root root 3045 May 29 02:46 dpp_protein.fasta -rw-r--r--. 1 root root 1413 May 29 02:46 maker_bopts.ctl -rw-r--r--. 1 root root 1288 May 29 02:46 maker_exe.ctl -rw-r--r--. 1 root root 4630 May 29 04:27 maker_opts.ctl [root at c0105 test3]# maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore To access files for individual sequences use the datastore index: /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_master_datas tore_index.log STATUS: Now running MAKER... examining contents of the fasta file and run log --Next Contig-- #--------------------------------------------------------------------- Now starting the contig!! SeqID: contig-dpp-500-500 Length: 32156 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.al l.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVo id.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased #--------------------------------------------------------------------- Now retrying the contig!! SeqID: contig-dpp-500-500 Length: 32156 Tries: 2!! #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking running repeat masker. #--------- command -------------# Widget::RepeatMasker: cd /tmp/maker_g7CIeW; /usr/local/RepeatMasker/RepeatMasker /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0/contig-dpp-500-500.0.al l.rb -species all -dir /home/fastq/annotation/test3/dpp_contig.maker.output/dpp_contig_datastore/05 /1F/contig-dpp-500-500//theVoid.contig-dpp-500-500/0 -pa 1 #-------------------------------# which: no cross_match in (/usr/local/bin) CrossmatchSearchEngine::setPathToEngine( /usr/local/bin/cross_match ): Program does not exist! at /usr/local/RepeatMasker/RepeatMasker line 519. ERROR: RepeatMasker failed --> rank=NA, hostname=c0105 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:contig-dpp-500-500 examining contents of the fasta file and run log --Next Contig-- Processing run.log file... MAKER WARNING: The file dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500//theVo id.contig-dpp-500-500/0/contig-dpp-500-500.0.all.rb.out did not finish on the last run and must be erased Maker is now finished!!! Start_time: 1401761680 End_time: 1401761688 Elapsed: 8 Thnaks. Sihua _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 09:51:44 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 09:51:44 -0600 Subject: [maker-devel] AED scores and thresholds: Not filtering? In-Reply-To: References: Message-ID: No. It should use whichever is lower the AED or eAED score. The only exception is model_gff results. Those are always kept. Also note that the filter is for the entire gene, not just individual splice forms if you have alternate splicing. If you want I can take a look if there is anything non-obvious. You would have to send me the final GFF3 and the maker_opts.ctl file. --Carson From: Jan Philip ?yen Date: Tuesday, June 3, 2014 at 9:07 AM To: Subject: [maker-devel] AED scores and thresholds: Not filtering? Hello, We are currently working on annotating arthropod genomes with the maker pipeline. We have tried to filter out genes with low support using the AED thresholds using AED_threshold=0.5. However, we still find genes without scores and most importantly genes with scores above the set threshold (even AED=1). Is there a known issue which could cause this? We can supply the control files as well as gffs, but would not like to make them public at this stage. Best Regards, Jan Philip Oeyen ZFMK / ZMB / University of Bonn _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 3 10:15:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 10:15:46 -0600 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <538D8987.4090606@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> Message-ID: You can give the manually curate ones to model_gff and the other ones to pred_gff. Then set keep_preds=1. The model_gff resuls always get kept even without evidence support, the pred_gff will be kept even without evidence support because you set keep_preds=1, but model_gff results will take precedence. --Carson On 6/3/14, 2:38 AM, "Anthony Bretaudeau" wrote: >Hello, > >I am working on the annotation of an insect genome, and I have 2 gff >files: >-an automatic annotation (done by another lab, with something else than >maker, ~20000genes) >-a manually curated annotation (with webapollo, ~1500 genes) > > From this, I would like to produce a single gff combining the 2. I'd >like to keep all the manually curated models, and only the automatic >ones that have no equivalent in the manually curated gff. > >Is it possible to do something like this with maker? I guess I could >play with the model_gff option, but I'm not sure how exactly I could use >it. > >Thank you for your help >Regards > >Anthony > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Jun 3 20:20:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 03 Jun 2014 20:20:20 -0600 Subject: [maker-devel] Short Introns In-Reply-To: References: Message-ID: I think you may be best off using WebApollo to manually annotated the few hundred short intron ones. It's not that fun to do, but you should be able to get them all in a couple of days by yourself or under a day if you had a helper. --Carson On 5/15/14, 11:15 AM, "Mack, Brian" wrote: >Hi, I examined the genes that had introns less than 10 bp that were being >flagged by tbl2asn and I noticed that all 438 of them were genes called >by SNAP. Also they were found in the CDS and not the UTR. It seems >strange that all of the genes that have these short introns are from SNAP >when only about one third of the final gene models are from SNAP. I've >examined the evidence for a handful of these genes and the short introns >do not seem supported by the evidence. Has anybody else had short intron >issues with SNAP? > >Brian > >-----Original Message----- >From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf >Of Carson Holt >Sent: Friday, April 18, 2014 10:36 AM >To: UMD Bioinformatics; maker-devel at yandell-lab.org >Subject: Re: [maker-devel] Short Introns > >Look at the name of those genes. The original name will let you know >where it came from because it will contain, augustus, genemark, snap, etc. > You will also want to open up the contig containing those geens in a >viewer like apollo >(http://weatherby.genetics.utah.edu/apollo/apollo.tar.gz). See if the >short intron is part of the CDS or UTR. If it's UTR then, it has >evidence support from an EST, which either means there are problems with >the EST/cDNA evidence or it's real. For those, even if they are real you >can just trim them off. If it's part of the CDS, then investigate >whether it is suggested by EST or protein evidence, or if the ab initio >predictor called it (sometime the ab initio predictor calls things to >force an ORF to work). This can sometimes be indicative of assembly >issues in that region. > >--Carson > > >On 4/18/14, 7:14 AM, "UMD Bioinformatics" >wrote: > >>Hello, >> >>We are preparing two submission for NCBI, nightmare. However some of >>our MAKER gene models have short introns that are being flagged by >>NCBI. In one species we have >400 introns smaller then 20bp which is >>almost biologically impossible. I know we can set max intron length in >>the opts.ctl file but can we set a minimum intron length? >> >>I saw yesterdays posts that mention this is a result of the external ab >>initio predictors but I didn?t see an indication as to which predictor >>and how to change that setting. >> >>from yesterday: >>*These are just short introns (intron size is under control of the ab >>initio >>predictors) --> 438 ERROR: SEQ_FEAT.ShortIntron >> >>Cheers >>Ian >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > >This electronic message contains information generated by the USDA solely >for the intended recipients. Any unauthorized interception of this >message or the use or disclosure of the information it contains may >violate the law and subject the violator to civil or criminal penalties. >If you believe you have received this message in error, please notify the >sender and delete the email immediately. From sujaikumar at gmail.com Wed Jun 4 06:26:09 2014 From: sujaikumar at gmail.com (Sujai) Date: Wed, 4 Jun 2014 13:26:09 +0100 Subject: [maker-devel] Augustus compilation Message-ID: Hi all I've installed older versions of Maker (up to 2.28) before successfully. I was trying to install maker 2.31.6 on a new cluster and decided to use the built in installers for the dependencies. Unfortunately ./Build augustuc gives this error: Unpacking augustus tarball... Configuring augustus... g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o genbank.cc -I../include g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o properties.cc -I../include properties.cc: In static member function 'static void Properties::init(int, char**)': properties.cc:349:25: error: 'boost::filesystem::path' has no member named 'native' configPath = cpath.native(); ^ properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': properties.cc:615:10: error: 'read_symlink' is not a member of 'boost::filesystem' bpath = boost::filesystem::read_symlink(bpath); ^ make: *** [properties.o] Error 1 ERROR: Failed installing augustus, now cleaning installation path... You may need to install augustus manually. ---- Would anyone have any suggestions for how to fix this? I've tried editing the ../exe/augustus-3.0.2/src/Makefile line: LIBS = -lboost_iostreams -lboost_system -lboost_filesystem to add the path to my system boost lib: LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib -lboost_iostreams -lboost_system -lboost_filesystem and then running make from inside ../exe/augustus-3.0.2/src but I get the same error again From mike.thon at gmail.com Wed Jun 4 07:31:30 2014 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 4 Jun 2014 15:31:30 +0200 Subject: [maker-devel] Augustus compilation In-Reply-To: References: Message-ID: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> Hi - Yes it the latest version of augustus needs the boost library. If you?re on linux you should be able to install it through the package manager. its called libboost-dev or some such thing. -Mike On Jun 4, 2014, at 2:26 PM, Sujai wrote: > Hi all > > I've installed older versions of Maker (up to 2.28) before successfully. > > I was trying to install maker 2.31.6 on a new cluster and decided to > use the built in installers for the dependencies. > > Unfortunately > > ./Build augustuc > > gives this error: > > Unpacking augustus tarball... > Configuring augustus... > g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o > genbank.cc -I../include > g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o > properties.cc -I../include > properties.cc: In static member function 'static void > Properties::init(int, char**)': > properties.cc:349:25: error: 'boost::filesystem::path' has no member > named 'native' > configPath = cpath.native(); > ^ > properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': > properties.cc:615:10: error: 'read_symlink' is not a member of > 'boost::filesystem' > bpath = boost::filesystem::read_symlink(bpath); > ^ > make: *** [properties.o] Error 1 > > ERROR: Failed installing augustus, now cleaning installation path... > You may need to install augustus manually. > > ---- > > Would anyone have any suggestions for how to fix this? I've tried > editing the ../exe/augustus-3.0.2/src/Makefile line: > > LIBS = -lboost_iostreams -lboost_system -lboost_filesystem > > to add the path to my system boost lib: > > LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib > -lboost_iostreams -lboost_system -lboost_filesystem > > and then running make from inside ../exe/augustus-3.0.2/src but I get > the same error again > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From sujaikumar at gmail.com Wed Jun 4 07:34:50 2014 From: sujaikumar at gmail.com (Sujai) Date: Wed, 4 Jun 2014 14:34:50 +0100 Subject: [maker-devel] Augustus compilation In-Reply-To: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> References: <51F9E919-5679-49A9-A34C-06DB21060669@gmail.com> Message-ID: Hi Mike Thanks for the super prompt response. I am on a cluster where I can't install libboost-dev. However, boost is on the cluster (as I wrote, it is compiled in the /system/software/linux-x86_64/lib/boost/1_55_0/lib directory) so is my modification to the Makefile below correct, or is there something else I need to do? LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib -lboost_iostreams -lboost_system -lboost_filesystem Cheers, - Sujai On 4 June 2014 14:31, Michael Thon wrote: > Hi - Yes it the latest version of augustus needs the boost library. If you're on linux you should be able to install it through the package manager. its called libboost-dev or some such thing. > > -Mike > > On Jun 4, 2014, at 2:26 PM, Sujai wrote: > >> Hi all >> >> I've installed older versions of Maker (up to 2.28) before successfully. >> >> I was trying to install maker 2.31.6 on a new cluster and decided to >> use the built in installers for the dependencies. >> >> Unfortunately >> >> ./Build augustuc >> >> gives this error: >> >> Unpacking augustus tarball... >> Configuring augustus... >> g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o genbank.o >> genbank.cc -I../include >> g++ -c -Wall -Wno-sign-compare -ansi -pedantic -O2 -o properties.o >> properties.cc -I../include >> properties.cc: In static member function 'static void >> Properties::init(int, char**)': >> properties.cc:349:25: error: 'boost::filesystem::path' has no member >> named 'native' >> configPath = cpath.native(); >> ^ >> properties.cc: In function 'boost::filesystem::path findLocationOfSelfBinary()': >> properties.cc:615:10: error: 'read_symlink' is not a member of >> 'boost::filesystem' >> bpath = boost::filesystem::read_symlink(bpath); >> ^ >> make: *** [properties.o] Error 1 >> >> ERROR: Failed installing augustus, now cleaning installation path... >> You may need to install augustus manually. >> >> ---- >> >> Would anyone have any suggestions for how to fix this? I've tried >> editing the ../exe/augustus-3.0.2/src/Makefile line: >> >> LIBS = -lboost_iostreams -lboost_system -lboost_filesystem >> >> to add the path to my system boost lib: >> >> LIBS = -L/system/software/linux-x86_64/lib/boost/1_55_0/lib >> -lboost_iostreams -lboost_system -lboost_filesystem >> >> and then running make from inside ../exe/augustus-3.0.2/src but I get >> the same error again >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From daniel.standage at gmail.com Wed Jun 4 13:03:27 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Wed, 4 Jun 2014 15:03:27 -0400 Subject: [maker-devel] Filtering of ab initio gene models Message-ID: Thanks everyone for your responses recently! The reason for my recent flurry of email activity is that I'm seeing some unexpected trends when running the new version of Maker with precomputed alignments. Compared with an annotation I did a while ago (Maker 2.10, Maker-computed alignments), this new annotation has a substantial number of new genes annotated. If I compare distributions of AED scores between the old and new annotation, it's clear that the new annotation has a lot more low-quality models. If I look at new gene models that do not overlap with any gene model from the old annotation, the likelihood that it's a low-quality model is much higher. I decided to run a little experiment. I annotated a scaffold first using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed transcript and protein alignments and the same (latest) version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models from 2.31.3. Before this experiment, I assumed the issue was related to providing pre-computed alignments in GFF3 format and perhaps violating some important assumption. However, this experiment makes me wonder whether there have been changes to how Maker filters *ab initio* gene models between version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could put together a small data set that reproduces the behavior I just described. Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 4 13:09:29 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 04 Jun 2014 13:09:29 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Sure. that would be helpful. One question. Do you provide the Gap attribute in your precomputed alignments? Having or not having that attribute affects the eAED score which takes reading frame into account, and may cause some things to be kept that normally would be dropped, because MAKER won't be able to take the points of mismatch of the alignment into account (it just assumes match everywhere). --Carson From: Daniel Standage Date: Wednesday, June 4, 2014 at 1:03 PM To: Maker Mailing List Subject: [maker-devel] Filtering of ab initio gene models Thanks everyone for your responses recently! The reason for my recent flurry of email activity is that I'm seeing some unexpected trends when running the new version of Maker with precomputed alignments. Compared with an annotation I did a while ago (Maker 2.10, Maker-computed alignments), this new annotation has a substantial number of new genes annotated. If I compare distributions of AED scores between the old and new annotation, it's clear that the new annotation has a lot more low-quality models. If I look at new gene models that do not overlap with any gene model from the old annotation, the likelihood that it's a low-quality model is much higher. I decided to run a little experiment. I annotated a scaffold first using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed transcript and protein alignments and the same (latest) version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models from 2.31.3. Before this experiment, I assumed the issue was related to providing pre-computed alignments in GFF3 format and perhaps violating some important assumption. However, this experiment makes me wonder whether there have been changes to how Maker filters ab initio gene models between version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could put together a small data set that reproduces the behavior I just described. Thanks! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Wed Jun 4 13:11:44 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Wed, 4 Jun 2014 15:11:44 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: I do not provide Gap or Target attributes in the GFF3. Will this affect the AED as well, or just the eAED? -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: > Sure. that would be helpful. One question. Do you provide the Gap > attribute in your precomputed alignments? Having or not having that > attribute affects the eAED score which takes reading frame into account, > and may cause some things to be kept that normally would be dropped, > because MAKER won't be able to take the points of mismatch of the alignment > into account (it just assumes match everywhere). > > --Carson > > > From: Daniel Standage > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that I'm seeing some > unexpected trends when running the new version of Maker with precomputed > alignments. Compared with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a substantial number of > new genes annotated. If I compare distributions of AED scores between the > old and new annotation, it's clear that the new annotation has a lot more > low-quality models. If I look at new gene models that do not overlap with > any gene model from the old annotation, the likelihood that it's a > low-quality model is much higher. > > I decided to run a little experiment. I annotated a scaffold first using > Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same > pre-computed transcript and protein alignments and the same (latest) > version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted > 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci > by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 > locus with only models from 2.10, and 28 loci with only models from 2.31.3. > > Before this experiment, I assumed the issue was related to providing > pre-computed alignments in GFF3 format and perhaps violating some important > assumption. However, this experiment makes me wonder whether there have > been changes to how Maker filters *ab initio* gene models between version > 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could > put together a small data set that reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 4 13:17:34 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 04 Jun 2014 13:17:34 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Just eAED, but eAED can affects selection of ab initio results. For example reading frame match of protein evidence which also affects whether evidence from single_exon=1 and genes with single_exon protein evidence get kept. There is also the assumption that your alignments in GFF3 are are correctly spliced (like BLAT does). So giving blastn results as precomputed est_gff would create a lot of noise, since maker ignores blastn and is using it only to seed the polished exonerate alignments. --Carson From: Daniel Standage Date: Wednesday, June 4, 2014 at 1:11 PM To: Carson Holt Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models I do not provide Gap or Target attributes in the GFF3. Will this affect the AED as well, or just the eAED? -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: > Sure. that would be helpful. One question. Do you provide the Gap attribute > in your precomputed alignments? Having or not having that attribute affects > the eAED score which takes reading frame into account, and may cause some > things to be kept that normally would be dropped, because MAKER won't be able > to take the points of mismatch of the alignment into account (it just assumes > match everywhere). > > --Carson > > > From: Daniel Standage > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that I'm seeing some > unexpected trends when running the new version of Maker with precomputed > alignments. Compared with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a substantial number of > new genes annotated. If I compare distributions of AED scores between the old > and new annotation, it's clear that the new annotation has a lot more > low-quality models. If I look at new gene models that do not overlap with any > gene model from the old annotation, the likelihood that it's a low-quality > model is much higher. > > I decided to run a little experiment. I annotated a scaffold first using Maker > 2.10 and then using Maker 2.31.3. I both cases, I used the same pre-computed > transcript and protein alignments and the same (latest) version of SNAP as the > only ab initio predictor. Maker 2.10 predicted 44 genes while Maker 2.31.3 > predicted 63. If we group gene models into loci by overlap, there are 33 loci > with gene models from both 2.10 and 2.31.3, 1 locus with only models from > 2.10, and 28 loci with only models from 2.31.3. > > Before this experiment, I assumed the issue was related to providing > pre-computed alignments in GFF3 format and perhaps violating some important > assumption. However, this experiment makes me wonder whether there have been > changes to how Maker filters ab initio gene models between version 2.10 and > version 2.31.3? Do you have any ideas? If it would help, I could put together > a small data set that reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ranjani at uga.edu Thu Jun 5 09:49:36 2014 From: ranjani at uga.edu (Sivaranjani Namasivayam) Date: Thu, 5 Jun 2014 15:49:36 +0000 Subject: [maker-devel] protein2genome gene models from protein gff Message-ID: <1401983375868.65464@uga.edu> Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Mack at ARS.USDA.GOV Thu Jun 5 11:56:04 2014 From: Brian.Mack at ARS.USDA.GOV (Mack, Brian) Date: Thu, 5 Jun 2014 17:56:04 +0000 Subject: [maker-devel] missing start and stop codons Message-ID: I've been looking at the start and stop codons of the maker predicted transcripts after I stripped off the utr with the fasta_tool script and 7% of the transcripts do not have ATG as a start codon. Also 2% do not have TAA, TAG, or TGA as a stop codon. Could these be real start and stop codons and just rare variants, or should I consider these incomplete genes? If I was to turn on the "always_complete" option in Maker what would that actually do? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 12:01:24 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:01:24 -0600 Subject: [maker-devel] missing start and stop codons Message-ID: They are incomplete genes there are many reasons why this happens in new assemblies. You can turn always_complete on to try and force a complete, but what is added or subtracted to get a start and stop codon may not be biologically correct. It's just forced canonical. Also make sure to use the latest MAKER version. 2.29 and before didn't correct for the BioPerl codon table which allows for an extra non-cannonical start codon. Now MAKER exports a strict canonical table to BioPerl so 'M' is the only start. --Carson From: "Mack, Brian" Date: Thursday, June 5, 2014 at 11:56 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] missing start and stop codons I?ve been looking at the start and stop codons of the maker predicted transcripts after I stripped off the utr with the fasta_tool script and 7% of the transcripts do not have ATG as a start codon. Also 2% do not have TAA, TAG, or TGA as a stop codon. Could these be real start and stop codons and just rare variants, or should I consider these incomplete genes? If I was to turn on the ?always_complete? option in Maker what would that actually do? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 12:08:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:08:20 -0600 Subject: [maker-devel] protein2genome gene models from protein gff Message-ID: est_gff assumes the alignments are spliced correctly. The protein2genome option also makes that assumption but with a little less confidence that the user always provides splice aware alignments, so in some instances (like protein2genome=1) it may not pass them forward as guaranteed splice aware alignments. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 9:49 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] protein2genome gene models from protein gff Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 12:24:03 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:24:03 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: Like I said. The predictors do the best they can, so there is probably something about the regions to make the CDS, reading frame, or start/stop work that requires exons to be dropped or added. In several ant genomes we saw something like this caused by incorrect homopolymers in the assembly which force the predictor to slightly alter the intron/exon structure because otherwise the reading frame made no sense (the EST alignments were used to confirmed that the assembly homopolymers were incorrect - lots of bad single base pair deletions). The way hints work is as follows. At the simplest level ab initio predictors are calculating the probability of being in different states (intergenic, intron, exon, etc.). The hints increase the probability of being in the intron state where MAKER gives an intron hint or being in an exon/CDS state when MAKER gives an exon/CDS hint. So this bends the likelihood of the ab intio gene predictor to call something similar in structure to the evidence overlapping it. That being said, if there is strong enough signal from something else in the sequence or my hints won't work with the splice sites in the region or the reading frame breaks, then no amount of hints can force augustus to make that model. --Carson On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >Hi, > >thanks for the feedback. I spent some more time on this and am still >somewhat unsatisfied with the whole thing? > >A few comments: > >I quite frequently see augustus and in extension Maker including exons >that are not supported by EST/Protein evidence and are not critical for >the gene model (i.e. I can take them out and still get a proper CDS). >Maybe I don?t know enough about how Maker creates hints and more >importantly what role these hints play for augustus, but I cannot really >see a great effect (any, really) on the gene models even if both ESTs and >proteins contradict an augustus gene model and the surplus exon is >non-essential. > >(all evidence is provided as fasta files, protein2genome and est2genome >are set to 0) > >As for the repeat library, I suppose this is a critical point. I am using >repeats from a closely related species via Repeatmasker, modelled and >filtered repeats from RepeatModeler and repeats derived from a >high-coverage 454 data set. Not sure what else I can do to improve that. > >As for evidence, I am using the curated reference proteome from a related >species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >reads. I don?t think it gets a whole lot better, in terms of what data >can be used. > >So in summary, I just don?t get where I want to using Augustus and Maker >- specifically, the gene models are full of weird, unsupported artefacts >despite manually curating > 850 models for training. I suppose I was >hoping for some secret trick to improve on this - but I guess there is >none? Actually, if I only do a pure evidence build (seeing that my input >data is very high quality), it looks better - which sort of goes against >the premise of Maker :/ > >Regards, > >Marc > > > > >Marc P. Hoeppner, PhD >Team Leader >Department for Medical Biochemistry and Microbiology >Uppsala University, Sweden >marc.hoeppner at bils.se > >On 27 May 2014, at 17:25, Carson Holt wrote: > >> Extra exons can be required for predictors to make sense of a region >>(they >> do the best they can). This can be due to imperfect assemblies or >> repeats. For plants the repeat database is the the one thing that will >> most affect the annotation quality. You may need to spend some time >> building the best repeat library you can. The repeat library is the >>next >> most important thing next to training the predictor, because they >>confuse >> the predictor (sometimes a lot) causing it to behave oddly in those >> regions (because repeats do encode real protein and protein fragments). >> Also when running now with MAKER make sure to include the entire >>proteome >> of a related species and not just UniProt, and you will get better >> performance. Now that you have Augustus trained, using it inside of >>MAKER >> with an improved repeat library and additional protein evidence should >> give it the feedback that will allow it to perform better than it would >> with just naked ab initio prediction. >> >> Thanks, >> Carson >> >> >> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >> >>> Hi, >>> >>> I wanted to get some feedback regarding the training of ab-initio gene >>> finders - it?s not strictly Maker related, but I suppose there are many >>> people on this list that have encountered and solved this issue in one >>> way or another. >>> >>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>> plant genome. This has always been a very frustrating process for me, >>>but >>> while I have a better idea now how to do it, I still don?t get the sort >>> of accuracy that I am hoping for. A quick run-through of my process; >>> >>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>> Sanger-sequenced EST data >>> >>> Filtered for Models with an AED <= 0.3 >>> >>> Loaded that into WebApollo, together with an existing reference >>> annotation and the evidence tracks >>> >>> Manually curated/selected 750 gene models using the following rules: >>> - Must have start/stop codon >>> - Most have as many exons as possible >>> - Must agree with evidence >>> - Must be >= 2kb part from other gene models (provided as flanking >>> regions for augustus to train intergenic sequence) >>> >>> From these models, I created a GBK file, split it into 650 (train) and >>> 100 (test) models and created a new profile using the documented >>> procedure. >>> >>> But: >>> >>> While the naked ab-init models created through maker get a lot of genes >>> ?sort of right?, I still see too many issues to be really satisfied. >>> Problems include: >>> >>> - random exon calls which are not supported by any line of evidence (~1 >>> per gene model, I would guess) >>> - poor congruency with some gene models (especially ones not used for >>> training/testing) >>> >>> Is there any best-practice guide on how to improve this? The Augustus >>> website is unfortunately quite poor on detail? My impression so far is >>> that ramping up the number of training models isn?t really doing too >>>much >>> beyond a certain point (tried 400, 500 and 750). >>> >>> Regards, >>> >>> Marc >>> >>> >>> Marc P. Hoeppner, PhD >>> Team Leader >>> BILS Genome Annotation Platform >>> Department for Medical Biochemistry and Microbiology >>> Uppsala University, Sweden >>> marc.hoeppner at bils.se >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Thu Jun 5 12:28:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 12:28:55 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: One thing you might want to try is adding another predictor like SNAP together with Augustus and then process the MAKER results using EVM. We actually have a collaboration with the EVM group to produce a MAKER-EVM pipeline (MAKER 3.0). EVM will produce consensus models using the predictions and the evidence in the MAKER GFF3 which are generally better than just SNAP and Augustus with hints, so it might be able to remove some of the artifacts you are worried about. --Carson On 6/5/14, 12:24 PM, "Carson Holt" wrote: >Like I said. The predictors do the best they can, so there is probably >something about the regions to make the CDS, reading frame, or start/stop >work that requires exons to be dropped or added. In several ant genomes >we saw something like this caused by incorrect homopolymers in the >assembly which force the predictor to slightly alter the intron/exon >structure because otherwise the reading frame made no sense (the EST >alignments were used to confirmed that the assembly homopolymers were >incorrect - lots of bad single base pair deletions). > >The way hints work is as follows. At the simplest level ab initio >predictors are calculating the probability of being in different states >(intergenic, intron, exon, etc.). The hints increase the probability of >being in the intron state where MAKER gives an intron hint or being in an >exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >likelihood of the ab intio gene predictor to call something similar in >structure to the evidence overlapping it. That being said, if there is >strong enough signal from something else in the sequence or my hints won't >work with the splice sites in the region or the reading frame breaks, then >no amount of hints can force augustus to make that model. > >--Carson > > > >On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: > >>Hi, >> >>thanks for the feedback. I spent some more time on this and am still >>somewhat unsatisfied with the whole thing? >> >>A few comments: >> >>I quite frequently see augustus and in extension Maker including exons >>that are not supported by EST/Protein evidence and are not critical for >>the gene model (i.e. I can take them out and still get a proper CDS). >>Maybe I don?t know enough about how Maker creates hints and more >>importantly what role these hints play for augustus, but I cannot really >>see a great effect (any, really) on the gene models even if both ESTs and >>proteins contradict an augustus gene model and the surplus exon is >>non-essential. >> >>(all evidence is provided as fasta files, protein2genome and est2genome >>are set to 0) >> >>As for the repeat library, I suppose this is a critical point. I am using >>repeats from a closely related species via Repeatmasker, modelled and >>filtered repeats from RepeatModeler and repeats derived from a >>high-coverage 454 data set. Not sure what else I can do to improve that. >> >>As for evidence, I am using the curated reference proteome from a related >>species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>reads. I don?t think it gets a whole lot better, in terms of what data >>can be used. >> >>So in summary, I just don?t get where I want to using Augustus and Maker >>- specifically, the gene models are full of weird, unsupported artefacts >>despite manually curating > 850 models for training. I suppose I was >>hoping for some secret trick to improve on this - but I guess there is >>none? Actually, if I only do a pure evidence build (seeing that my input >>data is very high quality), it looks better - which sort of goes against >>the premise of Maker :/ >> >>Regards, >> >>Marc >> >> >> >> >>Marc P. Hoeppner, PhD >>Team Leader >>Department for Medical Biochemistry and Microbiology >>Uppsala University, Sweden >>marc.hoeppner at bils.se >> >>On 27 May 2014, at 17:25, Carson Holt wrote: >> >>> Extra exons can be required for predictors to make sense of a region >>>(they >>> do the best they can). This can be due to imperfect assemblies or >>> repeats. For plants the repeat database is the the one thing that will >>> most affect the annotation quality. You may need to spend some time >>> building the best repeat library you can. The repeat library is the >>>next >>> most important thing next to training the predictor, because they >>>confuse >>> the predictor (sometimes a lot) causing it to behave oddly in those >>> regions (because repeats do encode real protein and protein fragments). >>> Also when running now with MAKER make sure to include the entire >>>proteome >>> of a related species and not just UniProt, and you will get better >>> performance. Now that you have Augustus trained, using it inside of >>>MAKER >>> with an improved repeat library and additional protein evidence should >>> give it the feedback that will allow it to perform better than it would >>> with just naked ab initio prediction. >>> >>> Thanks, >>> Carson >>> >>> >>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>> >>>> Hi, >>>> >>>> I wanted to get some feedback regarding the training of ab-initio gene >>>> finders - it?s not strictly Maker related, but I suppose there are >>>>many >>>> people on this list that have encountered and solved this issue in one >>>> way or another. >>>> >>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>>> plant genome. This has always been a very frustrating process for me, >>>>but >>>> while I have a better idea now how to do it, I still don?t get the >>>>sort >>>> of accuracy that I am hoping for. A quick run-through of my process; >>>> >>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>> Sanger-sequenced EST data >>>> >>>> Filtered for Models with an AED <= 0.3 >>>> >>>> Loaded that into WebApollo, together with an existing reference >>>> annotation and the evidence tracks >>>> >>>> Manually curated/selected 750 gene models using the following rules: >>>> - Must have start/stop codon >>>> - Most have as many exons as possible >>>> - Must agree with evidence >>>> - Must be >= 2kb part from other gene models (provided as flanking >>>> regions for augustus to train intergenic sequence) >>>> >>>> From these models, I created a GBK file, split it into 650 (train) >>>>and >>>> 100 (test) models and created a new profile using the documented >>>> procedure. >>>> >>>> But: >>>> >>>> While the naked ab-init models created through maker get a lot of >>>>genes >>>> ?sort of right?, I still see too many issues to be really satisfied. >>>> Problems include: >>>> >>>> - random exon calls which are not supported by any line of evidence >>>>(~1 >>>> per gene model, I would guess) >>>> - poor congruency with some gene models (especially ones not used for >>>> training/testing) >>>> >>>> Is there any best-practice guide on how to improve this? The Augustus >>>> website is unfortunately quite poor on detail? My impression so far is >>>> that ramping up the number of training models isn?t really doing too >>>>much >>>> beyond a certain point (tried 400, 500 and 750). >>>> >>>> Regards, >>>> >>>> Marc >>>> >>>> >>>> Marc P. Hoeppner, PhD >>>> Team Leader >>>> BILS Genome Annotation Platform >>>> Department for Medical Biochemistry and Microbiology >>>> Uppsala University, Sweden >>>> marc.hoeppner at bils.se >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > From marc.hoeppner at bils.se Thu Jun 5 02:15:55 2014 From: marc.hoeppner at bils.se (=?windows-1252?Q?Marc_H=F6ppner?=) Date: Thu, 5 Jun 2014 10:15:55 +0200 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> Message-ID: <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Hi, thanks for the feedback. I spent some more time on this and am still somewhat unsatisfied with the whole thing? A few comments: I quite frequently see augustus and in extension Maker including exons that are not supported by EST/Protein evidence and are not critical for the gene model (i.e. I can take them out and still get a proper CDS). Maybe I don?t know enough about how Maker creates hints and more importantly what role these hints play for augustus, but I cannot really see a great effect (any, really) on the gene models even if both ESTs and proteins contradict an augustus gene model and the surplus exon is non-essential. (all evidence is provided as fasta files, protein2genome and est2genome are set to 0) As for the repeat library, I suppose this is a critical point. I am using repeats from a closely related species via Repeatmasker, modelled and filtered repeats from RepeatModeler and repeats derived from a high-coverage 454 data set. Not sure what else I can do to improve that. As for evidence, I am using the curated reference proteome from a related species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 reads. I don?t think it gets a whole lot better, in terms of what data can be used. So in summary, I just don?t get where I want to using Augustus and Maker - specifically, the gene models are full of weird, unsupported artefacts despite manually curating > 850 models for training. I suppose I was hoping for some secret trick to improve on this - but I guess there is none? Actually, if I only do a pure evidence build (seeing that my input data is very high quality), it looks better - which sort of goes against the premise of Maker :/ Regards, Marc Marc P. Hoeppner, PhD Team Leader Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at bils.se On 27 May 2014, at 17:25, Carson Holt wrote: > Extra exons can be required for predictors to make sense of a region (they > do the best they can). This can be due to imperfect assemblies or > repeats. For plants the repeat database is the the one thing that will > most affect the annotation quality. You may need to spend some time > building the best repeat library you can. The repeat library is the next > most important thing next to training the predictor, because they confuse > the predictor (sometimes a lot) causing it to behave oddly in those > regions (because repeats do encode real protein and protein fragments). > Also when running now with MAKER make sure to include the entire proteome > of a related species and not just UniProt, and you will get better > performance. Now that you have Augustus trained, using it inside of MAKER > with an improved repeat library and additional protein evidence should > give it the feedback that will allow it to perform better than it would > with just naked ab initio prediction. > > Thanks, > Carson > > > On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: > >> Hi, >> >> I wanted to get some feedback regarding the training of ab-initio gene >> finders - it?s not strictly Maker related, but I suppose there are many >> people on this list that have encountered and solved this issue in one >> way or another. >> >> Specifically, I am trying to train Augustus (and possibly SNAP) for a >> plant genome. This has always been a very frustrating process for me, but >> while I have a better idea now how to do it, I still don?t get the sort >> of accuracy that I am hoping for. A quick run-through of my process; >> >> Evidence build with maker on level 1 and 2 proteins from Uniprot + >> Sanger-sequenced EST data >> >> Filtered for Models with an AED <= 0.3 >> >> Loaded that into WebApollo, together with an existing reference >> annotation and the evidence tracks >> >> Manually curated/selected 750 gene models using the following rules: >> - Must have start/stop codon >> - Most have as many exons as possible >> - Must agree with evidence >> - Must be >= 2kb part from other gene models (provided as flanking >> regions for augustus to train intergenic sequence) >> >> From these models, I created a GBK file, split it into 650 (train) and >> 100 (test) models and created a new profile using the documented >> procedure. >> >> But: >> >> While the naked ab-init models created through maker get a lot of genes >> ?sort of right?, I still see too many issues to be really satisfied. >> Problems include: >> >> - random exon calls which are not supported by any line of evidence (~1 >> per gene model, I would guess) >> - poor congruency with some gene models (especially ones not used for >> training/testing) >> >> Is there any best-practice guide on how to improve this? The Augustus >> website is unfortunately quite poor on detail? My impression so far is >> that ramping up the number of training models isn?t really doing too much >> beyond a certain point (tried 400, 500 and 750). >> >> Regards, >> >> Marc >> >> >> Marc P. Hoeppner, PhD >> Team Leader >> BILS Genome Annotation Platform >> Department for Medical Biochemistry and Microbiology >> Uppsala University, Sweden >> marc.hoeppner at bils.se >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From fbarreto at ucsd.edu Thu Jun 5 13:01:05 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 5 Jun 2014 12:01:05 -0700 Subject: [maker-devel] Generating GFF with selected tracks Message-ID: Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 13:02:36 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 13:02:36 -0600 Subject: [maker-devel] protein2genome gene models from protein gff In-Reply-To: <1401994595132.44761@uga.edu> References: <1401994595132.44761@uga.edu> Message-ID: That's what I'd do. But really protein2genome=1 is just meant to get enough rough gene models to train a gene predictor. You don't need to run it across the whole genome. But if you do, when you run again after training the gene predictor, MAKER will detect the old BLAST jobs and they won't have to rerun on the second MAKER run. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 12:56 PM To: Carson Holt Subject: RE: [maker-devel] protein2genome gene models from protein gff So what would you suggest is the best way to get protein2genome predictions? Use fasta sequences, instead of gff? Thanks, Ranjani From: Carson Holt Sent: Thursday, June 05, 2014 2:08 PM To: Sivaranjani Namasivayam; maker-devel at yandell-lab.org Subject: Re: [maker-devel] protein2genome gene models from protein gff est_gff assumes the alignments are spliced correctly. The protein2genome option also makes that assumption but with a little less confidence that the user always provides splice aware alignments, so in some instances (like protein2genome=1) it may not pass them forward as guaranteed splice aware alignments. --Carson From: Sivaranjani Namasivayam Date: Thursday, June 5, 2014 at 9:49 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] protein2genome gene models from protein gff Hi, I am trying to predict gene models from protein evidence, using the parameter protein2genome set to 1. I get gene models predicted if I provide the proteins as a fasta file, but not as gff (I want to use a gff format to avoid the blastx step again). Is this expected? In case of transcriptome evidence and est2genome set to 1, I get gene models predicted with both fasta and gff formats. Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 5 13:05:30 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 05 Jun 2014 13:05:30 -0600 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: gff3_merge just merges any two GFF3 files. So if you have two files just give both of them to it. Example --> gff3_merge maker_genes.gff repeats.gff Also if all you are trying to do is filter out certain feature types from the file, just use grep instead. Example --> grep -v -P "\tpred_gff\t" maker.gff Thanks, Carson From: Felipe Barreto Date: Thursday, June 5, 2014 at 1:01 PM To: MAKER group Subject: [maker-devel] Generating GFF with selected tracks Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Thu Jun 5 13:08:08 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 5 Jun 2014 19:08:08 +0000 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: Hi Felipe, I seem to remember that some of the gene model names did change when I did things similar to what you described. I think that you could accomplish the same thing with some cat and grep commands on the full gff. That would avoid the trouble of rerunning maker. Something like "cat full.gff | grep -P "\trepeatrunner\t" > tmp.gff " would get you started. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 5, 2014, at 12:01 PM, Felipe Barreto > wrote: Hi, all, I would like to produce a gff file that contains Maker gene models AND repeats. I know that using gff3_merge with -g will generate one with only the gene models, but I didn't see any options for adding additional tracks. The way I did this was to use the Re-annotation section in the control file. I provided the original full gff file in maker_gff, and turned on the rm_pass and model_pass. All other options in the control file were turned off. This seemed to work, though it also added a 'model_gff:maker' track, which is not a problem for me. I compared a few new and original scaffolds in Apollo, and all seem to match perfectly. But since I cannot check the whole genome, I was wondering if what I did was appropriate. Are all the gene models (and their names) and repeat alignments identical between the new and original files? Or is Maker potentially changing a few things since it's treated as a new run? Thanks! -- Felipe Barreto _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jun 5 14:07:51 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 5 Jun 2014 13:07:51 -0700 Subject: [maker-devel] Generating GFF with selected tracks In-Reply-To: References: Message-ID: OK, I see. I will just use grep to extract the desired features from the full.gff and merge them with gff3_merge. Don't know why I was making it more complicated. I guess I don't understand gff formats very well quite yet. Thanks yet again! On Thu, Jun 5, 2014 at 12:08 PM, Daniel Ence wrote: > Hi Felipe, I seem to remember that some of the gene model names did > change when I did things similar to what you described. I think that you > could accomplish the same thing with some cat and grep commands on the full > gff. That would avoid the trouble of rerunning maker. Something like "cat > full.gff | grep -P "\trepeatrunner\t" > tmp.gff " would get you started. > > ~Daniel > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 5, 2014, at 12:01 PM, Felipe Barreto > wrote: > > Hi, all, > > I would like to produce a gff file that contains Maker gene models AND > repeats. I know that using gff3_merge with -g will generate one with only > the gene models, but I didn't see any options for adding additional tracks. > > The way I did this was to use the Re-annotation section in the control > file. I provided the original full gff file in maker_gff, and turned on > the rm_pass and model_pass. All other options in the control file were > turned off. This seemed to work, though it also added a 'model_gff:maker' > track, which is not a problem for me. I compared a few new and original > scaffolds in Apollo, and all seem to match perfectly. But since I cannot > check the whole genome, I was wondering if what I did was appropriate. Are > all the gene models (and their names) and repeat alignments identical > between the new and original files? Or is Maker potentially changing a few > things since it's treated as a new run? > > Thanks! > > -- > Felipe Barreto > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 10:33:06 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:33:06 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Another question: is there documentation anywhere for the naming conventions of the genes annotated by Maker? Of course it's easy to spot genes based on a particular *ab initio* gene predictor, as the names are in the IDs. But what is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs "maker-$seqid-snap-gene"? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage wrote: > I have attached data for a small 18kb region with a handful of genes, as > well as the corresponding maker_opts.ctl file. (This is a smaller and > different data set than what I was looking at yesterday, with a more > well-defined problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 > with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a > different gene from 6111 to 8345 with an AED of 0.01. Both of these genes > have transcript support: will Maker report overlapping genes under any > conditions? And even if Maker is forced to choose only a single gene to > report, why would the model from 4125 to 6400 ever be reported in place of > the one from 6111 to 8345, especially since this is provided in the > model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: > >> Just eAED, but eAED can affects selection of ab initio results. For >> example reading frame match of protein evidence which also affects whether >> evidence from single_exon=1 and genes with single_exon protein evidence get >> kept. There is also the assumption that your alignments in GFF3 are are >> correctly spliced (like BLAT does). So giving blastn results as >> precomputed est_gff would create a lot of noise, since maker ignores blastn >> and is using it only to seed the polished exonerate alignments. >> >> --Carson >> >> >> From: Daniel Standage >> Date: Wednesday, June 4, 2014 at 1:11 PM >> To: Carson Holt >> Cc: Maker Mailing List >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> I do not provide Gap or Target attributes in the GFF3. Will this affect >> the AED as well, or just the eAED? >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >> >>> Sure. that would be helpful. One question. Do you provide the Gap >>> attribute in your precomputed alignments? Having or not having that >>> attribute affects the eAED score which takes reading frame into account, >>> and may cause some things to be kept that normally would be dropped, >>> because MAKER won't be able to take the points of mismatch of the alignment >>> into account (it just assumes match everywhere). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:03 PM >>> To: Maker Mailing List >>> Subject: [maker-devel] Filtering of ab initio gene models >>> >>> Thanks everyone for your responses recently! >>> >>> The reason for my recent flurry of email activity is that I'm seeing >>> some unexpected trends when running the new version of Maker with >>> precomputed alignments. Compared with an annotation I did a while ago >>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>> substantial number of new genes annotated. If I compare distributions of >>> AED scores between the old and new annotation, it's clear that the new >>> annotation has a lot more low-quality models. If I look at new gene models >>> that do not overlap with any gene model from the old annotation, the >>> likelihood that it's a low-quality model is much higher. >>> >>> I decided to run a little experiment. I annotated a scaffold first using >>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>> pre-computed transcript and protein alignments and the same (latest) >>> version of SNAP as the only *ab initio* predictor. Maker 2.10 predicted >>> 44 genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>> >>> Before this experiment, I assumed the issue was related to providing >>> pre-computed alignments in GFF3 format and perhaps violating some important >>> assumption. However, this experiment makes me wonder whether there have >>> been changes to how Maker filters *ab initio* gene models between >>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>> could put together a small data set that reproduces the behavior I just >>> described. >>> >>> Thanks! >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> _______________________________________________ maker-devel mailing >>> list maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 10:39:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 10:39:43 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked sequence without hints (i.e. the ab initio call). maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. In both cases MAKER is allowed to add UTR to the model (hence the 'processed' tag). --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 10:33 AM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Another question: is there documentation anywhere for the naming conventions of the genes annotated by Maker? Of course it's easy to spot genes based on a particular ab initio gene predictor, as the names are in the IDs. But what is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs "maker-$seqid-snap-gene"? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage wrote: > I have attached data for a small 18kb region with a handful of genes, as well > as the corresponding maker_opts.ctl file. (This is a smaller and different > data set than what I was looking at yesterday, with a more well-defined > problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 with > an AED of 0.23. If you exclude transcript TSA024184, Maker reports a different > gene from 6111 to 8345 with an AED of 0.01. Both of these genes have > transcript support: will Maker report overlapping genes under any conditions? > And even if Maker is forced to choose only a single gene to report, why would > the model from 4125 to 6400 ever be reported in place of the one from 6111 to > 8345, especially since this is provided in the model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> Just eAED, but eAED can affects selection of ab initio results. For example >> reading frame match of protein evidence which also affects whether evidence >> from single_exon=1 and genes with single_exon protein evidence get kept. >> There is also the assumption that your alignments in GFF3 are are correctly >> spliced (like BLAT does). So giving blastn results as precomputed est_gff >> would create a lot of noise, since maker ignores blastn and is using it only >> to seed the polished exonerate alignments. >> >> --Carson >> >> >> From: Daniel Standage >> Date: Wednesday, June 4, 2014 at 1:11 PM >> To: Carson Holt >> Cc: Maker Mailing List >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> I do not provide Gap or Target attributes in the GFF3. Will this affect the >> AED as well, or just the eAED? >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> Sure. that would be helpful. One question. Do you provide the Gap >>> attribute in your precomputed alignments? Having or not having that >>> attribute affects the eAED score which takes reading frame into account, and >>> may cause some things to be kept that normally would be dropped, because >>> MAKER won't be able to take the points of mismatch of the alignment into >>> account (it just assumes match everywhere). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:03 PM >>> To: Maker Mailing List >>> Subject: [maker-devel] Filtering of ab initio gene models >>> >>> Thanks everyone for your responses recently! >>> >>> The reason for my recent flurry of email activity is that I'm seeing some >>> unexpected trends when running the new version of Maker with precomputed >>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>> Maker-computed alignments), this new annotation has a substantial number of >>> new genes annotated. If I compare distributions of AED scores between the >>> old and new annotation, it's clear that the new annotation has a lot more >>> low-quality models. If I look at new gene models that do not overlap with >>> any gene model from the old annotation, the likelihood that it's a >>> low-quality model is much higher. >>> >>> I decided to run a little experiment. I annotated a scaffold first using >>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>> pre-computed transcript and protein alignments and the same (latest) version >>> of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 genes while >>> Maker 2.31.3 predicted 63. If we group gene models into loci by overlap, >>> there are 33 loci with gene models from both 2.10 and 2.31.3, 1 locus with >>> only models from 2.10, and 28 loci with only models from 2.31.3. >>> >>> Before this experiment, I assumed the issue was related to providing >>> pre-computed alignments in GFF3 format and perhaps violating some important >>> assumption. However, this experiment makes me wonder whether there have been >>> changes to how Maker filters ab initio gene models between version 2.10 and >>> version 2.31.3? Do you have any ideas? If it would help, I could put >>> together a small data set that reproduces the behavior I just described. >>> >>> Thanks! >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m >>> aker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 10:46:41 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:46:41 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Good to know, thanks. If multiple *ab initio* predictors inform a single annotation, how does Maker decide which one will be included in the gene's ID? Given your quick response just now, I wanted to confirm that you got the message and data set I sent yesterday. I received an email saying the size of my message required list admin approval to be distributed, but since you were also a direct recipient of the email I didn't worry about it too much. Thanks again! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to spot > genes based on a particular *ab initio* gene predictor, as the names are > in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> I have attached data for a small 18kb region with a handful of genes, as >> well as the corresponding maker_opts.ctl file. (This is a smaller and >> different data set than what I was looking at yesterday, with a more >> well-defined problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the >> model_gff file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> Just eAED, but eAED can affects selection of ab initio results. For >>> example reading frame match of protein evidence which also affects whether >>> evidence from single_exon=1 and genes with single_exon protein evidence get >>> kept. There is also the assumption that your alignments in GFF3 are are >>> correctly spliced (like BLAT does). So giving blastn results as >>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>> and is using it only to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>> the AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing >>>> some unexpected trends when running the new version of Maker with >>>> precomputed alignments. Compared with an annotation I did a while ago >>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>> substantial number of new genes annotated. If I compare distributions of >>>> AED scores between the old and new annotation, it's clear that the new >>>> annotation has a lot more low-quality models. If I look at new gene models >>>> that do not overlap with any gene model from the old annotation, the >>>> likelihood that it's a low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first >>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>> from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters *ab initio* gene models between >>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>> could put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel mailing >>>> list maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 10:56:38 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 10:56:38 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: I got the e-mail. Thanks for the test set. Multiple ab initio predictors don't inform a single annotation, rather one must be chosen from the pool of available models (I.e. it has to be SNAP or Augustus, or GeneMark). They all supply their own ab initio as well as hint based prediction, and then the one with best evidence match (measured by AED) is kept (it's like a competition that only one predictor can win). If you want a consensus model instead, you can take MAKER results in GFF3 format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a collaboration with the EVM group and will have this option, but for now users can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then produces consensus models based on the GFF3 content. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 10:46 AM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Good to know, thanks. If multiple ab initio predictors inform a single annotation, how does Maker decide which one will be included in the gene's ID? Given your quick response just now, I wanted to confirm that you got the message and data set I sent yesterday. I received an email saying the size of my message required list admin approval to be distributed, but since you were also a direct recipient of the email I didn't worry about it too much. Thanks again! -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked > sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the 'processed' > tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel > > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming conventions > of the genes annotated by Maker? Of course it's easy to spot genes based on a > particular ab initio gene predictor, as the names are in the IDs. But what is > the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: >> I have attached data for a small 18kb region with a handful of genes, as well >> as the corresponding maker_opts.ctl file. (This is a smaller and different >> data set than what I was looking at yesterday, with a more well-defined >> problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the model_gff >> file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> Just eAED, but eAED can affects selection of ab initio results. For example >>> reading frame match of protein evidence which also affects whether evidence >>> from single_exon=1 and genes with single_exon protein evidence get kept. >>> There is also the assumption that your alignments in GFF3 are are correctly >>> spliced (like BLAT does). So giving blastn results as precomputed est_gff >>> would create a lot of noise, since maker ignores blastn and is using it only >>> to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>> AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing some >>>> unexpected trends when running the new version of Maker with precomputed >>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>> Maker-computed alignments), this new annotation has a substantial number of >>>> new genes annotated. If I compare distributions of AED scores between the >>>> old and new annotation, it's clear that the new annotation has a lot more >>>> low-quality models. If I look at new gene models that do not overlap with >>>> any gene model from the old annotation, the likelihood that it's a >>>> low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first using >>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci by >>>> overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters ab initio gene models between version >>>> 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could >>>> put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel mailing list >>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ >>>> maker-devel_yandell-lab.org >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 10:59:16 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 12:59:16 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: This helps, thanks. -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: > I got the e-mail. Thanks for the test set. > > Multiple *ab initio* predictors don't inform a single annotation, rather > one must be chosen from the pool of available models (I.e. it has to be > SNAP or Augustus, or GeneMark). They all supply their own *ab initio* as > well as hint based prediction, and then the one with best evidence match > (measured by AED) is kept (it's like a competition that only one predictor > can win). > > If you want a consensus model instead, you can take MAKER results in GFF3 > format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is > a collaboration with the EVM group and will have this option, but for now > users can just split the MAKER GFF3 by evidence types and give it to EVM. > EVM then produces consensus models based on the GFF3 content. > > --Carson > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:46 AM > > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Good to know, thanks. If multiple *ab initio* predictors inform a single > annotation, how does Maker decide which one will be included in the gene's > ID? > > Given your quick response just now, I wanted to confirm that you got the > message and data set I sent yesterday. I received an email saying the size > of my message required list admin approval to be distributed, but since you > were also a direct recipient of the email I didn't worry about it too much. > > Thanks again! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: > >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat >> masked sequence without hints (i.e. the ab initio call). >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >> MAKER. >> >> In both cases MAKER is allowed to add UTR to the model (hence the >> 'processed' tag). >> >> --Carson >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel < >> vbrendel at indiana.edu> >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Another question: is there documentation anywhere for the naming >> conventions of the genes annotated by Maker? Of course it's easy to spot >> genes based on a particular *ab initio* gene predictor, as the names are >> in the IDs. But what is the significance of, say, >> "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> Thanks, >> Daniel >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage < >> daniel.standage at gmail.com> wrote: >> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to >>> 6400 with an AED of 0.23. If you exclude transcript TSA024184, Maker >>> reports a different gene from 6111 to 8345 with an AED of 0.01. Both of >>> these genes have transcript support: will Maker report overlapping genes >>> under any conditions? And even if Maker is forced to choose only a single >>> gene to report, why would the model from 4125 to 6400 ever be reported in >>> place of the one from 6111 to 8345, especially since this is provided in >>> the model_gff file? >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>> the AED as well, or just the eAED? >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the alignment >>>>> into account (it just assumes match everywhere). >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing >>>>> some unexpected trends when running the new version of Maker with >>>>> precomputed alignments. Compared with an annotation I did a while ago >>>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>>> substantial number of new genes annotated. If I compare distributions of >>>>> AED scores between the old and new annotation, it's clear that the new >>>>> annotation has a lot more low-quality models. If I look at new gene models >>>>> that do not overlap with any gene model from the old annotation, the >>>>> likelihood that it's a low-quality model is much higher. >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first >>>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>>> from 2.31.3. >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>>> assumption. However, this experiment makes me wonder whether there have >>>>> been changes to how Maker filters *ab initio* gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>>> could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> _______________________________________________ maker-devel mailing >>>>> list maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 12:38:23 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 14:38:23 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: > >> I got the e-mail. Thanks for the test set. >> >> Multiple *ab initio* predictors don't inform a single annotation, rather >> one must be chosen from the pool of available models (I.e. it has to be >> SNAP or Augustus, or GeneMark). They all supply their own *ab initio* >> as well as hint based prediction, and then the one with best evidence match >> (measured by AED) is kept (it's like a competition that only one predictor >> can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is >> a collaboration with the EVM group and will have this option, but for now >> users can just split the MAKER GFF3 by evidence types and give it to EVM. >> EVM then produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel < >> vbrendel at indiana.edu> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple *ab initio* predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size >> of my message required list admin approval to be distributed, but since you >> were also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >> >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat >>> masked sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel < >>> vbrendel at indiana.edu> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming >>> conventions of the genes annotated by Maker? Of course it's easy to spot >>> genes based on a particular *ab initio* gene predictor, as the names >>> are in the IDs. But what is the significance of, say, >>> "snap_masked-$seqid-processed-gene" in a gene ID vs >>> "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage < >>> daniel.standage at gmail.com> wrote: >>> >>>> I have attached data for a small 18kb region with a handful of genes, >>>> as well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to >>>> 6400 with an AED of 0.23. If you exclude transcript TSA024184, Maker >>>> reports a different gene from 6111 to 8345 with an AED of 0.01. Both of >>>> these genes have transcript support: will Maker report overlapping genes >>>> under any conditions? And even if Maker is forced to choose only a single >>>> gene to report, why would the model from 4125 to 6400 ever be reported in >>>> place of the one from 6111 to 8345, especially since this is provided in >>>> the model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>> >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>>> kept. There is also the assumption that your alignments in GFF3 are are >>>>> correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>>> and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this >>>>> affect the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt >>>>> wrote: >>>>> >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the alignment >>>>>> into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing >>>>>> some unexpected trends when running the new version of Maker with >>>>>> precomputed alignments. Compared with an annotation I did a while ago >>>>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>>>> substantial number of new genes annotated. If I compare distributions of >>>>>> AED scores between the old and new annotation, it's clear that the new >>>>>> annotation has a lot more low-quality models. If I look at new gene models >>>>>> that do not overlap with any gene model from the old annotation, the >>>>>> likelihood that it's a low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first >>>>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>>>> from 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>>>> assumption. However, this experiment makes me wonder whether there have >>>>>> been changes to how Maker filters *ab initio* gene models between >>>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>>>> could put together a small data set that reproduces the behavior I just >>>>>> described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing >>>>>> list maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>> >>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 6 12:51:18 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 06 Jun 2014 12:51:18 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: There can be overlapping meddles if you have multiple gene predictors. Also the hint based models will overlap the ab initio models, but you never get to see them (they are not kept in the evidence because they are confusing and really not useful unless they are chosen as the best model). So they will overlap the ab initio models, but you may never get top see them. All models regardless of location and overlap get sorted by their AED score. The best model is then kept from the list. Then the next, then the next. If the next best model overlaps a model that has already come off the list (which means the other model has a better AED score), then it gets skipped, and the next best one in the list gets added to the non-overlapping space. The result is that the final models will be non-redundant and non-overlapping, but if you look at the evidence aligments you will find ab initio models different than the MAKER models that were rejected and do not overlap the final models. model_gff competes just like any other model with AED. Ties always go to model_gff, and if there is a region where no model gets chosen (they all have AED of 1) and a model_gff entry will fit (even with an AED score of 1), then it will be chosen, because model_gff do not need evidence support to end up in the final annotations. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 12:38 PM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: >> I got the e-mail. Thanks for the test set. >> >> Multiple ab initio predictors don't inform a single annotation, rather one >> must be chosen from the pool of available models (I.e. it has to be SNAP or >> Augustus, or GeneMark). They all supply their own ab initio as well as hint >> based prediction, and then the one with best evidence match (measured by AED) >> is kept (it's like a competition that only one predictor can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a >> collaboration with the EVM group and will have this option, but for now users >> can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then >> produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple ab initio predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size of >> my message required list admin approval to be distributed, but since you were >> also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >>> sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel >>> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming conventions >>> of the genes annotated by Maker? Of course it's easy to spot genes based on >>> a particular ab initio gene predictor, as the names are in the IDs. But what >>> is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene >>> ID vs "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >>> wrote: >>>> I have attached data for a small 18kb region with a handful of genes, as >>>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>>> have transcript support: will Maker report overlapping genes under any >>>> conditions? And even if Maker is forced to choose only a single gene to >>>> report, why would the model from 4125 to 6400 ever be reported in place of >>>> the one from 6111 to 8345, especially since this is provided in the >>>> model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence >>>>> get kept. There is also the assumption that your alignments in GFF3 are >>>>> are correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores >>>>> blastn and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>>> the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the >>>>>> alignment into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>>> unexpected trends when running the new version of Maker with precomputed >>>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>>> Maker-computed alignments), this new annotation has a substantial number >>>>>> of new genes annotated. If I compare distributions of AED scores between >>>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>>> more low-quality models. If I look at new gene models that do not overlap >>>>>> with any gene model from the old annotation, the likelihood that it's a >>>>>> low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>>> 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>>> important assumption. However, this experiment makes me wonder whether >>>>>> there have been changes to how Maker filters ab initio gene models >>>>>> between version 2.10 and version 2.31.3? Do you have any ideas? If it >>>>>> would help, I could put together a small data set that reproduces the >>>>>> behavior I just described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing list >>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinf >>>>>> o/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.standage at gmail.com Fri Jun 6 17:58:26 2014 From: daniel.standage at gmail.com (Daniel Standage) Date: Fri, 6 Jun 2014 19:58:26 -0400 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: <53923808.7030401@indiana.edu> References: <53923808.7030401@indiana.edu> Message-ID: In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models > (supplied by the pred_gff or model_gff tag)? This seems to be one problem > we are running into. Our external models are high quality, but CDS only. > Thus their score gets knocked down relative to ab initio predictions with > added UTRs. > > Daniel will have more questions/observations later with regard to > overlapping gene models (we definitely need to allow gene models to overlap > in the UTRs, because transcript evidence clearly shows such negative > intergenic spaces). > > Thanks for all your help! > Volker > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel < > vbrendel at indiana.edu> > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to spot > genes based on a particular *ab initio* gene predictor, as the names are > in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> I have attached data for a small 18kb region with a handful of genes, >> as well as the corresponding maker_opts.ctl file. (This is a smaller and >> different data set than what I was looking at yesterday, with a more >> well-defined problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the >> model_gff file? >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> Any light you could shed would be helpful. Thanks! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> Just eAED, but eAED can affects selection of ab initio results. For >>> example reading frame match of protein evidence which also affects whether >>> evidence from single_exon=1 and genes with single_exon protein evidence get >>> kept. There is also the assumption that your alignments in GFF3 are are >>> correctly spliced (like BLAT does). So giving blastn results as >>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>> and is using it only to seed the polished exonerate alignments. >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this >>> affect the AED as well, or just the eAED? >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> --Carson >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing >>>> some unexpected trends when running the new version of Maker with >>>> precomputed alignments. Compared with an annotation I did a while ago >>>> (Maker 2.10, Maker-computed alignments), this new annotation has a >>>> substantial number of new genes annotated. If I compare distributions of >>>> AED scores between the old and new annotation, it's clear that the new >>>> annotation has a lot more low-quality models. If I look at new gene models >>>> that do not overlap with any gene model from the old annotation, the >>>> likelihood that it's a low-quality model is much higher. >>>> >>>> I decided to run a little experiment. I annotated a scaffold first >>>> using Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only *ab initio* predictor. Maker 2.10 >>>> predicted 44 genes while Maker 2.31.3 predicted 63. If we group gene models >>>> into loci by overlap, there are 33 loci with gene models from both 2.10 and >>>> 2.31.3, 1 locus with only models from 2.10, and 28 loci with only models >>>> from 2.31.3. >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters *ab initio* gene models between >>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, I >>>> could put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> _______________________________________________ maker-devel >>>> mailing list maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074http://brendelgroup.org/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbrendel at indiana.edu Fri Jun 6 15:52:08 2014 From: vbrendel at indiana.edu (Volker Brendel) Date: Fri, 06 Jun 2014 16:52:08 -0500 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: <53923808.7030401@indiana.edu> Hi Carson, is there a way of allowing MAKER to add UTRs to our external models (supplied by the pred_gff or model_gff tag)? This seems to be one problem we are running into. Our external models are high quality, but CDS only. Thus their score gets knocked down relative to ab initio predictions with added UTRs. Daniel will have more questions/observations later with regard to overlapping gene models (we definitely need to allow gene models to overlap in the UTRs, because transcript evidence clearly shows such negative intergenic spaces). Thanks for all your help! Volker On 6/6/2014 11:39 AM, Carson Holt wrote: > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat > masked sequence without hints (i.e. the ab initio call). > maker-$seqid-snap-gene was produced by SNAP after receiving hints from > MAKER. > > In both cases MAKER is allowed to add UTR to the model (hence the > 'processed' tag). > > --Carson > > > From: Daniel Standage > > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > > Cc: Maker Mailing List >, Volker Brendel > > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > Another question: is there documentation anywhere for the naming > conventions of the genes annotated by Maker? Of course it's easy to > spot genes based on a particular /ab initio/ gene predictor, as the > names are in the IDs. But what is the significance of, say, > "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > Thanks, > Daniel > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > > wrote: > > I have attached data for a small 18kb region with a handful of > genes, as well as the corresponding maker_opts.ctl file. (This is > a smaller and different data set than what I was looking at > yesterday, with a more well-defined problem). > > With the data files as is, Maker 2.31.3 reports a model from 4125 > to 6400 with an AED of 0.23. If you exclude transcript TSA024184, > Maker reports a different gene from 6111 to 8345 with an AED of > 0.01. Both of these genes have transcript support: will Maker > report overlapping genes under any conditions? And even if Maker > is forced to choose only a single gene to report, why would the > model from 4125 to 6400 ever be reported in place of the one from > 6111 to 8345, especially since this is provided in the model_gff file? > > Even when transcript TSA024184 is included, Maker 2.10 reports the > high-confidence gene from 611 to 8345. > > Any light you could shed would be helpful. Thanks! > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt > wrote: > > Just eAED, but eAED can affects selection of ab initio > results. For example reading frame match of protein evidence > which also affects whether evidence from single_exon=1 and > genes with single_exon protein evidence get kept. There is > also the assumption that your alignments in GFF3 are are > correctly spliced (like BLAT does). So giving blastn results > as precomputed est_gff would create a lot of noise, since > maker ignores blastn and is using it only to seed the polished > exonerate alignments. > > --Carson > > > From: Daniel Standage > > Date: Wednesday, June 4, 2014 at 1:11 PM > To: Carson Holt > > Cc: Maker Mailing List > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > I do not provide Gap or Target attributes in the GFF3. Will > this affect the AED as well, or just the eAED? > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt > > wrote: > > Sure. that would be helpful. One question. Do you > provide the Gap attribute in your precomputed alignments? > Having or not having that attribute affects the eAED > score which takes reading frame into account, and may > cause some things to be kept that normally would be > dropped, because MAKER won't be able to take the points of > mismatch of the alignment into account (it just assumes > match everywhere). > > --Carson > > > From: Daniel Standage > > Date: Wednesday, June 4, 2014 at 1:03 PM > To: Maker Mailing List > > Subject: [maker-devel] Filtering of ab initio gene models > > Thanks everyone for your responses recently! > > The reason for my recent flurry of email activity is that > I'm seeing some unexpected trends when running the new > version of Maker with precomputed alignments. Compared > with an annotation I did a while ago (Maker 2.10, > Maker-computed alignments), this new annotation has a > substantial number of new genes annotated. If I compare > distributions of AED scores between the old and new > annotation, it's clear that the new annotation has a lot > more low-quality models. If I look at new gene models that > do not overlap with any gene model from the old > annotation, the likelihood that it's a low-quality model > is much higher. > > I decided to run a little experiment. I annotated a > scaffold first using Maker 2.10 and then using Maker > 2.31.3. I both cases, I used the same pre-computed > transcript and protein alignments and the same (latest) > version of SNAP as the only /ab initio/ predictor. Maker > 2.10 predicted 44 genes while Maker 2.31.3 predicted 63. > If we group gene models into loci by overlap, there are 33 > loci with gene models from both 2.10 and 2.31.3, 1 locus > with only models from 2.10, and 28 loci with only models > from 2.31.3. > > Before this experiment, I assumed the issue was related to > providing pre-computed alignments in GFF3 format and > perhaps violating some important assumption. However, this > experiment makes me wonder whether there have been changes > to how Maker filters /ab initio/ gene models between > version 2.10 and version 2.31.3? Do you have any ideas? If > it would help, I could put together a small data set that > reproduces the behavior I just described. > > Thanks! > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > _______________________________________________ > maker-devel mailing list maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > -- Volker Brendel Professor of Biology and Computer Science Indiana University Department of Biology & School of Informatics and Computing Simon Hall 205C 212 South Hawthorne Drive Bloomington, IN 47405-7003 Tel.: (812) 855-7074 http://brendelgroup.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jun 7 14:03:18 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:03:18 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: <53923808.7030401@indiana.edu> Message-ID: The problem in the example you sent is the geneseqer entries in the GFF3 you are passing in. It is causing merge of gene clusters. The result is that UTR is being over extended and is overlapping on the models (and probably some models get merged). As you noticed you can't have overlapping models on the same strand. If you set score_preds=1 in the maker_opts.ctl file it will give you AED scores for the rejected ab initio models. You will notice that none of them score better than 0.23. One thing you can do is set correct_est_fusion=1. This tries to correct for erroneous EST/transcript evidence that leads to over extend UTR and false gene merging. You will see in the attached image that is trims back the overlapping 3' and 5' UTR for the overlapping gene models, given that MAKER believes the evidence leading to the overlap is likely low confidence and is a false merge of regions. I think much of your geneseqer input is more of a problem than a help for the annotation. Many seem to be spurious alignments. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 5:58 PM To: Volker Brendel Cc: Carson Holt , Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models (supplied > by the pred_gff or model_gff tag)? This seems to be one problem we are > running into. Our external models are high quality, but CDS only. Thus their > score gets knocked down relative to ab initio predictions with added UTRs. > > Daniel will have more questions/observations later with regard to overlapping > gene models (we definitely need to allow gene models to overlap in the UTRs, > because transcript evidence clearly shows such negative intergenic spaces). > > Thanks for all your help! > Volker > > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > >> >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >> sequence without hints (i.e. the ab initio call). >> >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. >> >> >> >> >> In both cases MAKER is allowed to add UTR to the model (hence the 'processed' >> tag). >> >> >> >> >> --Carson >> >> >> >> >> >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> >> >> >> >> >> >> Another question: is there documentation anywhere for the naming conventions >> of the genes annotated by Maker? Of course it's easy to spot genes based on a >> particular ab initio gene predictor, as the names are in the IDs. But what is >> the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >> wrote: >> >>> >>> >>> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>> have transcript support: will Maker report overlapping genes under any >>> conditions? And even if Maker is forced to choose only a single gene to >>> report, why would the model from 4125 to 6400 ever be reported in place of >>> the one from 6111 to 8345, especially since this is provided in the >>> model_gff file? >>> >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>>> AED as well, or just the eAED? >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> >>>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the >>>>> alignment into account (it just assumes match everywhere). >>>>> >>>>> >>>>> >>>>> >>>>> --Carson >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>> unexpected trends when running the new version of Maker with precomputed >>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>> Maker-computed alignments), this new annotation has a substantial number >>>>> of new genes annotated. If I compare distributions of AED scores between >>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>> more low-quality models. If I look at new gene models that do not overlap >>>>> with any gene model from the old annotation, the likelihood that it's a >>>>> low-quality model is much higher. >>>>> >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>> 2.31.3. >>>>> >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>> important assumption. However, this experiment makes me wonder whether >>>>> there have been changes to how Maker filters ab initio gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, >>>>> I could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ maker-devel mailing list >>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo >>>>> /maker-devel_yandell-lab.org >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074 http://brendelgroup.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6246A771-E6A9-4875-9362-DC8A7A5BC9C4.png Type: image/png Size: 48365 bytes Desc: not available URL: From carsonhh at gmail.com Sat Jun 7 14:07:41 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:07:41 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: <53923808.7030401@indiana.edu> Message-ID: Example (attached) of geneseqer GFF3 input causing problems. Notice that all the geneseqer features are almost exact representations of the transposon, they are essentially reintroducing all the noise that repeat masking tried to remove (they are giving hints to the gene predictor to try and call the transposon as a gene). --Carson From: Carson Holt Date: Saturday, June 7, 2014 at 2:03 PM To: Daniel Standage , Volker Brendel Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models The problem in the example you sent is the geneseqer entries in the GFF3 you are passing in. It is causing merge of gene clusters. The result is that UTR is being over extended and is overlapping on the models (and probably some models get merged). As you noticed you can't have overlapping models on the same strand. If you set score_preds=1 in the maker_opts.ctl file it will give you AED scores for the rejected ab initio models. You will notice that none of them score better than 0.23. One thing you can do is set correct_est_fusion=1. This tries to correct for erroneous EST/transcript evidence that leads to over extend UTR and false gene merging. You will see in the attached image that is trims back the overlapping 3' and 5' UTR for the overlapping gene models, given that MAKER believes the evidence leading to the overlap is likely low confidence and is a false merge of regions. I think much of your geneseqer input is more of a problem than a help for the annotation. Many seem to be spurious alignments. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 5:58 PM To: Volker Brendel Cc: Carson Holt , Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models In the example sent previously, transcript TSA024184 overlaps with the 3' end of our gene model's CDS by 3 nucleotides. If I manually change the transcript's end coordinate (6400 to 6100) so that there are two separate non-overlapping evidence clusters, two models are reported as expected. But I can even get both models reported with a much smaller change (6400 to 6395), where the UTRs still overlap but the CDS does not overlap with the UTR. The 5' end of our gene model's CDS also overlaps with another transcript. Maker has no problem reporting both of these gene models though, probably since they're on different strands? So correct me if I'm wrong, but it appears that Maker will report overlapping gene models if they are on opposite strands or if no CDS is involved in the overlap. Is there any way this behavior can be configured? On another note, we're considering your suggestion to integrate EVM with Maker. One possibility discussed is to run Maker 4 separate times (once for each of Augustus, GeneMark, SNAP, and our model_gff models), each time with all our transcript/protein evidence, prior to consensus modeling with EVM. Would that provide any benefit over running Maker a single time with all prediction sources simultaneously? Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel wrote: > > Hi Carson, > is there a way of allowing MAKER to add UTRs to our external models (supplied > by the pred_gff or model_gff tag)? This seems to be one problem we are > running into. Our external models are high quality, but CDS only. Thus their > score gets knocked down relative to ab initio predictions with added UTRs. > > Daniel will have more questions/observations later with regard to overlapping > gene models (we definitely need to allow gene models to overlap in the UTRs, > because transcript evidence clearly shows such negative intergenic spaces). > > Thanks for all your help! > Volker > > > > On 6/6/2014 11:39 AM, Carson Holt wrote: > > >> >> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >> sequence without hints (i.e. the ab initio call). >> >> maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. >> >> >> >> >> In both cases MAKER is allowed to add UTR to the model (hence the 'processed' >> tag). >> >> >> >> >> --Carson >> >> >> >> >> >> >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:33 AM >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> >> >> >> >> >> >> Another question: is there documentation anywhere for the naming conventions >> of the genes annotated by Maker? Of course it's easy to spot genes based on a >> particular ab initio gene predictor, as the names are in the IDs. But what is >> the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs >> "maker-$seqid-snap-gene"? >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >> wrote: >> >>> >>> >>> >>> I have attached data for a small 18kb region with a handful of genes, as >>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>> different data set than what I was looking at yesterday, with a more >>> well-defined problem). >>> >>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>> have transcript support: will Maker report overlapping genes under any >>> conditions? And even if Maker is forced to choose only a single gene to >>> report, why would the model from 4125 to 6400 ever be reported in place of >>> the one from 6111 to 8345, especially since this is provided in the >>> model_gff file? >>> >>> >>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>> high-confidence gene from 611 to 8345. >>> >>> >>> Any light you could shed would be helpful. Thanks! >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Just eAED, but eAED can affects selection of ab initio results. For >>>> example reading frame match of protein evidence which also affects whether >>>> evidence from single_exon=1 and genes with single_exon protein evidence get >>>> kept. There is also the assumption that your alignments in GFF3 are are >>>> correctly spliced (like BLAT does). So giving blastn results as >>>> precomputed est_gff would create a lot of noise, since maker ignores blastn >>>> and is using it only to seed the polished exonerate alignments. >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>> To: Carson Holt >>>> Cc: Maker Mailing List >>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>>> AED as well, or just the eAED? >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>> >>>>> >>>>> >>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>> attribute in your precomputed alignments? Having or not having that >>>>> attribute affects the eAED score which takes reading frame into account, >>>>> and may cause some things to be kept that normally would be dropped, >>>>> because MAKER won't be able to take the points of mismatch of the >>>>> alignment into account (it just assumes match everywhere). >>>>> >>>>> >>>>> >>>>> >>>>> --Carson >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>> To: Maker Mailing List >>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks everyone for your responses recently! >>>>> >>>>> >>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>> unexpected trends when running the new version of Maker with precomputed >>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>> Maker-computed alignments), this new annotation has a substantial number >>>>> of new genes annotated. If I compare distributions of AED scores between >>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>> more low-quality models. If I look at new gene models that do not overlap >>>>> with any gene model from the old annotation, the likelihood that it's a >>>>> low-quality model is much higher. >>>>> >>>>> >>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>> pre-computed transcript and protein alignments and the same (latest) >>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>> 2.31.3. >>>>> >>>>> >>>>> Before this experiment, I assumed the issue was related to providing >>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>> important assumption. However, this experiment makes me wonder whether >>>>> there have been changes to how Maker filters ab initio gene models between >>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help, >>>>> I could put together a small data set that reproduces the behavior I just >>>>> described. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ maker-devel mailing list >>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo >>>>> /maker-devel_yandell-lab.org >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > -- > Volker Brendel > Professor of Biology and Computer Science > Indiana University > Department of Biology & School of Informatics and Computing > Simon Hall 205C > 212 South Hawthorne Drive > Bloomington, IN 47405-7003 > > Tel.: (812) 855-7074 http://brendelgroup.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6246A771-E6A9-4875-9362-DC8A7A5BC9C4.png Type: image/png Size: 48365 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 48C1E0B9-001D-44C9-8D8E-37A52E4A17E8.png Type: image/png Size: 6592 bytes Desc: not available URL: From carsonhh at gmail.com Sat Jun 7 14:11:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:11:43 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: <53923808.7030401@indiana.edu> References: <53923808.7030401@indiana.edu> Message-ID: If you give input as pred_gff, set keep_preds=1, and then give MAKER EST evidence to work with then MAKER will just pass_through the pred_gff data you gave it with UTR added. Set correct_est_fusion=1 if your input contains false merges across regions (common from mRNA-seq results). This will trim overlapping UTR caused by the improperly merged EST evidence. --Carson From: Volker Brendel Date: Friday, June 6, 2014 at 3:52 PM To: Carson Holt , Daniel Standage Cc: Maker Mailing List Subject: Re: [maker-devel] Filtering of ab initio gene models Hi Carson, is there a way of allowing MAKER to add UTRs to our external models (supplied by the pred_gff or model_gff tag)? This seems to be one problem we are running into. Our external models are high quality, but CDS only. Thus their score gets knocked down relative to ab initio predictions with added UTRs. Daniel will have more questions/observations later with regard to overlapping gene models (we definitely need to allow gene models to overlap in the UTRs, because transcript evidence clearly shows such negative intergenic spaces). Thanks for all your help! Volker On 6/6/2014 11:39 AM, Carson Holt wrote: > > snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked > sequence without hints (i.e. the ab initio call). > > maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER. > > > > > In both cases MAKER is allowed to add UTR to the model (hence the 'processed' > tag). > > > > > --Carson > > > > > > > > From: Daniel Standage > Date: Friday, June 6, 2014 at 10:33 AM > To: Carson Holt > Cc: Maker Mailing List , Volker Brendel > > Subject: Re: [maker-devel] Filtering of ab initio gene models > > > > > > > > Another question: is there documentation anywhere for the naming conventions > of the genes annotated by Maker? Of course it's easy to spot genes based on a > particular ab initio gene predictor, as the names are in the IDs. But what is > the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs > "maker-$seqid-snap-gene"? > > > Thanks, > > Daniel > > > > > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > > > > On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage > wrote: > >> >> >> >> I have attached data for a small 18kb region with a handful of genes, as well >> as the corresponding maker_opts.ctl file. (This is a smaller and different >> data set than what I was looking at yesterday, with a more well-defined >> problem). >> >> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >> have transcript support: will Maker report overlapping genes under any >> conditions? And even if Maker is forced to choose only a single gene to >> report, why would the model from 4125 to 6400 ever be reported in place of >> the one from 6111 to 8345, especially since this is provided in the model_gff >> file? >> >> >> Even when transcript TSA024184 is included, Maker 2.10 reports the >> high-confidence gene from 611 to 8345. >> >> >> Any light you could shed would be helpful. Thanks! >> >> >> >> >> >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> >> >> >> >> >> >> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >> >>> >>> >>> Just eAED, but eAED can affects selection of ab initio results. For example >>> reading frame match of protein evidence which also affects whether evidence >>> from single_exon=1 and genes with single_exon protein evidence get kept. >>> There is also the assumption that your alignments in GFF3 are are correctly >>> spliced (like BLAT does). So giving blastn results as precomputed est_gff >>> would create a lot of noise, since maker ignores blastn and is using it only >>> to seed the polished exonerate alignments. >>> >>> >>> >>> >>> --Carson >>> >>> >>> >>> >>> >>> >>> >>> From: Daniel Standage >>> Date: Wednesday, June 4, 2014 at 1:11 PM >>> To: Carson Holt >>> Cc: Maker Mailing List >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> >>> >>> >>> >>> >>> >>> I do not provide Gap or Target attributes in the GFF3. Will this affect the >>> AED as well, or just the eAED? >>> >>> >>> >>> >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> >>> >>> >>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>> >>>> >>>> >>>> Sure. that would be helpful. One question. Do you provide the Gap >>>> attribute in your precomputed alignments? Having or not having that >>>> attribute affects the eAED score which takes reading frame into account, >>>> and may cause some things to be kept that normally would be dropped, >>>> because MAKER won't be able to take the points of mismatch of the alignment >>>> into account (it just assumes match everywhere). >>>> >>>> >>>> >>>> >>>> --Carson >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Daniel Standage >>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>> To: Maker Mailing List >>>> Subject: [maker-devel] Filtering of ab initio gene models >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Thanks everyone for your responses recently! >>>> >>>> >>>> The reason for my recent flurry of email activity is that I'm seeing some >>>> unexpected trends when running the new version of Maker with precomputed >>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>> Maker-computed alignments), this new annotation has a substantial number of >>>> new genes annotated. If I compare distributions of AED scores between the >>>> old and new annotation, it's clear that the new annotation has a lot more >>>> low-quality models. If I look at new gene models that do not overlap with >>>> any gene model from the old annotation, the likelihood that it's a >>>> low-quality model is much higher. >>>> >>>> >>>> I decided to run a little experiment. I annotated a scaffold first using >>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>> pre-computed transcript and protein alignments and the same (latest) >>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci by >>>> overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, 1 >>>> locus with only models from 2.10, and 28 loci with only models from 2.31.3. >>>> >>>> >>>> Before this experiment, I assumed the issue was related to providing >>>> pre-computed alignments in GFF3 format and perhaps violating some important >>>> assumption. However, this experiment makes me wonder whether there have >>>> been changes to how Maker filters ab initio gene models between version >>>> 2.10 and version 2.31.3? Do you have any ideas? If it would help, I could >>>> put together a small data set that reproduces the behavior I just >>>> described. >>>> >>>> Thanks! >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ maker-devel mailing list >>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ >>>> maker-devel_yandell-lab.org >>>> >>> >>> >>> >>> >>> >>> >>> >> >> >> >> >> >> > > > > -- Volker Brendel Professor of Biology and Computer Science Indiana University Department of Biology & School of Informatics and Computing Simon Hall 205C 212 South Hawthorne Drive Bloomington, IN 47405-7003 Tel.: (812) 855-7074 http://brendelgroup.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jun 7 14:16:29 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 07 Jun 2014 14:16:29 -0600 Subject: [maker-devel] Filtering of ab initio gene models In-Reply-To: References: Message-ID: Also MAKER 2.10 has a number of bugs with how UTR is generated and hints are generated for the ab into predictors (it's several years out of date). I don't think it checks from reading frame match when determining protein overlap match either. So no surprise that some models will be different from the current MAKER version. --Carson From: Daniel Standage Date: Friday, June 6, 2014 at 12:38 PM To: Carson Holt Cc: Maker Mailing List , Volker Brendel Subject: Re: [maker-devel] Filtering of ab initio gene models Carson et al, Your feedback so far has been very helpful, and we are grateful for the time you have taken to respond! We're still trying to understand the precise procedure by which competing models are chosen. You mentioned that a single model must be chosen (via AED) from a pool of available models: are these pools constructed by overlap? It is not uncommon in our experience to see overlapping genes reported by Maker, although for the most part it appears these overlapping genes don't have CDS overlap. Looking more closely at the Maker 2.10 output from the test data we sent yesterday, we also noted that exclusion of the transcript in question also had an effect on the interval exon structure (exon 7717-7776 becomes exon 7737-7776) of a downstream model with which it overlaps 3 nucleotides. And still unclear to us is how the model_gff data fits in with all this. >From my previous searching of the list archives I was under the impression that these models would be given substantial weight in the prediction process, and would only be altered if a considerably better model could be identified. Our experience with this small data set, though, is that which overlapping gene is reported, and which corresponding exon structure is selected, is dependent on very slight changes in the evidence. Thanks, Daniel -- Daniel S. Standage Ph.D. Candidate Computational Genome Science Laboratory Indiana University On Fri, Jun 6, 2014 at 12:59 PM, Daniel Standage wrote: > This helps, thanks. > > > -- > Daniel S. Standage > Ph.D. Candidate > Computational Genome Science Laboratory > Indiana University > > > On Fri, Jun 6, 2014 at 12:56 PM, Carson Holt wrote: >> I got the e-mail. Thanks for the test set. >> >> Multiple ab initio predictors don't inform a single annotation, rather one >> must be chosen from the pool of available models (I.e. it has to be SNAP or >> Augustus, or GeneMark). They all supply their own ab initio as well as hint >> based prediction, and then the one with best evidence match (measured by AED) >> is kept (it's like a competition that only one predictor can win). >> >> If you want a consensus model instead, you can take MAKER results in GFF3 >> format and give them to Evidence Modeler (EVM). The upcoming MAKER 3.0 is a >> collaboration with the EVM group and will have this option, but for now users >> can just split the MAKER GFF3 by evidence types and give it to EVM. EVM then >> produces consensus models based on the GFF3 content. >> >> --Carson >> >> From: Daniel Standage >> Date: Friday, June 6, 2014 at 10:46 AM >> >> To: Carson Holt >> Cc: Maker Mailing List , Volker Brendel >> >> Subject: Re: [maker-devel] Filtering of ab initio gene models >> >> Good to know, thanks. If multiple ab initio predictors inform a single >> annotation, how does Maker decide which one will be included in the gene's >> ID? >> >> Given your quick response just now, I wanted to confirm that you got the >> message and data set I sent yesterday. I received an email saying the size of >> my message required list admin approval to be distributed, but since you were >> also a direct recipient of the email I didn't worry about it too much. >> >> Thanks again! >> >> >> -- >> Daniel S. Standage >> Ph.D. Candidate >> Computational Genome Science Laboratory >> Indiana University >> >> >> On Fri, Jun 6, 2014 at 12:39 PM, Carson Holt wrote: >>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked >>> sequence without hints (i.e. the ab initio call). >>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from >>> MAKER. >>> >>> In both cases MAKER is allowed to add UTR to the model (hence the >>> 'processed' tag). >>> >>> --Carson >>> >>> >>> From: Daniel Standage >>> Date: Friday, June 6, 2014 at 10:33 AM >>> To: Carson Holt >>> Cc: Maker Mailing List , Volker Brendel >>> >>> >>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>> >>> Another question: is there documentation anywhere for the naming conventions >>> of the genes annotated by Maker? Of course it's easy to spot genes based on >>> a particular ab initio gene predictor, as the names are in the IDs. But what >>> is the significance of, say, "snap_masked-$seqid-processed-gene" in a gene >>> ID vs "maker-$seqid-snap-gene"? >>> >>> Thanks, >>> Daniel >>> >>> >>> -- >>> Daniel S. Standage >>> Ph.D. Candidate >>> Computational Genome Science Laboratory >>> Indiana University >>> >>> >>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage >>> wrote: >>>> I have attached data for a small 18kb region with a handful of genes, as >>>> well as the corresponding maker_opts.ctl file. (This is a smaller and >>>> different data set than what I was looking at yesterday, with a more >>>> well-defined problem). >>>> >>>> With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400 >>>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a >>>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes >>>> have transcript support: will Maker report overlapping genes under any >>>> conditions? And even if Maker is forced to choose only a single gene to >>>> report, why would the model from 4125 to 6400 ever be reported in place of >>>> the one from 6111 to 8345, especially since this is provided in the >>>> model_gff file? >>>> >>>> Even when transcript TSA024184 is included, Maker 2.10 reports the >>>> high-confidence gene from 611 to 8345. >>>> >>>> Any light you could shed would be helpful. Thanks! >>>> >>>> >>>> -- >>>> Daniel S. Standage >>>> Ph.D. Candidate >>>> Computational Genome Science Laboratory >>>> Indiana University >>>> >>>> >>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt wrote: >>>>> Just eAED, but eAED can affects selection of ab initio results. For >>>>> example reading frame match of protein evidence which also affects whether >>>>> evidence from single_exon=1 and genes with single_exon protein evidence >>>>> get kept. There is also the assumption that your alignments in GFF3 are >>>>> are correctly spliced (like BLAT does). So giving blastn results as >>>>> precomputed est_gff would create a lot of noise, since maker ignores >>>>> blastn and is using it only to seed the polished exonerate alignments. >>>>> >>>>> --Carson >>>>> >>>>> >>>>> From: Daniel Standage >>>>> Date: Wednesday, June 4, 2014 at 1:11 PM >>>>> To: Carson Holt >>>>> Cc: Maker Mailing List >>>>> Subject: Re: [maker-devel] Filtering of ab initio gene models >>>>> >>>>> I do not provide Gap or Target attributes in the GFF3. Will this affect >>>>> the AED as well, or just the eAED? >>>>> >>>>> >>>>> -- >>>>> Daniel S. Standage >>>>> Ph.D. Candidate >>>>> Computational Genome Science Laboratory >>>>> Indiana University >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt wrote: >>>>>> Sure. that would be helpful. One question. Do you provide the Gap >>>>>> attribute in your precomputed alignments? Having or not having that >>>>>> attribute affects the eAED score which takes reading frame into account, >>>>>> and may cause some things to be kept that normally would be dropped, >>>>>> because MAKER won't be able to take the points of mismatch of the >>>>>> alignment into account (it just assumes match everywhere). >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> From: Daniel Standage >>>>>> Date: Wednesday, June 4, 2014 at 1:03 PM >>>>>> To: Maker Mailing List >>>>>> Subject: [maker-devel] Filtering of ab initio gene models >>>>>> >>>>>> Thanks everyone for your responses recently! >>>>>> >>>>>> The reason for my recent flurry of email activity is that I'm seeing some >>>>>> unexpected trends when running the new version of Maker with precomputed >>>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10, >>>>>> Maker-computed alignments), this new annotation has a substantial number >>>>>> of new genes annotated. If I compare distributions of AED scores between >>>>>> the old and new annotation, it's clear that the new annotation has a lot >>>>>> more low-quality models. If I look at new gene models that do not overlap >>>>>> with any gene model from the old annotation, the likelihood that it's a >>>>>> low-quality model is much higher. >>>>>> >>>>>> I decided to run a little experiment. I annotated a scaffold first using >>>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same >>>>>> pre-computed transcript and protein alignments and the same (latest) >>>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44 >>>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci >>>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3, >>>>>> 1 locus with only models from 2.10, and 28 loci with only models from >>>>>> 2.31.3. >>>>>> >>>>>> Before this experiment, I assumed the issue was related to providing >>>>>> pre-computed alignments in GFF3 format and perhaps violating some >>>>>> important assumption. However, this experiment makes me wonder whether >>>>>> there have been changes to how Maker filters ab initio gene models >>>>>> between version 2.10 and version 2.31.3? Do you have any ideas? If it >>>>>> would help, I could put together a small data set that reproduces the >>>>>> behavior I just described. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Daniel S. Standage >>>>>> Ph.D. Candidate >>>>>> Computational Genome Science Laboratory >>>>>> Indiana University >>>>>> _______________________________________________ maker-devel mailing list >>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinf >>>>>> o/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.hoeppner at imbim.uu.se Mon Jun 9 02:48:01 2014 From: marc.hoeppner at imbim.uu.se (=?Windows-1252?Q?Marc_H=F6ppner?=) Date: Mon, 9 Jun 2014 08:48:01 +0000 Subject: [maker-devel] Repeatmasked genome Message-ID: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> Hi, this may be an odd question, but I was wondering where, if at all, Maker reports the repeat-masked genome sequence? As far as I can tell the fasta sequences included in the gff annotation are unmasked (?) or at least not softmasked. I guess it wouldn?t be too hard to take the repeat masker features and use them to soft mask the assembly, but still... Regards, Marc Marc P. Hoeppner, PhD Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se From dence at genetics.utah.edu Mon Jun 9 09:22:13 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 9 Jun 2014 15:22:13 +0000 Subject: [maker-devel] Repeatmasked genome In-Reply-To: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> References: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> Message-ID: <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> Hi Marc, The masked genome sequence is stored in the "theVoid" directory for each scaffold. There are files "query.fasta", "query.masked.fasta", and "query.masked.gff". The masked sequence is in the query.masked.fasta file and the genomic locations of the masked regions are stored in the query.masked.gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 9, 2014, at 2:48 AM, Marc H?ppner > wrote: Hi, this may be an odd question, but I was wondering where, if at all, Maker reports the repeat-masked genome sequence? As far as I can tell the fasta sequences included in the gff annotation are unmasked (?) or at least not softmasked. I guess it wouldn?t be too hard to take the repeat masker features and use them to soft mask the assembly, but still... Regards, Marc Marc P. Hoeppner, PhD Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 9 10:11:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 09 Jun 2014 10:11:23 -0600 Subject: [maker-devel] Repeatmasked genome In-Reply-To: <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> References: <734D87F8-FDA7-49C0-8A76-DC4BD3866F5D@imbim.uu.se> <5FB241C6-535F-45EF-A218-253B63CADBCF@genetics.utah.edu> Message-ID: Yes. Those are all temporary files, that (if you still have them) you can use to get at the masked fasta directly. Otherwise you can just use the features in the GFF3 file to remask the regions. --Carson From: Daniel Ence Date: Monday, June 9, 2014 at 9:22 AM To: Marc H?ppner Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Repeatmasked genome Hi Marc, The masked genome sequence is stored in the "theVoid" directory for each scaffold. There are files "query.fasta", "query.masked.fasta", and "query.masked.gff". The masked sequence is in the query.masked.fasta file and the genomic locations of the masked regions are stored in the query.masked.gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 9, 2014, at 2:48 AM, Marc H?ppner wrote: > Hi, > > this may be an odd question, but I was wondering where, if at all, Maker > reports the repeat-masked genome sequence? As far as I can tell the fasta > sequences included in the gff annotation are unmasked (?) or at least not > softmasked. I guess it wouldn?t be too hard to take the repeat masker features > and use them to soft mask the assembly, but still... > > Regards, > > Marc > > > Marc P. Hoeppner, PhD > > Department for Medical Biochemistry and Microbiology > Uppsala University, Sweden > marc.hoeppner at imbim.uu.se > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynsb1987 at gmail.com Mon Jun 9 22:22:47 2014 From: cynsb1987 at gmail.com (hueytyng) Date: Tue, 10 Jun 2014 14:22:47 +1000 Subject: [maker-devel] ERROR: This MpiChunk is not part of this level Message-ID: Hi Carson, I run Maker 2.31 on my assembled contigs. From the "master_datastore_index.log", 529 contigs run through, before I get the error below. Maker halts after this. #--------------------------------------------------------------------- Now starting the contig!! SeqID: 1AL_NODE_4659_length_8657_cov_8.758115 Length: 8739 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks ERROR: This MpiChunk is not part of this level at /ws/ws-group/app/maker/bin/../lib/Process/MpiTiers.pm line 439. Process::MpiTiers::update_chunk(Process::MpiTiers=HASH(0x63e64c0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 1537 main::update_chunk(ARRAY(0x40c69d0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 929 --> rank=4, hostname=safs-raijen deleted:0 hits deleted:0 hits Attached is my maker_opts.ctl file. Thank you, Jenny -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4932 bytes Desc: not available URL: From carsonhh at gmail.com Wed Jun 11 08:29:44 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 11 Jun 2014 08:29:44 -0600 Subject: [maker-devel] ERROR: This MpiChunk is not part of this level In-Reply-To: References: Message-ID: The cause of this is most likely a corrupt MPI message. It could be random (it happens with MPI messages). In which case it should succeed on retry. It could mean you need to reinstall you MPI communicator, or give fewer nodes to mpiexec when running your job (MPICH2 starts having communication issues after around 100 processes for example - even sooner on some systems). It may also mean that you set MAKER up with one communicator during the installation (like MPICH2) and then used mpiexec from another communicator to launch the job (OpenMPI for example or even a different version of MPICH2). Make sure you are not using MVAPICH2 because MAKER won't work with MVAPICH2. Also if you are using OpenMPI, you must preload libmpi.so or otherwise shared libraries won't work and it will fail while running MAKER. To do that you have to export the following environmental variable --> export LD_PRELOAD=/lib/libmpi.so #replace with the location of OpenMPI Also because a corrupt message has the chance to cause other issues, you may want to completely delete the folder for the failed contig (look in the datastore_index.log to see where that folder is). Also make sure you are using the latest version of MAKER because it has been vetted on OpenMPI using 8000+ cpus. Earlier version (I.e. 2.28 and below) may have issues on OpenMPI or on some systems with slow NFS storage or limited memory. --Carson From: hueytyng Date: Monday, June 9, 2014 at 10:22 PM To: Subject: [maker-devel] ERROR: This MpiChunk is not part of this level Hi Carson, I run Maker 2.31 on my assembled contigs. From the "master_datastore_index.log", 529 contigs run through, before I get the error below. Maker halts after this. #--------------------------------------------------------------------- Now starting the contig!! SeqID: 1AL_NODE_4659_length_8657_cov_8.758115 Length: 8739 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks ERROR: This MpiChunk is not part of this level at /ws/ws-group/app/maker/bin/../lib/Process/MpiTiers.pm line 439. Process::MpiTiers::update_chunk(Process::MpiTiers=HASH(0x63e64c0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 1537 main::update_chunk(ARRAY(0x40c69d0), Process::MpiChunk=HASH(0x6155138)) called at /ws/ws-group/bin/maker line 929 --> rank=4, hostname=safs-raijen deleted:0 hits deleted:0 hits Attached is my maker_opts.ctl file. Thank you, Jenny _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Wed Jun 11 14:44:41 2014 From: sjackman at gmail.com (Shaun Jackman) Date: Wed, 11 Jun 2014 13:44:41 -0700 Subject: [maker-devel] Alternate translation table Message-ID: Hi, Carson. I?m annotating a plastid genome. It has spliced genes so I?m using organism_type=eukaryotic. Its translation table however is 11 (Bacteria, Archaea, prokaryotic viruses and chloroplast proteins). Is it possible to change the translation table? I?m not doing any ab initio gene prediction, only homology-based annotation using protein for coding genes and est for non-coding genes. The sequences are from a very closely related species (99% identity). Cheers, Shaun? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 11 15:01:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 11 Jun 2014 15:01:23 -0600 Subject: [maker-devel] Alternate translation table In-Reply-To: References: Message-ID: Sorry. MAKER doesn't have an alternate codon table option. --Carson From: Shaun Jackman Reply-To: Shaun Jackman Date: Wednesday, June 11, 2014 at 2:44 PM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] Alternate translation table Hi, Carson. I?m annotating a plastid genome. It has spliced genes so I?m using organism_type=eukaryotic. Its translation table however is 11 (Bacteria, Archaea, prokaryotic viruses and chloroplast proteins). Is it possible to change the translation table? I?m not doing any ab initio gene prediction, only homology-based annotation using protein for coding genes and est for non-coding genes. The sequences are from a very closely related species (99% identity). Cheers, Shaun? _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at rennes.inra.fr Thu Jun 12 07:00:48 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Thu, 12 Jun 2014 15:00:48 +0200 Subject: [maker-devel] Merging 2 annotations In-Reply-To: References: <538D8987.4090606@rennes.inra.fr> Message-ID: <5399A480.10808@rennes.inra.fr> Thank you, it works fine! A little question which is related: I set the map_forward option to 1, but it seems to work only for the model_gff gff. Is there a way to make it keep the original IDs also for the pred_gff file? Thank you Anthony On 03/06/2014 18:15, Carson Holt wrote: > You can give the manually curate ones to model_gff and the other ones to > pred_gff. Then set keep_preds=1. The model_gff resuls always get kept > even without evidence support, the pred_gff will be kept even without > evidence support because you set keep_preds=1, but model_gff results will > take precedence. > > --Carson > > > On 6/3/14, 2:38 AM, "Anthony Bretaudeau" > wrote: > >> Hello, >> >> I am working on the annotation of an insect genome, and I have 2 gff >> files: >> -an automatic annotation (done by another lab, with something else than >> maker, ~20000genes) >> -a manually curated annotation (with webapollo, ~1500 genes) >> >> From this, I would like to produce a single gff combining the 2. I'd >> like to keep all the manually curated models, and only the automatic >> ones that have no equivalent in the manually curated gff. >> >> Is it possible to do something like this with maker? I guess I could >> play with the model_gff option, but I'm not sure how exactly I could use >> it. >> >> Thank you for your help >> Regards >> >> Anthony >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From dence at genetics.utah.edu Thu Jun 12 09:50:05 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 12 Jun 2014 15:50:05 +0000 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <5399A480.10808@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> Message-ID: <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> Hi Anthony, So I think that the gene ID gets changed in the process of promoting things from pred_gff to gene models. If you know which predictions you want to keep, then you can select those out and pass them to model_gff. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 12, 2014, at 7:00 AM, Anthony Bretaudeau > wrote: A little question which is related: I set the map_forward option to 1, but it seems to work only for the model_gff gff. Is there a way to make it keep the original IDs also for the pred_gff file? -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at rennes.inra.fr Thu Jun 12 10:17:11 2014 From: anthony.bretaudeau at rennes.inra.fr (Anthony Bretaudeau) Date: Thu, 12 Jun 2014 18:17:11 +0200 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> Message-ID: <5399D287.1090505@rennes.inra.fr> An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 12 10:23:06 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Jun 2014 10:23:06 -0600 Subject: [maker-devel] Merging 2 annotations In-Reply-To: <5399D287.1090505@rennes.inra.fr> References: <538D8987.4090606@rennes.inra.fr> <5399A480.10808@rennes.inra.fr> <8914E245-54DB-4260-8FA8-35FFE5D71F6F@genetics.utah.edu> <5399D287.1090505@rennes.inra.fr> Message-ID: This might be a round about way to get them to have the names unaltered. Give the pred_gff ones to est_gff. Still give the model_gff ones to model_gff. Set est2genome=1 and single_exon=1. Then add this line to the control file est_forward=1. This is normally used to move transcripts forward onto new assemblies with names being drawn from the alignment, but by telling MAKER that these are ESTs instead of predictions and setting the appropriate values, it will think it's moving transcripts forward, and the final result may be what you want. --Carson From: Anthony Bretaudeau Date: Thursday, June 12, 2014 at 10:17 AM To: Daniel Ence Cc: Carson Holt , "" Subject: Re: [maker-devel] Merging 2 annotations Yes, I think that's why the ids get changed. But I don't know which predictions I want to keep as I'm using maker to only keep the ones that are not equivalent to the models that are in the model_gff. Anthony On 12/06/2014 17:50, Daniel Ence wrote: > Hi Anthony, So I think that the gene ID gets changed in the process of > promoting things from pred_gff to gene models. If you know which predictions > you want to keep, then you can select those out and pass them to model_gff. > > > > ~Daniel > > > > > > > > Daniel Ence > > Graduate Student > > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > > > > On Jun 12, 2014, at 7:00 AM, Anthony Bretaudeau > > > wrote: > > >> A little question which is related: I set the map_forward option to 1, but it >> seems to work only for the model_gff gff. Is there a way to make it keep the >> original IDs also for the pred_gff file? >> > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jun 12 15:58:16 2014 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 12 Jun 2014 14:58:16 -0700 Subject: [maker-devel] Poor Exonerate gene model Message-ID: Hi, Carson. I have a case where MAKER is choosing a poor gene model when a better model exists. The two genes, psaA and psaB, are adjacent and are similar (37% exonerate score). BLASTX finds only the correct alignments of psaA and psaB. When exonerate is run, it also finds poor alignments of psaA to psaB and psaB to psaA. The result is that MAKER chooses the correct model for psaB, but picks the poor psaB model for psaA. Increasing ep_score_limit from 20 to 40 works around the issue. I think MAKER could make a better choice in this situation without that hint. See the attached screen shots. The first is ep_score_limit=20 and the second ep_score_limit=40. I?ve attached the evidence GFF. Cheers, Shaun [image: Inline images 1] [image: Inline images 3] ? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 86112 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 90074 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1.gff.gz Type: application/x-gzip Size: 57657 bytes Desc: not available URL: From saad.arif at tuebingen.mpg.de Fri Jun 13 05:03:38 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Fri, 13 Jun 2014 13:03:38 +0200 Subject: [maker-devel] Help with updating an annotation Message-ID: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad From carsonhh at gmail.com Fri Jun 13 10:59:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 13 Jun 2014 10:59:46 -0600 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: Message-ID: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" wrote: >Dear All, > >I would like to use Maker pipeline to expand a current annotation (new >isoforms and novel genes with respect to current annotation) and was >wondering if anyone had experience with this and or suggestions to my >questions. > >Briefly: > > I have tophat splice junctions from RNAseq data or alternatively >cufflinks generated transcript models (fasts format) that i want to use >as my new data (est_gff or est). > >I want to provide the current Ensembl annotation for gene prediction but >i want this annotation to remain unchanged. Hence, i?m not sure if i >should provide this annotation as pred_gff > or model_gff. Can the model_gff be used for gene prediction or is this >just a subset of pred_gff that remain unaltered? Can we provide the same >annotation for both options (pred_ and mod_gff)? > > > >Importantly, my main goal is to use the new RNAseq data to add more >isoforms and (any) novel genes to the existing Ensembl annotation. Any >thoughts or suggestions on how to go about this would be sincerely >appreciated. > > >Thanks in advance, >saad > > > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From juefish at gmail.com Tue Jun 17 14:54:51 2014 From: juefish at gmail.com (Nathaniel Jue) Date: Tue, 17 Jun 2014 16:54:51 -0400 Subject: [maker-devel] issue with forks module Message-ID: I've been running into all kinds of issues with the implementation of forks in GMOD. I repeatedly get this error when running an MPI run of Maker: Use of each() on hash after insertion without resetting hash iterator results in undefined behavior at /data2/local_installs/perls/perl-5.18.1/lib/site_perl/5.18.1/x86_64-linux/ forks.pm line 1736. I had to install an alternative version of perl with perlbrew but everything seemed to work on the test data sets. Any thoughts on what might be the issue? Or if this is even an issue that would affect results? Thanks, Nate -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 17 15:09:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 17 Jun 2014 15:09:55 -0600 Subject: [maker-devel] issue with forks module In-Reply-To: References: Message-ID: There is a change in Perl 5.18 that makes the forks.pm module incompatible. The forks.pm model maintainers have yet to update their module to resolve the issue, so it only works on perl version prior to 5.18. One work around it to manually edit forks.pm line 1736 yourself. Change it from this --> $write = each %WRITE; To this (make sure to include the {} brackets)--> { no warnings qw(internal); $write = each %WRITE; } --Carson From: Nathaniel Jue Date: Tuesday, June 17, 2014 at 2:54 PM To: Subject: [maker-devel] issue with forks module I've been running into all kinds of issues with the implementation of forks in GMOD. I repeatedly get this error when running an MPI run of Maker: Use of each() on hash after insertion without resetting hash iterator results in undefined behavior at /data2/local_installs/perls/perl-5.18.1/lib/site_perl/5.18.1/x86_64-linux/fo rks.pm line 1736. I had to install an alternative version of perl with perlbrew but everything seemed to work on the test data sets. Any thoughts on what might be the issue? Or if this is even an issue that would affect results? Thanks, Nate _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From saad.arif at tuebingen.mpg.de Wed Jun 18 05:09:48 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Wed, 18 Jun 2014 12:09:48 +0100 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: Message-ID: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: > Use the cufflinks instead of the tophat features (tophat tends to be > really noisy). Give the existing models to model_gff (they will then > always be kept unless something better is found). There is no option to > keep models and then just add isoforms. The model_gff input will either > be kept as is (unchanged), or replaced with an updated model suggested by > the evidence (the updated model may contain multiple isoforms though), and > map_forward=1 can be used to pull names forward from the old model onto > the new models. > > Thansk, > Carson > > > On 6/13/14, 5:03 AM, "Saad Arif" wrote: > >> Dear All, >> >> I would like to use Maker pipeline to expand a current annotation (new >> isoforms and novel genes with respect to current annotation) and was >> wondering if anyone had experience with this and or suggestions to my >> questions. >> >> Briefly: >> >> I have tophat splice junctions from RNAseq data or alternatively >> cufflinks generated transcript models (fasts format) that i want to use >> as my new data (est_gff or est). >> >> I want to provide the current Ensembl annotation for gene prediction but >> i want this annotation to remain unchanged. Hence, i?m not sure if i >> should provide this annotation as pred_gff >> or model_gff. Can the model_gff be used for gene prediction or is this >> just a subset of pred_gff that remain unaltered? Can we provide the same >> annotation for both options (pred_ and mod_gff)? >> >> >> >> Importantly, my main goal is to use the new RNAseq data to add more >> isoforms and (any) novel genes to the existing Ensembl annotation. Any >> thoughts or suggestions on how to go about this would be sincerely >> appreciated. >> >> >> Thanks in advance, >> saad >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From dence at genetics.utah.edu Wed Jun 18 10:21:19 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 18 Jun 2014 16:21:19 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> Message-ID: <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Wed Jun 18 11:04:26 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 18 Jun 2014 17:04:26 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: Hi Saad, That seems to be right to me. You'll do one run of MAKER with the cufflinks output and est2genome turned on and train SNAP on that output. You can repeat this as many times as you want, but in my experience you don't gain much in predictive power beyond two rounds of training. Next, you'll turn on SNAP and turn off est2genome, but still include the cufflinks and proteome evidence and the ensemble models. The other ab initio predictors that maker can use internally (genemark and augustus) are worth looking into also. Genemark does a self-training thing, but can take a couple of days depending on how large your genome is. Augustus takes a lot of time and effort to train, but comes with many prebuilt training files. If one of its prebuilt files is close to your species of interest, you can just use that instead. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 10:42 AM, Saad Arif > wrote: Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.priyam at qmul.ac.uk Wed Jun 18 11:44:34 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Wed, 18 Jun 2014 23:14:34 +0530 Subject: [maker-devel] errors in final gff Message-ID: Hi, I compiled all annotations generated by MAKER into a single GFF file using the gff3_merge script distributed with MAKER. While formatting this GFF for use with JBrowse, I found a few errors: 1. Three instances where two features were assigned the same id. 2. One instance where a group of three subfeatures refer to a non-existent parent. Here is the relevant portion of the GFF file: https://gist.github.com/yeban/ffaf5cd419639dd073a7 I worked around the issue temporarily for the job at hand, but I am left wondering why would these errors creep in. -- Priyam From carsonhh at gmail.com Wed Jun 18 12:11:49 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 18 Jun 2014 12:11:49 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: What MAKER version are you using? --Carson On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >Hi, > >I compiled all annotations generated by MAKER into a single GFF file >using the gff3_merge script distributed with MAKER. While formatting >this GFF for use with JBrowse, I found a few errors: > >1. Three instances where two features were assigned the same id. >2. One instance where a group of three subfeatures refer to a >non-existent parent. > >Here is the relevant portion of the GFF file: >https://gist.github.com/yeban/ffaf5cd419639dd073a7 > >I worked around the issue temporarily for the job at hand, but I am >left wondering why would these errors creep in. > >-- Priyam > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jun 18 15:33:08 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 18 Jun 2014 15:33:08 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Are you passing in old data via GFF3? --Carson On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >It's version 2.31. > >-- Priyam > >On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt wrote: >> What MAKER version are you using? >> >> --Carson >> >> >> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >> >>>Hi, >>> >>>I compiled all annotations generated by MAKER into a single GFF file >>>using the gff3_merge script distributed with MAKER. While formatting >>>this GFF for use with JBrowse, I found a few errors: >>> >>>1. Three instances where two features were assigned the same id. >>>2. One instance where a group of three subfeatures refer to a >>>non-existent parent. >>> >>>Here is the relevant portion of the GFF file: >>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>> >>>I worked around the issue temporarily for the job at hand, but I am >>>left wondering why would these errors creep in. >>> >>>-- Priyam >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> From mhinsley at ebi.ac.uk Thu Jun 19 03:07:32 2014 From: mhinsley at ebi.ac.uk (Malcolm Hinsley) Date: Thu, 19 Jun 2014 10:07:32 +0100 Subject: [maker-devel] 'not a SCALAR reference' error In-Reply-To: References: Message-ID: <53A2A854.3000009@ebi.ac.uk> Hi I'm running maker 2.31 with mpich 3 and have run once with est and protein2genome, then trained augustus and snap and run the first iteration of ab-initio predictors, which finished cleanly with no errors/ failures. Having retrained augustus and snap I'm trying to run maker -a using the same augustus species and snap.hmm pathname... previously this has worked fine. I get a lot of errors like this (it looks like every scaffold fails): doing repeat masking ERROR: Not a SCALAR reference at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Fasta.pm line 382 thread 1. Fasta::_formatSeq(FastaSeq=HASH(0x4426298), 60) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Fasta.pm line 369 thread 1 Fasta::toFastaRef(">scaffold29 CHUNK number:0 size:100000 offset:0", REF(0x42e48f0)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/FastaChunk.pm line 217 thread 1 FastaChunk::fasta_ref(FastaChunk=HASH(0x42c76a8)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/FastaChunk.pm line 168 thread 1 FastaChunk::write_file(FastaChunk=HASH(0x42c76a8), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/GI.pm line 3138 thread 1 GI::repeatmask(FastaChunk=HASH(0x42c76a8), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"..., "scaffold29", "", "/nfs/production/panda/ensemblgenomes/external/RepeatMasker-op"..., "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/c."..., 1, runlog=HASH(0x430e730)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 785 thread 1 Process::MpiChunk::__ANON__() called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Error.pm line 415 thread 1 eval {...} called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Error.pm line 407 thread 1 Error::subs::try(CODE(0x4437a90), HASH(0x4425fc8)) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 4215 thread 1 Process::MpiChunk::_go(Process::MpiChunk=HASH(0x426e0a0), "run", HASH(0x42a5410), 0, 1) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib/Process/MpiChunk.pm line 341 thread 1 Process::MpiChunk::run(Process::MpiChunk=HASH(0x426e0a0), 29) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker line 1457 thread 1 main::node_thread("/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 799 thread 1 eval {...} called at /nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 799 thread 1 threads::new("threads", CODE(0x4168d70), "/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) called at /nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker line 917 thread 1 --> rank=29, hostname=ebi5-229.ebi.ac.uk ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:scaffold29 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:scaffold29 I see from the mailing list that there's a known issue w/ forks..pm (which is at the bottom of this stack) relating to perl 5.18, but I'm running 5.14. Any ideas? On 17/06/14 22:09, Carson Holt wrote: > There is a change in Perl 5.18 that makes the forks.pm module incompatible. > The forks.pm model maintainers have yet to update their module to resolve > the issue, so it only works on perl version prior to 5.18. > One work around it to manually edit forks.pm line 1736 yourself. > > Change it from this --> > $write = each %WRITE; > > To this (make sure to include the {} brackets)--> > { > no warnings qw(internal); > $write = each %WRITE; > } > > --Carson > -- malcolm hinsley | EnsEMBL Genomes | +44 (0)1223 49 4669 European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD United Kingdom From rbharris at uw.edu Thu Jun 19 13:07:36 2014 From: rbharris at uw.edu (Rebecca Harris) Date: Thu, 19 Jun 2014 14:07:36 -0500 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hi, I'm trying to use the iprscan2gff3 script to update my final gff3 file with annotations from Interproscan 5. I'm getting a bunch of errors similar to another user but do not see how their issue was resolved: https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ I ran Interproscan on my ab initio predictions, then converted the xml to raw format. When I run iprscan2gff3 I get the errors: Use of uninitialized value $name in hash element at /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. Thanks, Rebecca -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Thu Jun 19 14:44:46 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Thu, 19 Jun 2014 20:44:46 +0000 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hi Rebecca, I at the conversation you linked to and it seems that Carson resolved the those parsing issues in an update to maker. What version of maker are you using? Also, in that same conversation Carson said that those errors wouldn't affect the output (because the script was parsing the mRNA features fine, but giving errors on the gene features). Does the output that you get from iprscan2gff3 seem complete? ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 1:07 PM, Rebecca Harris > wrote: Hi, I'm trying to use the iprscan2gff3 script to update my final gff3 file with annotations from Interproscan 5. I'm getting a bunch of errors similar to another user but do not see how their issue was resolved: https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ I ran Interproscan on my ab initio predictions, then converted the xml to raw format. When I run iprscan2gff3 I get the errors: Use of uninitialized value $name in hash element at /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. Thanks, Rebecca _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 19 14:47:27 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Jun 2014 14:47:27 -0600 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Also make sure there are gene/mRNA features in your GFF3 for your iprscan results. If you used the ab initio calls (which will be match/match_part features in the GFF3) as your input to iprscan, then you will need to upgrade them to gene/mRNA features before the script will add domains to them. --Carson From: Daniel Ence Date: Thursday, June 19, 2014 at 2:44 PM To: Rebecca Harris Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Fwd: iprscan2gff3 Hi Rebecca, I at the conversation you linked to and it seems that Carson resolved the those parsing issues in an update to maker. What version of maker are you using? Also, in that same conversation Carson said that those errors wouldn't affect the output (because the script was parsing the mRNA features fine, but giving errors on the gene features). Does the output that you get from iprscan2gff3 seem complete? ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 1:07 PM, Rebecca Harris wrote: > Hi, > > I'm trying to use the iprscan2gff3 script to update my final gff3 file with > annotations from Interproscan 5. I'm getting a bunch of errors similar to > another user but do not see how their issue was resolved: > https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-deve > l/MykTxL2Da64/5yrO3e6WBHUJ > > I ran Interproscan on my ab initio predictions, then converted the xml to raw > format. When I run iprscan2gff3 I get the errors: > Use of uninitialized value $name in hash element at > /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line 1090. > > Thanks, > Rebecca > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From rbharris at uw.edu Thu Jun 19 15:22:34 2014 From: rbharris at uw.edu (Rebecca Harris) Date: Thu, 19 Jun 2014 14:22:34 -0700 Subject: [maker-devel] Fwd: iprscan2gff3 In-Reply-To: References: Message-ID: Hey, Thanks for the reply. The problem was that I didn't upgrade the matches to gene/mRNA features before running the ipr_upgrade_gff3 script. R On Thu, Jun 19, 2014 at 1:47 PM, Carson Holt wrote: > Also make sure there are gene/mRNA features in your GFF3 for your iprscan > results. If you used the ab initio calls (which will be match/match_part > features in the GFF3) as your input to iprscan, then you will need to > upgrade them to gene/mRNA features before the script will add domains to > them. > > --Carson > > > From: Daniel Ence > Date: Thursday, June 19, 2014 at 2:44 PM > To: Rebecca Harris > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Fwd: iprscan2gff3 > > Hi Rebecca, I at the conversation you linked to and it seems that Carson > resolved the those parsing issues in an update to maker. What version of > maker are you using? > > Also, in that same conversation Carson said that those errors wouldn't > affect the output (because the script was parsing the mRNA features fine, > but giving errors on the gene features). Does the output that you get from > iprscan2gff3 seem complete? > > ~Daniel > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 19, 2014, at 1:07 PM, Rebecca Harris > wrote: > > Hi, > > I'm trying to use the iprscan2gff3 script to update my final gff3 file > with annotations from Interproscan 5. I'm getting a bunch of errors similar > to another user but do not see how their issue was resolved: > > https://groups.google.com/forum/#!searchin/maker-devel/iprscan2gff3/maker-devel/MykTxL2Da64/5yrO3e6WBHUJ > > I ran Interproscan on my ab initio predictions, then converted the xml to > raw format. When I run iprscan2gff3 I get the errors: > > Use of uninitialized value $name in hash element at > /gscratch/leache/rbharris/maker/bin/ipr_update_gff line 107, <$IN> line > 1090. > Thanks, > Rebecca > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.priyam at qmul.ac.uk Thu Jun 19 16:11:36 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Fri, 20 Jun 2014 03:41:36 +0530 Subject: [maker-devel] migrating annotations from old to new assembly Message-ID: Is it possible to migrate annotations from an old assembly to a new assembly using MAKER? Perhaps by setting est= to transcripts (spliced? or unspliced?) from the previous assembly and genome= to the new assembly? Maybe ask MAKER to use exonerate instead of BLASTN so splice junctions are accounted for better? -- Priyam From carsonhh at gmail.com Thu Jun 19 16:16:01 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Jun 2014 16:16:01 -0600 Subject: [maker-devel] migrating annotations from old to new assembly In-Reply-To: References: Message-ID: Here you go, this is covered in a previous post --> https://groups.google.com/forum/#!searchin/maker-devel/est_forward/maker-de vel/q9fxXGKO8mk/0ATwhJvZeI4J --Carson On 6/19/14, 4:11 PM, "Anurag Priyam" wrote: >Is it possible to migrate annotations from an old assembly to a new >assembly using MAKER? > >Perhaps by setting est= to transcripts (spliced? or unspliced?) from >the previous assembly and genome= to the new assembly? Maybe ask MAKER >to use exonerate instead of BLASTN so splice junctions are accounted >for better? > >-- Priyam > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From a.priyam at qmul.ac.uk Thu Jun 19 16:19:22 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Fri, 20 Jun 2014 03:49:22 +0530 Subject: [maker-devel] migrating annotations from old to new assembly In-Reply-To: References: Message-ID: Wow! Thanks :). I apologise that I didn't look through the archives before asking. -- Priyam On Fri, Jun 20, 2014 at 3:46 AM, Carson Holt wrote: > Here you go, this is covered in a previous post --> > https://groups.google.com/forum/#!searchin/maker-devel/est_forward/maker-de > vel/q9fxXGKO8mk/0ATwhJvZeI4J > > > --Carson > > > > On 6/19/14, 4:11 PM, "Anurag Priyam" wrote: > >>Is it possible to migrate annotations from an old assembly to a new >>assembly using MAKER? >> >>Perhaps by setting est= to transcripts (spliced? or unspliced?) from >>the previous assembly and genome= to the new assembly? Maybe ask MAKER >>to use exonerate instead of BLASTN so splice junctions are accounted >>for better? >> >>-- Priyam >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From saad.arif at tuebingen.mpg.de Wed Jun 18 10:42:17 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Wed, 18 Jun 2014 17:42:17 +0100 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> Message-ID: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: > Hi Saad, > > Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). > > You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. > > One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > > Let me know if that helps, or if you have more question > > > ~Daniel > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: > >> Thank you for the response. I still have one question though, with these options: >> >> est_GFF=cufflinksout.GFF >> >> modle_GFF= ensembl reference.GFF >> >> What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? >> Is there a simple way to combine adding (new genes) and improving of an existing annotation? >> >> Any feedback on this would be greatly appreciated. >> >> saad >> >> On 13 Jun 2014, at 17:59, Carson Holt wrote: >> >>> Use the cufflinks instead of the tophat features (tophat tends to be >>> really noisy). Give the existing models to model_gff (they will then >>> always be kept unless something better is found). There is no option to >>> keep models and then just add isoforms. The model_gff input will either >>> be kept as is (unchanged), or replaced with an updated model suggested by >>> the evidence (the updated model may contain multiple isoforms though), and >>> map_forward=1 can be used to pull names forward from the old model onto >>> the new models. >>> >>> Thansk, >>> Carson >>> >>> >>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>> >>>> Dear All, >>>> >>>> I would like to use Maker pipeline to expand a current annotation (new >>>> isoforms and novel genes with respect to current annotation) and was >>>> wondering if anyone had experience with this and or suggestions to my >>>> questions. >>>> >>>> Briefly: >>>> >>>> I have tophat splice junctions from RNAseq data or alternatively >>>> cufflinks generated transcript models (fasts format) that i want to use >>>> as my new data (est_gff or est). >>>> >>>> I want to provide the current Ensembl annotation for gene prediction but >>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>> should provide this annotation as pred_gff >>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>> annotation for both options (pred_ and mod_gff)? >>>> >>>> >>>> >>>> Importantly, my main goal is to use the new RNAseq data to add more >>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>> thoughts or suggestions on how to go about this would be sincerely >>>> appreciated. >>>> >>>> >>>> Thanks in advance, >>>> saad >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anurag08priyam at gmail.com Wed Jun 18 12:15:52 2014 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Wed, 18 Jun 2014 23:45:52 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: It's version 2.31. -- Priyam On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt wrote: > What MAKER version are you using? > > --Carson > > > On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: > >>Hi, >> >>I compiled all annotations generated by MAKER into a single GFF file >>using the gff3_merge script distributed with MAKER. While formatting >>this GFF for use with JBrowse, I found a few errors: >> >>1. Three instances where two features were assigned the same id. >>2. One instance where a group of three subfeatures refer to a >>non-existent parent. >> >>Here is the relevant portion of the GFF file: >>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >> >>I worked around the issue temporarily for the job at hand, but I am >>left wondering why would these errors creep in. >> >>-- Priyam >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From rajesh.bommareddy at tu-harburg.de Thu Jun 19 02:08:45 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Thu, 19 Jun 2014 10:08:45 +0200 Subject: [maker-devel] Maker control files Message-ID: <53A29A8D.5010709@tu-harburg.de> Dear Sir/Madam I have installed Maker on Linux. I have tested the installation using maker -h. It looks fine. But i cannot create the control files using /maker/bin ./maker -CTL I get the following error: Couldnot create maker_opts.ctl control files not found Can you please help me with possible reason and solution Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From dence at genetics.utah.edu Fri Jun 20 15:20:47 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 20 Jun 2014 21:20:47 +0000 Subject: [maker-devel] Maker control files In-Reply-To: <53A29A8D.5010709@tu-harburg.de> References: <53A29A8D.5010709@tu-harburg.de> Message-ID: <51B8C254-A912-4CF6-B0E3-5C66E6E3E9AE@genetics.utah.edu> Hi Rajesh, Do you have write permissions in the directory where you're running maker? Also, I can't tell whether you're doing one command or two commands? If you do "maker" and there's no control files, then you'll get the "control files not found" error, but if you do ./maker -CTL and don't have permission to write to the install directory (which isn't unusual) then you'll get the "Could not create maker_opts.ctl" error. Thanks, Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 19, 2014, at 2:08 AM, Rajesh Reddy Bommareddy > wrote: Dear Sir/Madam I have installed Maker on Linux. I have tested the installation using maker -h. It looks fine. But i cannot create the control files using /maker/bin ./maker -CTL I get the following error: Couldnot create maker_opts.ctl control files not found Can you please help me with possible reason and solution Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 20 15:42:13 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:42:13 -0600 Subject: [maker-devel] Help with updating an annotation In-Reply-To: <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_G MOD_Online_Training_2014#Training_ab_initio_Gene_Predictors --Carson From: Saad Arif Date: Wednesday, June 18, 2014 at 10:42 AM To: Daniel Ence Cc: "" Subject: Re: [maker-devel] Help with updating an annotation Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: > Hi Saad, > > Maker doesn't view EST or protein evidence as a gene model in themselves. > There's a good reason for this. Aligners like blast don't guarantee complete > gene models, with accurate start and stop codons and splice sites. With it's > default settings maker won't make a gene model unless there's evidence that > overlaps an ab-initio prediction (or something from the pred_gff option). > > You can use est2genome to promote everything from the est_gff option to a gene > model, but this will probably give you many spurious results. What you're > saying with est2genome is, "Everything that this tool found is a complete gene > model." I don't think that's true even for cufflinks output. > > One of the gene predictors that can run internally is snap. It's really easy > to train; here's a link to a tutorial for training it: > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMO > D_Online_Training_2014#Training_ab_initio_Gene_Predictors > > Let me know if that helps, or if you have more question > > > ~Daniel > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: > >> Thank you for the response. I still have one question though, with these >> options: >> >> est_GFF=cufflinksout.GFF >> >> modle_GFF= ensembl reference.GFF >> >> What happens to cufflinks assembled transcripts that are not confined to >> current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to >> prepare ab initio gene predictions for each of these predicted 'new' genes? >> Is there a simple way to combine adding (new genes) and improving of an >> existing annotation? >> >> Any feedback on this would be greatly appreciated. >> >> saad >> >> On 13 Jun 2014, at 17:59, Carson Holt wrote: >> >>> Use the cufflinks instead of the tophat features (tophat tends to be >>> really noisy). Give the existing models to model_gff (they will then >>> always be kept unless something better is found). There is no option to >>> keep models and then just add isoforms. The model_gff input will either >>> be kept as is (unchanged), or replaced with an updated model suggested by >>> the evidence (the updated model may contain multiple isoforms though), and >>> map_forward=1 can be used to pull names forward from the old model onto >>> the new models. >>> >>> Thansk, >>> Carson >>> >>> >>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>> >>>> Dear All, >>>> >>>> I would like to use Maker pipeline to expand a current annotation (new >>>> isoforms and novel genes with respect to current annotation) and was >>>> wondering if anyone had experience with this and or suggestions to my >>>> questions. >>>> >>>> Briefly: >>>> >>>> I have tophat splice junctions from RNAseq data or alternatively >>>> cufflinks generated transcript models (fasts format) that i want to use >>>> as my new data (est_gff or est). >>>> >>>> I want to provide the current Ensembl annotation for gene prediction but >>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>> should provide this annotation as pred_gff >>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>> annotation for both options (pred_ and mod_gff)? >>>> >>>> >>>> >>>> Importantly, my main goal is to use the new RNAseq data to add more >>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>> thoughts or suggestions on how to go about this would be sincerely >>>> appreciated. >>>> >>>> >>>> Thanks in advance, >>>> saad >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jun 20 15:46:59 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:46:59 -0600 Subject: [maker-devel] 'not a SCALAR reference' error In-Reply-To: <53A2A854.3000009@ebi.ac.uk> References: <53A2A854.3000009@ebi.ac.uk> Message-ID: Make sure you are using the latest version of MAKER 3.31.6. Also you may have to use MPICH2. MPICH3 is actually a different MPI protocol and I have not had success running MAKER with it. --Carson On 6/19/14, 3:07 AM, "Malcolm Hinsley" wrote: >Hi > >I'm running maker 2.31 with mpich 3 and have run once with est and >protein2genome, then trained augustus and snap and run the first >iteration of ab-initio predictors, which finished cleanly with no >errors/ failures. > >Having retrained augustus and snap I'm trying to run maker -a using the >same augustus species and snap.hmm pathname... previously this has >worked fine. > > >I get a lot of errors like this (it looks like every scaffold fails): > >doing repeat masking >ERROR: Not a SCALAR reference > at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Fasta.pm >line 382 thread 1. > Fasta::_formatSeq(FastaSeq=HASH(0x4426298), 60) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Fasta.pm >line 369 thread 1 > Fasta::toFastaRef(">scaffold29 CHUNK number:0 size:100000 >offset:0", REF(0x42e48f0)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/FastaChunk.pm >line 217 thread 1 > FastaChunk::fasta_ref(FastaChunk=HASH(0x42c76a8)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/FastaChunk.pm >line 168 thread 1 > FastaChunk::write_file(FastaChunk=HASH(0x42c76a8), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/GI.pm >line 3138 thread 1 > GI::repeatmask(FastaChunk=HASH(0x42c76a8), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"..., >"scaffold29", "", >"/nfs/production/panda/ensemblgenomes/external/RepeatMasker-op"..., >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/c."..., 1, >runlog=HASH(0x430e730)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 785 thread 1 > Process::MpiChunk::__ANON__() called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Error.pm >line 415 thread 1 > eval {...} called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Error.pm >line 407 thread 1 > Error::subs::try(CODE(0x4437a90), HASH(0x4425fc8)) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 4215 thread 1 > Process::MpiChunk::_go(Process::MpiChunk=HASH(0x426e0a0), >"run", HASH(0x42a5410), 0, 1) called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/../lib >/Process/MpiChunk.pm >line 341 thread 1 > Process::MpiChunk::run(Process::MpiChunk=HASH(0x426e0a0), 29) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker >line >1457 thread 1 >main::node_thread("/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/ma >ker/v8"...) >called at >/nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/ >x86_64-linux-thread-multi/forks.pm >line 799 thread 1 > eval {...} called at >/nfs/panda/ensemblgenomes/perl/perlbrew/perls/5.14.2/lib/site_perl/5.14.2/ >x86_64-linux-thread-multi/forks.pm >line 799 thread 1 > threads::new("threads", CODE(0x4168d70), >"/gpfs/nobackup/ensembl_genomes/mhinsley/c.sonorensis/maker/v8"...) >called at >/nfs/production/panda/ensemblgenomes/external/maker/2.31_mpich3/bin/maker >line >917 thread 1 >--> rank=29, hostname=ebi5-229.ebi.ac.uk >ERROR: Failed while doing repeat masking >ERROR: Chunk failed at level:0, tier_type:1 >FAILED CONTIG:scaffold29 > >ERROR: Chunk failed at level:2, tier_type:0 >FAILED CONTIG:scaffold29 > > >I see from the mailing list that there's a known issue w/ forks..pm >(which is at the bottom of this stack) relating to perl 5.18, but I'm >running 5.14. > > >Any ideas? > > > > > >On 17/06/14 22:09, Carson Holt wrote: >> There is a change in Perl 5.18 that makes the forks.pm module >>incompatible. >> The forks.pm model maintainers have yet to update their module to >>resolve >> the issue, so it only works on perl version prior to 5.18. >> One work around it to manually edit forks.pm line 1736 yourself. >> >> Change it from this --> >> $write = each %WRITE; >> >> To this (make sure to include the {} brackets)--> >> { >> no warnings qw(internal); >> $write = each %WRITE; >> } >> >> --Carson >> > >-- >malcolm hinsley | EnsEMBL Genomes | +44 (0)1223 49 4669 >European Bioinformatics Institute (EMBL-EBI) >European Molecular Biology Laboratory >Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD >United Kingdom > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Jun 20 15:50:38 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:50:38 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: did you use est_forward? Also in the example you showed all the IDs are unique (one says hit and the other hsp in the ID, so they are different)? Could you find the non-uunique IDs causing the error? --Carson On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >I used est_gff= option, which refers to a GFF file generated by >cufflinks2gff3. The erroneous annotations didn't come from this GFF. > >-- Priyam > >On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >> Are you passing in old data via GFF3? >> >> --Carson >> >> >> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >> >>>It's version 2.31. >>> >>>-- Priyam >>> >>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>wrote: >>>> What MAKER version are you using? >>>> >>>> --Carson >>>> >>>> >>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>> >>>>>Hi, >>>>> >>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>this GFF for use with JBrowse, I found a few errors: >>>>> >>>>>1. Three instances where two features were assigned the same id. >>>>>2. One instance where a group of three subfeatures refer to a >>>>>non-existent parent. >>>>> >>>>>Here is the relevant portion of the GFF file: >>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>> >>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>left wondering why would these errors creep in. >>>>> >>>>>-- Priyam >>>>> >>>>>_______________________________________________ >>>>>maker-devel mailing list >>>>>maker-devel at box290.bluehost.com >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>>> >> >> From carsonhh at gmail.com Fri Jun 20 15:56:46 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Jun 2014 15:56:46 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Also note that ID= must be unique. Name= does not have to be, and won't be if the same protein or repeat element aligns to more than one location for example. Thanks, Carson On 6/20/14, 3:50 PM, "Carson Holt" wrote: >did you use est_forward? Also in the example you showed all the IDs are >unique (one says hit and the other hsp in the ID, so they are different)? >Could you find the non-uunique IDs causing the error? > >--Carson > > >On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: > >>I used est_gff= option, which refers to a GFF file generated by >>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >> >>-- Priyam >> >>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >>> Are you passing in old data via GFF3? >>> >>> --Carson >>> >>> >>> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >>> >>>>It's version 2.31. >>>> >>>>-- Priyam >>>> >>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>wrote: >>>>> What MAKER version are you using? >>>>> >>>>> --Carson >>>>> >>>>> >>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>> >>>>>>Hi, >>>>>> >>>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>> >>>>>>1. Three instances where two features were assigned the same id. >>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>non-existent parent. >>>>>> >>>>>>Here is the relevant portion of the GFF file: >>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>> >>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>left wondering why would these errors creep in. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>_______________________________________________ >>>>>>maker-devel mailing list >>>>>>maker-devel at box290.bluehost.com >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>>> >>> >>> > > From a.priyam at qmul.ac.uk Tue Jun 24 12:56:41 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Wed, 25 Jun 2014 00:26:41 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: I am sorry. I have updated the gist - https://gist.github.com/yeban/ffaf5cd419639dd073a7. 1. The first two chunks contain the annotations with duplicate ids. (4 rows) 2. The last chunk contains the annotations that refer to a non-existent parent. And what looks like an incomplete line of annotation (I forgot to state this in my original email). No, I didn't use est_forward. I am not passing in any old data via GFF3. -- Priyam On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: > Also note that ID= must be unique. Name= does not have to be, and won't be > if the same protein or repeat element aligns to more than one location for > example. > > Thanks, > Carson > > > On 6/20/14, 3:50 PM, "Carson Holt" wrote: > >>did you use est_forward? Also in the example you showed all the IDs are >>unique (one says hit and the other hsp in the ID, so they are different)? >>Could you find the non-uunique IDs causing the error? >> >>--Carson >> >> >>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >> >>>I used est_gff= option, which refers to a GFF file generated by >>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>> >>>-- Priyam >>> >>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt wrote: >>>> Are you passing in old data via GFF3? >>>> >>>> --Carson >>>> >>>> >>>> On 6/18/14, 12:15 PM, "Anurag Priyam" wrote: >>>> >>>>>It's version 2.31. >>>>> >>>>>-- Priyam >>>>> >>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>wrote: >>>>>> What MAKER version are you using? >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>> >>>>>>>Hi, >>>>>>> >>>>>>>I compiled all annotations generated by MAKER into a single GFF file >>>>>>>using the gff3_merge script distributed with MAKER. While formatting >>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>> >>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>non-existent parent. >>>>>>> >>>>>>>Here is the relevant portion of the GFF file: >>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>> >>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>left wondering why would these errors creep in. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>_______________________________________________ >>>>>>>maker-devel mailing list >>>>>>>maker-devel at box290.bluehost.com >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>>r >>>>>>>g >>>>>> >>>>>> >>>> >>>> >> >> > > From carsonhh at gmail.com Tue Jun 24 14:05:00 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Jun 2014 14:05:00 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 The value 1026 is held in a global iterator, so it cannot repeat the same value during the life of the process. And 1.3.0.12 is generated from the point in the code the ID is being generated. This means that two distinct processses had to write to the same file at the same point in the code, which should normally be impossible. However, there are ways to make this happen. First if you turn file locks off (-nolock) option and then run MAKER multiple times on the same dataset you can get process collisions (because you disabled the locks that stop this). If your NFS file system does not support hard links (FhGFS for example) then you cannot lock the files (which is the same as setting -nolock). Or you have other serious IO failures over NFS. Note that NFS is your Network Mounted Storage. The last example you give shows the preceding line being truncated. This suggests that two processes are trying to write to the same file simultaneously (inserting lines in between other lines), or serious IO failures are occurring where writes are not completing but true is being returned for the operations (can happen on unreliable NFS implementations). So in summary either your NFS storage implementation is giving IO errors, you have run MAKER with -nolock set and then started MAKER multiple times in the same directory (process collisions), or your NFS implementation doesn't support hardlinks and won't allow MAKER to lock files (process collisions). If it is one of the latter two, you will have to make sure you never start MAKER more than once simultaneously on the same dataset. You can still run via MPI fro parallelization, but you won't be able to start a second MPI process while the first one is still running. Thanks, Carson On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >I am sorry. I have updated the gist - >https://gist.github.com/yeban/ffaf5cd419639dd073a7. >1. The first two chunks contain the annotations with duplicate ids. (4 >rows) >2. The last chunk contains the annotations that refer to a >non-existent parent. And what looks like an incomplete line of >annotation (I forgot to state this in my original email). > >No, I didn't use est_forward. I am not passing in any old data via GFF3. > >-- Priyam > >On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >> Also note that ID= must be unique. Name= does not have to be, and won't >>be >> if the same protein or repeat element aligns to more than one location >>for >> example. >> >> Thanks, >> Carson >> >> >> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >> >>>did you use est_forward? Also in the example you showed all the IDs are >>>unique (one says hit and the other hsp in the ID, so they are >>>different)? >>>Could you find the non-uunique IDs causing the error? >>> >>>--Carson >>> >>> >>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>> >>>>I used est_gff= option, which refers to a GFF file generated by >>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>> >>>>-- Priyam >>>> >>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>wrote: >>>>> Are you passing in old data via GFF3? >>>>> >>>>> --Carson >>>>> >>>>> >>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>wrote: >>>>> >>>>>>It's version 2.31. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>wrote: >>>>>>> What MAKER version are you using? >>>>>>> >>>>>>> --Carson >>>>>>> >>>>>>> >>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>>> >>>>>>>>Hi, >>>>>>>> >>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>file >>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>formatting >>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>> >>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>non-existent parent. >>>>>>>> >>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>> >>>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>>left wondering why would these errors creep in. >>>>>>>> >>>>>>>>-- Priyam >>>>>>>> >>>>>>>>_______________________________________________ >>>>>>>>maker-devel mailing list >>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>.o >>>>>>>>r >>>>>>>>g >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >> >> From a.priyam at qmul.ac.uk Wed Jun 25 15:11:22 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Thu, 26 Jun 2014 02:41:22 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER processes in the same directory. I feel it's unlikely that my file system doesn't allow hardlinks because a few processes quit earlier than the others, saying something to the tune of "Another MAKER process is processing this scaffold already." I remember one process in particular had _just_ crashed. I don't remember how: I might have Ctrl-C'ed by mistake instead of detaching screen? admin killed it? temporary system glitch? Could this have caused the same issue? -- Priyam On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: > Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 > > The value 1026 is held in a global iterator, so it cannot repeat the same > value during the life of the process. And 1.3.0.12 is generated from the > point in the code the ID is being generated. This means that two distinct > processses had to write to the same file at the same point in the code, > which should normally be impossible. > > However, there are ways to make this happen. First if you turn file locks > off (-nolock) option and then run MAKER multiple times on the same dataset > you can get process collisions (because you disabled the locks that stop > this). If your NFS file system does not support hard links (FhGFS for > example) then you cannot lock the files (which is the same as setting > -nolock). Or you have other serious IO failures over NFS. Note that NFS > is your Network Mounted Storage. > > The last example you give shows the preceding line being truncated. This > suggests that two processes are trying to write to the same file > simultaneously (inserting lines in between other lines), or serious IO > failures are occurring where writes are not completing but true is being > returned for the operations (can happen on unreliable NFS implementations). > > So in summary either your NFS storage implementation is giving IO errors, > you have run MAKER with -nolock set and then started MAKER multiple times > in the same directory (process collisions), or your NFS implementation > doesn't support hardlinks and won't allow MAKER to lock files (process > collisions). If it is one of the latter two, you will have to make sure > you never start MAKER more than once simultaneously on the same dataset. > You can still run via MPI fro parallelization, but you won't be able to > start a second MPI process while the first one is still running. > > Thanks, > Carson > > > On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: > >>I am sorry. I have updated the gist - >>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>1. The first two chunks contain the annotations with duplicate ids. (4 >>rows) >>2. The last chunk contains the annotations that refer to a >>non-existent parent. And what looks like an incomplete line of >>annotation (I forgot to state this in my original email). >> >>No, I didn't use est_forward. I am not passing in any old data via GFF3. >> >>-- Priyam >> >>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>> Also note that ID= must be unique. Name= does not have to be, and won't >>>be >>> if the same protein or repeat element aligns to more than one location >>>for >>> example. >>> >>> Thanks, >>> Carson >>> >>> >>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>> >>>>did you use est_forward? Also in the example you showed all the IDs are >>>>unique (one says hit and the other hsp in the ID, so they are >>>>different)? >>>>Could you find the non-uunique IDs causing the error? >>>> >>>>--Carson >>>> >>>> >>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>> >>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>> >>>>>-- Priyam >>>>> >>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>wrote: >>>>>> Are you passing in old data via GFF3? >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>wrote: >>>>>> >>>>>>>It's version 2.31. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>wrote: >>>>>>>> What MAKER version are you using? >>>>>>>> >>>>>>>> --Carson >>>>>>>> >>>>>>>> >>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" wrote: >>>>>>>> >>>>>>>>>Hi, >>>>>>>>> >>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>file >>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>formatting >>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>> >>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>non-existent parent. >>>>>>>>> >>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>> >>>>>>>>>I worked around the issue temporarily for the job at hand, but I am >>>>>>>>>left wondering why would these errors creep in. >>>>>>>>> >>>>>>>>>-- Priyam >>>>>>>>> >>>>>>>>>_______________________________________________ >>>>>>>>>maker-devel mailing list >>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>>.o >>>>>>>>>r >>>>>>>>>g >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >>> > > From carsonhh at gmail.com Wed Jun 25 15:26:45 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 25 Jun 2014 15:26:45 -0600 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: Maybe if it died in a weird way some of the processes could have continued briefly without active locks, but I'd more likely attribute this to NFS weirdness. Because of how network storage works, some implementations take shortcuts (like returning success on an IO operation even though it has not completed and may even fail later on). Or an IO operation can be buffered and completed several seconds later (the process that called the write operation may not even be active anymore). This is extremely common on NFS. You should probably just start MAKER fewer times in the same directory on your system. You may also want to start a single MAKER job (you should use MPI to parallelize it though), and use the -a flag. This will cause that job just to just rebuild the current GFF3 and FASTA files. That way you can clean up your current results without having to rerun everything. It should run relatively quickly since MAKER will be able to make use of the existing BLAST reports etc. that are already there (exonerate will run again though, but it shouldn't take too long). --Carson On 6/25/14, 3:11 PM, "Anurag Priyam" wrote: >Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER >processes in the same directory. > >I feel it's unlikely that my file system doesn't allow hardlinks >because a few processes quit earlier than the others, saying something >to the tune of "Another MAKER process is processing this scaffold >already." > >I remember one process in particular had _just_ crashed. I don't >remember how: I might have Ctrl-C'ed by mistake instead of detaching >screen? admin killed it? temporary system glitch? Could this have >caused the same issue? > >-- Priyam > > >On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: >> Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 >> >> The value 1026 is held in a global iterator, so it cannot repeat the >>same >> value during the life of the process. And 1.3.0.12 is generated from the >> point in the code the ID is being generated. This means that two >>distinct >> processses had to write to the same file at the same point in the code, >> which should normally be impossible. >> >> However, there are ways to make this happen. First if you turn file >>locks >> off (-nolock) option and then run MAKER multiple times on the same >>dataset >> you can get process collisions (because you disabled the locks that stop >> this). If your NFS file system does not support hard links (FhGFS for >> example) then you cannot lock the files (which is the same as setting >> -nolock). Or you have other serious IO failures over NFS. Note that NFS >> is your Network Mounted Storage. >> >> The last example you give shows the preceding line being truncated. >>This >> suggests that two processes are trying to write to the same file >> simultaneously (inserting lines in between other lines), or serious IO >> failures are occurring where writes are not completing but true is being >> returned for the operations (can happen on unreliable NFS >>implementations). >> >> So in summary either your NFS storage implementation is giving IO >>errors, >> you have run MAKER with -nolock set and then started MAKER multiple >>times >> in the same directory (process collisions), or your NFS implementation >> doesn't support hardlinks and won't allow MAKER to lock files (process >> collisions). If it is one of the latter two, you will have to make sure >> you never start MAKER more than once simultaneously on the same dataset. >> You can still run via MPI fro parallelization, but you won't be able to >> start a second MPI process while the first one is still running. >> >> Thanks, >> Carson >> >> >> On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >> >>>I am sorry. I have updated the gist - >>>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>>1. The first two chunks contain the annotations with duplicate ids. (4 >>>rows) >>>2. The last chunk contains the annotations that refer to a >>>non-existent parent. And what looks like an incomplete line of >>>annotation (I forgot to state this in my original email). >>> >>>No, I didn't use est_forward. I am not passing in any old data via GFF3. >>> >>>-- Priyam >>> >>>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>>> Also note that ID= must be unique. Name= does not have to be, and >>>>won't >>>>be >>>> if the same protein or repeat element aligns to more than one location >>>>for >>>> example. >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>>> >>>>>did you use est_forward? Also in the example you showed all the IDs >>>>>are >>>>>unique (one says hit and the other hsp in the ID, so they are >>>>>different)? >>>>>Could you find the non-uunique IDs causing the error? >>>>> >>>>>--Carson >>>>> >>>>> >>>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>>> >>>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>>> >>>>>>-- Priyam >>>>>> >>>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>>wrote: >>>>>>> Are you passing in old data via GFF3? >>>>>>> >>>>>>> --Carson >>>>>>> >>>>>>> >>>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>>wrote: >>>>>>> >>>>>>>>It's version 2.31. >>>>>>>> >>>>>>>>-- Priyam >>>>>>>> >>>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>>wrote: >>>>>>>>> What MAKER version are you using? >>>>>>>>> >>>>>>>>> --Carson >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" >>>>>>>>>wrote: >>>>>>>>> >>>>>>>>>>Hi, >>>>>>>>>> >>>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>>file >>>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>>formatting >>>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>>> >>>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>>non-existent parent. >>>>>>>>>> >>>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>>> >>>>>>>>>>I worked around the issue temporarily for the job at hand, but I >>>>>>>>>>am >>>>>>>>>>left wondering why would these errors creep in. >>>>>>>>>> >>>>>>>>>>-- Priyam >>>>>>>>>> >>>>>>>>>>_______________________________________________ >>>>>>>>>>maker-devel mailing list >>>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l >>>>>>>>>>ab >>>>>>>>>>.o >>>>>>>>>>r >>>>>>>>>>g >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> >> >> From a.priyam at qmul.ac.uk Wed Jun 25 15:38:17 2014 From: a.priyam at qmul.ac.uk (Anurag Priyam) Date: Thu, 26 Jun 2014 03:08:17 +0530 Subject: [maker-devel] errors in final gff In-Reply-To: References: Message-ID: -a option looks like just the thing I need. I will forward concerns about NFS to our IT team. And definitely use MPI for parallelisation next time. Thanks a lot :). -- Priyam On Thu, Jun 26, 2014 at 2:56 AM, Carson Holt wrote: > Maybe if it died in a weird way some of the processes could have continued > briefly without active locks, but I'd more likely attribute this to NFS > weirdness. Because of how network storage works, some implementations > take shortcuts (like returning success on an IO operation even though it > has not completed and may even fail later on). Or an IO operation can be > buffered and completed several seconds later (the process that called the > write operation may not even be active anymore). This is extremely common > on NFS. You should probably just start MAKER fewer times in the same > directory on your system. You may also want to start a single MAKER job > (you should use MPI to parallelize it though), and use the -a flag. This > will cause that job just to just rebuild the current GFF3 and FASTA files. > That way you can clean up your current results without having to rerun > everything. It should run relatively quickly since MAKER will be able to > make use of the existing BLAST reports etc. that are already there > (exonerate will run again though, but it shouldn't take too long). > > --Carson > > > On 6/25/14, 3:11 PM, "Anurag Priyam" wrote: > >>Mmm ... I didn't use -nolock option. But I did launch some 10 MAKER >>processes in the same directory. >> >>I feel it's unlikely that my file system doesn't allow hardlinks >>because a few processes quit earlier than the others, saying something >>to the tune of "Another MAKER process is processing this scaffold >>already." >> >>I remember one process in particular had _just_ crashed. I don't >>remember how: I might have Ctrl-C'ed by mistake instead of detaching >>screen? admin killed it? temporary system glitch? Could this have >>caused the same issue? >> >>-- Priyam >> >> >>On Wed, Jun 25, 2014 at 1:35 AM, Carson Holt wrote: >>> Thanks. For the first two --> scaffold00002:hit:1026:1.3.0.12 >>> >>> The value 1026 is held in a global iterator, so it cannot repeat the >>>same >>> value during the life of the process. And 1.3.0.12 is generated from the >>> point in the code the ID is being generated. This means that two >>>distinct >>> processses had to write to the same file at the same point in the code, >>> which should normally be impossible. >>> >>> However, there are ways to make this happen. First if you turn file >>>locks >>> off (-nolock) option and then run MAKER multiple times on the same >>>dataset >>> you can get process collisions (because you disabled the locks that stop >>> this). If your NFS file system does not support hard links (FhGFS for >>> example) then you cannot lock the files (which is the same as setting >>> -nolock). Or you have other serious IO failures over NFS. Note that NFS >>> is your Network Mounted Storage. >>> >>> The last example you give shows the preceding line being truncated. >>>This >>> suggests that two processes are trying to write to the same file >>> simultaneously (inserting lines in between other lines), or serious IO >>> failures are occurring where writes are not completing but true is being >>> returned for the operations (can happen on unreliable NFS >>>implementations). >>> >>> So in summary either your NFS storage implementation is giving IO >>>errors, >>> you have run MAKER with -nolock set and then started MAKER multiple >>>times >>> in the same directory (process collisions), or your NFS implementation >>> doesn't support hardlinks and won't allow MAKER to lock files (process >>> collisions). If it is one of the latter two, you will have to make sure >>> you never start MAKER more than once simultaneously on the same dataset. >>> You can still run via MPI fro parallelization, but you won't be able to >>> start a second MPI process while the first one is still running. >>> >>> Thanks, >>> Carson >>> >>> >>> On 6/24/14, 12:56 PM, "Anurag Priyam" wrote: >>> >>>>I am sorry. I have updated the gist - >>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7. >>>>1. The first two chunks contain the annotations with duplicate ids. (4 >>>>rows) >>>>2. The last chunk contains the annotations that refer to a >>>>non-existent parent. And what looks like an incomplete line of >>>>annotation (I forgot to state this in my original email). >>>> >>>>No, I didn't use est_forward. I am not passing in any old data via GFF3. >>>> >>>>-- Priyam >>>> >>>>On Sat, Jun 21, 2014 at 3:26 AM, Carson Holt wrote: >>>>> Also note that ID= must be unique. Name= does not have to be, and >>>>>won't >>>>>be >>>>> if the same protein or repeat element aligns to more than one location >>>>>for >>>>> example. >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> On 6/20/14, 3:50 PM, "Carson Holt" wrote: >>>>> >>>>>>did you use est_forward? Also in the example you showed all the IDs >>>>>>are >>>>>>unique (one says hit and the other hsp in the ID, so they are >>>>>>different)? >>>>>>Could you find the non-uunique IDs causing the error? >>>>>> >>>>>>--Carson >>>>>> >>>>>> >>>>>>On 6/19/14, 2:05 AM, "Anurag Priyam" wrote: >>>>>> >>>>>>>I used est_gff= option, which refers to a GFF file generated by >>>>>>>cufflinks2gff3. The erroneous annotations didn't come from this GFF. >>>>>>> >>>>>>>-- Priyam >>>>>>> >>>>>>>On Thu, Jun 19, 2014 at 3:03 AM, Carson Holt >>>>>>>wrote: >>>>>>>> Are you passing in old data via GFF3? >>>>>>>> >>>>>>>> --Carson >>>>>>>> >>>>>>>> >>>>>>>> On 6/18/14, 12:15 PM, "Anurag Priyam" >>>>>>>>wrote: >>>>>>>> >>>>>>>>>It's version 2.31. >>>>>>>>> >>>>>>>>>-- Priyam >>>>>>>>> >>>>>>>>>On Wed, Jun 18, 2014 at 11:41 PM, Carson Holt >>>>>>>>>wrote: >>>>>>>>>> What MAKER version are you using? >>>>>>>>>> >>>>>>>>>> --Carson >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 6/18/14, 11:44 AM, "Anurag Priyam" >>>>>>>>>>wrote: >>>>>>>>>> >>>>>>>>>>>Hi, >>>>>>>>>>> >>>>>>>>>>>I compiled all annotations generated by MAKER into a single GFF >>>>>>>>>>>file >>>>>>>>>>>using the gff3_merge script distributed with MAKER. While >>>>>>>>>>>formatting >>>>>>>>>>>this GFF for use with JBrowse, I found a few errors: >>>>>>>>>>> >>>>>>>>>>>1. Three instances where two features were assigned the same id. >>>>>>>>>>>2. One instance where a group of three subfeatures refer to a >>>>>>>>>>>non-existent parent. >>>>>>>>>>> >>>>>>>>>>>Here is the relevant portion of the GFF file: >>>>>>>>>>>https://gist.github.com/yeban/ffaf5cd419639dd073a7 >>>>>>>>>>> >>>>>>>>>>>I worked around the issue temporarily for the job at hand, but I >>>>>>>>>>>am >>>>>>>>>>>left wondering why would these errors creep in. >>>>>>>>>>> >>>>>>>>>>>-- Priyam >>>>>>>>>>> >>>>>>>>>>>_______________________________________________ >>>>>>>>>>>maker-devel mailing list >>>>>>>>>>>maker-devel at box290.bluehost.com >>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l >>>>>>>>>>>ab >>>>>>>>>>>.o >>>>>>>>>>>r >>>>>>>>>>>g >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>> >>> > > From rajesh.bommareddy at tu-harburg.de Mon Jun 30 04:18:12 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Mon, 30 Jun 2014 12:18:12 +0200 Subject: [maker-devel] Maker gene prediction Message-ID: <53B13964.3060608@tu-harburg.de> Dear Sir/Madam I have a general question regarding gene prediction and annotation in Maker. For example, I have a new sequence of a yeast strain, and i have to predict and annotate the genome. Of,course i know EST's from the same organism will help me to predict the genes accurately, but when i want to use EST or RNA transcripts from a closely related organism, how can i do that in Maker and how accurate will be the prediction ?. Is the produced prediction and annotation valid ? How do i check this ? Thank you and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From carsonhh at gmail.com Mon Jun 30 11:34:23 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 30 Jun 2014 11:34:23 -0600 Subject: [maker-devel] Maker gene prediction In-Reply-To: <53B13964.3060608@tu-harburg.de> References: <53B13964.3060608@tu-harburg.de> Message-ID: You can supply ESTs from a related organism to the alt_est= option. Note this runs really slow because it has to be translated in all 6 reading frames (target and query), and will be less sensitive (larger threshold for alignments to become statistically significant). So if you have protein evidence from a related species, use that instead of the EST evidence from a related species. With respect to accuracy, the alignment evidence that suggests the annotation is also the experimental evidence that supports an annotations accuracy (so it is kind of a circular argument). But the alignment evidence does provide a correlative measurement. Things with lower AED scores better match the evidence and should be considered as higher confidence, while genes with higher AED scores represent genes that have lower confidence (this correlation is very well supported across many many organisms). You should be aware of what is considered realistic with genome annotation. In general for newly sequenced organisms, a genome wide accuracy of greater than 80% is considered extremely well annotated (but can't directly be measured except retrospectively - i.e. once you have a future more complete assembly and more experimental evidence to compare to). Only a handful of genomes that have legions of curators working over a decade (drosophila for example) have accuracies of greater than 90%. --Carson On 6/30/14, 4:18 AM, "Rajesh Reddy Bommareddy" wrote: >Dear Sir/Madam > >I have a general question regarding gene prediction and annotation in >Maker. > >For example, I have a new sequence of a yeast strain, and i have to >predict and annotate the genome. Of,course i know EST's from the same >organism will help me to predict the genes accurately, but when i want >to use EST or RNA transcripts from a closely related organism, how can i >do that in Maker and how accurate will be the prediction ?. Is the >produced prediction and annotation valid ? How do i check this ? > >Thank you and Regards >-- >M.Sc. Rajesh Reddy Bommareddy >Institute of Bioprocess and Biosystems Engineering >Hamburg University of Technology(TUHH) >Denikestrasse 15, 20171 Hamburg >GERMANY. >e.mail: rajesh.bommareddy at tu-harburg.de >Phone:+4940428784011 >Mobile:+4917663673522 > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org