From carsonhh at gmail.com Mon Aug 3 14:10:48 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Aug 2015 13:10:48 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> Message-ID: <8548067E-FCBB-4E7D-A45F-70D6A1F62BF6@gmail.com> Thanks for the patch. I plan on adding the min_intron option as well as a snoscan bug fix to the stable release of MAKER soon. No plans for a GitHub move at present. ?Carson > On Jul 30, 2015, at 5:21 PM, Shaun Jackman wrote: > > Hi, Carson. > > Increasing $min_intron to 250 did the trick! Thanks for the tip. For this organellar genome, all the introns (at least that I?ve found) are group II self-splicing introns, so they have a much larger minimum size (about 990 bp) than introns in other genomes. https://en.wikipedia.org/wiki/Group_II_intron > Increasing $min_intron worked perfectly for all but one gene. Decreasing --codongapopen and --codongapextend rescued that one gene, but I?ll probably just leave the defaults and annotate that one gene by hand. > > Note that $min_intron had no effect on protein2genome without the following patch. Will there be a stable release soon that includes this patch and the min_intron option? I?m preparing a manuscript for submission, and I?d love to be able to refer to a stable version of MAKER in the manuscript that includes this feature. > > Have you considered moving the MAKER development to GitHub? > > Thanks again. Cheers, > Shaun > > diff --git a/protein.pm.orig b/protein.pm > --- a/protein.pm.orig > +++ b/protein.pm > @@ -94,11 +94,11 @@ sub runExonerate { > > my $command = "$exe -q $q_file -t $t_file -Q protein -T dna "; > $command .= "-m protein2genome --softmasktarget "; > + $command .= " --minintron $min_intron --maxintron $max_intron --showcigar"; > $command .= " --percent $percent"; > if ($matrix) { > $command .= " --proteinsubmat $matrix"; > } > - $command .= " --showcigar "; > $command .= " > $o_file"; > > my $w = new Widget::exonerate::protein2genome(); > > > > -- > http://sjackman.ca/ > On 2015-July-16 at 13:10:13 , Carson Holt (carsonhh at gmail.com ) wrote: > >> I can add it to the development version. >> >> ?Carson >> >> >>> On Jul 16, 2015, at 1:11 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. >>> >>> I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. >>> >>> I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. >>> >>> I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? >>> >>> Thanks for your help, Carson. Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com ) wrote: >>> >>>> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). >>>> >>>> The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. >>>> >>>> Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; >>>> >>>> ?Carson >>>> >>>> >>>>> On Jul 15, 2015, at 6:36 PM, Shaun Jackman > wrote: >>>>> >>>>> Hi, Carson. >>>>> >>>>> I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? >>>>> >>>>> 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. >>>>> >>>>> Thanks, >>>>> Shaun >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> http://sjackman.ca/ >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From palmierinico at gmail.com Tue Aug 4 09:46:38 2015 From: palmierinico at gmail.com (Nicola Palmieri) Date: Tue, 4 Aug 2015 16:46:38 +0200 Subject: [maker-devel] MAKER: single exons not predicted Message-ID: Dear MAKER developers, I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. I would really appreciate some hints on how to proceed. Kind regards, Nicola -- Nicola Palmieri Postdoctoral fellow Institut f?r Parasitologie Vetmeduni Vienna -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2015-08-04 at 16.39.26.png Type: image/png Size: 230184 bytes Desc: not available URL: From dence at genetics.utah.edu Tue Aug 4 11:03:03 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Tue, 4 Aug 2015 16:03:03 +0000 Subject: [maker-devel] MAKER: single exons not predicted In-Reply-To: References: Message-ID: Hi Nicola, I don?t think that August will predict single-exon genes in most cases. That?s probably responsible for your results. ~Daniel Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Aug 4, 2015, at 8:46 AM, Nicola Palmieri > wrote: Dear MAKER developers, I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. I would really appreciate some hints on how to proceed. Kind regards, Nicola -- Nicola Palmieri Postdoctoral fellow Institut f?r Parasitologie Vetmeduni Vienna _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 4 11:37:41 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 4 Aug 2015 10:37:41 -0600 Subject: [maker-devel] MAKER: single exons not predicted In-Reply-To: References: Message-ID: <62FFE93E-E54B-4F8B-AE16-9C63CB43DF9E@gmail.com> Most single exon alignments from ESTs or mRNA-seq will actually be spurious in nature. It?s the reason why single exon alignments are ignored by default. Also as Daniel mentioned, gene predictors do not like to call single exon genes. Maker gets around this by supplying hints to the predictor based on alignment evidence which increases the probabilities of the HMM used by the predictor within a given region, but given the spurious nature of single exon nucleotide alignments, you will also need protein alignment support for single exon genes for this to work. Also the region you gave as an example does not appear to be an open reading frame. In fact almost the entire region that overlaps the Augustus call is labeled as UTR. It looks like a classic spurious alignment. Likely an unmasked repetitive element. ?Carson > On Aug 4, 2015, at 10:03 AM, Daniel Ence wrote: > > Hi Nicola, I don?t think that August will predict single-exon genes in most cases. That?s probably responsible for your results. > > ~Daniel > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Aug 4, 2015, at 8:46 AM, Nicola Palmieri > wrote: >> >> Dear MAKER developers, >> >> I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. >> >> I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. >> >> I would really appreciate some hints on how to proceed. >> >> Kind regards, >> Nicola >> >> -- >> Nicola Palmieri >> Postdoctoral fellow >> Institut f?r Parasitologie >> Vetmeduni Vienna >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xh33 at georgetown.edu Thu Aug 20 10:37:19 2015 From: xh33 at georgetown.edu (Xin Huang) Date: Thu, 20 Aug 2015 11:37:19 -0400 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser Message-ID: Dear Developer, I was trying to use a mixed GFF3 file to do some analysis, which includes genome-level gene predictions and manually curated annotations. I replaced the gene prediction by Maker annotations where appropriate. And a sample of the blended GFF3 file is as follows: scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; scaffold10370 maker gene 32452 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 scaffold10370 maker mRNA 32452 54508 44204 - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 scaffold10370 maker exon 32452 33011 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 35140 35388 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 35455 36119 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 36443 36777 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 53979 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker five_prime_UTR 54133 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 53979 54132 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 36443 36777 . - 2 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 35455 36119 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 35140 35388 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 32768 33011 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker three_prime_UTR 32452 32767 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 This file does not pass the GFF3 online validator from GenomeTools with the following error message: GenomeTools error: Parent "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff" was not defined (via "ID=") And when I tried to use gff3toGenePred to convert GFF3 to the refflat format, I got the following parsing errors (samples): Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent attribute Can't find annotation record "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent attribute What could have been the issue in the annotation file or the way I used the file? Any feedback will be highly appreciated! Sincerely, Xin Huang -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 25 11:51:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 25 Aug 2015 10:51:59 -0600 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser In-Reply-To: References: Message-ID: You have edited the file in a way that broke it. Gene maker-scaffold10370-exonerate_est2genome-gene-0.0 previously had two separate transcript sequences (mRNA-1 and mRNA-2). You have deleted mRNA-1, but some of it?s children exons still exist. Notice that some exons have two parents, and you have deleted one of the parents without removing the parent relationship. ?Carson > On Aug 20, 2015, at 9:37 AM, Xin Huang wrote: > > Dear Developer, > > I was trying to use a mixed GFF3 file to do some analysis, which includes genome-level gene predictions and manually curated annotations. I replaced the gene prediction by Maker annotations where appropriate. And a sample of the blended GFF3 file is as follows: > > scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; > scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; > scaffold10370 maker gene 32452 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 > scaffold10370 maker mRNA 32452 54508 44204 - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 > scaffold10370 maker exon 32452 33011 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35140 35388 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35455 36119 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 36443 36777 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 53979 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker five_prime_UTR 54133 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 53979 54132 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 36443 36777 . - 2 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35455 36119 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35140 35388 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 32768 33011 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker three_prime_UTR 32452 32767 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > > This file does not pass the GFF3 online validator from GenomeTools with the following error message: > > GenomeTools error: Parent "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff " was not defined (via "ID=") > > And when I tried to use gff3toGenePred to convert GFF3 to the refflat format, I got the following parsing errors (samples): > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent attribute > > Can't find annotation record "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent attribute > > What could have been the issue in the annotation file or the way I used the file? Any feedback will be highly appreciated! > > Sincerely, > > Xin Huang > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xh33 at georgetown.edu Tue Aug 25 12:04:37 2015 From: xh33 at georgetown.edu (Xin Huang) Date: Tue, 25 Aug 2015 13:04:37 -0400 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser In-Reply-To: References: Message-ID: Thank you very much for your response. Yes, that is the issue, and after it had been fixed, the file went through the program just fine. Thanks, Xin On Tue, Aug 25, 2015 at 12:51 PM, Carson Holt wrote: > You have edited the file in a way that broke it. > Gene maker-scaffold10370-exonerate_est2genome-gene-0.0 previously had two > separate transcript sequences (mRNA-1 and mRNA-2). You have deleted > mRNA-1, but some of it?s children exons still exist. Notice that some exons > have two parents, and you have deleted one of the parents without removing > the parent relationship. > > ?Carson > > > > On Aug 20, 2015, at 9:37 AM, Xin Huang wrote: > > Dear Developer, > > I was trying to use a mixed GFF3 file to do some analysis, which includes > genome-level gene predictions and manually curated annotations. I replaced > the gene prediction by Maker annotations where appropriate. And a sample of > the blended GFF3 file is as follows: > > scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . > ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; > scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; > scaffold10370 maker gene 32452 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 > scaffold10370 maker mRNA 32452 54508 44204 - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 > scaffold10370 maker exon 32452 33011 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35140 35388 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35455 36119 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 36443 36777 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 53979 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker five_prime_UTR 54133 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 53979 54132 . - 0 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 36443 36777 . - 2 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35455 36119 . - 0 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35140 35388 . - 1 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 32768 33011 . - 1 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker three_prime_UTR 32452 32767 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > > This file does not pass the GFF3 online validator from GenomeTools with > the following error message: > > GenomeTools error: Parent > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in > file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff" was > not defined (via "ID=") > > And when I tried to use gff3toGenePred to convert GFF3 to the refflat > format, I got the following parsing errors (samples): > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent > attribute > > Can't find annotation record > "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by > "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent > attribute > > What could have been the issue in the annotation file or the way I used > the file? Any feedback will be highly appreciated! > > Sincerely, > > Xin Huang > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Tue Aug 25 13:38:05 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Tue, 25 Aug 2015 11:38:05 -0700 Subject: [maker-devel] CEGMA vs. initial MAKER output for SNAP training Message-ID: Hello, I was wondering if it would be better to generate the first hmm for SNAP training from CEGMA results or if I should use the output from my initial MAKER run? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 25 13:45:52 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 25 Aug 2015 12:45:52 -0600 Subject: [maker-devel] CEGMA vs. initial MAKER output for SNAP training In-Reply-To: References: Message-ID: <73C81426-2B40-42D9-8150-F152F18B23FD@gmail.com> Either works. Both just count as a starting point (you just need something). Then you will run MAKER once more with this new HMM and do a final round of bootstrap training using the results of that last MAKER run. Regardless of what you start with, results will converge when you do the last bootstrap training step. ?Carson > On Aug 25, 2015, at 12:38 PM, John Cornelius wrote: > > Hello, I was wondering if it would be better to generate the first hmm for SNAP training from CEGMA results or if I should use the output from my initial MAKER run? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From johnbracht at gmail.com Thu Aug 27 11:36:41 2015 From: johnbracht at gmail.com (John Bracht) Date: Thu, 27 Aug 2015 12:36:41 -0400 Subject: [maker-devel] maker on a nematode: few novel proteins Message-ID: Hi Carson, all, I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4729 bytes Desc: not available URL: From yincl2013 at 126.com Sun Aug 30 03:52:09 2015 From: yincl2013 at 126.com (=?UTF-8?B?5bC55Lyg5p6X?=) Date: Sun, 30 Aug 2015 16:52:09 +0800 Subject: [maker-devel] maker problem Message-ID: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> Hello, When I use the maker for geomic annotation pipeline, I met a very big problem. My genome has 7282 scaffolds, all is success but for one scaffold is failed. I donot konw why and I have check the log, but I still can't fix the problem. So I need you help, and I will appreciate it very much. Here is a part of main error log: WARNING: Cannot find >0, trying to re-index the fasta. stop here:0 ERROR: Fasta index error at /disk/yinchuanlin/software/maker/maker/bin/../lib/GI.pm line 1622. GI::polish_exonerate(FastaChunk=HASH(0x7fe0263563b8), FastaSeq=HASH(0x3f7b280), 2988503, ">contig1", ARRAY(0x3d3ad80), ARRAY(0x3d17ed0), "/disk/yinchuanlin/01_pingguodue_project/02_OMIGA/09_maker_wor"..., "e", "/disk/yinchuanlin/software/maker/exonerate-2.2.0-x86_64/bin/e"..., ...) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 1986 Process::MpiChunk::__ANON__() called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 415 eval {...} called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 407 Error::subs::try(CODE(0x3f381b8), HASH(0x3f62380)) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 4224 Process::MpiChunk::_go(Process::MpiChunk=HASH(0x3e30020), "run", HASH(0x7fe026352410), 2, 3) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 341 Process::MpiChunk::run(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 357 Process::MpiChunk::run_all(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3d43310), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3f28b28), 0) called at /disk/yinchuanlin/software/maker/maker/bin/maker line 686 --> rank=NA, hostname=node7 ERROR: Failed while polishig ESTs ERROR: Chunk failed at level:2, tier_type:3 FAILED CONTIG:contig1 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:contig1 Thanks! Best regards, Ian Department of Entomology, College of Plant Protection, Nanjing Agricultural University No. 1, Weigang Road, Xuanwu District, Nanjing, Jiangsu 210095, China -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 12:09:24 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:09:24 -0600 Subject: [maker-devel] maker problem In-Reply-To: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> References: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> Message-ID: Separate out just that contig into a separate file and run it by itself. You may also want to rename the contig. Don?t call it 0. Having the name simply be a zero has the potential to cause any number of problems in MAKER or any of the programs MAKER uses because 0 all by itself means ?false? in most computer languages. Try contig_0 instead. It looks like the first error may be the BioPerl indexer failing because the contig is named 0. ?Carson > On Aug 30, 2015, at 2:52 AM, ??? wrote: > > Hello, > > When I use the maker for geomic annotation pipeline, I met a very big problem. My genome has 7282 scaffolds, all is success but for one scaffold is failed. I donot konw why and I have check the log, but I still can't fix the problem. So I need you help, and I will appreciate it very much. > > Here is a part of main error log: > WARNING: Cannot find >0, trying to re-index the fasta. > stop here:0 > ERROR: Fasta index error > at /disk/yinchuanlin/software/maker/maker/bin/../lib/GI.pm line 1622. > GI::polish_exonerate(FastaChunk=HASH(0x7fe0263563b8), FastaSeq=HASH(0x3f7b280), 2988503, ">contig1", ARRAY(0x3d3ad80), ARRAY(0x3d17ed0), "/disk/yinchuanlin/01_pingguodue_project/02_OMIGA/09_maker_wor"..., "e", "/disk/yinchuanlin/software/maker/exonerate-2.2.0-x86_64/bin/e"..., ...) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 1986 > Process::MpiChunk::__ANON__() called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 415 > eval {...} called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 407 > Error::subs::try(CODE(0x3f381b8), HASH(0x3f62380)) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 4224 > Process::MpiChunk::_go(Process::MpiChunk=HASH(0x3e30020), "run", HASH(0x7fe026352410), 2, 3) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 341 > Process::MpiChunk::run(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 357 > Process::MpiChunk::run_all(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 > Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3d43310), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 > Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3f28b28), 0) called at /disk/yinchuanlin/software/maker/maker/bin/maker line 686 > --> rank=NA, hostname=node7 > ERROR: Failed while polishig ESTs > ERROR: Chunk failed at level:2, tier_type:3 > FAILED CONTIG:contig1 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:contig1 > > Thanks! > > Best regards, > Ian > Department of Entomology, College of Plant Protection, Nanjing Agricultural University > No. 1, Weigang Road, Xuanwu District, Nanjing, Jiangsu 210095, China > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 12:31:38 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:31:38 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: References: Message-ID: Hi John, MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. Thanks, Carson > On Aug 27, 2015, at 10:36 AM, John Bracht wrote: > > Hi Carson, all, > > I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. > > Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. > > Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. > > Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. > > Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). > > My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). > > In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. > > For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. > > So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. > > Thanks, > John > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 12:35:43 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:35:43 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: References: Message-ID: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> Sorry I meant to say Protostome not Deuterostomes (last sentence). ?Carson > On Aug 30, 2015, at 11:31 AM, Carson Holt wrote: > > Hi John, > > MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. > > You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. > > Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. > > Thanks, > Carson > > > >> On Aug 27, 2015, at 10:36 AM, John Bracht > wrote: >> >> Hi Carson, all, >> >> I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. >> >> Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. >> >> Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. >> >> Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. >> >> Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). >> >> My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). >> >> In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. >> >> For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. >> >> So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. >> >> Thanks, >> John >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 12:44:24 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:44:24 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> References: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> Message-ID: Should have said. The major trend in all Protostome evolution tends to be gene loss as opposed to gene gain. And the process of gene loss is more pronounced in the Ecdysozoa (insects and nematodes) than in the Lophotrochozoa (mollusks and flat worms). ?Carson > On Aug 30, 2015, at 11:35 AM, Carson Holt wrote: > > Sorry I meant to say Protostome not Deuterostomes (last sentence). > > ?Carson > >> On Aug 30, 2015, at 11:31 AM, Carson Holt > wrote: >> >> Hi John, >> >> MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. >> >> You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. >> >> Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. >> >> Thanks, >> Carson >> >> >> >>> On Aug 27, 2015, at 10:36 AM, John Bracht > wrote: >>> >>> Hi Carson, all, >>> >>> I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. >>> >>> Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. >>> >>> Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. >>> >>> Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. >>> >>> Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). >>> >>> My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). >>> >>> In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. >>> >>> For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. >>> >>> So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. >>> >>> Thanks, >>> John >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccheng at jcvi.org Mon Aug 31 09:47:24 2015 From: ccheng at jcvi.org (Cheng, Chia-Yi) Date: Mon, 31 Aug 2015 10:47:24 -0400 Subject: [maker-devel] AED scores from MAKER pipeline - deterministic or not? Message-ID: Hello MAKER team, We at JCVI have been using MAKER (2.31.8) to calculate the AED of Arabidopsis gene models. We provided the annotation set as ?model_gff? with evidence file in ?protein_gff? and ?est_gff?. All the other settings were default. One issue I?ve noticed was that the AED scores did not seem to be deterministic. When I compare the AED scores from two runs using identical control files, ~1,000 (out of 35,385) gene models had different AED scores. The difference between two sets of AED scores could range from 0.01 to 1.00. I looked into several gene models with lager difference, i.e. AED = 0.00 in run 1 and AED = 1.00 in run 2, and noticed a disagreement in the QI: Run 1: _AED=0.00;_eAED=-0.00;_QI=0|-1|0|1|-1|0|1|0|344 Run 2: _AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|0|344 The discrepancy in the 4th column seemed to suggest the evidence file was not used properly in run 2. I?m not sure what may have caused as both runs have used the same input. A snapshot of the evidence files are pasted in the end of the email in case needed. Please let me know if more info is needed. Any help is appreciated. Thank you. Chia-Yi RNA-seq evidence file: Chr1 assembler-aerial2_pasa cDNA_match 3624 5927 . + . ID=aerial2_align_161343;Target=asmbl_1 1082 1234 +%2Casmbl_1 692 1081 +%2Casmbl_1 572 691 +%2Casmbl_1 1 290 +%2Casmbl_1 291 571 +%2Casmbl_1 1235 1723 + Chr1 assembler-aerial2_pasa match_part 3624 3913 . + . ID=aerial2_align_161343-1;Parent=aerial2_align_161343 Chr1 assembler-aerial2_pasa match_part 3996 4276 . + . ID=aerial2_align_161343-2;Parent=aerial2_align_161343 EST evidence file: Chr1 est2genome expressed_sequence_match 5470 5899 2150 - . ID=Chr1:hit:213:3.2.0.0;Name=gi|19829901|gb|AV795918|RAFL08-19-M04 Chr1 est2genome match_part 5470 5899 2150 - . ID=Chr1:hsp:500:3.2.0.0;Parent=Chr1:hit:213:3.2.0.0;Target=gi|19829901|gb|AV795918|RAFL08-19-M04 2 431 +;Gap=M430 Protein evidence file: Chr1 protein2genome protein_match 3760 5284 727 + . ID=Chr1:hit:202:3.10.0.0;Name=UniRef90_M4EWW1 Chr1 protein2genome match_part 3760 3913 727 + . ID=Chr1:hsp:488:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 1 50;Gap=M31 D1 M19 F1 Chr1 protein2genome match_part 3996 4276 727 + . ID=Chr1:hsp:489:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 51 144;Gap=R1 M23 D1 M28 D1 M36 I2 M5 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Aug 31 10:08:34 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Aug 2015 09:08:34 -0600 Subject: [maker-devel] AED scores from MAKER pipeline - deterministic or not? In-Reply-To: References: Message-ID: <81B272E9-3439-49C6-96E0-674B03F87569@gmail.com> I would have to see the actual GFF3 files (full file including fast at the end). Give me both GFF3 files and the coordinates of the gene in question. My first guess is that you had the single_exon= filter set to different values on each run. The gene in question is an unspliced single exon gene (based on the QI), your primary piece of evidence appears to be a single exon EST, and the only value that changes in the QI is the exon overlap. Single exon evidence will be ignored by default for the AED calculation unless you have single_exon set to 1. Thanks, Carson > On Aug 31, 2015, at 8:47 AM, Cheng, Chia-Yi wrote: > > Hello MAKER team, > > We at JCVI have been using MAKER (2.31.8) to calculate the AED of Arabidopsis gene models. We provided the annotation set as ?model_gff? with evidence file in ?protein_gff? and ?est_gff?. All the other settings were default. One issue I?ve noticed was that the AED scores did not seem to be deterministic. When I compare the AED scores from two runs using identical control files, ~1,000 (out of 35,385) gene models had different AED scores. The difference between two sets of AED scores could range from 0.01 to 1.00. > > I looked into several gene models with lager difference, i.e. AED = 0.00 in run 1 and AED = 1.00 in run 2, and noticed a disagreement in the QI: > > Run 1: _AED=0.00;_eAED=-0.00;_QI=0|-1|0|1|-1|0|1|0|344 > Run 2: _AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|0|344 > > The discrepancy in the 4th column seemed to suggest the evidence file was not used properly in run 2. I?m not sure what may have caused as both runs have used the same input. A snapshot of the evidence files are pasted in the end of the email in case needed. > > Please let me know if more info is needed. Any help is appreciated. Thank you. > > Chia-Yi > > > RNA-seq evidence file: > Chr1 assembler-aerial2_pasa cDNA_match 3624 5927 . + . ID=aerial2_align_161343;Target=asmbl_1 1082 1234 +%2Casmbl_1 692 1081 +%2Casmbl_1 572 691 +%2Casmbl_1 1 290 +%2Casmbl_1 291 571 +%2Casmbl_1 1235 1723 + > Chr1 assembler-aerial2_pasa match_part 3624 3913 . + . ID=aerial2_align_161343-1;Parent=aerial2_align_161343 > Chr1 assembler-aerial2_pasa match_part 3996 4276 . + . ID=aerial2_align_161343-2;Parent=aerial2_align_161343 > > EST evidence file: > Chr1 est2genome expressed_sequence_match 5470 5899 2150 - . ID=Chr1:hit:213:3.2.0.0;Name=gi|19829901|gb|AV795918|RAFL08-19-M04 > Chr1 est2genome match_part 5470 5899 2150 - . ID=Chr1:hsp:500:3.2.0.0;Parent=Chr1:hit:213:3.2.0.0;Target=gi|19829901|gb|AV795918|RAFL08-19-M04 2 431 +;Gap=M430 > > Protein evidence file: > Chr1 protein2genome protein_match 3760 5284 727 + . ID=Chr1:hit:202:3.10.0.0;Name=UniRef90_M4EWW1 > Chr1 protein2genome match_part 3760 3913 727 + . ID=Chr1:hsp:488:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 1 50;Gap=M31 D1 M19 F1 > Chr1 protein2genome match_part 3996 4276 727 + . ID=Chr1:hsp:489:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 51 144;Gap=R1 M23 D1 M28 D1 M36 I2 M5 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Aug 3 13:10:48 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Aug 2015 13:10:48 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> Message-ID: <8548067E-FCBB-4E7D-A45F-70D6A1F62BF6@gmail.com> Thanks for the patch. I plan on adding the min_intron option as well as a snoscan bug fix to the stable release of MAKER soon. No plans for a GitHub move at present. ?Carson > On Jul 30, 2015, at 5:21 PM, Shaun Jackman wrote: > > Hi, Carson. > > Increasing $min_intron to 250 did the trick! Thanks for the tip. For this organellar genome, all the introns (at least that I?ve found) are group II self-splicing introns, so they have a much larger minimum size (about 990 bp) than introns in other genomes. https://en.wikipedia.org/wiki/Group_II_intron > Increasing $min_intron worked perfectly for all but one gene. Decreasing --codongapopen and --codongapextend rescued that one gene, but I?ll probably just leave the defaults and annotate that one gene by hand. > > Note that $min_intron had no effect on protein2genome without the following patch. Will there be a stable release soon that includes this patch and the min_intron option? I?m preparing a manuscript for submission, and I?d love to be able to refer to a stable version of MAKER in the manuscript that includes this feature. > > Have you considered moving the MAKER development to GitHub? > > Thanks again. Cheers, > Shaun > > diff --git a/protein.pm.orig b/protein.pm > --- a/protein.pm.orig > +++ b/protein.pm > @@ -94,11 +94,11 @@ sub runExonerate { > > my $command = "$exe -q $q_file -t $t_file -Q protein -T dna "; > $command .= "-m protein2genome --softmasktarget "; > + $command .= " --minintron $min_intron --maxintron $max_intron --showcigar"; > $command .= " --percent $percent"; > if ($matrix) { > $command .= " --proteinsubmat $matrix"; > } > - $command .= " --showcigar "; > $command .= " > $o_file"; > > my $w = new Widget::exonerate::protein2genome(); > > > > -- > http://sjackman.ca/ > On 2015-July-16 at 13:10:13 , Carson Holt (carsonhh at gmail.com ) wrote: > >> I can add it to the development version. >> >> ?Carson >> >> >>> On Jul 16, 2015, at 1:11 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. >>> >>> I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. >>> >>> I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. >>> >>> I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? >>> >>> Thanks for your help, Carson. Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com ) wrote: >>> >>>> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). >>>> >>>> The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. >>>> >>>> Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; >>>> >>>> ?Carson >>>> >>>> >>>>> On Jul 15, 2015, at 6:36 PM, Shaun Jackman > wrote: >>>>> >>>>> Hi, Carson. >>>>> >>>>> I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? >>>>> >>>>> 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. >>>>> >>>>> Thanks, >>>>> Shaun >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> http://sjackman.ca/ >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From palmierinico at gmail.com Tue Aug 4 08:46:38 2015 From: palmierinico at gmail.com (Nicola Palmieri) Date: Tue, 4 Aug 2015 16:46:38 +0200 Subject: [maker-devel] MAKER: single exons not predicted Message-ID: Dear MAKER developers, I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. I would really appreciate some hints on how to proceed. Kind regards, Nicola -- Nicola Palmieri Postdoctoral fellow Institut f?r Parasitologie Vetmeduni Vienna -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2015-08-04 at 16.39.26.png Type: image/png Size: 230184 bytes Desc: not available URL: From dence at genetics.utah.edu Tue Aug 4 10:03:03 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Tue, 4 Aug 2015 16:03:03 +0000 Subject: [maker-devel] MAKER: single exons not predicted In-Reply-To: References: Message-ID: Hi Nicola, I don?t think that August will predict single-exon genes in most cases. That?s probably responsible for your results. ~Daniel Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Aug 4, 2015, at 8:46 AM, Nicola Palmieri > wrote: Dear MAKER developers, I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. I would really appreciate some hints on how to proceed. Kind regards, Nicola -- Nicola Palmieri Postdoctoral fellow Institut f?r Parasitologie Vetmeduni Vienna _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 4 10:37:41 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 4 Aug 2015 10:37:41 -0600 Subject: [maker-devel] MAKER: single exons not predicted In-Reply-To: References: Message-ID: <62FFE93E-E54B-4F8B-AE16-9C63CB43DF9E@gmail.com> Most single exon alignments from ESTs or mRNA-seq will actually be spurious in nature. It?s the reason why single exon alignments are ignored by default. Also as Daniel mentioned, gene predictors do not like to call single exon genes. Maker gets around this by supplying hints to the predictor based on alignment evidence which increases the probabilities of the HMM used by the predictor within a given region, but given the spurious nature of single exon nucleotide alignments, you will also need protein alignment support for single exon genes for this to work. Also the region you gave as an example does not appear to be an open reading frame. In fact almost the entire region that overlaps the Augustus call is labeled as UTR. It looks like a classic spurious alignment. Likely an unmasked repetitive element. ?Carson > On Aug 4, 2015, at 10:03 AM, Daniel Ence wrote: > > Hi Nicola, I don?t think that August will predict single-exon genes in most cases. That?s probably responsible for your results. > > ~Daniel > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Aug 4, 2015, at 8:46 AM, Nicola Palmieri > wrote: >> >> Dear MAKER developers, >> >> I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. >> >> I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. >> >> I would really appreciate some hints on how to proceed. >> >> Kind regards, >> Nicola >> >> -- >> Nicola Palmieri >> Postdoctoral fellow >> Institut f?r Parasitologie >> Vetmeduni Vienna >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xh33 at georgetown.edu Thu Aug 20 09:37:19 2015 From: xh33 at georgetown.edu (Xin Huang) Date: Thu, 20 Aug 2015 11:37:19 -0400 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser Message-ID: Dear Developer, I was trying to use a mixed GFF3 file to do some analysis, which includes genome-level gene predictions and manually curated annotations. I replaced the gene prediction by Maker annotations where appropriate. And a sample of the blended GFF3 file is as follows: scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; scaffold10370 maker gene 32452 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 scaffold10370 maker mRNA 32452 54508 44204 - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 scaffold10370 maker exon 32452 33011 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 35140 35388 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 35455 36119 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 36443 36777 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 53979 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker five_prime_UTR 54133 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 53979 54132 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 36443 36777 . - 2 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 35455 36119 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 35140 35388 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 32768 33011 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker three_prime_UTR 32452 32767 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 This file does not pass the GFF3 online validator from GenomeTools with the following error message: GenomeTools error: Parent "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff" was not defined (via "ID=") And when I tried to use gff3toGenePred to convert GFF3 to the refflat format, I got the following parsing errors (samples): Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent attribute Can't find annotation record "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent attribute What could have been the issue in the annotation file or the way I used the file? Any feedback will be highly appreciated! Sincerely, Xin Huang -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 25 10:51:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 25 Aug 2015 10:51:59 -0600 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser In-Reply-To: References: Message-ID: You have edited the file in a way that broke it. Gene maker-scaffold10370-exonerate_est2genome-gene-0.0 previously had two separate transcript sequences (mRNA-1 and mRNA-2). You have deleted mRNA-1, but some of it?s children exons still exist. Notice that some exons have two parents, and you have deleted one of the parents without removing the parent relationship. ?Carson > On Aug 20, 2015, at 9:37 AM, Xin Huang wrote: > > Dear Developer, > > I was trying to use a mixed GFF3 file to do some analysis, which includes genome-level gene predictions and manually curated annotations. I replaced the gene prediction by Maker annotations where appropriate. And a sample of the blended GFF3 file is as follows: > > scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; > scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; > scaffold10370 maker gene 32452 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 > scaffold10370 maker mRNA 32452 54508 44204 - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 > scaffold10370 maker exon 32452 33011 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35140 35388 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35455 36119 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 36443 36777 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 53979 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker five_prime_UTR 54133 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 53979 54132 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 36443 36777 . - 2 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35455 36119 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35140 35388 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 32768 33011 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker three_prime_UTR 32452 32767 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > > This file does not pass the GFF3 online validator from GenomeTools with the following error message: > > GenomeTools error: Parent "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff " was not defined (via "ID=") > > And when I tried to use gff3toGenePred to convert GFF3 to the refflat format, I got the following parsing errors (samples): > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent attribute > > Can't find annotation record "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent attribute > > What could have been the issue in the annotation file or the way I used the file? Any feedback will be highly appreciated! > > Sincerely, > > Xin Huang > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xh33 at georgetown.edu Tue Aug 25 11:04:37 2015 From: xh33 at georgetown.edu (Xin Huang) Date: Tue, 25 Aug 2015 13:04:37 -0400 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser In-Reply-To: References: Message-ID: Thank you very much for your response. Yes, that is the issue, and after it had been fixed, the file went through the program just fine. Thanks, Xin On Tue, Aug 25, 2015 at 12:51 PM, Carson Holt wrote: > You have edited the file in a way that broke it. > Gene maker-scaffold10370-exonerate_est2genome-gene-0.0 previously had two > separate transcript sequences (mRNA-1 and mRNA-2). You have deleted > mRNA-1, but some of it?s children exons still exist. Notice that some exons > have two parents, and you have deleted one of the parents without removing > the parent relationship. > > ?Carson > > > > On Aug 20, 2015, at 9:37 AM, Xin Huang wrote: > > Dear Developer, > > I was trying to use a mixed GFF3 file to do some analysis, which includes > genome-level gene predictions and manually curated annotations. I replaced > the gene prediction by Maker annotations where appropriate. And a sample of > the blended GFF3 file is as follows: > > scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . > ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; > scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; > scaffold10370 maker gene 32452 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 > scaffold10370 maker mRNA 32452 54508 44204 - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 > scaffold10370 maker exon 32452 33011 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35140 35388 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35455 36119 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 36443 36777 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 53979 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker five_prime_UTR 54133 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 53979 54132 . - 0 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 36443 36777 . - 2 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35455 36119 . - 0 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35140 35388 . - 1 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 32768 33011 . - 1 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker three_prime_UTR 32452 32767 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > > This file does not pass the GFF3 online validator from GenomeTools with > the following error message: > > GenomeTools error: Parent > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in > file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff" was > not defined (via "ID=") > > And when I tried to use gff3toGenePred to convert GFF3 to the refflat > format, I got the following parsing errors (samples): > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent > attribute > > Can't find annotation record > "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by > "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent > attribute > > What could have been the issue in the annotation file or the way I used > the file? Any feedback will be highly appreciated! > > Sincerely, > > Xin Huang > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Tue Aug 25 12:38:05 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Tue, 25 Aug 2015 11:38:05 -0700 Subject: [maker-devel] CEGMA vs. initial MAKER output for SNAP training Message-ID: Hello, I was wondering if it would be better to generate the first hmm for SNAP training from CEGMA results or if I should use the output from my initial MAKER run? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 25 12:45:52 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 25 Aug 2015 12:45:52 -0600 Subject: [maker-devel] CEGMA vs. initial MAKER output for SNAP training In-Reply-To: References: Message-ID: <73C81426-2B40-42D9-8150-F152F18B23FD@gmail.com> Either works. Both just count as a starting point (you just need something). Then you will run MAKER once more with this new HMM and do a final round of bootstrap training using the results of that last MAKER run. Regardless of what you start with, results will converge when you do the last bootstrap training step. ?Carson > On Aug 25, 2015, at 12:38 PM, John Cornelius wrote: > > Hello, I was wondering if it would be better to generate the first hmm for SNAP training from CEGMA results or if I should use the output from my initial MAKER run? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From johnbracht at gmail.com Thu Aug 27 10:36:41 2015 From: johnbracht at gmail.com (John Bracht) Date: Thu, 27 Aug 2015 12:36:41 -0400 Subject: [maker-devel] maker on a nematode: few novel proteins Message-ID: Hi Carson, all, I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4729 bytes Desc: not available URL: From yincl2013 at 126.com Sun Aug 30 02:52:09 2015 From: yincl2013 at 126.com (=?UTF-8?B?5bC55Lyg5p6X?=) Date: Sun, 30 Aug 2015 16:52:09 +0800 Subject: [maker-devel] maker problem Message-ID: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> Hello, When I use the maker for geomic annotation pipeline, I met a very big problem. My genome has 7282 scaffolds, all is success but for one scaffold is failed. I donot konw why and I have check the log, but I still can't fix the problem. So I need you help, and I will appreciate it very much. Here is a part of main error log: WARNING: Cannot find >0, trying to re-index the fasta. stop here:0 ERROR: Fasta index error at /disk/yinchuanlin/software/maker/maker/bin/../lib/GI.pm line 1622. GI::polish_exonerate(FastaChunk=HASH(0x7fe0263563b8), FastaSeq=HASH(0x3f7b280), 2988503, ">contig1", ARRAY(0x3d3ad80), ARRAY(0x3d17ed0), "/disk/yinchuanlin/01_pingguodue_project/02_OMIGA/09_maker_wor"..., "e", "/disk/yinchuanlin/software/maker/exonerate-2.2.0-x86_64/bin/e"..., ...) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 1986 Process::MpiChunk::__ANON__() called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 415 eval {...} called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 407 Error::subs::try(CODE(0x3f381b8), HASH(0x3f62380)) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 4224 Process::MpiChunk::_go(Process::MpiChunk=HASH(0x3e30020), "run", HASH(0x7fe026352410), 2, 3) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 341 Process::MpiChunk::run(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 357 Process::MpiChunk::run_all(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3d43310), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3f28b28), 0) called at /disk/yinchuanlin/software/maker/maker/bin/maker line 686 --> rank=NA, hostname=node7 ERROR: Failed while polishig ESTs ERROR: Chunk failed at level:2, tier_type:3 FAILED CONTIG:contig1 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:contig1 Thanks! Best regards, Ian Department of Entomology, College of Plant Protection, Nanjing Agricultural University No. 1, Weigang Road, Xuanwu District, Nanjing, Jiangsu 210095, China -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:09:24 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:09:24 -0600 Subject: [maker-devel] maker problem In-Reply-To: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> References: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> Message-ID: Separate out just that contig into a separate file and run it by itself. You may also want to rename the contig. Don?t call it 0. Having the name simply be a zero has the potential to cause any number of problems in MAKER or any of the programs MAKER uses because 0 all by itself means ?false? in most computer languages. Try contig_0 instead. It looks like the first error may be the BioPerl indexer failing because the contig is named 0. ?Carson > On Aug 30, 2015, at 2:52 AM, ??? wrote: > > Hello, > > When I use the maker for geomic annotation pipeline, I met a very big problem. My genome has 7282 scaffolds, all is success but for one scaffold is failed. I donot konw why and I have check the log, but I still can't fix the problem. So I need you help, and I will appreciate it very much. > > Here is a part of main error log: > WARNING: Cannot find >0, trying to re-index the fasta. > stop here:0 > ERROR: Fasta index error > at /disk/yinchuanlin/software/maker/maker/bin/../lib/GI.pm line 1622. > GI::polish_exonerate(FastaChunk=HASH(0x7fe0263563b8), FastaSeq=HASH(0x3f7b280), 2988503, ">contig1", ARRAY(0x3d3ad80), ARRAY(0x3d17ed0), "/disk/yinchuanlin/01_pingguodue_project/02_OMIGA/09_maker_wor"..., "e", "/disk/yinchuanlin/software/maker/exonerate-2.2.0-x86_64/bin/e"..., ...) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 1986 > Process::MpiChunk::__ANON__() called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 415 > eval {...} called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 407 > Error::subs::try(CODE(0x3f381b8), HASH(0x3f62380)) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 4224 > Process::MpiChunk::_go(Process::MpiChunk=HASH(0x3e30020), "run", HASH(0x7fe026352410), 2, 3) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 341 > Process::MpiChunk::run(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 357 > Process::MpiChunk::run_all(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 > Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3d43310), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 > Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3f28b28), 0) called at /disk/yinchuanlin/software/maker/maker/bin/maker line 686 > --> rank=NA, hostname=node7 > ERROR: Failed while polishig ESTs > ERROR: Chunk failed at level:2, tier_type:3 > FAILED CONTIG:contig1 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:contig1 > > Thanks! > > Best regards, > Ian > Department of Entomology, College of Plant Protection, Nanjing Agricultural University > No. 1, Weigang Road, Xuanwu District, Nanjing, Jiangsu 210095, China > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:31:38 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:31:38 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: References: Message-ID: Hi John, MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. Thanks, Carson > On Aug 27, 2015, at 10:36 AM, John Bracht wrote: > > Hi Carson, all, > > I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. > > Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. > > Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. > > Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. > > Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). > > My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). > > In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. > > For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. > > So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. > > Thanks, > John > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:35:43 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:35:43 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: References: Message-ID: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> Sorry I meant to say Protostome not Deuterostomes (last sentence). ?Carson > On Aug 30, 2015, at 11:31 AM, Carson Holt wrote: > > Hi John, > > MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. > > You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. > > Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. > > Thanks, > Carson > > > >> On Aug 27, 2015, at 10:36 AM, John Bracht > wrote: >> >> Hi Carson, all, >> >> I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. >> >> Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. >> >> Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. >> >> Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. >> >> Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). >> >> My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). >> >> In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. >> >> For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. >> >> So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. >> >> Thanks, >> John >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:44:24 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:44:24 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> References: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> Message-ID: Should have said. The major trend in all Protostome evolution tends to be gene loss as opposed to gene gain. And the process of gene loss is more pronounced in the Ecdysozoa (insects and nematodes) than in the Lophotrochozoa (mollusks and flat worms). ?Carson > On Aug 30, 2015, at 11:35 AM, Carson Holt wrote: > > Sorry I meant to say Protostome not Deuterostomes (last sentence). > > ?Carson > >> On Aug 30, 2015, at 11:31 AM, Carson Holt > wrote: >> >> Hi John, >> >> MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. >> >> You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. >> >> Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. >> >> Thanks, >> Carson >> >> >> >>> On Aug 27, 2015, at 10:36 AM, John Bracht > wrote: >>> >>> Hi Carson, all, >>> >>> I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. >>> >>> Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. >>> >>> Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. >>> >>> Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. >>> >>> Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). >>> >>> My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). >>> >>> In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. >>> >>> For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. >>> >>> So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. >>> >>> Thanks, >>> John >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccheng at jcvi.org Mon Aug 31 08:47:24 2015 From: ccheng at jcvi.org (Cheng, Chia-Yi) Date: Mon, 31 Aug 2015 10:47:24 -0400 Subject: [maker-devel] AED scores from MAKER pipeline - deterministic or not? Message-ID: Hello MAKER team, We at JCVI have been using MAKER (2.31.8) to calculate the AED of Arabidopsis gene models. We provided the annotation set as ?model_gff? with evidence file in ?protein_gff? and ?est_gff?. All the other settings were default. One issue I?ve noticed was that the AED scores did not seem to be deterministic. When I compare the AED scores from two runs using identical control files, ~1,000 (out of 35,385) gene models had different AED scores. The difference between two sets of AED scores could range from 0.01 to 1.00. I looked into several gene models with lager difference, i.e. AED = 0.00 in run 1 and AED = 1.00 in run 2, and noticed a disagreement in the QI: Run 1: _AED=0.00;_eAED=-0.00;_QI=0|-1|0|1|-1|0|1|0|344 Run 2: _AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|0|344 The discrepancy in the 4th column seemed to suggest the evidence file was not used properly in run 2. I?m not sure what may have caused as both runs have used the same input. A snapshot of the evidence files are pasted in the end of the email in case needed. Please let me know if more info is needed. Any help is appreciated. Thank you. Chia-Yi RNA-seq evidence file: Chr1 assembler-aerial2_pasa cDNA_match 3624 5927 . + . ID=aerial2_align_161343;Target=asmbl_1 1082 1234 +%2Casmbl_1 692 1081 +%2Casmbl_1 572 691 +%2Casmbl_1 1 290 +%2Casmbl_1 291 571 +%2Casmbl_1 1235 1723 + Chr1 assembler-aerial2_pasa match_part 3624 3913 . + . ID=aerial2_align_161343-1;Parent=aerial2_align_161343 Chr1 assembler-aerial2_pasa match_part 3996 4276 . + . ID=aerial2_align_161343-2;Parent=aerial2_align_161343 EST evidence file: Chr1 est2genome expressed_sequence_match 5470 5899 2150 - . ID=Chr1:hit:213:3.2.0.0;Name=gi|19829901|gb|AV795918|RAFL08-19-M04 Chr1 est2genome match_part 5470 5899 2150 - . ID=Chr1:hsp:500:3.2.0.0;Parent=Chr1:hit:213:3.2.0.0;Target=gi|19829901|gb|AV795918|RAFL08-19-M04 2 431 +;Gap=M430 Protein evidence file: Chr1 protein2genome protein_match 3760 5284 727 + . ID=Chr1:hit:202:3.10.0.0;Name=UniRef90_M4EWW1 Chr1 protein2genome match_part 3760 3913 727 + . ID=Chr1:hsp:488:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 1 50;Gap=M31 D1 M19 F1 Chr1 protein2genome match_part 3996 4276 727 + . ID=Chr1:hsp:489:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 51 144;Gap=R1 M23 D1 M28 D1 M36 I2 M5 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Aug 31 09:08:34 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Aug 2015 09:08:34 -0600 Subject: [maker-devel] AED scores from MAKER pipeline - deterministic or not? In-Reply-To: References: Message-ID: <81B272E9-3439-49C6-96E0-674B03F87569@gmail.com> I would have to see the actual GFF3 files (full file including fast at the end). Give me both GFF3 files and the coordinates of the gene in question. My first guess is that you had the single_exon= filter set to different values on each run. The gene in question is an unspliced single exon gene (based on the QI), your primary piece of evidence appears to be a single exon EST, and the only value that changes in the QI is the exon overlap. Single exon evidence will be ignored by default for the AED calculation unless you have single_exon set to 1. Thanks, Carson > On Aug 31, 2015, at 8:47 AM, Cheng, Chia-Yi wrote: > > Hello MAKER team, > > We at JCVI have been using MAKER (2.31.8) to calculate the AED of Arabidopsis gene models. We provided the annotation set as ?model_gff? with evidence file in ?protein_gff? and ?est_gff?. All the other settings were default. One issue I?ve noticed was that the AED scores did not seem to be deterministic. When I compare the AED scores from two runs using identical control files, ~1,000 (out of 35,385) gene models had different AED scores. The difference between two sets of AED scores could range from 0.01 to 1.00. > > I looked into several gene models with lager difference, i.e. AED = 0.00 in run 1 and AED = 1.00 in run 2, and noticed a disagreement in the QI: > > Run 1: _AED=0.00;_eAED=-0.00;_QI=0|-1|0|1|-1|0|1|0|344 > Run 2: _AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|0|344 > > The discrepancy in the 4th column seemed to suggest the evidence file was not used properly in run 2. I?m not sure what may have caused as both runs have used the same input. A snapshot of the evidence files are pasted in the end of the email in case needed. > > Please let me know if more info is needed. Any help is appreciated. Thank you. > > Chia-Yi > > > RNA-seq evidence file: > Chr1 assembler-aerial2_pasa cDNA_match 3624 5927 . + . ID=aerial2_align_161343;Target=asmbl_1 1082 1234 +%2Casmbl_1 692 1081 +%2Casmbl_1 572 691 +%2Casmbl_1 1 290 +%2Casmbl_1 291 571 +%2Casmbl_1 1235 1723 + > Chr1 assembler-aerial2_pasa match_part 3624 3913 . + . ID=aerial2_align_161343-1;Parent=aerial2_align_161343 > Chr1 assembler-aerial2_pasa match_part 3996 4276 . + . ID=aerial2_align_161343-2;Parent=aerial2_align_161343 > > EST evidence file: > Chr1 est2genome expressed_sequence_match 5470 5899 2150 - . ID=Chr1:hit:213:3.2.0.0;Name=gi|19829901|gb|AV795918|RAFL08-19-M04 > Chr1 est2genome match_part 5470 5899 2150 - . ID=Chr1:hsp:500:3.2.0.0;Parent=Chr1:hit:213:3.2.0.0;Target=gi|19829901|gb|AV795918|RAFL08-19-M04 2 431 +;Gap=M430 > > Protein evidence file: > Chr1 protein2genome protein_match 3760 5284 727 + . ID=Chr1:hit:202:3.10.0.0;Name=UniRef90_M4EWW1 > Chr1 protein2genome match_part 3760 3913 727 + . ID=Chr1:hsp:488:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 1 50;Gap=M31 D1 M19 F1 > Chr1 protein2genome match_part 3996 4276 727 + . ID=Chr1:hsp:489:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 51 144;Gap=R1 M23 D1 M28 D1 M36 I2 M5 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Aug 3 13:10:48 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Aug 2015 13:10:48 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> Message-ID: <8548067E-FCBB-4E7D-A45F-70D6A1F62BF6@gmail.com> Thanks for the patch. I plan on adding the min_intron option as well as a snoscan bug fix to the stable release of MAKER soon. No plans for a GitHub move at present. ?Carson > On Jul 30, 2015, at 5:21 PM, Shaun Jackman wrote: > > Hi, Carson. > > Increasing $min_intron to 250 did the trick! Thanks for the tip. For this organellar genome, all the introns (at least that I?ve found) are group II self-splicing introns, so they have a much larger minimum size (about 990 bp) than introns in other genomes. https://en.wikipedia.org/wiki/Group_II_intron > Increasing $min_intron worked perfectly for all but one gene. Decreasing --codongapopen and --codongapextend rescued that one gene, but I?ll probably just leave the defaults and annotate that one gene by hand. > > Note that $min_intron had no effect on protein2genome without the following patch. Will there be a stable release soon that includes this patch and the min_intron option? I?m preparing a manuscript for submission, and I?d love to be able to refer to a stable version of MAKER in the manuscript that includes this feature. > > Have you considered moving the MAKER development to GitHub? > > Thanks again. Cheers, > Shaun > > diff --git a/protein.pm.orig b/protein.pm > --- a/protein.pm.orig > +++ b/protein.pm > @@ -94,11 +94,11 @@ sub runExonerate { > > my $command = "$exe -q $q_file -t $t_file -Q protein -T dna "; > $command .= "-m protein2genome --softmasktarget "; > + $command .= " --minintron $min_intron --maxintron $max_intron --showcigar"; > $command .= " --percent $percent"; > if ($matrix) { > $command .= " --proteinsubmat $matrix"; > } > - $command .= " --showcigar "; > $command .= " > $o_file"; > > my $w = new Widget::exonerate::protein2genome(); > > > > -- > http://sjackman.ca/ > On 2015-July-16 at 13:10:13 , Carson Holt (carsonhh at gmail.com ) wrote: > >> I can add it to the development version. >> >> ?Carson >> >> >>> On Jul 16, 2015, at 1:11 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. >>> >>> I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. >>> >>> I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. >>> >>> I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? >>> >>> Thanks for your help, Carson. Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com ) wrote: >>> >>>> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). >>>> >>>> The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. >>>> >>>> Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; >>>> >>>> ?Carson >>>> >>>> >>>>> On Jul 15, 2015, at 6:36 PM, Shaun Jackman > wrote: >>>>> >>>>> Hi, Carson. >>>>> >>>>> I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? >>>>> >>>>> 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. >>>>> >>>>> Thanks, >>>>> Shaun >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> http://sjackman.ca/ >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From palmierinico at gmail.com Tue Aug 4 08:46:38 2015 From: palmierinico at gmail.com (Nicola Palmieri) Date: Tue, 4 Aug 2015 16:46:38 +0200 Subject: [maker-devel] MAKER: single exons not predicted Message-ID: Dear MAKER developers, I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. I would really appreciate some hints on how to proceed. Kind regards, Nicola -- Nicola Palmieri Postdoctoral fellow Institut f?r Parasitologie Vetmeduni Vienna -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2015-08-04 at 16.39.26.png Type: image/png Size: 230184 bytes Desc: not available URL: From dence at genetics.utah.edu Tue Aug 4 10:03:03 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Tue, 4 Aug 2015 16:03:03 +0000 Subject: [maker-devel] MAKER: single exons not predicted In-Reply-To: References: Message-ID: Hi Nicola, I don?t think that August will predict single-exon genes in most cases. That?s probably responsible for your results. ~Daniel Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Aug 4, 2015, at 8:46 AM, Nicola Palmieri > wrote: Dear MAKER developers, I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. I would really appreciate some hints on how to proceed. Kind regards, Nicola -- Nicola Palmieri Postdoctoral fellow Institut f?r Parasitologie Vetmeduni Vienna _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 4 10:37:41 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 4 Aug 2015 10:37:41 -0600 Subject: [maker-devel] MAKER: single exons not predicted In-Reply-To: References: Message-ID: <62FFE93E-E54B-4F8B-AE16-9C63CB43DF9E@gmail.com> Most single exon alignments from ESTs or mRNA-seq will actually be spurious in nature. It?s the reason why single exon alignments are ignored by default. Also as Daniel mentioned, gene predictors do not like to call single exon genes. Maker gets around this by supplying hints to the predictor based on alignment evidence which increases the probabilities of the HMM used by the predictor within a given region, but given the spurious nature of single exon nucleotide alignments, you will also need protein alignment support for single exon genes for this to work. Also the region you gave as an example does not appear to be an open reading frame. In fact almost the entire region that overlaps the Augustus call is labeled as UTR. It looks like a classic spurious alignment. Likely an unmasked repetitive element. ?Carson > On Aug 4, 2015, at 10:03 AM, Daniel Ence wrote: > > Hi Nicola, I don?t think that August will predict single-exon genes in most cases. That?s probably responsible for your results. > > ~Daniel > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Aug 4, 2015, at 8:46 AM, Nicola Palmieri > wrote: >> >> Dear MAKER developers, >> >> I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. >> >> I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. >> >> I would really appreciate some hints on how to proceed. >> >> Kind regards, >> Nicola >> >> -- >> Nicola Palmieri >> Postdoctoral fellow >> Institut f?r Parasitologie >> Vetmeduni Vienna >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xh33 at georgetown.edu Thu Aug 20 09:37:19 2015 From: xh33 at georgetown.edu (Xin Huang) Date: Thu, 20 Aug 2015 11:37:19 -0400 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser Message-ID: Dear Developer, I was trying to use a mixed GFF3 file to do some analysis, which includes genome-level gene predictions and manually curated annotations. I replaced the gene prediction by Maker annotations where appropriate. And a sample of the blended GFF3 file is as follows: scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; scaffold10370 maker gene 32452 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 scaffold10370 maker mRNA 32452 54508 44204 - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 scaffold10370 maker exon 32452 33011 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 35140 35388 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 35455 36119 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 36443 36777 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 53979 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker five_prime_UTR 54133 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 53979 54132 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 36443 36777 . - 2 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 35455 36119 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 35140 35388 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 32768 33011 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker three_prime_UTR 32452 32767 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 This file does not pass the GFF3 online validator from GenomeTools with the following error message: GenomeTools error: Parent "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff" was not defined (via "ID=") And when I tried to use gff3toGenePred to convert GFF3 to the refflat format, I got the following parsing errors (samples): Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent attribute Can't find annotation record "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent attribute What could have been the issue in the annotation file or the way I used the file? Any feedback will be highly appreciated! Sincerely, Xin Huang -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 25 10:51:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 25 Aug 2015 10:51:59 -0600 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser In-Reply-To: References: Message-ID: You have edited the file in a way that broke it. Gene maker-scaffold10370-exonerate_est2genome-gene-0.0 previously had two separate transcript sequences (mRNA-1 and mRNA-2). You have deleted mRNA-1, but some of it?s children exons still exist. Notice that some exons have two parents, and you have deleted one of the parents without removing the parent relationship. ?Carson > On Aug 20, 2015, at 9:37 AM, Xin Huang wrote: > > Dear Developer, > > I was trying to use a mixed GFF3 file to do some analysis, which includes genome-level gene predictions and manually curated annotations. I replaced the gene prediction by Maker annotations where appropriate. And a sample of the blended GFF3 file is as follows: > > scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; > scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; > scaffold10370 maker gene 32452 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 > scaffold10370 maker mRNA 32452 54508 44204 - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 > scaffold10370 maker exon 32452 33011 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35140 35388 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35455 36119 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 36443 36777 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 53979 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker five_prime_UTR 54133 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 53979 54132 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 36443 36777 . - 2 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35455 36119 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35140 35388 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 32768 33011 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker three_prime_UTR 32452 32767 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > > This file does not pass the GFF3 online validator from GenomeTools with the following error message: > > GenomeTools error: Parent "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff " was not defined (via "ID=") > > And when I tried to use gff3toGenePred to convert GFF3 to the refflat format, I got the following parsing errors (samples): > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent attribute > > Can't find annotation record "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent attribute > > What could have been the issue in the annotation file or the way I used the file? Any feedback will be highly appreciated! > > Sincerely, > > Xin Huang > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xh33 at georgetown.edu Tue Aug 25 11:04:37 2015 From: xh33 at georgetown.edu (Xin Huang) Date: Tue, 25 Aug 2015 13:04:37 -0400 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser In-Reply-To: References: Message-ID: Thank you very much for your response. Yes, that is the issue, and after it had been fixed, the file went through the program just fine. Thanks, Xin On Tue, Aug 25, 2015 at 12:51 PM, Carson Holt wrote: > You have edited the file in a way that broke it. > Gene maker-scaffold10370-exonerate_est2genome-gene-0.0 previously had two > separate transcript sequences (mRNA-1 and mRNA-2). You have deleted > mRNA-1, but some of it?s children exons still exist. Notice that some exons > have two parents, and you have deleted one of the parents without removing > the parent relationship. > > ?Carson > > > > On Aug 20, 2015, at 9:37 AM, Xin Huang wrote: > > Dear Developer, > > I was trying to use a mixed GFF3 file to do some analysis, which includes > genome-level gene predictions and manually curated annotations. I replaced > the gene prediction by Maker annotations where appropriate. And a sample of > the blended GFF3 file is as follows: > > scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . > ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; > scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; > scaffold10370 maker gene 32452 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 > scaffold10370 maker mRNA 32452 54508 44204 - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 > scaffold10370 maker exon 32452 33011 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35140 35388 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35455 36119 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 36443 36777 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 53979 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker five_prime_UTR 54133 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 53979 54132 . - 0 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 36443 36777 . - 2 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35455 36119 . - 0 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35140 35388 . - 1 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 32768 33011 . - 1 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker three_prime_UTR 32452 32767 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > > This file does not pass the GFF3 online validator from GenomeTools with > the following error message: > > GenomeTools error: Parent > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in > file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff" was > not defined (via "ID=") > > And when I tried to use gff3toGenePred to convert GFF3 to the refflat > format, I got the following parsing errors (samples): > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent > attribute > > Can't find annotation record > "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by > "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent > attribute > > What could have been the issue in the annotation file or the way I used > the file? Any feedback will be highly appreciated! > > Sincerely, > > Xin Huang > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Tue Aug 25 12:38:05 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Tue, 25 Aug 2015 11:38:05 -0700 Subject: [maker-devel] CEGMA vs. initial MAKER output for SNAP training Message-ID: Hello, I was wondering if it would be better to generate the first hmm for SNAP training from CEGMA results or if I should use the output from my initial MAKER run? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 25 12:45:52 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 25 Aug 2015 12:45:52 -0600 Subject: [maker-devel] CEGMA vs. initial MAKER output for SNAP training In-Reply-To: References: Message-ID: <73C81426-2B40-42D9-8150-F152F18B23FD@gmail.com> Either works. Both just count as a starting point (you just need something). Then you will run MAKER once more with this new HMM and do a final round of bootstrap training using the results of that last MAKER run. Regardless of what you start with, results will converge when you do the last bootstrap training step. ?Carson > On Aug 25, 2015, at 12:38 PM, John Cornelius wrote: > > Hello, I was wondering if it would be better to generate the first hmm for SNAP training from CEGMA results or if I should use the output from my initial MAKER run? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From johnbracht at gmail.com Thu Aug 27 10:36:41 2015 From: johnbracht at gmail.com (John Bracht) Date: Thu, 27 Aug 2015 12:36:41 -0400 Subject: [maker-devel] maker on a nematode: few novel proteins Message-ID: Hi Carson, all, I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4730 bytes Desc: not available URL: From yincl2013 at 126.com Sun Aug 30 02:52:09 2015 From: yincl2013 at 126.com (=?UTF-8?B?5bC55Lyg5p6X?=) Date: Sun, 30 Aug 2015 16:52:09 +0800 Subject: [maker-devel] maker problem Message-ID: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> Hello, When I use the maker for geomic annotation pipeline, I met a very big problem. My genome has 7282 scaffolds, all is success but for one scaffold is failed. I donot konw why and I have check the log, but I still can't fix the problem. So I need you help, and I will appreciate it very much. Here is a part of main error log: WARNING: Cannot find >0, trying to re-index the fasta. stop here:0 ERROR: Fasta index error at /disk/yinchuanlin/software/maker/maker/bin/../lib/GI.pm line 1622. GI::polish_exonerate(FastaChunk=HASH(0x7fe0263563b8), FastaSeq=HASH(0x3f7b280), 2988503, ">contig1", ARRAY(0x3d3ad80), ARRAY(0x3d17ed0), "/disk/yinchuanlin/01_pingguodue_project/02_OMIGA/09_maker_wor"..., "e", "/disk/yinchuanlin/software/maker/exonerate-2.2.0-x86_64/bin/e"..., ...) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 1986 Process::MpiChunk::__ANON__() called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 415 eval {...} called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 407 Error::subs::try(CODE(0x3f381b8), HASH(0x3f62380)) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 4224 Process::MpiChunk::_go(Process::MpiChunk=HASH(0x3e30020), "run", HASH(0x7fe026352410), 2, 3) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 341 Process::MpiChunk::run(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 357 Process::MpiChunk::run_all(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3d43310), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3f28b28), 0) called at /disk/yinchuanlin/software/maker/maker/bin/maker line 686 --> rank=NA, hostname=node7 ERROR: Failed while polishig ESTs ERROR: Chunk failed at level:2, tier_type:3 FAILED CONTIG:contig1 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:contig1 Thanks! Best regards, Ian Department of Entomology, College of Plant Protection, Nanjing Agricultural University No. 1, Weigang Road, Xuanwu District, Nanjing, Jiangsu 210095, China -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:09:24 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:09:24 -0600 Subject: [maker-devel] maker problem In-Reply-To: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> References: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> Message-ID: Separate out just that contig into a separate file and run it by itself. You may also want to rename the contig. Don?t call it 0. Having the name simply be a zero has the potential to cause any number of problems in MAKER or any of the programs MAKER uses because 0 all by itself means ?false? in most computer languages. Try contig_0 instead. It looks like the first error may be the BioPerl indexer failing because the contig is named 0. ?Carson > On Aug 30, 2015, at 2:52 AM, ??? wrote: > > Hello, > > When I use the maker for geomic annotation pipeline, I met a very big problem. My genome has 7282 scaffolds, all is success but for one scaffold is failed. I donot konw why and I have check the log, but I still can't fix the problem. So I need you help, and I will appreciate it very much. > > Here is a part of main error log: > WARNING: Cannot find >0, trying to re-index the fasta. > stop here:0 > ERROR: Fasta index error > at /disk/yinchuanlin/software/maker/maker/bin/../lib/GI.pm line 1622. > GI::polish_exonerate(FastaChunk=HASH(0x7fe0263563b8), FastaSeq=HASH(0x3f7b280), 2988503, ">contig1", ARRAY(0x3d3ad80), ARRAY(0x3d17ed0), "/disk/yinchuanlin/01_pingguodue_project/02_OMIGA/09_maker_wor"..., "e", "/disk/yinchuanlin/software/maker/exonerate-2.2.0-x86_64/bin/e"..., ...) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 1986 > Process::MpiChunk::__ANON__() called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 415 > eval {...} called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 407 > Error::subs::try(CODE(0x3f381b8), HASH(0x3f62380)) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 4224 > Process::MpiChunk::_go(Process::MpiChunk=HASH(0x3e30020), "run", HASH(0x7fe026352410), 2, 3) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 341 > Process::MpiChunk::run(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 357 > Process::MpiChunk::run_all(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 > Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3d43310), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 > Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3f28b28), 0) called at /disk/yinchuanlin/software/maker/maker/bin/maker line 686 > --> rank=NA, hostname=node7 > ERROR: Failed while polishig ESTs > ERROR: Chunk failed at level:2, tier_type:3 > FAILED CONTIG:contig1 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:contig1 > > Thanks! > > Best regards, > Ian > Department of Entomology, College of Plant Protection, Nanjing Agricultural University > No. 1, Weigang Road, Xuanwu District, Nanjing, Jiangsu 210095, China > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:31:38 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:31:38 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: References: Message-ID: Hi John, MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. Thanks, Carson > On Aug 27, 2015, at 10:36 AM, John Bracht wrote: > > Hi Carson, all, > > I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. > > Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. > > Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. > > Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. > > Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). > > My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). > > In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. > > For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. > > So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. > > Thanks, > John > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:35:43 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:35:43 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: References: Message-ID: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> Sorry I meant to say Protostome not Deuterostomes (last sentence). ?Carson > On Aug 30, 2015, at 11:31 AM, Carson Holt wrote: > > Hi John, > > MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. > > You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. > > Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. > > Thanks, > Carson > > > >> On Aug 27, 2015, at 10:36 AM, John Bracht > wrote: >> >> Hi Carson, all, >> >> I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. >> >> Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. >> >> Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. >> >> Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. >> >> Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). >> >> My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). >> >> In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. >> >> For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. >> >> So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. >> >> Thanks, >> John >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:44:24 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:44:24 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> References: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> Message-ID: Should have said. The major trend in all Protostome evolution tends to be gene loss as opposed to gene gain. And the process of gene loss is more pronounced in the Ecdysozoa (insects and nematodes) than in the Lophotrochozoa (mollusks and flat worms). ?Carson > On Aug 30, 2015, at 11:35 AM, Carson Holt wrote: > > Sorry I meant to say Protostome not Deuterostomes (last sentence). > > ?Carson > >> On Aug 30, 2015, at 11:31 AM, Carson Holt > wrote: >> >> Hi John, >> >> MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. >> >> You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. >> >> Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. >> >> Thanks, >> Carson >> >> >> >>> On Aug 27, 2015, at 10:36 AM, John Bracht > wrote: >>> >>> Hi Carson, all, >>> >>> I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. >>> >>> Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. >>> >>> Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. >>> >>> Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. >>> >>> Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). >>> >>> My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). >>> >>> In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. >>> >>> For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. >>> >>> So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. >>> >>> Thanks, >>> John >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccheng at jcvi.org Mon Aug 31 08:47:24 2015 From: ccheng at jcvi.org (Cheng, Chia-Yi) Date: Mon, 31 Aug 2015 10:47:24 -0400 Subject: [maker-devel] AED scores from MAKER pipeline - deterministic or not? Message-ID: Hello MAKER team, We at JCVI have been using MAKER (2.31.8) to calculate the AED of Arabidopsis gene models. We provided the annotation set as ?model_gff? with evidence file in ?protein_gff? and ?est_gff?. All the other settings were default. One issue I?ve noticed was that the AED scores did not seem to be deterministic. When I compare the AED scores from two runs using identical control files, ~1,000 (out of 35,385) gene models had different AED scores. The difference between two sets of AED scores could range from 0.01 to 1.00. I looked into several gene models with lager difference, i.e. AED = 0.00 in run 1 and AED = 1.00 in run 2, and noticed a disagreement in the QI: Run 1: _AED=0.00;_eAED=-0.00;_QI=0|-1|0|1|-1|0|1|0|344 Run 2: _AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|0|344 The discrepancy in the 4th column seemed to suggest the evidence file was not used properly in run 2. I?m not sure what may have caused as both runs have used the same input. A snapshot of the evidence files are pasted in the end of the email in case needed. Please let me know if more info is needed. Any help is appreciated. Thank you. Chia-Yi RNA-seq evidence file: Chr1 assembler-aerial2_pasa cDNA_match 3624 5927 . + . ID=aerial2_align_161343;Target=asmbl_1 1082 1234 +%2Casmbl_1 692 1081 +%2Casmbl_1 572 691 +%2Casmbl_1 1 290 +%2Casmbl_1 291 571 +%2Casmbl_1 1235 1723 + Chr1 assembler-aerial2_pasa match_part 3624 3913 . + . ID=aerial2_align_161343-1;Parent=aerial2_align_161343 Chr1 assembler-aerial2_pasa match_part 3996 4276 . + . ID=aerial2_align_161343-2;Parent=aerial2_align_161343 EST evidence file: Chr1 est2genome expressed_sequence_match 5470 5899 2150 - . ID=Chr1:hit:213:3.2.0.0;Name=gi|19829901|gb|AV795918|RAFL08-19-M04 Chr1 est2genome match_part 5470 5899 2150 - . ID=Chr1:hsp:500:3.2.0.0;Parent=Chr1:hit:213:3.2.0.0;Target=gi|19829901|gb|AV795918|RAFL08-19-M04 2 431 +;Gap=M430 Protein evidence file: Chr1 protein2genome protein_match 3760 5284 727 + . ID=Chr1:hit:202:3.10.0.0;Name=UniRef90_M4EWW1 Chr1 protein2genome match_part 3760 3913 727 + . ID=Chr1:hsp:488:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 1 50;Gap=M31 D1 M19 F1 Chr1 protein2genome match_part 3996 4276 727 + . ID=Chr1:hsp:489:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 51 144;Gap=R1 M23 D1 M28 D1 M36 I2 M5 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Aug 31 09:08:34 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Aug 2015 09:08:34 -0600 Subject: [maker-devel] AED scores from MAKER pipeline - deterministic or not? In-Reply-To: References: Message-ID: <81B272E9-3439-49C6-96E0-674B03F87569@gmail.com> I would have to see the actual GFF3 files (full file including fast at the end). Give me both GFF3 files and the coordinates of the gene in question. My first guess is that you had the single_exon= filter set to different values on each run. The gene in question is an unspliced single exon gene (based on the QI), your primary piece of evidence appears to be a single exon EST, and the only value that changes in the QI is the exon overlap. Single exon evidence will be ignored by default for the AED calculation unless you have single_exon set to 1. Thanks, Carson > On Aug 31, 2015, at 8:47 AM, Cheng, Chia-Yi wrote: > > Hello MAKER team, > > We at JCVI have been using MAKER (2.31.8) to calculate the AED of Arabidopsis gene models. We provided the annotation set as ?model_gff? with evidence file in ?protein_gff? and ?est_gff?. All the other settings were default. One issue I?ve noticed was that the AED scores did not seem to be deterministic. When I compare the AED scores from two runs using identical control files, ~1,000 (out of 35,385) gene models had different AED scores. The difference between two sets of AED scores could range from 0.01 to 1.00. > > I looked into several gene models with lager difference, i.e. AED = 0.00 in run 1 and AED = 1.00 in run 2, and noticed a disagreement in the QI: > > Run 1: _AED=0.00;_eAED=-0.00;_QI=0|-1|0|1|-1|0|1|0|344 > Run 2: _AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|0|344 > > The discrepancy in the 4th column seemed to suggest the evidence file was not used properly in run 2. I?m not sure what may have caused as both runs have used the same input. A snapshot of the evidence files are pasted in the end of the email in case needed. > > Please let me know if more info is needed. Any help is appreciated. Thank you. > > Chia-Yi > > > RNA-seq evidence file: > Chr1 assembler-aerial2_pasa cDNA_match 3624 5927 . + . ID=aerial2_align_161343;Target=asmbl_1 1082 1234 +%2Casmbl_1 692 1081 +%2Casmbl_1 572 691 +%2Casmbl_1 1 290 +%2Casmbl_1 291 571 +%2Casmbl_1 1235 1723 + > Chr1 assembler-aerial2_pasa match_part 3624 3913 . + . ID=aerial2_align_161343-1;Parent=aerial2_align_161343 > Chr1 assembler-aerial2_pasa match_part 3996 4276 . + . ID=aerial2_align_161343-2;Parent=aerial2_align_161343 > > EST evidence file: > Chr1 est2genome expressed_sequence_match 5470 5899 2150 - . ID=Chr1:hit:213:3.2.0.0;Name=gi|19829901|gb|AV795918|RAFL08-19-M04 > Chr1 est2genome match_part 5470 5899 2150 - . ID=Chr1:hsp:500:3.2.0.0;Parent=Chr1:hit:213:3.2.0.0;Target=gi|19829901|gb|AV795918|RAFL08-19-M04 2 431 +;Gap=M430 > > Protein evidence file: > Chr1 protein2genome protein_match 3760 5284 727 + . ID=Chr1:hit:202:3.10.0.0;Name=UniRef90_M4EWW1 > Chr1 protein2genome match_part 3760 3913 727 + . ID=Chr1:hsp:488:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 1 50;Gap=M31 D1 M19 F1 > Chr1 protein2genome match_part 3996 4276 727 + . ID=Chr1:hsp:489:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 51 144;Gap=R1 M23 D1 M28 D1 M36 I2 M5 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Aug 3 13:10:48 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Aug 2015 13:10:48 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> Message-ID: <8548067E-FCBB-4E7D-A45F-70D6A1F62BF6@gmail.com> Thanks for the patch. I plan on adding the min_intron option as well as a snoscan bug fix to the stable release of MAKER soon. No plans for a GitHub move at present. ?Carson > On Jul 30, 2015, at 5:21 PM, Shaun Jackman wrote: > > Hi, Carson. > > Increasing $min_intron to 250 did the trick! Thanks for the tip. For this organellar genome, all the introns (at least that I?ve found) are group II self-splicing introns, so they have a much larger minimum size (about 990 bp) than introns in other genomes. https://en.wikipedia.org/wiki/Group_II_intron > Increasing $min_intron worked perfectly for all but one gene. Decreasing --codongapopen and --codongapextend rescued that one gene, but I?ll probably just leave the defaults and annotate that one gene by hand. > > Note that $min_intron had no effect on protein2genome without the following patch. Will there be a stable release soon that includes this patch and the min_intron option? I?m preparing a manuscript for submission, and I?d love to be able to refer to a stable version of MAKER in the manuscript that includes this feature. > > Have you considered moving the MAKER development to GitHub? > > Thanks again. Cheers, > Shaun > > diff --git a/protein.pm.orig b/protein.pm > --- a/protein.pm.orig > +++ b/protein.pm > @@ -94,11 +94,11 @@ sub runExonerate { > > my $command = "$exe -q $q_file -t $t_file -Q protein -T dna "; > $command .= "-m protein2genome --softmasktarget "; > + $command .= " --minintron $min_intron --maxintron $max_intron --showcigar"; > $command .= " --percent $percent"; > if ($matrix) { > $command .= " --proteinsubmat $matrix"; > } > - $command .= " --showcigar "; > $command .= " > $o_file"; > > my $w = new Widget::exonerate::protein2genome(); > > > > -- > http://sjackman.ca/ > On 2015-July-16 at 13:10:13 , Carson Holt (carsonhh at gmail.com ) wrote: > >> I can add it to the development version. >> >> ?Carson >> >> >>> On Jul 16, 2015, at 1:11 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. >>> >>> I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. >>> >>> I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. >>> >>> I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? >>> >>> Thanks for your help, Carson. Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com ) wrote: >>> >>>> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). >>>> >>>> The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. >>>> >>>> Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; >>>> >>>> ?Carson >>>> >>>> >>>>> On Jul 15, 2015, at 6:36 PM, Shaun Jackman > wrote: >>>>> >>>>> Hi, Carson. >>>>> >>>>> I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? >>>>> >>>>> 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. >>>>> >>>>> Thanks, >>>>> Shaun >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> http://sjackman.ca/ >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From palmierinico at gmail.com Tue Aug 4 08:46:38 2015 From: palmierinico at gmail.com (Nicola Palmieri) Date: Tue, 4 Aug 2015 16:46:38 +0200 Subject: [maker-devel] MAKER: single exons not predicted Message-ID: Dear MAKER developers, I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. I would really appreciate some hints on how to proceed. Kind regards, Nicola -- Nicola Palmieri Postdoctoral fellow Institut f?r Parasitologie Vetmeduni Vienna -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2015-08-04 at 16.39.26.png Type: image/png Size: 230184 bytes Desc: not available URL: From dence at genetics.utah.edu Tue Aug 4 10:03:03 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Tue, 4 Aug 2015 16:03:03 +0000 Subject: [maker-devel] MAKER: single exons not predicted In-Reply-To: References: Message-ID: Hi Nicola, I don?t think that August will predict single-exon genes in most cases. That?s probably responsible for your results. ~Daniel Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Aug 4, 2015, at 8:46 AM, Nicola Palmieri > wrote: Dear MAKER developers, I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. I would really appreciate some hints on how to proceed. Kind regards, Nicola -- Nicola Palmieri Postdoctoral fellow Institut f?r Parasitologie Vetmeduni Vienna _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 4 10:37:41 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 4 Aug 2015 10:37:41 -0600 Subject: [maker-devel] MAKER: single exons not predicted In-Reply-To: References: Message-ID: <62FFE93E-E54B-4F8B-AE16-9C63CB43DF9E@gmail.com> Most single exon alignments from ESTs or mRNA-seq will actually be spurious in nature. It?s the reason why single exon alignments are ignored by default. Also as Daniel mentioned, gene predictors do not like to call single exon genes. Maker gets around this by supplying hints to the predictor based on alignment evidence which increases the probabilities of the HMM used by the predictor within a given region, but given the spurious nature of single exon nucleotide alignments, you will also need protein alignment support for single exon genes for this to work. Also the region you gave as an example does not appear to be an open reading frame. In fact almost the entire region that overlaps the Augustus call is labeled as UTR. It looks like a classic spurious alignment. Likely an unmasked repetitive element. ?Carson > On Aug 4, 2015, at 10:03 AM, Daniel Ence wrote: > > Hi Nicola, I don?t think that August will predict single-exon genes in most cases. That?s probably responsible for your results. > > ~Daniel > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Aug 4, 2015, at 8:46 AM, Nicola Palmieri > wrote: >> >> Dear MAKER developers, >> >> I am using Maker to annotate a new protozoan species (Cystoisospora suis), I am trying to incorporate TopHat/Cufflinks evidence from RNA-Seq but some genes still do not get annotated. >> >> I attached a screenshot from IGV, in which I use separately Augustus, Cufflinks and then various run of Maker incorporating TopHat and /or Cufflinks as evidence. The single exon genes are constantly not predicted (ie. gene under the label g.38.t1). I have also played around with the single_exon option. I didn't find any similar post in the wiki. >> >> I would really appreciate some hints on how to proceed. >> >> Kind regards, >> Nicola >> >> -- >> Nicola Palmieri >> Postdoctoral fellow >> Institut f?r Parasitologie >> Vetmeduni Vienna >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xh33 at georgetown.edu Thu Aug 20 09:37:19 2015 From: xh33 at georgetown.edu (Xin Huang) Date: Thu, 20 Aug 2015 11:37:19 -0400 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser Message-ID: Dear Developer, I was trying to use a mixed GFF3 file to do some analysis, which includes genome-level gene predictions and manually curated annotations. I replaced the gene prediction by Maker annotations where appropriate. And a sample of the blended GFF3 file is as follows: scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; scaffold10370 maker gene 32452 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 scaffold10370 maker mRNA 32452 54508 44204 - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 scaffold10370 maker exon 32452 33011 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 35140 35388 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 35455 36119 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 36443 36777 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker exon 53979 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker five_prime_UTR 54133 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 53979 54132 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 36443 36777 . - 2 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 35455 36119 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 35140 35388 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker CDS 32768 33011 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 scaffold10370 maker three_prime_UTR 32452 32767 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 This file does not pass the GFF3 online validator from GenomeTools with the following error message: GenomeTools error: Parent "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff" was not defined (via "ID=") And when I tried to use gff3toGenePred to convert GFF3 to the refflat format, I got the following parsing errors (samples): Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent attribute Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent attribute Can't find annotation record "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent attribute Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent attribute What could have been the issue in the annotation file or the way I used the file? Any feedback will be highly appreciated! Sincerely, Xin Huang -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 25 10:51:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 25 Aug 2015 10:51:59 -0600 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser In-Reply-To: References: Message-ID: You have edited the file in a way that broke it. Gene maker-scaffold10370-exonerate_est2genome-gene-0.0 previously had two separate transcript sequences (mRNA-1 and mRNA-2). You have deleted mRNA-1, but some of it?s children exons still exist. Notice that some exons have two parents, and you have deleted one of the parents without removing the parent relationship. ?Carson > On Aug 20, 2015, at 9:37 AM, Xin Huang wrote: > > Dear Developer, > > I was trying to use a mixed GFF3 file to do some analysis, which includes genome-level gene predictions and manually curated annotations. I replaced the gene prediction by Maker annotations where appropriate. And a sample of the blended GFF3 file is as follows: > > scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; > scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; > scaffold10370 maker gene 32452 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 > scaffold10370 maker mRNA 32452 54508 44204 - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 > scaffold10370 maker exon 32452 33011 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35140 35388 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35455 36119 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 36443 36777 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 53979 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker five_prime_UTR 54133 54508 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 53979 54132 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 36443 36777 . - 2 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35455 36119 . - 0 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35140 35388 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 32768 33011 . - 1 ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker three_prime_UTR 32452 32767 . - . ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > > This file does not pass the GFF3 online validator from GenomeTools with the following error message: > > GenomeTools error: Parent "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff " was not defined (via "ID=") > > And when I tried to use gff3toGenePred to convert GFF3 to the refflat format, I got the following parsing errors (samples): > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent attribute > > Can't find annotation record "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent attribute > > Can't find annotation record "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent attribute > > Can't find annotation record "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent attribute > > What could have been the issue in the annotation file or the way I used the file? Any feedback will be highly appreciated! > > Sincerely, > > Xin Huang > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xh33 at georgetown.edu Tue Aug 25 11:04:37 2015 From: xh33 at georgetown.edu (Xin Huang) Date: Tue, 25 Aug 2015 13:04:37 -0400 Subject: [maker-devel] Maker annotation file not recognized by gff3toGenePred of UCSC Genome Browser In-Reply-To: References: Message-ID: Thank you very much for your response. Yes, that is the issue, and after it had been fixed, the file went through the program just fine. Thanks, Xin On Tue, Aug 25, 2015 at 12:51 PM, Carson Holt wrote: > You have edited the file in a way that broke it. > Gene maker-scaffold10370-exonerate_est2genome-gene-0.0 previously had two > separate transcript sequences (mRNA-1 and mRNA-2). You have deleted > mRNA-1, but some of it?s children exons still exist. Notice that some exons > have two parents, and you have deleted one of the parents without removing > the parent relationship. > > ?Carson > > > > On Aug 20, 2015, at 9:37 AM, Xin Huang wrote: > > Dear Developer, > > I was trying to use a mixed GFF3 file to do some analysis, which includes > genome-level gene predictions and manually curated annotations. I replaced > the gene prediction by Maker annotations where appropriate. And a sample of > the blended GFF3 file is as follows: > > scaffold10370 GLEAN mRNA 3918 18376 0.920154 + . > ID=CCG000529.1;source_id=Aalb_GLEAN_10002648; > scaffold10370 GLEAN CDS 3918 3983 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 4521 4598 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17407 17516 . + 0 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 17588 17767 . + 1 Parent=CCG000529.1; > scaffold10370 GLEAN CDS 18238 18376 . + 1 Parent=CCG000529.1; > scaffold10370 maker gene 32452 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=AAEL004146 > scaffold10370 maker mRNA 32452 54508 44204 - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0;Name=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2;aED=0.00;eAED=0.00;qI=376|1|1|1|0|0|5|316|548 > scaffold10370 maker exon 32452 33011 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:exon:15;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35140 35388 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 35455 36119 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 36443 36777 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker exon 53979 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1,maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker five_prime_UTR 54133 54508 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:five_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 53979 54132 . - 0 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 36443 36777 . - 2 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35455 36119 . - 0 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 35140 35388 . - 1 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker CDS 32768 33011 . - 1 > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:cds;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > scaffold10370 maker three_prime_UTR 32452 32767 . - . > ID=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2:three_prime_utr;Parent=maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-2 > > This file does not pass the GFF3 online validator from GenomeTools with > the following error message: > > GenomeTools error: Parent > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" on line 1456 in > file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/Aalbo.gff" was > not defined (via "ID=") > > And when I tried to use gff3toGenePred to convert GFF3 to the refflat > format, I got the following parsing errors (samples): > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:12" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:11" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:10" Parent > attribute > > Can't find annotation record > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold10370-exonerate_est2genome-gene-0.0-mRNA-1:exon:9" Parent > attribute > > Can't find annotation record > "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-2" referenced by > "maker-scaffold1193-exonerate_est2genome-gene-2.0-mRNA-1:exon:114" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:118" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:119" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:120" Parent > attribute > > Can't find annotation record > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1" referenced by > "maker-scaffold16459-exonerate_est2genome-gene-0.0-mRNA-1:exon:121" Parent > attribute > > What could have been the issue in the annotation file or the way I used > the file? Any feedback will be highly appreciated! > > Sincerely, > > Xin Huang > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Tue Aug 25 12:38:05 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Tue, 25 Aug 2015 11:38:05 -0700 Subject: [maker-devel] CEGMA vs. initial MAKER output for SNAP training Message-ID: Hello, I was wondering if it would be better to generate the first hmm for SNAP training from CEGMA results or if I should use the output from my initial MAKER run? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Aug 25 12:45:52 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 25 Aug 2015 12:45:52 -0600 Subject: [maker-devel] CEGMA vs. initial MAKER output for SNAP training In-Reply-To: References: Message-ID: <73C81426-2B40-42D9-8150-F152F18B23FD@gmail.com> Either works. Both just count as a starting point (you just need something). Then you will run MAKER once more with this new HMM and do a final round of bootstrap training using the results of that last MAKER run. Regardless of what you start with, results will converge when you do the last bootstrap training step. ?Carson > On Aug 25, 2015, at 12:38 PM, John Cornelius wrote: > > Hello, I was wondering if it would be better to generate the first hmm for SNAP training from CEGMA results or if I should use the output from my initial MAKER run? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From johnbracht at gmail.com Thu Aug 27 10:36:41 2015 From: johnbracht at gmail.com (John Bracht) Date: Thu, 27 Aug 2015 12:36:41 -0400 Subject: [maker-devel] maker on a nematode: few novel proteins Message-ID: Hi Carson, all, I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4730 bytes Desc: not available URL: From yincl2013 at 126.com Sun Aug 30 02:52:09 2015 From: yincl2013 at 126.com (=?UTF-8?B?5bC55Lyg5p6X?=) Date: Sun, 30 Aug 2015 16:52:09 +0800 Subject: [maker-devel] maker problem Message-ID: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> Hello, When I use the maker for geomic annotation pipeline, I met a very big problem. My genome has 7282 scaffolds, all is success but for one scaffold is failed. I donot konw why and I have check the log, but I still can't fix the problem. So I need you help, and I will appreciate it very much. Here is a part of main error log: WARNING: Cannot find >0, trying to re-index the fasta. stop here:0 ERROR: Fasta index error at /disk/yinchuanlin/software/maker/maker/bin/../lib/GI.pm line 1622. GI::polish_exonerate(FastaChunk=HASH(0x7fe0263563b8), FastaSeq=HASH(0x3f7b280), 2988503, ">contig1", ARRAY(0x3d3ad80), ARRAY(0x3d17ed0), "/disk/yinchuanlin/01_pingguodue_project/02_OMIGA/09_maker_wor"..., "e", "/disk/yinchuanlin/software/maker/exonerate-2.2.0-x86_64/bin/e"..., ...) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 1986 Process::MpiChunk::__ANON__() called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 415 eval {...} called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 407 Error::subs::try(CODE(0x3f381b8), HASH(0x3f62380)) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 4224 Process::MpiChunk::_go(Process::MpiChunk=HASH(0x3e30020), "run", HASH(0x7fe026352410), 2, 3) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 341 Process::MpiChunk::run(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 357 Process::MpiChunk::run_all(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3d43310), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3f28b28), 0) called at /disk/yinchuanlin/software/maker/maker/bin/maker line 686 --> rank=NA, hostname=node7 ERROR: Failed while polishig ESTs ERROR: Chunk failed at level:2, tier_type:3 FAILED CONTIG:contig1 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:contig1 Thanks! Best regards, Ian Department of Entomology, College of Plant Protection, Nanjing Agricultural University No. 1, Weigang Road, Xuanwu District, Nanjing, Jiangsu 210095, China -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:09:24 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:09:24 -0600 Subject: [maker-devel] maker problem In-Reply-To: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> References: <2f67a261.1e724d.14f7dce99f8.Coremail.yincl2013@126.com> Message-ID: Separate out just that contig into a separate file and run it by itself. You may also want to rename the contig. Don?t call it 0. Having the name simply be a zero has the potential to cause any number of problems in MAKER or any of the programs MAKER uses because 0 all by itself means ?false? in most computer languages. Try contig_0 instead. It looks like the first error may be the BioPerl indexer failing because the contig is named 0. ?Carson > On Aug 30, 2015, at 2:52 AM, ??? wrote: > > Hello, > > When I use the maker for geomic annotation pipeline, I met a very big problem. My genome has 7282 scaffolds, all is success but for one scaffold is failed. I donot konw why and I have check the log, but I still can't fix the problem. So I need you help, and I will appreciate it very much. > > Here is a part of main error log: > WARNING: Cannot find >0, trying to re-index the fasta. > stop here:0 > ERROR: Fasta index error > at /disk/yinchuanlin/software/maker/maker/bin/../lib/GI.pm line 1622. > GI::polish_exonerate(FastaChunk=HASH(0x7fe0263563b8), FastaSeq=HASH(0x3f7b280), 2988503, ">contig1", ARRAY(0x3d3ad80), ARRAY(0x3d17ed0), "/disk/yinchuanlin/01_pingguodue_project/02_OMIGA/09_maker_wor"..., "e", "/disk/yinchuanlin/software/maker/exonerate-2.2.0-x86_64/bin/e"..., ...) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 1986 > Process::MpiChunk::__ANON__() called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 415 > eval {...} called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Error.pm line 407 > Error::subs::try(CODE(0x3f381b8), HASH(0x3f62380)) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 4224 > Process::MpiChunk::_go(Process::MpiChunk=HASH(0x3e30020), "run", HASH(0x7fe026352410), 2, 3) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 341 > Process::MpiChunk::run(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiChunk.pm line 357 > Process::MpiChunk::run_all(Process::MpiChunk=HASH(0x3e30020), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 > Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3d43310), 0) called at /disk/yinchuanlin/software/maker/maker/bin/../lib/Process/MpiTiers.pm line 287 > Process::MpiTiers::run_all(Process::MpiTiers=HASH(0x3f28b28), 0) called at /disk/yinchuanlin/software/maker/maker/bin/maker line 686 > --> rank=NA, hostname=node7 > ERROR: Failed while polishig ESTs > ERROR: Chunk failed at level:2, tier_type:3 > FAILED CONTIG:contig1 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:contig1 > > Thanks! > > Best regards, > Ian > Department of Entomology, College of Plant Protection, Nanjing Agricultural University > No. 1, Weigang Road, Xuanwu District, Nanjing, Jiangsu 210095, China > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:31:38 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:31:38 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: References: Message-ID: Hi John, MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. Thanks, Carson > On Aug 27, 2015, at 10:36 AM, John Bracht wrote: > > Hi Carson, all, > > I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. > > Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. > > Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. > > Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. > > Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). > > My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). > > In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. > > For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. > > So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. > > Thanks, > John > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:35:43 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:35:43 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: References: Message-ID: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> Sorry I meant to say Protostome not Deuterostomes (last sentence). ?Carson > On Aug 30, 2015, at 11:31 AM, Carson Holt wrote: > > Hi John, > > MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. > > You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. > > Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. > > Thanks, > Carson > > > >> On Aug 27, 2015, at 10:36 AM, John Bracht > wrote: >> >> Hi Carson, all, >> >> I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. >> >> Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. >> >> Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. >> >> Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. >> >> Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). >> >> My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). >> >> In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. >> >> For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. >> >> So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. >> >> Thanks, >> John >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Aug 30 11:44:24 2015 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 30 Aug 2015 11:44:24 -0600 Subject: [maker-devel] maker on a nematode: few novel proteins In-Reply-To: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> References: <7881AA14-B1D3-4F18-8847-9D53AE27154B@gmail.com> Message-ID: Should have said. The major trend in all Protostome evolution tends to be gene loss as opposed to gene gain. And the process of gene loss is more pronounced in the Ecdysozoa (insects and nematodes) than in the Lophotrochozoa (mollusks and flat worms). ?Carson > On Aug 30, 2015, at 11:35 AM, Carson Holt wrote: > > Sorry I meant to say Protostome not Deuterostomes (last sentence). > > ?Carson > >> On Aug 30, 2015, at 11:31 AM, Carson Holt > wrote: >> >> Hi John, >> >> MAKER requires that at least some form of evidence support each model. This is because ab initio predictors like Augustus and SNAP tend to over predict, in extreme cases by as much as 10 fold. So if you do not have any mRNA based evidence the chance that you would find something novel is extremely limited. This may be further complicated by the fact that nematodes tend to be highly divergent at the amino acid level, more than would be expected from their evolutionary relationship to other organisms. Their molecular clock is highly accelerated, and even making a phylogenic tree using nematodes is a nightmare because of the extremely long branch lengths they generate. So you may miss a lot of orthologs simply because proteins won?t align across such divergence. A classic example is smg-5 in C. elegans which will not align to its supposed orthologs in other eukaryotes at the amino acid level, but appears to have a conserved function and domain structure that is only detectable at a higher level. >> >> You may be able to rescue a number of rejected gene models by using InterProScan to identify protein domains from the non-overlapping-ab-inits fasta files. In the absence of mRNA evidence, that really would be the best way of capturing as much as possible. >> >> Also keep in mind that if the nematode you are working with is parasitic, then it?s evolution will be dominated by genome reduction as apposed to evolving new and novel genes. So having a smaller gene count with fewer novel genes will be expected. In fact the major trend in all Deterstome evolution tends to be gene loss as opposed to gene gain. >> >> Thanks, >> Carson >> >> >> >>> On Aug 27, 2015, at 10:36 AM, John Bracht > wrote: >>> >>> Hi Carson, all, >>> >>> I am a new user of Maker and have been overall quite pleased with it. Let me say also that the information I've gleaned on this forum has been extremely helpful and important in learning how to successfully use the software. I'm at a point where I'm planning next steps and hoping to solicit suggestions and ideas. >>> >>> Briefly, I'm annotating a novel nematode genome that appears to have a reduced genome (~60 Mb, but about 24% repetitive, with a custom RepeatModeler library that we generated). For technical reasons I do not have access to RNA data so I'm running ab-initio predictors SNAP and Augustus supplemented with lots of protein hints: swiss-prot and a combined library of 28 nematode proteomes. >>> >>> Maker Round 0: My student and I initially ran Maker without any active predictors, just to align the swiss-prot database to the genome (no nematode proteins used in this round), which output a .gff file with swiss-prot alignments, but no predicted proteins as SNAP and Augustus were turned off. We consider this 'round 0' of Maker. >>> >>> Maker Round 1: We then trained SNAP with CEGMA output (which showed the assembly is 98% complete and identified 243 eukaryotic orthologs in our genome assembly). We proceeded with Maker round 1, run with CEGMA-trained SNAP, supplemented with a combined library of 28 nematode proteomes (protein predictions obtained from wormbase) and inputting the gff file containing the swiss-prot alignments from 'round 0' of Maker (protein_pass set to 1). Round 1 produced about 10,000 proteins. >>> >>> Maker Round 2: The output of Round 1 was then filtered for protein hints giving a training set of about 3,000 proteins that we used in re-training SNAP, and in training Augustus. We then re-ran Maker without protein hints but with SNAP and Augustus re-trained; we did however feed in the .gff file produced in Round 1, containing all protein alignments (protein_pass set to 1). This gff file has all the swiss-prot alignments as well as the 28 nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At first I thought this was only half of what should be produced but now I'm re-evaluating that notion. I've been comparing known domains and protein families and the numbers are quite comparable across nematodes. There is a clear 'core' of conserved nematode proteins (about 5,000) and with C. elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My student and I have manually inspected a number of predictions and they appear quite reasonable, with good numbers of introns etc (some introns are on the small side but that's also consistent with a reduced genome size). >>> >>> My uncertainty arises from this: I took the 9,374 proteins and performed blastp against the 28 nematode proteome database from our training, and found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is widely reported in the literature that new nematode projects often identify about 30% novel proteins with no blast-match to any other organism. I am concerned that we are missing these novel proteins, perhaps due to our heavy reliance on 'known' proteins as hints in Maker training. On the other hand, novel proteins could have been predicted as easily as other proteins by Maker, as to my knowledge there is no requirement that protein hints support a given prediction. (About 18% of our proteins have no Pfam domain predictions, so in some sense they are novel, but they're matching other 'unknown' proteins in other nematodes by blast). >>> >>> In sum, I'm wondering if we should re-evaluate some step of our Maker pipeline or whether we are likely to be on safe ground concluding that the 9,374 is relatively representative of the full proteome of this organism. Interestingly, given the small genome size, there simply isn't much room for more proteins to be predicted--once you account for the high repetitive nature of this genome the gene density is slightly higher than that of C. elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are quite densely populated with predictions and there are not obvious 'gaps' where new predictions might be derived. From these observations I am tending toward the idea that the 9,374 proteins is relatively complete and that a lack of novel proteins is actually a scientific finding in this organism (it lives in an unusual environment where this might make sense) but I'd like to be sure we are seeing a real phenomenon, and not some weird artifact of the way we predicted the proteins. >>> >>> For the record, the N50 of our assembly is about 50kb, so I don't think we're missing a lot of genes due to fragmentation, and at any rate that shouldn't preferentially affect 'novel' genes more than 'known' genes. >>> >>> So: where are the novel proteins? Should we amend our Maker pipeline? Thanks for any an all ideas, questions, or comments! I'm attaching our maker_opts.ctl file from the last run (Round 2) in case it is helpful. >>> >>> Thanks, >>> John >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccheng at jcvi.org Mon Aug 31 08:47:24 2015 From: ccheng at jcvi.org (Cheng, Chia-Yi) Date: Mon, 31 Aug 2015 10:47:24 -0400 Subject: [maker-devel] AED scores from MAKER pipeline - deterministic or not? Message-ID: Hello MAKER team, We at JCVI have been using MAKER (2.31.8) to calculate the AED of Arabidopsis gene models. We provided the annotation set as ?model_gff? with evidence file in ?protein_gff? and ?est_gff?. All the other settings were default. One issue I?ve noticed was that the AED scores did not seem to be deterministic. When I compare the AED scores from two runs using identical control files, ~1,000 (out of 35,385) gene models had different AED scores. The difference between two sets of AED scores could range from 0.01 to 1.00. I looked into several gene models with lager difference, i.e. AED = 0.00 in run 1 and AED = 1.00 in run 2, and noticed a disagreement in the QI: Run 1: _AED=0.00;_eAED=-0.00;_QI=0|-1|0|1|-1|0|1|0|344 Run 2: _AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|0|344 The discrepancy in the 4th column seemed to suggest the evidence file was not used properly in run 2. I?m not sure what may have caused as both runs have used the same input. A snapshot of the evidence files are pasted in the end of the email in case needed. Please let me know if more info is needed. Any help is appreciated. Thank you. Chia-Yi RNA-seq evidence file: Chr1 assembler-aerial2_pasa cDNA_match 3624 5927 . + . ID=aerial2_align_161343;Target=asmbl_1 1082 1234 +%2Casmbl_1 692 1081 +%2Casmbl_1 572 691 +%2Casmbl_1 1 290 +%2Casmbl_1 291 571 +%2Casmbl_1 1235 1723 + Chr1 assembler-aerial2_pasa match_part 3624 3913 . + . ID=aerial2_align_161343-1;Parent=aerial2_align_161343 Chr1 assembler-aerial2_pasa match_part 3996 4276 . + . ID=aerial2_align_161343-2;Parent=aerial2_align_161343 EST evidence file: Chr1 est2genome expressed_sequence_match 5470 5899 2150 - . ID=Chr1:hit:213:3.2.0.0;Name=gi|19829901|gb|AV795918|RAFL08-19-M04 Chr1 est2genome match_part 5470 5899 2150 - . ID=Chr1:hsp:500:3.2.0.0;Parent=Chr1:hit:213:3.2.0.0;Target=gi|19829901|gb|AV795918|RAFL08-19-M04 2 431 +;Gap=M430 Protein evidence file: Chr1 protein2genome protein_match 3760 5284 727 + . ID=Chr1:hit:202:3.10.0.0;Name=UniRef90_M4EWW1 Chr1 protein2genome match_part 3760 3913 727 + . ID=Chr1:hsp:488:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 1 50;Gap=M31 D1 M19 F1 Chr1 protein2genome match_part 3996 4276 727 + . ID=Chr1:hsp:489:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 51 144;Gap=R1 M23 D1 M28 D1 M36 I2 M5 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Aug 31 09:08:34 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Aug 2015 09:08:34 -0600 Subject: [maker-devel] AED scores from MAKER pipeline - deterministic or not? In-Reply-To: References: Message-ID: <81B272E9-3439-49C6-96E0-674B03F87569@gmail.com> I would have to see the actual GFF3 files (full file including fast at the end). Give me both GFF3 files and the coordinates of the gene in question. My first guess is that you had the single_exon= filter set to different values on each run. The gene in question is an unspliced single exon gene (based on the QI), your primary piece of evidence appears to be a single exon EST, and the only value that changes in the QI is the exon overlap. Single exon evidence will be ignored by default for the AED calculation unless you have single_exon set to 1. Thanks, Carson > On Aug 31, 2015, at 8:47 AM, Cheng, Chia-Yi wrote: > > Hello MAKER team, > > We at JCVI have been using MAKER (2.31.8) to calculate the AED of Arabidopsis gene models. We provided the annotation set as ?model_gff? with evidence file in ?protein_gff? and ?est_gff?. All the other settings were default. One issue I?ve noticed was that the AED scores did not seem to be deterministic. When I compare the AED scores from two runs using identical control files, ~1,000 (out of 35,385) gene models had different AED scores. The difference between two sets of AED scores could range from 0.01 to 1.00. > > I looked into several gene models with lager difference, i.e. AED = 0.00 in run 1 and AED = 1.00 in run 2, and noticed a disagreement in the QI: > > Run 1: _AED=0.00;_eAED=-0.00;_QI=0|-1|0|1|-1|0|1|0|344 > Run 2: _AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|0|344 > > The discrepancy in the 4th column seemed to suggest the evidence file was not used properly in run 2. I?m not sure what may have caused as both runs have used the same input. A snapshot of the evidence files are pasted in the end of the email in case needed. > > Please let me know if more info is needed. Any help is appreciated. Thank you. > > Chia-Yi > > > RNA-seq evidence file: > Chr1 assembler-aerial2_pasa cDNA_match 3624 5927 . + . ID=aerial2_align_161343;Target=asmbl_1 1082 1234 +%2Casmbl_1 692 1081 +%2Casmbl_1 572 691 +%2Casmbl_1 1 290 +%2Casmbl_1 291 571 +%2Casmbl_1 1235 1723 + > Chr1 assembler-aerial2_pasa match_part 3624 3913 . + . ID=aerial2_align_161343-1;Parent=aerial2_align_161343 > Chr1 assembler-aerial2_pasa match_part 3996 4276 . + . ID=aerial2_align_161343-2;Parent=aerial2_align_161343 > > EST evidence file: > Chr1 est2genome expressed_sequence_match 5470 5899 2150 - . ID=Chr1:hit:213:3.2.0.0;Name=gi|19829901|gb|AV795918|RAFL08-19-M04 > Chr1 est2genome match_part 5470 5899 2150 - . ID=Chr1:hsp:500:3.2.0.0;Parent=Chr1:hit:213:3.2.0.0;Target=gi|19829901|gb|AV795918|RAFL08-19-M04 2 431 +;Gap=M430 > > Protein evidence file: > Chr1 protein2genome protein_match 3760 5284 727 + . ID=Chr1:hit:202:3.10.0.0;Name=UniRef90_M4EWW1 > Chr1 protein2genome match_part 3760 3913 727 + . ID=Chr1:hsp:488:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 1 50;Gap=M31 D1 M19 F1 > Chr1 protein2genome match_part 3996 4276 727 + . ID=Chr1:hsp:489:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 51 144;Gap=R1 M23 D1 M28 D1 M36 I2 M5 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: