From robert.king at rothamsted.ac.uk Thu Oct 6 06:30:49 2016 From: robert.king at rothamsted.ac.uk (Robert King) Date: Thu, 6 Oct 2016 11:30:49 +0000 Subject: [maker-devel] ATG strict start codon usage query Message-ID: Hi, I'm using latest version of Maker2 but when I use it I get CTG and TTG as start codons of which I don't want. Reading threads, the bioperl CodonTable.pm has been changed to allow for strict setting so that only ATG is used. My question is how to invoke this functionality? I've looked in maker ctrl files and command line maker but don't see how to get it just to use ATG as the start codon. Can you please advise. Best wishes Rob -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 6 11:08:00 2016 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 6 Oct 2016 10:08:00 -0600 Subject: [maker-devel] ATG strict start codon usage query In-Reply-To: References: Message-ID: <786A1E40-6261-43C8-AA84-4AD0EF45BC9F@gmail.com> Make sure you are using the latest maker version (2.31.8 - since about 2014). Make sure you are not using GFF3 files as input to MAKER (otherwise you will use whatever codon is in the GFF3 ). Make sure your BioPerl is up to date (CPAN version not BioPerl live version). With respect to behavior, MAKER by default will keep whatever start codon given used by the ab initio predictor, and only search for a different one if you set always_complete=1. ?Carson > On Oct 6, 2016, at 5:30 AM, Robert King wrote: > > Hi, > > I?m using latest version of Maker2 but when I use it I get CTG and TTG as start codons of which I don?t want. Reading threads, the bioperl CodonTable.pm has been changed to allow for strict setting so that only ATG is used. My question is how to invoke this functionality? I?ve looked in maker ctrl files and command line maker but don?t see how to get it just to use ATG as the start codon. Can you please advise. > > Best wishes > Rob > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Mon Oct 10 04:43:21 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Mon, 10 Oct 2016 11:43:21 +0200 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? Message-ID: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without _doing a re-annotation _and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Oct 11 15:05:50 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 11 Oct 2016 14:05:50 -0600 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Message-ID: Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI wrote: > > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From aravindp at imcb.a-star.edu.sg Mon Oct 17 01:45:59 2016 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Mon, 17 Oct 2016 06:45:59 +0000 Subject: [maker-devel] Maker MPI installation error and IO error for serial version Message-ID: Hi, I'm trying to install Maker in my cluster account. I have installed all the dependencies. But, there are two issues for which I would like to get a solution. I tried to find it from the forums but helpless. 1. MPI installation "flock: Function not implemented error" at src/lib/Parallel/Application/MPI.pm line 256. ./Build install Configuring MAKER with MPI support flock: Function not implemented at /scratch/tools/maker_mpi/src/lib/Parallel/Application/MPI.pm line 256. Parallel::Application::MPI::_bind("/app/openmpi/1.10.3/intel_java/bin/mpicc", "/app/openmpi/1.10.3/intel_java/include", "blib", "") called at /scratch/users/astar/imcb/aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 277 MAKER::Build::ACTION_build(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "build") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1993 Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0), "build") called at /aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 469 MAKER::Build::ACTION_install(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "install") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1998 Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0)) called at ./Build line 69 2. When I run a serial version of Maker, I get an error as follow in the "makerlog.e" file. DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 109. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 111. DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 113. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 191. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 390. Please help me with these errors as early as possible. I have double checked for all the dependencies and the file paths given while running Maker. Awaiting your reply! Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http:/www.imcb.a-star.edu.sg/ [2] Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18239 bytes Desc: image002.png URL: From mark.ebbert at gmail.com Thu Oct 13 16:57:50 2016 From: mark.ebbert at gmail.com (Mark Ebbert) Date: Thu, 13 Oct 2016 14:57:50 -0700 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! Message-ID: <57fffd715f83340001fcf47d@polymail.io> Hi, I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? I already deleted the log files before I realized maker started over because the log files get way too big. I really appreciate your help! Mark T. W. Ebbert -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 12 04:44:48 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (chebbi mohamed amine) Date: Wed, 12 Oct 2016 11:44:48 +0200 (CEST) Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Message-ID: <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? Best, Amine De: "chebbi mohamed amine" ?: "Carson Holt" Cc: maker-devel at yandell-lab.org Envoy?: Mercredi 12 Octobre 2016 11:44:21 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? De: "Carson Holt" ?: "Mohamed Amine CHEBBI" Cc: maker-devel at yandell-lab.org Envoy?: Mardi 11 Octobre 2016 22:05:50 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 17 13:17:17 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 12:17:17 -0600 Subject: [maker-devel] Maker MPI installation error and IO error for serial version In-Reply-To: References: Message-ID: It?s saying your system has no flock (file locking). For NFS mounts this is usually a configuration by the administrator. At the very least they can enable lock emulation in NFS which is what your scratch seems to be. Unfortunately SQLite will not work without this. You can still get MAKER to install with MPI by removing the lock used during setup (do this by editing line 210 of ?/maker/src/lib/Parallel/Application/MPI.pm). Turn this?> $lock = new File::NFSLock("$loc/_MPI", 'EX', 300, 40) while(!$lock); To this (i.e. comment out line 210)?> #$lock = new File::NFSLock("$loc/_MPI", 'EX', 300, 40) while(!$lock); However there is no work around for the SQLite IO error. It requires that your administrator enable locks or lock emulation (for example setting nolock,local_lock=all will cause the system to emulate locks on NFS locally). So while not exactly a real lock, they won?t fail. Thanks, Carson > On Oct 17, 2016, at 12:45 AM, Aravind PRASAD wrote: > > Hi, > > I?m trying to install Maker in my cluster account. I have installed all the dependencies. But, there are two issues for which I would like to get a solution. I tried to find it from the forums but helpless. > 1. MPI installation > ?flock: Function not implemented error? at src/lib/Parallel/Application/MPI.pm line 256. > ./Build install > > Configuring MAKER with MPI support > flock: Function not implemented > at /scratch/tools/maker_mpi/src/lib/Parallel/Application/MPI.pm line 256. > Parallel::Application::MPI::_bind("/app/openmpi/1.10.3/intel_java/bin/mpicc", "/app/openmpi/1.10.3/intel_java/include", "blib", "") called at /scratch/users/astar/imcb/aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 277 > MAKER::Build::ACTION_build(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 > Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "build") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1993 > Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0), "build") called at /aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 469 > MAKER::Build::ACTION_install(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 > Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "install") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1998 > Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0)) called at ./Build line 69 > > 2. When I run a serial version of Maker, I get an error as follow in the ?makerlog.e? file. > > DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 109. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 111. > DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 113. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 191. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 390. > > > Please help me with these errors as early as possible. I have double checked for all the dependencies and the file paths given while running Maker. > Awaiting your reply! > > > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http:/www.imcb.a-star.edu.sg/ > > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 17 13:25:54 2016 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Mon, 17 Oct 2016 18:25:54 +0000 Subject: [maker-devel] question about Maker2 In-Reply-To: References: <56F4066F.4000803@fgcz.ethz.ch> <01AB4222AE1B7E41A3B5CAEC445F192B3F71EB84@MBX115.d.ethz.ch> <3470AFC0-7B3A-485C-A86E-C7DE5A341C3C@genetics.utah.edu> <57270F57.50208@fgcz.ethz.ch> <5A09C696-CBD0-4DA9-8CB6-B994981E00D3@genetics.utah.edu> <01AB4222AE1B7E41A3B5CAEC445F192B3F747251@MBX115.d.ethz.ch> <89F7DE68-6FFF-4E17-B867-8E699D3DE986@genetics.utah.edu> <01AB4222AE1B7E41A3B5CAEC445F192B3F752945@MBX215.d.ethz.ch> <1DB8E975-3E54-455D-8852-2DD2937B2FCF@genetics.utah.edu> Message-ID: <8D22D8B2-73DC-4276-8B2D-BDEF8ECDFBE7@genetics.utah.edu> > what is the difference between files > > 1) ContigXXX.maker.non_overlapping_ab_initio.proteins.fasta Non-redundant non-overlapping models (i.e. subset of snap/augustus models that do not overlap a final MAKER selected model). > and > > 2 )ContigXXX.maker.augustus_masked.proteins.fasta Contains all raw augustus models called without hints (i.e. the equivalent of just running Augustus on it?s own). > None of these should have EST info (as the sequences headers are > > 1) augustus_masked-1-processed-gene- This was a raw augustus model that may or may not have UTR added using EST info (i.e model came strait from Augustus so no hints were used to produce the model, but MAKER did try and add UTR) > and > > 2) augustus_masked-1-abinit-gene- Model strait from Augustus. No hints, and no MAKER attempt to add UTR. These are raw unmodified models and will never be in the final selected set. > so no "maker-XXX) maker-XXX means it was a hint derived model and not a raw Augustus model. > Should file 2 just be ignored and 1) be kept aside the maker file, where EST/protein evidence is incorporated? ignore all the abinit files. They are for reference purposes only. The non-overlapping file can be used to see what was rejected, does not overlap a current model (i.e. you may be able to find a handful of false negatives that can be rescued with domain analysis using something like InterProscan). ?Carson > Thanks, > > G > > On 5/18/16 11:31 AM, Carson Holt wrote: >> Hi Giancarlo, >> >> There was no image attached. If you can, just send me the contig GFF3, and I can look at it in apollo (which lets me manipulate reading frame and display spice sites). Then I can tell you more. Basically the gene models are the result of an HMM for gene patterns plus hints to alter probability around evidence suggested sites. If there is any issue with the reading frame (can be a single bp assembly error) then no amount of hints can force a broken CDS to be coding, and the predictor will do the best it can to still produce a workable model (i.e. truncate exons, skip exons, etc). Also if your mRNA-seq is not aligned correctly around a canonical splice site (i.e. overhang beyond splice acceptor) then that hint may be ignored. >> >> ?Carson >> >> >>> On May 17, 2016, at 4:50 AM, Russo Giancarlo wrote: >>> >>> Hi Carson, thanks again for all your answers. >>> A (hopefullly) final question: in the image attached you can see an IGV sashimi plot of RNA-seq data, with the annotated gene derived from Maker; what could be the reason that in the gene model the two bits on the sides (UTRs?), which show high coverage from the RNA-seq data and plenty of splice junctions with the neighbouring exons are completely missing? >>> >>> In this run I have used a closely related species from the augustus database for gene prediction, RNA-seq based denovo assemblied transcripts as EST and protein sequences from the same closely related species. I have masked using a customized library build following the guidelines in the tutorial. >>> >>> Thanks, >>> Giancarlo >>> >>> Giancarlo Russo, Ph.D. >>> Functional Genomics Center Zurich >>> ETH Zurich / University of Zurich >>> Winterthurerstrasse 190 / Y32 H66 >>> CH-8057 Zurich >>> >>> Phone: +41 44 635 3964 >>> Fax: +41 44 635 3922 >>> e-mail: giancarlo.russo at fgcz.ethz.ch >>> http://www.fgcz.ch >>> ________________________________________ >>> From: Carson Holt [carson.holt at genetics.utah.edu] >>> Sent: 09 May 2016 18:02 >>> To: Russo Giancarlo >>> Subject: Re: question about Maker2 >>> >>> For training gene predictors with protein and EST ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >>> >>> If reusing MAKER results I don?t recommend GFF3 passthrough. The GFF# option is to get not MAKER sourced result into MAKER. You will actually lose some functionality by passing in MAKER sourced results as GFF3 (MAEKR can?t do things with GFF3 that it can do with self generated data). >>> >>> It is best to just rerun MAKER in the same directory, it will reuse previous reports it finds in the datastore. >>> >>> ?Carson >>> >>> >>> >>>> On May 3, 2016, at 2:08 AM, Russo Giancarlo wrote: >>>> >>>> OK, thanks a lot, now it is clear. >>>> >>>> About the passthrough procedure, would you have any particular advice on what would be the best strategy to run it? >>>> I have tried an existing organism in Augustus but the results were not too good. >>>> >>>> I have both EST and protein evidence, so I thought I could use EST to infer ab-initio and produce a first annotation and then run a second-pass using the first gff maker file as ab-initio. >>>> >>>> Any advice would be appreciated. >>>> >>>> Best and thanks again. >>>> Giancarlo >>>> >>>> Giancarlo Russo, Ph.D. >>>> Functional Genomics Center Zurich >>>> ETH Zurich / University of Zurich >>>> Winterthurerstrasse 190 / Y32 H66 >>>> CH-8057 Zurich >>>> >>>> Phone: +41 44 635 3964 >>>> Fax: +41 44 635 3922 >>>> e-mail: giancarlo.russo at fgcz.ethz.ch >>>> http://www.fgcz.ch >>>> ________________________________________ >>>> From: Carson Holt [carson.holt at genetics.utah.edu] >>>> Sent: 02 May 2016 18:16 >>>> To: Russo Giancarlo >>>> Subject: Re: question about Maker2 >>>> >>>> As part of the MAEKR job, it runs Snap and Augustus on their own before aligning evidence and generating hints for the later run. The Contig2.maker.augustus.transcripts.fasta are just the results of that uninformed Augustus run. They are not the final gene models, they are just the raw uninformed Augustus models. They are there for reference purposes only. They are what you would have gotten by just running Augustus directly on the assembly without any additional input (i.e. what Augustus would have produced on it?s own outside of MAKER). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On May 2, 2016, at 2:27 AM, giancarlo.russo wrote: >>>>> >>>>> Hi Carson, >>>>> sorry to bother you again, I still don't understand the difference between >>>>> >>>>> 1) Contig2.maker.augustus.transcripts.fasta >>>>> and >>>>> 2) Contig2.maker.transcripts.fasta >>>>> >>>>> If 1) contains the transcripts "Produced by maker sending hints to >>>>> augustus to modify scoring against the HMM", >>>>> , and these hints are derived from EST/protein evidence, what extra >>>>> information is used/extra steps are performed to produce 3) ? >>>>> >>>>> Also, how is a passthrough using a first pass, maker-produced gff >>>>> annotation file is best done? >>>>> Should this gff file be used for ab-initio gene models that are then >>>>> corrected EST and protein evidence? >>>>> Does it make sense to use augustus when a first pass gff file is >>>>> available? Do these two options (ab-initio based on first pass gff and >>>>> augustus switched on) exclude each other? >>>>> >>>>> Thanks again for your time and help. >>>>> >>>>> Best, >>>>> G >>>>> On 29/03/16 17:42, Carson Holt wrote: >>>>>> Yes. The EST?s generate both hints as to intron location and exon location. The protein alignments generate CDS location hints. Each algorithm has different ways to feed hints with Augustus being the most advanced. It allows separate bonuses for partial vs exact matches, and you can optionally link hints so they have to be matched as a group. It also offerer many other hint types like splice donor and acceptor hints. However we really only use the intron, exon, and CDS hints. We also use the partial match bonus. >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>>> On Mar 29, 2016, at 7:50 AM, Russo Giancarlo wrote: >>>>>>> >>>>>>> Hi Carson, thanks a lot for your answer. >>>>>>> >>>>>>> So let's see if I get it correctly. >>>>>>> In the final datastore I have the fasta files named >>>>>>> >>>>>>> 1)Contig2.maker.augustus.transcripts.fasta >>>>>>> 2)Contig2.maker.non_overlapping_ab_initio.transcripts.fasta >>>>>>> 3)Contig2.maker.transcripts.fasta >>>>>>> >>>>>>> 1) contains the transcripts "Produced by maker sending hints to augustus to modify scoring against the HMM" >>>>>>> 2) contains the transcripts predicted only by the ab initio algorithm (e.g. augustus) >>>>>>> 3) contains the transcripts with a full gene model based on ab initio + EST and/or PROTEIN >>>>>>> >>>>>>> However, what "hints" are sent by maker to augustus? If these are EST/PROTEIN hints, then what is the difference between 1) and 3) ? >>>>>>> >>>>>>> Thanks again for your help and sorry for bothering. >>>>>>> >>>>>>> Best, >>>>>>> Giancarlo >>>>>>> >>>>>>> Giancarlo Russo, Ph.D. >>>>>>> Functional Genomics Center Zurich >>>>>>> ETH Zurich / University of Zurich >>>>>>> Winterthurerstrasse 190 / Y32 H66 >>>>>>> CH-8057 Zurich >>>>>>> >>>>>>> Phone: +41 44 635 3964 >>>>>>> Fax: +41 44 635 3922 >>>>>>> e-mail: giancarlo.russo at fgcz.ethz.ch >>>>>>> http://www.fgcz.ch >>>>>>> ________________________________________ >>>>>>> From: Carson Holt [carson.holt at genetics.utah.edu] >>>>>>> Sent: 24 March 2016 21:56 >>>>>>> To: maker-devel >>>>>>> Cc: Russo Giancarlo; Mark Yandell >>>>>>> Subject: Re: question about Maker2 >>>>>>> >>>>>>> Hi Giancarlo, >>>>>>> >>>>>>> Anything listed as something like maker-*-augustus was a result of MAKER sending hints to augustus, and anything like augustus-*-abinit was the result of augustus run directly from the HMM without hints. >>>>>>> >>>>>>> Here is more detail on the format ?> >>>>>>> - - -gene- - >>>>>>> >>>>>>> Top level possibilities: >>>>>>> maker #maker generated model >>>>>>> snap_masked #snap run on masked sequence >>>>>>> augustus_masked #augustus run on masked sequence >>>>>>> etc. >>>>>>> >>>>>>> Internal source: >>>>>>> abinit #ab initio model direct from HMM >>>>>>> snap #hints provided to SNAP (alters scoring) >>>>>>> augustus #hints provided to augustus (alters scoring) >>>>>>> >>>>>>> Then chunk and iterator are just to generate a uniq ID. >>>>>>> >>>>>>> >>>>>>> Example: >>>>>>> augustus_masked-scaffold11899-abinit-gene-0.6 #Produced by Augustus on masked sequence using raw HMM (no MAKER intervention). >>>>>>> maker-scaffold11899-augustus-gene-0.6 #Produced by maker sending hints to augustus to modify scoring against the HMM >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 3/24/16, 9:23 AM, "giancarlo.russo" >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Dear Mike, >>>>>>>>> >>>>>>>>> first of all thanks for taking care and sharing Maker, as part of the >>>>>>>>> community I appreciate it. >>>>>>>>> >>>>>>>>> I have a question about the nomenclature of the annotation in the output >>>>>>>>> file: >>>>>>>>> what is the difference between genes named >>>>>>>>> >>>>>>>>> maker-Contig-XXX >>>>>>>>> and those named >>>>>>>>> augustus-Contig-XXX-processed genes >>>>>>>>> ? >>>>>>>>> >>>>>>>>> Please find attached the maker_opts file I have used for my annotation. >>>>>>>>> I was under the impression that the ab-initio related prefixes would be >>>>>>>>> present only in the genes which are not marked as "maker" in column 3 of >>>>>>>>> the gff file (i.e., those >>>>>>>>> with both ab-initio and EST evidence) >>>>>>>>> >>>>>>>>> Is there something I am missing? >>>>>>>>> >>>>>>>>> Thanks a lot in advance, >>>>>>>>> Giancarlo >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Giancarlo Russo, Ph.D. >>>>>>>>> Functional Genomics Center Zurich >>>>>>>>> Y32 H66 >>>>>>>>> Winterthurerstr. 190 >>>>>>>>> 8057 Zurich >>>>>>>>> SWITZERLAND >>>>>>>>> Phone: +41 44 635 39 64 >>>>>>>>> Fax: +41 44 635 39 22 >>>>>>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch >>>>>>>>> >>>>>>>> >>>>> -- >>>>> Giancarlo Russo, Ph.D. >>>>> Functional Genomics Center Zurich >>>>> Y32 H66 >>>>> Winterthurerstr. 190 >>>>> 8057 Zurich >>>>> SWITZERLAND >>>>> Phone: +41 44 635 39 64 >>>>> Fax: +41 44 635 39 22 >>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch >>>>> > > -- > Giancarlo Russo, Ph.D. > Functional Genomics Center Zurich > Winterthurerstrasse 190 > 8057 Zurich (CH) > Phone: +41 044 635 3964 > Fax: +41 044 635 3922 > From carsonhh at gmail.com Mon Oct 17 13:35:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 12:35:52 -0600 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <57fffd715f83340001fcf47d@polymail.io> References: <57fffd715f83340001fcf47d@polymail.io> Message-ID: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line. In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). ?Carson > On Oct 13, 2016, at 3:57 PM, Mark Ebbert wrote: > > > Hi, > > I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? > > This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? > > I already deleted the log files before I realized maker started over because the log files get way too big. > > I really appreciate your help! > > Mark T. W. Ebbert > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From annabel.beichman at gmail.com Mon Oct 17 14:20:37 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Mon, 17 Oct 2016 12:20:37 -0700 Subject: [maker-devel] Too many genes? Message-ID: Hi Carson et al., Thanks so much for such a great pipeline, tutorials and advice pages. I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) Thank you all so much for your help and advice! [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] Best wishes, Annabel Beichman Wayne Lab/Lohmueller Lab Ecology & Evolutionary Biology UCLA Annabelbeichman.com From carsonhh at gmail.com Mon Oct 17 15:11:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 14:11:52 -0600 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <58052fc8a2cc1400014626fe@polymail.io> References: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> <58052fc8a2cc1400014626fe@polymail.io> Message-ID: MAKER should automatically try and salvage things on restart (that is the purpose of the checkpoint files). You can set clean_try=1 if you want. It will then delete failed contigs before retrying on any failure. ?Carson > On Oct 17, 2016, at 2:09 PM, Mark Ebbert wrote: > > > Thanks Carson, > > I?ve been restarting it using the same commands several times in a row. Unless that ?find? command has the potential to modify any important files, then I don?t think I modified anything. All I ran was: > > ?find . -name *.NFSLock* -exec rm {} \;? > ?sbatch maker.slurm? > > I?m inclined to nuke it all and start over. Is it possible to salvage previous work, or is it all gone? > > Mark T. W. Ebbert > Please note my new email address: mark.ebbert at gmail.com > > On Mon, Oct 17, 2016 at 12:35 PM Carson Holt >> wrote: > If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. > > Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line. > > In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. > > If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). > > ?Carson > > >> On Oct 13, 2016, at 3:57 PM, Mark Ebbert > wrote: >> >> >> Hi, >> >> I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? >> >> This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? >> >> I already deleted the log files before I realized maker started over because the log files get way too big. >> >> I really appreciate your help! >> >> Mark T. W. Ebbert >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark.ebbert at gmail.com Mon Oct 17 15:09:52 2016 From: mark.ebbert at gmail.com (Mark Ebbert) Date: Mon, 17 Oct 2016 13:09:52 -0700 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> References: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> Message-ID: <58052fc8a2cc1400014626fe@polymail.io> Thanks Carson, I?ve been restarting it using the same commands several times in a row. Unless that ?find? command has the potential to modify any important files, then I don?t think I modified anything. All I ran was: ?find . -name *.NFSLock* -exec rm {} \;? ?sbatch maker.slurm? I?m inclined to nuke it all and start over. Is it possible to salvage previous work, or is it all gone? Mark T. W. Ebbert Please note my new email address: mark.ebbert at gmail.com On Mon, Oct 17, 2016 at 12:35 PM Carson Holt < mailto:Carson Holt > wrote: a, pre, code, a:link, body { word-wrap: break-word !important; } If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line.? In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). ?Carson On Oct 13, 2016, at 3:57 PM, Mark Ebbert < mailto:mark.ebbert at gmail.com > wrote: Hi, I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? I already deleted the log files before I realized maker started over because the log files get way too big. I really appreciate your help! Mark T. W. Ebbert _______________________________________________ maker-devel mailing list mailto:maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 17 15:25:32 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 14:25:32 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: References: Message-ID: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). ?Carson > On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: > > Hi Carson et al., > > Thanks so much for such a great pipeline, tutorials and advice pages. > > I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. > > Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. > > Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. > > Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. > > I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). > > However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. > > 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). > > I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? > > > Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: > ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) > ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) > ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) > ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) > > Thank you all so much for your help and advice! > > [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] > > Best wishes, > Annabel Beichman > Wayne Lab/Lohmueller Lab > Ecology & Evolutionary Biology > UCLA > Annabelbeichman.com > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From annabel.beichman at gmail.com Mon Oct 17 18:13:07 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Mon, 17 Oct 2016 16:13:07 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> Message-ID: <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. Thanks so much again, ~ Annabel > On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: > > Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). > > You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). > > ?Carson > > > >> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >> >> Hi Carson et al., >> >> Thanks so much for such a great pipeline, tutorials and advice pages. >> >> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >> >> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >> >> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >> >> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >> >> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >> >> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >> >> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >> >> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >> >> >> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> >> Thank you all so much for your help and advice! >> >> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >> >> Best wishes, >> Annabel Beichman >> Wayne Lab/Lohmueller Lab >> Ecology & Evolutionary Biology >> UCLA >> Annabelbeichman.com >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Mon Oct 17 19:09:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 18:09:52 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> Message-ID: <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. ?Carson > On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: > > Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. > > Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta > > My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. > > Thanks so much again, > > ~ Annabel > > >> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >> >> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >> >> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>> >>> Hi Carson et al., >>> >>> Thanks so much for such a great pipeline, tutorials and advice pages. >>> >>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>> >>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>> >>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>> >>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>> >>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>> >>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>> >>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>> >>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>> >>> >>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> >>> Thank you all so much for your help and advice! >>> >>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>> >>> Best wishes, >>> Annabel Beichman >>> Wayne Lab/Lohmueller Lab >>> Ecology & Evolutionary Biology >>> UCLA >>> Annabelbeichman.com >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > From carsonhh at gmail.com Sun Oct 23 18:25:34 2016 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 23 Oct 2016 17:25:34 -0600 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: It?s unfortunate the archived GMOD post is gone, because I always used it for my own reference. If I remember right, the main point was that Jason Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format suitable for Augustus training. This meant you could use the maker2zff script that came with MAKER, then use Jason?s tool to convert for Augustus training. Tool to convert SNAP training ZFF to Augustus trining input file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Since the post is gone, you could use that documentation provided with his tool and then maybe a generic Augustus training guide like the following to design a path forward ?> http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ?Carson > On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine wrote: > > Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l. Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? > > Best, > Amine > > De: "chebbi mohamed amine" > ?: "Carson Holt" > Cc: maker-devel at yandell-lab.org > Envoy?: Mercredi 12 Octobre 2016 11:44:21 > Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? > > Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l. Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? > > > De: "Carson Holt" > ?: "Mohamed Amine CHEBBI" > Cc: maker-devel at yandell-lab.org > Envoy?: Mardi 11 Octobre 2016 22:05:50 > Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? > > Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). > > The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. > > ?Carson > > > > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI > wrote: > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sun Oct 23 18:49:53 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Mon, 24 Oct 2016 10:49:53 +1100 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: If it's of any help I had this notes on my old protocol (before I started to do the training with BUSCO): For Augustus, we need the script "zff2augustus_gbk.pl". This will take the > export.dna generated by fathom and generate a *.gb file that will be used > as "training gene structure file" in a new training submission in > WebAugustus, but remember to give it a new name in the submission, e.g. > MYGENOME_v2, or Maker won't see the difference (same name): > perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb > As said, you could also do the training with BUSCO with the --long option. It has a dataset specific for arthropods. But if you have EST data you'll probably do better with the other method, as it allows to enter the EST for a more accurate training. On 24 October 2016 at 10:25, Carson Holt wrote: > It?s unfortunate the archived GMOD post is gone, because I always used it > for my own reference. If I remember right, the main point was that Jason > Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format > suitable for Augustus training. This meant you could use the maker2zff > script that came with MAKER, then use Jason?s tool to convert for Augustus > training. > > Tool to convert SNAP training ZFF to Augustus trining input file ?> > https://github.com/hyphaltip/genome-scripts/blob/master/ > gene_prediction/zff2augustus_gbk.pl > > > Since the post is gone, you could use that documentation provided with his > tool and then maybe a generic Augustus training guide like the following to > design a path forward ?> > http://www.molecularevolution.org/molevolfiles/exercises/ > augustus/training.html > > ?Carson > > > On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < > mohamed.amine.chebbi at univ-poitiers.fr> wrote: > > Thank you Carson for your quick response. Sorry, I have another question > concerning Augustus Training. You posted previously in the mailing list a > link to an explanation of Augustus training steps http://brie4.cshl.edu/ > pipermail/gmod-help/2012-June/001724.htm > l. > Unfortunately the link doesn't work anymore. Otherwise could you explain > how to filter the gff file produced by the first run of Maker to get best > full length ORF as a set of gene models to train Augustus ? > > Best, > Amine > > ------------------------------ > *De: *"chebbi mohamed amine" > *?: *"Carson Holt" > *Cc: *maker-devel at yandell-lab.org > *Envoy?: *Mercredi 12 Octobre 2016 11:44:21 > *Objet: *Re: [maker-devel] Combining and merging two Maker annotation gff > files ? > > Thank you Carson for your quick response. Sorry, I have another question > concerning Augustus Training. You posted previously in the mailing list a > link to an explanation of Augustus training steps http://brie4.cshl.edu/ > pipermail/gmod-help/2012-June/001724.htm > l. > Unfortunately the link doesn't work anymore. Otherwise could you explain > how to filter the gff file produced by the first run of Maker to get best > full length ORF as a set of gene models to train Augustus ? > > > ------------------------------ > *De: *"Carson Holt" > *?: *"Mohamed Amine CHEBBI" > *Cc: *maker-devel at yandell-lab.org > *Envoy?: *Mardi 11 Octobre 2016 22:05:50 > *Objet: *Re: [maker-devel] Combining and merging two Maker annotation gff > files ? > > Masking doesn?t just affect the gene models, but also evidence alignment > and thus scoring. So merging in this way would not make much sense as the > second less masked set would always score better because it has more > evidence alignments permitted by the lack of masking (not necessarily real, > but drawn in by repeats). > > The result would be that any attempt of a merge would almost exclusively > result in all genes from the second set always scoring higher. > > ?Carson > > > > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < > mohamed.amine.chebbi at univ-poitiers.fr> wrote: > > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. > First, I have run RepeatModeler to create rmlib for Maker, then I have > followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( > Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 > genes against 22931 done by the second one. Know, I'm seeing for a mean to > merge the two annotation gff files without doing a re-annotation and by > taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could > resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From jill711021 at gmail.com Sun Oct 23 22:32:38 2016 From: jill711021 at gmail.com (=?UTF-8?B?546L5LiA5Yeh?=) Date: Mon, 24 Oct 2016 11:32:38 +0800 Subject: [maker-devel] maker -error Message-ID: Dear sir I am trying to run GeneMark-ES and Maker for annotate the fungi genome. when I using gm_es.pl, the script terminal as an error with the following description : Must input more than one data point! at > /home/myname/Applications/GeneMarkES/parse_ET.pl line 213. > Invalid regression data > error on call: /home/myname/Applications/GeneMarkES/parse_ET.pl --section > ET_C --cfg /home/myname/projectX/Maker/GeneMark/run.cfg --v > and after searching and asking i still have no idea how to deal with it. so do u have any idea? thank u for your time ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 24 17:41:04 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 24 Oct 2016 16:41:04 -0600 Subject: [maker-devel] maker -error In-Reply-To: References: Message-ID: <65B4147C-B28C-40EB-9004-F93D821AF1C7@gmail.com> That is a GeneMark internal error. I?d recommend running it by itself (outside of MAKER) on whatever contig it failed on, then if it reproduces, you can post the error and the test dataset to the GeneMark developers. ?Carson > On Oct 23, 2016, at 9:32 PM, ??? wrote: > > Dear sir > > I am trying to run GeneMark-ES and Maker for annotate the fungi genome. when I using gm_es.pl , the script terminal as an error with the following description : > > Must input more than one data point! at /home/myname/Applications/GeneMarkES/parse_ET.pl line 213. > Invalid regression data > error on call: /home/myname/Applications/GeneMarkES/parse_ET.pl --section ET_C --cfg /home/myname/projectX/Maker/GeneMark/run.cfg --v > > > and after searching and asking i still have no idea how to deal with it. so do u have any idea? thank u for your time ! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 26 03:32:52 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (chebbi mohamed amine) Date: Wed, 26 Oct 2016 10:32:52 +0200 (CEST) Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: <1581157450.4030281.1477470772694.JavaMail.zimbra@univ-poitiers.fr> Thank you very much for your help. Best, Mohamed De: "Xabier V?zquez-Campos" ?: "Carson Holt" Cc: "chebbi mohamed amine" , "Maker Mailing List" Envoy?: Lundi 24 Octobre 2016 01:49:53 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? If it's of any help I had this notes on my old protocol (before I started to do the training with BUSCO): For Augustus, we need the script " zff2augustus_gbk.pl ". This will take the export.dna generated by fathom and generate a *.gb file that will be used as "training gene structure file" in a new training submission in WebAugustus, but remember to give it a new name in the submission, e.g. MYGENOME_v2, or Maker won't see the difference (same name): perl PATH/TO/SCRIPT/ zff2augustus_gbk.pl > MYGENOME.train.gb As said, you could also do the training with BUSCO with the --long option. It has a dataset specific for arthropods. But if you have EST data you'll probably do better with the other method, as it allows to enter the EST for a more accurate training. On 24 October 2016 at 10:25, Carson Holt < carsonhh at gmail.com > wrote: BQ_BEGIN It?s unfortunate the archived GMOD post is gone, because I always used it for my own reference. If I remember right, the main point was that Jason Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format suitable for Augustus training. This meant you could use the maker2zff script that came with MAKER, then use Jason?s tool to convert for Augustus training. Tool to convert SNAP training ZFF to Augustus trining input file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Since the post is gone, you could use that documentation provided with his tool and then maybe a generic Augustus training guide like the following to design a path forward ?> http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ?Carson BQ_BEGIN On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? Best, Amine De: "chebbi mohamed amine" < mohamed.amine.chebbi at univ-poitiers.fr > ?: "Carson Holt" < carsonhh at gmail.com > Cc: maker-devel at yandell-lab.org Envoy?: Mercredi 12 Octobre 2016 11:44:21 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? De: "Carson Holt" < carsonhh at gmail.com > ?: "Mohamed Amine CHEBBI" < mohamed.amine.chebbi at univ-poitiers.fr > Cc: maker-devel at yandell-lab.org Envoy?: Mardi 11 Octobre 2016 22:05:50 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson BQ_BEGIN On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org BQ_END BQ_END _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org BQ_END -- Xabier V?zquez-Campos, PhD Research Associate Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 26 08:09:33 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine Chebbi) Date: Wed, 26 Oct 2016 15:09:33 +0200 (CEST) Subject: [maker-devel] Filter transcripts to improve annotation quality ? Message-ID: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Hi ! I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. However, the AED profile (attached) don't seem to be satisfactory. So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? Thank you. Best; Amine -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: AED-Graph.pdf Type: application/pdf Size: 5327 bytes Desc: not available URL: From michael.s.campbell1 at gmail.com Wed Oct 26 13:00:08 2016 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Wed, 26 Oct 2016 14:00:08 -0400 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Message-ID: Hi Amine, I haven?t seen that pattern in a CFD plot of AED before. Is there a possibility that the x and y axises are swiched in the plot? Thanks, Mike > On Oct 26, 2016, at 9:09 AM, Mohamed Amine Chebbi wrote: > > Hi ! > I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. > As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. > The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. > > However, the AED profile (attached) don't seem to be satisfactory. > So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? > Thank you. > > Best; > Amine > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 26 13:04:20 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 26 Oct 2016 12:04:20 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Message-ID: <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). Thanks, Carson > On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi wrote: > > Hi ! > I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. > As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. > The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. > > However, the AED profile (attached) don't seem to be satisfactory. > So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? > Thank you. > > Best; > Amine > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 26 13:06:36 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 26 Oct 2016 12:06:36 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> Message-ID: <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. ?Carson > On Oct 26, 2016, at 12:04 PM, Carson Holt wrote: > > Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). > > Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). > > Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). > > Thanks, > Carson > > > > >> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi > wrote: >> >> Hi ! >> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >> >> However, the AED profile (attached) don't seem to be satisfactory. >> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >> Thank you. >> >> Best; >> Amine >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.stajich at gmail.com Wed Oct 26 20:26:26 2016 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 26 Oct 2016 18:26:26 -0700 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: Yes thanks for re-sharing. Maybe we should write this up into a clearer tutorial - I go back and forth on how to make this easier and automated. Jason On Sunday, October 23, 2016, Xabier V?zquez-Campos wrote: > If it's of any help I had this notes on my old protocol (before I started > to do the training with BUSCO): > > For Augustus, we need the script "zff2augustus_gbk.pl". This will take >> the export.dna generated by fathom and generate a *.gb file that will be >> used as "training gene structure file" in a new training submission in >> WebAugustus, but remember to give it a new name in the submission, e.g. >> MYGENOME_v2, or Maker won't see the difference (same name): >> perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb >> > > As said, you could also do the training with BUSCO with the --long option. > It has a dataset specific for arthropods. But if you have EST data you'll > probably do better with the other method, as it allows to enter the EST for > a more accurate training. > > On 24 October 2016 at 10:25, Carson Holt > wrote: > >> It?s unfortunate the archived GMOD post is gone, because I always used it >> for my own reference. If I remember right, the main point was that Jason >> Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format >> suitable for Augustus training. This meant you could use the maker2zff >> script that came with MAKER, then use Jason?s tool to convert for Augustus >> training. >> >> Tool to convert SNAP training ZFF to Augustus trining input file ?> >> https://github.com/hyphaltip/genome-scripts/blob/master/gene >> _prediction/zff2augustus_gbk.pl >> >> >> Since the post is gone, you could use that documentation provided with >> his tool and then maybe a generic Augustus training guide like the >> following to design a path forward ?> >> http://www.molecularevolution.org/molevolfiles/exercises/aug >> ustus/training.html >> >> ?Carson >> >> >> On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < >> mohamed.amine.chebbi at univ-poitiers.fr >> > >> wrote: >> >> Thank you Carson for your quick response. Sorry, I have another question >> concerning Augustus Training. You posted previously in the mailing list a >> link to an explanation of Augustus training steps >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm >> l. >> Unfortunately the link doesn't work anymore. Otherwise could you explain >> how to filter the gff file produced by the first run of Maker to get best >> full length ORF as a set of gene models to train Augustus ? >> >> Best, >> Amine >> >> ------------------------------ >> *De: *"chebbi mohamed amine" > > >> *?: *"Carson Holt" > > >> *Cc: *maker-devel at yandell-lab.org >> >> *Envoy?: *Mercredi 12 Octobre 2016 11:44:21 >> *Objet: *Re: [maker-devel] Combining and merging two Maker annotation >> gff files ? >> >> Thank you Carson for your quick response. Sorry, I have another question >> concerning Augustus Training. You posted previously in the mailing list a >> link to an explanation of Augustus training steps >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm >> l. >> Unfortunately the link doesn't work anymore. Otherwise could you explain >> how to filter the gff file produced by the first run of Maker to get best >> full length ORF as a set of gene models to train Augustus ? >> >> >> ------------------------------ >> *De: *"Carson Holt" > > >> *?: *"Mohamed Amine CHEBBI" > > >> *Cc: *maker-devel at yandell-lab.org >> >> *Envoy?: *Mardi 11 Octobre 2016 22:05:50 >> *Objet: *Re: [maker-devel] Combining and merging two Maker annotation >> gff files ? >> >> Masking doesn?t just affect the gene models, but also evidence alignment >> and thus scoring. So merging in this way would not make much sense as the >> second less masked set would always score better because it has more >> evidence alignments permitted by the lack of masking (not necessarily real, >> but drawn in by repeats). >> >> The result would be that any attempt of a merge would almost exclusively >> result in all genes from the second set always scoring higher. >> >> ?Carson >> >> >> >> On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < >> mohamed.amine.chebbi at univ-poitiers.fr >> > >> wrote: >> >> Hi! >> >> I?m using the latest version of Maker2 to annotate an arthropod genome. >> First, I have run RepeatModeler to create rmlib for Maker, then I have >> followed two independent annotation strategies on the same assembly : >> 1- Passing throw Maker all the repeats collected by RepeatModeler ( >> Identified repeats in the Repbase + Unkown Models). >> 2- Passing throw Maker only the identified repeats. >> >> Both annotations work successfully. The first annotation gives me 19048 >> genes against 22931 done by the second one. Know, I'm seeing for a mean to >> merge the two annotation gff files without doing a re-annotation and by >> taking the best and non redundant supported gene models . >> >> So, do you think that configuring the maker options as below, could >> resolve this issue : >> maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file >> #MAKER derived GFF3 file >> est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no >> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no >> protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no >> rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no >> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no >> pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no >> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no >> >> -- >> Mohamed Amine CHEBBI, PhD Student >> Universit? de Poitiers >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -- Jason Stajich jason.stajich at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Thu Oct 27 08:21:01 2016 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Thu, 27 Oct 2016 09:21:01 -0400 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Message-ID: <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> I think that if you train any further you will run the risk of overtraining. setting alt_splice to 1 will add transcripts but not genes, so the gene count is going to be related to the training of the gene finder. I would recommend looking at a few of your large scaffolds in a genome browser. I would also recommend adding a second gene predictor such as augustus. When multiple predictors are used and the models they predict converge you can have more confidence in the gene prediction. For the masking you can make a species specific repeat library like Carson suggested to see if the gene count comes down a little. If you are concerned about masking duplicated genes you cad do a couple of things. You can filter the repeat library based on known proteins. You can also set a copy number minimum for the making and only include repeats that are present more than 10 time in the genome. Here are a couple of URLs for making species specific repeat libraries http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic Take care, Mike > On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI wrote: > > > > > Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. > > The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? > > I didn't precise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I let unmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. > > In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? > > > > Thank you very much for your time. > > > > Best, > > Amine > > > > Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. >> >> ?Carson >> >> >>> On Oct 26, 2016, at 12:04 PM, Carson Holt > wrote: >>> >>> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). >>> >>> Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). >>> >>> Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). >>> >>> Thanks, >>> Carson >>> >>> >>> >>> >>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi < mohamed.amine.chebbi at univ-poitiers.fr > wrote: >>>> >>>> Hi ! >>>> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >>>> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >>>> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >>>> >>>> However, the AED profile (attached) don't seem to be satisfactory. >>>> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >>>> Thank you. >>>> >>>> Best; >>>> Amine >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 > Equipe Ecologie Evolution Symbiose > B?t. B8-B35 - 5 Rue Albert Turpin > TSA 51106 > F-86022 Poitiers Cedex 9 > FRANCE > Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 04:54:31 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 11:54:31 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Message-ID: Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? I didn'tprecise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I letunmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? Thank you very much for your time. Best, Amine Le 26/10/2016 ? 20:06, Carson Holt a ?crit : > Sorry. I also assumed X and Y was flipped when I looked at it. Now I > read the labels, your AED curve would be weird unless the X and Y are > flipped in your figure. > > ?Carson > > >> On Oct 26, 2016, at 12:04 PM, Carson Holt > > wrote: >> >> Your AED curve looks fine. The first run (using protein2genome or >> est2genome I assume) will always have really low overall AED because >> they are exact copies of the protein/transcript alignments (so AED is >> meaningless there because it will always artificially look good). The >> protein2genome or est2genome modles also have a hard end-to-end >> coverage filtering cutoff of 0.5 when generated (apparent in the >> curve - value in maker_bopts.ctl). The next runs with SNAP show >80% >> of models with AED under 0.5, so it looks good. You can further look >> at models by adding protein domains using InterProScan in which you >> would expect 70-80% of models to contain a recognizable InterPro >> domain (false and bad models will result in very low overall domain >> content). >> >> Your overall gene counts are a little high though for an arthropod >> (14,000-19,000 genes would be expected as gene loss rather than gene >> gain is the primary evolutionary force in the Ecdysozoa). However >> your gene counts can be explained by either insufficient repeat >> masking (you can add a RepeatModeler generated library to the >> existing settings to help with this), poor mRNA-seq assembly or a lot >> of noise in the RNA-seq (this can be helped with more strict assembly >> parameters including the jaccard-clip option in trinity), or it is >> just the result of assembly fragmentation (if you have a lot of >> contigs or runs of NNNN in the assembly, then many genes will be >> split which results in inflated gene counts). >> >> Finally manually look at the most gene dense contigs in a browser >> like Apollo or IGV (gene_density = gene_count / contig_length). If >> the most gene dense contigs are overwhelmingly single exon, then you >> may need to filter out some prokaryotic assembly contamination (not >> uncommon). If you have contamination, it will assemble as independent >> contigs, so is easily blacklisted and can be identified visually >> (always gene dense and single exon). >> >> Thanks, >> Carson >> >> >> >> >>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi >>> >> > wrote: >>> >>> Hi ! >>> I have tried three rounds of annotation in Maker on a non model >>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and >>> illumina reads. >>> As suggested in the tutorial, I ran in the first round Maker with >>> repeat masking to generate gene models using transcript (Trinity >>> assembly) and protein (swissprot) evidence. Then Maker models were >>> used twice in a bootstrap fashion to retrain SNAP. >>> The number of genes drops from 29207 in the round 1 to 22547 in the >>> round 2 then increases slightly to 22931 in the round 3. >>> >>> However, the AED profile (attached) don't seem to be satisfactory. >>> So I wonder if you could let me a good strategy to improve the >>> annotation quality. Do you think that filtering good transcripts >>> could improve results. If yes , which criteria shouldbe taken into >>> account ? >>> Thank you. >>> >>> Best; >>> Amine >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: AED-Graph.pdf Type: application/pdf Size: 5301 bytes Desc: not available URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 09:34:02 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 16:34:02 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: Thank you Michael for your response. As suggested by you, I would use Augustus andSnap trained both by the assembled transcripts in a bootstrap fashion. For the masking, I intend to to adapt Carson strategy : ?Collecting RepeatModeler repeats.lib ?Searching Sequences in Modelerunknown.lib against a transposase database (derived from RepeatMasker package and Kennedy et al (2011) ) and considering sequences matching transposases as transposons. ?Exclusion of gene fragments in both known and unkown repeats ?As I'm concerned by gene duplications, the remainder sequences in the unkown lib present less than 10 times will be removed. Thank you again for your time and I remain open to any suggestion. Best, Amine Le 27/10/2016 ? 15:21, Michael Campbell a ?crit : > I think that if you train any further you will run the risk of > overtraining. setting alt_splice to 1 will add transcripts but not > genes, so the gene count is going to be related to the training of the > gene finder. I would recommend looking at a few of your large > scaffolds in a genome browser. I would also recommend adding a second > gene predictor such as augustus. When multiple predictors are used and > the models they predict converge you can have more confidence in the > gene prediction. > > For the masking you can make a species specific repeat library like > Carson suggested to see if the gene count comes down a little. If you > are concerned about masking duplicated genes you cad do a couple of > things. You can filter the repeat library based on known proteins. You > can also set a copy number minimum for the making and only include > repeats that are present more than 10 time in the genome. Here are a > couple of URLs for making species specific repeat libraries > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic > > Take care, > Mike > >> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI >> > > wrote: >> >> >> >> >> Sorry, the X and Y were switched in the plot due to a mishandling. >> Please find attached now the correct AED graph. >> >> The round 3 (red curve) shows little higher overall AED than the >> second round (green curve) and more genes (22931 comparing to 22547 >> in the round 2). Do you think that I should stop at the second round ? >> >> I didn'tprecise in the precedent email that the Repeat masking was >> done in Maker using the Repbase and only models found by >> RepeatModeler having identities. I letunmasked the unkown lib of >> RepeatModeler. In fact we expect a high rate of segmental and gene >> duplication in the genome and then we could explain the high overall >> count of genes found by Maker. >> >> In the other hand the high, rate of genes may be also expalined by >> the fact that I activate the alt_splice=1 option to find alternative >> splicing, do you think that it was a good idea ? >> >> Thank you very much for your time. >> >> >> >> Best, >> >> Amine >> >> >> >> Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I >>> read the labels, your AED curve would be weird unless the X and Y >>> are flipped in your figure. >>> >>> ?Carson >>> >>> >>>> On Oct 26, 2016, at 12:04 PM, Carson Holt >>> > wrote: >>>> >>>> Your AED curve looks fine. The first run (using protein2genome or >>>> est2genome I assume) will always have really low overall AED >>>> because they are exact copies of the protein/transcript alignments >>>> (so AED is meaningless there because it will always artificially >>>> look good). The protein2genome or est2genome modles also have a >>>> hard end-to-end coverage filtering cutoff of 0.5 when generated >>>> (apparent in the curve - value in maker_bopts.ctl). The next runs >>>> with SNAP show >80% of models with AED under 0.5, so it looks good. >>>> You can further look at models by adding protein domains using >>>> InterProScan in which you would expect 70-80% of models to contain >>>> a recognizable InterPro domain (false and bad models will result in >>>> very low overall domain content). >>>> >>>> Your overall gene counts are a little high though for an arthropod >>>> (14,000-19,000 genes would be expected as gene loss rather than >>>> gene gain is the primary evolutionary force in the Ecdysozoa). >>>> However your gene counts can be explained by either insufficient >>>> repeat masking (you can add a RepeatModeler generated library to >>>> the existing settings to help with this), poor mRNA-seq assembly or >>>> a lot of noise in the RNA-seq (this can be helped with more strict >>>> assembly parameters including the jaccard-clip option in trinity), >>>> or it is just the result of assembly fragmentation (if you have a >>>> lot of contigs or runs of NNNN in the assembly, then many genes >>>> will be split which results in inflated gene counts). >>>> >>>> Finally manually look at the most gene dense contigs in a browser >>>> like Apollo or IGV (gene_density = gene_count / contig_length). If >>>> the most gene dense contigs are overwhelmingly single exon, then >>>> you may need to filter out some prokaryotic assembly contamination >>>> (not uncommon). If you have contamination, it will assemble as >>>> independent contigs, so is easily blacklisted and can be identified >>>> visually (always gene dense and single exon). >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> >>>> >>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi >>>>> wrote: >>>>> >>>>> Hi ! >>>>> I have tried three rounds of annotation in Maker on a non model >>>>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and >>>>> illumina reads. >>>>> As suggested in the tutorial, I ran in the first round Maker with >>>>> repeat masking to generate gene models using transcript (Trinity >>>>> assembly) and protein (swissprot) evidence. Then Maker models were >>>>> used twice in a bootstrap fashion to retrain SNAP. >>>>> The number of genes drops from29207 in the round 1 to 22547 in the >>>>> round 2 then increases slightly to 22931 in the round 3. >>>>> >>>>> However, the AED profile (attached) don't seem to be satisfactory. >>>>> So I wonder if you could let me a good strategy to improve the >>>>> annotation quality. Do you think that filtering good transcripts >>>>> could improve results. If yes , which criteria shouldbe taken into >>>>> account ? >>>>> Thank you. >>>>> >>>>> Best; >>>>> Amine >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> >> -- >> Mohamed Amine CHEBBI, PhD Student >> Universit? de Poitiers >> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 >> Equipe Ecologie Evolution Symbiose >> B?t. B8-B35 - 5 Rue Albert Turpin >> TSA 51106 >> F-86022 Poitiers Cedex 9 >> FRANCE >> Lab website:http://ecoevol.labo.univ-poitiers.fr/ >> > -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 27 10:08:15 2016 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 27 Oct 2016 09:08:15 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: I do believe that you are getting a number of false positive genes because of under masking. So taking a more carful strategy (i.e. using the suggestions given by Michael) should mitigate that. You will have to decide how aggressive to be with the repeat masking (i.e. sensitivity/specificity balance). I would however turn off alt_splice. It has a very high threshold for how clean and complete mRNA alignments and repeat masking have to be in order to function correctly (reason why default is off). So given the filtering being done to pull back on repeat masking, it likely does not meet that threshold. It won?t really produce more genes, but you will get many spurious alternate transcripts. Also for the gene count, make sure not to count from the fasta, that is the transcript count. You have to count the ?gene" feature lines in the GFF3 to get the gene count. i.e. ?> grep -P -c "\tgene\t" models.gff ?Carson > On Oct 27, 2016, at 8:34 AM, Mohamed Amine CHEBBI wrote: > > > > Thank you Michael for your response. > > As suggested by you, I would use Augustus and Snap trained both by the assembled transcripts in a bootstrap fashion. > > For the masking, I intend to to adapt Carson strategy : > > ? Collecting RepeatModeler repeats.lib > ? Searching Sequences in Modelerunknown.lib against a transposase database (derived from RepeatMasker package and Kennedy et al (2011) ) and considering sequences matching transposases as transposons. > ? Exclusion of gene fragments in both known and unkown repeats > ? As I'm concerned by gene duplications, the remainder sequences in the unkown lib present less than 10 times will be removed. > > Thank you again for your time and I remain open to any suggestion. > > Best, > Amine > > > Le 27/10/2016 ? 15:21, Michael Campbell a ?crit : >> I think that if you train any further you will run the risk of overtraining. setting alt_splice to 1 will add transcripts but not genes, so the gene count is going to be related to the training of the gene finder. I would recommend looking at a few of your large scaffolds in a genome browser. I would also recommend adding a second gene predictor such as augustus. When multiple predictors are used and the models they predict converge you can have more confidence in the gene prediction. >> >> For the masking you can make a species specific repeat library like Carson suggested to see if the gene count comes down a little. If you are concerned about masking duplicated genes you cad do a couple of things. You can filter the repeat library based on known proteins. You can also set a copy number minimum for the making and only include repeats that are present more than 10 time in the genome. Here are a couple of URLs for making species specific repeat libraries >> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced >> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic >> >> Take care, >> Mike >> >>> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI > wrote: >>> >>> >>> >>> >>> Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. >>> >>> The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? >>> >>> I didn't precise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I let unmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. >>> >>> In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? >>> >>> >>> >>> Thank you very much for your time. >>> >>> >>> >>> Best, >>> >>> Amine >>> >>> >>> >>> Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >>>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. >>>> >>>> ?Carson >>>> >>>> >>>>> On Oct 26, 2016, at 12:04 PM, Carson Holt > wrote: >>>>> >>>>> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). >>>>> >>>>> Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). >>>>> >>>>> Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> >>>>> >>>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi < mohamed.amine.chebbi at univ-poitiers.fr > wrote: >>>>>> >>>>>> Hi ! >>>>>> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >>>>>> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >>>>>> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >>>>>> >>>>>> However, the AED profile (attached) don't seem to be satisfactory. >>>>>> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >>>>>> Thank you. >>>>>> >>>>>> Best; >>>>>> Amine >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>> >>> -- >>> Mohamed Amine CHEBBI, PhD Student >>> Universit? de Poitiers >>> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 >>> Equipe Ecologie Evolution Symbiose >>> B?t. B8-B35 - 5 Rue Albert Turpin >>> TSA 51106 >>> F-86022 Poitiers Cedex 9 >>> FRANCE >>> Lab website: http://ecoevol.labo.univ-poitiers.fr/ >> > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 > Equipe Ecologie Evolution Symbiose > B?t. B8-B35 - 5 Rue Albert Turpin > TSA 51106 > F-86022 Poitiers Cedex 9 > FRANCE > Lab website: http://ecoevol.labo.univ-poitiers.fr/ _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 10:22:08 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 17:22:08 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: <69dcf9e0-b736-3f79-082d-1ec2d6d04467@univ-poitiers.fr> Indeed the gene count has been done by the command grep -P -c "\tgene\t" models.gff. I would be careful about repeats, however in the strategy I'm not convinced by the step of searching the sequencesin Modelerunknown.lib against a transposase database, as it has been done yet by the RepeatModeler against the repbase . So I think skip this step. A last question, how to create a Protein database excluding the transposases. Thank you again. Best, Amine Le 27/10/2016 ? 17:08, Carson Holt a ?crit : > not to cou -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From scott at scottcain.net Fri Oct 28 15:57:07 2016 From: scott at scottcain.net (Scott Cain) Date: Fri, 28 Oct 2016 16:57:07 -0400 Subject: [maker-devel] Call for GMOD talks at PAG Message-ID: Hi, I am pleased to announce a call for talks to be given at the Plant and Animal Genomes conference this January in the GMOD workshop on Wednesday, January 18th. Any talks that involve the development or use of GMOD software are welcome. In particular this year, I'd really like to highlight plugins for the various GMOD software packages that support them, like JBrowse, Galaxy and Tripal (of course, Galaxy and Tripal have their own sessions, so you should consider submitting to them too). Please get an abstract, brief summary or a vague title to me as soon as possible so I can start getting it put together. Also, if you'd like to be a co-organizer, please let me drop me a line about that too. I might be able to get you some meeting-related niceties for not very much work. For more information about PAG, see: http://www.intlpag.org Thanks and I look forward to seeing in January, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research -------------- next part -------------- An HTML attachment was scrubbed... URL: From annabel.beichman at gmail.com Fri Oct 28 18:11:11 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Fri, 28 Oct 2016 16:11:11 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> Message-ID: <97D8C047-69C2-4379-AF5C-3E6DAAADA51C@gmail.com> re-sending this to the list without attachments as they were too large Cheers, Annabel > On Oct 28, 2016, at 4:04 PM, Annabel Beichman wrote: > > Hi Carson, > Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! > > However, I?m still seeing an odd pattern that I wonder if you have any ideas about: > > For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. > For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): > > Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo > ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > > I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? > > Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) > > Thanks so much again for your help! > > ~ Annabel > >> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >> >> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>> >>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>> >>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>> >>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>> >>> Thanks so much again, >>> >>> ~ Annabel >>> >>> >>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>> >>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>> >>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>> >>>>> Hi Carson et al., >>>>> >>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>> >>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>> >>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>> >>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>> >>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>> >>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>> >>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>> >>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>> >>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>> >>>>> >>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> >>>>> Thank you all so much for your help and advice! >>>>> >>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>> >>>>> Best wishes, >>>>> Annabel Beichman >>>>> Wayne Lab/Lohmueller Lab >>>>> Ecology & Evolutionary Biology >>>>> UCLA >>>>> Annabelbeichman.com >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 18:23:00 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:23:00 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> Message-ID: <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. ?Carson > On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: > > Hi Carson, > Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! > > However, I?m still seeing an odd pattern that I wonder if you have any ideas about: > > For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. > For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): > > Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo > ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > > I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? > > Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) > > Thanks so much again for your help! > > ~ Annabel > >> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >> >> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>> >>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>> >>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>> >>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>> >>> Thanks so much again, >>> >>> ~ Annabel >>> >>> >>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>> >>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>> >>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>> >>>>> Hi Carson et al., >>>>> >>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>> >>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>> >>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>> >>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>> >>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>> >>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>> >>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>> >>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>> >>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>> >>>>> >>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> >>>>> Thank you all so much for your help and advice! >>>>> >>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>> >>>>> Best wishes, >>>>> Annabel Beichman >>>>> Wayne Lab/Lohmueller Lab >>>>> Ecology & Evolutionary Biology >>>>> UCLA >>>>> Annabelbeichman.com >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 18:27:59 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:27:59 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> Message-ID: <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. ?Carson > On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: > > You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). > > Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. > > ?Carson > > > >> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >> >> Hi Carson, >> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >> >> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >> >> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >> >> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >> >> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >> >> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >> >> Thanks so much again for your help! >> >> ~ Annabel >> >>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>> >>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>> >>> ?Carson >>> >>> >>> >>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>> >>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>> >>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>> >>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>> >>>> Thanks so much again, >>>> >>>> ~ Annabel >>>> >>>> >>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>> >>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>> >>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>> >>>>>> Hi Carson et al., >>>>>> >>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>> >>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>> >>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>> >>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>> >>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>> >>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>> >>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>> >>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>> >>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>> >>>>>> >>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> >>>>>> Thank you all so much for your help and advice! >>>>>> >>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>> >>>>>> Best wishes, >>>>>> Annabel Beichman >>>>>> Wayne Lab/Lohmueller Lab >>>>>> Ecology & Evolutionary Biology >>>>>> UCLA >>>>>> Annabelbeichman.com >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > From annabel.beichman at gmail.com Fri Oct 28 18:36:03 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Fri, 28 Oct 2016 16:36:03 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> Message-ID: <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> Thank you so much, Carson, for such a rapid reply! I have checked the prokaryotic issue and it looks okay ? my most gene-dense contigs all have multi-exon genes. I will re-blast with a more stringent cutoff as well. I think your theory about the NNNNNNs might be spot on. The assembly is by Dovetail Genomics and they insert many NNNNNs as they join contigs together into the long scaffolds, which would disrupt the gene models. Is there any way to salvage the genes that are split around the NNNNs? Or should I just leave them out of my analyses? Thanks again, ~ Annabel > On Oct 28, 2016, at 4:27 PM, Carson Holt wrote: > > Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. > > ?Carson > > > >> On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: >> >> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). >> >> Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. >> >> ?Carson >> >> >> >>> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >>> >>> Hi Carson, >>> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >>> >>> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >>> >>> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >>> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >>> >>> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >>> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>> >>> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >>> >>> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >>> >>> Thanks so much again for your help! >>> >>> ~ Annabel >>> >>>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>>> >>>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>>> >>>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>>> >>>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>>> >>>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>>> >>>>> Thanks so much again, >>>>> >>>>> ~ Annabel >>>>> >>>>> >>>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>>> >>>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>>> >>>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>> >>>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>>> >>>>>>> Hi Carson et al., >>>>>>> >>>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>>> >>>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>>> >>>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>>> >>>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>>> >>>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>>> >>>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>>> >>>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>>> >>>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>>> >>>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>>> >>>>>>> >>>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> >>>>>>> Thank you all so much for your help and advice! >>>>>>> >>>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>>> >>>>>>> Best wishes, >>>>>>> Annabel Beichman >>>>>>> Wayne Lab/Lohmueller Lab >>>>>>> Ecology & Evolutionary Biology >>>>>>> UCLA >>>>>>> Annabelbeichman.com >>>>>>> _______________________________________________ >>>>>>> maker-devel mailing list >>>>>>> maker-devel at box290.bluehost.com >>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>> >>>>> >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 18:49:27 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:49:27 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> Message-ID: <07C987F9-1354-4DB6-A63F-9B23F2006871@gmail.com> The NNNN?s both preclude alignment and prediction, so unless they occur in an intron, it results in a split model (many times runs of NNN may just be a few base pairs long, but if they occur in the exon, you can?t really work around it). The predictors work off of a maximum score, so the ab initio predictor ends up finding some way of terminating the model around the NNN?s that scores well even though it does not reflect the biology. Sometimes you can try and force things in manually (non-canonical splice sites etc.) if it is an important gene (Web-Apollo even allows you to insert SNPs and INDELS to correct the ORF, but it?s a labor intensive manual process). So short answer. You should investigate if you see these in a browser. If you do have them, then you will have to decide how to handle them depending on the analysis (perhaps take the longer one?). Take some time just viewing alignments and models to get a feel of how evidence and gene models should correlate. There really is no substitute for visual manual review. ?Carson > On Oct 28, 2016, at 5:36 PM, Annabel Beichman wrote: > > Thank you so much, Carson, for such a rapid reply! > > I have checked the prokaryotic issue and it looks okay ? my most gene-dense contigs all have multi-exon genes. I will re-blast with a more stringent cutoff as well. I think your theory about the NNNNNNs might be spot on. The assembly is by Dovetail Genomics and they insert many NNNNNs as they join contigs together into the long scaffolds, which would disrupt the gene models. Is there any way to salvage the genes that are split around the NNNNs? Or should I just leave them out of my analyses? > > Thanks again, > ~ Annabel >> On Oct 28, 2016, at 4:27 PM, Carson Holt wrote: >> >> Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. >> >> ?Carson >> >> >> >>> On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: >>> >>> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). >>> >>> Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. >>> >>> ?Carson >>> >>> >>> >>>> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >>>> >>>> Hi Carson, >>>> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >>>> >>>> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >>>> >>>> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >>>> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >>>> >>>> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >>>> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>>> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>>> >>>> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >>>> >>>> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >>>> >>>> Thanks so much again for your help! >>>> >>>> ~ Annabel >>>> >>>>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>>>> >>>>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>>>> >>>>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>>>> >>>>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>>>> >>>>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>>>> >>>>>> Thanks so much again, >>>>>> >>>>>> ~ Annabel >>>>>> >>>>>> >>>>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>>>> >>>>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>>>> >>>>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>>>> >>>>>>>> Hi Carson et al., >>>>>>>> >>>>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>>>> >>>>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>>>> >>>>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>>>> >>>>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>>>> >>>>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>>>> >>>>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>>>> >>>>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>>>> >>>>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>>>> >>>>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>>>> >>>>>>>> >>>>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> >>>>>>>> Thank you all so much for your help and advice! >>>>>>>> >>>>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>>>> >>>>>>>> Best wishes, >>>>>>>> Annabel Beichman >>>>>>>> Wayne Lab/Lohmueller Lab >>>>>>>> Ecology & Evolutionary Biology >>>>>>>> UCLA >>>>>>>> Annabelbeichman.com >>>>>>>> _______________________________________________ >>>>>>>> maker-devel mailing list >>>>>>>> maker-devel at box290.bluehost.com >>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>>> >>>>>> >>>>> >>>> >>> >> > From jacques.dainat at bils.se Mon Oct 31 05:51:29 2016 From: jacques.dainat at bils.se (Jacques Dainat) Date: Mon, 31 Oct 2016 11:51:29 +0100 Subject: [maker-devel] est_gff input does not provide any gene model Message-ID: Hello, I?m using usually Cufflinks output to feed Maker through the est_gff parameter, combined with the est2genome=1 parameter I get the wanted output. This time I used Stringtie output to feed Maker, but I don?t have any gene model predicted using the est2genome parameter. Any explanation ? Is it due to the gff3 format differences between these two file ? Cufflinks output example: Pnalgiovense_4592 Cufflinks match 363 977 17.844829 - . ID=1:s3_c1_r1.4.2;Name=1:s3_c1_r1.4.2; Pnalgiovense_4592 Cufflinks match_part 363 666 17.844829 - . ID=1:s3_c1_r1.4.2:exon-1;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 1 304 +; Pnalgiovense_4592 Cufflinks match_part 743 977 17.844829 - . ID=1:s3_c1_r1.4.2:exon-2;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 305 539 +; Stringtie output example: Pnalgiovense_112 StringTie gene 20 1256 1000 + . ID=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 Pnalgiovense_112 StringTie mRNA 20 1256 1000 + . ID=HtMm_All.12253.1;Parent=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 Pnalgiovense_112 StringTie exon 20 1256 1000 + . ID=HtMm_All.12253.1-exon-1;Parent=HtMm_All.12253.1;cov=8.028295;exon_number=1;gene_id=HtMm_All.12253;transcript_id=HtMm_All.12253.1 If it?s the Stringtie output that is problematic how can I fix it ? Removing gene, changing mRNA by match and exons by match_part is enough ? Best regards, Jacques Dainat, PhD NBIS (National Bioinformatics Infrastructure Sweden) Genome Annotation Service Address: (room E10:4204 - last floor) Uppsala University, BMC Department of Medical Biochemistry Microbiology, Genomics Husargatan 3, box 582 S-75123 Uppsala Sweden Phone: 01 84 71 46 25 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 31 22:24:03 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Oct 2016 21:24:03 -0600 Subject: [maker-devel] est_gff input does not provide any gene model In-Reply-To: References: Message-ID: Evidence such as est_gff has to follow the alignment format used by GFF3 (i.e. match/match_part) whereas you are providing gene models (i.e. gene/mRNA/exon/CDS). Note that match/match_part are two level features whereas gene models are 3 levels. You need to reformat to match/match_part. ?Carson > On Oct 31, 2016, at 4:51 AM, Jacques Dainat wrote: > > Hello, > > I?m using usually Cufflinks output to feed Maker through the est_gff parameter, combined with the est2genome=1 parameter I get the wanted output. > This time I used Stringtie output to feed Maker, but I don?t have any gene model predicted using the est2genome parameter. > > Any explanation ? Is it due to the gff3 format differences between these two file ? > > Cufflinks output example: > Pnalgiovense_4592 Cufflinks match 363 977 17.844829 - . ID=1:s3_c1_r1.4.2;Name=1:s3_c1_r1.4.2; > Pnalgiovense_4592 Cufflinks match_part 363 666 17.844829 - . ID=1:s3_c1_r1.4.2:exon-1;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 1 304 +; > Pnalgiovense_4592 Cufflinks match_part 743 977 17.844829 - . ID=1:s3_c1_r1.4.2:exon-2;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 305 539 +; > > Stringtie output example: > Pnalgiovense_112 StringTie gene 20 1256 1000 + . ID=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 > Pnalgiovense_112 StringTie mRNA 20 1256 1000 + . ID=HtMm_All.12253.1;Parent=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 > Pnalgiovense_112 StringTie exon 20 1256 1000 + . ID=HtMm_All.12253.1-exon-1;Parent=HtMm_All.12253.1;cov=8.028295;exon_number=1;gene_id=HtMm_All.12253;transcript_id=HtMm_All.12253.1 > > > If it?s the Stringtie output that is problematic how can I fix it ? Removing gene, changing mRNA by match and exons by match_part is enough ? > > Best regards, > > > Jacques Dainat, PhD > NBIS (National Bioinformatics Infrastructure Sweden) > Genome Annotation Service > > Address: (room E10:4204 - last floor) > Uppsala University, BMC > Department of Medical Biochemistry Microbiology, Genomics > Husargatan 3, box 582 > S-75123 Uppsala Sweden > Phone: 01 84 71 46 25 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From allisonfuiten at gmail.com Mon Oct 31 19:34:23 2016 From: allisonfuiten at gmail.com (Allison Fuiten) Date: Mon, 31 Oct 2016 17:34:23 -0700 Subject: [maker-devel] InterProScan protein domain & AED physical evidence filtering Message-ID: Hello MAKER google group, For the final round of a MAKER annotation for a de novo plant genome assembly, I ran MAKER twice: once with keep_preds=0 which annotated 20,284 genes and once with keep_preds=1 which annotated 34,055 genes. I ran the 34,055 genes (the keep_preds=1 set) through InterProScan to search the MAKER predictions for protein domain content and added this IPRScan output into the MAKER gff file with the ipr_update_gff accessory script. The game plan is to go through the 34,055 genes and remove any gene model that doesn?t have either protein domain content or physical evidence. I am counting genes that have an AED=1 as the genes that don?t have physical evidence. I have two questions: 1. I count 11,762 genes that have AED=1.0 in the keep_preds=1 annotation set, which leaves me with 22,293 genes that I?m assuming have some physical evidence (34,055-11,762=22,293). But when I ran MAKER with keep_preds=0 originally, I only count 20,284 genes. What are the extra ~2,000 genes that are being annotated in the keep_preds=1 run that have and AED score of less than 1.0, but are not being annotated in the keep_preds=0 run? 2. My second question is if there is an accessory script available that will remove genes that lack either the IPRScan protein domains or physical evidence (AED < 1)? This type of gene removal was mentioned in a previous post from 2012 (https://groups.google.com/forum/#!searchin/maker-devel/ sorry$20there$27s$20not$20a$20script$20prepackaged$20with$ 20MAKER$20for$20that$20yet.%7Csort:relevance/maker-devel/ VaoXWlGHOjs/EElr_otrK8QJ) and I was just wondering if since then someone wrote a script that will do this for me. If anyone could offer me any feedback, that would be greatly appreciated! Thank you, Allison -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.king at rothamsted.ac.uk Thu Oct 6 05:30:49 2016 From: robert.king at rothamsted.ac.uk (Robert King) Date: Thu, 6 Oct 2016 11:30:49 +0000 Subject: [maker-devel] ATG strict start codon usage query Message-ID: Hi, I'm using latest version of Maker2 but when I use it I get CTG and TTG as start codons of which I don't want. Reading threads, the bioperl CodonTable.pm has been changed to allow for strict setting so that only ATG is used. My question is how to invoke this functionality? I've looked in maker ctrl files and command line maker but don't see how to get it just to use ATG as the start codon. Can you please advise. Best wishes Rob -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 6 10:08:00 2016 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 6 Oct 2016 10:08:00 -0600 Subject: [maker-devel] ATG strict start codon usage query In-Reply-To: References: Message-ID: <786A1E40-6261-43C8-AA84-4AD0EF45BC9F@gmail.com> Make sure you are using the latest maker version (2.31.8 - since about 2014). Make sure you are not using GFF3 files as input to MAKER (otherwise you will use whatever codon is in the GFF3 ). Make sure your BioPerl is up to date (CPAN version not BioPerl live version). With respect to behavior, MAKER by default will keep whatever start codon given used by the ab initio predictor, and only search for a different one if you set always_complete=1. ?Carson > On Oct 6, 2016, at 5:30 AM, Robert King wrote: > > Hi, > > I?m using latest version of Maker2 but when I use it I get CTG and TTG as start codons of which I don?t want. Reading threads, the bioperl CodonTable.pm has been changed to allow for strict setting so that only ATG is used. My question is how to invoke this functionality? I?ve looked in maker ctrl files and command line maker but don?t see how to get it just to use ATG as the start codon. Can you please advise. > > Best wishes > Rob > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Mon Oct 10 03:43:21 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Mon, 10 Oct 2016 11:43:21 +0200 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? Message-ID: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without _doing a re-annotation _and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Oct 11 14:05:50 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 11 Oct 2016 14:05:50 -0600 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Message-ID: Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI wrote: > > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From aravindp at imcb.a-star.edu.sg Mon Oct 17 00:45:59 2016 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Mon, 17 Oct 2016 06:45:59 +0000 Subject: [maker-devel] Maker MPI installation error and IO error for serial version Message-ID: Hi, I'm trying to install Maker in my cluster account. I have installed all the dependencies. But, there are two issues for which I would like to get a solution. I tried to find it from the forums but helpless. 1. MPI installation "flock: Function not implemented error" at src/lib/Parallel/Application/MPI.pm line 256. ./Build install Configuring MAKER with MPI support flock: Function not implemented at /scratch/tools/maker_mpi/src/lib/Parallel/Application/MPI.pm line 256. Parallel::Application::MPI::_bind("/app/openmpi/1.10.3/intel_java/bin/mpicc", "/app/openmpi/1.10.3/intel_java/include", "blib", "") called at /scratch/users/astar/imcb/aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 277 MAKER::Build::ACTION_build(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "build") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1993 Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0), "build") called at /aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 469 MAKER::Build::ACTION_install(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "install") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1998 Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0)) called at ./Build line 69 2. When I run a serial version of Maker, I get an error as follow in the "makerlog.e" file. DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 109. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 111. DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 113. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 191. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 390. Please help me with these errors as early as possible. I have double checked for all the dependencies and the file paths given while running Maker. Awaiting your reply! Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http:/www.imcb.a-star.edu.sg/ [2] Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18239 bytes Desc: image002.png URL: From mark.ebbert at gmail.com Thu Oct 13 15:57:50 2016 From: mark.ebbert at gmail.com (Mark Ebbert) Date: Thu, 13 Oct 2016 14:57:50 -0700 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! Message-ID: <57fffd715f83340001fcf47d@polymail.io> Hi, I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? I already deleted the log files before I realized maker started over because the log files get way too big. I really appreciate your help! Mark T. W. Ebbert -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 12 03:44:48 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (chebbi mohamed amine) Date: Wed, 12 Oct 2016 11:44:48 +0200 (CEST) Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Message-ID: <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? Best, Amine De: "chebbi mohamed amine" ?: "Carson Holt" Cc: maker-devel at yandell-lab.org Envoy?: Mercredi 12 Octobre 2016 11:44:21 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? De: "Carson Holt" ?: "Mohamed Amine CHEBBI" Cc: maker-devel at yandell-lab.org Envoy?: Mardi 11 Octobre 2016 22:05:50 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 17 12:17:17 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 12:17:17 -0600 Subject: [maker-devel] Maker MPI installation error and IO error for serial version In-Reply-To: References: Message-ID: It?s saying your system has no flock (file locking). For NFS mounts this is usually a configuration by the administrator. At the very least they can enable lock emulation in NFS which is what your scratch seems to be. Unfortunately SQLite will not work without this. You can still get MAKER to install with MPI by removing the lock used during setup (do this by editing line 210 of ?/maker/src/lib/Parallel/Application/MPI.pm). Turn this?> $lock = new File::NFSLock("$loc/_MPI", 'EX', 300, 40) while(!$lock); To this (i.e. comment out line 210)?> #$lock = new File::NFSLock("$loc/_MPI", 'EX', 300, 40) while(!$lock); However there is no work around for the SQLite IO error. It requires that your administrator enable locks or lock emulation (for example setting nolock,local_lock=all will cause the system to emulate locks on NFS locally). So while not exactly a real lock, they won?t fail. Thanks, Carson > On Oct 17, 2016, at 12:45 AM, Aravind PRASAD wrote: > > Hi, > > I?m trying to install Maker in my cluster account. I have installed all the dependencies. But, there are two issues for which I would like to get a solution. I tried to find it from the forums but helpless. > 1. MPI installation > ?flock: Function not implemented error? at src/lib/Parallel/Application/MPI.pm line 256. > ./Build install > > Configuring MAKER with MPI support > flock: Function not implemented > at /scratch/tools/maker_mpi/src/lib/Parallel/Application/MPI.pm line 256. > Parallel::Application::MPI::_bind("/app/openmpi/1.10.3/intel_java/bin/mpicc", "/app/openmpi/1.10.3/intel_java/include", "blib", "") called at /scratch/users/astar/imcb/aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 277 > MAKER::Build::ACTION_build(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 > Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "build") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1993 > Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0), "build") called at /aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 469 > MAKER::Build::ACTION_install(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 > Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "install") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1998 > Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0)) called at ./Build line 69 > > 2. When I run a serial version of Maker, I get an error as follow in the ?makerlog.e? file. > > DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 109. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 111. > DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 113. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 191. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 390. > > > Please help me with these errors as early as possible. I have double checked for all the dependencies and the file paths given while running Maker. > Awaiting your reply! > > > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http:/www.imcb.a-star.edu.sg/ > > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 17 12:25:54 2016 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Mon, 17 Oct 2016 18:25:54 +0000 Subject: [maker-devel] question about Maker2 In-Reply-To: References: <56F4066F.4000803@fgcz.ethz.ch> <01AB4222AE1B7E41A3B5CAEC445F192B3F71EB84@MBX115.d.ethz.ch> <3470AFC0-7B3A-485C-A86E-C7DE5A341C3C@genetics.utah.edu> <57270F57.50208@fgcz.ethz.ch> <5A09C696-CBD0-4DA9-8CB6-B994981E00D3@genetics.utah.edu> <01AB4222AE1B7E41A3B5CAEC445F192B3F747251@MBX115.d.ethz.ch> <89F7DE68-6FFF-4E17-B867-8E699D3DE986@genetics.utah.edu> <01AB4222AE1B7E41A3B5CAEC445F192B3F752945@MBX215.d.ethz.ch> <1DB8E975-3E54-455D-8852-2DD2937B2FCF@genetics.utah.edu> Message-ID: <8D22D8B2-73DC-4276-8B2D-BDEF8ECDFBE7@genetics.utah.edu> > what is the difference between files > > 1) ContigXXX.maker.non_overlapping_ab_initio.proteins.fasta Non-redundant non-overlapping models (i.e. subset of snap/augustus models that do not overlap a final MAKER selected model). > and > > 2 )ContigXXX.maker.augustus_masked.proteins.fasta Contains all raw augustus models called without hints (i.e. the equivalent of just running Augustus on it?s own). > None of these should have EST info (as the sequences headers are > > 1) augustus_masked-1-processed-gene- This was a raw augustus model that may or may not have UTR added using EST info (i.e model came strait from Augustus so no hints were used to produce the model, but MAKER did try and add UTR) > and > > 2) augustus_masked-1-abinit-gene- Model strait from Augustus. No hints, and no MAKER attempt to add UTR. These are raw unmodified models and will never be in the final selected set. > so no "maker-XXX) maker-XXX means it was a hint derived model and not a raw Augustus model. > Should file 2 just be ignored and 1) be kept aside the maker file, where EST/protein evidence is incorporated? ignore all the abinit files. They are for reference purposes only. The non-overlapping file can be used to see what was rejected, does not overlap a current model (i.e. you may be able to find a handful of false negatives that can be rescued with domain analysis using something like InterProscan). ?Carson > Thanks, > > G > > On 5/18/16 11:31 AM, Carson Holt wrote: >> Hi Giancarlo, >> >> There was no image attached. If you can, just send me the contig GFF3, and I can look at it in apollo (which lets me manipulate reading frame and display spice sites). Then I can tell you more. Basically the gene models are the result of an HMM for gene patterns plus hints to alter probability around evidence suggested sites. If there is any issue with the reading frame (can be a single bp assembly error) then no amount of hints can force a broken CDS to be coding, and the predictor will do the best it can to still produce a workable model (i.e. truncate exons, skip exons, etc). Also if your mRNA-seq is not aligned correctly around a canonical splice site (i.e. overhang beyond splice acceptor) then that hint may be ignored. >> >> ?Carson >> >> >>> On May 17, 2016, at 4:50 AM, Russo Giancarlo wrote: >>> >>> Hi Carson, thanks again for all your answers. >>> A (hopefullly) final question: in the image attached you can see an IGV sashimi plot of RNA-seq data, with the annotated gene derived from Maker; what could be the reason that in the gene model the two bits on the sides (UTRs?), which show high coverage from the RNA-seq data and plenty of splice junctions with the neighbouring exons are completely missing? >>> >>> In this run I have used a closely related species from the augustus database for gene prediction, RNA-seq based denovo assemblied transcripts as EST and protein sequences from the same closely related species. I have masked using a customized library build following the guidelines in the tutorial. >>> >>> Thanks, >>> Giancarlo >>> >>> Giancarlo Russo, Ph.D. >>> Functional Genomics Center Zurich >>> ETH Zurich / University of Zurich >>> Winterthurerstrasse 190 / Y32 H66 >>> CH-8057 Zurich >>> >>> Phone: +41 44 635 3964 >>> Fax: +41 44 635 3922 >>> e-mail: giancarlo.russo at fgcz.ethz.ch >>> http://www.fgcz.ch >>> ________________________________________ >>> From: Carson Holt [carson.holt at genetics.utah.edu] >>> Sent: 09 May 2016 18:02 >>> To: Russo Giancarlo >>> Subject: Re: question about Maker2 >>> >>> For training gene predictors with protein and EST ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >>> >>> If reusing MAKER results I don?t recommend GFF3 passthrough. The GFF# option is to get not MAKER sourced result into MAKER. You will actually lose some functionality by passing in MAKER sourced results as GFF3 (MAEKR can?t do things with GFF3 that it can do with self generated data). >>> >>> It is best to just rerun MAKER in the same directory, it will reuse previous reports it finds in the datastore. >>> >>> ?Carson >>> >>> >>> >>>> On May 3, 2016, at 2:08 AM, Russo Giancarlo wrote: >>>> >>>> OK, thanks a lot, now it is clear. >>>> >>>> About the passthrough procedure, would you have any particular advice on what would be the best strategy to run it? >>>> I have tried an existing organism in Augustus but the results were not too good. >>>> >>>> I have both EST and protein evidence, so I thought I could use EST to infer ab-initio and produce a first annotation and then run a second-pass using the first gff maker file as ab-initio. >>>> >>>> Any advice would be appreciated. >>>> >>>> Best and thanks again. >>>> Giancarlo >>>> >>>> Giancarlo Russo, Ph.D. >>>> Functional Genomics Center Zurich >>>> ETH Zurich / University of Zurich >>>> Winterthurerstrasse 190 / Y32 H66 >>>> CH-8057 Zurich >>>> >>>> Phone: +41 44 635 3964 >>>> Fax: +41 44 635 3922 >>>> e-mail: giancarlo.russo at fgcz.ethz.ch >>>> http://www.fgcz.ch >>>> ________________________________________ >>>> From: Carson Holt [carson.holt at genetics.utah.edu] >>>> Sent: 02 May 2016 18:16 >>>> To: Russo Giancarlo >>>> Subject: Re: question about Maker2 >>>> >>>> As part of the MAEKR job, it runs Snap and Augustus on their own before aligning evidence and generating hints for the later run. The Contig2.maker.augustus.transcripts.fasta are just the results of that uninformed Augustus run. They are not the final gene models, they are just the raw uninformed Augustus models. They are there for reference purposes only. They are what you would have gotten by just running Augustus directly on the assembly without any additional input (i.e. what Augustus would have produced on it?s own outside of MAKER). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On May 2, 2016, at 2:27 AM, giancarlo.russo wrote: >>>>> >>>>> Hi Carson, >>>>> sorry to bother you again, I still don't understand the difference between >>>>> >>>>> 1) Contig2.maker.augustus.transcripts.fasta >>>>> and >>>>> 2) Contig2.maker.transcripts.fasta >>>>> >>>>> If 1) contains the transcripts "Produced by maker sending hints to >>>>> augustus to modify scoring against the HMM", >>>>> , and these hints are derived from EST/protein evidence, what extra >>>>> information is used/extra steps are performed to produce 3) ? >>>>> >>>>> Also, how is a passthrough using a first pass, maker-produced gff >>>>> annotation file is best done? >>>>> Should this gff file be used for ab-initio gene models that are then >>>>> corrected EST and protein evidence? >>>>> Does it make sense to use augustus when a first pass gff file is >>>>> available? Do these two options (ab-initio based on first pass gff and >>>>> augustus switched on) exclude each other? >>>>> >>>>> Thanks again for your time and help. >>>>> >>>>> Best, >>>>> G >>>>> On 29/03/16 17:42, Carson Holt wrote: >>>>>> Yes. The EST?s generate both hints as to intron location and exon location. The protein alignments generate CDS location hints. Each algorithm has different ways to feed hints with Augustus being the most advanced. It allows separate bonuses for partial vs exact matches, and you can optionally link hints so they have to be matched as a group. It also offerer many other hint types like splice donor and acceptor hints. However we really only use the intron, exon, and CDS hints. We also use the partial match bonus. >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>>> On Mar 29, 2016, at 7:50 AM, Russo Giancarlo wrote: >>>>>>> >>>>>>> Hi Carson, thanks a lot for your answer. >>>>>>> >>>>>>> So let's see if I get it correctly. >>>>>>> In the final datastore I have the fasta files named >>>>>>> >>>>>>> 1)Contig2.maker.augustus.transcripts.fasta >>>>>>> 2)Contig2.maker.non_overlapping_ab_initio.transcripts.fasta >>>>>>> 3)Contig2.maker.transcripts.fasta >>>>>>> >>>>>>> 1) contains the transcripts "Produced by maker sending hints to augustus to modify scoring against the HMM" >>>>>>> 2) contains the transcripts predicted only by the ab initio algorithm (e.g. augustus) >>>>>>> 3) contains the transcripts with a full gene model based on ab initio + EST and/or PROTEIN >>>>>>> >>>>>>> However, what "hints" are sent by maker to augustus? If these are EST/PROTEIN hints, then what is the difference between 1) and 3) ? >>>>>>> >>>>>>> Thanks again for your help and sorry for bothering. >>>>>>> >>>>>>> Best, >>>>>>> Giancarlo >>>>>>> >>>>>>> Giancarlo Russo, Ph.D. >>>>>>> Functional Genomics Center Zurich >>>>>>> ETH Zurich / University of Zurich >>>>>>> Winterthurerstrasse 190 / Y32 H66 >>>>>>> CH-8057 Zurich >>>>>>> >>>>>>> Phone: +41 44 635 3964 >>>>>>> Fax: +41 44 635 3922 >>>>>>> e-mail: giancarlo.russo at fgcz.ethz.ch >>>>>>> http://www.fgcz.ch >>>>>>> ________________________________________ >>>>>>> From: Carson Holt [carson.holt at genetics.utah.edu] >>>>>>> Sent: 24 March 2016 21:56 >>>>>>> To: maker-devel >>>>>>> Cc: Russo Giancarlo; Mark Yandell >>>>>>> Subject: Re: question about Maker2 >>>>>>> >>>>>>> Hi Giancarlo, >>>>>>> >>>>>>> Anything listed as something like maker-*-augustus was a result of MAKER sending hints to augustus, and anything like augustus-*-abinit was the result of augustus run directly from the HMM without hints. >>>>>>> >>>>>>> Here is more detail on the format ?> >>>>>>> - - -gene- - >>>>>>> >>>>>>> Top level possibilities: >>>>>>> maker #maker generated model >>>>>>> snap_masked #snap run on masked sequence >>>>>>> augustus_masked #augustus run on masked sequence >>>>>>> etc. >>>>>>> >>>>>>> Internal source: >>>>>>> abinit #ab initio model direct from HMM >>>>>>> snap #hints provided to SNAP (alters scoring) >>>>>>> augustus #hints provided to augustus (alters scoring) >>>>>>> >>>>>>> Then chunk and iterator are just to generate a uniq ID. >>>>>>> >>>>>>> >>>>>>> Example: >>>>>>> augustus_masked-scaffold11899-abinit-gene-0.6 #Produced by Augustus on masked sequence using raw HMM (no MAKER intervention). >>>>>>> maker-scaffold11899-augustus-gene-0.6 #Produced by maker sending hints to augustus to modify scoring against the HMM >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 3/24/16, 9:23 AM, "giancarlo.russo" >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Dear Mike, >>>>>>>>> >>>>>>>>> first of all thanks for taking care and sharing Maker, as part of the >>>>>>>>> community I appreciate it. >>>>>>>>> >>>>>>>>> I have a question about the nomenclature of the annotation in the output >>>>>>>>> file: >>>>>>>>> what is the difference between genes named >>>>>>>>> >>>>>>>>> maker-Contig-XXX >>>>>>>>> and those named >>>>>>>>> augustus-Contig-XXX-processed genes >>>>>>>>> ? >>>>>>>>> >>>>>>>>> Please find attached the maker_opts file I have used for my annotation. >>>>>>>>> I was under the impression that the ab-initio related prefixes would be >>>>>>>>> present only in the genes which are not marked as "maker" in column 3 of >>>>>>>>> the gff file (i.e., those >>>>>>>>> with both ab-initio and EST evidence) >>>>>>>>> >>>>>>>>> Is there something I am missing? >>>>>>>>> >>>>>>>>> Thanks a lot in advance, >>>>>>>>> Giancarlo >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Giancarlo Russo, Ph.D. >>>>>>>>> Functional Genomics Center Zurich >>>>>>>>> Y32 H66 >>>>>>>>> Winterthurerstr. 190 >>>>>>>>> 8057 Zurich >>>>>>>>> SWITZERLAND >>>>>>>>> Phone: +41 44 635 39 64 >>>>>>>>> Fax: +41 44 635 39 22 >>>>>>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch >>>>>>>>> >>>>>>>> >>>>> -- >>>>> Giancarlo Russo, Ph.D. >>>>> Functional Genomics Center Zurich >>>>> Y32 H66 >>>>> Winterthurerstr. 190 >>>>> 8057 Zurich >>>>> SWITZERLAND >>>>> Phone: +41 44 635 39 64 >>>>> Fax: +41 44 635 39 22 >>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch >>>>> > > -- > Giancarlo Russo, Ph.D. > Functional Genomics Center Zurich > Winterthurerstrasse 190 > 8057 Zurich (CH) > Phone: +41 044 635 3964 > Fax: +41 044 635 3922 > From carsonhh at gmail.com Mon Oct 17 12:35:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 12:35:52 -0600 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <57fffd715f83340001fcf47d@polymail.io> References: <57fffd715f83340001fcf47d@polymail.io> Message-ID: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line. In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). ?Carson > On Oct 13, 2016, at 3:57 PM, Mark Ebbert wrote: > > > Hi, > > I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? > > This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? > > I already deleted the log files before I realized maker started over because the log files get way too big. > > I really appreciate your help! > > Mark T. W. Ebbert > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From annabel.beichman at gmail.com Mon Oct 17 13:20:37 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Mon, 17 Oct 2016 12:20:37 -0700 Subject: [maker-devel] Too many genes? Message-ID: Hi Carson et al., Thanks so much for such a great pipeline, tutorials and advice pages. I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) Thank you all so much for your help and advice! [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] Best wishes, Annabel Beichman Wayne Lab/Lohmueller Lab Ecology & Evolutionary Biology UCLA Annabelbeichman.com From carsonhh at gmail.com Mon Oct 17 14:11:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 14:11:52 -0600 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <58052fc8a2cc1400014626fe@polymail.io> References: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> <58052fc8a2cc1400014626fe@polymail.io> Message-ID: MAKER should automatically try and salvage things on restart (that is the purpose of the checkpoint files). You can set clean_try=1 if you want. It will then delete failed contigs before retrying on any failure. ?Carson > On Oct 17, 2016, at 2:09 PM, Mark Ebbert wrote: > > > Thanks Carson, > > I?ve been restarting it using the same commands several times in a row. Unless that ?find? command has the potential to modify any important files, then I don?t think I modified anything. All I ran was: > > ?find . -name *.NFSLock* -exec rm {} \;? > ?sbatch maker.slurm? > > I?m inclined to nuke it all and start over. Is it possible to salvage previous work, or is it all gone? > > Mark T. W. Ebbert > Please note my new email address: mark.ebbert at gmail.com > > On Mon, Oct 17, 2016 at 12:35 PM Carson Holt >> wrote: > If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. > > Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line. > > In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. > > If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). > > ?Carson > > >> On Oct 13, 2016, at 3:57 PM, Mark Ebbert > wrote: >> >> >> Hi, >> >> I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? >> >> This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? >> >> I already deleted the log files before I realized maker started over because the log files get way too big. >> >> I really appreciate your help! >> >> Mark T. W. Ebbert >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark.ebbert at gmail.com Mon Oct 17 14:09:52 2016 From: mark.ebbert at gmail.com (Mark Ebbert) Date: Mon, 17 Oct 2016 13:09:52 -0700 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> References: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> Message-ID: <58052fc8a2cc1400014626fe@polymail.io> Thanks Carson, I?ve been restarting it using the same commands several times in a row. Unless that ?find? command has the potential to modify any important files, then I don?t think I modified anything. All I ran was: ?find . -name *.NFSLock* -exec rm {} \;? ?sbatch maker.slurm? I?m inclined to nuke it all and start over. Is it possible to salvage previous work, or is it all gone? Mark T. W. Ebbert Please note my new email address: mark.ebbert at gmail.com On Mon, Oct 17, 2016 at 12:35 PM Carson Holt < mailto:Carson Holt > wrote: a, pre, code, a:link, body { word-wrap: break-word !important; } If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line.? In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). ?Carson On Oct 13, 2016, at 3:57 PM, Mark Ebbert < mailto:mark.ebbert at gmail.com > wrote: Hi, I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? I already deleted the log files before I realized maker started over because the log files get way too big. I really appreciate your help! Mark T. W. Ebbert _______________________________________________ maker-devel mailing list mailto:maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 17 14:25:32 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 14:25:32 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: References: Message-ID: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). ?Carson > On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: > > Hi Carson et al., > > Thanks so much for such a great pipeline, tutorials and advice pages. > > I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. > > Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. > > Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. > > Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. > > I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). > > However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. > > 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). > > I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? > > > Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: > ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) > ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) > ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) > ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) > > Thank you all so much for your help and advice! > > [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] > > Best wishes, > Annabel Beichman > Wayne Lab/Lohmueller Lab > Ecology & Evolutionary Biology > UCLA > Annabelbeichman.com > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From annabel.beichman at gmail.com Mon Oct 17 17:13:07 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Mon, 17 Oct 2016 16:13:07 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> Message-ID: <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. Thanks so much again, ~ Annabel > On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: > > Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). > > You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). > > ?Carson > > > >> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >> >> Hi Carson et al., >> >> Thanks so much for such a great pipeline, tutorials and advice pages. >> >> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >> >> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >> >> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >> >> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >> >> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >> >> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >> >> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >> >> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >> >> >> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> >> Thank you all so much for your help and advice! >> >> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >> >> Best wishes, >> Annabel Beichman >> Wayne Lab/Lohmueller Lab >> Ecology & Evolutionary Biology >> UCLA >> Annabelbeichman.com >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Mon Oct 17 18:09:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 18:09:52 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> Message-ID: <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. ?Carson > On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: > > Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. > > Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta > > My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. > > Thanks so much again, > > ~ Annabel > > >> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >> >> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >> >> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>> >>> Hi Carson et al., >>> >>> Thanks so much for such a great pipeline, tutorials and advice pages. >>> >>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>> >>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>> >>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>> >>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>> >>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>> >>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>> >>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>> >>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>> >>> >>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> >>> Thank you all so much for your help and advice! >>> >>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>> >>> Best wishes, >>> Annabel Beichman >>> Wayne Lab/Lohmueller Lab >>> Ecology & Evolutionary Biology >>> UCLA >>> Annabelbeichman.com >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > From carsonhh at gmail.com Sun Oct 23 17:25:34 2016 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 23 Oct 2016 17:25:34 -0600 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: It?s unfortunate the archived GMOD post is gone, because I always used it for my own reference. If I remember right, the main point was that Jason Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format suitable for Augustus training. This meant you could use the maker2zff script that came with MAKER, then use Jason?s tool to convert for Augustus training. Tool to convert SNAP training ZFF to Augustus trining input file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Since the post is gone, you could use that documentation provided with his tool and then maybe a generic Augustus training guide like the following to design a path forward ?> http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ?Carson > On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine wrote: > > Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l. Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? > > Best, > Amine > > De: "chebbi mohamed amine" > ?: "Carson Holt" > Cc: maker-devel at yandell-lab.org > Envoy?: Mercredi 12 Octobre 2016 11:44:21 > Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? > > Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l. Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? > > > De: "Carson Holt" > ?: "Mohamed Amine CHEBBI" > Cc: maker-devel at yandell-lab.org > Envoy?: Mardi 11 Octobre 2016 22:05:50 > Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? > > Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). > > The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. > > ?Carson > > > > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI > wrote: > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sun Oct 23 17:49:53 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Mon, 24 Oct 2016 10:49:53 +1100 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: If it's of any help I had this notes on my old protocol (before I started to do the training with BUSCO): For Augustus, we need the script "zff2augustus_gbk.pl". This will take the > export.dna generated by fathom and generate a *.gb file that will be used > as "training gene structure file" in a new training submission in > WebAugustus, but remember to give it a new name in the submission, e.g. > MYGENOME_v2, or Maker won't see the difference (same name): > perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb > As said, you could also do the training with BUSCO with the --long option. It has a dataset specific for arthropods. But if you have EST data you'll probably do better with the other method, as it allows to enter the EST for a more accurate training. On 24 October 2016 at 10:25, Carson Holt wrote: > It?s unfortunate the archived GMOD post is gone, because I always used it > for my own reference. If I remember right, the main point was that Jason > Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format > suitable for Augustus training. This meant you could use the maker2zff > script that came with MAKER, then use Jason?s tool to convert for Augustus > training. > > Tool to convert SNAP training ZFF to Augustus trining input file ?> > https://github.com/hyphaltip/genome-scripts/blob/master/ > gene_prediction/zff2augustus_gbk.pl > > > Since the post is gone, you could use that documentation provided with his > tool and then maybe a generic Augustus training guide like the following to > design a path forward ?> > http://www.molecularevolution.org/molevolfiles/exercises/ > augustus/training.html > > ?Carson > > > On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < > mohamed.amine.chebbi at univ-poitiers.fr> wrote: > > Thank you Carson for your quick response. Sorry, I have another question > concerning Augustus Training. You posted previously in the mailing list a > link to an explanation of Augustus training steps http://brie4.cshl.edu/ > pipermail/gmod-help/2012-June/001724.htm > l. > Unfortunately the link doesn't work anymore. Otherwise could you explain > how to filter the gff file produced by the first run of Maker to get best > full length ORF as a set of gene models to train Augustus ? > > Best, > Amine > > ------------------------------ > *De: *"chebbi mohamed amine" > *?: *"Carson Holt" > *Cc: *maker-devel at yandell-lab.org > *Envoy?: *Mercredi 12 Octobre 2016 11:44:21 > *Objet: *Re: [maker-devel] Combining and merging two Maker annotation gff > files ? > > Thank you Carson for your quick response. Sorry, I have another question > concerning Augustus Training. You posted previously in the mailing list a > link to an explanation of Augustus training steps http://brie4.cshl.edu/ > pipermail/gmod-help/2012-June/001724.htm > l. > Unfortunately the link doesn't work anymore. Otherwise could you explain > how to filter the gff file produced by the first run of Maker to get best > full length ORF as a set of gene models to train Augustus ? > > > ------------------------------ > *De: *"Carson Holt" > *?: *"Mohamed Amine CHEBBI" > *Cc: *maker-devel at yandell-lab.org > *Envoy?: *Mardi 11 Octobre 2016 22:05:50 > *Objet: *Re: [maker-devel] Combining and merging two Maker annotation gff > files ? > > Masking doesn?t just affect the gene models, but also evidence alignment > and thus scoring. So merging in this way would not make much sense as the > second less masked set would always score better because it has more > evidence alignments permitted by the lack of masking (not necessarily real, > but drawn in by repeats). > > The result would be that any attempt of a merge would almost exclusively > result in all genes from the second set always scoring higher. > > ?Carson > > > > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < > mohamed.amine.chebbi at univ-poitiers.fr> wrote: > > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. > First, I have run RepeatModeler to create rmlib for Maker, then I have > followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( > Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 > genes against 22931 done by the second one. Know, I'm seeing for a mean to > merge the two annotation gff files without doing a re-annotation and by > taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could > resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From jill711021 at gmail.com Sun Oct 23 21:32:38 2016 From: jill711021 at gmail.com (=?UTF-8?B?546L5LiA5Yeh?=) Date: Mon, 24 Oct 2016 11:32:38 +0800 Subject: [maker-devel] maker -error Message-ID: Dear sir I am trying to run GeneMark-ES and Maker for annotate the fungi genome. when I using gm_es.pl, the script terminal as an error with the following description : Must input more than one data point! at > /home/myname/Applications/GeneMarkES/parse_ET.pl line 213. > Invalid regression data > error on call: /home/myname/Applications/GeneMarkES/parse_ET.pl --section > ET_C --cfg /home/myname/projectX/Maker/GeneMark/run.cfg --v > and after searching and asking i still have no idea how to deal with it. so do u have any idea? thank u for your time ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 24 16:41:04 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 24 Oct 2016 16:41:04 -0600 Subject: [maker-devel] maker -error In-Reply-To: References: Message-ID: <65B4147C-B28C-40EB-9004-F93D821AF1C7@gmail.com> That is a GeneMark internal error. I?d recommend running it by itself (outside of MAKER) on whatever contig it failed on, then if it reproduces, you can post the error and the test dataset to the GeneMark developers. ?Carson > On Oct 23, 2016, at 9:32 PM, ??? wrote: > > Dear sir > > I am trying to run GeneMark-ES and Maker for annotate the fungi genome. when I using gm_es.pl , the script terminal as an error with the following description : > > Must input more than one data point! at /home/myname/Applications/GeneMarkES/parse_ET.pl line 213. > Invalid regression data > error on call: /home/myname/Applications/GeneMarkES/parse_ET.pl --section ET_C --cfg /home/myname/projectX/Maker/GeneMark/run.cfg --v > > > and after searching and asking i still have no idea how to deal with it. so do u have any idea? thank u for your time ! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 26 02:32:52 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (chebbi mohamed amine) Date: Wed, 26 Oct 2016 10:32:52 +0200 (CEST) Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: <1581157450.4030281.1477470772694.JavaMail.zimbra@univ-poitiers.fr> Thank you very much for your help. Best, Mohamed De: "Xabier V?zquez-Campos" ?: "Carson Holt" Cc: "chebbi mohamed amine" , "Maker Mailing List" Envoy?: Lundi 24 Octobre 2016 01:49:53 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? If it's of any help I had this notes on my old protocol (before I started to do the training with BUSCO): For Augustus, we need the script " zff2augustus_gbk.pl ". This will take the export.dna generated by fathom and generate a *.gb file that will be used as "training gene structure file" in a new training submission in WebAugustus, but remember to give it a new name in the submission, e.g. MYGENOME_v2, or Maker won't see the difference (same name): perl PATH/TO/SCRIPT/ zff2augustus_gbk.pl > MYGENOME.train.gb As said, you could also do the training with BUSCO with the --long option. It has a dataset specific for arthropods. But if you have EST data you'll probably do better with the other method, as it allows to enter the EST for a more accurate training. On 24 October 2016 at 10:25, Carson Holt < carsonhh at gmail.com > wrote: BQ_BEGIN It?s unfortunate the archived GMOD post is gone, because I always used it for my own reference. If I remember right, the main point was that Jason Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format suitable for Augustus training. This meant you could use the maker2zff script that came with MAKER, then use Jason?s tool to convert for Augustus training. Tool to convert SNAP training ZFF to Augustus trining input file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Since the post is gone, you could use that documentation provided with his tool and then maybe a generic Augustus training guide like the following to design a path forward ?> http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ?Carson BQ_BEGIN On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? Best, Amine De: "chebbi mohamed amine" < mohamed.amine.chebbi at univ-poitiers.fr > ?: "Carson Holt" < carsonhh at gmail.com > Cc: maker-devel at yandell-lab.org Envoy?: Mercredi 12 Octobre 2016 11:44:21 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? De: "Carson Holt" < carsonhh at gmail.com > ?: "Mohamed Amine CHEBBI" < mohamed.amine.chebbi at univ-poitiers.fr > Cc: maker-devel at yandell-lab.org Envoy?: Mardi 11 Octobre 2016 22:05:50 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson BQ_BEGIN On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org BQ_END BQ_END _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org BQ_END -- Xabier V?zquez-Campos, PhD Research Associate Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 26 07:09:33 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine Chebbi) Date: Wed, 26 Oct 2016 15:09:33 +0200 (CEST) Subject: [maker-devel] Filter transcripts to improve annotation quality ? Message-ID: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Hi ! I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. However, the AED profile (attached) don't seem to be satisfactory. So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? Thank you. Best; Amine -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: AED-Graph.pdf Type: application/pdf Size: 5327 bytes Desc: not available URL: From michael.s.campbell1 at gmail.com Wed Oct 26 12:00:08 2016 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Wed, 26 Oct 2016 14:00:08 -0400 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Message-ID: Hi Amine, I haven?t seen that pattern in a CFD plot of AED before. Is there a possibility that the x and y axises are swiched in the plot? Thanks, Mike > On Oct 26, 2016, at 9:09 AM, Mohamed Amine Chebbi wrote: > > Hi ! > I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. > As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. > The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. > > However, the AED profile (attached) don't seem to be satisfactory. > So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? > Thank you. > > Best; > Amine > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 26 12:04:20 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 26 Oct 2016 12:04:20 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Message-ID: <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). Thanks, Carson > On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi wrote: > > Hi ! > I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. > As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. > The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. > > However, the AED profile (attached) don't seem to be satisfactory. > So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? > Thank you. > > Best; > Amine > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 26 12:06:36 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 26 Oct 2016 12:06:36 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> Message-ID: <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. ?Carson > On Oct 26, 2016, at 12:04 PM, Carson Holt wrote: > > Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). > > Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). > > Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). > > Thanks, > Carson > > > > >> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi > wrote: >> >> Hi ! >> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >> >> However, the AED profile (attached) don't seem to be satisfactory. >> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >> Thank you. >> >> Best; >> Amine >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.stajich at gmail.com Wed Oct 26 19:26:26 2016 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 26 Oct 2016 18:26:26 -0700 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: Yes thanks for re-sharing. Maybe we should write this up into a clearer tutorial - I go back and forth on how to make this easier and automated. Jason On Sunday, October 23, 2016, Xabier V?zquez-Campos wrote: > If it's of any help I had this notes on my old protocol (before I started > to do the training with BUSCO): > > For Augustus, we need the script "zff2augustus_gbk.pl". This will take >> the export.dna generated by fathom and generate a *.gb file that will be >> used as "training gene structure file" in a new training submission in >> WebAugustus, but remember to give it a new name in the submission, e.g. >> MYGENOME_v2, or Maker won't see the difference (same name): >> perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb >> > > As said, you could also do the training with BUSCO with the --long option. > It has a dataset specific for arthropods. But if you have EST data you'll > probably do better with the other method, as it allows to enter the EST for > a more accurate training. > > On 24 October 2016 at 10:25, Carson Holt > wrote: > >> It?s unfortunate the archived GMOD post is gone, because I always used it >> for my own reference. If I remember right, the main point was that Jason >> Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format >> suitable for Augustus training. This meant you could use the maker2zff >> script that came with MAKER, then use Jason?s tool to convert for Augustus >> training. >> >> Tool to convert SNAP training ZFF to Augustus trining input file ?> >> https://github.com/hyphaltip/genome-scripts/blob/master/gene >> _prediction/zff2augustus_gbk.pl >> >> >> Since the post is gone, you could use that documentation provided with >> his tool and then maybe a generic Augustus training guide like the >> following to design a path forward ?> >> http://www.molecularevolution.org/molevolfiles/exercises/aug >> ustus/training.html >> >> ?Carson >> >> >> On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < >> mohamed.amine.chebbi at univ-poitiers.fr >> > >> wrote: >> >> Thank you Carson for your quick response. Sorry, I have another question >> concerning Augustus Training. You posted previously in the mailing list a >> link to an explanation of Augustus training steps >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm >> l. >> Unfortunately the link doesn't work anymore. Otherwise could you explain >> how to filter the gff file produced by the first run of Maker to get best >> full length ORF as a set of gene models to train Augustus ? >> >> Best, >> Amine >> >> ------------------------------ >> *De: *"chebbi mohamed amine" > > >> *?: *"Carson Holt" > > >> *Cc: *maker-devel at yandell-lab.org >> >> *Envoy?: *Mercredi 12 Octobre 2016 11:44:21 >> *Objet: *Re: [maker-devel] Combining and merging two Maker annotation >> gff files ? >> >> Thank you Carson for your quick response. Sorry, I have another question >> concerning Augustus Training. You posted previously in the mailing list a >> link to an explanation of Augustus training steps >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm >> l. >> Unfortunately the link doesn't work anymore. Otherwise could you explain >> how to filter the gff file produced by the first run of Maker to get best >> full length ORF as a set of gene models to train Augustus ? >> >> >> ------------------------------ >> *De: *"Carson Holt" > > >> *?: *"Mohamed Amine CHEBBI" > > >> *Cc: *maker-devel at yandell-lab.org >> >> *Envoy?: *Mardi 11 Octobre 2016 22:05:50 >> *Objet: *Re: [maker-devel] Combining and merging two Maker annotation >> gff files ? >> >> Masking doesn?t just affect the gene models, but also evidence alignment >> and thus scoring. So merging in this way would not make much sense as the >> second less masked set would always score better because it has more >> evidence alignments permitted by the lack of masking (not necessarily real, >> but drawn in by repeats). >> >> The result would be that any attempt of a merge would almost exclusively >> result in all genes from the second set always scoring higher. >> >> ?Carson >> >> >> >> On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < >> mohamed.amine.chebbi at univ-poitiers.fr >> > >> wrote: >> >> Hi! >> >> I?m using the latest version of Maker2 to annotate an arthropod genome. >> First, I have run RepeatModeler to create rmlib for Maker, then I have >> followed two independent annotation strategies on the same assembly : >> 1- Passing throw Maker all the repeats collected by RepeatModeler ( >> Identified repeats in the Repbase + Unkown Models). >> 2- Passing throw Maker only the identified repeats. >> >> Both annotations work successfully. The first annotation gives me 19048 >> genes against 22931 done by the second one. Know, I'm seeing for a mean to >> merge the two annotation gff files without doing a re-annotation and by >> taking the best and non redundant supported gene models . >> >> So, do you think that configuring the maker options as below, could >> resolve this issue : >> maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file >> #MAKER derived GFF3 file >> est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no >> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no >> protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no >> rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no >> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no >> pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no >> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no >> >> -- >> Mohamed Amine CHEBBI, PhD Student >> Universit? de Poitiers >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -- Jason Stajich jason.stajich at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Thu Oct 27 07:21:01 2016 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Thu, 27 Oct 2016 09:21:01 -0400 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Message-ID: <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> I think that if you train any further you will run the risk of overtraining. setting alt_splice to 1 will add transcripts but not genes, so the gene count is going to be related to the training of the gene finder. I would recommend looking at a few of your large scaffolds in a genome browser. I would also recommend adding a second gene predictor such as augustus. When multiple predictors are used and the models they predict converge you can have more confidence in the gene prediction. For the masking you can make a species specific repeat library like Carson suggested to see if the gene count comes down a little. If you are concerned about masking duplicated genes you cad do a couple of things. You can filter the repeat library based on known proteins. You can also set a copy number minimum for the making and only include repeats that are present more than 10 time in the genome. Here are a couple of URLs for making species specific repeat libraries http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic Take care, Mike > On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI wrote: > > > > > Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. > > The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? > > I didn't precise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I let unmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. > > In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? > > > > Thank you very much for your time. > > > > Best, > > Amine > > > > Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. >> >> ?Carson >> >> >>> On Oct 26, 2016, at 12:04 PM, Carson Holt > wrote: >>> >>> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). >>> >>> Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). >>> >>> Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). >>> >>> Thanks, >>> Carson >>> >>> >>> >>> >>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi < mohamed.amine.chebbi at univ-poitiers.fr > wrote: >>>> >>>> Hi ! >>>> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >>>> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >>>> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >>>> >>>> However, the AED profile (attached) don't seem to be satisfactory. >>>> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >>>> Thank you. >>>> >>>> Best; >>>> Amine >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 > Equipe Ecologie Evolution Symbiose > B?t. B8-B35 - 5 Rue Albert Turpin > TSA 51106 > F-86022 Poitiers Cedex 9 > FRANCE > Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 03:54:31 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 11:54:31 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Message-ID: Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? I didn'tprecise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I letunmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? Thank you very much for your time. Best, Amine Le 26/10/2016 ? 20:06, Carson Holt a ?crit : > Sorry. I also assumed X and Y was flipped when I looked at it. Now I > read the labels, your AED curve would be weird unless the X and Y are > flipped in your figure. > > ?Carson > > >> On Oct 26, 2016, at 12:04 PM, Carson Holt > > wrote: >> >> Your AED curve looks fine. The first run (using protein2genome or >> est2genome I assume) will always have really low overall AED because >> they are exact copies of the protein/transcript alignments (so AED is >> meaningless there because it will always artificially look good). The >> protein2genome or est2genome modles also have a hard end-to-end >> coverage filtering cutoff of 0.5 when generated (apparent in the >> curve - value in maker_bopts.ctl). The next runs with SNAP show >80% >> of models with AED under 0.5, so it looks good. You can further look >> at models by adding protein domains using InterProScan in which you >> would expect 70-80% of models to contain a recognizable InterPro >> domain (false and bad models will result in very low overall domain >> content). >> >> Your overall gene counts are a little high though for an arthropod >> (14,000-19,000 genes would be expected as gene loss rather than gene >> gain is the primary evolutionary force in the Ecdysozoa). However >> your gene counts can be explained by either insufficient repeat >> masking (you can add a RepeatModeler generated library to the >> existing settings to help with this), poor mRNA-seq assembly or a lot >> of noise in the RNA-seq (this can be helped with more strict assembly >> parameters including the jaccard-clip option in trinity), or it is >> just the result of assembly fragmentation (if you have a lot of >> contigs or runs of NNNN in the assembly, then many genes will be >> split which results in inflated gene counts). >> >> Finally manually look at the most gene dense contigs in a browser >> like Apollo or IGV (gene_density = gene_count / contig_length). If >> the most gene dense contigs are overwhelmingly single exon, then you >> may need to filter out some prokaryotic assembly contamination (not >> uncommon). If you have contamination, it will assemble as independent >> contigs, so is easily blacklisted and can be identified visually >> (always gene dense and single exon). >> >> Thanks, >> Carson >> >> >> >> >>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi >>> >> > wrote: >>> >>> Hi ! >>> I have tried three rounds of annotation in Maker on a non model >>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and >>> illumina reads. >>> As suggested in the tutorial, I ran in the first round Maker with >>> repeat masking to generate gene models using transcript (Trinity >>> assembly) and protein (swissprot) evidence. Then Maker models were >>> used twice in a bootstrap fashion to retrain SNAP. >>> The number of genes drops from 29207 in the round 1 to 22547 in the >>> round 2 then increases slightly to 22931 in the round 3. >>> >>> However, the AED profile (attached) don't seem to be satisfactory. >>> So I wonder if you could let me a good strategy to improve the >>> annotation quality. Do you think that filtering good transcripts >>> could improve results. If yes , which criteria shouldbe taken into >>> account ? >>> Thank you. >>> >>> Best; >>> Amine >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: AED-Graph.pdf Type: application/pdf Size: 5301 bytes Desc: not available URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 08:34:02 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 16:34:02 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: Thank you Michael for your response. As suggested by you, I would use Augustus andSnap trained both by the assembled transcripts in a bootstrap fashion. For the masking, I intend to to adapt Carson strategy : ?Collecting RepeatModeler repeats.lib ?Searching Sequences in Modelerunknown.lib against a transposase database (derived from RepeatMasker package and Kennedy et al (2011) ) and considering sequences matching transposases as transposons. ?Exclusion of gene fragments in both known and unkown repeats ?As I'm concerned by gene duplications, the remainder sequences in the unkown lib present less than 10 times will be removed. Thank you again for your time and I remain open to any suggestion. Best, Amine Le 27/10/2016 ? 15:21, Michael Campbell a ?crit : > I think that if you train any further you will run the risk of > overtraining. setting alt_splice to 1 will add transcripts but not > genes, so the gene count is going to be related to the training of the > gene finder. I would recommend looking at a few of your large > scaffolds in a genome browser. I would also recommend adding a second > gene predictor such as augustus. When multiple predictors are used and > the models they predict converge you can have more confidence in the > gene prediction. > > For the masking you can make a species specific repeat library like > Carson suggested to see if the gene count comes down a little. If you > are concerned about masking duplicated genes you cad do a couple of > things. You can filter the repeat library based on known proteins. You > can also set a copy number minimum for the making and only include > repeats that are present more than 10 time in the genome. Here are a > couple of URLs for making species specific repeat libraries > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic > > Take care, > Mike > >> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI >> > > wrote: >> >> >> >> >> Sorry, the X and Y were switched in the plot due to a mishandling. >> Please find attached now the correct AED graph. >> >> The round 3 (red curve) shows little higher overall AED than the >> second round (green curve) and more genes (22931 comparing to 22547 >> in the round 2). Do you think that I should stop at the second round ? >> >> I didn'tprecise in the precedent email that the Repeat masking was >> done in Maker using the Repbase and only models found by >> RepeatModeler having identities. I letunmasked the unkown lib of >> RepeatModeler. In fact we expect a high rate of segmental and gene >> duplication in the genome and then we could explain the high overall >> count of genes found by Maker. >> >> In the other hand the high, rate of genes may be also expalined by >> the fact that I activate the alt_splice=1 option to find alternative >> splicing, do you think that it was a good idea ? >> >> Thank you very much for your time. >> >> >> >> Best, >> >> Amine >> >> >> >> Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I >>> read the labels, your AED curve would be weird unless the X and Y >>> are flipped in your figure. >>> >>> ?Carson >>> >>> >>>> On Oct 26, 2016, at 12:04 PM, Carson Holt >>> > wrote: >>>> >>>> Your AED curve looks fine. The first run (using protein2genome or >>>> est2genome I assume) will always have really low overall AED >>>> because they are exact copies of the protein/transcript alignments >>>> (so AED is meaningless there because it will always artificially >>>> look good). The protein2genome or est2genome modles also have a >>>> hard end-to-end coverage filtering cutoff of 0.5 when generated >>>> (apparent in the curve - value in maker_bopts.ctl). The next runs >>>> with SNAP show >80% of models with AED under 0.5, so it looks good. >>>> You can further look at models by adding protein domains using >>>> InterProScan in which you would expect 70-80% of models to contain >>>> a recognizable InterPro domain (false and bad models will result in >>>> very low overall domain content). >>>> >>>> Your overall gene counts are a little high though for an arthropod >>>> (14,000-19,000 genes would be expected as gene loss rather than >>>> gene gain is the primary evolutionary force in the Ecdysozoa). >>>> However your gene counts can be explained by either insufficient >>>> repeat masking (you can add a RepeatModeler generated library to >>>> the existing settings to help with this), poor mRNA-seq assembly or >>>> a lot of noise in the RNA-seq (this can be helped with more strict >>>> assembly parameters including the jaccard-clip option in trinity), >>>> or it is just the result of assembly fragmentation (if you have a >>>> lot of contigs or runs of NNNN in the assembly, then many genes >>>> will be split which results in inflated gene counts). >>>> >>>> Finally manually look at the most gene dense contigs in a browser >>>> like Apollo or IGV (gene_density = gene_count / contig_length). If >>>> the most gene dense contigs are overwhelmingly single exon, then >>>> you may need to filter out some prokaryotic assembly contamination >>>> (not uncommon). If you have contamination, it will assemble as >>>> independent contigs, so is easily blacklisted and can be identified >>>> visually (always gene dense and single exon). >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> >>>> >>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi >>>>> wrote: >>>>> >>>>> Hi ! >>>>> I have tried three rounds of annotation in Maker on a non model >>>>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and >>>>> illumina reads. >>>>> As suggested in the tutorial, I ran in the first round Maker with >>>>> repeat masking to generate gene models using transcript (Trinity >>>>> assembly) and protein (swissprot) evidence. Then Maker models were >>>>> used twice in a bootstrap fashion to retrain SNAP. >>>>> The number of genes drops from29207 in the round 1 to 22547 in the >>>>> round 2 then increases slightly to 22931 in the round 3. >>>>> >>>>> However, the AED profile (attached) don't seem to be satisfactory. >>>>> So I wonder if you could let me a good strategy to improve the >>>>> annotation quality. Do you think that filtering good transcripts >>>>> could improve results. If yes , which criteria shouldbe taken into >>>>> account ? >>>>> Thank you. >>>>> >>>>> Best; >>>>> Amine >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> >> -- >> Mohamed Amine CHEBBI, PhD Student >> Universit? de Poitiers >> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 >> Equipe Ecologie Evolution Symbiose >> B?t. B8-B35 - 5 Rue Albert Turpin >> TSA 51106 >> F-86022 Poitiers Cedex 9 >> FRANCE >> Lab website:http://ecoevol.labo.univ-poitiers.fr/ >> > -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 27 09:08:15 2016 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 27 Oct 2016 09:08:15 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: I do believe that you are getting a number of false positive genes because of under masking. So taking a more carful strategy (i.e. using the suggestions given by Michael) should mitigate that. You will have to decide how aggressive to be with the repeat masking (i.e. sensitivity/specificity balance). I would however turn off alt_splice. It has a very high threshold for how clean and complete mRNA alignments and repeat masking have to be in order to function correctly (reason why default is off). So given the filtering being done to pull back on repeat masking, it likely does not meet that threshold. It won?t really produce more genes, but you will get many spurious alternate transcripts. Also for the gene count, make sure not to count from the fasta, that is the transcript count. You have to count the ?gene" feature lines in the GFF3 to get the gene count. i.e. ?> grep -P -c "\tgene\t" models.gff ?Carson > On Oct 27, 2016, at 8:34 AM, Mohamed Amine CHEBBI wrote: > > > > Thank you Michael for your response. > > As suggested by you, I would use Augustus and Snap trained both by the assembled transcripts in a bootstrap fashion. > > For the masking, I intend to to adapt Carson strategy : > > ? Collecting RepeatModeler repeats.lib > ? Searching Sequences in Modelerunknown.lib against a transposase database (derived from RepeatMasker package and Kennedy et al (2011) ) and considering sequences matching transposases as transposons. > ? Exclusion of gene fragments in both known and unkown repeats > ? As I'm concerned by gene duplications, the remainder sequences in the unkown lib present less than 10 times will be removed. > > Thank you again for your time and I remain open to any suggestion. > > Best, > Amine > > > Le 27/10/2016 ? 15:21, Michael Campbell a ?crit : >> I think that if you train any further you will run the risk of overtraining. setting alt_splice to 1 will add transcripts but not genes, so the gene count is going to be related to the training of the gene finder. I would recommend looking at a few of your large scaffolds in a genome browser. I would also recommend adding a second gene predictor such as augustus. When multiple predictors are used and the models they predict converge you can have more confidence in the gene prediction. >> >> For the masking you can make a species specific repeat library like Carson suggested to see if the gene count comes down a little. If you are concerned about masking duplicated genes you cad do a couple of things. You can filter the repeat library based on known proteins. You can also set a copy number minimum for the making and only include repeats that are present more than 10 time in the genome. Here are a couple of URLs for making species specific repeat libraries >> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced >> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic >> >> Take care, >> Mike >> >>> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI > wrote: >>> >>> >>> >>> >>> Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. >>> >>> The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? >>> >>> I didn't precise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I let unmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. >>> >>> In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? >>> >>> >>> >>> Thank you very much for your time. >>> >>> >>> >>> Best, >>> >>> Amine >>> >>> >>> >>> Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >>>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. >>>> >>>> ?Carson >>>> >>>> >>>>> On Oct 26, 2016, at 12:04 PM, Carson Holt > wrote: >>>>> >>>>> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). >>>>> >>>>> Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). >>>>> >>>>> Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> >>>>> >>>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi < mohamed.amine.chebbi at univ-poitiers.fr > wrote: >>>>>> >>>>>> Hi ! >>>>>> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >>>>>> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >>>>>> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >>>>>> >>>>>> However, the AED profile (attached) don't seem to be satisfactory. >>>>>> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >>>>>> Thank you. >>>>>> >>>>>> Best; >>>>>> Amine >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>> >>> -- >>> Mohamed Amine CHEBBI, PhD Student >>> Universit? de Poitiers >>> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 >>> Equipe Ecologie Evolution Symbiose >>> B?t. B8-B35 - 5 Rue Albert Turpin >>> TSA 51106 >>> F-86022 Poitiers Cedex 9 >>> FRANCE >>> Lab website: http://ecoevol.labo.univ-poitiers.fr/ >> > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 > Equipe Ecologie Evolution Symbiose > B?t. B8-B35 - 5 Rue Albert Turpin > TSA 51106 > F-86022 Poitiers Cedex 9 > FRANCE > Lab website: http://ecoevol.labo.univ-poitiers.fr/ _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 09:22:08 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 17:22:08 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: <69dcf9e0-b736-3f79-082d-1ec2d6d04467@univ-poitiers.fr> Indeed the gene count has been done by the command grep -P -c "\tgene\t" models.gff. I would be careful about repeats, however in the strategy I'm not convinced by the step of searching the sequencesin Modelerunknown.lib against a transposase database, as it has been done yet by the RepeatModeler against the repbase . So I think skip this step. A last question, how to create a Protein database excluding the transposases. Thank you again. Best, Amine Le 27/10/2016 ? 17:08, Carson Holt a ?crit : > not to cou -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From scott at scottcain.net Fri Oct 28 14:57:07 2016 From: scott at scottcain.net (Scott Cain) Date: Fri, 28 Oct 2016 16:57:07 -0400 Subject: [maker-devel] Call for GMOD talks at PAG Message-ID: Hi, I am pleased to announce a call for talks to be given at the Plant and Animal Genomes conference this January in the GMOD workshop on Wednesday, January 18th. Any talks that involve the development or use of GMOD software are welcome. In particular this year, I'd really like to highlight plugins for the various GMOD software packages that support them, like JBrowse, Galaxy and Tripal (of course, Galaxy and Tripal have their own sessions, so you should consider submitting to them too). Please get an abstract, brief summary or a vague title to me as soon as possible so I can start getting it put together. Also, if you'd like to be a co-organizer, please let me drop me a line about that too. I might be able to get you some meeting-related niceties for not very much work. For more information about PAG, see: http://www.intlpag.org Thanks and I look forward to seeing in January, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research -------------- next part -------------- An HTML attachment was scrubbed... URL: From annabel.beichman at gmail.com Fri Oct 28 17:11:11 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Fri, 28 Oct 2016 16:11:11 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> Message-ID: <97D8C047-69C2-4379-AF5C-3E6DAAADA51C@gmail.com> re-sending this to the list without attachments as they were too large Cheers, Annabel > On Oct 28, 2016, at 4:04 PM, Annabel Beichman wrote: > > Hi Carson, > Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! > > However, I?m still seeing an odd pattern that I wonder if you have any ideas about: > > For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. > For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): > > Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo > ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > > I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? > > Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) > > Thanks so much again for your help! > > ~ Annabel > >> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >> >> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>> >>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>> >>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>> >>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>> >>> Thanks so much again, >>> >>> ~ Annabel >>> >>> >>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>> >>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>> >>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>> >>>>> Hi Carson et al., >>>>> >>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>> >>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>> >>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>> >>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>> >>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>> >>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>> >>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>> >>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>> >>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>> >>>>> >>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> >>>>> Thank you all so much for your help and advice! >>>>> >>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>> >>>>> Best wishes, >>>>> Annabel Beichman >>>>> Wayne Lab/Lohmueller Lab >>>>> Ecology & Evolutionary Biology >>>>> UCLA >>>>> Annabelbeichman.com >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 17:23:00 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:23:00 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> Message-ID: <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. ?Carson > On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: > > Hi Carson, > Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! > > However, I?m still seeing an odd pattern that I wonder if you have any ideas about: > > For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. > For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): > > Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo > ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > > I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? > > Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) > > Thanks so much again for your help! > > ~ Annabel > >> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >> >> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>> >>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>> >>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>> >>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>> >>> Thanks so much again, >>> >>> ~ Annabel >>> >>> >>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>> >>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>> >>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>> >>>>> Hi Carson et al., >>>>> >>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>> >>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>> >>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>> >>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>> >>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>> >>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>> >>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>> >>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>> >>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>> >>>>> >>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> >>>>> Thank you all so much for your help and advice! >>>>> >>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>> >>>>> Best wishes, >>>>> Annabel Beichman >>>>> Wayne Lab/Lohmueller Lab >>>>> Ecology & Evolutionary Biology >>>>> UCLA >>>>> Annabelbeichman.com >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 17:27:59 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:27:59 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> Message-ID: <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. ?Carson > On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: > > You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). > > Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. > > ?Carson > > > >> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >> >> Hi Carson, >> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >> >> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >> >> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >> >> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >> >> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >> >> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >> >> Thanks so much again for your help! >> >> ~ Annabel >> >>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>> >>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>> >>> ?Carson >>> >>> >>> >>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>> >>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>> >>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>> >>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>> >>>> Thanks so much again, >>>> >>>> ~ Annabel >>>> >>>> >>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>> >>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>> >>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>> >>>>>> Hi Carson et al., >>>>>> >>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>> >>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>> >>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>> >>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>> >>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>> >>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>> >>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>> >>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>> >>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>> >>>>>> >>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> >>>>>> Thank you all so much for your help and advice! >>>>>> >>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>> >>>>>> Best wishes, >>>>>> Annabel Beichman >>>>>> Wayne Lab/Lohmueller Lab >>>>>> Ecology & Evolutionary Biology >>>>>> UCLA >>>>>> Annabelbeichman.com >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > From annabel.beichman at gmail.com Fri Oct 28 17:36:03 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Fri, 28 Oct 2016 16:36:03 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> Message-ID: <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> Thank you so much, Carson, for such a rapid reply! I have checked the prokaryotic issue and it looks okay ? my most gene-dense contigs all have multi-exon genes. I will re-blast with a more stringent cutoff as well. I think your theory about the NNNNNNs might be spot on. The assembly is by Dovetail Genomics and they insert many NNNNNs as they join contigs together into the long scaffolds, which would disrupt the gene models. Is there any way to salvage the genes that are split around the NNNNs? Or should I just leave them out of my analyses? Thanks again, ~ Annabel > On Oct 28, 2016, at 4:27 PM, Carson Holt wrote: > > Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. > > ?Carson > > > >> On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: >> >> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). >> >> Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. >> >> ?Carson >> >> >> >>> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >>> >>> Hi Carson, >>> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >>> >>> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >>> >>> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >>> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >>> >>> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >>> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>> >>> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >>> >>> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >>> >>> Thanks so much again for your help! >>> >>> ~ Annabel >>> >>>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>>> >>>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>>> >>>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>>> >>>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>>> >>>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>>> >>>>> Thanks so much again, >>>>> >>>>> ~ Annabel >>>>> >>>>> >>>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>>> >>>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>>> >>>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>> >>>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>>> >>>>>>> Hi Carson et al., >>>>>>> >>>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>>> >>>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>>> >>>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>>> >>>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>>> >>>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>>> >>>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>>> >>>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>>> >>>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>>> >>>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>>> >>>>>>> >>>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> >>>>>>> Thank you all so much for your help and advice! >>>>>>> >>>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>>> >>>>>>> Best wishes, >>>>>>> Annabel Beichman >>>>>>> Wayne Lab/Lohmueller Lab >>>>>>> Ecology & Evolutionary Biology >>>>>>> UCLA >>>>>>> Annabelbeichman.com >>>>>>> _______________________________________________ >>>>>>> maker-devel mailing list >>>>>>> maker-devel at box290.bluehost.com >>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>> >>>>> >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 17:49:27 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:49:27 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> Message-ID: <07C987F9-1354-4DB6-A63F-9B23F2006871@gmail.com> The NNNN?s both preclude alignment and prediction, so unless they occur in an intron, it results in a split model (many times runs of NNN may just be a few base pairs long, but if they occur in the exon, you can?t really work around it). The predictors work off of a maximum score, so the ab initio predictor ends up finding some way of terminating the model around the NNN?s that scores well even though it does not reflect the biology. Sometimes you can try and force things in manually (non-canonical splice sites etc.) if it is an important gene (Web-Apollo even allows you to insert SNPs and INDELS to correct the ORF, but it?s a labor intensive manual process). So short answer. You should investigate if you see these in a browser. If you do have them, then you will have to decide how to handle them depending on the analysis (perhaps take the longer one?). Take some time just viewing alignments and models to get a feel of how evidence and gene models should correlate. There really is no substitute for visual manual review. ?Carson > On Oct 28, 2016, at 5:36 PM, Annabel Beichman wrote: > > Thank you so much, Carson, for such a rapid reply! > > I have checked the prokaryotic issue and it looks okay ? my most gene-dense contigs all have multi-exon genes. I will re-blast with a more stringent cutoff as well. I think your theory about the NNNNNNs might be spot on. The assembly is by Dovetail Genomics and they insert many NNNNNs as they join contigs together into the long scaffolds, which would disrupt the gene models. Is there any way to salvage the genes that are split around the NNNNs? Or should I just leave them out of my analyses? > > Thanks again, > ~ Annabel >> On Oct 28, 2016, at 4:27 PM, Carson Holt wrote: >> >> Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. >> >> ?Carson >> >> >> >>> On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: >>> >>> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). >>> >>> Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. >>> >>> ?Carson >>> >>> >>> >>>> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >>>> >>>> Hi Carson, >>>> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >>>> >>>> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >>>> >>>> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >>>> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >>>> >>>> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >>>> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>>> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>>> >>>> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >>>> >>>> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >>>> >>>> Thanks so much again for your help! >>>> >>>> ~ Annabel >>>> >>>>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>>>> >>>>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>>>> >>>>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>>>> >>>>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>>>> >>>>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>>>> >>>>>> Thanks so much again, >>>>>> >>>>>> ~ Annabel >>>>>> >>>>>> >>>>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>>>> >>>>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>>>> >>>>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>>>> >>>>>>>> Hi Carson et al., >>>>>>>> >>>>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>>>> >>>>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>>>> >>>>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>>>> >>>>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>>>> >>>>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>>>> >>>>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>>>> >>>>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>>>> >>>>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>>>> >>>>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>>>> >>>>>>>> >>>>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> >>>>>>>> Thank you all so much for your help and advice! >>>>>>>> >>>>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>>>> >>>>>>>> Best wishes, >>>>>>>> Annabel Beichman >>>>>>>> Wayne Lab/Lohmueller Lab >>>>>>>> Ecology & Evolutionary Biology >>>>>>>> UCLA >>>>>>>> Annabelbeichman.com >>>>>>>> _______________________________________________ >>>>>>>> maker-devel mailing list >>>>>>>> maker-devel at box290.bluehost.com >>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>>> >>>>>> >>>>> >>>> >>> >> > From jacques.dainat at bils.se Mon Oct 31 04:51:29 2016 From: jacques.dainat at bils.se (Jacques Dainat) Date: Mon, 31 Oct 2016 11:51:29 +0100 Subject: [maker-devel] est_gff input does not provide any gene model Message-ID: Hello, I?m using usually Cufflinks output to feed Maker through the est_gff parameter, combined with the est2genome=1 parameter I get the wanted output. This time I used Stringtie output to feed Maker, but I don?t have any gene model predicted using the est2genome parameter. Any explanation ? Is it due to the gff3 format differences between these two file ? Cufflinks output example: Pnalgiovense_4592 Cufflinks match 363 977 17.844829 - . ID=1:s3_c1_r1.4.2;Name=1:s3_c1_r1.4.2; Pnalgiovense_4592 Cufflinks match_part 363 666 17.844829 - . ID=1:s3_c1_r1.4.2:exon-1;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 1 304 +; Pnalgiovense_4592 Cufflinks match_part 743 977 17.844829 - . ID=1:s3_c1_r1.4.2:exon-2;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 305 539 +; Stringtie output example: Pnalgiovense_112 StringTie gene 20 1256 1000 + . ID=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 Pnalgiovense_112 StringTie mRNA 20 1256 1000 + . ID=HtMm_All.12253.1;Parent=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 Pnalgiovense_112 StringTie exon 20 1256 1000 + . ID=HtMm_All.12253.1-exon-1;Parent=HtMm_All.12253.1;cov=8.028295;exon_number=1;gene_id=HtMm_All.12253;transcript_id=HtMm_All.12253.1 If it?s the Stringtie output that is problematic how can I fix it ? Removing gene, changing mRNA by match and exons by match_part is enough ? Best regards, Jacques Dainat, PhD NBIS (National Bioinformatics Infrastructure Sweden) Genome Annotation Service Address: (room E10:4204 - last floor) Uppsala University, BMC Department of Medical Biochemistry Microbiology, Genomics Husargatan 3, box 582 S-75123 Uppsala Sweden Phone: 01 84 71 46 25 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 31 21:24:03 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Oct 2016 21:24:03 -0600 Subject: [maker-devel] est_gff input does not provide any gene model In-Reply-To: References: Message-ID: Evidence such as est_gff has to follow the alignment format used by GFF3 (i.e. match/match_part) whereas you are providing gene models (i.e. gene/mRNA/exon/CDS). Note that match/match_part are two level features whereas gene models are 3 levels. You need to reformat to match/match_part. ?Carson > On Oct 31, 2016, at 4:51 AM, Jacques Dainat wrote: > > Hello, > > I?m using usually Cufflinks output to feed Maker through the est_gff parameter, combined with the est2genome=1 parameter I get the wanted output. > This time I used Stringtie output to feed Maker, but I don?t have any gene model predicted using the est2genome parameter. > > Any explanation ? Is it due to the gff3 format differences between these two file ? > > Cufflinks output example: > Pnalgiovense_4592 Cufflinks match 363 977 17.844829 - . ID=1:s3_c1_r1.4.2;Name=1:s3_c1_r1.4.2; > Pnalgiovense_4592 Cufflinks match_part 363 666 17.844829 - . ID=1:s3_c1_r1.4.2:exon-1;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 1 304 +; > Pnalgiovense_4592 Cufflinks match_part 743 977 17.844829 - . ID=1:s3_c1_r1.4.2:exon-2;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 305 539 +; > > Stringtie output example: > Pnalgiovense_112 StringTie gene 20 1256 1000 + . ID=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 > Pnalgiovense_112 StringTie mRNA 20 1256 1000 + . ID=HtMm_All.12253.1;Parent=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 > Pnalgiovense_112 StringTie exon 20 1256 1000 + . ID=HtMm_All.12253.1-exon-1;Parent=HtMm_All.12253.1;cov=8.028295;exon_number=1;gene_id=HtMm_All.12253;transcript_id=HtMm_All.12253.1 > > > If it?s the Stringtie output that is problematic how can I fix it ? Removing gene, changing mRNA by match and exons by match_part is enough ? > > Best regards, > > > Jacques Dainat, PhD > NBIS (National Bioinformatics Infrastructure Sweden) > Genome Annotation Service > > Address: (room E10:4204 - last floor) > Uppsala University, BMC > Department of Medical Biochemistry Microbiology, Genomics > Husargatan 3, box 582 > S-75123 Uppsala Sweden > Phone: 01 84 71 46 25 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From allisonfuiten at gmail.com Mon Oct 31 18:34:23 2016 From: allisonfuiten at gmail.com (Allison Fuiten) Date: Mon, 31 Oct 2016 17:34:23 -0700 Subject: [maker-devel] InterProScan protein domain & AED physical evidence filtering Message-ID: Hello MAKER google group, For the final round of a MAKER annotation for a de novo plant genome assembly, I ran MAKER twice: once with keep_preds=0 which annotated 20,284 genes and once with keep_preds=1 which annotated 34,055 genes. I ran the 34,055 genes (the keep_preds=1 set) through InterProScan to search the MAKER predictions for protein domain content and added this IPRScan output into the MAKER gff file with the ipr_update_gff accessory script. The game plan is to go through the 34,055 genes and remove any gene model that doesn?t have either protein domain content or physical evidence. I am counting genes that have an AED=1 as the genes that don?t have physical evidence. I have two questions: 1. I count 11,762 genes that have AED=1.0 in the keep_preds=1 annotation set, which leaves me with 22,293 genes that I?m assuming have some physical evidence (34,055-11,762=22,293). But when I ran MAKER with keep_preds=0 originally, I only count 20,284 genes. What are the extra ~2,000 genes that are being annotated in the keep_preds=1 run that have and AED score of less than 1.0, but are not being annotated in the keep_preds=0 run? 2. My second question is if there is an accessory script available that will remove genes that lack either the IPRScan protein domains or physical evidence (AED < 1)? This type of gene removal was mentioned in a previous post from 2012 (https://groups.google.com/forum/#!searchin/maker-devel/ sorry$20there$27s$20not$20a$20script$20prepackaged$20with$ 20MAKER$20for$20that$20yet.%7Csort:relevance/maker-devel/ VaoXWlGHOjs/EElr_otrK8QJ) and I was just wondering if since then someone wrote a script that will do this for me. If anyone could offer me any feedback, that would be greatly appreciated! Thank you, Allison -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.king at rothamsted.ac.uk Thu Oct 6 05:30:49 2016 From: robert.king at rothamsted.ac.uk (Robert King) Date: Thu, 6 Oct 2016 11:30:49 +0000 Subject: [maker-devel] ATG strict start codon usage query Message-ID: Hi, I'm using latest version of Maker2 but when I use it I get CTG and TTG as start codons of which I don't want. Reading threads, the bioperl CodonTable.pm has been changed to allow for strict setting so that only ATG is used. My question is how to invoke this functionality? I've looked in maker ctrl files and command line maker but don't see how to get it just to use ATG as the start codon. Can you please advise. Best wishes Rob -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 6 10:08:00 2016 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 6 Oct 2016 10:08:00 -0600 Subject: [maker-devel] ATG strict start codon usage query In-Reply-To: References: Message-ID: <786A1E40-6261-43C8-AA84-4AD0EF45BC9F@gmail.com> Make sure you are using the latest maker version (2.31.8 - since about 2014). Make sure you are not using GFF3 files as input to MAKER (otherwise you will use whatever codon is in the GFF3 ). Make sure your BioPerl is up to date (CPAN version not BioPerl live version). With respect to behavior, MAKER by default will keep whatever start codon given used by the ab initio predictor, and only search for a different one if you set always_complete=1. ?Carson > On Oct 6, 2016, at 5:30 AM, Robert King wrote: > > Hi, > > I?m using latest version of Maker2 but when I use it I get CTG and TTG as start codons of which I don?t want. Reading threads, the bioperl CodonTable.pm has been changed to allow for strict setting so that only ATG is used. My question is how to invoke this functionality? I?ve looked in maker ctrl files and command line maker but don?t see how to get it just to use ATG as the start codon. Can you please advise. > > Best wishes > Rob > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Mon Oct 10 03:43:21 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Mon, 10 Oct 2016 11:43:21 +0200 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? Message-ID: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without _doing a re-annotation _and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Oct 11 14:05:50 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 11 Oct 2016 14:05:50 -0600 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Message-ID: Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI wrote: > > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From aravindp at imcb.a-star.edu.sg Mon Oct 17 00:45:59 2016 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Mon, 17 Oct 2016 06:45:59 +0000 Subject: [maker-devel] Maker MPI installation error and IO error for serial version Message-ID: Hi, I'm trying to install Maker in my cluster account. I have installed all the dependencies. But, there are two issues for which I would like to get a solution. I tried to find it from the forums but helpless. 1. MPI installation "flock: Function not implemented error" at src/lib/Parallel/Application/MPI.pm line 256. ./Build install Configuring MAKER with MPI support flock: Function not implemented at /scratch/tools/maker_mpi/src/lib/Parallel/Application/MPI.pm line 256. Parallel::Application::MPI::_bind("/app/openmpi/1.10.3/intel_java/bin/mpicc", "/app/openmpi/1.10.3/intel_java/include", "blib", "") called at /scratch/users/astar/imcb/aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 277 MAKER::Build::ACTION_build(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "build") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1993 Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0), "build") called at /aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 469 MAKER::Build::ACTION_install(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "install") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1998 Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0)) called at ./Build line 69 2. When I run a serial version of Maker, I get an error as follow in the "makerlog.e" file. DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 109. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 111. DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 113. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 191. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 390. Please help me with these errors as early as possible. I have double checked for all the dependencies and the file paths given while running Maker. Awaiting your reply! Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http:/www.imcb.a-star.edu.sg/ [2] Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18239 bytes Desc: image002.png URL: From mark.ebbert at gmail.com Thu Oct 13 15:57:50 2016 From: mark.ebbert at gmail.com (Mark Ebbert) Date: Thu, 13 Oct 2016 14:57:50 -0700 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! Message-ID: <57fffd715f83340001fcf47d@polymail.io> Hi, I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? I already deleted the log files before I realized maker started over because the log files get way too big. I really appreciate your help! Mark T. W. Ebbert -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 12 03:44:48 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (chebbi mohamed amine) Date: Wed, 12 Oct 2016 11:44:48 +0200 (CEST) Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Message-ID: <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? Best, Amine De: "chebbi mohamed amine" ?: "Carson Holt" Cc: maker-devel at yandell-lab.org Envoy?: Mercredi 12 Octobre 2016 11:44:21 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? De: "Carson Holt" ?: "Mohamed Amine CHEBBI" Cc: maker-devel at yandell-lab.org Envoy?: Mardi 11 Octobre 2016 22:05:50 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 17 12:17:17 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 12:17:17 -0600 Subject: [maker-devel] Maker MPI installation error and IO error for serial version In-Reply-To: References: Message-ID: It?s saying your system has no flock (file locking). For NFS mounts this is usually a configuration by the administrator. At the very least they can enable lock emulation in NFS which is what your scratch seems to be. Unfortunately SQLite will not work without this. You can still get MAKER to install with MPI by removing the lock used during setup (do this by editing line 210 of ?/maker/src/lib/Parallel/Application/MPI.pm). Turn this?> $lock = new File::NFSLock("$loc/_MPI", 'EX', 300, 40) while(!$lock); To this (i.e. comment out line 210)?> #$lock = new File::NFSLock("$loc/_MPI", 'EX', 300, 40) while(!$lock); However there is no work around for the SQLite IO error. It requires that your administrator enable locks or lock emulation (for example setting nolock,local_lock=all will cause the system to emulate locks on NFS locally). So while not exactly a real lock, they won?t fail. Thanks, Carson > On Oct 17, 2016, at 12:45 AM, Aravind PRASAD wrote: > > Hi, > > I?m trying to install Maker in my cluster account. I have installed all the dependencies. But, there are two issues for which I would like to get a solution. I tried to find it from the forums but helpless. > 1. MPI installation > ?flock: Function not implemented error? at src/lib/Parallel/Application/MPI.pm line 256. > ./Build install > > Configuring MAKER with MPI support > flock: Function not implemented > at /scratch/tools/maker_mpi/src/lib/Parallel/Application/MPI.pm line 256. > Parallel::Application::MPI::_bind("/app/openmpi/1.10.3/intel_java/bin/mpicc", "/app/openmpi/1.10.3/intel_java/include", "blib", "") called at /scratch/users/astar/imcb/aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 277 > MAKER::Build::ACTION_build(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 > Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "build") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1993 > Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0), "build") called at /aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 469 > MAKER::Build::ACTION_install(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 > Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "install") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1998 > Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0)) called at ./Build line 69 > > 2. When I run a serial version of Maker, I get an error as follow in the ?makerlog.e? file. > > DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 109. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 111. > DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 113. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 191. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 390. > > > Please help me with these errors as early as possible. I have double checked for all the dependencies and the file paths given while running Maker. > Awaiting your reply! > > > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http:/www.imcb.a-star.edu.sg/ > > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 17 12:25:54 2016 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Mon, 17 Oct 2016 18:25:54 +0000 Subject: [maker-devel] question about Maker2 In-Reply-To: References: <56F4066F.4000803@fgcz.ethz.ch> <01AB4222AE1B7E41A3B5CAEC445F192B3F71EB84@MBX115.d.ethz.ch> <3470AFC0-7B3A-485C-A86E-C7DE5A341C3C@genetics.utah.edu> <57270F57.50208@fgcz.ethz.ch> <5A09C696-CBD0-4DA9-8CB6-B994981E00D3@genetics.utah.edu> <01AB4222AE1B7E41A3B5CAEC445F192B3F747251@MBX115.d.ethz.ch> <89F7DE68-6FFF-4E17-B867-8E699D3DE986@genetics.utah.edu> <01AB4222AE1B7E41A3B5CAEC445F192B3F752945@MBX215.d.ethz.ch> <1DB8E975-3E54-455D-8852-2DD2937B2FCF@genetics.utah.edu> Message-ID: <8D22D8B2-73DC-4276-8B2D-BDEF8ECDFBE7@genetics.utah.edu> > what is the difference between files > > 1) ContigXXX.maker.non_overlapping_ab_initio.proteins.fasta Non-redundant non-overlapping models (i.e. subset of snap/augustus models that do not overlap a final MAKER selected model). > and > > 2 )ContigXXX.maker.augustus_masked.proteins.fasta Contains all raw augustus models called without hints (i.e. the equivalent of just running Augustus on it?s own). > None of these should have EST info (as the sequences headers are > > 1) augustus_masked-1-processed-gene- This was a raw augustus model that may or may not have UTR added using EST info (i.e model came strait from Augustus so no hints were used to produce the model, but MAKER did try and add UTR) > and > > 2) augustus_masked-1-abinit-gene- Model strait from Augustus. No hints, and no MAKER attempt to add UTR. These are raw unmodified models and will never be in the final selected set. > so no "maker-XXX) maker-XXX means it was a hint derived model and not a raw Augustus model. > Should file 2 just be ignored and 1) be kept aside the maker file, where EST/protein evidence is incorporated? ignore all the abinit files. They are for reference purposes only. The non-overlapping file can be used to see what was rejected, does not overlap a current model (i.e. you may be able to find a handful of false negatives that can be rescued with domain analysis using something like InterProscan). ?Carson > Thanks, > > G > > On 5/18/16 11:31 AM, Carson Holt wrote: >> Hi Giancarlo, >> >> There was no image attached. If you can, just send me the contig GFF3, and I can look at it in apollo (which lets me manipulate reading frame and display spice sites). Then I can tell you more. Basically the gene models are the result of an HMM for gene patterns plus hints to alter probability around evidence suggested sites. If there is any issue with the reading frame (can be a single bp assembly error) then no amount of hints can force a broken CDS to be coding, and the predictor will do the best it can to still produce a workable model (i.e. truncate exons, skip exons, etc). Also if your mRNA-seq is not aligned correctly around a canonical splice site (i.e. overhang beyond splice acceptor) then that hint may be ignored. >> >> ?Carson >> >> >>> On May 17, 2016, at 4:50 AM, Russo Giancarlo wrote: >>> >>> Hi Carson, thanks again for all your answers. >>> A (hopefullly) final question: in the image attached you can see an IGV sashimi plot of RNA-seq data, with the annotated gene derived from Maker; what could be the reason that in the gene model the two bits on the sides (UTRs?), which show high coverage from the RNA-seq data and plenty of splice junctions with the neighbouring exons are completely missing? >>> >>> In this run I have used a closely related species from the augustus database for gene prediction, RNA-seq based denovo assemblied transcripts as EST and protein sequences from the same closely related species. I have masked using a customized library build following the guidelines in the tutorial. >>> >>> Thanks, >>> Giancarlo >>> >>> Giancarlo Russo, Ph.D. >>> Functional Genomics Center Zurich >>> ETH Zurich / University of Zurich >>> Winterthurerstrasse 190 / Y32 H66 >>> CH-8057 Zurich >>> >>> Phone: +41 44 635 3964 >>> Fax: +41 44 635 3922 >>> e-mail: giancarlo.russo at fgcz.ethz.ch >>> http://www.fgcz.ch >>> ________________________________________ >>> From: Carson Holt [carson.holt at genetics.utah.edu] >>> Sent: 09 May 2016 18:02 >>> To: Russo Giancarlo >>> Subject: Re: question about Maker2 >>> >>> For training gene predictors with protein and EST ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >>> >>> If reusing MAKER results I don?t recommend GFF3 passthrough. The GFF# option is to get not MAKER sourced result into MAKER. You will actually lose some functionality by passing in MAKER sourced results as GFF3 (MAEKR can?t do things with GFF3 that it can do with self generated data). >>> >>> It is best to just rerun MAKER in the same directory, it will reuse previous reports it finds in the datastore. >>> >>> ?Carson >>> >>> >>> >>>> On May 3, 2016, at 2:08 AM, Russo Giancarlo wrote: >>>> >>>> OK, thanks a lot, now it is clear. >>>> >>>> About the passthrough procedure, would you have any particular advice on what would be the best strategy to run it? >>>> I have tried an existing organism in Augustus but the results were not too good. >>>> >>>> I have both EST and protein evidence, so I thought I could use EST to infer ab-initio and produce a first annotation and then run a second-pass using the first gff maker file as ab-initio. >>>> >>>> Any advice would be appreciated. >>>> >>>> Best and thanks again. >>>> Giancarlo >>>> >>>> Giancarlo Russo, Ph.D. >>>> Functional Genomics Center Zurich >>>> ETH Zurich / University of Zurich >>>> Winterthurerstrasse 190 / Y32 H66 >>>> CH-8057 Zurich >>>> >>>> Phone: +41 44 635 3964 >>>> Fax: +41 44 635 3922 >>>> e-mail: giancarlo.russo at fgcz.ethz.ch >>>> http://www.fgcz.ch >>>> ________________________________________ >>>> From: Carson Holt [carson.holt at genetics.utah.edu] >>>> Sent: 02 May 2016 18:16 >>>> To: Russo Giancarlo >>>> Subject: Re: question about Maker2 >>>> >>>> As part of the MAEKR job, it runs Snap and Augustus on their own before aligning evidence and generating hints for the later run. The Contig2.maker.augustus.transcripts.fasta are just the results of that uninformed Augustus run. They are not the final gene models, they are just the raw uninformed Augustus models. They are there for reference purposes only. They are what you would have gotten by just running Augustus directly on the assembly without any additional input (i.e. what Augustus would have produced on it?s own outside of MAKER). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On May 2, 2016, at 2:27 AM, giancarlo.russo wrote: >>>>> >>>>> Hi Carson, >>>>> sorry to bother you again, I still don't understand the difference between >>>>> >>>>> 1) Contig2.maker.augustus.transcripts.fasta >>>>> and >>>>> 2) Contig2.maker.transcripts.fasta >>>>> >>>>> If 1) contains the transcripts "Produced by maker sending hints to >>>>> augustus to modify scoring against the HMM", >>>>> , and these hints are derived from EST/protein evidence, what extra >>>>> information is used/extra steps are performed to produce 3) ? >>>>> >>>>> Also, how is a passthrough using a first pass, maker-produced gff >>>>> annotation file is best done? >>>>> Should this gff file be used for ab-initio gene models that are then >>>>> corrected EST and protein evidence? >>>>> Does it make sense to use augustus when a first pass gff file is >>>>> available? Do these two options (ab-initio based on first pass gff and >>>>> augustus switched on) exclude each other? >>>>> >>>>> Thanks again for your time and help. >>>>> >>>>> Best, >>>>> G >>>>> On 29/03/16 17:42, Carson Holt wrote: >>>>>> Yes. The EST?s generate both hints as to intron location and exon location. The protein alignments generate CDS location hints. Each algorithm has different ways to feed hints with Augustus being the most advanced. It allows separate bonuses for partial vs exact matches, and you can optionally link hints so they have to be matched as a group. It also offerer many other hint types like splice donor and acceptor hints. However we really only use the intron, exon, and CDS hints. We also use the partial match bonus. >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>>> On Mar 29, 2016, at 7:50 AM, Russo Giancarlo wrote: >>>>>>> >>>>>>> Hi Carson, thanks a lot for your answer. >>>>>>> >>>>>>> So let's see if I get it correctly. >>>>>>> In the final datastore I have the fasta files named >>>>>>> >>>>>>> 1)Contig2.maker.augustus.transcripts.fasta >>>>>>> 2)Contig2.maker.non_overlapping_ab_initio.transcripts.fasta >>>>>>> 3)Contig2.maker.transcripts.fasta >>>>>>> >>>>>>> 1) contains the transcripts "Produced by maker sending hints to augustus to modify scoring against the HMM" >>>>>>> 2) contains the transcripts predicted only by the ab initio algorithm (e.g. augustus) >>>>>>> 3) contains the transcripts with a full gene model based on ab initio + EST and/or PROTEIN >>>>>>> >>>>>>> However, what "hints" are sent by maker to augustus? If these are EST/PROTEIN hints, then what is the difference between 1) and 3) ? >>>>>>> >>>>>>> Thanks again for your help and sorry for bothering. >>>>>>> >>>>>>> Best, >>>>>>> Giancarlo >>>>>>> >>>>>>> Giancarlo Russo, Ph.D. >>>>>>> Functional Genomics Center Zurich >>>>>>> ETH Zurich / University of Zurich >>>>>>> Winterthurerstrasse 190 / Y32 H66 >>>>>>> CH-8057 Zurich >>>>>>> >>>>>>> Phone: +41 44 635 3964 >>>>>>> Fax: +41 44 635 3922 >>>>>>> e-mail: giancarlo.russo at fgcz.ethz.ch >>>>>>> http://www.fgcz.ch >>>>>>> ________________________________________ >>>>>>> From: Carson Holt [carson.holt at genetics.utah.edu] >>>>>>> Sent: 24 March 2016 21:56 >>>>>>> To: maker-devel >>>>>>> Cc: Russo Giancarlo; Mark Yandell >>>>>>> Subject: Re: question about Maker2 >>>>>>> >>>>>>> Hi Giancarlo, >>>>>>> >>>>>>> Anything listed as something like maker-*-augustus was a result of MAKER sending hints to augustus, and anything like augustus-*-abinit was the result of augustus run directly from the HMM without hints. >>>>>>> >>>>>>> Here is more detail on the format ?> >>>>>>> - - -gene- - >>>>>>> >>>>>>> Top level possibilities: >>>>>>> maker #maker generated model >>>>>>> snap_masked #snap run on masked sequence >>>>>>> augustus_masked #augustus run on masked sequence >>>>>>> etc. >>>>>>> >>>>>>> Internal source: >>>>>>> abinit #ab initio model direct from HMM >>>>>>> snap #hints provided to SNAP (alters scoring) >>>>>>> augustus #hints provided to augustus (alters scoring) >>>>>>> >>>>>>> Then chunk and iterator are just to generate a uniq ID. >>>>>>> >>>>>>> >>>>>>> Example: >>>>>>> augustus_masked-scaffold11899-abinit-gene-0.6 #Produced by Augustus on masked sequence using raw HMM (no MAKER intervention). >>>>>>> maker-scaffold11899-augustus-gene-0.6 #Produced by maker sending hints to augustus to modify scoring against the HMM >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 3/24/16, 9:23 AM, "giancarlo.russo" >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Dear Mike, >>>>>>>>> >>>>>>>>> first of all thanks for taking care and sharing Maker, as part of the >>>>>>>>> community I appreciate it. >>>>>>>>> >>>>>>>>> I have a question about the nomenclature of the annotation in the output >>>>>>>>> file: >>>>>>>>> what is the difference between genes named >>>>>>>>> >>>>>>>>> maker-Contig-XXX >>>>>>>>> and those named >>>>>>>>> augustus-Contig-XXX-processed genes >>>>>>>>> ? >>>>>>>>> >>>>>>>>> Please find attached the maker_opts file I have used for my annotation. >>>>>>>>> I was under the impression that the ab-initio related prefixes would be >>>>>>>>> present only in the genes which are not marked as "maker" in column 3 of >>>>>>>>> the gff file (i.e., those >>>>>>>>> with both ab-initio and EST evidence) >>>>>>>>> >>>>>>>>> Is there something I am missing? >>>>>>>>> >>>>>>>>> Thanks a lot in advance, >>>>>>>>> Giancarlo >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Giancarlo Russo, Ph.D. >>>>>>>>> Functional Genomics Center Zurich >>>>>>>>> Y32 H66 >>>>>>>>> Winterthurerstr. 190 >>>>>>>>> 8057 Zurich >>>>>>>>> SWITZERLAND >>>>>>>>> Phone: +41 44 635 39 64 >>>>>>>>> Fax: +41 44 635 39 22 >>>>>>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch >>>>>>>>> >>>>>>>> >>>>> -- >>>>> Giancarlo Russo, Ph.D. >>>>> Functional Genomics Center Zurich >>>>> Y32 H66 >>>>> Winterthurerstr. 190 >>>>> 8057 Zurich >>>>> SWITZERLAND >>>>> Phone: +41 44 635 39 64 >>>>> Fax: +41 44 635 39 22 >>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch >>>>> > > -- > Giancarlo Russo, Ph.D. > Functional Genomics Center Zurich > Winterthurerstrasse 190 > 8057 Zurich (CH) > Phone: +41 044 635 3964 > Fax: +41 044 635 3922 > From carsonhh at gmail.com Mon Oct 17 12:35:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 12:35:52 -0600 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <57fffd715f83340001fcf47d@polymail.io> References: <57fffd715f83340001fcf47d@polymail.io> Message-ID: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line. In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). ?Carson > On Oct 13, 2016, at 3:57 PM, Mark Ebbert wrote: > > > Hi, > > I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? > > This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? > > I already deleted the log files before I realized maker started over because the log files get way too big. > > I really appreciate your help! > > Mark T. W. Ebbert > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From annabel.beichman at gmail.com Mon Oct 17 13:20:37 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Mon, 17 Oct 2016 12:20:37 -0700 Subject: [maker-devel] Too many genes? Message-ID: Hi Carson et al., Thanks so much for such a great pipeline, tutorials and advice pages. I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) Thank you all so much for your help and advice! [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] Best wishes, Annabel Beichman Wayne Lab/Lohmueller Lab Ecology & Evolutionary Biology UCLA Annabelbeichman.com From carsonhh at gmail.com Mon Oct 17 14:11:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 14:11:52 -0600 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <58052fc8a2cc1400014626fe@polymail.io> References: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> <58052fc8a2cc1400014626fe@polymail.io> Message-ID: MAKER should automatically try and salvage things on restart (that is the purpose of the checkpoint files). You can set clean_try=1 if you want. It will then delete failed contigs before retrying on any failure. ?Carson > On Oct 17, 2016, at 2:09 PM, Mark Ebbert wrote: > > > Thanks Carson, > > I?ve been restarting it using the same commands several times in a row. Unless that ?find? command has the potential to modify any important files, then I don?t think I modified anything. All I ran was: > > ?find . -name *.NFSLock* -exec rm {} \;? > ?sbatch maker.slurm? > > I?m inclined to nuke it all and start over. Is it possible to salvage previous work, or is it all gone? > > Mark T. W. Ebbert > Please note my new email address: mark.ebbert at gmail.com > > On Mon, Oct 17, 2016 at 12:35 PM Carson Holt >> wrote: > If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. > > Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line. > > In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. > > If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). > > ?Carson > > >> On Oct 13, 2016, at 3:57 PM, Mark Ebbert > wrote: >> >> >> Hi, >> >> I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? >> >> This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? >> >> I already deleted the log files before I realized maker started over because the log files get way too big. >> >> I really appreciate your help! >> >> Mark T. W. Ebbert >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark.ebbert at gmail.com Mon Oct 17 14:09:52 2016 From: mark.ebbert at gmail.com (Mark Ebbert) Date: Mon, 17 Oct 2016 13:09:52 -0700 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> References: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> Message-ID: <58052fc8a2cc1400014626fe@polymail.io> Thanks Carson, I?ve been restarting it using the same commands several times in a row. Unless that ?find? command has the potential to modify any important files, then I don?t think I modified anything. All I ran was: ?find . -name *.NFSLock* -exec rm {} \;? ?sbatch maker.slurm? I?m inclined to nuke it all and start over. Is it possible to salvage previous work, or is it all gone? Mark T. W. Ebbert Please note my new email address: mark.ebbert at gmail.com On Mon, Oct 17, 2016 at 12:35 PM Carson Holt < mailto:Carson Holt > wrote: a, pre, code, a:link, body { word-wrap: break-word !important; } If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line.? In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). ?Carson On Oct 13, 2016, at 3:57 PM, Mark Ebbert < mailto:mark.ebbert at gmail.com > wrote: Hi, I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? I already deleted the log files before I realized maker started over because the log files get way too big. I really appreciate your help! Mark T. W. Ebbert _______________________________________________ maker-devel mailing list mailto:maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 17 14:25:32 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 14:25:32 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: References: Message-ID: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). ?Carson > On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: > > Hi Carson et al., > > Thanks so much for such a great pipeline, tutorials and advice pages. > > I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. > > Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. > > Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. > > Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. > > I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). > > However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. > > 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). > > I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? > > > Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: > ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) > ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) > ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) > ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) > > Thank you all so much for your help and advice! > > [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] > > Best wishes, > Annabel Beichman > Wayne Lab/Lohmueller Lab > Ecology & Evolutionary Biology > UCLA > Annabelbeichman.com > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From annabel.beichman at gmail.com Mon Oct 17 17:13:07 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Mon, 17 Oct 2016 16:13:07 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> Message-ID: <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. Thanks so much again, ~ Annabel > On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: > > Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). > > You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). > > ?Carson > > > >> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >> >> Hi Carson et al., >> >> Thanks so much for such a great pipeline, tutorials and advice pages. >> >> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >> >> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >> >> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >> >> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >> >> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >> >> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >> >> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >> >> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >> >> >> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> >> Thank you all so much for your help and advice! >> >> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >> >> Best wishes, >> Annabel Beichman >> Wayne Lab/Lohmueller Lab >> Ecology & Evolutionary Biology >> UCLA >> Annabelbeichman.com >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Mon Oct 17 18:09:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 18:09:52 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> Message-ID: <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. ?Carson > On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: > > Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. > > Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta > > My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. > > Thanks so much again, > > ~ Annabel > > >> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >> >> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >> >> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>> >>> Hi Carson et al., >>> >>> Thanks so much for such a great pipeline, tutorials and advice pages. >>> >>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>> >>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>> >>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>> >>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>> >>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>> >>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>> >>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>> >>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>> >>> >>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> >>> Thank you all so much for your help and advice! >>> >>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>> >>> Best wishes, >>> Annabel Beichman >>> Wayne Lab/Lohmueller Lab >>> Ecology & Evolutionary Biology >>> UCLA >>> Annabelbeichman.com >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > From carsonhh at gmail.com Sun Oct 23 17:25:34 2016 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 23 Oct 2016 17:25:34 -0600 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: It?s unfortunate the archived GMOD post is gone, because I always used it for my own reference. If I remember right, the main point was that Jason Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format suitable for Augustus training. This meant you could use the maker2zff script that came with MAKER, then use Jason?s tool to convert for Augustus training. Tool to convert SNAP training ZFF to Augustus trining input file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Since the post is gone, you could use that documentation provided with his tool and then maybe a generic Augustus training guide like the following to design a path forward ?> http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ?Carson > On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine wrote: > > Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l. Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? > > Best, > Amine > > De: "chebbi mohamed amine" > ?: "Carson Holt" > Cc: maker-devel at yandell-lab.org > Envoy?: Mercredi 12 Octobre 2016 11:44:21 > Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? > > Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l. Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? > > > De: "Carson Holt" > ?: "Mohamed Amine CHEBBI" > Cc: maker-devel at yandell-lab.org > Envoy?: Mardi 11 Octobre 2016 22:05:50 > Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? > > Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). > > The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. > > ?Carson > > > > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI > wrote: > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sun Oct 23 17:49:53 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Mon, 24 Oct 2016 10:49:53 +1100 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: If it's of any help I had this notes on my old protocol (before I started to do the training with BUSCO): For Augustus, we need the script "zff2augustus_gbk.pl". This will take the > export.dna generated by fathom and generate a *.gb file that will be used > as "training gene structure file" in a new training submission in > WebAugustus, but remember to give it a new name in the submission, e.g. > MYGENOME_v2, or Maker won't see the difference (same name): > perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb > As said, you could also do the training with BUSCO with the --long option. It has a dataset specific for arthropods. But if you have EST data you'll probably do better with the other method, as it allows to enter the EST for a more accurate training. On 24 October 2016 at 10:25, Carson Holt wrote: > It?s unfortunate the archived GMOD post is gone, because I always used it > for my own reference. If I remember right, the main point was that Jason > Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format > suitable for Augustus training. This meant you could use the maker2zff > script that came with MAKER, then use Jason?s tool to convert for Augustus > training. > > Tool to convert SNAP training ZFF to Augustus trining input file ?> > https://github.com/hyphaltip/genome-scripts/blob/master/ > gene_prediction/zff2augustus_gbk.pl > > > Since the post is gone, you could use that documentation provided with his > tool and then maybe a generic Augustus training guide like the following to > design a path forward ?> > http://www.molecularevolution.org/molevolfiles/exercises/ > augustus/training.html > > ?Carson > > > On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < > mohamed.amine.chebbi at univ-poitiers.fr> wrote: > > Thank you Carson for your quick response. Sorry, I have another question > concerning Augustus Training. You posted previously in the mailing list a > link to an explanation of Augustus training steps http://brie4.cshl.edu/ > pipermail/gmod-help/2012-June/001724.htm > l. > Unfortunately the link doesn't work anymore. Otherwise could you explain > how to filter the gff file produced by the first run of Maker to get best > full length ORF as a set of gene models to train Augustus ? > > Best, > Amine > > ------------------------------ > *De: *"chebbi mohamed amine" > *?: *"Carson Holt" > *Cc: *maker-devel at yandell-lab.org > *Envoy?: *Mercredi 12 Octobre 2016 11:44:21 > *Objet: *Re: [maker-devel] Combining and merging two Maker annotation gff > files ? > > Thank you Carson for your quick response. Sorry, I have another question > concerning Augustus Training. You posted previously in the mailing list a > link to an explanation of Augustus training steps http://brie4.cshl.edu/ > pipermail/gmod-help/2012-June/001724.htm > l. > Unfortunately the link doesn't work anymore. Otherwise could you explain > how to filter the gff file produced by the first run of Maker to get best > full length ORF as a set of gene models to train Augustus ? > > > ------------------------------ > *De: *"Carson Holt" > *?: *"Mohamed Amine CHEBBI" > *Cc: *maker-devel at yandell-lab.org > *Envoy?: *Mardi 11 Octobre 2016 22:05:50 > *Objet: *Re: [maker-devel] Combining and merging two Maker annotation gff > files ? > > Masking doesn?t just affect the gene models, but also evidence alignment > and thus scoring. So merging in this way would not make much sense as the > second less masked set would always score better because it has more > evidence alignments permitted by the lack of masking (not necessarily real, > but drawn in by repeats). > > The result would be that any attempt of a merge would almost exclusively > result in all genes from the second set always scoring higher. > > ?Carson > > > > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < > mohamed.amine.chebbi at univ-poitiers.fr> wrote: > > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. > First, I have run RepeatModeler to create rmlib for Maker, then I have > followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( > Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 > genes against 22931 done by the second one. Know, I'm seeing for a mean to > merge the two annotation gff files without doing a re-annotation and by > taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could > resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From jill711021 at gmail.com Sun Oct 23 21:32:38 2016 From: jill711021 at gmail.com (=?UTF-8?B?546L5LiA5Yeh?=) Date: Mon, 24 Oct 2016 11:32:38 +0800 Subject: [maker-devel] maker -error Message-ID: Dear sir I am trying to run GeneMark-ES and Maker for annotate the fungi genome. when I using gm_es.pl, the script terminal as an error with the following description : Must input more than one data point! at > /home/myname/Applications/GeneMarkES/parse_ET.pl line 213. > Invalid regression data > error on call: /home/myname/Applications/GeneMarkES/parse_ET.pl --section > ET_C --cfg /home/myname/projectX/Maker/GeneMark/run.cfg --v > and after searching and asking i still have no idea how to deal with it. so do u have any idea? thank u for your time ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 24 16:41:04 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 24 Oct 2016 16:41:04 -0600 Subject: [maker-devel] maker -error In-Reply-To: References: Message-ID: <65B4147C-B28C-40EB-9004-F93D821AF1C7@gmail.com> That is a GeneMark internal error. I?d recommend running it by itself (outside of MAKER) on whatever contig it failed on, then if it reproduces, you can post the error and the test dataset to the GeneMark developers. ?Carson > On Oct 23, 2016, at 9:32 PM, ??? wrote: > > Dear sir > > I am trying to run GeneMark-ES and Maker for annotate the fungi genome. when I using gm_es.pl , the script terminal as an error with the following description : > > Must input more than one data point! at /home/myname/Applications/GeneMarkES/parse_ET.pl line 213. > Invalid regression data > error on call: /home/myname/Applications/GeneMarkES/parse_ET.pl --section ET_C --cfg /home/myname/projectX/Maker/GeneMark/run.cfg --v > > > and after searching and asking i still have no idea how to deal with it. so do u have any idea? thank u for your time ! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 26 02:32:52 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (chebbi mohamed amine) Date: Wed, 26 Oct 2016 10:32:52 +0200 (CEST) Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: <1581157450.4030281.1477470772694.JavaMail.zimbra@univ-poitiers.fr> Thank you very much for your help. Best, Mohamed De: "Xabier V?zquez-Campos" ?: "Carson Holt" Cc: "chebbi mohamed amine" , "Maker Mailing List" Envoy?: Lundi 24 Octobre 2016 01:49:53 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? If it's of any help I had this notes on my old protocol (before I started to do the training with BUSCO): For Augustus, we need the script " zff2augustus_gbk.pl ". This will take the export.dna generated by fathom and generate a *.gb file that will be used as "training gene structure file" in a new training submission in WebAugustus, but remember to give it a new name in the submission, e.g. MYGENOME_v2, or Maker won't see the difference (same name): perl PATH/TO/SCRIPT/ zff2augustus_gbk.pl > MYGENOME.train.gb As said, you could also do the training with BUSCO with the --long option. It has a dataset specific for arthropods. But if you have EST data you'll probably do better with the other method, as it allows to enter the EST for a more accurate training. On 24 October 2016 at 10:25, Carson Holt < carsonhh at gmail.com > wrote: BQ_BEGIN It?s unfortunate the archived GMOD post is gone, because I always used it for my own reference. If I remember right, the main point was that Jason Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format suitable for Augustus training. This meant you could use the maker2zff script that came with MAKER, then use Jason?s tool to convert for Augustus training. Tool to convert SNAP training ZFF to Augustus trining input file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Since the post is gone, you could use that documentation provided with his tool and then maybe a generic Augustus training guide like the following to design a path forward ?> http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ?Carson BQ_BEGIN On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? Best, Amine De: "chebbi mohamed amine" < mohamed.amine.chebbi at univ-poitiers.fr > ?: "Carson Holt" < carsonhh at gmail.com > Cc: maker-devel at yandell-lab.org Envoy?: Mercredi 12 Octobre 2016 11:44:21 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? De: "Carson Holt" < carsonhh at gmail.com > ?: "Mohamed Amine CHEBBI" < mohamed.amine.chebbi at univ-poitiers.fr > Cc: maker-devel at yandell-lab.org Envoy?: Mardi 11 Octobre 2016 22:05:50 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson BQ_BEGIN On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org BQ_END BQ_END _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org BQ_END -- Xabier V?zquez-Campos, PhD Research Associate Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 26 07:09:33 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine Chebbi) Date: Wed, 26 Oct 2016 15:09:33 +0200 (CEST) Subject: [maker-devel] Filter transcripts to improve annotation quality ? Message-ID: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Hi ! I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. However, the AED profile (attached) don't seem to be satisfactory. So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? Thank you. Best; Amine -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: AED-Graph.pdf Type: application/pdf Size: 5328 bytes Desc: not available URL: From michael.s.campbell1 at gmail.com Wed Oct 26 12:00:08 2016 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Wed, 26 Oct 2016 14:00:08 -0400 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Message-ID: Hi Amine, I haven?t seen that pattern in a CFD plot of AED before. Is there a possibility that the x and y axises are swiched in the plot? Thanks, Mike > On Oct 26, 2016, at 9:09 AM, Mohamed Amine Chebbi wrote: > > Hi ! > I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. > As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. > The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. > > However, the AED profile (attached) don't seem to be satisfactory. > So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? > Thank you. > > Best; > Amine > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 26 12:04:20 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 26 Oct 2016 12:04:20 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Message-ID: <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). Thanks, Carson > On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi wrote: > > Hi ! > I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. > As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. > The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. > > However, the AED profile (attached) don't seem to be satisfactory. > So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? > Thank you. > > Best; > Amine > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 26 12:06:36 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 26 Oct 2016 12:06:36 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> Message-ID: <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. ?Carson > On Oct 26, 2016, at 12:04 PM, Carson Holt wrote: > > Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). > > Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). > > Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). > > Thanks, > Carson > > > > >> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi > wrote: >> >> Hi ! >> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >> >> However, the AED profile (attached) don't seem to be satisfactory. >> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >> Thank you. >> >> Best; >> Amine >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.stajich at gmail.com Wed Oct 26 19:26:26 2016 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 26 Oct 2016 18:26:26 -0700 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: Yes thanks for re-sharing. Maybe we should write this up into a clearer tutorial - I go back and forth on how to make this easier and automated. Jason On Sunday, October 23, 2016, Xabier V?zquez-Campos wrote: > If it's of any help I had this notes on my old protocol (before I started > to do the training with BUSCO): > > For Augustus, we need the script "zff2augustus_gbk.pl". This will take >> the export.dna generated by fathom and generate a *.gb file that will be >> used as "training gene structure file" in a new training submission in >> WebAugustus, but remember to give it a new name in the submission, e.g. >> MYGENOME_v2, or Maker won't see the difference (same name): >> perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb >> > > As said, you could also do the training with BUSCO with the --long option. > It has a dataset specific for arthropods. But if you have EST data you'll > probably do better with the other method, as it allows to enter the EST for > a more accurate training. > > On 24 October 2016 at 10:25, Carson Holt > wrote: > >> It?s unfortunate the archived GMOD post is gone, because I always used it >> for my own reference. If I remember right, the main point was that Jason >> Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format >> suitable for Augustus training. This meant you could use the maker2zff >> script that came with MAKER, then use Jason?s tool to convert for Augustus >> training. >> >> Tool to convert SNAP training ZFF to Augustus trining input file ?> >> https://github.com/hyphaltip/genome-scripts/blob/master/gene >> _prediction/zff2augustus_gbk.pl >> >> >> Since the post is gone, you could use that documentation provided with >> his tool and then maybe a generic Augustus training guide like the >> following to design a path forward ?> >> http://www.molecularevolution.org/molevolfiles/exercises/aug >> ustus/training.html >> >> ?Carson >> >> >> On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < >> mohamed.amine.chebbi at univ-poitiers.fr >> > >> wrote: >> >> Thank you Carson for your quick response. Sorry, I have another question >> concerning Augustus Training. You posted previously in the mailing list a >> link to an explanation of Augustus training steps >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm >> l. >> Unfortunately the link doesn't work anymore. Otherwise could you explain >> how to filter the gff file produced by the first run of Maker to get best >> full length ORF as a set of gene models to train Augustus ? >> >> Best, >> Amine >> >> ------------------------------ >> *De: *"chebbi mohamed amine" > > >> *?: *"Carson Holt" > > >> *Cc: *maker-devel at yandell-lab.org >> >> *Envoy?: *Mercredi 12 Octobre 2016 11:44:21 >> *Objet: *Re: [maker-devel] Combining and merging two Maker annotation >> gff files ? >> >> Thank you Carson for your quick response. Sorry, I have another question >> concerning Augustus Training. You posted previously in the mailing list a >> link to an explanation of Augustus training steps >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm >> l. >> Unfortunately the link doesn't work anymore. Otherwise could you explain >> how to filter the gff file produced by the first run of Maker to get best >> full length ORF as a set of gene models to train Augustus ? >> >> >> ------------------------------ >> *De: *"Carson Holt" > > >> *?: *"Mohamed Amine CHEBBI" > > >> *Cc: *maker-devel at yandell-lab.org >> >> *Envoy?: *Mardi 11 Octobre 2016 22:05:50 >> *Objet: *Re: [maker-devel] Combining and merging two Maker annotation >> gff files ? >> >> Masking doesn?t just affect the gene models, but also evidence alignment >> and thus scoring. So merging in this way would not make much sense as the >> second less masked set would always score better because it has more >> evidence alignments permitted by the lack of masking (not necessarily real, >> but drawn in by repeats). >> >> The result would be that any attempt of a merge would almost exclusively >> result in all genes from the second set always scoring higher. >> >> ?Carson >> >> >> >> On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < >> mohamed.amine.chebbi at univ-poitiers.fr >> > >> wrote: >> >> Hi! >> >> I?m using the latest version of Maker2 to annotate an arthropod genome. >> First, I have run RepeatModeler to create rmlib for Maker, then I have >> followed two independent annotation strategies on the same assembly : >> 1- Passing throw Maker all the repeats collected by RepeatModeler ( >> Identified repeats in the Repbase + Unkown Models). >> 2- Passing throw Maker only the identified repeats. >> >> Both annotations work successfully. The first annotation gives me 19048 >> genes against 22931 done by the second one. Know, I'm seeing for a mean to >> merge the two annotation gff files without doing a re-annotation and by >> taking the best and non redundant supported gene models . >> >> So, do you think that configuring the maker options as below, could >> resolve this issue : >> maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file >> #MAKER derived GFF3 file >> est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no >> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no >> protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no >> rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no >> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no >> pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no >> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no >> >> -- >> Mohamed Amine CHEBBI, PhD Student >> Universit? de Poitiers >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -- Jason Stajich jason.stajich at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Thu Oct 27 07:21:01 2016 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Thu, 27 Oct 2016 09:21:01 -0400 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Message-ID: <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> I think that if you train any further you will run the risk of overtraining. setting alt_splice to 1 will add transcripts but not genes, so the gene count is going to be related to the training of the gene finder. I would recommend looking at a few of your large scaffolds in a genome browser. I would also recommend adding a second gene predictor such as augustus. When multiple predictors are used and the models they predict converge you can have more confidence in the gene prediction. For the masking you can make a species specific repeat library like Carson suggested to see if the gene count comes down a little. If you are concerned about masking duplicated genes you cad do a couple of things. You can filter the repeat library based on known proteins. You can also set a copy number minimum for the making and only include repeats that are present more than 10 time in the genome. Here are a couple of URLs for making species specific repeat libraries http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic Take care, Mike > On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI wrote: > > > > > Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. > > The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? > > I didn't precise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I let unmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. > > In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? > > > > Thank you very much for your time. > > > > Best, > > Amine > > > > Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. >> >> ?Carson >> >> >>> On Oct 26, 2016, at 12:04 PM, Carson Holt > wrote: >>> >>> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). >>> >>> Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). >>> >>> Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). >>> >>> Thanks, >>> Carson >>> >>> >>> >>> >>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi < mohamed.amine.chebbi at univ-poitiers.fr > wrote: >>>> >>>> Hi ! >>>> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >>>> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >>>> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >>>> >>>> However, the AED profile (attached) don't seem to be satisfactory. >>>> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >>>> Thank you. >>>> >>>> Best; >>>> Amine >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 > Equipe Ecologie Evolution Symbiose > B?t. B8-B35 - 5 Rue Albert Turpin > TSA 51106 > F-86022 Poitiers Cedex 9 > FRANCE > Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 03:54:31 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 11:54:31 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Message-ID: Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? I didn'tprecise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I letunmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? Thank you very much for your time. Best, Amine Le 26/10/2016 ? 20:06, Carson Holt a ?crit : > Sorry. I also assumed X and Y was flipped when I looked at it. Now I > read the labels, your AED curve would be weird unless the X and Y are > flipped in your figure. > > ?Carson > > >> On Oct 26, 2016, at 12:04 PM, Carson Holt > > wrote: >> >> Your AED curve looks fine. The first run (using protein2genome or >> est2genome I assume) will always have really low overall AED because >> they are exact copies of the protein/transcript alignments (so AED is >> meaningless there because it will always artificially look good). The >> protein2genome or est2genome modles also have a hard end-to-end >> coverage filtering cutoff of 0.5 when generated (apparent in the >> curve - value in maker_bopts.ctl). The next runs with SNAP show >80% >> of models with AED under 0.5, so it looks good. You can further look >> at models by adding protein domains using InterProScan in which you >> would expect 70-80% of models to contain a recognizable InterPro >> domain (false and bad models will result in very low overall domain >> content). >> >> Your overall gene counts are a little high though for an arthropod >> (14,000-19,000 genes would be expected as gene loss rather than gene >> gain is the primary evolutionary force in the Ecdysozoa). However >> your gene counts can be explained by either insufficient repeat >> masking (you can add a RepeatModeler generated library to the >> existing settings to help with this), poor mRNA-seq assembly or a lot >> of noise in the RNA-seq (this can be helped with more strict assembly >> parameters including the jaccard-clip option in trinity), or it is >> just the result of assembly fragmentation (if you have a lot of >> contigs or runs of NNNN in the assembly, then many genes will be >> split which results in inflated gene counts). >> >> Finally manually look at the most gene dense contigs in a browser >> like Apollo or IGV (gene_density = gene_count / contig_length). If >> the most gene dense contigs are overwhelmingly single exon, then you >> may need to filter out some prokaryotic assembly contamination (not >> uncommon). If you have contamination, it will assemble as independent >> contigs, so is easily blacklisted and can be identified visually >> (always gene dense and single exon). >> >> Thanks, >> Carson >> >> >> >> >>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi >>> >> > wrote: >>> >>> Hi ! >>> I have tried three rounds of annotation in Maker on a non model >>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and >>> illumina reads. >>> As suggested in the tutorial, I ran in the first round Maker with >>> repeat masking to generate gene models using transcript (Trinity >>> assembly) and protein (swissprot) evidence. Then Maker models were >>> used twice in a bootstrap fashion to retrain SNAP. >>> The number of genes drops from 29207 in the round 1 to 22547 in the >>> round 2 then increases slightly to 22931 in the round 3. >>> >>> However, the AED profile (attached) don't seem to be satisfactory. >>> So I wonder if you could let me a good strategy to improve the >>> annotation quality. Do you think that filtering good transcripts >>> could improve results. If yes , which criteria shouldbe taken into >>> account ? >>> Thank you. >>> >>> Best; >>> Amine >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: AED-Graph.pdf Type: application/pdf Size: 5302 bytes Desc: not available URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 08:34:02 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 16:34:02 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: Thank you Michael for your response. As suggested by you, I would use Augustus andSnap trained both by the assembled transcripts in a bootstrap fashion. For the masking, I intend to to adapt Carson strategy : ?Collecting RepeatModeler repeats.lib ?Searching Sequences in Modelerunknown.lib against a transposase database (derived from RepeatMasker package and Kennedy et al (2011) ) and considering sequences matching transposases as transposons. ?Exclusion of gene fragments in both known and unkown repeats ?As I'm concerned by gene duplications, the remainder sequences in the unkown lib present less than 10 times will be removed. Thank you again for your time and I remain open to any suggestion. Best, Amine Le 27/10/2016 ? 15:21, Michael Campbell a ?crit : > I think that if you train any further you will run the risk of > overtraining. setting alt_splice to 1 will add transcripts but not > genes, so the gene count is going to be related to the training of the > gene finder. I would recommend looking at a few of your large > scaffolds in a genome browser. I would also recommend adding a second > gene predictor such as augustus. When multiple predictors are used and > the models they predict converge you can have more confidence in the > gene prediction. > > For the masking you can make a species specific repeat library like > Carson suggested to see if the gene count comes down a little. If you > are concerned about masking duplicated genes you cad do a couple of > things. You can filter the repeat library based on known proteins. You > can also set a copy number minimum for the making and only include > repeats that are present more than 10 time in the genome. Here are a > couple of URLs for making species specific repeat libraries > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic > > Take care, > Mike > >> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI >> > > wrote: >> >> >> >> >> Sorry, the X and Y were switched in the plot due to a mishandling. >> Please find attached now the correct AED graph. >> >> The round 3 (red curve) shows little higher overall AED than the >> second round (green curve) and more genes (22931 comparing to 22547 >> in the round 2). Do you think that I should stop at the second round ? >> >> I didn'tprecise in the precedent email that the Repeat masking was >> done in Maker using the Repbase and only models found by >> RepeatModeler having identities. I letunmasked the unkown lib of >> RepeatModeler. In fact we expect a high rate of segmental and gene >> duplication in the genome and then we could explain the high overall >> count of genes found by Maker. >> >> In the other hand the high, rate of genes may be also expalined by >> the fact that I activate the alt_splice=1 option to find alternative >> splicing, do you think that it was a good idea ? >> >> Thank you very much for your time. >> >> >> >> Best, >> >> Amine >> >> >> >> Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I >>> read the labels, your AED curve would be weird unless the X and Y >>> are flipped in your figure. >>> >>> ?Carson >>> >>> >>>> On Oct 26, 2016, at 12:04 PM, Carson Holt >>> > wrote: >>>> >>>> Your AED curve looks fine. The first run (using protein2genome or >>>> est2genome I assume) will always have really low overall AED >>>> because they are exact copies of the protein/transcript alignments >>>> (so AED is meaningless there because it will always artificially >>>> look good). The protein2genome or est2genome modles also have a >>>> hard end-to-end coverage filtering cutoff of 0.5 when generated >>>> (apparent in the curve - value in maker_bopts.ctl). The next runs >>>> with SNAP show >80% of models with AED under 0.5, so it looks good. >>>> You can further look at models by adding protein domains using >>>> InterProScan in which you would expect 70-80% of models to contain >>>> a recognizable InterPro domain (false and bad models will result in >>>> very low overall domain content). >>>> >>>> Your overall gene counts are a little high though for an arthropod >>>> (14,000-19,000 genes would be expected as gene loss rather than >>>> gene gain is the primary evolutionary force in the Ecdysozoa). >>>> However your gene counts can be explained by either insufficient >>>> repeat masking (you can add a RepeatModeler generated library to >>>> the existing settings to help with this), poor mRNA-seq assembly or >>>> a lot of noise in the RNA-seq (this can be helped with more strict >>>> assembly parameters including the jaccard-clip option in trinity), >>>> or it is just the result of assembly fragmentation (if you have a >>>> lot of contigs or runs of NNNN in the assembly, then many genes >>>> will be split which results in inflated gene counts). >>>> >>>> Finally manually look at the most gene dense contigs in a browser >>>> like Apollo or IGV (gene_density = gene_count / contig_length). If >>>> the most gene dense contigs are overwhelmingly single exon, then >>>> you may need to filter out some prokaryotic assembly contamination >>>> (not uncommon). If you have contamination, it will assemble as >>>> independent contigs, so is easily blacklisted and can be identified >>>> visually (always gene dense and single exon). >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> >>>> >>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi >>>>> wrote: >>>>> >>>>> Hi ! >>>>> I have tried three rounds of annotation in Maker on a non model >>>>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and >>>>> illumina reads. >>>>> As suggested in the tutorial, I ran in the first round Maker with >>>>> repeat masking to generate gene models using transcript (Trinity >>>>> assembly) and protein (swissprot) evidence. Then Maker models were >>>>> used twice in a bootstrap fashion to retrain SNAP. >>>>> The number of genes drops from29207 in the round 1 to 22547 in the >>>>> round 2 then increases slightly to 22931 in the round 3. >>>>> >>>>> However, the AED profile (attached) don't seem to be satisfactory. >>>>> So I wonder if you could let me a good strategy to improve the >>>>> annotation quality. Do you think that filtering good transcripts >>>>> could improve results. If yes , which criteria shouldbe taken into >>>>> account ? >>>>> Thank you. >>>>> >>>>> Best; >>>>> Amine >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> >> -- >> Mohamed Amine CHEBBI, PhD Student >> Universit? de Poitiers >> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 >> Equipe Ecologie Evolution Symbiose >> B?t. B8-B35 - 5 Rue Albert Turpin >> TSA 51106 >> F-86022 Poitiers Cedex 9 >> FRANCE >> Lab website:http://ecoevol.labo.univ-poitiers.fr/ >> > -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 27 09:08:15 2016 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 27 Oct 2016 09:08:15 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: I do believe that you are getting a number of false positive genes because of under masking. So taking a more carful strategy (i.e. using the suggestions given by Michael) should mitigate that. You will have to decide how aggressive to be with the repeat masking (i.e. sensitivity/specificity balance). I would however turn off alt_splice. It has a very high threshold for how clean and complete mRNA alignments and repeat masking have to be in order to function correctly (reason why default is off). So given the filtering being done to pull back on repeat masking, it likely does not meet that threshold. It won?t really produce more genes, but you will get many spurious alternate transcripts. Also for the gene count, make sure not to count from the fasta, that is the transcript count. You have to count the ?gene" feature lines in the GFF3 to get the gene count. i.e. ?> grep -P -c "\tgene\t" models.gff ?Carson > On Oct 27, 2016, at 8:34 AM, Mohamed Amine CHEBBI wrote: > > > > Thank you Michael for your response. > > As suggested by you, I would use Augustus and Snap trained both by the assembled transcripts in a bootstrap fashion. > > For the masking, I intend to to adapt Carson strategy : > > ? Collecting RepeatModeler repeats.lib > ? Searching Sequences in Modelerunknown.lib against a transposase database (derived from RepeatMasker package and Kennedy et al (2011) ) and considering sequences matching transposases as transposons. > ? Exclusion of gene fragments in both known and unkown repeats > ? As I'm concerned by gene duplications, the remainder sequences in the unkown lib present less than 10 times will be removed. > > Thank you again for your time and I remain open to any suggestion. > > Best, > Amine > > > Le 27/10/2016 ? 15:21, Michael Campbell a ?crit : >> I think that if you train any further you will run the risk of overtraining. setting alt_splice to 1 will add transcripts but not genes, so the gene count is going to be related to the training of the gene finder. I would recommend looking at a few of your large scaffolds in a genome browser. I would also recommend adding a second gene predictor such as augustus. When multiple predictors are used and the models they predict converge you can have more confidence in the gene prediction. >> >> For the masking you can make a species specific repeat library like Carson suggested to see if the gene count comes down a little. If you are concerned about masking duplicated genes you cad do a couple of things. You can filter the repeat library based on known proteins. You can also set a copy number minimum for the making and only include repeats that are present more than 10 time in the genome. Here are a couple of URLs for making species specific repeat libraries >> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced >> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic >> >> Take care, >> Mike >> >>> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI > wrote: >>> >>> >>> >>> >>> Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. >>> >>> The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? >>> >>> I didn't precise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I let unmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. >>> >>> In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? >>> >>> >>> >>> Thank you very much for your time. >>> >>> >>> >>> Best, >>> >>> Amine >>> >>> >>> >>> Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >>>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. >>>> >>>> ?Carson >>>> >>>> >>>>> On Oct 26, 2016, at 12:04 PM, Carson Holt > wrote: >>>>> >>>>> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). >>>>> >>>>> Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). >>>>> >>>>> Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> >>>>> >>>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi < mohamed.amine.chebbi at univ-poitiers.fr > wrote: >>>>>> >>>>>> Hi ! >>>>>> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >>>>>> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >>>>>> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >>>>>> >>>>>> However, the AED profile (attached) don't seem to be satisfactory. >>>>>> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >>>>>> Thank you. >>>>>> >>>>>> Best; >>>>>> Amine >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>> >>> -- >>> Mohamed Amine CHEBBI, PhD Student >>> Universit? de Poitiers >>> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 >>> Equipe Ecologie Evolution Symbiose >>> B?t. B8-B35 - 5 Rue Albert Turpin >>> TSA 51106 >>> F-86022 Poitiers Cedex 9 >>> FRANCE >>> Lab website: http://ecoevol.labo.univ-poitiers.fr/ >> > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 > Equipe Ecologie Evolution Symbiose > B?t. B8-B35 - 5 Rue Albert Turpin > TSA 51106 > F-86022 Poitiers Cedex 9 > FRANCE > Lab website: http://ecoevol.labo.univ-poitiers.fr/ _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 09:22:08 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 17:22:08 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: <69dcf9e0-b736-3f79-082d-1ec2d6d04467@univ-poitiers.fr> Indeed the gene count has been done by the command grep -P -c "\tgene\t" models.gff. I would be careful about repeats, however in the strategy I'm not convinced by the step of searching the sequencesin Modelerunknown.lib against a transposase database, as it has been done yet by the RepeatModeler against the repbase . So I think skip this step. A last question, how to create a Protein database excluding the transposases. Thank you again. Best, Amine Le 27/10/2016 ? 17:08, Carson Holt a ?crit : > not to cou -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From scott at scottcain.net Fri Oct 28 14:57:07 2016 From: scott at scottcain.net (Scott Cain) Date: Fri, 28 Oct 2016 16:57:07 -0400 Subject: [maker-devel] Call for GMOD talks at PAG Message-ID: Hi, I am pleased to announce a call for talks to be given at the Plant and Animal Genomes conference this January in the GMOD workshop on Wednesday, January 18th. Any talks that involve the development or use of GMOD software are welcome. In particular this year, I'd really like to highlight plugins for the various GMOD software packages that support them, like JBrowse, Galaxy and Tripal (of course, Galaxy and Tripal have their own sessions, so you should consider submitting to them too). Please get an abstract, brief summary or a vague title to me as soon as possible so I can start getting it put together. Also, if you'd like to be a co-organizer, please let me drop me a line about that too. I might be able to get you some meeting-related niceties for not very much work. For more information about PAG, see: http://www.intlpag.org Thanks and I look forward to seeing in January, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research -------------- next part -------------- An HTML attachment was scrubbed... URL: From annabel.beichman at gmail.com Fri Oct 28 17:11:11 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Fri, 28 Oct 2016 16:11:11 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> Message-ID: <97D8C047-69C2-4379-AF5C-3E6DAAADA51C@gmail.com> re-sending this to the list without attachments as they were too large Cheers, Annabel > On Oct 28, 2016, at 4:04 PM, Annabel Beichman wrote: > > Hi Carson, > Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! > > However, I?m still seeing an odd pattern that I wonder if you have any ideas about: > > For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. > For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): > > Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo > ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > > I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? > > Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) > > Thanks so much again for your help! > > ~ Annabel > >> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >> >> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>> >>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>> >>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>> >>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>> >>> Thanks so much again, >>> >>> ~ Annabel >>> >>> >>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>> >>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>> >>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>> >>>>> Hi Carson et al., >>>>> >>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>> >>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>> >>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>> >>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>> >>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>> >>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>> >>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>> >>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>> >>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>> >>>>> >>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> >>>>> Thank you all so much for your help and advice! >>>>> >>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>> >>>>> Best wishes, >>>>> Annabel Beichman >>>>> Wayne Lab/Lohmueller Lab >>>>> Ecology & Evolutionary Biology >>>>> UCLA >>>>> Annabelbeichman.com >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 17:23:00 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:23:00 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> Message-ID: <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. ?Carson > On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: > > Hi Carson, > Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! > > However, I?m still seeing an odd pattern that I wonder if you have any ideas about: > > For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. > For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): > > Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo > ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > > I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? > > Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) > > Thanks so much again for your help! > > ~ Annabel > >> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >> >> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>> >>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>> >>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>> >>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>> >>> Thanks so much again, >>> >>> ~ Annabel >>> >>> >>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>> >>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>> >>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>> >>>>> Hi Carson et al., >>>>> >>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>> >>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>> >>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>> >>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>> >>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>> >>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>> >>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>> >>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>> >>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>> >>>>> >>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> >>>>> Thank you all so much for your help and advice! >>>>> >>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>> >>>>> Best wishes, >>>>> Annabel Beichman >>>>> Wayne Lab/Lohmueller Lab >>>>> Ecology & Evolutionary Biology >>>>> UCLA >>>>> Annabelbeichman.com >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 17:27:59 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:27:59 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> Message-ID: <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. ?Carson > On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: > > You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). > > Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. > > ?Carson > > > >> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >> >> Hi Carson, >> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >> >> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >> >> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >> >> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >> >> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >> >> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >> >> Thanks so much again for your help! >> >> ~ Annabel >> >>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>> >>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>> >>> ?Carson >>> >>> >>> >>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>> >>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>> >>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>> >>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>> >>>> Thanks so much again, >>>> >>>> ~ Annabel >>>> >>>> >>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>> >>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>> >>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>> >>>>>> Hi Carson et al., >>>>>> >>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>> >>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>> >>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>> >>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>> >>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>> >>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>> >>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>> >>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>> >>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>> >>>>>> >>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> >>>>>> Thank you all so much for your help and advice! >>>>>> >>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>> >>>>>> Best wishes, >>>>>> Annabel Beichman >>>>>> Wayne Lab/Lohmueller Lab >>>>>> Ecology & Evolutionary Biology >>>>>> UCLA >>>>>> Annabelbeichman.com >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > From annabel.beichman at gmail.com Fri Oct 28 17:36:03 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Fri, 28 Oct 2016 16:36:03 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> Message-ID: <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> Thank you so much, Carson, for such a rapid reply! I have checked the prokaryotic issue and it looks okay ? my most gene-dense contigs all have multi-exon genes. I will re-blast with a more stringent cutoff as well. I think your theory about the NNNNNNs might be spot on. The assembly is by Dovetail Genomics and they insert many NNNNNs as they join contigs together into the long scaffolds, which would disrupt the gene models. Is there any way to salvage the genes that are split around the NNNNs? Or should I just leave them out of my analyses? Thanks again, ~ Annabel > On Oct 28, 2016, at 4:27 PM, Carson Holt wrote: > > Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. > > ?Carson > > > >> On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: >> >> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). >> >> Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. >> >> ?Carson >> >> >> >>> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >>> >>> Hi Carson, >>> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >>> >>> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >>> >>> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >>> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >>> >>> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >>> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>> >>> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >>> >>> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >>> >>> Thanks so much again for your help! >>> >>> ~ Annabel >>> >>>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>>> >>>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>>> >>>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>>> >>>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>>> >>>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>>> >>>>> Thanks so much again, >>>>> >>>>> ~ Annabel >>>>> >>>>> >>>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>>> >>>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>>> >>>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>> >>>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>>> >>>>>>> Hi Carson et al., >>>>>>> >>>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>>> >>>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>>> >>>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>>> >>>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>>> >>>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>>> >>>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>>> >>>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>>> >>>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>>> >>>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>>> >>>>>>> >>>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> >>>>>>> Thank you all so much for your help and advice! >>>>>>> >>>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>>> >>>>>>> Best wishes, >>>>>>> Annabel Beichman >>>>>>> Wayne Lab/Lohmueller Lab >>>>>>> Ecology & Evolutionary Biology >>>>>>> UCLA >>>>>>> Annabelbeichman.com >>>>>>> _______________________________________________ >>>>>>> maker-devel mailing list >>>>>>> maker-devel at box290.bluehost.com >>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>> >>>>> >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 17:49:27 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:49:27 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> Message-ID: <07C987F9-1354-4DB6-A63F-9B23F2006871@gmail.com> The NNNN?s both preclude alignment and prediction, so unless they occur in an intron, it results in a split model (many times runs of NNN may just be a few base pairs long, but if they occur in the exon, you can?t really work around it). The predictors work off of a maximum score, so the ab initio predictor ends up finding some way of terminating the model around the NNN?s that scores well even though it does not reflect the biology. Sometimes you can try and force things in manually (non-canonical splice sites etc.) if it is an important gene (Web-Apollo even allows you to insert SNPs and INDELS to correct the ORF, but it?s a labor intensive manual process). So short answer. You should investigate if you see these in a browser. If you do have them, then you will have to decide how to handle them depending on the analysis (perhaps take the longer one?). Take some time just viewing alignments and models to get a feel of how evidence and gene models should correlate. There really is no substitute for visual manual review. ?Carson > On Oct 28, 2016, at 5:36 PM, Annabel Beichman wrote: > > Thank you so much, Carson, for such a rapid reply! > > I have checked the prokaryotic issue and it looks okay ? my most gene-dense contigs all have multi-exon genes. I will re-blast with a more stringent cutoff as well. I think your theory about the NNNNNNs might be spot on. The assembly is by Dovetail Genomics and they insert many NNNNNs as they join contigs together into the long scaffolds, which would disrupt the gene models. Is there any way to salvage the genes that are split around the NNNNs? Or should I just leave them out of my analyses? > > Thanks again, > ~ Annabel >> On Oct 28, 2016, at 4:27 PM, Carson Holt wrote: >> >> Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. >> >> ?Carson >> >> >> >>> On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: >>> >>> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). >>> >>> Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. >>> >>> ?Carson >>> >>> >>> >>>> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >>>> >>>> Hi Carson, >>>> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >>>> >>>> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >>>> >>>> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >>>> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >>>> >>>> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >>>> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>>> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>>> >>>> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >>>> >>>> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >>>> >>>> Thanks so much again for your help! >>>> >>>> ~ Annabel >>>> >>>>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>>>> >>>>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>>>> >>>>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>>>> >>>>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>>>> >>>>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>>>> >>>>>> Thanks so much again, >>>>>> >>>>>> ~ Annabel >>>>>> >>>>>> >>>>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>>>> >>>>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>>>> >>>>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>>>> >>>>>>>> Hi Carson et al., >>>>>>>> >>>>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>>>> >>>>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>>>> >>>>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>>>> >>>>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>>>> >>>>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>>>> >>>>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>>>> >>>>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>>>> >>>>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>>>> >>>>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>>>> >>>>>>>> >>>>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> >>>>>>>> Thank you all so much for your help and advice! >>>>>>>> >>>>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>>>> >>>>>>>> Best wishes, >>>>>>>> Annabel Beichman >>>>>>>> Wayne Lab/Lohmueller Lab >>>>>>>> Ecology & Evolutionary Biology >>>>>>>> UCLA >>>>>>>> Annabelbeichman.com >>>>>>>> _______________________________________________ >>>>>>>> maker-devel mailing list >>>>>>>> maker-devel at box290.bluehost.com >>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>>> >>>>>> >>>>> >>>> >>> >> > From jacques.dainat at bils.se Mon Oct 31 04:51:29 2016 From: jacques.dainat at bils.se (Jacques Dainat) Date: Mon, 31 Oct 2016 11:51:29 +0100 Subject: [maker-devel] est_gff input does not provide any gene model Message-ID: Hello, I?m using usually Cufflinks output to feed Maker through the est_gff parameter, combined with the est2genome=1 parameter I get the wanted output. This time I used Stringtie output to feed Maker, but I don?t have any gene model predicted using the est2genome parameter. Any explanation ? Is it due to the gff3 format differences between these two file ? Cufflinks output example: Pnalgiovense_4592 Cufflinks match 363 977 17.844829 - . ID=1:s3_c1_r1.4.2;Name=1:s3_c1_r1.4.2; Pnalgiovense_4592 Cufflinks match_part 363 666 17.844829 - . ID=1:s3_c1_r1.4.2:exon-1;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 1 304 +; Pnalgiovense_4592 Cufflinks match_part 743 977 17.844829 - . ID=1:s3_c1_r1.4.2:exon-2;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 305 539 +; Stringtie output example: Pnalgiovense_112 StringTie gene 20 1256 1000 + . ID=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 Pnalgiovense_112 StringTie mRNA 20 1256 1000 + . ID=HtMm_All.12253.1;Parent=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 Pnalgiovense_112 StringTie exon 20 1256 1000 + . ID=HtMm_All.12253.1-exon-1;Parent=HtMm_All.12253.1;cov=8.028295;exon_number=1;gene_id=HtMm_All.12253;transcript_id=HtMm_All.12253.1 If it?s the Stringtie output that is problematic how can I fix it ? Removing gene, changing mRNA by match and exons by match_part is enough ? Best regards, Jacques Dainat, PhD NBIS (National Bioinformatics Infrastructure Sweden) Genome Annotation Service Address: (room E10:4204 - last floor) Uppsala University, BMC Department of Medical Biochemistry Microbiology, Genomics Husargatan 3, box 582 S-75123 Uppsala Sweden Phone: 01 84 71 46 25 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 31 21:24:03 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Oct 2016 21:24:03 -0600 Subject: [maker-devel] est_gff input does not provide any gene model In-Reply-To: References: Message-ID: Evidence such as est_gff has to follow the alignment format used by GFF3 (i.e. match/match_part) whereas you are providing gene models (i.e. gene/mRNA/exon/CDS). Note that match/match_part are two level features whereas gene models are 3 levels. You need to reformat to match/match_part. ?Carson > On Oct 31, 2016, at 4:51 AM, Jacques Dainat wrote: > > Hello, > > I?m using usually Cufflinks output to feed Maker through the est_gff parameter, combined with the est2genome=1 parameter I get the wanted output. > This time I used Stringtie output to feed Maker, but I don?t have any gene model predicted using the est2genome parameter. > > Any explanation ? Is it due to the gff3 format differences between these two file ? > > Cufflinks output example: > Pnalgiovense_4592 Cufflinks match 363 977 17.844829 - . ID=1:s3_c1_r1.4.2;Name=1:s3_c1_r1.4.2; > Pnalgiovense_4592 Cufflinks match_part 363 666 17.844829 - . ID=1:s3_c1_r1.4.2:exon-1;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 1 304 +; > Pnalgiovense_4592 Cufflinks match_part 743 977 17.844829 - . ID=1:s3_c1_r1.4.2:exon-2;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 305 539 +; > > Stringtie output example: > Pnalgiovense_112 StringTie gene 20 1256 1000 + . ID=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 > Pnalgiovense_112 StringTie mRNA 20 1256 1000 + . ID=HtMm_All.12253.1;Parent=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 > Pnalgiovense_112 StringTie exon 20 1256 1000 + . ID=HtMm_All.12253.1-exon-1;Parent=HtMm_All.12253.1;cov=8.028295;exon_number=1;gene_id=HtMm_All.12253;transcript_id=HtMm_All.12253.1 > > > If it?s the Stringtie output that is problematic how can I fix it ? Removing gene, changing mRNA by match and exons by match_part is enough ? > > Best regards, > > > Jacques Dainat, PhD > NBIS (National Bioinformatics Infrastructure Sweden) > Genome Annotation Service > > Address: (room E10:4204 - last floor) > Uppsala University, BMC > Department of Medical Biochemistry Microbiology, Genomics > Husargatan 3, box 582 > S-75123 Uppsala Sweden > Phone: 01 84 71 46 25 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From allisonfuiten at gmail.com Mon Oct 31 18:34:23 2016 From: allisonfuiten at gmail.com (Allison Fuiten) Date: Mon, 31 Oct 2016 17:34:23 -0700 Subject: [maker-devel] InterProScan protein domain & AED physical evidence filtering Message-ID: Hello MAKER google group, For the final round of a MAKER annotation for a de novo plant genome assembly, I ran MAKER twice: once with keep_preds=0 which annotated 20,284 genes and once with keep_preds=1 which annotated 34,055 genes. I ran the 34,055 genes (the keep_preds=1 set) through InterProScan to search the MAKER predictions for protein domain content and added this IPRScan output into the MAKER gff file with the ipr_update_gff accessory script. The game plan is to go through the 34,055 genes and remove any gene model that doesn?t have either protein domain content or physical evidence. I am counting genes that have an AED=1 as the genes that don?t have physical evidence. I have two questions: 1. I count 11,762 genes that have AED=1.0 in the keep_preds=1 annotation set, which leaves me with 22,293 genes that I?m assuming have some physical evidence (34,055-11,762=22,293). But when I ran MAKER with keep_preds=0 originally, I only count 20,284 genes. What are the extra ~2,000 genes that are being annotated in the keep_preds=1 run that have and AED score of less than 1.0, but are not being annotated in the keep_preds=0 run? 2. My second question is if there is an accessory script available that will remove genes that lack either the IPRScan protein domains or physical evidence (AED < 1)? This type of gene removal was mentioned in a previous post from 2012 (https://groups.google.com/forum/#!searchin/maker-devel/ sorry$20there$27s$20not$20a$20script$20prepackaged$20with$ 20MAKER$20for$20that$20yet.%7Csort:relevance/maker-devel/ VaoXWlGHOjs/EElr_otrK8QJ) and I was just wondering if since then someone wrote a script that will do this for me. If anyone could offer me any feedback, that would be greatly appreciated! Thank you, Allison -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.king at rothamsted.ac.uk Thu Oct 6 05:30:49 2016 From: robert.king at rothamsted.ac.uk (Robert King) Date: Thu, 6 Oct 2016 11:30:49 +0000 Subject: [maker-devel] ATG strict start codon usage query Message-ID: Hi, I'm using latest version of Maker2 but when I use it I get CTG and TTG as start codons of which I don't want. Reading threads, the bioperl CodonTable.pm has been changed to allow for strict setting so that only ATG is used. My question is how to invoke this functionality? I've looked in maker ctrl files and command line maker but don't see how to get it just to use ATG as the start codon. Can you please advise. Best wishes Rob -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 6 10:08:00 2016 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 6 Oct 2016 10:08:00 -0600 Subject: [maker-devel] ATG strict start codon usage query In-Reply-To: References: Message-ID: <786A1E40-6261-43C8-AA84-4AD0EF45BC9F@gmail.com> Make sure you are using the latest maker version (2.31.8 - since about 2014). Make sure you are not using GFF3 files as input to MAKER (otherwise you will use whatever codon is in the GFF3 ). Make sure your BioPerl is up to date (CPAN version not BioPerl live version). With respect to behavior, MAKER by default will keep whatever start codon given used by the ab initio predictor, and only search for a different one if you set always_complete=1. ?Carson > On Oct 6, 2016, at 5:30 AM, Robert King wrote: > > Hi, > > I?m using latest version of Maker2 but when I use it I get CTG and TTG as start codons of which I don?t want. Reading threads, the bioperl CodonTable.pm has been changed to allow for strict setting so that only ATG is used. My question is how to invoke this functionality? I?ve looked in maker ctrl files and command line maker but don?t see how to get it just to use ATG as the start codon. Can you please advise. > > Best wishes > Rob > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Mon Oct 10 03:43:21 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Mon, 10 Oct 2016 11:43:21 +0200 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? Message-ID: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without _doing a re-annotation _and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Oct 11 14:05:50 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 11 Oct 2016 14:05:50 -0600 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Message-ID: Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI wrote: > > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From aravindp at imcb.a-star.edu.sg Mon Oct 17 00:45:59 2016 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Mon, 17 Oct 2016 06:45:59 +0000 Subject: [maker-devel] Maker MPI installation error and IO error for serial version Message-ID: Hi, I'm trying to install Maker in my cluster account. I have installed all the dependencies. But, there are two issues for which I would like to get a solution. I tried to find it from the forums but helpless. 1. MPI installation "flock: Function not implemented error" at src/lib/Parallel/Application/MPI.pm line 256. ./Build install Configuring MAKER with MPI support flock: Function not implemented at /scratch/tools/maker_mpi/src/lib/Parallel/Application/MPI.pm line 256. Parallel::Application::MPI::_bind("/app/openmpi/1.10.3/intel_java/bin/mpicc", "/app/openmpi/1.10.3/intel_java/include", "blib", "") called at /scratch/users/astar/imcb/aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 277 MAKER::Build::ACTION_build(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "build") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1993 Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0), "build") called at /aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 469 MAKER::Build::ACTION_install(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "install") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1998 Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0)) called at ./Build line 69 2. When I run a serial version of Maker, I get an error as follow in the "makerlog.e" file. DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 109. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 111. DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 113. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 191. DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 390. Please help me with these errors as early as possible. I have double checked for all the dependencies and the file paths given while running Maker. Awaiting your reply! Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http:/www.imcb.a-star.edu.sg/ [2] Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18239 bytes Desc: image002.png URL: From mark.ebbert at gmail.com Thu Oct 13 15:57:50 2016 From: mark.ebbert at gmail.com (Mark Ebbert) Date: Thu, 13 Oct 2016 14:57:50 -0700 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! Message-ID: <57fffd715f83340001fcf47d@polymail.io> Hi, I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? I already deleted the log files before I realized maker started over because the log files get way too big. I really appreciate your help! Mark T. W. Ebbert -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 12 03:44:48 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (chebbi mohamed amine) Date: Wed, 12 Oct 2016 11:44:48 +0200 (CEST) Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> Message-ID: <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? Best, Amine De: "chebbi mohamed amine" ?: "Carson Holt" Cc: maker-devel at yandell-lab.org Envoy?: Mercredi 12 Octobre 2016 11:44:21 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? De: "Carson Holt" ?: "Mohamed Amine CHEBBI" Cc: maker-devel at yandell-lab.org Envoy?: Mardi 11 Octobre 2016 22:05:50 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 17 12:17:17 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 12:17:17 -0600 Subject: [maker-devel] Maker MPI installation error and IO error for serial version In-Reply-To: References: Message-ID: It?s saying your system has no flock (file locking). For NFS mounts this is usually a configuration by the administrator. At the very least they can enable lock emulation in NFS which is what your scratch seems to be. Unfortunately SQLite will not work without this. You can still get MAKER to install with MPI by removing the lock used during setup (do this by editing line 210 of ?/maker/src/lib/Parallel/Application/MPI.pm). Turn this?> $lock = new File::NFSLock("$loc/_MPI", 'EX', 300, 40) while(!$lock); To this (i.e. comment out line 210)?> #$lock = new File::NFSLock("$loc/_MPI", 'EX', 300, 40) while(!$lock); However there is no work around for the SQLite IO error. It requires that your administrator enable locks or lock emulation (for example setting nolock,local_lock=all will cause the system to emulate locks on NFS locally). So while not exactly a real lock, they won?t fail. Thanks, Carson > On Oct 17, 2016, at 12:45 AM, Aravind PRASAD wrote: > > Hi, > > I?m trying to install Maker in my cluster account. I have installed all the dependencies. But, there are two issues for which I would like to get a solution. I tried to find it from the forums but helpless. > 1. MPI installation > ?flock: Function not implemented error? at src/lib/Parallel/Application/MPI.pm line 256. > ./Build install > > Configuring MAKER with MPI support > flock: Function not implemented > at /scratch/tools/maker_mpi/src/lib/Parallel/Application/MPI.pm line 256. > Parallel::Application::MPI::_bind("/app/openmpi/1.10.3/intel_java/bin/mpicc", "/app/openmpi/1.10.3/intel_java/include", "blib", "") called at /scratch/users/astar/imcb/aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 277 > MAKER::Build::ACTION_build(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 > Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "build") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1993 > Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0), "build") called at /aravindp/tools/maker_mpi/src/inc/lib/MAKER/Build.pm line 469 > MAKER::Build::ACTION_install(MAKER::Build=HASH(0x1618ac0)) called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 2010 > Module::Build::Base::_call_action(MAKER::Build=HASH(0x1618ac0), "install") called at /scratch/tools/myperl/lib/perl5/Module/Build/Base.pm line 1998 > Module::Build::Base::dispatch(MAKER::Build=HASH(0x1618ac0)) called at ./Build line 69 > > 2. When I run a serial version of Maker, I get an error as follow in the ?makerlog.e? file. > > DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 109. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 111. > DBD::SQLite::db do failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 113. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 191. > DBD::SQLite::db selectcol_arrayref failed: disk I/O error at /scratch/tools/maker/bin/../lib/GFFDB.pm line 390. > > > Please help me with these errors as early as possible. I have double checked for all the dependencies and the file paths given while running Maker. > Awaiting your reply! > > > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http:/www.imcb.a-star.edu.sg/ > > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 17 12:25:54 2016 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Mon, 17 Oct 2016 18:25:54 +0000 Subject: [maker-devel] question about Maker2 In-Reply-To: References: <56F4066F.4000803@fgcz.ethz.ch> <01AB4222AE1B7E41A3B5CAEC445F192B3F71EB84@MBX115.d.ethz.ch> <3470AFC0-7B3A-485C-A86E-C7DE5A341C3C@genetics.utah.edu> <57270F57.50208@fgcz.ethz.ch> <5A09C696-CBD0-4DA9-8CB6-B994981E00D3@genetics.utah.edu> <01AB4222AE1B7E41A3B5CAEC445F192B3F747251@MBX115.d.ethz.ch> <89F7DE68-6FFF-4E17-B867-8E699D3DE986@genetics.utah.edu> <01AB4222AE1B7E41A3B5CAEC445F192B3F752945@MBX215.d.ethz.ch> <1DB8E975-3E54-455D-8852-2DD2937B2FCF@genetics.utah.edu> Message-ID: <8D22D8B2-73DC-4276-8B2D-BDEF8ECDFBE7@genetics.utah.edu> > what is the difference between files > > 1) ContigXXX.maker.non_overlapping_ab_initio.proteins.fasta Non-redundant non-overlapping models (i.e. subset of snap/augustus models that do not overlap a final MAKER selected model). > and > > 2 )ContigXXX.maker.augustus_masked.proteins.fasta Contains all raw augustus models called without hints (i.e. the equivalent of just running Augustus on it?s own). > None of these should have EST info (as the sequences headers are > > 1) augustus_masked-1-processed-gene- This was a raw augustus model that may or may not have UTR added using EST info (i.e model came strait from Augustus so no hints were used to produce the model, but MAKER did try and add UTR) > and > > 2) augustus_masked-1-abinit-gene- Model strait from Augustus. No hints, and no MAKER attempt to add UTR. These are raw unmodified models and will never be in the final selected set. > so no "maker-XXX) maker-XXX means it was a hint derived model and not a raw Augustus model. > Should file 2 just be ignored and 1) be kept aside the maker file, where EST/protein evidence is incorporated? ignore all the abinit files. They are for reference purposes only. The non-overlapping file can be used to see what was rejected, does not overlap a current model (i.e. you may be able to find a handful of false negatives that can be rescued with domain analysis using something like InterProscan). ?Carson > Thanks, > > G > > On 5/18/16 11:31 AM, Carson Holt wrote: >> Hi Giancarlo, >> >> There was no image attached. If you can, just send me the contig GFF3, and I can look at it in apollo (which lets me manipulate reading frame and display spice sites). Then I can tell you more. Basically the gene models are the result of an HMM for gene patterns plus hints to alter probability around evidence suggested sites. If there is any issue with the reading frame (can be a single bp assembly error) then no amount of hints can force a broken CDS to be coding, and the predictor will do the best it can to still produce a workable model (i.e. truncate exons, skip exons, etc). Also if your mRNA-seq is not aligned correctly around a canonical splice site (i.e. overhang beyond splice acceptor) then that hint may be ignored. >> >> ?Carson >> >> >>> On May 17, 2016, at 4:50 AM, Russo Giancarlo wrote: >>> >>> Hi Carson, thanks again for all your answers. >>> A (hopefullly) final question: in the image attached you can see an IGV sashimi plot of RNA-seq data, with the annotated gene derived from Maker; what could be the reason that in the gene model the two bits on the sides (UTRs?), which show high coverage from the RNA-seq data and plenty of splice junctions with the neighbouring exons are completely missing? >>> >>> In this run I have used a closely related species from the augustus database for gene prediction, RNA-seq based denovo assemblied transcripts as EST and protein sequences from the same closely related species. I have masked using a customized library build following the guidelines in the tutorial. >>> >>> Thanks, >>> Giancarlo >>> >>> Giancarlo Russo, Ph.D. >>> Functional Genomics Center Zurich >>> ETH Zurich / University of Zurich >>> Winterthurerstrasse 190 / Y32 H66 >>> CH-8057 Zurich >>> >>> Phone: +41 44 635 3964 >>> Fax: +41 44 635 3922 >>> e-mail: giancarlo.russo at fgcz.ethz.ch >>> http://www.fgcz.ch >>> ________________________________________ >>> From: Carson Holt [carson.holt at genetics.utah.edu] >>> Sent: 09 May 2016 18:02 >>> To: Russo Giancarlo >>> Subject: Re: question about Maker2 >>> >>> For training gene predictors with protein and EST ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >>> >>> If reusing MAKER results I don?t recommend GFF3 passthrough. The GFF# option is to get not MAKER sourced result into MAKER. You will actually lose some functionality by passing in MAKER sourced results as GFF3 (MAEKR can?t do things with GFF3 that it can do with self generated data). >>> >>> It is best to just rerun MAKER in the same directory, it will reuse previous reports it finds in the datastore. >>> >>> ?Carson >>> >>> >>> >>>> On May 3, 2016, at 2:08 AM, Russo Giancarlo wrote: >>>> >>>> OK, thanks a lot, now it is clear. >>>> >>>> About the passthrough procedure, would you have any particular advice on what would be the best strategy to run it? >>>> I have tried an existing organism in Augustus but the results were not too good. >>>> >>>> I have both EST and protein evidence, so I thought I could use EST to infer ab-initio and produce a first annotation and then run a second-pass using the first gff maker file as ab-initio. >>>> >>>> Any advice would be appreciated. >>>> >>>> Best and thanks again. >>>> Giancarlo >>>> >>>> Giancarlo Russo, Ph.D. >>>> Functional Genomics Center Zurich >>>> ETH Zurich / University of Zurich >>>> Winterthurerstrasse 190 / Y32 H66 >>>> CH-8057 Zurich >>>> >>>> Phone: +41 44 635 3964 >>>> Fax: +41 44 635 3922 >>>> e-mail: giancarlo.russo at fgcz.ethz.ch >>>> http://www.fgcz.ch >>>> ________________________________________ >>>> From: Carson Holt [carson.holt at genetics.utah.edu] >>>> Sent: 02 May 2016 18:16 >>>> To: Russo Giancarlo >>>> Subject: Re: question about Maker2 >>>> >>>> As part of the MAEKR job, it runs Snap and Augustus on their own before aligning evidence and generating hints for the later run. The Contig2.maker.augustus.transcripts.fasta are just the results of that uninformed Augustus run. They are not the final gene models, they are just the raw uninformed Augustus models. They are there for reference purposes only. They are what you would have gotten by just running Augustus directly on the assembly without any additional input (i.e. what Augustus would have produced on it?s own outside of MAKER). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On May 2, 2016, at 2:27 AM, giancarlo.russo wrote: >>>>> >>>>> Hi Carson, >>>>> sorry to bother you again, I still don't understand the difference between >>>>> >>>>> 1) Contig2.maker.augustus.transcripts.fasta >>>>> and >>>>> 2) Contig2.maker.transcripts.fasta >>>>> >>>>> If 1) contains the transcripts "Produced by maker sending hints to >>>>> augustus to modify scoring against the HMM", >>>>> , and these hints are derived from EST/protein evidence, what extra >>>>> information is used/extra steps are performed to produce 3) ? >>>>> >>>>> Also, how is a passthrough using a first pass, maker-produced gff >>>>> annotation file is best done? >>>>> Should this gff file be used for ab-initio gene models that are then >>>>> corrected EST and protein evidence? >>>>> Does it make sense to use augustus when a first pass gff file is >>>>> available? Do these two options (ab-initio based on first pass gff and >>>>> augustus switched on) exclude each other? >>>>> >>>>> Thanks again for your time and help. >>>>> >>>>> Best, >>>>> G >>>>> On 29/03/16 17:42, Carson Holt wrote: >>>>>> Yes. The EST?s generate both hints as to intron location and exon location. The protein alignments generate CDS location hints. Each algorithm has different ways to feed hints with Augustus being the most advanced. It allows separate bonuses for partial vs exact matches, and you can optionally link hints so they have to be matched as a group. It also offerer many other hint types like splice donor and acceptor hints. However we really only use the intron, exon, and CDS hints. We also use the partial match bonus. >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>>> On Mar 29, 2016, at 7:50 AM, Russo Giancarlo wrote: >>>>>>> >>>>>>> Hi Carson, thanks a lot for your answer. >>>>>>> >>>>>>> So let's see if I get it correctly. >>>>>>> In the final datastore I have the fasta files named >>>>>>> >>>>>>> 1)Contig2.maker.augustus.transcripts.fasta >>>>>>> 2)Contig2.maker.non_overlapping_ab_initio.transcripts.fasta >>>>>>> 3)Contig2.maker.transcripts.fasta >>>>>>> >>>>>>> 1) contains the transcripts "Produced by maker sending hints to augustus to modify scoring against the HMM" >>>>>>> 2) contains the transcripts predicted only by the ab initio algorithm (e.g. augustus) >>>>>>> 3) contains the transcripts with a full gene model based on ab initio + EST and/or PROTEIN >>>>>>> >>>>>>> However, what "hints" are sent by maker to augustus? If these are EST/PROTEIN hints, then what is the difference between 1) and 3) ? >>>>>>> >>>>>>> Thanks again for your help and sorry for bothering. >>>>>>> >>>>>>> Best, >>>>>>> Giancarlo >>>>>>> >>>>>>> Giancarlo Russo, Ph.D. >>>>>>> Functional Genomics Center Zurich >>>>>>> ETH Zurich / University of Zurich >>>>>>> Winterthurerstrasse 190 / Y32 H66 >>>>>>> CH-8057 Zurich >>>>>>> >>>>>>> Phone: +41 44 635 3964 >>>>>>> Fax: +41 44 635 3922 >>>>>>> e-mail: giancarlo.russo at fgcz.ethz.ch >>>>>>> http://www.fgcz.ch >>>>>>> ________________________________________ >>>>>>> From: Carson Holt [carson.holt at genetics.utah.edu] >>>>>>> Sent: 24 March 2016 21:56 >>>>>>> To: maker-devel >>>>>>> Cc: Russo Giancarlo; Mark Yandell >>>>>>> Subject: Re: question about Maker2 >>>>>>> >>>>>>> Hi Giancarlo, >>>>>>> >>>>>>> Anything listed as something like maker-*-augustus was a result of MAKER sending hints to augustus, and anything like augustus-*-abinit was the result of augustus run directly from the HMM without hints. >>>>>>> >>>>>>> Here is more detail on the format ?> >>>>>>> - - -gene- - >>>>>>> >>>>>>> Top level possibilities: >>>>>>> maker #maker generated model >>>>>>> snap_masked #snap run on masked sequence >>>>>>> augustus_masked #augustus run on masked sequence >>>>>>> etc. >>>>>>> >>>>>>> Internal source: >>>>>>> abinit #ab initio model direct from HMM >>>>>>> snap #hints provided to SNAP (alters scoring) >>>>>>> augustus #hints provided to augustus (alters scoring) >>>>>>> >>>>>>> Then chunk and iterator are just to generate a uniq ID. >>>>>>> >>>>>>> >>>>>>> Example: >>>>>>> augustus_masked-scaffold11899-abinit-gene-0.6 #Produced by Augustus on masked sequence using raw HMM (no MAKER intervention). >>>>>>> maker-scaffold11899-augustus-gene-0.6 #Produced by maker sending hints to augustus to modify scoring against the HMM >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 3/24/16, 9:23 AM, "giancarlo.russo" >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Dear Mike, >>>>>>>>> >>>>>>>>> first of all thanks for taking care and sharing Maker, as part of the >>>>>>>>> community I appreciate it. >>>>>>>>> >>>>>>>>> I have a question about the nomenclature of the annotation in the output >>>>>>>>> file: >>>>>>>>> what is the difference between genes named >>>>>>>>> >>>>>>>>> maker-Contig-XXX >>>>>>>>> and those named >>>>>>>>> augustus-Contig-XXX-processed genes >>>>>>>>> ? >>>>>>>>> >>>>>>>>> Please find attached the maker_opts file I have used for my annotation. >>>>>>>>> I was under the impression that the ab-initio related prefixes would be >>>>>>>>> present only in the genes which are not marked as "maker" in column 3 of >>>>>>>>> the gff file (i.e., those >>>>>>>>> with both ab-initio and EST evidence) >>>>>>>>> >>>>>>>>> Is there something I am missing? >>>>>>>>> >>>>>>>>> Thanks a lot in advance, >>>>>>>>> Giancarlo >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Giancarlo Russo, Ph.D. >>>>>>>>> Functional Genomics Center Zurich >>>>>>>>> Y32 H66 >>>>>>>>> Winterthurerstr. 190 >>>>>>>>> 8057 Zurich >>>>>>>>> SWITZERLAND >>>>>>>>> Phone: +41 44 635 39 64 >>>>>>>>> Fax: +41 44 635 39 22 >>>>>>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch >>>>>>>>> >>>>>>>> >>>>> -- >>>>> Giancarlo Russo, Ph.D. >>>>> Functional Genomics Center Zurich >>>>> Y32 H66 >>>>> Winterthurerstr. 190 >>>>> 8057 Zurich >>>>> SWITZERLAND >>>>> Phone: +41 44 635 39 64 >>>>> Fax: +41 44 635 39 22 >>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch >>>>> > > -- > Giancarlo Russo, Ph.D. > Functional Genomics Center Zurich > Winterthurerstrasse 190 > 8057 Zurich (CH) > Phone: +41 044 635 3964 > Fax: +41 044 635 3922 > From carsonhh at gmail.com Mon Oct 17 12:35:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 12:35:52 -0600 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <57fffd715f83340001fcf47d@polymail.io> References: <57fffd715f83340001fcf47d@polymail.io> Message-ID: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line. In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). ?Carson > On Oct 13, 2016, at 3:57 PM, Mark Ebbert wrote: > > > Hi, > > I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? > > This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? > > I already deleted the log files before I realized maker started over because the log files get way too big. > > I really appreciate your help! > > Mark T. W. Ebbert > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From annabel.beichman at gmail.com Mon Oct 17 13:20:37 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Mon, 17 Oct 2016 12:20:37 -0700 Subject: [maker-devel] Too many genes? Message-ID: Hi Carson et al., Thanks so much for such a great pipeline, tutorials and advice pages. I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) Thank you all so much for your help and advice! [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] Best wishes, Annabel Beichman Wayne Lab/Lohmueller Lab Ecology & Evolutionary Biology UCLA Annabelbeichman.com From carsonhh at gmail.com Mon Oct 17 14:11:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 14:11:52 -0600 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <58052fc8a2cc1400014626fe@polymail.io> References: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> <58052fc8a2cc1400014626fe@polymail.io> Message-ID: MAKER should automatically try and salvage things on restart (that is the purpose of the checkpoint files). You can set clean_try=1 if you want. It will then delete failed contigs before retrying on any failure. ?Carson > On Oct 17, 2016, at 2:09 PM, Mark Ebbert wrote: > > > Thanks Carson, > > I?ve been restarting it using the same commands several times in a row. Unless that ?find? command has the potential to modify any important files, then I don?t think I modified anything. All I ran was: > > ?find . -name *.NFSLock* -exec rm {} \;? > ?sbatch maker.slurm? > > I?m inclined to nuke it all and start over. Is it possible to salvage previous work, or is it all gone? > > Mark T. W. Ebbert > Please note my new email address: mark.ebbert at gmail.com > > On Mon, Oct 17, 2016 at 12:35 PM Carson Holt >> wrote: > If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. > > Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line. > > In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. > > If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). > > ?Carson > > >> On Oct 13, 2016, at 3:57 PM, Mark Ebbert > wrote: >> >> >> Hi, >> >> I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? >> >> This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? >> >> I already deleted the log files before I realized maker started over because the log files get way too big. >> >> I really appreciate your help! >> >> Mark T. W. Ebbert >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark.ebbert at gmail.com Mon Oct 17 14:09:52 2016 From: mark.ebbert at gmail.com (Mark Ebbert) Date: Mon, 17 Oct 2016 13:09:52 -0700 Subject: [maker-devel] Maker regularly fails and just lost all of the previous work! In-Reply-To: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> References: <2DE93768-4E3D-4F22-AB39-020EB88570C6@gmail.com> Message-ID: <58052fc8a2cc1400014626fe@polymail.io> Thanks Carson, I?ve been restarting it using the same commands several times in a row. Unless that ?find? command has the potential to modify any important files, then I don?t think I modified anything. All I ran was: ?find . -name *.NFSLock* -exec rm {} \;? ?sbatch maker.slurm? I?m inclined to nuke it all and start over. Is it possible to salvage previous work, or is it all gone? Mark T. W. Ebbert Please note my new email address: mark.ebbert at gmail.com On Mon, Oct 17, 2016 at 12:35 PM Carson Holt < mailto:Carson Holt > wrote: a, pre, code, a:link, body { word-wrap: break-word !important; } If you made a change that affects downstream steps, MAKER erases affected intermediate files, and recalculates. It?s possible that you erased required checkpoiunt files, so MAKER thinks a change has been made that requires some things to be rerun. Also if the STDERR is too big. Set -quiet or -qq (really quiet) on the command line.? In general the error you see at the end is not the cause. The real error is further back in the log. MAKER tries to recover/retry, so the final failure you see is basically MAKER saying, I give up. But the original cause is further back in the log often behind the output of other MAKER threads that are writing to the log simultaneously. Iif you have 100 CPUs writing to the same output log, you may bury the real error behind the output of other threads (the log is not truly linear), so you have to look further back. If you use the beta, you can also specify -nolock, but be warned that the locks themselves are important to avoid file corruption (i.e. you accidentally launch MAKER twice). ?Carson On Oct 13, 2016, at 3:57 PM, Mark Ebbert < mailto:mark.ebbert at gmail.com > wrote: Hi, I?ve been working with maker for several months off and on with varying success. It worked great the first time I ran it, but ever since, it fails every run without any specific errors. Just says that one of the processes failed. I?ve been limping along by just running the following command to remove any locks and re-starting: ?find . -name *.NFSLock* -exec rm {} \;? This has been working, but for some reason maker started over from the beginning and lost all of the previous work! I don?t even know where to start interrogating. Should I nuke the whole maker directory structure and start from scratch? Maybe something got corrupted?? I already deleted the log files before I realized maker started over because the log files get way too big. I really appreciate your help! Mark T. W. Ebbert _______________________________________________ maker-devel mailing list mailto:maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 17 14:25:32 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 14:25:32 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: References: Message-ID: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). ?Carson > On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: > > Hi Carson et al., > > Thanks so much for such a great pipeline, tutorials and advice pages. > > I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. > > Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. > > Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. > > Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. > > I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). > > However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. > > 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). > > I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? > > > Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: > ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) > ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) > ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) > ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) > > Thank you all so much for your help and advice! > > [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] > > Best wishes, > Annabel Beichman > Wayne Lab/Lohmueller Lab > Ecology & Evolutionary Biology > UCLA > Annabelbeichman.com > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From annabel.beichman at gmail.com Mon Oct 17 17:13:07 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Mon, 17 Oct 2016 16:13:07 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> Message-ID: <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. Thanks so much again, ~ Annabel > On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: > > Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). > > You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). > > ?Carson > > > >> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >> >> Hi Carson et al., >> >> Thanks so much for such a great pipeline, tutorials and advice pages. >> >> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >> >> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >> >> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >> >> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >> >> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >> >> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >> >> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >> >> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >> >> >> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >> >> Thank you all so much for your help and advice! >> >> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >> >> Best wishes, >> Annabel Beichman >> Wayne Lab/Lohmueller Lab >> Ecology & Evolutionary Biology >> UCLA >> Annabelbeichman.com >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Mon Oct 17 18:09:52 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Oct 2016 18:09:52 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> Message-ID: <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. ?Carson > On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: > > Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. > > Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta > > My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. > > Thanks so much again, > > ~ Annabel > > >> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >> >> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >> >> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>> >>> Hi Carson et al., >>> >>> Thanks so much for such a great pipeline, tutorials and advice pages. >>> >>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>> >>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>> >>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>> >>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>> >>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>> >>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>> >>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>> >>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>> >>> >>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>> >>> Thank you all so much for your help and advice! >>> >>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>> >>> Best wishes, >>> Annabel Beichman >>> Wayne Lab/Lohmueller Lab >>> Ecology & Evolutionary Biology >>> UCLA >>> Annabelbeichman.com >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > From carsonhh at gmail.com Sun Oct 23 17:25:34 2016 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 23 Oct 2016 17:25:34 -0600 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: It?s unfortunate the archived GMOD post is gone, because I always used it for my own reference. If I remember right, the main point was that Jason Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format suitable for Augustus training. This meant you could use the maker2zff script that came with MAKER, then use Jason?s tool to convert for Augustus training. Tool to convert SNAP training ZFF to Augustus trining input file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Since the post is gone, you could use that documentation provided with his tool and then maybe a generic Augustus training guide like the following to design a path forward ?> http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ?Carson > On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine wrote: > > Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l. Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? > > Best, > Amine > > De: "chebbi mohamed amine" > ?: "Carson Holt" > Cc: maker-devel at yandell-lab.org > Envoy?: Mercredi 12 Octobre 2016 11:44:21 > Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? > > Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l. Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? > > > De: "Carson Holt" > ?: "Mohamed Amine CHEBBI" > Cc: maker-devel at yandell-lab.org > Envoy?: Mardi 11 Octobre 2016 22:05:50 > Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? > > Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). > > The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. > > ?Carson > > > > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI > wrote: > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sun Oct 23 17:49:53 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Mon, 24 Oct 2016 10:49:53 +1100 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: If it's of any help I had this notes on my old protocol (before I started to do the training with BUSCO): For Augustus, we need the script "zff2augustus_gbk.pl". This will take the > export.dna generated by fathom and generate a *.gb file that will be used > as "training gene structure file" in a new training submission in > WebAugustus, but remember to give it a new name in the submission, e.g. > MYGENOME_v2, or Maker won't see the difference (same name): > perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb > As said, you could also do the training with BUSCO with the --long option. It has a dataset specific for arthropods. But if you have EST data you'll probably do better with the other method, as it allows to enter the EST for a more accurate training. On 24 October 2016 at 10:25, Carson Holt wrote: > It?s unfortunate the archived GMOD post is gone, because I always used it > for my own reference. If I remember right, the main point was that Jason > Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format > suitable for Augustus training. This meant you could use the maker2zff > script that came with MAKER, then use Jason?s tool to convert for Augustus > training. > > Tool to convert SNAP training ZFF to Augustus trining input file ?> > https://github.com/hyphaltip/genome-scripts/blob/master/ > gene_prediction/zff2augustus_gbk.pl > > > Since the post is gone, you could use that documentation provided with his > tool and then maybe a generic Augustus training guide like the following to > design a path forward ?> > http://www.molecularevolution.org/molevolfiles/exercises/ > augustus/training.html > > ?Carson > > > On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < > mohamed.amine.chebbi at univ-poitiers.fr> wrote: > > Thank you Carson for your quick response. Sorry, I have another question > concerning Augustus Training. You posted previously in the mailing list a > link to an explanation of Augustus training steps http://brie4.cshl.edu/ > pipermail/gmod-help/2012-June/001724.htm > l. > Unfortunately the link doesn't work anymore. Otherwise could you explain > how to filter the gff file produced by the first run of Maker to get best > full length ORF as a set of gene models to train Augustus ? > > Best, > Amine > > ------------------------------ > *De: *"chebbi mohamed amine" > *?: *"Carson Holt" > *Cc: *maker-devel at yandell-lab.org > *Envoy?: *Mercredi 12 Octobre 2016 11:44:21 > *Objet: *Re: [maker-devel] Combining and merging two Maker annotation gff > files ? > > Thank you Carson for your quick response. Sorry, I have another question > concerning Augustus Training. You posted previously in the mailing list a > link to an explanation of Augustus training steps http://brie4.cshl.edu/ > pipermail/gmod-help/2012-June/001724.htm > l. > Unfortunately the link doesn't work anymore. Otherwise could you explain > how to filter the gff file produced by the first run of Maker to get best > full length ORF as a set of gene models to train Augustus ? > > > ------------------------------ > *De: *"Carson Holt" > *?: *"Mohamed Amine CHEBBI" > *Cc: *maker-devel at yandell-lab.org > *Envoy?: *Mardi 11 Octobre 2016 22:05:50 > *Objet: *Re: [maker-devel] Combining and merging two Maker annotation gff > files ? > > Masking doesn?t just affect the gene models, but also evidence alignment > and thus scoring. So merging in this way would not make much sense as the > second less masked set would always score better because it has more > evidence alignments permitted by the lack of masking (not necessarily real, > but drawn in by repeats). > > The result would be that any attempt of a merge would almost exclusively > result in all genes from the second set always scoring higher. > > ?Carson > > > > On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < > mohamed.amine.chebbi at univ-poitiers.fr> wrote: > > Hi! > > I?m using the latest version of Maker2 to annotate an arthropod genome. > First, I have run RepeatModeler to create rmlib for Maker, then I have > followed two independent annotation strategies on the same assembly : > 1- Passing throw Maker all the repeats collected by RepeatModeler ( > Identified repeats in the Repbase + Unkown Models). > 2- Passing throw Maker only the identified repeats. > > Both annotations work successfully. The first annotation gives me 19048 > genes against 22931 done by the second one. Know, I'm seeing for a mean to > merge the two annotation gff files without doing a re-annotation and by > taking the best and non redundant supported gene models . > > So, do you think that configuring the maker options as below, could > resolve this issue : > maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file > #MAKER derived GFF3 file > est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no > altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no > protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no > rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no > model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no > pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no > other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From jill711021 at gmail.com Sun Oct 23 21:32:38 2016 From: jill711021 at gmail.com (=?UTF-8?B?546L5LiA5Yeh?=) Date: Mon, 24 Oct 2016 11:32:38 +0800 Subject: [maker-devel] maker -error Message-ID: Dear sir I am trying to run GeneMark-ES and Maker for annotate the fungi genome. when I using gm_es.pl, the script terminal as an error with the following description : Must input more than one data point! at > /home/myname/Applications/GeneMarkES/parse_ET.pl line 213. > Invalid regression data > error on call: /home/myname/Applications/GeneMarkES/parse_ET.pl --section > ET_C --cfg /home/myname/projectX/Maker/GeneMark/run.cfg --v > and after searching and asking i still have no idea how to deal with it. so do u have any idea? thank u for your time ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 24 16:41:04 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 24 Oct 2016 16:41:04 -0600 Subject: [maker-devel] maker -error In-Reply-To: References: Message-ID: <65B4147C-B28C-40EB-9004-F93D821AF1C7@gmail.com> That is a GeneMark internal error. I?d recommend running it by itself (outside of MAKER) on whatever contig it failed on, then if it reproduces, you can post the error and the test dataset to the GeneMark developers. ?Carson > On Oct 23, 2016, at 9:32 PM, ??? wrote: > > Dear sir > > I am trying to run GeneMark-ES and Maker for annotate the fungi genome. when I using gm_es.pl , the script terminal as an error with the following description : > > Must input more than one data point! at /home/myname/Applications/GeneMarkES/parse_ET.pl line 213. > Invalid regression data > error on call: /home/myname/Applications/GeneMarkES/parse_ET.pl --section ET_C --cfg /home/myname/projectX/Maker/GeneMark/run.cfg --v > > > and after searching and asking i still have no idea how to deal with it. so do u have any idea? thank u for your time ! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 26 02:32:52 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (chebbi mohamed amine) Date: Wed, 26 Oct 2016 10:32:52 +0200 (CEST) Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: <1581157450.4030281.1477470772694.JavaMail.zimbra@univ-poitiers.fr> Thank you very much for your help. Best, Mohamed De: "Xabier V?zquez-Campos" ?: "Carson Holt" Cc: "chebbi mohamed amine" , "Maker Mailing List" Envoy?: Lundi 24 Octobre 2016 01:49:53 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? If it's of any help I had this notes on my old protocol (before I started to do the training with BUSCO): For Augustus, we need the script " zff2augustus_gbk.pl ". This will take the export.dna generated by fathom and generate a *.gb file that will be used as "training gene structure file" in a new training submission in WebAugustus, but remember to give it a new name in the submission, e.g. MYGENOME_v2, or Maker won't see the difference (same name): perl PATH/TO/SCRIPT/ zff2augustus_gbk.pl > MYGENOME.train.gb As said, you could also do the training with BUSCO with the --long option. It has a dataset specific for arthropods. But if you have EST data you'll probably do better with the other method, as it allows to enter the EST for a more accurate training. On 24 October 2016 at 10:25, Carson Holt < carsonhh at gmail.com > wrote: BQ_BEGIN It?s unfortunate the archived GMOD post is gone, because I always used it for my own reference. If I remember right, the main point was that Jason Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format suitable for Augustus training. This meant you could use the maker2zff script that came with MAKER, then use Jason?s tool to convert for Augustus training. Tool to convert SNAP training ZFF to Augustus trining input file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Since the post is gone, you could use that documentation provided with his tool and then maybe a generic Augustus training guide like the following to design a path forward ?> http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ?Carson BQ_BEGIN On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? Best, Amine De: "chebbi mohamed amine" < mohamed.amine.chebbi at univ-poitiers.fr > ?: "Carson Holt" < carsonhh at gmail.com > Cc: maker-devel at yandell-lab.org Envoy?: Mercredi 12 Octobre 2016 11:44:21 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Thank you Carson for your quick response. Sorry, I have another question concerning Augustus Training. You posted previously in the mailing list a link to an explanation of Augustus training steps http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm l . Unfortunately the link doesn't work anymore. Otherwise could you explain how to filter the gff file produced by the first run of Maker to get best full length ORF as a set of gene models to train Augustus ? De: "Carson Holt" < carsonhh at gmail.com > ?: "Mohamed Amine CHEBBI" < mohamed.amine.chebbi at univ-poitiers.fr > Cc: maker-devel at yandell-lab.org Envoy?: Mardi 11 Octobre 2016 22:05:50 Objet: Re: [maker-devel] Combining and merging two Maker annotation gff files ? Masking doesn?t just affect the gene models, but also evidence alignment and thus scoring. So merging in this way would not make much sense as the second less masked set would always score better because it has more evidence alignments permitted by the lack of masking (not necessarily real, but drawn in by repeats). The result would be that any attempt of a merge would almost exclusively result in all genes from the second set always scoring higher. ?Carson BQ_BEGIN On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < mohamed.amine.chebbi at univ-poitiers.fr > wrote: Hi! I?m using the latest version of Maker2 to annotate an arthropod genome. First, I have run RepeatModeler to create rmlib for Maker, then I have followed two independent annotation strategies on the same assembly : 1- Passing throw Maker all the repeats collected by RepeatModeler ( Identified repeats in the Repbase + Unkown Models). 2- Passing throw Maker only the identified repeats. Both annotations work successfully. The first annotation gives me 19048 genes against 22931 done by the second one. Know, I'm seeing for a mean to merge the two annotation gff files without doing a re-annotation and by taking the best and non redundant supported gene models . So, do you think that configuring the maker options as below, could resolve this issue : maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file #MAKER derived GFF3 file est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org BQ_END BQ_END _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org BQ_END -- Xabier V?zquez-Campos, PhD Research Associate Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Wed Oct 26 07:09:33 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine Chebbi) Date: Wed, 26 Oct 2016 15:09:33 +0200 (CEST) Subject: [maker-devel] Filter transcripts to improve annotation quality ? Message-ID: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Hi ! I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. However, the AED profile (attached) don't seem to be satisfactory. So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? Thank you. Best; Amine -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: AED-Graph.pdf Type: application/pdf Size: 5328 bytes Desc: not available URL: From michael.s.campbell1 at gmail.com Wed Oct 26 12:00:08 2016 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Wed, 26 Oct 2016 14:00:08 -0400 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Message-ID: Hi Amine, I haven?t seen that pattern in a CFD plot of AED before. Is there a possibility that the x and y axises are swiched in the plot? Thanks, Mike > On Oct 26, 2016, at 9:09 AM, Mohamed Amine Chebbi wrote: > > Hi ! > I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. > As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. > The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. > > However, the AED profile (attached) don't seem to be satisfactory. > So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? > Thank you. > > Best; > Amine > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 26 12:04:20 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 26 Oct 2016 12:04:20 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> Message-ID: <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). Thanks, Carson > On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi wrote: > > Hi ! > I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. > As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. > The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. > > However, the AED profile (attached) don't seem to be satisfactory. > So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? > Thank you. > > Best; > Amine > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 26 12:06:36 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 26 Oct 2016 12:06:36 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> Message-ID: <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. ?Carson > On Oct 26, 2016, at 12:04 PM, Carson Holt wrote: > > Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). > > Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). > > Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). > > Thanks, > Carson > > > > >> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi > wrote: >> >> Hi ! >> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >> >> However, the AED profile (attached) don't seem to be satisfactory. >> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >> Thank you. >> >> Best; >> Amine >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.stajich at gmail.com Wed Oct 26 19:26:26 2016 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 26 Oct 2016 18:26:26 -0700 Subject: [maker-devel] Combining and merging two Maker annotation gff files ? In-Reply-To: References: <331db87e-3ae4-34e1-241c-a4875783e1ac@univ-poitiers.fr> <980094649.600573.1476265488779.JavaMail.zimbra@univ-poitiers.fr> Message-ID: Yes thanks for re-sharing. Maybe we should write this up into a clearer tutorial - I go back and forth on how to make this easier and automated. Jason On Sunday, October 23, 2016, Xabier V?zquez-Campos wrote: > If it's of any help I had this notes on my old protocol (before I started > to do the training with BUSCO): > > For Augustus, we need the script "zff2augustus_gbk.pl". This will take >> the export.dna generated by fathom and generate a *.gb file that will be >> used as "training gene structure file" in a new training submission in >> WebAugustus, but remember to give it a new name in the submission, e.g. >> MYGENOME_v2, or Maker won't see the difference (same name): >> perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb >> > > As said, you could also do the training with BUSCO with the --long option. > It has a dataset specific for arthropods. But if you have EST data you'll > probably do better with the other method, as it allows to enter the EST for > a more accurate training. > > On 24 October 2016 at 10:25, Carson Holt > wrote: > >> It?s unfortunate the archived GMOD post is gone, because I always used it >> for my own reference. If I remember right, the main point was that Jason >> Stajich wrote a tool to convert Snap?s ZFF format to a Genbank format >> suitable for Augustus training. This meant you could use the maker2zff >> script that came with MAKER, then use Jason?s tool to convert for Augustus >> training. >> >> Tool to convert SNAP training ZFF to Augustus trining input file ?> >> https://github.com/hyphaltip/genome-scripts/blob/master/gene >> _prediction/zff2augustus_gbk.pl >> >> >> Since the post is gone, you could use that documentation provided with >> his tool and then maybe a generic Augustus training guide like the >> following to design a path forward ?> >> http://www.molecularevolution.org/molevolfiles/exercises/aug >> ustus/training.html >> >> ?Carson >> >> >> On Oct 12, 2016, at 3:44 AM, chebbi mohamed amine < >> mohamed.amine.chebbi at univ-poitiers.fr >> > >> wrote: >> >> Thank you Carson for your quick response. Sorry, I have another question >> concerning Augustus Training. You posted previously in the mailing list a >> link to an explanation of Augustus training steps >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm >> l. >> Unfortunately the link doesn't work anymore. Otherwise could you explain >> how to filter the gff file produced by the first run of Maker to get best >> full length ORF as a set of gene models to train Augustus ? >> >> Best, >> Amine >> >> ------------------------------ >> *De: *"chebbi mohamed amine" > > >> *?: *"Carson Holt" > > >> *Cc: *maker-devel at yandell-lab.org >> >> *Envoy?: *Mercredi 12 Octobre 2016 11:44:21 >> *Objet: *Re: [maker-devel] Combining and merging two Maker annotation >> gff files ? >> >> Thank you Carson for your quick response. Sorry, I have another question >> concerning Augustus Training. You posted previously in the mailing list a >> link to an explanation of Augustus training steps >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.htm >> l. >> Unfortunately the link doesn't work anymore. Otherwise could you explain >> how to filter the gff file produced by the first run of Maker to get best >> full length ORF as a set of gene models to train Augustus ? >> >> >> ------------------------------ >> *De: *"Carson Holt" > > >> *?: *"Mohamed Amine CHEBBI" > > >> *Cc: *maker-devel at yandell-lab.org >> >> *Envoy?: *Mardi 11 Octobre 2016 22:05:50 >> *Objet: *Re: [maker-devel] Combining and merging two Maker annotation >> gff files ? >> >> Masking doesn?t just affect the gene models, but also evidence alignment >> and thus scoring. So merging in this way would not make much sense as the >> second less masked set would always score better because it has more >> evidence alignments permitted by the lack of masking (not necessarily real, >> but drawn in by repeats). >> >> The result would be that any attempt of a merge would almost exclusively >> result in all genes from the second set always scoring higher. >> >> ?Carson >> >> >> >> On Oct 10, 2016, at 3:43 AM, Mohamed Amine CHEBBI < >> mohamed.amine.chebbi at univ-poitiers.fr >> > >> wrote: >> >> Hi! >> >> I?m using the latest version of Maker2 to annotate an arthropod genome. >> First, I have run RepeatModeler to create rmlib for Maker, then I have >> followed two independent annotation strategies on the same assembly : >> 1- Passing throw Maker all the repeats collected by RepeatModeler ( >> Identified repeats in the Repbase + Unkown Models). >> 2- Passing throw Maker only the identified repeats. >> >> Both annotations work successfully. The first annotation gives me 19048 >> genes against 22931 done by the second one. Know, I'm seeing for a mean to >> merge the two annotation gff files without doing a re-annotation and by >> taking the best and non redundant supported gene models . >> >> So, do you think that configuring the maker options as below, could >> resolve this issue : >> maker_gff=1-mask-all.gff,2-mask-onlyKnown.gff #MAKER derived GFF3 file >> #MAKER derived GFF3 file >> est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no >> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no >> protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no >> rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no >> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no >> pred_pass=1 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no >> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no >> >> -- >> Mohamed Amine CHEBBI, PhD Student >> Universit? de Poitiers >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -- Jason Stajich jason.stajich at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Thu Oct 27 07:21:01 2016 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Thu, 27 Oct 2016 09:21:01 -0400 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Message-ID: <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> I think that if you train any further you will run the risk of overtraining. setting alt_splice to 1 will add transcripts but not genes, so the gene count is going to be related to the training of the gene finder. I would recommend looking at a few of your large scaffolds in a genome browser. I would also recommend adding a second gene predictor such as augustus. When multiple predictors are used and the models they predict converge you can have more confidence in the gene prediction. For the masking you can make a species specific repeat library like Carson suggested to see if the gene count comes down a little. If you are concerned about masking duplicated genes you cad do a couple of things. You can filter the repeat library based on known proteins. You can also set a copy number minimum for the making and only include repeats that are present more than 10 time in the genome. Here are a couple of URLs for making species specific repeat libraries http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic Take care, Mike > On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI wrote: > > > > > Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. > > The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? > > I didn't precise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I let unmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. > > In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? > > > > Thank you very much for your time. > > > > Best, > > Amine > > > > Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. >> >> ?Carson >> >> >>> On Oct 26, 2016, at 12:04 PM, Carson Holt > wrote: >>> >>> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). >>> >>> Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). >>> >>> Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). >>> >>> Thanks, >>> Carson >>> >>> >>> >>> >>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi < mohamed.amine.chebbi at univ-poitiers.fr > wrote: >>>> >>>> Hi ! >>>> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >>>> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >>>> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >>>> >>>> However, the AED profile (attached) don't seem to be satisfactory. >>>> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >>>> Thank you. >>>> >>>> Best; >>>> Amine >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 > Equipe Ecologie Evolution Symbiose > B?t. B8-B35 - 5 Rue Albert Turpin > TSA 51106 > F-86022 Poitiers Cedex 9 > FRANCE > Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 03:54:31 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 11:54:31 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> Message-ID: Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? I didn'tprecise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I letunmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? Thank you very much for your time. Best, Amine Le 26/10/2016 ? 20:06, Carson Holt a ?crit : > Sorry. I also assumed X and Y was flipped when I looked at it. Now I > read the labels, your AED curve would be weird unless the X and Y are > flipped in your figure. > > ?Carson > > >> On Oct 26, 2016, at 12:04 PM, Carson Holt > > wrote: >> >> Your AED curve looks fine. The first run (using protein2genome or >> est2genome I assume) will always have really low overall AED because >> they are exact copies of the protein/transcript alignments (so AED is >> meaningless there because it will always artificially look good). The >> protein2genome or est2genome modles also have a hard end-to-end >> coverage filtering cutoff of 0.5 when generated (apparent in the >> curve - value in maker_bopts.ctl). The next runs with SNAP show >80% >> of models with AED under 0.5, so it looks good. You can further look >> at models by adding protein domains using InterProScan in which you >> would expect 70-80% of models to contain a recognizable InterPro >> domain (false and bad models will result in very low overall domain >> content). >> >> Your overall gene counts are a little high though for an arthropod >> (14,000-19,000 genes would be expected as gene loss rather than gene >> gain is the primary evolutionary force in the Ecdysozoa). However >> your gene counts can be explained by either insufficient repeat >> masking (you can add a RepeatModeler generated library to the >> existing settings to help with this), poor mRNA-seq assembly or a lot >> of noise in the RNA-seq (this can be helped with more strict assembly >> parameters including the jaccard-clip option in trinity), or it is >> just the result of assembly fragmentation (if you have a lot of >> contigs or runs of NNNN in the assembly, then many genes will be >> split which results in inflated gene counts). >> >> Finally manually look at the most gene dense contigs in a browser >> like Apollo or IGV (gene_density = gene_count / contig_length). If >> the most gene dense contigs are overwhelmingly single exon, then you >> may need to filter out some prokaryotic assembly contamination (not >> uncommon). If you have contamination, it will assemble as independent >> contigs, so is easily blacklisted and can be identified visually >> (always gene dense and single exon). >> >> Thanks, >> Carson >> >> >> >> >>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi >>> >> > wrote: >>> >>> Hi ! >>> I have tried three rounds of annotation in Maker on a non model >>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and >>> illumina reads. >>> As suggested in the tutorial, I ran in the first round Maker with >>> repeat masking to generate gene models using transcript (Trinity >>> assembly) and protein (swissprot) evidence. Then Maker models were >>> used twice in a bootstrap fashion to retrain SNAP. >>> The number of genes drops from 29207 in the round 1 to 22547 in the >>> round 2 then increases slightly to 22931 in the round 3. >>> >>> However, the AED profile (attached) don't seem to be satisfactory. >>> So I wonder if you could let me a good strategy to improve the >>> annotation quality. Do you think that filtering good transcripts >>> could improve results. If yes , which criteria shouldbe taken into >>> account ? >>> Thank you. >>> >>> Best; >>> Amine >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: AED-Graph.pdf Type: application/pdf Size: 5302 bytes Desc: not available URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 08:34:02 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 16:34:02 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: Thank you Michael for your response. As suggested by you, I would use Augustus andSnap trained both by the assembled transcripts in a bootstrap fashion. For the masking, I intend to to adapt Carson strategy : ?Collecting RepeatModeler repeats.lib ?Searching Sequences in Modelerunknown.lib against a transposase database (derived from RepeatMasker package and Kennedy et al (2011) ) and considering sequences matching transposases as transposons. ?Exclusion of gene fragments in both known and unkown repeats ?As I'm concerned by gene duplications, the remainder sequences in the unkown lib present less than 10 times will be removed. Thank you again for your time and I remain open to any suggestion. Best, Amine Le 27/10/2016 ? 15:21, Michael Campbell a ?crit : > I think that if you train any further you will run the risk of > overtraining. setting alt_splice to 1 will add transcripts but not > genes, so the gene count is going to be related to the training of the > gene finder. I would recommend looking at a few of your large > scaffolds in a genome browser. I would also recommend adding a second > gene predictor such as augustus. When multiple predictors are used and > the models they predict converge you can have more confidence in the > gene prediction. > > For the masking you can make a species specific repeat library like > Carson suggested to see if the gene count comes down a little. If you > are concerned about masking duplicated genes you cad do a couple of > things. You can filter the repeat library based on known proteins. You > can also set a copy number minimum for the making and only include > repeats that are present more than 10 time in the genome. Here are a > couple of URLs for making species specific repeat libraries > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic > > Take care, > Mike > >> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI >> > > wrote: >> >> >> >> >> Sorry, the X and Y were switched in the plot due to a mishandling. >> Please find attached now the correct AED graph. >> >> The round 3 (red curve) shows little higher overall AED than the >> second round (green curve) and more genes (22931 comparing to 22547 >> in the round 2). Do you think that I should stop at the second round ? >> >> I didn'tprecise in the precedent email that the Repeat masking was >> done in Maker using the Repbase and only models found by >> RepeatModeler having identities. I letunmasked the unkown lib of >> RepeatModeler. In fact we expect a high rate of segmental and gene >> duplication in the genome and then we could explain the high overall >> count of genes found by Maker. >> >> In the other hand the high, rate of genes may be also expalined by >> the fact that I activate the alt_splice=1 option to find alternative >> splicing, do you think that it was a good idea ? >> >> Thank you very much for your time. >> >> >> >> Best, >> >> Amine >> >> >> >> Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I >>> read the labels, your AED curve would be weird unless the X and Y >>> are flipped in your figure. >>> >>> ?Carson >>> >>> >>>> On Oct 26, 2016, at 12:04 PM, Carson Holt >>> > wrote: >>>> >>>> Your AED curve looks fine. The first run (using protein2genome or >>>> est2genome I assume) will always have really low overall AED >>>> because they are exact copies of the protein/transcript alignments >>>> (so AED is meaningless there because it will always artificially >>>> look good). The protein2genome or est2genome modles also have a >>>> hard end-to-end coverage filtering cutoff of 0.5 when generated >>>> (apparent in the curve - value in maker_bopts.ctl). The next runs >>>> with SNAP show >80% of models with AED under 0.5, so it looks good. >>>> You can further look at models by adding protein domains using >>>> InterProScan in which you would expect 70-80% of models to contain >>>> a recognizable InterPro domain (false and bad models will result in >>>> very low overall domain content). >>>> >>>> Your overall gene counts are a little high though for an arthropod >>>> (14,000-19,000 genes would be expected as gene loss rather than >>>> gene gain is the primary evolutionary force in the Ecdysozoa). >>>> However your gene counts can be explained by either insufficient >>>> repeat masking (you can add a RepeatModeler generated library to >>>> the existing settings to help with this), poor mRNA-seq assembly or >>>> a lot of noise in the RNA-seq (this can be helped with more strict >>>> assembly parameters including the jaccard-clip option in trinity), >>>> or it is just the result of assembly fragmentation (if you have a >>>> lot of contigs or runs of NNNN in the assembly, then many genes >>>> will be split which results in inflated gene counts). >>>> >>>> Finally manually look at the most gene dense contigs in a browser >>>> like Apollo or IGV (gene_density = gene_count / contig_length). If >>>> the most gene dense contigs are overwhelmingly single exon, then >>>> you may need to filter out some prokaryotic assembly contamination >>>> (not uncommon). If you have contamination, it will assemble as >>>> independent contigs, so is easily blacklisted and can be identified >>>> visually (always gene dense and single exon). >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> >>>> >>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi >>>>> wrote: >>>>> >>>>> Hi ! >>>>> I have tried three rounds of annotation in Maker on a non model >>>>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and >>>>> illumina reads. >>>>> As suggested in the tutorial, I ran in the first round Maker with >>>>> repeat masking to generate gene models using transcript (Trinity >>>>> assembly) and protein (swissprot) evidence. Then Maker models were >>>>> used twice in a bootstrap fashion to retrain SNAP. >>>>> The number of genes drops from29207 in the round 1 to 22547 in the >>>>> round 2 then increases slightly to 22931 in the round 3. >>>>> >>>>> However, the AED profile (attached) don't seem to be satisfactory. >>>>> So I wonder if you could let me a good strategy to improve the >>>>> annotation quality. Do you think that filtering good transcripts >>>>> could improve results. If yes , which criteria shouldbe taken into >>>>> account ? >>>>> Thank you. >>>>> >>>>> Best; >>>>> Amine >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> >> -- >> Mohamed Amine CHEBBI, PhD Student >> Universit? de Poitiers >> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 >> Equipe Ecologie Evolution Symbiose >> B?t. B8-B35 - 5 Rue Albert Turpin >> TSA 51106 >> F-86022 Poitiers Cedex 9 >> FRANCE >> Lab website:http://ecoevol.labo.univ-poitiers.fr/ >> > -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 27 09:08:15 2016 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 27 Oct 2016 09:08:15 -0600 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: I do believe that you are getting a number of false positive genes because of under masking. So taking a more carful strategy (i.e. using the suggestions given by Michael) should mitigate that. You will have to decide how aggressive to be with the repeat masking (i.e. sensitivity/specificity balance). I would however turn off alt_splice. It has a very high threshold for how clean and complete mRNA alignments and repeat masking have to be in order to function correctly (reason why default is off). So given the filtering being done to pull back on repeat masking, it likely does not meet that threshold. It won?t really produce more genes, but you will get many spurious alternate transcripts. Also for the gene count, make sure not to count from the fasta, that is the transcript count. You have to count the ?gene" feature lines in the GFF3 to get the gene count. i.e. ?> grep -P -c "\tgene\t" models.gff ?Carson > On Oct 27, 2016, at 8:34 AM, Mohamed Amine CHEBBI wrote: > > > > Thank you Michael for your response. > > As suggested by you, I would use Augustus and Snap trained both by the assembled transcripts in a bootstrap fashion. > > For the masking, I intend to to adapt Carson strategy : > > ? Collecting RepeatModeler repeats.lib > ? Searching Sequences in Modelerunknown.lib against a transposase database (derived from RepeatMasker package and Kennedy et al (2011) ) and considering sequences matching transposases as transposons. > ? Exclusion of gene fragments in both known and unkown repeats > ? As I'm concerned by gene duplications, the remainder sequences in the unkown lib present less than 10 times will be removed. > > Thank you again for your time and I remain open to any suggestion. > > Best, > Amine > > > Le 27/10/2016 ? 15:21, Michael Campbell a ?crit : >> I think that if you train any further you will run the risk of overtraining. setting alt_splice to 1 will add transcripts but not genes, so the gene count is going to be related to the training of the gene finder. I would recommend looking at a few of your large scaffolds in a genome browser. I would also recommend adding a second gene predictor such as augustus. When multiple predictors are used and the models they predict converge you can have more confidence in the gene prediction. >> >> For the masking you can make a species specific repeat library like Carson suggested to see if the gene count comes down a little. If you are concerned about masking duplicated genes you cad do a couple of things. You can filter the repeat library based on known proteins. You can also set a copy number minimum for the making and only include repeats that are present more than 10 time in the genome. Here are a couple of URLs for making species specific repeat libraries >> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced >> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic >> >> Take care, >> Mike >> >>> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI > wrote: >>> >>> >>> >>> >>> Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. >>> >>> The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ? >>> >>> I didn't precise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I let unmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we could explain the high overall count of genes found by Maker. >>> >>> In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ? >>> >>> >>> >>> Thank you very much for your time. >>> >>> >>> >>> Best, >>> >>> Amine >>> >>> >>> >>> Le 26/10/2016 ? 20:06, Carson Holt a ?crit : >>>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure. >>>> >>>> ?Carson >>>> >>>> >>>>> On Oct 26, 2016, at 12:04 PM, Carson Holt > wrote: >>>>> >>>>> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content). >>>>> >>>>> Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts). >>>>> >>>>> Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon). >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> >>>>> >>>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi < mohamed.amine.chebbi at univ-poitiers.fr > wrote: >>>>>> >>>>>> Hi ! >>>>>> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads. >>>>>> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP. >>>>>> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3. >>>>>> >>>>>> However, the AED profile (attached) don't seem to be satisfactory. >>>>>> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ? >>>>>> Thank you. >>>>>> >>>>>> Best; >>>>>> Amine >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>> >>> -- >>> Mohamed Amine CHEBBI, PhD Student >>> Universit? de Poitiers >>> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 >>> Equipe Ecologie Evolution Symbiose >>> B?t. B8-B35 - 5 Rue Albert Turpin >>> TSA 51106 >>> F-86022 Poitiers Cedex 9 >>> FRANCE >>> Lab website: http://ecoevol.labo.univ-poitiers.fr/ >> > > -- > Mohamed Amine CHEBBI, PhD Student > Universit? de Poitiers > Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 > Equipe Ecologie Evolution Symbiose > B?t. B8-B35 - 5 Rue Albert Turpin > TSA 51106 > F-86022 Poitiers Cedex 9 > FRANCE > Lab website: http://ecoevol.labo.univ-poitiers.fr/ _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohamed.amine.chebbi at univ-poitiers.fr Thu Oct 27 09:22:08 2016 From: mohamed.amine.chebbi at univ-poitiers.fr (Mohamed Amine CHEBBI) Date: Thu, 27 Oct 2016 17:22:08 +0200 Subject: [maker-devel] Filter transcripts to improve annotation quality ? In-Reply-To: References: <2098382382.4146797.1477487373881.JavaMail.zimbra@univ-poitiers.fr> <9A45E0F5-EB27-491F-8713-39D0EB06547A@gmail.com> <3EA2EC84-9B2A-4631-97F8-44D774E67468@gmail.com> <8935E6BD-FDEC-464B-B174-94649CB42D63@gmail.com> Message-ID: <69dcf9e0-b736-3f79-082d-1ec2d6d04467@univ-poitiers.fr> Indeed the gene count has been done by the command grep -P -c "\tgene\t" models.gff. I would be careful about repeats, however in the strategy I'm not convinced by the step of searching the sequencesin Modelerunknown.lib against a transposase database, as it has been done yet by the RepeatModeler against the repbase . So I think skip this step. A last question, how to create a Protein database excluding the transposases. Thank you again. Best, Amine Le 27/10/2016 ? 17:08, Carson Holt a ?crit : > not to cou -- Mohamed Amine CHEBBI, PhD Student Universit? de Poitiers Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267 Equipe Ecologie Evolution Symbiose B?t. B8-B35 - 5 Rue Albert Turpin TSA 51106 F-86022 Poitiers Cedex 9 FRANCE Lab website: http://ecoevol.labo.univ-poitiers.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From scott at scottcain.net Fri Oct 28 14:57:07 2016 From: scott at scottcain.net (Scott Cain) Date: Fri, 28 Oct 2016 16:57:07 -0400 Subject: [maker-devel] Call for GMOD talks at PAG Message-ID: Hi, I am pleased to announce a call for talks to be given at the Plant and Animal Genomes conference this January in the GMOD workshop on Wednesday, January 18th. Any talks that involve the development or use of GMOD software are welcome. In particular this year, I'd really like to highlight plugins for the various GMOD software packages that support them, like JBrowse, Galaxy and Tripal (of course, Galaxy and Tripal have their own sessions, so you should consider submitting to them too). Please get an abstract, brief summary or a vague title to me as soon as possible so I can start getting it put together. Also, if you'd like to be a co-organizer, please let me drop me a line about that too. I might be able to get you some meeting-related niceties for not very much work. For more information about PAG, see: http://www.intlpag.org Thanks and I look forward to seeing in January, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research -------------- next part -------------- An HTML attachment was scrubbed... URL: From annabel.beichman at gmail.com Fri Oct 28 17:11:11 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Fri, 28 Oct 2016 16:11:11 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> Message-ID: <97D8C047-69C2-4379-AF5C-3E6DAAADA51C@gmail.com> re-sending this to the list without attachments as they were too large Cheers, Annabel > On Oct 28, 2016, at 4:04 PM, Annabel Beichman wrote: > > Hi Carson, > Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! > > However, I?m still seeing an odd pattern that I wonder if you have any ideas about: > > For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. > For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): > > Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo > ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > > I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? > > Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) > > Thanks so much again for your help! > > ~ Annabel > >> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >> >> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>> >>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>> >>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>> >>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>> >>> Thanks so much again, >>> >>> ~ Annabel >>> >>> >>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>> >>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>> >>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>> >>>>> Hi Carson et al., >>>>> >>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>> >>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>> >>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>> >>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>> >>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>> >>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>> >>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>> >>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>> >>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>> >>>>> >>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> >>>>> Thank you all so much for your help and advice! >>>>> >>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>> >>>>> Best wishes, >>>>> Annabel Beichman >>>>> Wayne Lab/Lohmueller Lab >>>>> Ecology & Evolutionary Biology >>>>> UCLA >>>>> Annabelbeichman.com >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 17:23:00 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:23:00 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> Message-ID: <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. ?Carson > On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: > > Hi Carson, > Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! > > However, I?m still seeing an odd pattern that I wonder if you have any ideas about: > > For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. > For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): > > Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo > ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) > > I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? > > Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) > > Thanks so much again for your help! > > ~ Annabel > >> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >> >> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >> >> ?Carson >> >> >> >>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>> >>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>> >>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>> >>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>> >>> Thanks so much again, >>> >>> ~ Annabel >>> >>> >>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>> >>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>> >>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>> >>>>> Hi Carson et al., >>>>> >>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>> >>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>> >>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>> >>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>> >>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>> >>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>> >>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>> >>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>> >>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>> >>>>> >>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>> >>>>> Thank you all so much for your help and advice! >>>>> >>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>> >>>>> Best wishes, >>>>> Annabel Beichman >>>>> Wayne Lab/Lohmueller Lab >>>>> Ecology & Evolutionary Biology >>>>> UCLA >>>>> Annabelbeichman.com >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 17:27:59 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:27:59 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> Message-ID: <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. ?Carson > On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: > > You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). > > Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. > > ?Carson > > > >> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >> >> Hi Carson, >> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >> >> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >> >> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >> >> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >> >> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >> >> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >> >> Thanks so much again for your help! >> >> ~ Annabel >> >>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>> >>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>> >>> ?Carson >>> >>> >>> >>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>> >>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>> >>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>> >>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>> >>>> Thanks so much again, >>>> >>>> ~ Annabel >>>> >>>> >>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>> >>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>> >>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>> >>>>>> Hi Carson et al., >>>>>> >>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>> >>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>> >>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>> >>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>> >>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>> >>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>> >>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>> >>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>> >>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>> >>>>>> >>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>> >>>>>> Thank you all so much for your help and advice! >>>>>> >>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>> >>>>>> Best wishes, >>>>>> Annabel Beichman >>>>>> Wayne Lab/Lohmueller Lab >>>>>> Ecology & Evolutionary Biology >>>>>> UCLA >>>>>> Annabelbeichman.com >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>> >>> >> > From annabel.beichman at gmail.com Fri Oct 28 17:36:03 2016 From: annabel.beichman at gmail.com (Annabel Beichman) Date: Fri, 28 Oct 2016 16:36:03 -0700 Subject: [maker-devel] Too many genes? In-Reply-To: <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> Message-ID: <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> Thank you so much, Carson, for such a rapid reply! I have checked the prokaryotic issue and it looks okay ? my most gene-dense contigs all have multi-exon genes. I will re-blast with a more stringent cutoff as well. I think your theory about the NNNNNNs might be spot on. The assembly is by Dovetail Genomics and they insert many NNNNNs as they join contigs together into the long scaffolds, which would disrupt the gene models. Is there any way to salvage the genes that are split around the NNNNs? Or should I just leave them out of my analyses? Thanks again, ~ Annabel > On Oct 28, 2016, at 4:27 PM, Carson Holt wrote: > > Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. > > ?Carson > > > >> On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: >> >> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). >> >> Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. >> >> ?Carson >> >> >> >>> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >>> >>> Hi Carson, >>> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >>> >>> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >>> >>> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >>> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >>> >>> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >>> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>> >>> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >>> >>> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >>> >>> Thanks so much again for your help! >>> >>> ~ Annabel >>> >>>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>>> >>>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>>> >>>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>>> >>>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>>> >>>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>>> >>>>> Thanks so much again, >>>>> >>>>> ~ Annabel >>>>> >>>>> >>>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>>> >>>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>>> >>>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>> >>>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>>> >>>>>>> Hi Carson et al., >>>>>>> >>>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>>> >>>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>>> >>>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>>> >>>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>>> >>>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>>> >>>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>>> >>>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>>> >>>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>>> >>>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>>> >>>>>>> >>>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>> >>>>>>> Thank you all so much for your help and advice! >>>>>>> >>>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>>> >>>>>>> Best wishes, >>>>>>> Annabel Beichman >>>>>>> Wayne Lab/Lohmueller Lab >>>>>>> Ecology & Evolutionary Biology >>>>>>> UCLA >>>>>>> Annabelbeichman.com >>>>>>> _______________________________________________ >>>>>>> maker-devel mailing list >>>>>>> maker-devel at box290.bluehost.com >>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>> >>>>> >>>> >>> >> > From carsonhh at gmail.com Fri Oct 28 17:49:27 2016 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 28 Oct 2016 17:49:27 -0600 Subject: [maker-devel] Too many genes? In-Reply-To: <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> References: <5B57B88B-33CC-4707-83D0-0C47A71EF9C0@gmail.com> <8B659CCF-E427-4AD7-81C3-1C7871C6BF5B@gmail.com> <9BC689D0-F233-46EA-969F-76101533FFA7@gmail.com> <3F5EF76F-050F-429C-9850-E452CD6BB3A9@gmail.com> <1616B9D6-1FED-47A7-897E-2F88914871C8@gmail.com> <2663F796-A997-49AF-9B1F-2A28AB3B8D6E@gmail.com> <326237DD-7A6A-4A09-AEC4-346734F7F39C@gmail.com> Message-ID: <07C987F9-1354-4DB6-A63F-9B23F2006871@gmail.com> The NNNN?s both preclude alignment and prediction, so unless they occur in an intron, it results in a split model (many times runs of NNN may just be a few base pairs long, but if they occur in the exon, you can?t really work around it). The predictors work off of a maximum score, so the ab initio predictor ends up finding some way of terminating the model around the NNN?s that scores well even though it does not reflect the biology. Sometimes you can try and force things in manually (non-canonical splice sites etc.) if it is an important gene (Web-Apollo even allows you to insert SNPs and INDELS to correct the ORF, but it?s a labor intensive manual process). So short answer. You should investigate if you see these in a browser. If you do have them, then you will have to decide how to handle them depending on the analysis (perhaps take the longer one?). Take some time just viewing alignments and models to get a feel of how evidence and gene models should correlate. There really is no substitute for visual manual review. ?Carson > On Oct 28, 2016, at 5:36 PM, Annabel Beichman wrote: > > Thank you so much, Carson, for such a rapid reply! > > I have checked the prokaryotic issue and it looks okay ? my most gene-dense contigs all have multi-exon genes. I will re-blast with a more stringent cutoff as well. I think your theory about the NNNNNNs might be spot on. The assembly is by Dovetail Genomics and they insert many NNNNNs as they join contigs together into the long scaffolds, which would disrupt the gene models. Is there any way to salvage the genes that are split around the NNNNs? Or should I just leave them out of my analyses? > > Thanks again, > ~ Annabel >> On Oct 28, 2016, at 4:27 PM, Carson Holt wrote: >> >> Also if you labeled putative function using BLAST results, make sure you set the expect value sufficiently low to filter out false homology. Otherwise you will be labeling off the best hit, which may in fact have a very poor score, but because it?s the best one. The threshold value should never be higher than 1e-6. You can go all the way down to 1e-10 if necessary. >> >> ?Carson >> >> >> >>> On Oct 28, 2016, at 5:23 PM, Carson Holt wrote: >>> >>> You need to look at some of the contigs in a browser. Look at the most gene dense ones first (density = gene_count/contig_length). You may have prokaryiotic contamination if you are seeing a lot of contigs containing primarily single exon gene models. Also make sure you still left model_org=all on after adding the species specific library (the species specific library is to supplement RepBase as opposed to replace it). >>> >>> Some locations where you are seeing neighboring genes with similar blast hits (Cadherin) may infact be one gene that was split, either because evidence insufficiently clusters (perhaps the max intron size is set too low in the control files), or perhaps the assembly has runs of NNNN that do not permit the gene predictor to create a spanning model (not uncommon). If you are using Apollo to view the genes you can zoom in around evidence alignments until you see the sequence, and often you will see clusters of NNNN in the sequence around evidence HSP breakpoints. >>> >>> ?Carson >>> >>> >>> >>>> On Oct 28, 2016, at 5:04 PM, Annabel Beichman wrote: >>>> >>>> Hi Carson, >>>> Re-running Maker without SNAP definitely improved things, as did filtering out fragmented genes without start/stop codons. Thank you! >>>> >>>> However, I?m still seeing an odd pattern that I wonder if you have any ideas about: >>>> >>>> For the set of ~6000 genes that do not have orthologs in the ferret, but do have start/stop codons and are below AED/eAED of 0.5, I am seeing duplication of BLAST annotations for ~2,600 of the gene models, particularly gene models that are in a row on a scaffold. I?ve thrown the genes with duplicate blast annotations into the attached excel file so you can see the patterns I?m describing. >>>> For example, there is a similar annotation for two genes in a row on a scaffold, both of which have low AED/eAED scores and start/stop codons (also visualized in attached Jbrowse screenshot): >>>> >>>> Scaffold Start Stop Strand GeneID mRNALength #Exons BlastInfo >>>> ScbS9RH_82700 41318 49503 + ELUT_00017706 8185 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>>> ScbS9RH_82700 99358 103910 + ELUT_00017707 4552 3 Similar to Cdh13: Cadherin-13 (Mus musculus) >>>> >>>> I am trying to filter out false positive gene models as I make my exome capture design so wondered if you had any tips on what might be going on here. Paralogs? Artifacts of the assembly? Is the gene with the most exons likely to be the original gene? Should I filter sets of duplicates by those that have IPR domains? >>>> >>>> Secondly, I also notice 250 of these repeat genes are annotated as 40S or 60S ribosomal protein genes. Do you expect to see this many (I know there are usually many rDNA genes) or could this number be inflated due to ribosomal RNA in the RNA-seq reads? (I carried out poly-A selection prior to sequencing) >>>> >>>> Thanks so much again for your help! >>>> >>>> ~ Annabel >>>> >>>>> On Oct 17, 2016, at 5:09 PM, Carson Holt wrote: >>>>> >>>>> It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>>> On Oct 17, 2016, at 5:13 PM, Annabel Beichman wrote: >>>>>> >>>>>> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn?t show great concordance compared to Augustus. >>>>>> >>>>>> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta >>>>>> >>>>>> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I?m on the right track. >>>>>> >>>>>> Thanks so much again, >>>>>> >>>>>> ~ Annabel >>>>>> >>>>>> >>>>>>> On Oct 17, 2016, at 1:25 PM, Carson Holt wrote: >>>>>>> >>>>>>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases). >>>>>>> >>>>>>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain). >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman wrote: >>>>>>>> >>>>>>>> Hi Carson et al., >>>>>>>> >>>>>>>> Thanks so much for such a great pipeline, tutorials and advice pages. >>>>>>>> >>>>>>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. >>>>>>>> >>>>>>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. >>>>>>>> >>>>>>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb. >>>>>>>> >>>>>>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. >>>>>>>> >>>>>>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). >>>>>>>> >>>>>>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. >>>>>>>> >>>>>>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter?s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 ? when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). >>>>>>>> >>>>>>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.? >>>>>>>> >>>>>>>> >>>>>>>> Not sure if this is significant, but one thing I?ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold: >>>>>>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus) >>>>>>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens) >>>>>>>> >>>>>>>> Thank you all so much for your help and advice! >>>>>>>> >>>>>>>> [I also want to report an odd behavior, that may be specific to our server ? when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno?s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.] >>>>>>>> >>>>>>>> Best wishes, >>>>>>>> Annabel Beichman >>>>>>>> Wayne Lab/Lohmueller Lab >>>>>>>> Ecology & Evolutionary Biology >>>>>>>> UCLA >>>>>>>> Annabelbeichman.com >>>>>>>> _______________________________________________ >>>>>>>> maker-devel mailing list >>>>>>>> maker-devel at box290.bluehost.com >>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>>>>> >>>>>> >>>>> >>>> >>> >> > From jacques.dainat at bils.se Mon Oct 31 04:51:29 2016 From: jacques.dainat at bils.se (Jacques Dainat) Date: Mon, 31 Oct 2016 11:51:29 +0100 Subject: [maker-devel] est_gff input does not provide any gene model Message-ID: Hello, I?m using usually Cufflinks output to feed Maker through the est_gff parameter, combined with the est2genome=1 parameter I get the wanted output. This time I used Stringtie output to feed Maker, but I don?t have any gene model predicted using the est2genome parameter. Any explanation ? Is it due to the gff3 format differences between these two file ? Cufflinks output example: Pnalgiovense_4592 Cufflinks match 363 977 17.844829 - . ID=1:s3_c1_r1.4.2;Name=1:s3_c1_r1.4.2; Pnalgiovense_4592 Cufflinks match_part 363 666 17.844829 - . ID=1:s3_c1_r1.4.2:exon-1;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 1 304 +; Pnalgiovense_4592 Cufflinks match_part 743 977 17.844829 - . ID=1:s3_c1_r1.4.2:exon-2;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 305 539 +; Stringtie output example: Pnalgiovense_112 StringTie gene 20 1256 1000 + . ID=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 Pnalgiovense_112 StringTie mRNA 20 1256 1000 + . ID=HtMm_All.12253.1;Parent=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 Pnalgiovense_112 StringTie exon 20 1256 1000 + . ID=HtMm_All.12253.1-exon-1;Parent=HtMm_All.12253.1;cov=8.028295;exon_number=1;gene_id=HtMm_All.12253;transcript_id=HtMm_All.12253.1 If it?s the Stringtie output that is problematic how can I fix it ? Removing gene, changing mRNA by match and exons by match_part is enough ? Best regards, Jacques Dainat, PhD NBIS (National Bioinformatics Infrastructure Sweden) Genome Annotation Service Address: (room E10:4204 - last floor) Uppsala University, BMC Department of Medical Biochemistry Microbiology, Genomics Husargatan 3, box 582 S-75123 Uppsala Sweden Phone: 01 84 71 46 25 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 31 21:24:03 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Oct 2016 21:24:03 -0600 Subject: [maker-devel] est_gff input does not provide any gene model In-Reply-To: References: Message-ID: Evidence such as est_gff has to follow the alignment format used by GFF3 (i.e. match/match_part) whereas you are providing gene models (i.e. gene/mRNA/exon/CDS). Note that match/match_part are two level features whereas gene models are 3 levels. You need to reformat to match/match_part. ?Carson > On Oct 31, 2016, at 4:51 AM, Jacques Dainat wrote: > > Hello, > > I?m using usually Cufflinks output to feed Maker through the est_gff parameter, combined with the est2genome=1 parameter I get the wanted output. > This time I used Stringtie output to feed Maker, but I don?t have any gene model predicted using the est2genome parameter. > > Any explanation ? Is it due to the gff3 format differences between these two file ? > > Cufflinks output example: > Pnalgiovense_4592 Cufflinks match 363 977 17.844829 - . ID=1:s3_c1_r1.4.2;Name=1:s3_c1_r1.4.2; > Pnalgiovense_4592 Cufflinks match_part 363 666 17.844829 - . ID=1:s3_c1_r1.4.2:exon-1;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 1 304 +; > Pnalgiovense_4592 Cufflinks match_part 743 977 17.844829 - . ID=1:s3_c1_r1.4.2:exon-2;Name=1:s3_c1_r1.4.2;Parent=1:s3_c1_r1.4.2;Target=1:s3_c1_r1.4.2 305 539 +; > > Stringtie output example: > Pnalgiovense_112 StringTie gene 20 1256 1000 + . ID=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 > Pnalgiovense_112 StringTie mRNA 20 1256 1000 + . ID=HtMm_All.12253.1;Parent=HtMm_All.12253;cov=8.028295;fPKM=1.214491;gene_id=HtMm_All.12253;tPM=2.706611;transcript_id=HtMm_All.12253.1 > Pnalgiovense_112 StringTie exon 20 1256 1000 + . ID=HtMm_All.12253.1-exon-1;Parent=HtMm_All.12253.1;cov=8.028295;exon_number=1;gene_id=HtMm_All.12253;transcript_id=HtMm_All.12253.1 > > > If it?s the Stringtie output that is problematic how can I fix it ? Removing gene, changing mRNA by match and exons by match_part is enough ? > > Best regards, > > > Jacques Dainat, PhD > NBIS (National Bioinformatics Infrastructure Sweden) > Genome Annotation Service > > Address: (room E10:4204 - last floor) > Uppsala University, BMC > Department of Medical Biochemistry Microbiology, Genomics > Husargatan 3, box 582 > S-75123 Uppsala Sweden > Phone: 01 84 71 46 25 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From allisonfuiten at gmail.com Mon Oct 31 18:34:23 2016 From: allisonfuiten at gmail.com (Allison Fuiten) Date: Mon, 31 Oct 2016 17:34:23 -0700 Subject: [maker-devel] InterProScan protein domain & AED physical evidence filtering Message-ID: Hello MAKER google group, For the final round of a MAKER annotation for a de novo plant genome assembly, I ran MAKER twice: once with keep_preds=0 which annotated 20,284 genes and once with keep_preds=1 which annotated 34,055 genes. I ran the 34,055 genes (the keep_preds=1 set) through InterProScan to search the MAKER predictions for protein domain content and added this IPRScan output into the MAKER gff file with the ipr_update_gff accessory script. The game plan is to go through the 34,055 genes and remove any gene model that doesn?t have either protein domain content or physical evidence. I am counting genes that have an AED=1 as the genes that don?t have physical evidence. I have two questions: 1. I count 11,762 genes that have AED=1.0 in the keep_preds=1 annotation set, which leaves me with 22,293 genes that I?m assuming have some physical evidence (34,055-11,762=22,293). But when I ran MAKER with keep_preds=0 originally, I only count 20,284 genes. What are the extra ~2,000 genes that are being annotated in the keep_preds=1 run that have and AED score of less than 1.0, but are not being annotated in the keep_preds=0 run? 2. My second question is if there is an accessory script available that will remove genes that lack either the IPRScan protein domains or physical evidence (AED < 1)? This type of gene removal was mentioned in a previous post from 2012 (https://groups.google.com/forum/#!searchin/maker-devel/ sorry$20there$27s$20not$20a$20script$20prepackaged$20with$ 20MAKER$20for$20that$20yet.%7Csort:relevance/maker-devel/ VaoXWlGHOjs/EElr_otrK8QJ) and I was just wondering if since then someone wrote a script that will do this for me. If anyone could offer me any feedback, that would be greatly appreciated! Thank you, Allison -------------- next part -------------- An HTML attachment was scrubbed... URL: