[maker-devel] spliting genome for annotation

Thu Jun 27 09:42:26 MDT 2013

Correct.  The level of splitting is going to be limited by the largest
config.  The largest config will then be your slowest job, but the total
runtime will be based off how much splitting you can achieve.  Splitting
into 10 jobs and running them all simultaneously will make total run time
1/10 as long.  You can use the base flag with MAKER to make all jobs write
to the same directory.  Use the g flag to specify a different input fasta
file for each job (then they can all share the same control files).  You
will then need to run maker once using the original assembly fasta and the
dsindex flag when all jobs complete to get MAKER to clean up the datastore
log file (rebuilt to index all contigs). That only takes 2 minutes to run.

You can use the fasta_tool utility that comes with MAKER to conveniently
split the input assembly fasta.
MAKER does not train the gene predictors for you, and the hints it gives are
on a per gene basis, so splitting contigs has no affect on that.  For
initial training of gene predictors, run MAKER on about 10-30 Mb of your
largest contigs and use either the protein2genome or est2genome prediction
options to build gene models to train the predictors on.  You will need to
train Augustus or SNAP yourself using those models and their own
documentation.  If training SNAP, you can use maker2zff to convert for SNAPs
training format.  You can also use the tool CEGMA from Ian Korf's lab to
train SNAP. Use the cegma2zff script that comes with MAKER to do the
conversion for training input.

If you have questions once you start training, just send them to the list.

Thanks,
Carson

From:  Daniel Lawson <lawson at ebi.ac.uk>
Date:  Thursday, 27 June, 2013 9:37 AM
To:  <michel.moser at ips.unibe.ch>
Cc:  <maker-devel at yandell-lab.org>
Subject:  Re: [maker-devel] spliting genome for annotation

Michel,

It is about the size of your scaffolds rather than the whole genome.
Presumably you don't have 1.2 Gb of contiguous sequence. If you have long
scaffolds then the compute time will be constrained by the time taken to
process the largest scaffold.

regards
Dan

On 27 June 2013 14:33,  <michel.moser at ips.unibe.ch> wrote:
> Dear Maker-developers
> 
> If i understood correctly, in order to increase speed and reduce needed
> resources one can split the genome into chunks and annotate each chunk
> separately.
> (i would really like to use that as i am working with a 1.2 Gbasepair
> draftgenome and cant use MPI on the computing cluster)
> I am a bit worried about how this might affect the annotation as the
> gene-predictor would get trained quite differently for each chunk, right?
> Or is there communication between the chunks using the -base function of
> maker?
> 
> Could you maybe name some pros and cons of splitting your genome for the
> annotation with maker?
> 
> Thank you very much,
> Michel
> 
> 
> 
> 
> ________________________________________
> Von: Moser, Michel (IPS)
> Gesendet: Donnerstag, 27. Juni 2013 15:24
> An: Carson Holt
> Betreff: AW: [maker-devel] start position for some genes results
> 
> ________________________________________
> Von: maker-devel [maker-devel-bounces at yandell-lab.org]" im Auftrag von
> "Carson Holt [carsonhh at gmail.com]
> Gesendet: Mittwoch, 26. Juni 2013 04:02
> An: Jingjing Jin; maker-devel at yandell-lab.org
> Betreff: Re: [maker-devel] start position for some genes results
> 
> The point of the failure you are seeing is occurring in the initialization
> stage, before reaching any of the changes that would have been introduced by
> 2.28.  Try running the test data that comes with MAKER, does it fail as well?
> 
> --Carson
> 
> 
> 
> From: Jingjing Jin
> <jjin01 at mail.rockefeller.edu<mailto:jjin01 at mail.rockefeller.edu>>
> Date: Tuesday, 25 June, 2013 9:53 PM
> To: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>,
> "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
> <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
> Subject: RE: [maker-devel] start position for some genes results
> 
> Yes, this is the real name.
> 
> There is also no ":" in the name.
> 
> Because I have use the same file for maker.2.27 and have no problem.
> 
> I am not sure what is wrong with the new version.
> 
> Jingjing
> 
> 
> ________________________________
> From: Carson Holt [carsonhh at gmail.com<mailto:carsonhh at gmail.com>]
> Sent: Tuesday, June 25, 2013 9:47 PM
> To: Jingjing Jin;
> maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] start position for some genes results
> 
> Could you check for this sequence in your input genome file for
> "processed_tobacco_genome_sequences_c1", make sure that it is in fact that
> exact name, and there are no ':' characters in the name because they can
> confuse the bioperl fasta indexer.
> 
> --Carson
> 
> 
> From: Jingjing Jin
> <jjin01 at mail.rockefeller.edu<mailto:jjin01 at mail.rockefeller.edu>>
> Date: Tuesday, 25 June, 2013 9:30 PM
> To: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>,
> "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
> <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
> Subject: RE: [maker-devel] start position for some genes results
> 
> Dear Carson,
> 
> 
> I am so sorry. The problem is still here.
> 
> STATUS: Parsing control files...
> STATUS: Processing and indexing input FASTA files...
> STATUS: Setting up database for any GFF3 input...
> A data structure will be created for you at:
> /home/jingjing/project/tobacco/Nicotiana_tabacum/maker.2.28/1/tobacco_seq_1.ma
> ker.output/tobacco_seq_1_datastore
> 
> To access files for individual sequences use the datastore index:
> /home/jingjing/project/tobacco/Nicotiana_tabacum/maker.2.28/1/tobacco_seq_1.ma
> ker.output/tobacco_seq_1_master_datastore_index.log
> 
> STATUS: Now running MAKER...
> WARNING: Cannot find >processed_tobacco_genome_sequences_c1, trying to
> re-index the fasta.
> stop here: processed_tobacco_genome_sequences_c1
> ERROR: Fasta index error
>  at /home/jingjing/software/maker.2.28/maker/bin/../lib/Process/MpiChunk.pm
> line 239.
>         Process::MpiChunk::_prepare('Process::MpiChunk=HASH(0x4e16178)',
> 'HASH(0x4e10810)', 0) called at
> /home/jingjing/software/maker.2.28/maker/bin/../lib/Process/MpiTiers.pm line
> 73
>         Process::MpiTiers::__ANON__() called at
> /home/jingjing/software/maker.2.28/maker/bin/../lib/Error.pm line 415
>         eval {...} called at
> /home/jingjing/software/maker.2.28/maker/bin/../lib/Error.pm line 407
>         Error::subs::try('CODE(0x4e19100)', 'HASH(0x4e1bd58)') called at
> /home/jingjing/software/maker.2.28/maker/bin/../lib/Process/MpiTiers.pm line
> 79
>         Process::MpiTiers::_prepare('Process::MpiTiers=HASH(0x4e16e68)')
> called at 
> /home/jingjing/software/maker.2.28/maker/bin/../lib/Process/MpiTiers.pm line
> 56
>         Process::MpiTiers::new('Process::MpiTiers', 'HASH(0x4e16ad8)', 0,
> 'Process::MpiChunk') called at
> /home/jingjing/software/maker.2.28/maker/bin/./maker line 650
> --> rank=NA, hostname=ChuaServer1
> ERROR: Failed in tier preparation
> WARNING: You must always set a rank before running MpiTiers
> FATAL: argument `seq_id` does not exist in MpiTier object
> 
> ________________________________
> From: Carson Holt [carsonhh at gmail.com<mailto:carsonhh at gmail.com>]
> Sent: Tuesday, June 25, 2013 8:55 PM
> To: Jingjing Jin;
> maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] start position for some genes results
> 
> Delete the mpi_blastdb directory before starting, to make sure all indexes get
> rebuilt.  Also make sure you are not setting TMP= to a network mounted
> location.
> 
> --Carson
> 
> 
> From: Jingjing Jin
> <jjin01 at mail.rockefeller.edu<mailto:jjin01 at mail.rockefeller.edu>>
> Date: Tuesday, 25 June, 2013 8:53 PM
> To: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>,
> "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
> <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
> Subject: RE: [maker-devel] start position for some genes results
> 
> Dear Carson,
> 
> When I use the new version of maker, I have another problem like this:
> 
> jingjing at ChuaServer1:~/project/$
> /home/jingjing/software/maker.2.28/maker/bin/./maker
> STATUS: Parsing control files...
> STATUS: Processing and indexing input FASTA files...
> STATUS: Setting up database for any GFF3 input...
> A data structure will be created for you at:
> /home/jingjing/project/tobacco/Nicotiana_tabacum/maker.2.28/1/tobacco_seq_1.ma
> ker.output/tobacco_seq_1_datastore
> 
> To access files for individual sequences use the datastore index:
> /home/jingjing/project/tobacco/Nicotiana_tabacum/maker.2.28/1/tobacco_seq_1.ma
> ker.output/tobacco_seq_1_master_datastore_index.log
> 
> STATUS: Now running MAKER...
> WARNING: Cannot find >processed_tobacco_genome_sequences_c1, trying to
> re-index the fasta.
> stop here: processed_tobacco_genome_sequences_c1
> ERROR: Fasta index error
> 
> 
> Do you know how to fix this problem about new version?
> 
> Thanks!
> 
> Jingjing
> 
> 
> 
> ________________________________
> From: Carson Holt [carsonhh at gmail.com<mailto:carsonhh at gmail.com>]
> Sent: Tuesday, June 25, 2013 6:55 PM
> To: Jingjing Jin;
> maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] start position for some genes results
> 
> What MAKER version are you using?  This should be fixed in the current 2.28.
> It only happened under a very specific set of circumstances, but I remember
> fixing it. So let me know if you are using 2.28.
> 
> --Carson
> 
> 
> 
> From: Jingjing Jin
> <jjin01 at mail.rockefeller.edu<mailto:jjin01 at mail.rockefeller.edu>>
> Date: Tuesday, 25 June, 2013 5:13 PM
> To: "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
> <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
> Subject: [maker-devel] start position for some genes results
> 
> Dear all,
> 
> I find some strange things about location for my final result.
> 
> Like for some start position of final gene model:
> 
> c124062 maker   gene    -1      507     .       -       .
> ID=maker-c124062-snap-gene-0.2;Name=maker-c124062-snap-gene-0.2
> 
> 
> It start position is -1.
> 
> Does someone know why the start position is  -1?
> 
> Is there something wrong?
> 
> Thanks!
> 
> Jingjing
> 
> 
> _______________________________________________ maker-devel mailing list
> maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-- 
Ensembl Genomes | VectorBase | i5K insect genome initiative
_______________________________________________ maker-devel mailing list
maker-devel at box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20130627/8d1abdda/attachment-0003.html>