[maker-devel] altest without MPI?

Wed Jun 19 19:05:49 MDT 2013

The throughput is based on contig length, so long contigs will take longer
than short contigs.  Any contig less than 10kb is mostly useless for
annotation purposes (so you can filter those from your 800,000 right
away). Take your contigs that finish, and sum up their length to get a
better estimate of how long it will take to complete running.  Most
genomes can complete in a few days an a multi-core machine.  Bigger
genomes or bigger datasets take longer.  (note that altest evidence takes
3-4x longer to align than proteins).

The advantage of proteins is that the species do not have to be closely
related.  Nucleotide sequence diverges quickly and proteins slowly (that's
why proteins are used for phylogenetic trees).

A good strategy would be to get ~10Mb of sequence (use your longest
contigs).  Run with Chicken, turkey, and pigeon proteins.  Use the
protein2genome option to generate annotations.  Those annotations should
now be sufficient to train SNAP and Augustus.  Then you can finish by
running all your contigs with the same dataset (protein2genome now turned
off), use the newly trained snap and augustus files along with any altest
files you want to use. Note that the size of the dataset will determine
the total run time.

To get things to run faster, you can also run on your university's
computer cluster (then you will have hundreds of cpus available to you).
The purdue cluster supports MPI and with 30-50 cpus you could annotate
even large genomes in a reasonable time.  Alternatively you can request a
startup account at XSEDE, an NFS funded computer resource open to all US
institutions.  A startup allocation with 50,000 cpu hours only takes 2
weeks to approve. You should request an allocation on the Lonestar cluster
if you go that route, it has 64,000 cpus. I was able to annotate the Maize
genome (which is a very large genome at over 2 gigabases).  I used an
abnormally large EST and protein datasets (~4 gigabases of evidence which
is much more than a normal annotation job), and it completed in under 3
hours on 2,100 cpus.

--Carson

On 13-06-19 5:12 PM, "Jacqueline R M Doyle" <jmdoyle at purdue.edu> wrote:

>Hi Carson (and whoever else might be reading this!)
>
>Thanks so much, I think splitting the files up using fasta_tool will
>definitely move things along.  I did a trial version with altest this
>weekend, and seemed to be averaging about an hour a scaffold (with 1
>cpu).  I'm a little concerned, as we have ~800,000 scaffolds.  Does this
>seem like a reasonable estimate of the time it should take to annotate
>one sequence?  Could I be missing something in my maker_opts file?
>
>Let me back up for just a minute and describe the project a little more
>generally.  As I mentioned before, we have no protein sequences or ESTs
>for our species of interest, which is an avian species.  I could
>potentially use proteins from chicken or turkey, but neither is closely
>related to our species.  Time is a bit of an issue... do you have any
>thoughts on how much time per scaffold it should take to annotate using
>protein2genome?  If chicken and turkey are not closely related, is it
>worth the time investment?
>
>Let me finish by saying I think MAKER is wonderful, and I really
>appreciate the discussions on this group.
>
>Best wishes, Jackie