[maker-devel] AED calculations using the MAKER pipeline

Wed Mar 20 07:51:29 MDT 2013

In the current MAKER download when using GFF3 passthrough there was an
issue with everything being done at the very last step.  This of course
leads to a memory spike and a very slow last step.  That seems to be
similar to what you are describing. It should be resolved in what will
become version 2.28. I can give you access to the pre-release code, so you
can check that the issue is resolved for you.  I'll send details in a
separate e-mail.

Also the ### will be printed after every ~100,000 bp of assembly processed
by MAKER.  You can ignore them, but they actually have a meaning in GFF3.
Basically everything between two sets of ###'s are fully resolved.  It
allows programs that read GFF3 to parallelize file loading or just load
sections of a file as they can rapidly identify "safe chunks".  Without
them the entire file must be loaded into memory in order to be certain
that all feature parts are there (as there is no requirement for sorting
or order in GFF3).

log.child files will always be empty unless you run analysis like snap or
blast.

Thanks,
Carson

On 13-03-20 9:05 AM, "Krishnakumar, Vivek" <vKrishna at jcvi.org> wrote:

>Hi,
>
>We have been using the MAKER pipeline here at JCVI to calculate AED
>scores by feeding in our annotation set as `model_gff` and the protein
>and EST evidence as `protein_gff` and `est_gff` respectively. Here is the
>issue we are having:
>
>When running the above pipeline with protein2genome and est2genome
>evidence generated earlier by MAKER, there are no problems calculating
>the AED score. Normally this pipeline takes a little over 12 hours to
>complete.
>
>But if we use our own evidence, AAT and Genewise aligned proteins for
>`protein_gff` and PASA assembled ESTs for `est_gff`, the same pipeline
>runs very very slow and the intermediary *.gff.ann file has many chunks
>(separated by '###') that are completely empty. Our evidence in formatted
>in the same way as est2genome or protein2genome (GFF file with
>"expressed_sequence_match::match_part" or "protein_match::match_part"
>features respectively)
>
>The input to my pipeline is 8 chromosomes, ~2200 scaffolds and I use the
>default `max_dna_len` parameter used to split the large assemblies into
>chunks.
>
>Investigating the master_datastore.log shows me that the scaffolds run
>through without any issues and the chromosomes are still being processed.
>For any of the chromosomes, investigating the 'run.log' file, one level
>above 'theVoid' shows me how many "final.section" jobs were started and
>how many finished. And in the case of all the chromosomes, it tells me
>that everything that was started has finished. And the 'log.child.*'
>files within `theVoid` are all empty. Also within `theVoid`, I'm noticing
>that the "raw.section" and "evidence_*.gff" files are not empty. But one
>thing that is surprising is that of all the "final.section" files, only
>the one pertaining to the last chunk is very large (proportional to the
>size of the evidnce), the rest are all exactly the same size (exactly 331
>bytes).
>
>I'm running MAKER in MPI mode spawning 48 processes on a high memory
>machine with 64 available cores and 1TB of RAM.
>
>I hope I've been able to explain my situation clearly in this email.
>
>Any help is appreciated.
>Thank you.
>
>Vivek