[maker-devel] Further split genome questions

Thu Aug 14 09:46:44 MDT 2014

What version of MAKER are you using? I'd also need to see the GFF3 files
before the merge.  You may also need to turn off map_forward since you are
passing in GFF3 with MAKER names, creating new models with MAKER names and
then moving names from old models forward onto new ones (which may force
names to be used twice).

--Carson

On 8/14/14, 9:40 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:

>
>Thank you so much!
>
>However, I'm still, struggling, I'm afraid: I tried this 'two-step
>merging' approach with
>a subset of scaffolds and got duplicate IDs.
>
>Here is what I did:
>- divided input scaffolds in two files
>- run maker separately on these files (-> separate output dirs)
>-- additional input: maker-generated gff3 from previous (singular) run
>-- repeatmasking, snaphmm, gmhmm, augustus_species are given
>-- map_forward=0 / 1 (I tried both, to the same effect)
>- gff3_merge two times using index-log
>- gff3_merge these two gff3 files
>
>$
>grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort |
>uniq -c | sort -n
>| tail
>      2 ID=snap_masked-scf7180005140699-processed-gene-0.19
>      2 ID=snap_masked-scf7180005140699-processed-gene-0.22
>      2 ID=snap_masked-scf7180005140699-processed-gene-1.36
>      2 ID=snap_masked-scf7180005140713-processed-gene-0.4
>      2 ID=snap_masked-scf7180005140744-processed-gene-0.4
>      2 ID=snap_masked-scf7180005140744-processed-gene-0.6
>      2 ID=snap_masked-scf7180005140754-processed-gene-0.14
>      2 ID=snap_masked-scf7180005140754-processed-gene-0.15
>      2 ID=snap_masked-scf7180005140754-processed-gene-0.19
>      2 ID=snap_masked-scf7180005181475-processed-gene-0.3
>
>$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 |
>grep "\sgene"
>scf7180005181475	maker	gene	9050	9385	.	-	.	ID=snap_masked-scf718000518147
>5-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>scf7180005181475	maker	gene	846	1088	.	-	.	ID=snap_masked-scf7180005181475
>-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>
>- found duplicates! i.e. the same ID for gene annotations in different
>areas of the same
>scaffold (of 655 gene annotations, 51 appear twice)
>-- this happens not only with gene, but also CDS and mRNA annotations, as
>far as I can
>see (here, in one example, non-everlapping but close CDS snippets got the
>same ID). 
>
>
>I suspected this might have to do with the map_forward flag, but I get
>the same problem
>again (with genes at the same locations).
>I attached one of the ctl files for you in case you want to have a look,
>the other is
>analogous. Do you need something else?
>
>What did I miss? This should not happen, right?
>
>
>
>
>On Wed, 13 Aug 2014 15:52:34 +0000
> Carson Holt <carsonhh at gmail.com> wrote:
>>Yes. One cpu will have several processes, most are helper processes that
>>will use 0% CPU almost all of the time (for example there is a shared
>>variable manager process that will launch with MAKER but will also be
>>called 'maker' under top because it is technically its child and not a
>>separate script).  Also system calls will launch a new process that will
>>use all CPU while the process calling it will drop to 0% CPU until it
>>finishes.
>>
>>Yes.  Your explanation is correct. You then use gff3_merge to merge the
>>GFF3 file.
>>
>>--Carson
>>
>>
>>
>>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>>
>>>
>>>Our admin counts processes. Do I understand you right, that one CPU
>>>handles several
>>>processes?
>>>
>>>I'm still confused by the different directories (and I made a mistake
>>>when asking last
>>>time, I wanted to say 'If I do NOT start the jobs in the same
>>>directory...). 
>>>So, if I start each piece of a genome in its own directory (for
>>>example),
>>>then it gets a
>>>unique basename (because the output will be separate from all other
>>>pieces anyway) and I
>>>will not run dsindex but instead use gff3_merge for each piece's output
>>>and then once
>>>again to merge all resulting gff3-files?
>>>
>>>Hope I got you right :)
>>>
>>>Thanks fopr your help!
>>>Jeanne
>>>
>>>
>>>
>>>On Wed, 6 Aug 2014 15:45:56 +0000
>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>Is your admin counting processes or cpu usage?  Because each system
>>>>call
>>>>creates a
>>>>separate process, so you can expect multiple processes (each system
>>>>call
>>>>generates a new
>>>>process) but only a single cpu of usage per instance.  Use different
>>>>directories if you
>>>>are running that many jobs.  You can concatenate the separate results
>>>>when your done.
>>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>>generated from
>>>>separate jobs.
>>>>
>>>>--Carson
>>>>
>>>>Sent from my iPhone
>>>>
>>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de>
>>>>>wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>>threads. Our admin
>>>>reports
>>>>> however, that the processes seem to assemble more threads than they
>>>>>are allowed. It is
>>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>>suggestion why?
>>>>> 
>>>>> If I start the jobs in the same directory, how can I make sure they
>>>>>write to the same
>>>>> directory (as, I think is required to put the pieces together in the
>>>>>end?)? das
>>>>-basename
>>>>> take paths?
>>>>> 
>>>>> 
>>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>> I think the freezing is because you are starting too many
>>>>>>simultaneous jobs.  You
>>>>should
>>>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>>>doing things can
>>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>>same directory. You
>>>>>> could try splitting them into different directories.
>>>>>> 
>>>>>> --Carson
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt"
>>>>>>><j.wilbrandt at zfmk.de>
>>>>>>>wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> aha, so this explains that.
>>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>>than 60,000,
>>>>roughly
>>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>>> 
>>>>>>> What do you think about this weird 'I am running but not really
>>>>>>>doing
>>>>>> anything'-behavior?
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks a lot!
>>>>>>> Jeanne
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>>>> If you are starting and restarting, or running multiple jobs then
>>>>>>>>the log can be
>>>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are
>>>>>>>>added.
>>>>>>>> If there is a
>>>>>> GFF3
>>>>>>>> result file for the contig, then it is FINISHED. FASTA files will
>>>>>>>>only exist for
>>>>the
>>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>>models.
>>>>>>>> 
>>>>>>>> --Carson
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>>><j.wilbrandt at zfmk.de> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Carson,
>>>>>>>>> 
>>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>>genome which is split
>>>>>>>> into
>>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed
>>>>>>>>>to
>>>>>>>>>finish
>>>>normally,
>>>>>>>> while
>>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>>output. Do you have
>>>>any
>>>>>>>>> suggestion why this is happening?
>>>>>>>>> 
>>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>>found that of
>>>>38.384
>>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The
>>>>>>>>>surprise
>>>>>>>>>is, that 2/3 of
>>>>>>>> these
>>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>>models for these
>>>>>>>> 'finished'
>>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>>parts of the genome
>>>>>>>> (i.e.,
>>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but
>>>>>>>>>'finished')
>>>>>>>>> Should this be an issue of concern?
>>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the
>>>>>>>>>NFS files look
>>>>>> good,
>>>>>>>> so
>>>>>>>>> we suspect something fishy going on...
>>>>>>>>> 
>>>>>>>>> Hope you can help,
>>>>>>>>> best wishes,
>>>>>>>>> Jeanne Wilbrandt
>>>>>>>>> 
>>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> maker-devel mailing list
>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>> 
>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>>>>>>>>>b.
>>>>>>>>>org
>>>>> 
>>>
>>
>>
>