[maker-devel] Further split genome questions

Thu Aug 14 09:57:39 MDT 2014

For the file you just sent me, is that from the first run with
map_forward=0 or with map_forward=1?

--Carson

On 8/14/14, 9:53 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:

>
>It is version 2.31.
>
>My first try was done with map_forward=0, and (I just noticed) the
>duplicates are present
>in the separate gff3s already also in this case (one is attached).
> 
>Has this something to do with the first-run-gff3 I fed it?
>
>
>
>
>On Thu, 14 Aug 2014 15:46:44 +0000
> Carson Holt <carsonhh at gmail.com> wrote:
>>What version of MAKER are you using? I'd also need to see the GFF3 files
>>before the merge.  You may also need to turn off map_forward since you
>>are
>>passing in GFF3 with MAKER names, creating new models with MAKER names
>>and
>>then moving names from old models forward onto new ones (which may force
>>names to be used twice).
>>
>>--Carson
>>
>>
>>On 8/14/14, 9:40 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>>
>>>
>>>Thank you so much!
>>>
>>>However, I'm still, struggling, I'm afraid: I tried this 'two-step
>>>merging' approach with
>>>a subset of scaffolds and got duplicate IDs.
>>>
>>>Here is what I did:
>>>- divided input scaffolds in two files
>>>- run maker separately on these files (-> separate output dirs)
>>>-- additional input: maker-generated gff3 from previous (singular) run
>>>-- repeatmasking, snaphmm, gmhmm, augustus_species are given
>>>-- map_forward=0 / 1 (I tried both, to the same effect)
>>>- gff3_merge two times using index-log
>>>- gff3_merge these two gff3 files
>>>
>>>$
>>>grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort |
>>>uniq -c | sort -n
>>>| tail
>>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.19
>>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.22
>>>      2 ID=snap_masked-scf7180005140699-processed-gene-1.36
>>>      2 ID=snap_masked-scf7180005140713-processed-gene-0.4
>>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.4
>>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.6
>>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.14
>>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.15
>>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.19
>>>      2 ID=snap_masked-scf7180005181475-processed-gene-0.3
>>>
>>>$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 |
>>>grep "\sgene"
>>>scf7180005181475	maker	gene	9050	9385	.	-	.	ID=snap_masked-scf7180005181
>>>47
>>>5-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.
>>>3
>>>scf7180005181475	maker	gene	846	1088	.	-	.	ID=snap_masked-scf71800051814
>>>75
>>>-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>>>
>>>- found duplicates! i.e. the same ID for gene annotations in different
>>>areas of the same
>>>scaffold (of 655 gene annotations, 51 appear twice)
>>>-- this happens not only with gene, but also CDS and mRNA annotations,
>>>as
>>>far as I can
>>>see (here, in one example, non-everlapping but close CDS snippets got
>>>the
>>>same ID). 
>>>
>>>
>>>I suspected this might have to do with the map_forward flag, but I get
>>>the same problem
>>>again (with genes at the same locations).
>>>I attached one of the ctl files for you in case you want to have a look,
>>>the other is
>>>analogous. Do you need something else?
>>>
>>>What did I miss? This should not happen, right?
>>>
>>>
>>>
>>>
>>>On Wed, 13 Aug 2014 15:52:34 +0000
>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>Yes. One cpu will have several processes, most are helper processes
>>>>that
>>>>will use 0% CPU almost all of the time (for example there is a shared
>>>>variable manager process that will launch with MAKER but will also be
>>>>called 'maker' under top because it is technically its child and not a
>>>>separate script).  Also system calls will launch a new process that
>>>>will
>>>>use all CPU while the process calling it will drop to 0% CPU until it
>>>>finishes.
>>>>
>>>>Yes.  Your explanation is correct. You then use gff3_merge to merge the
>>>>GFF3 file.
>>>>
>>>>--Carson
>>>>
>>>>
>>>>
>>>>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>>>>
>>>>>
>>>>>Our admin counts processes. Do I understand you right, that one CPU
>>>>>handles several
>>>>>processes?
>>>>>
>>>>>I'm still confused by the different directories (and I made a mistake
>>>>>when asking last
>>>>>time, I wanted to say 'If I do NOT start the jobs in the same
>>>>>directory...).
>>>>>So, if I start each piece of a genome in its own directory (for
>>>>>example),
>>>>>then it gets a
>>>>>unique basename (because the output will be separate from all other
>>>>>pieces anyway) and I
>>>>>will not run dsindex but instead use gff3_merge for each piece's
>>>>>output
>>>>>and then once
>>>>>again to merge all resulting gff3-files?
>>>>>
>>>>>Hope I got you right :)
>>>>>
>>>>>Thanks fopr your help!
>>>>>Jeanne
>>>>>
>>>>>
>>>>>
>>>>>On Wed, 6 Aug 2014 15:45:56 +0000
>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>>Is your admin counting processes or cpu usage?  Because each system
>>>>>>call
>>>>>>creates a
>>>>>>separate process, so you can expect multiple processes (each system
>>>>>>call
>>>>>>generates a new
>>>>>>process) but only a single cpu of usage per instance.  Use different
>>>>>>directories if you
>>>>>>are running that many jobs.  You can concatenate the separate results
>>>>>>when your done.
>>>>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>>>>generated from
>>>>>>separate jobs.
>>>>>>
>>>>>>--Carson
>>>>>>
>>>>>>Sent from my iPhone
>>>>>>
>>>>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt"
>>>>>>><j.wilbrandt at zfmk.de>
>>>>>>>wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>>>>threads. Our admin
>>>>>>reports
>>>>>>> however, that the processes seem to assemble more threads than they
>>>>>>>are allowed. It is
>>>>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>>>>suggestion why?
>>>>>>> 
>>>>>>> If I start the jobs in the same directory, how can I make sure they
>>>>>>>write to the same
>>>>>>> directory (as, I think is required to put the pieces together in
>>>>>>>the
>>>>>>>end?)? das
>>>>>>-basename
>>>>>>> take paths?
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>>>> I think the freezing is because you are starting too many
>>>>>>>>simultaneous jobs.  You
>>>>>>should
>>>>>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>>>>>doing things can
>>>>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>>>>same directory. You
>>>>>>>> could try splitting them into different directories.
>>>>>>>> 
>>>>>>>> --Carson
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt"
>>>>>>>>><j.wilbrandt at zfmk.de>
>>>>>>>>>wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> aha, so this explains that.
>>>>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>>>>than 60,000,
>>>>>>roughly
>>>>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>>>>> 
>>>>>>>>> What do you think about this weird 'I am running but not really
>>>>>>>>>doing
>>>>>>>> anything'-behavior?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks a lot!
>>>>>>>>> Jeanne
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>>>>>> If you are starting and restarting, or running multiple jobs
>>>>>>>>>>then
>>>>>>>>>>the log can be
>>>>>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are
>>>>>>>>>>added.
>>>>>>>>>> If there is a
>>>>>>>> GFF3
>>>>>>>>>> result file for the contig, then it is FINISHED. FASTA files
>>>>>>>>>>will
>>>>>>>>>>only exist for
>>>>>>the
>>>>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>>>>models.
>>>>>>>>>> 
>>>>>>>>>> --Carson
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>> 
>>>>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>>>>><j.wilbrandt at zfmk.de> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi Carson,
>>>>>>>>>>> 
>>>>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>>>>genome which is split
>>>>>>>>>> into
>>>>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed
>>>>>>>>>>>to
>>>>>>>>>>>finish
>>>>>>normally,
>>>>>>>>>> while
>>>>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>>>>output. Do you have
>>>>>>any
>>>>>>>>>>> suggestion why this is happening?
>>>>>>>>>>> 
>>>>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>>>>found that of
>>>>>>38.384
>>>>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The
>>>>>>>>>>>surprise
>>>>>>>>>>>is, that 2/3 of
>>>>>>>>>> these
>>>>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>>>>models for these
>>>>>>>>>> 'finished'
>>>>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>>>>parts of the genome
>>>>>>>>>> (i.e.,
>>>>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start'
>>>>>>>>>>>but
>>>>>>>>>>>'finished')
>>>>>>>>>>> Should this be an issue of concern?
>>>>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but
>>>>>>>>>>>the
>>>>>>>>>>>NFS files look
>>>>>>>> good,
>>>>>>>>>> so
>>>>>>>>>>> we suspect something fishy going on...
>>>>>>>>>>> 
>>>>>>>>>>> Hope you can help,
>>>>>>>>>>> best wishes,
>>>>>>>>>>> Jeanne Wilbrandt
>>>>>>>>>>> 
>>>>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>> 
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-
>>>>>>>>>>>la
>>>>>>>>>>>b.
>>>>>>>>>>>org
>>>>>>> 
>>>>>
>>>>
>>>>
>>>
>>
>>
>