[maker-devel] Further split genome questions

Thu Aug 14 09:53:38 MDT 2014

It is version 2.31. 

My first try was done with map_forward=0, and (I just noticed) the duplicates are present
in the separate gff3s already also in this case (one is attached).

Has this something to do with the first-run-gff3 I fed it?

On Thu, 14 Aug 2014 15:46:44 +0000
 Carson Holt <carsonhh at gmail.com> wrote:
>What version of MAKER are you using? I'd also need to see the GFF3 files
>before the merge.  You may also need to turn off map_forward since you are
>passing in GFF3 with MAKER names, creating new models with MAKER names and
>then moving names from old models forward onto new ones (which may force
>names to be used twice).
>
>--Carson
>
>
>On 8/14/14, 9:40 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>
>>
>>Thank you so much!
>>
>>However, I'm still, struggling, I'm afraid: I tried this 'two-step
>>merging' approach with
>>a subset of scaffolds and got duplicate IDs.
>>
>>Here is what I did:
>>- divided input scaffolds in two files
>>- run maker separately on these files (-> separate output dirs)
>>-- additional input: maker-generated gff3 from previous (singular) run
>>-- repeatmasking, snaphmm, gmhmm, augustus_species are given
>>-- map_forward=0 / 1 (I tried both, to the same effect)
>>- gff3_merge two times using index-log
>>- gff3_merge these two gff3 files
>>
>>$
>>grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort |
>>uniq -c | sort -n
>>| tail
>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.19
>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.22
>>      2 ID=snap_masked-scf7180005140699-processed-gene-1.36
>>      2 ID=snap_masked-scf7180005140713-processed-gene-0.4
>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.4
>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.6
>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.14
>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.15
>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.19
>>      2 ID=snap_masked-scf7180005181475-processed-gene-0.3
>>
>>$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 |
>>grep "\sgene"
>>scf7180005181475	maker	gene	9050	9385	.	-	.	ID=snap_masked-scf718000518147
>>5-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>>scf7180005181475	maker	gene	846	1088	.	-	.	ID=snap_masked-scf7180005181475
>>-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>>
>>- found duplicates! i.e. the same ID for gene annotations in different
>>areas of the same
>>scaffold (of 655 gene annotations, 51 appear twice)
>>-- this happens not only with gene, but also CDS and mRNA annotations, as
>>far as I can
>>see (here, in one example, non-everlapping but close CDS snippets got the
>>same ID). 
>>
>>
>>I suspected this might have to do with the map_forward flag, but I get
>>the same problem
>>again (with genes at the same locations).
>>I attached one of the ctl files for you in case you want to have a look,
>>the other is
>>analogous. Do you need something else?
>>
>>What did I miss? This should not happen, right?
>>
>>
>>
>>
>>On Wed, 13 Aug 2014 15:52:34 +0000
>> Carson Holt <carsonhh at gmail.com> wrote:
>>>Yes. One cpu will have several processes, most are helper processes that
>>>will use 0% CPU almost all of the time (for example there is a shared
>>>variable manager process that will launch with MAKER but will also be
>>>called 'maker' under top because it is technically its child and not a
>>>separate script).  Also system calls will launch a new process that will
>>>use all CPU while the process calling it will drop to 0% CPU until it
>>>finishes.
>>>
>>>Yes.  Your explanation is correct. You then use gff3_merge to merge the
>>>GFF3 file.
>>>
>>>--Carson
>>>
>>>
>>>
>>>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>>>
>>>>
>>>>Our admin counts processes. Do I understand you right, that one CPU
>>>>handles several
>>>>processes?
>>>>
>>>>I'm still confused by the different directories (and I made a mistake
>>>>when asking last
>>>>time, I wanted to say 'If I do NOT start the jobs in the same
>>>>directory...). 
>>>>So, if I start each piece of a genome in its own directory (for
>>>>example),
>>>>then it gets a
>>>>unique basename (because the output will be separate from all other
>>>>pieces anyway) and I
>>>>will not run dsindex but instead use gff3_merge for each piece's output
>>>>and then once
>>>>again to merge all resulting gff3-files?
>>>>
>>>>Hope I got you right :)
>>>>
>>>>Thanks fopr your help!
>>>>Jeanne
>>>>
>>>>
>>>>
>>>>On Wed, 6 Aug 2014 15:45:56 +0000
>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>Is your admin counting processes or cpu usage?  Because each system
>>>>>call
>>>>>creates a
>>>>>separate process, so you can expect multiple processes (each system
>>>>>call
>>>>>generates a new
>>>>>process) but only a single cpu of usage per instance.  Use different
>>>>>directories if you
>>>>>are running that many jobs.  You can concatenate the separate results
>>>>>when your done.
>>>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>>>generated from
>>>>>separate jobs.
>>>>>
>>>>>--Carson
>>>>>
>>>>>Sent from my iPhone
>>>>>
>>>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de>
>>>>>>wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>>>threads. Our admin
>>>>>reports
>>>>>> however, that the processes seem to assemble more threads than they
>>>>>>are allowed. It is
>>>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>>>suggestion why?
>>>>>> 
>>>>>> If I start the jobs in the same directory, how can I make sure they
>>>>>>write to the same
>>>>>> directory (as, I think is required to put the pieces together in the
>>>>>>end?)? das
>>>>>-basename
>>>>>> take paths?
>>>>>> 
>>>>>> 
>>>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>>> I think the freezing is because you are starting too many
>>>>>>>simultaneous jobs.  You
>>>>>should
>>>>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>>>>doing things can
>>>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>>>same directory. You
>>>>>>> could try splitting them into different directories.
>>>>>>> 
>>>>>>> --Carson
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt"
>>>>>>>><j.wilbrandt at zfmk.de>
>>>>>>>>wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> aha, so this explains that.
>>>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>>>than 60,000,
>>>>>roughly
>>>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>>>> 
>>>>>>>> What do you think about this weird 'I am running but not really
>>>>>>>>doing
>>>>>>> anything'-behavior?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks a lot!
>>>>>>>> Jeanne
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>>>>> If you are starting and restarting, or running multiple jobs then
>>>>>>>>>the log can be
>>>>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are
>>>>>>>>>added.
>>>>>>>>> If there is a
>>>>>>> GFF3
>>>>>>>>> result file for the contig, then it is FINISHED. FASTA files will
>>>>>>>>>only exist for
>>>>>the
>>>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>>>models.
>>>>>>>>> 
>>>>>>>>> --Carson
>>>>>>>>> 
>>>>>>>>> Sent from my iPhone
>>>>>>>>> 
>>>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>>>><j.wilbrandt at zfmk.de> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi Carson,
>>>>>>>>>> 
>>>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>>>genome which is split
>>>>>>>>> into
>>>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed
>>>>>>>>>>to
>>>>>>>>>>finish
>>>>>normally,
>>>>>>>>> while
>>>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>>>output. Do you have
>>>>>any
>>>>>>>>>> suggestion why this is happening?
>>>>>>>>>> 
>>>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>>>found that of
>>>>>38.384
>>>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The
>>>>>>>>>>surprise
>>>>>>>>>>is, that 2/3 of
>>>>>>>>> these
>>>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>>>models for these
>>>>>>>>> 'finished'
>>>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>>>parts of the genome
>>>>>>>>> (i.e.,
>>>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but
>>>>>>>>>>'finished')
>>>>>>>>>> Should this be an issue of concern?
>>>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the
>>>>>>>>>>NFS files look
>>>>>>> good,
>>>>>>>>> so
>>>>>>>>>> we suspect something fishy going on...
>>>>>>>>>> 
>>>>>>>>>> Hope you can help,
>>>>>>>>>> best wishes,
>>>>>>>>>> Jeanne Wilbrandt
>>>>>>>>>> 
>>>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>> 
>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>>>>>>>>>>b.
>>>>>>>>>>org
>>>>>> 
>>>>
>>>
>>>
>>
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: splitrun_problem_01_all.gff3
Type: application/octet-stream
Size: 4967463 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140814/cd152a47/attachment-0002.obj>