[maker-devel] Further split genome questions
Jeanne Wilbrandt
j.wilbrandt at zfmk.de
Thu Aug 14 09:53:38 MDT 2014
It is version 2.31.
My first try was done with map_forward=0, and (I just noticed) the duplicates are present
in the separate gff3s already also in this case (one is attached).
Has this something to do with the first-run-gff3 I fed it?
On Thu, 14 Aug 2014 15:46:44 +0000
Carson Holt <carsonhh at gmail.com> wrote:
>What version of MAKER are you using? I'd also need to see the GFF3 files
>before the merge. You may also need to turn off map_forward since you are
>passing in GFF3 with MAKER names, creating new models with MAKER names and
>then moving names from old models forward onto new ones (which may force
>names to be used twice).
>
>--Carson
>
>
>On 8/14/14, 9:40 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>
>>
>>Thank you so much!
>>
>>However, I'm still, struggling, I'm afraid: I tried this 'two-step
>>merging' approach with
>>a subset of scaffolds and got duplicate IDs.
>>
>>Here is what I did:
>>- divided input scaffolds in two files
>>- run maker separately on these files (-> separate output dirs)
>>-- additional input: maker-generated gff3 from previous (singular) run
>>-- repeatmasking, snaphmm, gmhmm, augustus_species are given
>>-- map_forward=0 / 1 (I tried both, to the same effect)
>>- gff3_merge two times using index-log
>>- gff3_merge these two gff3 files
>>
>>$
>>grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort |
>>uniq -c | sort -n
>>| tail
>> 2 ID=snap_masked-scf7180005140699-processed-gene-0.19
>> 2 ID=snap_masked-scf7180005140699-processed-gene-0.22
>> 2 ID=snap_masked-scf7180005140699-processed-gene-1.36
>> 2 ID=snap_masked-scf7180005140713-processed-gene-0.4
>> 2 ID=snap_masked-scf7180005140744-processed-gene-0.4
>> 2 ID=snap_masked-scf7180005140744-processed-gene-0.6
>> 2 ID=snap_masked-scf7180005140754-processed-gene-0.14
>> 2 ID=snap_masked-scf7180005140754-processed-gene-0.15
>> 2 ID=snap_masked-scf7180005140754-processed-gene-0.19
>> 2 ID=snap_masked-scf7180005181475-processed-gene-0.3
>>
>>$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 |
>>grep "\sgene"
>>scf7180005181475 maker gene 9050 9385 . - . ID=snap_masked-scf718000518147
>>5-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>>scf7180005181475 maker gene 846 1088 . - . ID=snap_masked-scf7180005181475
>>-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>>
>>- found duplicates! i.e. the same ID for gene annotations in different
>>areas of the same
>>scaffold (of 655 gene annotations, 51 appear twice)
>>-- this happens not only with gene, but also CDS and mRNA annotations, as
>>far as I can
>>see (here, in one example, non-everlapping but close CDS snippets got the
>>same ID).
>>
>>
>>I suspected this might have to do with the map_forward flag, but I get
>>the same problem
>>again (with genes at the same locations).
>>I attached one of the ctl files for you in case you want to have a look,
>>the other is
>>analogous. Do you need something else?
>>
>>What did I miss? This should not happen, right?
>>
>>
>>
>>
>>On Wed, 13 Aug 2014 15:52:34 +0000
>> Carson Holt <carsonhh at gmail.com> wrote:
>>>Yes. One cpu will have several processes, most are helper processes that
>>>will use 0% CPU almost all of the time (for example there is a shared
>>>variable manager process that will launch with MAKER but will also be
>>>called 'maker' under top because it is technically its child and not a
>>>separate script). Also system calls will launch a new process that will
>>>use all CPU while the process calling it will drop to 0% CPU until it
>>>finishes.
>>>
>>>Yes. Your explanation is correct. You then use gff3_merge to merge the
>>>GFF3 file.
>>>
>>>--Carson
>>>
>>>
>>>
>>>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>>>
>>>>
>>>>Our admin counts processes. Do I understand you right, that one CPU
>>>>handles several
>>>>processes?
>>>>
>>>>I'm still confused by the different directories (and I made a mistake
>>>>when asking last
>>>>time, I wanted to say 'If I do NOT start the jobs in the same
>>>>directory...).
>>>>So, if I start each piece of a genome in its own directory (for
>>>>example),
>>>>then it gets a
>>>>unique basename (because the output will be separate from all other
>>>>pieces anyway) and I
>>>>will not run dsindex but instead use gff3_merge for each piece's output
>>>>and then once
>>>>again to merge all resulting gff3-files?
>>>>
>>>>Hope I got you right :)
>>>>
>>>>Thanks fopr your help!
>>>>Jeanne
>>>>
>>>>
>>>>
>>>>On Wed, 6 Aug 2014 15:45:56 +0000
>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>Is your admin counting processes or cpu usage? Because each system
>>>>>call
>>>>>creates a
>>>>>separate process, so you can expect multiple processes (each system
>>>>>call
>>>>>generates a new
>>>>>process) but only a single cpu of usage per instance. Use different
>>>>>directories if you
>>>>>are running that many jobs. You can concatenate the separate results
>>>>>when your done.
>>>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>>>generated from
>>>>>separate jobs.
>>>>>
>>>>>--Carson
>>>>>
>>>>>Sent from my iPhone
>>>>>
>>>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de>
>>>>>>wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>>>threads. Our admin
>>>>>reports
>>>>>> however, that the processes seem to assemble more threads than they
>>>>>>are allowed. It is
>>>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>>>suggestion why?
>>>>>>
>>>>>> If I start the jobs in the same directory, how can I make sure they
>>>>>>write to the same
>>>>>> directory (as, I think is required to put the pieces together in the
>>>>>>end?)? das
>>>>>-basename
>>>>>> take paths?
>>>>>>
>>>>>>
>>>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>>> I think the freezing is because you are starting too many
>>>>>>>simultaneous jobs. You
>>>>>should
>>>>>>> try and use MPI to parallelize instead. The concurrent job way of
>>>>>>>doing things can
>>>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>>>same directory. You
>>>>>>> could try splitting them into different directories.
>>>>>>>
>>>>>>> --Carson
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt"
>>>>>>>><j.wilbrandt at zfmk.de>
>>>>>>>>wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> aha, so this explains that.
>>>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>>>than 60,000,
>>>>>roughly
>>>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>>>>
>>>>>>>> What do you think about this weird 'I am running but not really
>>>>>>>>doing
>>>>>>> anything'-behavior?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks a lot!
>>>>>>>> Jeanne
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>>>>> If you are starting and restarting, or running multiple jobs then
>>>>>>>>>the log can be
>>>>>>>>> partially rebuilt. On rebuild only the FINISHED entries are
>>>>>>>>>added.
>>>>>>>>> If there is a
>>>>>>> GFF3
>>>>>>>>> result file for the contig, then it is FINISHED. FASTA files will
>>>>>>>>>only exist for
>>>>>the
>>>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>>>models.
>>>>>>>>>
>>>>>>>>> --Carson
>>>>>>>>>
>>>>>>>>> Sent from my iPhone
>>>>>>>>>
>>>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>>>><j.wilbrandt at zfmk.de> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Carson,
>>>>>>>>>>
>>>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>>>genome which is split
>>>>>>>>> into
>>>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed
>>>>>>>>>>to
>>>>>>>>>>finish
>>>>>normally,
>>>>>>>>> while
>>>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>>>output. Do you have
>>>>>any
>>>>>>>>>> suggestion why this is happening?
>>>>>>>>>>
>>>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>>>found that of
>>>>>38.384
>>>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The
>>>>>>>>>>surprise
>>>>>>>>>>is, that 2/3 of
>>>>>>>>> these
>>>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>>>models for these
>>>>>>>>> 'finished'
>>>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>>>parts of the genome
>>>>>>>>> (i.e.,
>>>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but
>>>>>>>>>>'finished')
>>>>>>>>>> Should this be an issue of concern?
>>>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the
>>>>>>>>>>NFS files look
>>>>>>> good,
>>>>>>>>> so
>>>>>>>>>> we suspect something fishy going on...
>>>>>>>>>>
>>>>>>>>>> Hope you can help,
>>>>>>>>>> best wishes,
>>>>>>>>>> Jeanne Wilbrandt
>>>>>>>>>>
>>>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>
>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>>>>>>>>>>b.
>>>>>>>>>>org
>>>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: splitrun_problem_01_all.gff3
Type: application/octet-stream
Size: 4967463 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140814/cd152a47/attachment-0002.obj>
More information about the maker-devel
mailing list