[maker-devel] Further split genome questions

Jeanne Wilbrandt j.wilbrandt at zfmk.de
Thu Aug 14 09:40:04 MDT 2014


Thank you so much! 

However, I'm still, struggling, I'm afraid: I tried this 'two-step merging' approach with
a subset of scaffolds and got duplicate IDs.

Here is what I did:
- divided input scaffolds in two files
- run maker separately on these files (-> separate output dirs)
-- additional input: maker-generated gff3 from previous (singular) run
-- repeatmasking, snaphmm, gmhmm, augustus_species are given
-- map_forward=0 / 1 (I tried both, to the same effect)
- gff3_merge two times using index-log
- gff3_merge these two gff3 files

$
grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort | uniq -c | sort -n
| tail
      2 ID=snap_masked-scf7180005140699-processed-gene-0.19
      2 ID=snap_masked-scf7180005140699-processed-gene-0.22
      2 ID=snap_masked-scf7180005140699-processed-gene-1.36
      2 ID=snap_masked-scf7180005140713-processed-gene-0.4
      2 ID=snap_masked-scf7180005140744-processed-gene-0.4
      2 ID=snap_masked-scf7180005140744-processed-gene-0.6
      2 ID=snap_masked-scf7180005140754-processed-gene-0.14
      2 ID=snap_masked-scf7180005140754-processed-gene-0.15
      2 ID=snap_masked-scf7180005140754-processed-gene-0.19
      2 ID=snap_masked-scf7180005181475-processed-gene-0.3

$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 | grep "\sgene"
scf7180005181475	maker	gene	9050	9385	.	-	.	ID=snap_masked-scf7180005181475-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
scf7180005181475	maker	gene	846	1088	.	-	.	ID=snap_masked-scf7180005181475-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3

- found duplicates! i.e. the same ID for gene annotations in different areas of the same
scaffold (of 655 gene annotations, 51 appear twice)
-- this happens not only with gene, but also CDS and mRNA annotations, as far as I can
see (here, in one example, non-everlapping but close CDS snippets got the same ID). 


I suspected this might have to do with the map_forward flag, but I get the same problem
again (with genes at the same locations).
I attached one of the ctl files for you in case you want to have a look, the other is
analogous. Do you need something else?

What did I miss? This should not happen, right?




On Wed, 13 Aug 2014 15:52:34 +0000
 Carson Holt <carsonhh at gmail.com> wrote:
>Yes. One cpu will have several processes, most are helper processes that
>will use 0% CPU almost all of the time (for example there is a shared
>variable manager process that will launch with MAKER but will also be
>called 'maker' under top because it is technically its child and not a
>separate script).  Also system calls will launch a new process that will
>use all CPU while the process calling it will drop to 0% CPU until it
>finishes.
>
>Yes.  Your explanation is correct. You then use gff3_merge to merge the
>GFF3 file.
>
>--Carson
>
>
>
>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>
>>
>>Our admin counts processes. Do I understand you right, that one CPU
>>handles several
>>processes?
>>
>>I'm still confused by the different directories (and I made a mistake
>>when asking last
>>time, I wanted to say 'If I do NOT start the jobs in the same
>>directory...). 
>>So, if I start each piece of a genome in its own directory (for example),
>>then it gets a
>>unique basename (because the output will be separate from all other
>>pieces anyway) and I
>>will not run dsindex but instead use gff3_merge for each piece's output
>>and then once
>>again to merge all resulting gff3-files?
>>
>>Hope I got you right :)
>>
>>Thanks fopr your help!
>>Jeanne
>>
>>
>>
>>On Wed, 6 Aug 2014 15:45:56 +0000
>> Carson Holt <carsonhh at gmail.com> wrote:
>>>Is your admin counting processes or cpu usage?  Because each system call
>>>creates a
>>>separate process, so you can expect multiple processes (each system call
>>>generates a new
>>>process) but only a single cpu of usage per instance.  Use different
>>>directories if you
>>>are running that many jobs.  You can concatenate the separate results
>>>when your done.
>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>generated from
>>>separate jobs.
>>>
>>>--Carson
>>>
>>>Sent from my iPhone
>>>
>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de>
>>>>wrote:
>>>> 
>>>> 
>>>> 
>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>threads. Our admin
>>>reports
>>>> however, that the processes seem to assemble more threads than they
>>>>are allowed. It is
>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>suggestion why?
>>>> 
>>>> If I start the jobs in the same directory, how can I make sure they
>>>>write to the same
>>>> directory (as, I think is required to put the pieces together in the
>>>>end?)? das
>>>-basename
>>>> take paths?
>>>> 
>>>> 
>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>> I think the freezing is because you are starting too many
>>>>>simultaneous jobs.  You
>>>should
>>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>>doing things can
>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>same directory. You
>>>>> could try splitting them into different directories.
>>>>> 
>>>>> --Carson
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de>
>>>>>>wrote:
>>>>>> 
>>>>>> 
>>>>>> aha, so this explains that.
>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>than 60,000,
>>>roughly
>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>> 
>>>>>> What do you think about this weird 'I am running but not really doing
>>>>> anything'-behavior?
>>>>>> 
>>>>>> 
>>>>>> Thanks a lot!
>>>>>> Jeanne
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>>>> If you are starting and restarting, or running multiple jobs then
>>>>>>>the log can be
>>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are added.
>>>>>>> If there is a
>>>>> GFF3
>>>>>>> result file for the contig, then it is FINISHED. FASTA files will
>>>>>>>only exist for
>>>the
>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>models.
>>>>>>> 
>>>>>>> --Carson
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>><j.wilbrandt at zfmk.de> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Carson, 
>>>>>>>> 
>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>genome which is split
>>>>>>> into
>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed to
>>>>>>>>finish
>>>normally,
>>>>>>> while
>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>output. Do you have
>>>any
>>>>>>>> suggestion why this is happening?
>>>>>>>> 
>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>found that of
>>>38.384
>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The surprise
>>>>>>>>is, that 2/3 of
>>>>>>> these
>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>models for these
>>>>>>> 'finished'
>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>parts of the genome
>>>>>>> (i.e.,
>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but
>>>>>>>>'finished')
>>>>>>>> Should this be an issue of concern?
>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the
>>>>>>>>NFS files look
>>>>> good,
>>>>>>> so
>>>>>>>> we suspect something fishy going on...
>>>>>>>> 
>>>>>>>> Hope you can help,
>>>>>>>> best wishes,
>>>>>>>> Jeanne Wilbrandt
>>>>>>>> 
>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>> 
>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.
>>>>>>>>org
>>>> 
>>
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts_Lclav_splitrun_problem_01_mapfwd.ctl
Type: application/octet-stream
Size: 5859 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140814/58ce0be1/attachment-0003.obj>


More information about the maker-devel mailing list