[maker-devel] Further split genome questions

Wed Aug 13 09:46:59 MDT 2014

Hi Jeanne, 

I believe that's right. You can pass gff3_merge either a list of gff3 files or a maker-created datastore index file. To compile the pieces for each of your different runs you would give gff3_merge the datastore index file. To put those resulting gff3 files together, you would pass gff3_merge the list of gff3 files that you want to merge. 

~Daniel

Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
________________________________________
From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Jeanne Wilbrandt [j.wilbrandt at zfmk.de]
Sent: Wednesday, August 13, 2014 3:32 AM
To: Carson Holt; Wilbrandt Jeanne
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Further split genome questions

Our admin counts processes. Do I understand you right, that one CPU handles several
processes?

I'm still confused by the different directories (and I made a mistake when asking last
time, I wanted to say 'If I do NOT start the jobs in the same directory...).
So, if I start each piece of a genome in its own directory (for example), then it gets a
unique basename (because the output will be separate from all other pieces anyway) and I
will not run dsindex but instead use gff3_merge for each piece's output and then once
again to merge all resulting gff3-files?

Hope I got you right :)

Thanks fopr your help!
Jeanne

On Wed, 6 Aug 2014 15:45:56 +0000
 Carson Holt <carsonhh at gmail.com> wrote:
>Is your admin counting processes or cpu usage?  Because each system call creates a
>separate process, so you can expect multiple processes (each system call generates a new
>process) but only a single cpu of usage per instance.  Use different directories if you
>are running that many jobs.  You can concatenate the separate results when your done.
> Use gff3_merge script to help concatenate the separate GFF3 files generated from
>separate jobs.
>
>--Carson
>
>Sent from my iPhone
>
>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>>
>>
>>
>> We are using MPI as well, each of the 20 parts gets assigned 4 threads. Our admin
>reports
>> however, that the processes seem to assemble more threads than they are allowed. It is
>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a suggestion why?
>>
>> If I start the jobs in the same directory, how can I make sure they write to the same
>> directory (as, I think is required to put the pieces together in the end?)? das
>-basename
>> take paths?
>>
>>
>> On Wed, 6 Aug 2014 15:12:50 +0000
>> Carson Holt <carsonhh at gmail.com> wrote:
>>> I think the freezing is because you are starting too many simultaneous jobs.  You
>should
>>> try and use MPI to parallelize instead.  The concurrent job way of doing things can
>>> start to cause problems If you are running 10 or more jobs in the same directory. You
>>> could try splitting them into different directories.
>>>
>>> --Carson
>>>
>>> Sent from my iPhone
>>>
>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>>>>
>>>>
>>>> aha, so this explains that.
>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more than 60,000,
>roughly
>>>> half of the sequences being shorter than 3,000 bp.
>>>>
>>>> What do you think about this weird 'I am running but not really doing
>>> anything'-behavior?
>>>>
>>>>
>>>> Thanks a lot!
>>>> Jeanne
>>>>
>>>>
>>>>
>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>> Carson Holt <carsonhh at gmail.com> wrote:
>>>>> If you are starting and restarting, or running multiple jobs then the log can be
>>>>> partially rebuilt.  On rebuild only the FINISHED entries are added.  If there is a
>>> GFF3
>>>>> result file for the contig, then it is FINISHED. FASTA files will only exist for
>the
>>>>> contigs that have gene models. Small contigs will rarely contain models.
>>>>>
>>>>> --Carson
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt" <j.wilbrandt at zfmk.de> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Carson,
>>>>>>
>>>>>> I ran into more conspicuous behavior running maker 2.31 on a genome which is split
>>>>> into
>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed to finish
>normally,
>>>>> while
>>>>>> the remaining three seemed to be stalled and produced 0B of output. Do you have
>any
>>>>>> suggestion why this is happening?
>>>>>>
>>>>>> After I stopped these stalled jobs, I checked the index.log and found that of
>38.384
>>>>>> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of
>>>>> these
>>>>>> only appear as FINISHED (the rest only started). There are no models for these
>>>>> 'finished'
>>>>>> scaffolds stored in the .db and they are distributed over all parts of the genome
>>>>> (i.e.,
>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
>>>>>> Should this be an issue of concern?
>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look
>>> good,
>>>>> so
>>>>>> we suspect something fishy going on...
>>>>>>
>>>>>> Hope you can help,
>>>>>> best wishes,
>>>>>> Jeanne Wilbrandt
>>>>>>
>>>>>> zmb // ZFMK // University of Bonn
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> maker-devel mailing list
>>>>>> maker-devel at box290.bluehost.com
>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org