[maker-devel] maker-devel Digest, Vol 74, Issue 17

Carson Holt carsonhh at gmail.com
Fri Sep 5 09:37:02 MDT 2014


The partial lines are symptoms of writing data to a slow NFS mounted
drive.  If NFS can't get a response for a write operation, it returns
success (even though it wasn't really successful) and then continues to
wait for the operation to really complete.  This is called asynchronous
writing.  It improves performance by optimistically returning success on
all operations rather than waiting to see if the operation really
succeeded. If you have a slow or overloaded NFS mount though, you can get
a number a failures and never any indication that they failed except for
the fact that some files are missing content or lines are partial.

When this happens, you need to run MAKER with the -a flag on fewer CPUs to
rebuild the GFF3 files. Fewer CPUs reduces the IO burden.  Or if you can
find which contigs have partial GFF3 lines, you can delete just those
along with the datastore index log file and then launch maker without any
flags to let it recompute just those contigs.

Another possible cause is also NFS related.  If you are running MAKER
multiple times in the same working directory, and a slow NFS mount doesn't
allow maker to properly lock files, then two maker jobs can try and
compute the same contig simultaneously.  Simultaneous writing of files can
then cause IDs to be duplicated and some lines to be munged as lines from
one process arrive to the file in the middle of lines from another process
(creating a jumble of characters and partial lines).  Start a singe maker
job on fewer cpus using the -a flag to rebuild the GFF3 files if this is
the case.

Repeated gene/mRNA IDs can also be caused by gff3_passthrough when you are
passing in GFF3 files with already assigned IDS (that may be used
elsewhere).  Are you using GFF3 pass-trough?

Features that will not have unique ID= tags are CDS, three_prime_utr, and
five_prime_utr features (these are considered non-continuous features
because of the shared ID across lines).
You can see examples here --> http://www.sequenceontology.org/gff3.shtml

Also Name= attributes are not required to be unique.

Thanks,
Carson






On 9/5/14, 8:43 AM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]"
<nguyenan at mail.nih.gov> wrote:

>Hi,
>
>I finished running MAKER as suggested above.
>Then I ran gff3_merge.pl to retrieve only MAKER annotation using -n -g
>options. I called the output file maker.gff3
> 
>In the maker.gff3 I found some invalid data (does not conform .gff3
>format), e.g.
>
>###
>2 +
>###
>
>OR
>
>###
>.Contig1:hsp:72378:1.3.0.0;Parent=c209800247.Contig1:hit:30214:1.3.0.0;Tar
>g
>et=species:tRNA-Asn-AAC|genus:tRNA 1 75 +
>###
>
>OR some gene (or mRNA) IDs are not uniq. This means they can be found
>multiple times with different values within the maker.gff3
>
>How could it happen? As I understood, mRNA IDs in a .gff3 file must be
>uniq.
>
>Thanks
>Anh-Dao
> 
>
> 
>
>
>On 7/18/14 2:00 PM, "maker-devel-request at yandell-lab.org"
><maker-devel-request at yandell-lab.org> wrote:
>
>>Send maker-devel mailing list submissions to
>>	maker-devel at yandell-lab.org
>>
>>To subscribe or unsubscribe via the World Wide Web, visit
>>	http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>or, via email, send a message with subject or body 'help' to
>>	maker-devel-request at yandell-lab.org
>>
>>You can reach the person managing the list at
>>	maker-devel-owner at yandell-lab.org
>>
>>When replying, please edit your Subject line so it is more specific
>>than "Re: Contents of maker-devel digest..."
>>
>>
>>Today's Topics:
>>
>>   1. Re: Maker_opts.ctl (Carson Holt)
>>
>>
>>----------------------------------------------------------------------
>>
>>Message: 1
>>Date: Fri, 18 Jul 2014 11:04:09 -0600
>>From: Carson Holt <carsonhh at gmail.com>
>>To: "Nguyen, Anh-Dao (NIH/NHGRI) [C]" <nguyenan at mail.nih.gov>,	Daniel
>>	Ence <dence at genetics.utah.edu>
>>Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
>>Subject: Re: [maker-devel] Maker_opts.ctl
>>Message-ID: <CFEEAF84.DCAF%carsonhh at gmail.com>
>>Content-Type: text/plain;	charset="UTF-8"
>>
>>It should just be 'fgenesh'.  If it's not there you can still just give
>>the GFF3.
>>
>>--Carson
>>
>>
>>On 7/17/14, 8:19 AM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]"
>><nguyenan at mail.nih.gov> wrote:
>>
>>>I am not sure which fgenesh executable file should I use.
>>>
>>>fgenesh= #location of fgenesh executable
>>>
>>>When I run FGENESH++, I need to run the run_pipe.pl script. Sure you
>>>need
>>>to specify a list of other executable programs (such as ppd, ppdn+, etc)
>>>
>>>Anh-Dao
>>>
>>>
>>>On 7/16/14 3:32 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>>>
>>>>'all' will use the whole of RepBase, or you can do 'metazoa' like your
>>>>previous run.  Then provide the RepeatModeler file to rmlib=
>>>>
>>>>--Carson
>>>>
>>>>
>>>>
>>>>On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]"
>>>><nguyenan at mail.nih.gov> wrote:
>>>>
>>>>>By default, model_org=all. Can I use the de novo repeat library
>>>>>predicted
>>>>>by RepeatModeler for the rmlib option?
>>>>>
>>>>>Anh-Dao
>>>>>
>>>>>
>>>>>
>>>>>On 7/16/14 3:17 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>>>>>
>>>>>>No.  You can provide both to MAKER. The options are model_org= and
>>>>>>rmlib=.
>>>>>> By letting MAKER handle repeat masking it will differentiate repeat
>>>>>>types
>>>>>>and use soft masking for some and hard masking for others.  This
>>>>>>increases
>>>>>>sensitivity of evidence alignments while still maintaining
>>>>>>specificity.
>>>>>>
>>>>>>--Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]"
>>>>>><nguyenan at mail.nih.gov> wrote:
>>>>>>
>>>>>>>I will run Augustus and FGENESH++ inside of MAKER using the
>>>>>>>parameter
>>>>>>>files for Augustus.
>>>>>>>I could also run RepeatMasker inside of MAKER. However, I ran RM
>>>>>>>using
>>>>>>>two
>>>>>>>options: -lib (de novo) and -species (known). I got ~ 45% repeats
>>>>>>>via
>>>>>>>de
>>>>>>>novo and ~ 4% repeats via known options. As I understood, RM inside
>>>>>>>of
>>>>>>>MAKER uses only RepBase repeat library and RepeatRunner protein
>>>>>>>database.
>>>>>>>
>>>>>>>Anh-Dao
>>>>>>>
>>>>>>>
>>>>>>>On 7/16/14 2:36 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>>>>>>>
>>>>>>>>When you ran Augustus separately, it should have created the
>>>>>>>>parameters
>>>>>>>>needed to run it.  Now you should be able to run it inside of MAKER
>>>>>>>>using
>>>>>>>>the species name you just created.
>>>>>>>>
>>>>>>>>I'd also recommend letting MAKER run RepeatMasker for you rather
>>>>>>>>than
>>>>>>>>giving it the results as GFF3.
>>>>>>>>
>>>>>>>>--Carson
>>>>>>>>
>>>>>>>>
>>>>>>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]"
>>>>>>>><nguyenan at mail.nih.gov> wrote:
>>>>>>>>
>>>>>>>>>Thanks Daniel for your quick response.
>>>>>>>>>
>>>>>>>>>I did not use the parameter file of other organism when running
>>>>>>>>>Augustus.
>>>>>>>>>I created the parameter file for the genome following their
>>>>>>>>>instructions.
>>>>>>>>>There were multiple steps to train and run Augustus (Creating gene
>>>>>>>>>structures for training AUGUSTUS with CEGMA => parameter file will
>>>>>>>>>be
>>>>>>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences;
>>>>>>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.)
>>>>>>>>>As I mentioned the reason why I ran Augustus separately, because
>>>>>>>>>Augustus
>>>>>>>>>has not trained that genome (no parameter file exists). Otherwise
>>>>>>>>>I
>>>>>>>>>would
>>>>>>>>>run Augustus inside MAKER.
>>>>>>>>> 
>>>>>>>>>You suggested to use rm_gff option to specify RepeatMasker output
>>>>>>>>>(sure
>>>>>>>>>I
>>>>>>>>>will convert them to .gff3 formatted files). Can I submit two RM
>>>>>>>>>.gff3
>>>>>>>>>files, separated by comma?
>>>>>>>>>
>>>>>>>>>Anh-Dao
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>On 7/16/14 2:13 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>>>>>>>>
>>>>>>>>>>Hi Anh-Dao,
>>>>>>>>>>
>>>>>>>>>>In the maker_opts.ctl file, there are options for est and protein
>>>>>>>>>>evidence. You?ll put all of your fasta est files together in a
>>>>>>>>>>command
>>>>>>>>>>separated list in the ?est" option, and all of your fasta protein
>>>>>>>>>>files
>>>>>>>>>>in a command separated list for the ?protein? option.
>>>>>>>>>>
>>>>>>>>>>You?ll specify the SNAP and Genemark files in their respective
>>>>>>>>>>options
>>>>>>>>>>in
>>>>>>>>>>the control file and pass the augustus and fgenesh predictions in
>>>>>>>>>>the
>>>>>>>>>>?pred_gff? option.
>>>>>>>>>>
>>>>>>>>>>If you have the RepeatMasker output in gff3 format you can give
>>>>>>>>>>it
>>>>>>>>>>to
>>>>>>>>>>maker with the ?rm_gff? option.
>>>>>>>>>>
>>>>>>>>>>If you?ve converted the cufflinks output to gff3, you can give it
>>>>>>>>>>to
>>>>>>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only
>>>>>>>>>>gives
>>>>>>>>>>fasta
>>>>>>>>>>output, so you would put that in the ?est? option, along with all
>>>>>>>>>>the
>>>>>>>>>>other est fasta files.
>>>>>>>>>>
>>>>>>>>>>If Augustus isn?t trained for your particular organism, then you
>>>>>>>>>>can
>>>>>>>>>>use
>>>>>>>>>>another organism that augustus is already trained for. The list
>>>>>>>>>>of
>>>>>>>>>>species that augustus has parameter files for is in the
>>>>>>>>>>README.txt
>>>>>>>>>>that
>>>>>>>>>>came with Augustus. I really recommend that you run Augustus from
>>>>>>>>>>inside
>>>>>>>>>>maker, because then you get all the benefits of maker passing
>>>>>>>>>>ext-based
>>>>>>>>>>hints to augustus at runtime, which can really improve Augustus?
>>>>>>>>>>predictive ability.
>>>>>>>>>>
>>>>>>>>>>When you ran the augustus gene prediction separately, did you use
>>>>>>>>>>another
>>>>>>>>>>organism?s parameter file?
>>>>>>>>>>
>>>>>>>>>>Thanks,
>>>>>>>>>>Daniel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C]
>>>>>>>>>><nguyenan at mail.nih.gov> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I would like to conduct a genome annotation and have the
>>>>>>>>>>>following
>>>>>>>>>>>data:
>>>>>>>>>>> - Two separate RepeatMasker outputs (using -lib and -species
>>>>>>>>>>>options)
>>>>>>>>>>> - ESTs and RACE (fasta)
>>>>>>>>>>> - proteins (fasta)
>>>>>>>>>>> - proteins of related organisms (fasta)
>>>>>>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to
>>>>>>>>>>>convert
>>>>>>>>>>>to
>>>>>>>>>>>ZFF
>>>>>>>>>>>format, etc. )
>>>>>>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl)
>>>>>>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to
>>>>>>>>>>>convert
>>>>>>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene
>>>>>>>>>>>prediction separately, because the genome has never been trained
>>>>>>>>>>>for
>>>>>>>>>>>Augustus.
>>>>>>>>>>> - Cufflinks and Trinity from RNA-Seq
>>>>>>>>>>> 
>>>>>>>>>>> Could you please let me know how can I specify parameters in
>>>>>>>>>>>the
>>>>>>>>>>>maker_opts.ctl file?
>>>>>>>>>>> Or do you have other suggestions to re-do the data listed
>>>>>>>>>>>above?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks.
>>>>>>>>>>> Anh-Dao
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>> 
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-
>>>>>>>>>>>l
>>>>>>>>>>>a
>>>>>>>>>>>b
>>>>>>>>>>>.
>>>>>>>>>>>o
>>>>>>>>>>>r
>>>>>>>>>>>g
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>_______________________________________________
>>>>>>>>>maker-devel mailing list
>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>>>>>>>>>b
>>>>>>>>>.
>>>>>>>>>o
>>>>>>>>>r
>>>>>>>>>g
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>>
>>------------------------------
>>
>>Subject: Digest Footer
>>
>>_______________________________________________
>>maker-devel mailing list
>>maker-devel at box290.bluehost.com
>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>------------------------------
>>
>>End of maker-devel Digest, Vol 74, Issue 17
>>*******************************************
>
>
>_______________________________________________
>maker-devel mailing list
>maker-devel at box290.bluehost.com
>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org






More information about the maker-devel mailing list