From kai.kamm at ecolevol.de  Thu Mar  5 10:47:02 2015
From: kai.kamm at ecolevol.de (Kai Kamm)
Date: Thu, 05 Mar 2015 17:47:02 +0100
Subject: [maker-devel] Better resolve conflicting gene models
Message-ID: <54F88886.9010004@ecolevol.de>

Hello, thanks for your previous advice.

(Btw, how can one reply to an existing thread such that the reply will 
be added to the same thread?)


I am trying to find the best parameters with Maker for the annotation of 
my genome. I have run Maker with several combinations of parameters and 
predictors on my three biggest scaffolds and looked at the results in 
Jbrowse. Overall most predictions seem fine, but there are some genes 
with conflicts and I have no idea why.

I have:

- 100Mb assembled genome
- Trinity RNAseq assembly
- cufflinks data (in my case don't seem to be messy as suggested, rather 
a good complement to the trinity data))
- protein evidence (related and unrelated species)
- repeat library from repeat modeler


Gene predictors used:

- Augustus trained with transcripts from related species: seems to 
perform fine

- SNAP: no convergence with Augustus even after second training. Dropped 
it because it predicted lots of additional low quality transcripts and 
sometimes disrupted final Maker transcripts.

- Genemark: converged with Augustus after training (introns received 
from TopHat2 output). Tends to predict some additional transcripts 
(compared to Augustus). Few (but some) of these are covered by evidence 
and thus become final Maker transcripts.


So the combination of Augustus and Genemark seems optimal. In general 
both perform well in Maker and tend to predict the same transcripts.

However, I still observe some problems in the behavior of Maker which I 
don't understand:

Example 1: One of the predictors predicts a small additional exon at the 
start which is also covered by protein or EST data. But sometimes Maker 
chooses the other predictors model for the final transcript. Mostly 
these are minor differences but I don't understand this behavior?

Example 2: there are some extreme cases like an Augustus prediction with 
17 exons which are all covered by Trinity and cufflinks isoforms. 
Genemark instead predicts two separate small genes with 2 and 4 exons 
respectively. The resulting final transcript has 7 exons and the 
additional evidence from the trinity and cufflinks data is treated as UTR.


So I thought Augustus seems a little more accurate and run Maker only 
with Augustus to resolve such conflicts, even though I would loose the 
few additional transcripts from Genemark.

This is what happened:

- The gene in Example 2 now has all the 17 exons. This is good!

- Sadly another gene with several exons, which was formerly predicted by 
both Augustus and Genemark and is also covered by cufflinks and trinity 
transcripts, now consists only of two small exons in the final 
transcript. Even though Augustus still predicts the same exons and the 
same evidence is present - only the Genemark prediction is absent which 
was almost identical to Augustus. This I completely don't understand.

I don't worry about the minor differences. The extreme cases are like 
two genes in a hundred and I don't understand the behavior. I was 
thinking that in case of conflicting models Maker will choose the one 
that best fits the evidence. Obviously with most conflicts this is what 
happens, because the majority of the final models look OK. But not the 
above mentioned cases and I don't understand why?

Is there any parameter I missed to better resolve such conflicts?

Best


From bmoore at genetics.utah.edu  Thu Mar  5 18:20:52 2015
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Fri, 6 Mar 2015 00:20:52 +0000
Subject: [maker-devel] Maker Software Question
In-Reply-To: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu>
References: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu>
Message-ID: <E682D1C1-B792-498E-88C9-D9349E9548C8@genetics.utah.edu>

Hi Chris,

I don?t know if others have responded to you already, but I just found this e-mail unanswered in the depths of my e-mail inbox, so apologies for no reply.

I?m going to forward you e-mail along to the full maker mailing list so that you?ll get the benefit of response from the full group of MAKER developers.

MAKER does include a couple of scripts that add functional annotation to MAKER GFF3 output.  This process is described in the recent paper:

Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using
MAKER and MAKER-P. Curr Protoc Bioinformatics.

http://www.ncbi.nlm.nih.gov/pubmed/25501943

Mike do you have a PDF of the final print version of that you could send directly to Christopher?

B

On Jan 16, 2015, at 8:38 AM, Seabury, Christopher <CSeabury at cvm.tamu.edu<mailto:CSeabury at cvm.tamu.edu>> wrote:

Dear Colleagues,

I would like to quickly ask about a specific routine/possible function in MAKER.
Previously, we have essentially made home-made versions of maker by way of
Multi-step programming.   At present we are exploring MAKER but are wondering
IF MAKER has the ability to populate the GFF with GENE/Protein ID information?
As an initial experiment, we gave MAKER cDNA refseqs, plus corresponding Protein refseqs,
And a reference, but do not see the GENE/Protein ID in the GFF.  Is there a subroutine
For this, or option we have missed?


Thanks and Kind Regards,


Christopher M. Seabury PhD
Associate Professor
Department of Veterinary Pathobiology
College of Veterinary Medicine
Texas A&M University
College Station, TX 77843-4467
cseabury at cvm.tamu.edu<mailto:cseabury at cvm.tamu.edu>
Mobile: 979-492-6400

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150306/15c2e575/attachment.html>

From bmoore at genetics.utah.edu  Mon Mar  9 13:12:10 2015
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Mon, 9 Mar 2015 18:12:10 +0000
Subject: [maker-devel] Does the maker google forum works? -[Doubt]
	maker2zff line 109
In-Reply-To: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es>
References: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es>
Message-ID: <13826169-DFBB-4E32-AFB1-A01CB3EEE0A4@genetics.utah.edu>

Hi Javier,

The Google forum is only a mirror of the actual mailing list that allows it gets archived by Google (better Google search results for MAKER users), so posting is not allowed there.  Please join the official MAKER mailing list at:

http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Thanks,

B

On Mar 9, 2015, at 12:07 PM, FRANCISCO JAVIER SANCHEZ GARCIA <javiersg at um.es<mailto:javiersg at um.es>> wrote:


Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8.

../maker/marker_v2.31.8/maker/bin/maker2zff   ../sequences.all.gff everything.ann everything.dna
[sudo] password for soba:
No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, <GFF> line 1922870.

I read something about the problematic characters in the ID . But i dont know if it is my example.

http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html


Regards.
Thanks in advance.


Fco Javier S?nchez-Garc?a
PhD student Forest Entomology
?rea de Biolog?a Animal
Departamento de Zoolog?a y Antropolog?a F?sica
Facultad de Veterinaria
Universidad de Murcia
Campus de Espinardo 30100
Murcia (Espa?a-Spain)
Telf. +34 660 500 416 (mobile phone)
      +34 868 888 031 (laboratory-work phone)

http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl
http://www.researchgate.net/profile/Francisco_Sanchez-Garcia
http://orcid.org/0000-0002-5442-0292
http://www.researcherid.com/rid/M-2407-2014

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150309/0bd030ec/attachment.html>

From javiersg at um.es  Mon Mar  9 17:27:00 2015
From: javiersg at um.es (FRANCISCO JAVIER SANCHEZ GARCIA)
Date: Mon, 09 Mar 2015 23:27:00 +0100
Subject: [maker-devel] [Doubt] maker2zff line 109Good morning. I am trying
 problem to write messages in the help forum of maker in google groups. I
 dont know if my problem or contrary it might be a problem with the
 permissions. But i cant see the red button of new threads. Anyway,
 I will try to show my problem with maker2zff. Which does not work. My
 version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff
 ../sequences.all.gff everything.ann everything.dna [sudo] password for
 soba: No such file or directory at
 ../maker/marker_v2.31.8/maker/bin/maker2zff line 109,
 <GFF> line 1922870. I read something about the problematic characters in the
 ID . But i dont know if it is my example.
 http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html
 Regards. Thanks in advance.
Message-ID: <20150309232700.Horde.39Aj_iELfhGei1Bv5JG1fw6@webmail.um.es>

Good night everyone.I will try to show my problem with maker2zff. Which
does not work. My version is the v2.31.8. The last line of the gff file is
the line which the mistake alert said ?that it doesnt find the file or
directory.

../maker/marker_v2.31.8/maker/bin/maker2zff?? ../sequences.all.gff
everything.ann everything.dna
[sudo] password for soba:
No such file?or directory?at?../maker/marker_v2.31.8/maker/bin/MAKER2ZFF
LINE 109, <GFF> line 1922870.

I read something about the problematic characters in the ID . But i dont
know if it is my example.

http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html

Regards.
Thanks in advance.

Fco Javier S?nchez-Garc?a
PhD student Forest Entomology
?rea de Biolog?a Animal
Departamento de Zoolog?a y Antropolog?a F?sica
Facultad de Veterinaria
Universidad de Murcia
Campus de Espinardo 30100
Murcia (Espa?a-Spain)
Telf. +34 660 500 416 (mobile phone)
       +34 868 888 031 (laboratory-work phone)

http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl
http://www.researchgate.net/profile/Francisco_Sanchez-Garcia
http://orcid.org/0000-0002-5442-0292
http://www.researcherid.com/rid/M-2407-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150309/6b31f9e4/attachment.html>

From carsonhh at gmail.com  Thu Mar 12 14:50:44 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 12 Mar 2015 13:50:44 -0600
Subject: [maker-devel] Better resolve conflicting gene models
In-Reply-To: <54F88886.9010004@ecolevol.de>
References: <54F88886.9010004@ecolevol.de>
Message-ID: <7B8D93E6-DEE7-43D7-BFBF-38E474311A6C@gmail.com>

Sorry for the slow reply.


> how can one reply to an existing thread such that the reply will be added to the same thread?

Threads show up under the Google archive, but that is really just an artifact of how things are archived. The mailing list itself doesn?t have threads, just a subject line like any other e-mail. But because we are archiving to a message board any messages with the same subject get turned into parts of the same thread.


> Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior?

The gene chosen by MAKER is the one that best matches the evidence.  This is a numeric value called AED (lower means better match).  If a model predicts an exon or a base pair that is not supported by an evidence alignment, then it we be penalized.  If a model fails to predict a base pair that is supported by evidence then it will also be penalized.  The differences in your results are because you changed something between runs, either you added or subtracted a predictor (gives MAKER more models to choose from) or evidence changed (different AED score).  Whichever model has the lowest AED score will always be kept, so some change you made either resulted in different set of available models to choose from or a different AED score. 

Also be aware that models cannot overlap on the same strand, so if you made a change that results in a model being removed from consideration, you may have removed overlap that was precluding another model with a slightly higher AED from being chosen.

> 
> Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR.
> 
> - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. 
> Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand.

The model chosen will always be the one with the lowest AED.  The changes you describe are all based off of a model with lower AED supplanting another, which may have opened up a region that was previously overlapped by a longer gene with a poor AED score.

I would also recommend not including cufflinks output if you have trinity data.  Cufflinks results often falsely bridge regions and will cause gene mergers by falsely extending evidence clusters. This makes it appear as if genes should extend into regions where they shouldn?t.  Given those hints, predictors will then try and predict at least something in those regions, and if they do the models will get a good AED score since they are supported by evidence even if it is false evidence.

?Carson


From carsonhh at gmail.com  Thu Mar 12 15:03:11 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 12 Mar 2015 14:03:11 -0600
Subject: [maker-devel] maker-devel post from avhoeck@sckcen.be requires
	approval
In-Reply-To: <mailman.5.1426178330.13228.maker-devel_yandell-lab.org@box290.bluehost.com>
References: <mailman.5.1426178330.13228.maker-devel_yandell-lab.org@box290.bluehost.com>
Message-ID: <73D91049-1CB8-4C59-A9E5-58374FCB0789@gmail.com>

Hi Arne,

The genes found by CEGMA are very short genes, so where those genes might be identifiable at least partially on shorter contigs other genes will be far longer.  So the 87% value you get from CEGMA is probably what you are going to be able to find (even partially) for the entire genome. Remember that gene size when you include the introns can actually be quite large, and gene predictors need flanking signals from the sequence several hundred bp upstream and downstream to make a prediction, so having a gene partially contained by a contig makes that gene un-annotatable for ab initio predictors. Unless the organism has a high gene density or very few introns, you usually won?t be able to find a gene on contigs smaller than ~10kb.

?Carson


> On Mar 12, 2015, at 10:38 AM
> 
> From: Van Hoeck Arne <avhoeck at SCKCEN.BE <mailto:avhoeck at SCKCEN.BE>>
> To: "maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>>
> Subject: TACC lonestar and N50 value
> Date: March 12, 2015 at 10:38:42 AM MDT
> 
> 
> Dear MAKER developer,
> 
> We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)
> 
> Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?
> 
> Best regards
> Arne
> 
> 
> 	Consider the environment before you print
> Denk aan het milieu voor u deze e-mail print
> Pensez ? l'environnement avant d'imprimer
> 
> 
> SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer <http://www.sckcen.be/en/e-mail_disclaimer>
> 
> 
> Subject: confirm ff58f4c50902af9cf748cc6c032a1ee92a2a19a9
> From: maker-devel-request at yandell-lab.org <mailto:maker-devel-request at yandell-lab.org>
> Date: March 12, 2015 at 10:38:50 AM MDT
> 
> 
> If you reply to this message, keeping the Subject: header intact,
> Mailman will discard the held message.  Do this if the message is
> spam.  If you reply to this message and include an Approved: header
> with the list password in it, the message will be approved for posting
> to the list.  The Approved: header can also appear in the first line
> of the body of the reply.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150312/30a64d42/attachment.html>

From avhoeck at SCKCEN.BE  Thu Mar 12 11:38:42 2015
From: avhoeck at SCKCEN.BE (Van Hoeck Arne)
Date: Thu, 12 Mar 2015 16:38:42 +0000
Subject: [maker-devel] TACC lonestar and N50 value
Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>

Dear MAKER developer,

We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)

Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?

Best regards
Arne


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150312/10c284bd/attachment.html>

From mtollis at asu.edu  Fri Mar 13 15:50:33 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Fri, 13 Mar 2015 13:50:33 -0700
Subject: [maker-devel] Question about pre-masked genome.
Message-ID: <CAM2H015C_WDgMScNBeBe=hZxbYS_c6w1aOFuR+JAkYek4QoUZA@mail.gmail.com>

Hello,
I have a genome that has been well-annotated for species-specific repeats
(using RepeatModeler, trf, and RepeatMasker), and was wondering if I could
simply use the soft-masked assembly for maker annotation, in order to skip
the sometimes cumbersome repeatmasking step. Is this generally looked upon
as doable, or should I just stick with using the unmasked assembly and the
repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could
not confirm that the masked assembly (i.e., lowercase letters) was
maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150313/e4d6520e/attachment.html>

From mtollis at asu.edu  Fri Mar 13 16:48:46 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Fri, 13 Mar 2015 14:48:46 -0700
Subject: [maker-devel] Question about pre-masked genome.
Message-ID: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>

Hello,
I have a genome that has been well-annotated for species-specific repeats
(using RepeatModeler, trf, and RepeatMasker), and was wondering if I could
simply use the soft-masked assembly for maker annotation, in order to skip
the sometimes cumbersome repeatmasking step. Is this generally looked upon
as doable, or should I just stick with using the unmasked assembly and the
repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could
not confirm that the masked assembly (i.e., lowercase letters) was
maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc

-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150313/e763639c/attachment.html>

From dence at genetics.utah.edu  Fri Mar 13 19:14:52 2015
From: dence at genetics.utah.edu (Daniel Ence)
Date: Sat, 14 Mar 2015 00:14:52 +0000
Subject: [maker-devel] Question about pre-masked genome.
In-Reply-To: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>
References: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>
Message-ID: <51DE3D97-10DB-4FF9-A793-EEA051C705BD@genetics.utah.edu>

Hi Marc, Several groups have been doing what you described with fungal genomes for a while and it seems to be working for them.  With the softmasked genome, maker will still allow EST alignments to extend into the masked regions; if it were a hard masked genome, this wouldn?t be possible.

Let us know how it works out though!

Thanks,
Daniel


On Mar 13, 2015, at 3:48 PM, Marc Tollis <mtollis at asu.edu<mailto:mtollis at asu.edu>> wrote:

Hello,
I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc

--
Marc Tollis, Ph.D.
Post-Doctoral Research Associate
Arizona State University
LSE 313
(480) 965-7456
marc.tollis at asu.edu<mailto:marc.tollis at asu.edu>

website: https://sites.google.com/site/tollisresearch/
blog: anolistollis.wordpress.com<http://anolistollis.wordpress.com/>
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150314/44c59bd7/attachment.html>

From mtollis at asu.edu  Sun Mar 15 09:19:37 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Sun, 15 Mar 2015 07:19:37 -0700
Subject: [maker-devel] control file for SNAP training
Message-ID: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>

This is a question about process, and to make sure I am doing things right
(when time is of the essence, some mistakes can set you back weeks).

I have run maker on my de novo vertebrate genome, using only the predictive
proteome from a congener (well-studied and available on Ensembl), and
generated the HMM for the first round of SNAP training. As per the 2014
tutorial, I edited the control file for this step as follows: I added the
path to the .hmm file, and set protein2genome to 0.

When I run maker, I notice that in addition to snap, it is still running
blastx and exonerate however. I noticed that this is because I did not
remove (or "comment out") the path to the protein.fa in the control file
(the output looks markedly different when I do comment out the protein file
- and I can't even tell if it's running snap in this instance).

Is it simply using exonerate to place the ab initio predictions on the
scaffolds (meaning that having protein2genome=1 is to tell maker to make
evidence annotations) ? Did I do this correctly, or should I also remove
the protein.fa out of the control file for SNAP training?
?
-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150315/ef226da4/attachment.html>

From steinj at cshl.edu  Mon Mar 16 08:29:36 2015
From: steinj at cshl.edu (Stein, Joshua)
Date: Mon, 16 Mar 2015 13:29:36 +0000
Subject: [maker-devel] TACC lonestar and N50 value
In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>
References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>
Message-ID: <0A0DBB88-EC95-43B7-AA75-B7858B67A8F4@cshl.edu>

Hi Arne,

I have experience with iPlant resources and with MAKER-P.  I would encourage you to try the Atmosphere image (7888b8e1-c006-4794-82d9-4c940ddbf4c6).  You can request a large instance (up to 16 CPU's and 128 GB memory) and run in MPI-mode to distribute the work.  Please see this tutorial, which includes information on running in MPI-mode:  https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial.

You can also access the TACC Lonestar installation using the iPlant Discovery Environment.  There is an app called "MAKER-P-Lonestar-Small-Genomes 2.3".  Although it is advertised as appropriate for "small" genomes, I think there is a good chance that it will work for 450 Mb.  This is a new resource and the iPlant team would value any feedback and benchmarks on how the system is working.  Depending how this goes there are plans to roll-out additional apps intended for larger genomes.  Here is a tutorial: https://pods.iplantcollaborative.org/wiki/display/sciplant/Tutorial+for+running+MAKER-P+on+TACC-Lonestar+from+iPlant+Discovery+Environment

Regarding contig sizes, though not ideal, you can include contigs smaller than 10kbp in your run.  Plant genes tend to be more compact than vertebrate genes so you ought to be able to recover annotations on the smaller contigs, though keep an eye out for truncated genes.

Best,
Josh


On Mar 13, 2015, at 6:06 PM, Van Hoeck Arne <avhoeck at SCKCEN.BE<mailto:avhoeck at SCKCEN.BE>> wrote:

Dear MAKER developer,

We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)

Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?

Best regards
Arne


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu<mailto:steinj at cshl.edu>
http://ware.cshl.org/


From mtollis at asu.edu  Tue Mar 17 16:26:44 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Tue, 17 Mar 2015 14:26:44 -0700
Subject: [maker-devel] control file for SNAP training
In-Reply-To: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
References: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
Message-ID: <CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>

I answered my own question:
No need to re-align proteins again - takes too long.
So, I used the gff file from the gff_merge on the log file from the first
run (the one with just protein2genome). Then, after generating the .hmm
file, I put it in my control file, along with protein2genome=0, removed the
protein.fasta, set maker_gff and protein_pass=1. The output now shows that
only snap is running, and no blastx and exonerate - a relief because it is
much faster!

On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis <mtollis at asu.edu> wrote:

> This is a question about process, and to make sure I am doing things right
> (when time is of the essence, some mistakes can set you back weeks).
>
> I have run maker on my de novo vertebrate genome, using only the
> predictive proteome from a congener (well-studied and available on
> Ensembl), and generated the HMM for the first round of SNAP training. As
> per the 2014 tutorial, I edited the control file for this step as follows:
> I added the path to the .hmm file, and set protein2genome to 0.
>
> When I run maker, I notice that in addition to snap, it is still running
> blastx and exonerate however. I noticed that this is because I did not
> remove (or "comment out") the path to the protein.fa in the control file
> (the output looks markedly different when I do comment out the protein file
> - and I can't even tell if it's running snap in this instance).
>
> Is it simply using exonerate to place the ab initio predictions on the
> scaffolds (meaning that having protein2genome=1 is to tell maker to make
> evidence annotations) ? Did I do this correctly, or should I also remove
> the protein.fa out of the control file for SNAP training?
> ?
> --
> *Marc Tollis, Ph.D.*
> *Post-Doctoral Research Associate*
> *Arizona State University*
> *LSE 313*
> *(480) 965-7456 <%28480%29%20965-7456>*
> marc.tollis at asu.edu
>
> *website: *https://sites.google.com/site/tollisresearch/
> *blog: *anolistollis.wordpress.com
>


-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150317/051bb315/attachment.html>

From carsonhh at gmail.com  Tue Mar 17 21:47:50 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 17 Mar 2015 20:47:50 -0600
Subject: [maker-devel] control file for SNAP training
In-Reply-To: <CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>
References: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
	<CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>
Message-ID: <AADABBD3-04F1-49BF-B261-4B316EF60D2B@gmail.com>

You can also add additional protein file if you want more evidence than that already in the GFF3. It makes re-annotation and adding evidence to an already annotated genome file really easy.

?Carson


> On Mar 17, 2015, at 3:26 PM, Marc Tollis <mtollis at asu.edu> wrote:
> 
> I answered my own question:
> No need to re-align proteins again - takes too long.
> So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster!
> 
> On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis <mtollis at asu.edu <mailto:mtollis at asu.edu>> wrote:
> This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks).
> 
> I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. 
> 
> When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). 
> 
> Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? 
> ?
> -- 
> Marc Tollis, Ph.D.
> Post-Doctoral Research Associate
> Arizona State University
> LSE 313
> (480) 965-7456 <tel:%28480%29%20965-7456>
> marc.tollis at asu.edu <mailto:marc.tollis at asu.edu>
> 
> website: https://sites.google.com/site/tollisresearch/ <https://sites.google.com/site/tollisresearch/>
> blog: anolistollis.wordpress.com <http://anolistollis.wordpress.com/>
> 
> 
> -- 
> Marc Tollis, Ph.D.
> Post-Doctoral Research Associate
> Arizona State University
> LSE 313
> (480) 965-7456
> marc.tollis at asu.edu <mailto:marc.tollis at asu.edu>
> 
> website: https://sites.google.com/site/tollisresearch/ <https://sites.google.com/site/tollisresearch/>
> blog: anolistollis.wordpress.com <http://anolistollis.wordpress.com/>_______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150317/f3680a6c/attachment.html>

From Brian.Mack at ARS.USDA.GOV  Fri Mar 20 08:17:09 2015
From: Brian.Mack at ARS.USDA.GOV (Mack, Brian)
Date: Fri, 20 Mar 2015 13:17:09 +0000
Subject: [maker-devel] est2genome wrong strand
Message-ID: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>

Hi, I've noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I've copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I've noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this?


Thanks,

Brian


Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496

>contig_69 <http://10.114.143.20:4567/get_sequence/?id=contig_69&db=/home/brian/blastdb/af70_20130423_id-modified.assembly.txt>

Length=108040


 Score =  1043 bits (1156),  Expect = 0.0

 Identities = 589/592 (99%), Gaps = 3/592 (1%)

 Strand=Plus/Plus


Query  24      TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  83

               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct  105546  TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  105605


Query  84      CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  142

               |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||

Sbjct  105606  CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  105665


69           blastn    expressed_sequence_match         105546  106137  559        +             .               ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3
69           blastn    match_part         105546  106137  559        +             .               ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359
69           est2genome       expressed_sequence_match         105546  106137  2909      -              .               ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3
69           est2genome       match_part         105546  106137  2909      -              .               ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150320/d3f1dc4c/attachment.html>

From carsonhh at gmail.com  Fri Mar 20 09:54:28 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 20 Mar 2015 08:54:28 -0600
Subject: [maker-devel] est2genome wrong strand
In-Reply-To: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>
References: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>
Message-ID: <C32539B3-CF24-4C99-9897-605FE8C8CCB8@gmail.com>

Hi Brian,

Multi-exon ESTs are stranded by their splice sites which can only work on one strand. Single exon ESTs on the other hand are stranded by their open reading frame. The standard chemistry used in most sequencing does not allow for strand specific sequence. There are technologies that do, but unless you used one of those, re-stranding single exon ESTs based on ORF is the best way to figure out where single exon alignments should go (not a completely reliable method but it works most of the time).  I believe trinity tries to determine strand the same way (but it is unreliable there too). For example since your alignment is not an exact match to the genomic sequence (single bp deletion in the alignment) the best open reading frame in the transcript is not the same as the best open reading frame in the genomic sequence (so one of them likely contains an error).  MAKER since it is annotating the genome logically re-strands it to the genome and trinity (being unaware of the genome strands it to the transcript).  Because single exon alignments are very unreliable, they are ignored in MAKER by default.  They will not be used as hints for gene predictors and can only be used to support a gene if there is also protein evidence or a single exon ab initio prediction at the same location to support it (even then this will only happen if you set single_exon=1 in the control files).

?Carson


On Mar 20, 2015, at 7:17 AM, Mack, Brian <Brian.Mack at ARS.USDA.GOV <mailto:Brian.Mack at ARS.USDA.GOV>> wrote:

> Hi, I?ve noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I?ve copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I?ve noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? 
>  
> Thanks,
> Brian
>  
> Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496
> >contig_69  <http://10.114.143.20:4567/get_sequence/?id=contig_69&db=/home/brian/blastdb/af70_20130423_id-modified.assembly.txt> <> 
> Length=108040
>  
>  Score =  1043 bits (1156),  Expect = 0.0
>  Identities = 589/592 (99%), Gaps = 3/592 (1%)
>  Strand=Plus/Plus
>  
> Query  24      TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  83
>                ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct  105546  TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  105605
>  
> Query  84      CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  142
>                |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct  105606  CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  105665
>  
>  
>  
> 69           blastn    expressed_sequence_match         105546  106137  559        +             .               ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3
> 69           blastn    match_part         105546  106137  559        +             .               ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359
> 69           est2genome       expressed_sequence_match         105546  106137  2909      -              .               ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3
> 69           est2genome       match_part         105546  106137  2909      -              .               ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150320/f91a44d0/attachment.html>

From xvazquezc at gmail.com  Sat Mar 21 22:27:27 2015
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=)
Date: Sun, 22 Mar 2015 14:27:27 +1100
Subject: [maker-devel] annotation stats: repeats
Message-ID: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>

Hi all,

I was wondering how can I get data about the repeat content of the genome
from maker if possible, as well as each type of repeats: RE, transposons,
simple repeats, low complexity repeats

Thank you in advance,

Xabier

-- 
Xabier V?zquez Campos
*PhD Candidate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150322/e07ccc08/attachment.html>

From dence at genetics.utah.edu  Sun Mar 22 00:56:06 2015
From: dence at genetics.utah.edu (Daniel Ence)
Date: Sun, 22 Mar 2015 05:56:06 +0000
Subject: [maker-devel] annotation stats: repeats
In-Reply-To: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>
References: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>
Message-ID: <4FFB960C-CF1A-440F-B767-36EC369D2B58@genetics.utah.edu>

Hi Xabier, MAKER offer several options for repeat-masking. There?s a file of repetitive element proteins that is run with repeat runner, and there you can run RepeatMasker with a custom library and/or one of the default RepeatMasker libraries.

The results from these programs are in the gff3 files that maker generates. Depending on the library that you give RepeatMasker, the repeats might be classified as to transposable element family and simple vs. complex repeats. These are probably a good place to start, but you?ll have to extract that data from the gff3 file and compile it.

Let us know whether that helps.

Thanks,
Daniel


On Mar 21, 2015, at 9:27 PM, Xabier V?zquez Campos <xvazquezc at gmail.com<mailto:xvazquezc at gmail.com>> wrote:

Hi all,

I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats

Thank you in advance,

Xabier

--
Xabier V?zquez Campos
PhD Candidate
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150322/3d95c8da/attachment.html>

From panos.ioannidis at gmail.com  Tue Mar 24 03:29:14 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 09:29:14 +0100
Subject: [maker-devel] Augustus retraining
Message-ID: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>

Hello All,

I'm trying to retrain Augustus using EST data from the same species and
realized that quite a few of the gene models I get based on EST data are
incomplete (i.e. no start and/or stop codon).

Now, when I get to the "etraining" step in Augustus retraining (right after
the time-consuming "optimize_augustus.pl" step), I get a warning for each
gene that doesn't contain a start or stop codon.

.....
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
does not begin with start codon but with acg
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
....

Does anyone know whether training is compromised by such incomplete gene
models? Do you usually exclude them from the training set?

Oh, and by the way, the best guide to retraining Augustus is here
<http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
The official
<http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
page isn't bad, but doesn't explain in detail certain things.

Thanks,
Panos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150324/a82d7062/attachment.html>

From xvazquezc at gmail.com  Tue Mar 24 07:06:25 2015
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=)
Date: Tue, 24 Mar 2015 23:06:25 +1100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
Message-ID: <CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>

Hi Panos,

Have you tried using webAugustus for the (re)training? I found it very
convenient for generating the models for Augustus.

Cheers,

2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:

> Hello All,
>
> I'm trying to retrain Augustus using EST data from the same species and
> realized that quite a few of the gene models I get based on EST data are
> incomplete (i.e. no start and/or stop codon).
>
> Now, when I get to the "etraining" step in Augustus retraining (right
> after the time-consuming "optimize_augustus.pl" step), I get a warning
> for each gene that doesn't contain a start or stop codon.
>
> .....
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
> does not begin with start codon but with acg
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
> ....
>
> Does anyone know whether training is compromised by such incomplete gene
> models? Do you usually exclude them from the training set?
>
> Oh, and by the way, the best guide to retraining Augustus is here
> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
> The official
> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
> page isn't bad, but doesn't explain in detail certain things.
>
> Thanks,
> Panos
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 
Xabier V?zquez Campos
*PhD Candidate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150324/0b0a4daf/attachment.html>

From panos.ioannidis at gmail.com  Tue Mar 24 07:24:45 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 13:24:45 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
Message-ID: <CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>

Hi Xabier,

Thanks for your quick reply!

No, I haven't used WebAugustus, but I just checked it out and it looks like
my training set is too big (~300 Mbp), so I can't even upload it!

Anyway, I prefer to train it locally because I have better control over
each step. Also, I have done the entire training procedure with less genes,
but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
replicate it using more of my scaffolds, but as it appears I get a lot more
incomplete models from exonerate (run through Maker).

P


On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com>
wrote:

> Hi Panos,
>
> Have you tried using webAugustus for the (re)training? I found it very
> convenient for generating the models for Augustus.
>
> Cheers,
>
> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>
>> Hello All,
>>
>> I'm trying to retrain Augustus using EST data from the same species and
>> realized that quite a few of the gene models I get based on EST data are
>> incomplete (i.e. no start and/or stop codon).
>>
>> Now, when I get to the "etraining" step in Augustus retraining (right
>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>> for each gene that doesn't contain a start or stop codon.
>>
>> .....
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>> does not begin with start codon but with acg
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>> ....
>>
>> Does anyone know whether training is compromised by such incomplete gene
>> models? Do you usually exclude them from the training set?
>>
>> Oh, and by the way, the best guide to retraining Augustus is here
>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>> The official
>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
>> page isn't bad, but doesn't explain in detail certain things.
>>
>> Thanks,
>> Panos
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
> --
> Xabier V?zquez Campos
> *PhD Candidate*
> Water Research Centre
> School of Civil and Environmental Engineering
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150324/2be728f0/attachment.html>

From carsonhh at gmail.com  Tue Mar 24 09:14:51 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 08:14:51 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
Message-ID: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>

Hi Panos,

EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.

More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>

Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.

?Carson


> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Hi Xabier,
> 
> Thanks for your quick reply!
> 
> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
> 
> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
> 
> P
> 
> 
> 
> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
> Hi Panos,
> 
> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
> 
> Cheers,
> 
> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
> Hello All,
> 
> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
> 
> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
> 
> .....
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
> ....
> 
> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
> 
> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
> 
> Thanks,
> Panos
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> 
> 
> -- 
> Xabier V?zquez Campos
> PhD Candidate
> Water Research Centre
> School of Civil and Environmental Engineering
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150324/1e0e6b39/attachment.html>

From panos.ioannidis at gmail.com  Tue Mar 24 09:31:04 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 15:31:04 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
Message-ID: <CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>

Hi Carson,

So you think it's okay to include incomplete gene models when training
Augustus?

I'll certainly try the bootstrap method you're suggesting. Even though I
did it for SNAP, for some weird reason I forgot it for Augustus :p Do you
think, however, that I can get a big improvement in gene-level sensitivity?
Currently, I have only 6%...

Thanks,
Panos


On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com> wrote:

> Hi Panos,
>
> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a
> first round of training you can run MAKER together with protein and EST
> evidence and the newly trained Augustus species file.  Because MAKER gives
> hints to Augustus as it runs, the models it produces will be improved over
> what it would get from just running Augustus on it?s own.  Then take these
> gene models and use them to retrain Augustus.  This is the standard
> bootstrap retraining procedure, and can be repeated as needed.
>
> More info on bootstrap training here (info is for SNAP but procedure is
> similar to Augustus) ?>  http://weatherby.genetics.
> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_
> Online_Training_2014#Training_ab_initio_Gene_Predictors
> Here is an excellent explanation of Augustus training ?>
> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html
> and here are tools to convert SNAP training files to Augustus training
> files (MAKER comes with a tool that converts GFF3 for SNAP training so just
> take that and convert it for Augustus)?> https://github.com/
> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
>
> Finally you can also manually edit the GFF3 file in Apollo (easier to use
> the legacy stand alone version), and then convert that file for bootstrap
> training.
>
> ?Carson
>
>
> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
> wrote:
>
> Hi Xabier,
>
> Thanks for your quick reply!
>
> No, I haven't used WebAugustus, but I just checked it out and it looks
> like my training set is too big (~300 Mbp), so I can't even upload it!
>
> Anyway, I prefer to train it locally because I have better control over
> each step. Also, I have done the entire training procedure with less genes,
> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
> replicate it using more of my scaffolds, but as it appears I get a lot more
> incomplete models from exonerate (run through Maker).
>
> P
>
>
>
> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <
> xvazquezc at gmail.com> wrote:
>
>> Hi Panos,
>>
>> Have you tried using webAugustus for the (re)training? I found it very
>> convenient for generating the models for Augustus.
>>
>> Cheers,
>>
>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>>
>>> Hello All,
>>>
>>> I'm trying to retrain Augustus using EST data from the same species and
>>> realized that quite a few of the gene models I get based on EST data are
>>> incomplete (i.e. no start and/or stop codon).
>>>
>>> Now, when I get to the "etraining" step in Augustus retraining (right
>>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>>> for each gene that doesn't contain a start or stop codon.
>>>
>>> .....
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>>> does not begin with start codon but with acg
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>> ....
>>>
>>> Does anyone know whether training is compromised by such incomplete gene
>>> models? Do you usually exclude them from the training set?
>>>
>>> Oh, and by the way, the best guide to retraining Augustus is here
>>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>>> The official
>>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
>>> page isn't bad, but doesn't explain in detail certain things.
>>>
>>> Thanks,
>>> Panos
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>>
>>
>>
>> --
>> Xabier V?zquez Campos
>> *PhD Candidate*
>> Water Research Centre
>> School of Civil and Environmental Engineering
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150324/34c2980c/attachment.html>

From carsonhh at gmail.com  Tue Mar 24 09:39:20 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 08:39:20 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
Message-ID: <C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>

On your first round it is fine.  It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity.  If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics.  The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them).

?Carson


> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Hi Carson,
> 
> So you think it's okay to include incomplete gene models when training Augustus?
> 
> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...
> 
> Thanks,
> Panos
> 
> 
> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Hi Panos,
> 
> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.
> 
> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>
> 
> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.
> 
> ?Carson
> 
> 
>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>> 
>> Hi Xabier,
>> 
>> Thanks for your quick reply!
>> 
>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
>> 
>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
>> 
>> P
>> 
>> 
>> 
>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> Hi Panos,
>> 
>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
>> 
>> Cheers,
>> 
>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
>> Hello All,
>> 
>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
>> 
>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
>> 
>> .....
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>> ....
>> 
>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
>> 
>> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
>> 
>> Thanks,
>> Panos
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>> 
>> 
>> 
>> 
>> -- 
>> Xabier V?zquez Campos
>> PhD Candidate
>> Water Research Centre
>> School of Civil and Environmental Engineering
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150324/f25ab2fc/attachment.html>

From panos.ioannidis at gmail.com  Tue Mar 24 10:05:54 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 16:05:54 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
	<C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
Message-ID: <CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>

Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level
is 88%. I only mentioned gene-level, because that's the only metric
mentioned in the Augustus web site.

I got these numbers outside of Maker. Actually, I only used Maker to
generate the gff files needed to start the training (ran it using only EST
evidence and only on a subset of my assembly, using this
<http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>
as a guide).

Now, I've started running the second round of training, as you suggested.
Since, however, I don't have data from closely related species, I'm only
using Uniref50 as protein evidence.

P

On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt <carsonhh at gmail.com> wrote:

> On your first round it is fine.  It gives the predictor enough to work
> with, then on the second round you use improved models. When you say 6%
> sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER
> that means you are not providing sufficient protein evidence (you need the
> full proteome of at least two related species). Also is that the gene
> level, exon level, or nucleotide level sensitivity.  If you are looking at
> the gene level sensitivity measure, you only get a match when you perfectly
> match all transcripts in a gene (models that may not be correct in the
> first place). This value will rarely go above 10% for any predictor. You
> need to use the nucleotide level sensitivity/specificity metrics.  The gene
> and exon level metrics are basically meaningless (unless it?s Drosophila
> which is the only species annotated correctly enough to use them).
>
> ?Carson
>
>
> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
> wrote:
>
> Hi Carson,
>
> So you think it's okay to include incomplete gene models when training
> Augustus?
>
> I'll certainly try the bootstrap method you're suggesting. Even though I
> did it for SNAP, for some weird reason I forgot it for Augustus :p Do you
> think, however, that I can get a big improvement in gene-level sensitivity?
> Currently, I have only 6%...
>
> Thanks,
> Panos
>
>
> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com> wrote:
>
>> Hi Panos,
>>
>> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a
>> first round of training you can run MAKER together with protein and EST
>> evidence and the newly trained Augustus species file.  Because MAKER gives
>> hints to Augustus as it runs, the models it produces will be improved over
>> what it would get from just running Augustus on it?s own.  Then take these
>> gene models and use them to retrain Augustus.  This is the standard
>> bootstrap retraining procedure, and can be repeated as needed.
>>
>> More info on bootstrap training here (info is for SNAP but procedure is
>> similar to Augustus) ?>  http://weatherby.genetics.
>> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_
>> Online_Training_2014#Training_ab_initio_Gene_Predictors
>> Here is an excellent explanation of Augustus training ?>
>> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html
>> and here are tools to convert SNAP training files to Augustus training
>> files (MAKER comes with a tool that converts GFF3 for SNAP training so just
>> take that and convert it for Augustus)?> https://github.com/
>> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
>>
>> Finally you can also manually edit the GFF3 file in Apollo (easier to use
>> the legacy stand alone version), and then convert that file for bootstrap
>> training.
>>
>> ?Carson
>>
>>
>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
>> wrote:
>>
>> Hi Xabier,
>>
>> Thanks for your quick reply!
>>
>> No, I haven't used WebAugustus, but I just checked it out and it looks
>> like my training set is too big (~300 Mbp), so I can't even upload it!
>>
>> Anyway, I prefer to train it locally because I have better control over
>> each step. Also, I have done the entire training procedure with less genes,
>> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
>> replicate it using more of my scaffolds, but as it appears I get a lot more
>> incomplete models from exonerate (run through Maker).
>>
>> P
>>
>>
>>
>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <
>> xvazquezc at gmail.com> wrote:
>>
>>> Hi Panos,
>>>
>>> Have you tried using webAugustus for the (re)training? I found it very
>>> convenient for generating the models for Augustus.
>>>
>>> Cheers,
>>>
>>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>>>
>>>> Hello All,
>>>>
>>>> I'm trying to retrain Augustus using EST data from the same species and
>>>> realized that quite a few of the gene models I get based on EST data are
>>>> incomplete (i.e. no start and/or stop codon).
>>>>
>>>> Now, when I get to the "etraining" step in Augustus retraining (right
>>>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>>>> for each gene that doesn't contain a start or stop codon.
>>>>
>>>> .....
>>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>>>> does not begin with start codon but with acg
>>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>>> ....
>>>>
>>>> Does anyone know whether training is compromised by such incomplete
>>>> gene models? Do you usually exclude them from the training set?
>>>>
>>>> Oh, and by the way, the best guide to retraining Augustus is here
>>>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>>>> The official
>>>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html>
>>>> web page isn't bad, but doesn't explain in detail certain things.
>>>>
>>>> Thanks,
>>>> Panos
>>>>
>>>>
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Xabier V?zquez Campos
>>> *PhD Candidate*
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150324/1567f72a/attachment.html>

From carsonhh at gmail.com  Tue Mar 24 10:38:08 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 09:38:08 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
	<C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
	<CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>
Message-ID: <4EA5A1F6-2950-4D65-A59C-0F3848C86C02@gmail.com>

I?d pick a couple of species that are as closely related as you can find.  Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won?t have (those databases are usually a little too conservative).

The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with.  Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point.  This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics.

Thanks,
Carson


> On Mar 24, 2015, at 9:05 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site.
> 
> I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html> as a guide).
> 
> Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence.
> 
> P
> 
> On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> On your first round it is fine.  It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity.  If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics.  The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them).
> 
> ?Carson
> 
> 
>> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>> 
>> Hi Carson,
>> 
>> So you think it's okay to include incomplete gene models when training Augustus?
>> 
>> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...
>> 
>> Thanks,
>> Panos
>> 
>> 
>> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> Hi Panos,
>> 
>> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.
>> 
>> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
>> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
>> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>
>> 
>> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.
>> 
>> ?Carson
>> 
>> 
>>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>>> 
>>> Hi Xabier,
>>> 
>>> Thanks for your quick reply!
>>> 
>>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
>>> 
>>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
>>> 
>>> P
>>> 
>>> 
>>> 
>>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>>> Hi Panos,
>>> 
>>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
>>> 
>>> Cheers,
>>> 
>>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
>>> Hello All,
>>> 
>>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
>>> 
>>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
>>> 
>>> .....
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>> ....
>>> 
>>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
>>> 
>>> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
>>> 
>>> Thanks,
>>> Panos
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Xabier V?zquez Campos
>>> PhD Candidate
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150324/80336079/attachment.html>

From alicebdennis at gmail.com  Thu Mar 26 05:34:26 2015
From: alicebdennis at gmail.com (Alice Dennis)
Date: Thu, 26 Mar 2015 11:34:26 +0100
Subject: [maker-devel] iterative Maker2
In-Reply-To: <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch>
	<1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
Message-ID: <CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>

Hello again,

I posted a while ago about a genome I'm running through the Maker2
pipeline. I was concerned because my results were still changing with
3 and 4 iterations.

Following the very useful advice of Carson (below), I've made a few
modifications (adding a RepeatModeler run, using a big protein
database), but my gene predictions are still changing between the 3rd
and 4th iterations. Perhaps this is ok, but these increasing gene
lengths make me worry that I haven't built stable models.

Here is the short version of what I've done.
1. Run RepeatModeler, but this only produced 47 sequences in the
resulting .fasta... so that seemed a bit small.

2. Run Maker2 using:
- RepeatModeler output + "model_org=all" and "softmask=1" in the
Repeat Masking section.
- protein evidence from 2 distantly related species AND all of Uniprot
- ests from a different strain of my species (a parasitoid wasp)
- the .hmm from Nasonia, one of the 2 distantly related species whose
proteome I also provided as protein evidence
- my assembled genome of 1,509 scaffolds.

3. After this, I did three subsequent rounds of Maker2 (cleverly named
Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
.hmm was replaced by a SNAP generated .hmm from the previous round.
Also, the est2genome and protein2genome was changed from 1 to 0 in all
runs after the first.

Here are some results:
Round1: 14,647 genes, average length 2,491
Round2: 12,158 genes, average length 3,760
Round3: 13,515 genes, average length 3,090
Round4: 12,169 genes, average length 3,918

This is a bit confusing because the number of genes predicted goes up
and down, as does their lengths. I've doubly checked the dates of my
files, and they are all labeled such that I don't think anything could
be swapped.

So my questions are:
Is this an indication that my models are unstable and I shouldn't
trust these predictions?
Is the decreasing number of genes, while also getting longer perhaps a
good thing?
How do I know when to stop if genes keep getting longer?


Thanks very much,
Alice


On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <carsonhh at gmail.com> wrote:
> The gene models are actually produced by SNAP, Augustus, or whatever gene
> predictor you are using, so if you change the HMM every round, then the
> models will change too.  But I have one concern.  You are using a very
> sparse protein evidence dataset.  The protein dataset is very important to
> MAKER?s performance, and for itterative training of the ab initio
> predictors.  Normally after the second iteration, additional training should
> not be beneficial, but if you are getting wildly different results on 3rd
> and 4th round, then you probably aren?t getting sufficient good models to
> train with.
>
> For a protein dataset you should be using the entire a proteome from a
> minimum of two related species and perhaps all of UniProt/Swiss-prot to get
> a broad protein database.  Don?t use the proteins extracted by CEGMA and
> HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff scrip
> that comes with MAEKR), but don?t give the proteins to MAKER as evidence,
> also the HaMSTr results will be redundant with the ESTs.  You need proteins
> from related species to look for homology not found in the EST dataset.
>
> Also repeat masking is important for any genome and has a huge effect on ab
> initio predictor performance.  Make sure you run something like
> RepeatModeler to look for species specific repeats that will not already be
> in RepBase.  Then add those results to the rmlib= option in the maker
> control files.
>
> Thanks,
> Carson
>
>
>
>
> On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch> wrote:
>
> Hi all,
>
> I am a relatively new user to Maker2, and I?m looking for advise on running
> many iterations of the same dataset in Maker2.
>
> I have a relatively small genome (~124 MB) from a wasp that is assembled
> into ~1,500 scaffold. I have run several iterations of Maker2 by
> re-generating .hmms in SNAP and feeding them into the next round, and my
> gene predictions keep increasing (in number and in size).  The only thing
> that changes at each round is the .hmm.
> This is the evidence that I give is:
> -          de novo assembled ESTs from a different strain of the same
> species (70,000 contigs? I am currently working on improving this assembly
> with the hope that this will be helpful here)
> -          610 proteins extracted from the genome scaffolds using CEGMA and
> HaMSTr
>
> For my 1st iteration, I used the Nasonia .hmm from SNAP, and the
> est2genome/protein2genome option.
>
> For the 2nd, 3rd and 4th rounds I have used .hmms generated from the
> previous round, all without the est2genome/protein2genome option. All other
> files are the same as in the original run.
>
> As I understand it, after the second round, nothing should change in Maker2.
> But the differences are obvious between runs. Some entirely new exons are
> annotated. For example,  just counting ?exon? in the .gff file gives me
> 73,000 after the third iteration and 96,000 after the fourth! Actually the
> biggest leap in this number is between the third and fourth round. I can
> also see that many features are longer when I look at the files in Geneious.
>
> Is this sort of change possible after the second round of Maker2? Is there
> something I have done wrong in my runs, or am a understanding this output
> incorrectly?
>
> Thank you,
> Alice
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 


Alice Dennis
alicebdennis at gmail.com

Postdoctoral Researcher
Institute for Integrative Biology, ETH Z?rich & EAWAG
?berlandstrasse 133
P.O. Box 611
8600 D?bendorf, Switzerland

https://adennis5.wordpress.com/


From michael.s.campbell1 at gmail.com  Thu Mar 26 10:50:41 2015
From: michael.s.campbell1 at gmail.com (Michael Campbell)
Date: Thu, 26 Mar 2015 09:50:41 -0600
Subject: [maker-devel] iterative Maker2
In-Reply-To: <CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>
References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch>
	<1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
	<CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>
Message-ID: <CAAi6vWXnyyFkTVD9tc-QGxSBCBenTy5QyTM6ReVqDveXQA0FTg@mail.gmail.com>

Hi Alice,

In my experience the fewer longer genes is generally a good thing (and very
normal) resulting from the merging of split models and extension of
incomplete models. I find it helpful to load the annotations and evidence
into a browser to get a visual idea of what is happening.

Mike

On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis <alicebdennis at gmail.com>
wrote:

> Hello again,
>
> I posted a while ago about a genome I'm running through the Maker2
> pipeline. I was concerned because my results were still changing with
> 3 and 4 iterations.
>
> Following the very useful advice of Carson (below), I've made a few
> modifications (adding a RepeatModeler run, using a big protein
> database), but my gene predictions are still changing between the 3rd
> and 4th iterations. Perhaps this is ok, but these increasing gene
> lengths make me worry that I haven't built stable models.
>
> Here is the short version of what I've done.
> 1. Run RepeatModeler, but this only produced 47 sequences in the
> resulting .fasta... so that seemed a bit small.
>
> 2. Run Maker2 using:
> - RepeatModeler output + "model_org=all" and "softmask=1" in the
> Repeat Masking section.
> - protein evidence from 2 distantly related species AND all of Uniprot
> - ests from a different strain of my species (a parasitoid wasp)
> - the .hmm from Nasonia, one of the 2 distantly related species whose
> proteome I also provided as protein evidence
> - my assembled genome of 1,509 scaffolds.
>
> 3. After this, I did three subsequent rounds of Maker2 (cleverly named
> Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
> .hmm was replaced by a SNAP generated .hmm from the previous round.
> Also, the est2genome and protein2genome was changed from 1 to 0 in all
> runs after the first.
>
> Here are some results:
> Round1: 14,647 genes, average length 2,491
> Round2: 12,158 genes, average length 3,760
> Round3: 13,515 genes, average length 3,090
> Round4: 12,169 genes, average length 3,918
>
> This is a bit confusing because the number of genes predicted goes up
> and down, as does their lengths. I've doubly checked the dates of my
> files, and they are all labeled such that I don't think anything could
> be swapped.
>
> So my questions are:
> Is this an indication that my models are unstable and I shouldn't
> trust these predictions?
> Is the decreasing number of genes, while also getting longer perhaps a
> good thing?
> How do I know when to stop if genes keep getting longer?
>
>
> Thanks very much,
> Alice
>
>
> On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <carsonhh at gmail.com> wrote:
> > The gene models are actually produced by SNAP, Augustus, or whatever gene
> > predictor you are using, so if you change the HMM every round, then the
> > models will change too.  But I have one concern.  You are using a very
> > sparse protein evidence dataset.  The protein dataset is very important
> to
> > MAKER?s performance, and for itterative training of the ab initio
> > predictors.  Normally after the second iteration, additional training
> should
> > not be beneficial, but if you are getting wildly different results on 3rd
> > and 4th round, then you probably aren?t getting sufficient good models to
> > train with.
> >
> > For a protein dataset you should be using the entire a proteome from a
> > minimum of two related species and perhaps all of UniProt/Swiss-prot to
> get
> > a broad protein database.  Don?t use the proteins extracted by CEGMA and
> > HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff
> scrip
> > that comes with MAEKR), but don?t give the proteins to MAKER as evidence,
> > also the HaMSTr results will be redundant with the ESTs.  You need
> proteins
> > from related species to look for homology not found in the EST dataset.
> >
> > Also repeat masking is important for any genome and has a huge effect on
> ab
> > initio predictor performance.  Make sure you run something like
> > RepeatModeler to look for species specific repeats that will not already
> be
> > in RepBase.  Then add those results to the rmlib= option in the maker
> > control files.
> >
> > Thanks,
> > Carson
> >
> >
> >
> >
> > On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch>
> wrote:
> >
> > Hi all,
> >
> > I am a relatively new user to Maker2, and I?m looking for advise on
> running
> > many iterations of the same dataset in Maker2.
> >
> > I have a relatively small genome (~124 MB) from a wasp that is assembled
> > into ~1,500 scaffold. I have run several iterations of Maker2 by
> > re-generating .hmms in SNAP and feeding them into the next round, and my
> > gene predictions keep increasing (in number and in size).  The only thing
> > that changes at each round is the .hmm.
> > This is the evidence that I give is:
> > -          de novo assembled ESTs from a different strain of the same
> > species (70,000 contigs? I am currently working on improving this
> assembly
> > with the hope that this will be helpful here)
> > -          610 proteins extracted from the genome scaffolds using CEGMA
> and
> > HaMSTr
> >
> > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the
> > est2genome/protein2genome option.
> >
> > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the
> > previous round, all without the est2genome/protein2genome option. All
> other
> > files are the same as in the original run.
> >
> > As I understand it, after the second round, nothing should change in
> Maker2.
> > But the differences are obvious between runs. Some entirely new exons are
> > annotated. For example,  just counting ?exon? in the .gff file gives me
> > 73,000 after the third iteration and 96,000 after the fourth! Actually
> the
> > biggest leap in this number is between the third and fourth round. I can
> > also see that many features are longer when I look at the files in
> Geneious.
> >
> > Is this sort of change possible after the second round of Maker2? Is
> there
> > something I have done wrong in my runs, or am a understanding this output
> > incorrectly?
> >
> > Thank you,
> > Alice
> >
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> >
> >
>
>
>
> --
>
>
> Alice Dennis
> alicebdennis at gmail.com
>
> Postdoctoral Researcher
> Institute for Integrative Biology, ETH Z?rich & EAWAG
> ?berlandstrasse 133
> P.O. Box 611
> 8600 D?bendorf, Switzerland
>
> https://adennis5.wordpress.com/
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 
Michael Campbell MS, RD.
Doctoral Candidate
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:585-3543
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150326/016a477f/attachment.html>

From rens.holmer at wur.nl  Mon Mar 30 01:12:20 2015
From: rens.holmer at wur.nl (Holmer, Rens)
Date: Mon, 30 Mar 2015 06:12:20 +0000
Subject: [maker-devel] Incorporating cufflinks in maker
Message-ID: <42E0168F-C672-4B7F-97D4-98442B825BF9@wur.nl>

Hi maker team,

I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options:

Provide the cufflinks output as EST-gff
Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff

What would you suggest, and what would be the required formatting for both options?

Thanks in advance,

Rens Holmer


From goutham.atla at gmail.com  Sat Mar 28 00:37:08 2015
From: goutham.atla at gmail.com (Goutham atla)
Date: Sat, 28 Mar 2015 11:07:08 +0530
Subject: [maker-devel] Annotating Cufflinks GTF with Maker
Message-ID: <CALU8LA4CwLD8qm5f==xKSjZoCw+9Ajd=RCD62LkHTdBYbuajig@mail.gmail.com>

Dear All,

I have a draft genome for organism of my interest and I have around 150G of
100bp paired-end RNA-Seq data from different conditions. This organism has
ensemble annotations but very few.

My goal is to look at differential splicing analysis between two
conditions. For this I need good annotations in gtf format at isoform
level.I am interested in using the Splicing Analysis Kit
<http://cbcb.umd.edu/software/spanki/>

For now, I have aligned one sample to genome using tophat2 and then used
cufflinks to generate a de-novo GTF file. In either cases I have not used
the avail be GTF with very few annotations.

The GTF file generated by cufflinks should be annotated to know the
function of each transcript. So I am interested in adding annotations to
the gtf file generated from cufflinks. What is the best of doing it ?

Or is there any better way of getting a gtf file, like that of ensemble,
from my data ?

I have looked at trinotate, but its more about functional annotation and
expression studies.


Regards,

-- 
Goutham Atla
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150328/686b6c3b/attachment.html>

From avhoeck at SCKCEN.BE  Mon Mar 30 11:11:16 2015
From: avhoeck at SCKCEN.BE (Van Hoeck Arne)
Date: Mon, 30 Mar 2015 16:11:16 +0000
Subject: [maker-devel] comments on Incorporating cufflinks in maker
Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF11826E1A@MAILSRV3.sck.be>

Dear Rens and Carlson,
I would like to comment on Rens' question on adding transcript data into your annotation. Since I have no access to google groups, I try to contact you both via mail with, hopefully, the correct mail addresses.

I have tested Maker-P with multiple parameters, optimized 2 gene prediction tools (SNAP and augustus). Both give similarly results at which maker find around 17000 gene annotations. I also have RNAseq samples, and as a test case I also used Cufflinks output and processed with TransDecoder to select gene annotations. By using this approach, I end up with 25000 gene annotations. More or less, all the genes that Maker selected were also selected by the TransDecoder approach. So is it wise to include the information on the 8000 missing genes, based on optimized gene models? probably, there will some false positive in these 8000 missing genes, but with good criteria and models, it would maybe possible that maker can find more genes annotations.

Best regards
Arne

Hi maker team,

I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options:

Provide the cufflinks output as EST-gff
Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff

What would you suggest, and what would be the required formatting for both options?

Thanks in advance,

Rens Holmer


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150330/1fa390fe/attachment.html>

From kai.kamm at ecolevol.de  Thu Mar  5 09:47:02 2015
From: kai.kamm at ecolevol.de (Kai Kamm)
Date: Thu, 05 Mar 2015 17:47:02 +0100
Subject: [maker-devel] Better resolve conflicting gene models
Message-ID: <54F88886.9010004@ecolevol.de>

Hello, thanks for your previous advice.

(Btw, how can one reply to an existing thread such that the reply will 
be added to the same thread?)


I am trying to find the best parameters with Maker for the annotation of 
my genome. I have run Maker with several combinations of parameters and 
predictors on my three biggest scaffolds and looked at the results in 
Jbrowse. Overall most predictions seem fine, but there are some genes 
with conflicts and I have no idea why.

I have:

- 100Mb assembled genome
- Trinity RNAseq assembly
- cufflinks data (in my case don't seem to be messy as suggested, rather 
a good complement to the trinity data))
- protein evidence (related and unrelated species)
- repeat library from repeat modeler


Gene predictors used:

- Augustus trained with transcripts from related species: seems to 
perform fine

- SNAP: no convergence with Augustus even after second training. Dropped 
it because it predicted lots of additional low quality transcripts and 
sometimes disrupted final Maker transcripts.

- Genemark: converged with Augustus after training (introns received 
from TopHat2 output). Tends to predict some additional transcripts 
(compared to Augustus). Few (but some) of these are covered by evidence 
and thus become final Maker transcripts.


So the combination of Augustus and Genemark seems optimal. In general 
both perform well in Maker and tend to predict the same transcripts.

However, I still observe some problems in the behavior of Maker which I 
don't understand:

Example 1: One of the predictors predicts a small additional exon at the 
start which is also covered by protein or EST data. But sometimes Maker 
chooses the other predictors model for the final transcript. Mostly 
these are minor differences but I don't understand this behavior?

Example 2: there are some extreme cases like an Augustus prediction with 
17 exons which are all covered by Trinity and cufflinks isoforms. 
Genemark instead predicts two separate small genes with 2 and 4 exons 
respectively. The resulting final transcript has 7 exons and the 
additional evidence from the trinity and cufflinks data is treated as UTR.


So I thought Augustus seems a little more accurate and run Maker only 
with Augustus to resolve such conflicts, even though I would loose the 
few additional transcripts from Genemark.

This is what happened:

- The gene in Example 2 now has all the 17 exons. This is good!

- Sadly another gene with several exons, which was formerly predicted by 
both Augustus and Genemark and is also covered by cufflinks and trinity 
transcripts, now consists only of two small exons in the final 
transcript. Even though Augustus still predicts the same exons and the 
same evidence is present - only the Genemark prediction is absent which 
was almost identical to Augustus. This I completely don't understand.

I don't worry about the minor differences. The extreme cases are like 
two genes in a hundred and I don't understand the behavior. I was 
thinking that in case of conflicting models Maker will choose the one 
that best fits the evidence. Obviously with most conflicts this is what 
happens, because the majority of the final models look OK. But not the 
above mentioned cases and I don't understand why?

Is there any parameter I missed to better resolve such conflicts?

Best


From bmoore at genetics.utah.edu  Thu Mar  5 17:20:52 2015
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Fri, 6 Mar 2015 00:20:52 +0000
Subject: [maker-devel] Maker Software Question
In-Reply-To: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu>
References: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu>
Message-ID: <E682D1C1-B792-498E-88C9-D9349E9548C8@genetics.utah.edu>

Hi Chris,

I don?t know if others have responded to you already, but I just found this e-mail unanswered in the depths of my e-mail inbox, so apologies for no reply.

I?m going to forward you e-mail along to the full maker mailing list so that you?ll get the benefit of response from the full group of MAKER developers.

MAKER does include a couple of scripts that add functional annotation to MAKER GFF3 output.  This process is described in the recent paper:

Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using
MAKER and MAKER-P. Curr Protoc Bioinformatics.

http://www.ncbi.nlm.nih.gov/pubmed/25501943

Mike do you have a PDF of the final print version of that you could send directly to Christopher?

B

On Jan 16, 2015, at 8:38 AM, Seabury, Christopher <CSeabury at cvm.tamu.edu<mailto:CSeabury at cvm.tamu.edu>> wrote:

Dear Colleagues,

I would like to quickly ask about a specific routine/possible function in MAKER.
Previously, we have essentially made home-made versions of maker by way of
Multi-step programming.   At present we are exploring MAKER but are wondering
IF MAKER has the ability to populate the GFF with GENE/Protein ID information?
As an initial experiment, we gave MAKER cDNA refseqs, plus corresponding Protein refseqs,
And a reference, but do not see the GENE/Protein ID in the GFF.  Is there a subroutine
For this, or option we have missed?


Thanks and Kind Regards,


Christopher M. Seabury PhD
Associate Professor
Department of Veterinary Pathobiology
College of Veterinary Medicine
Texas A&M University
College Station, TX 77843-4467
cseabury at cvm.tamu.edu<mailto:cseabury at cvm.tamu.edu>
Mobile: 979-492-6400

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150306/15c2e575/attachment-0001.html>

From bmoore at genetics.utah.edu  Mon Mar  9 12:12:10 2015
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Mon, 9 Mar 2015 18:12:10 +0000
Subject: [maker-devel] Does the maker google forum works? -[Doubt]
	maker2zff line 109
In-Reply-To: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es>
References: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es>
Message-ID: <13826169-DFBB-4E32-AFB1-A01CB3EEE0A4@genetics.utah.edu>

Hi Javier,

The Google forum is only a mirror of the actual mailing list that allows it gets archived by Google (better Google search results for MAKER users), so posting is not allowed there.  Please join the official MAKER mailing list at:

http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Thanks,

B

On Mar 9, 2015, at 12:07 PM, FRANCISCO JAVIER SANCHEZ GARCIA <javiersg at um.es<mailto:javiersg at um.es>> wrote:


Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8.

../maker/marker_v2.31.8/maker/bin/maker2zff   ../sequences.all.gff everything.ann everything.dna
[sudo] password for soba:
No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, <GFF> line 1922870.

I read something about the problematic characters in the ID . But i dont know if it is my example.

http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html


Regards.
Thanks in advance.


Fco Javier S?nchez-Garc?a
PhD student Forest Entomology
?rea de Biolog?a Animal
Departamento de Zoolog?a y Antropolog?a F?sica
Facultad de Veterinaria
Universidad de Murcia
Campus de Espinardo 30100
Murcia (Espa?a-Spain)
Telf. +34 660 500 416 (mobile phone)
      +34 868 888 031 (laboratory-work phone)

http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl
http://www.researchgate.net/profile/Francisco_Sanchez-Garcia
http://orcid.org/0000-0002-5442-0292
http://www.researcherid.com/rid/M-2407-2014

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150309/0bd030ec/attachment-0001.html>

From javiersg at um.es  Mon Mar  9 16:27:00 2015
From: javiersg at um.es (FRANCISCO JAVIER SANCHEZ GARCIA)
Date: Mon, 09 Mar 2015 23:27:00 +0100
Subject: [maker-devel] [Doubt] maker2zff line 109Good morning. I am trying
 problem to write messages in the help forum of maker in google groups. I
 dont know if my problem or contrary it might be a problem with the
 permissions. But i cant see the red button of new threads. Anyway,
 I will try to show my problem with maker2zff. Which does not work. My
 version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff
 ../sequences.all.gff everything.ann everything.dna [sudo] password for
 soba: No such file or directory at
 ../maker/marker_v2.31.8/maker/bin/maker2zff line 109,
 <GFF> line 1922870. I read something about the problematic characters in the
 ID . But i dont know if it is my example.
 http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html
 Regards. Thanks in advance.
Message-ID: <20150309232700.Horde.39Aj_iELfhGei1Bv5JG1fw6@webmail.um.es>

Good night everyone.I will try to show my problem with maker2zff. Which
does not work. My version is the v2.31.8. The last line of the gff file is
the line which the mistake alert said ?that it doesnt find the file or
directory.

../maker/marker_v2.31.8/maker/bin/maker2zff?? ../sequences.all.gff
everything.ann everything.dna
[sudo] password for soba:
No such file?or directory?at?../maker/marker_v2.31.8/maker/bin/MAKER2ZFF
LINE 109, <GFF> line 1922870.

I read something about the problematic characters in the ID . But i dont
know if it is my example.

http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html

Regards.
Thanks in advance.

Fco Javier S?nchez-Garc?a
PhD student Forest Entomology
?rea de Biolog?a Animal
Departamento de Zoolog?a y Antropolog?a F?sica
Facultad de Veterinaria
Universidad de Murcia
Campus de Espinardo 30100
Murcia (Espa?a-Spain)
Telf. +34 660 500 416 (mobile phone)
       +34 868 888 031 (laboratory-work phone)

http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl
http://www.researchgate.net/profile/Francisco_Sanchez-Garcia
http://orcid.org/0000-0002-5442-0292
http://www.researcherid.com/rid/M-2407-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150309/6b31f9e4/attachment-0001.html>

From carsonhh at gmail.com  Thu Mar 12 13:50:44 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 12 Mar 2015 13:50:44 -0600
Subject: [maker-devel] Better resolve conflicting gene models
In-Reply-To: <54F88886.9010004@ecolevol.de>
References: <54F88886.9010004@ecolevol.de>
Message-ID: <7B8D93E6-DEE7-43D7-BFBF-38E474311A6C@gmail.com>

Sorry for the slow reply.


> how can one reply to an existing thread such that the reply will be added to the same thread?

Threads show up under the Google archive, but that is really just an artifact of how things are archived. The mailing list itself doesn?t have threads, just a subject line like any other e-mail. But because we are archiving to a message board any messages with the same subject get turned into parts of the same thread.


> Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior?

The gene chosen by MAKER is the one that best matches the evidence.  This is a numeric value called AED (lower means better match).  If a model predicts an exon or a base pair that is not supported by an evidence alignment, then it we be penalized.  If a model fails to predict a base pair that is supported by evidence then it will also be penalized.  The differences in your results are because you changed something between runs, either you added or subtracted a predictor (gives MAKER more models to choose from) or evidence changed (different AED score).  Whichever model has the lowest AED score will always be kept, so some change you made either resulted in different set of available models to choose from or a different AED score. 

Also be aware that models cannot overlap on the same strand, so if you made a change that results in a model being removed from consideration, you may have removed overlap that was precluding another model with a slightly higher AED from being chosen.

> 
> Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR.
> 
> - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. 
> Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand.

The model chosen will always be the one with the lowest AED.  The changes you describe are all based off of a model with lower AED supplanting another, which may have opened up a region that was previously overlapped by a longer gene with a poor AED score.

I would also recommend not including cufflinks output if you have trinity data.  Cufflinks results often falsely bridge regions and will cause gene mergers by falsely extending evidence clusters. This makes it appear as if genes should extend into regions where they shouldn?t.  Given those hints, predictors will then try and predict at least something in those regions, and if they do the models will get a good AED score since they are supported by evidence even if it is false evidence.

?Carson


From carsonhh at gmail.com  Thu Mar 12 14:03:11 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 12 Mar 2015 14:03:11 -0600
Subject: [maker-devel] maker-devel post from avhoeck@sckcen.be requires
	approval
In-Reply-To: <mailman.5.1426178330.13228.maker-devel_yandell-lab.org@box290.bluehost.com>
References: <mailman.5.1426178330.13228.maker-devel_yandell-lab.org@box290.bluehost.com>
Message-ID: <73D91049-1CB8-4C59-A9E5-58374FCB0789@gmail.com>

Hi Arne,

The genes found by CEGMA are very short genes, so where those genes might be identifiable at least partially on shorter contigs other genes will be far longer.  So the 87% value you get from CEGMA is probably what you are going to be able to find (even partially) for the entire genome. Remember that gene size when you include the introns can actually be quite large, and gene predictors need flanking signals from the sequence several hundred bp upstream and downstream to make a prediction, so having a gene partially contained by a contig makes that gene un-annotatable for ab initio predictors. Unless the organism has a high gene density or very few introns, you usually won?t be able to find a gene on contigs smaller than ~10kb.

?Carson


> On Mar 12, 2015, at 10:38 AM
> 
> From: Van Hoeck Arne <avhoeck at SCKCEN.BE <mailto:avhoeck at SCKCEN.BE>>
> To: "maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>>
> Subject: TACC lonestar and N50 value
> Date: March 12, 2015 at 10:38:42 AM MDT
> 
> 
> Dear MAKER developer,
> 
> We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)
> 
> Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?
> 
> Best regards
> Arne
> 
> 
> 	Consider the environment before you print
> Denk aan het milieu voor u deze e-mail print
> Pensez ? l'environnement avant d'imprimer
> 
> 
> SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer <http://www.sckcen.be/en/e-mail_disclaimer>
> 
> 
> Subject: confirm ff58f4c50902af9cf748cc6c032a1ee92a2a19a9
> From: maker-devel-request at yandell-lab.org <mailto:maker-devel-request at yandell-lab.org>
> Date: March 12, 2015 at 10:38:50 AM MDT
> 
> 
> If you reply to this message, keeping the Subject: header intact,
> Mailman will discard the held message.  Do this if the message is
> spam.  If you reply to this message and include an Approved: header
> with the list password in it, the message will be approved for posting
> to the list.  The Approved: header can also appear in the first line
> of the body of the reply.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150312/30a64d42/attachment-0001.html>

From avhoeck at SCKCEN.BE  Thu Mar 12 10:38:42 2015
From: avhoeck at SCKCEN.BE (Van Hoeck Arne)
Date: Thu, 12 Mar 2015 16:38:42 +0000
Subject: [maker-devel] TACC lonestar and N50 value
Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>

Dear MAKER developer,

We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)

Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?

Best regards
Arne


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150312/10c284bd/attachment-0001.html>

From mtollis at asu.edu  Fri Mar 13 14:50:33 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Fri, 13 Mar 2015 13:50:33 -0700
Subject: [maker-devel] Question about pre-masked genome.
Message-ID: <CAM2H015C_WDgMScNBeBe=hZxbYS_c6w1aOFuR+JAkYek4QoUZA@mail.gmail.com>

Hello,
I have a genome that has been well-annotated for species-specific repeats
(using RepeatModeler, trf, and RepeatMasker), and was wondering if I could
simply use the soft-masked assembly for maker annotation, in order to skip
the sometimes cumbersome repeatmasking step. Is this generally looked upon
as doable, or should I just stick with using the unmasked assembly and the
repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could
not confirm that the masked assembly (i.e., lowercase letters) was
maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150313/e4d6520e/attachment-0001.html>

From mtollis at asu.edu  Fri Mar 13 15:48:46 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Fri, 13 Mar 2015 14:48:46 -0700
Subject: [maker-devel] Question about pre-masked genome.
Message-ID: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>

Hello,
I have a genome that has been well-annotated for species-specific repeats
(using RepeatModeler, trf, and RepeatMasker), and was wondering if I could
simply use the soft-masked assembly for maker annotation, in order to skip
the sometimes cumbersome repeatmasking step. Is this generally looked upon
as doable, or should I just stick with using the unmasked assembly and the
repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could
not confirm that the masked assembly (i.e., lowercase letters) was
maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc

-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150313/e763639c/attachment-0001.html>

From dence at genetics.utah.edu  Fri Mar 13 18:14:52 2015
From: dence at genetics.utah.edu (Daniel Ence)
Date: Sat, 14 Mar 2015 00:14:52 +0000
Subject: [maker-devel] Question about pre-masked genome.
In-Reply-To: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>
References: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>
Message-ID: <51DE3D97-10DB-4FF9-A793-EEA051C705BD@genetics.utah.edu>

Hi Marc, Several groups have been doing what you described with fungal genomes for a while and it seems to be working for them.  With the softmasked genome, maker will still allow EST alignments to extend into the masked regions; if it were a hard masked genome, this wouldn?t be possible.

Let us know how it works out though!

Thanks,
Daniel


On Mar 13, 2015, at 3:48 PM, Marc Tollis <mtollis at asu.edu<mailto:mtollis at asu.edu>> wrote:

Hello,
I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc

--
Marc Tollis, Ph.D.
Post-Doctoral Research Associate
Arizona State University
LSE 313
(480) 965-7456
marc.tollis at asu.edu<mailto:marc.tollis at asu.edu>

website: https://sites.google.com/site/tollisresearch/
blog: anolistollis.wordpress.com<http://anolistollis.wordpress.com/>
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150314/44c59bd7/attachment-0001.html>

From mtollis at asu.edu  Sun Mar 15 08:19:37 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Sun, 15 Mar 2015 07:19:37 -0700
Subject: [maker-devel] control file for SNAP training
Message-ID: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>

This is a question about process, and to make sure I am doing things right
(when time is of the essence, some mistakes can set you back weeks).

I have run maker on my de novo vertebrate genome, using only the predictive
proteome from a congener (well-studied and available on Ensembl), and
generated the HMM for the first round of SNAP training. As per the 2014
tutorial, I edited the control file for this step as follows: I added the
path to the .hmm file, and set protein2genome to 0.

When I run maker, I notice that in addition to snap, it is still running
blastx and exonerate however. I noticed that this is because I did not
remove (or "comment out") the path to the protein.fa in the control file
(the output looks markedly different when I do comment out the protein file
- and I can't even tell if it's running snap in this instance).

Is it simply using exonerate to place the ab initio predictions on the
scaffolds (meaning that having protein2genome=1 is to tell maker to make
evidence annotations) ? Did I do this correctly, or should I also remove
the protein.fa out of the control file for SNAP training?
?
-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150315/ef226da4/attachment-0001.html>

From steinj at cshl.edu  Mon Mar 16 07:29:36 2015
From: steinj at cshl.edu (Stein, Joshua)
Date: Mon, 16 Mar 2015 13:29:36 +0000
Subject: [maker-devel] TACC lonestar and N50 value
In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>
References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>
Message-ID: <0A0DBB88-EC95-43B7-AA75-B7858B67A8F4@cshl.edu>

Hi Arne,

I have experience with iPlant resources and with MAKER-P.  I would encourage you to try the Atmosphere image (7888b8e1-c006-4794-82d9-4c940ddbf4c6).  You can request a large instance (up to 16 CPU's and 128 GB memory) and run in MPI-mode to distribute the work.  Please see this tutorial, which includes information on running in MPI-mode:  https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial.

You can also access the TACC Lonestar installation using the iPlant Discovery Environment.  There is an app called "MAKER-P-Lonestar-Small-Genomes 2.3".  Although it is advertised as appropriate for "small" genomes, I think there is a good chance that it will work for 450 Mb.  This is a new resource and the iPlant team would value any feedback and benchmarks on how the system is working.  Depending how this goes there are plans to roll-out additional apps intended for larger genomes.  Here is a tutorial: https://pods.iplantcollaborative.org/wiki/display/sciplant/Tutorial+for+running+MAKER-P+on+TACC-Lonestar+from+iPlant+Discovery+Environment

Regarding contig sizes, though not ideal, you can include contigs smaller than 10kbp in your run.  Plant genes tend to be more compact than vertebrate genes so you ought to be able to recover annotations on the smaller contigs, though keep an eye out for truncated genes.

Best,
Josh


On Mar 13, 2015, at 6:06 PM, Van Hoeck Arne <avhoeck at SCKCEN.BE<mailto:avhoeck at SCKCEN.BE>> wrote:

Dear MAKER developer,

We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)

Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?

Best regards
Arne


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu<mailto:steinj at cshl.edu>
http://ware.cshl.org/


From mtollis at asu.edu  Tue Mar 17 15:26:44 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Tue, 17 Mar 2015 14:26:44 -0700
Subject: [maker-devel] control file for SNAP training
In-Reply-To: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
References: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
Message-ID: <CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>

I answered my own question:
No need to re-align proteins again - takes too long.
So, I used the gff file from the gff_merge on the log file from the first
run (the one with just protein2genome). Then, after generating the .hmm
file, I put it in my control file, along with protein2genome=0, removed the
protein.fasta, set maker_gff and protein_pass=1. The output now shows that
only snap is running, and no blastx and exonerate - a relief because it is
much faster!

On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis <mtollis at asu.edu> wrote:

> This is a question about process, and to make sure I am doing things right
> (when time is of the essence, some mistakes can set you back weeks).
>
> I have run maker on my de novo vertebrate genome, using only the
> predictive proteome from a congener (well-studied and available on
> Ensembl), and generated the HMM for the first round of SNAP training. As
> per the 2014 tutorial, I edited the control file for this step as follows:
> I added the path to the .hmm file, and set protein2genome to 0.
>
> When I run maker, I notice that in addition to snap, it is still running
> blastx and exonerate however. I noticed that this is because I did not
> remove (or "comment out") the path to the protein.fa in the control file
> (the output looks markedly different when I do comment out the protein file
> - and I can't even tell if it's running snap in this instance).
>
> Is it simply using exonerate to place the ab initio predictions on the
> scaffolds (meaning that having protein2genome=1 is to tell maker to make
> evidence annotations) ? Did I do this correctly, or should I also remove
> the protein.fa out of the control file for SNAP training?
> ?
> --
> *Marc Tollis, Ph.D.*
> *Post-Doctoral Research Associate*
> *Arizona State University*
> *LSE 313*
> *(480) 965-7456 <%28480%29%20965-7456>*
> marc.tollis at asu.edu
>
> *website: *https://sites.google.com/site/tollisresearch/
> *blog: *anolistollis.wordpress.com
>


-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150317/051bb315/attachment-0001.html>

From carsonhh at gmail.com  Tue Mar 17 20:47:50 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 17 Mar 2015 20:47:50 -0600
Subject: [maker-devel] control file for SNAP training
In-Reply-To: <CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>
References: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
	<CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>
Message-ID: <AADABBD3-04F1-49BF-B261-4B316EF60D2B@gmail.com>

You can also add additional protein file if you want more evidence than that already in the GFF3. It makes re-annotation and adding evidence to an already annotated genome file really easy.

?Carson


> On Mar 17, 2015, at 3:26 PM, Marc Tollis <mtollis at asu.edu> wrote:
> 
> I answered my own question:
> No need to re-align proteins again - takes too long.
> So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster!
> 
> On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis <mtollis at asu.edu <mailto:mtollis at asu.edu>> wrote:
> This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks).
> 
> I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. 
> 
> When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). 
> 
> Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? 
> ?
> -- 
> Marc Tollis, Ph.D.
> Post-Doctoral Research Associate
> Arizona State University
> LSE 313
> (480) 965-7456 <tel:%28480%29%20965-7456>
> marc.tollis at asu.edu <mailto:marc.tollis at asu.edu>
> 
> website: https://sites.google.com/site/tollisresearch/ <https://sites.google.com/site/tollisresearch/>
> blog: anolistollis.wordpress.com <http://anolistollis.wordpress.com/>
> 
> 
> -- 
> Marc Tollis, Ph.D.
> Post-Doctoral Research Associate
> Arizona State University
> LSE 313
> (480) 965-7456
> marc.tollis at asu.edu <mailto:marc.tollis at asu.edu>
> 
> website: https://sites.google.com/site/tollisresearch/ <https://sites.google.com/site/tollisresearch/>
> blog: anolistollis.wordpress.com <http://anolistollis.wordpress.com/>_______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150317/f3680a6c/attachment-0001.html>

From Brian.Mack at ARS.USDA.GOV  Fri Mar 20 07:17:09 2015
From: Brian.Mack at ARS.USDA.GOV (Mack, Brian)
Date: Fri, 20 Mar 2015 13:17:09 +0000
Subject: [maker-devel] est2genome wrong strand
Message-ID: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>

Hi, I've noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I've copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I've noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this?


Thanks,

Brian


Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496

>contig_69 <http://10.114.143.20:4567/get_sequence/?id=contig_69&db=/home/brian/blastdb/af70_20130423_id-modified.assembly.txt>

Length=108040


 Score =  1043 bits (1156),  Expect = 0.0

 Identities = 589/592 (99%), Gaps = 3/592 (1%)

 Strand=Plus/Plus


Query  24      TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  83

               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct  105546  TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  105605


Query  84      CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  142

               |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||

Sbjct  105606  CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  105665


69           blastn    expressed_sequence_match         105546  106137  559        +             .               ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3
69           blastn    match_part         105546  106137  559        +             .               ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359
69           est2genome       expressed_sequence_match         105546  106137  2909      -              .               ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3
69           est2genome       match_part         105546  106137  2909      -              .               ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150320/d3f1dc4c/attachment-0001.html>

From carsonhh at gmail.com  Fri Mar 20 08:54:28 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 20 Mar 2015 08:54:28 -0600
Subject: [maker-devel] est2genome wrong strand
In-Reply-To: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>
References: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>
Message-ID: <C32539B3-CF24-4C99-9897-605FE8C8CCB8@gmail.com>

Hi Brian,

Multi-exon ESTs are stranded by their splice sites which can only work on one strand. Single exon ESTs on the other hand are stranded by their open reading frame. The standard chemistry used in most sequencing does not allow for strand specific sequence. There are technologies that do, but unless you used one of those, re-stranding single exon ESTs based on ORF is the best way to figure out where single exon alignments should go (not a completely reliable method but it works most of the time).  I believe trinity tries to determine strand the same way (but it is unreliable there too). For example since your alignment is not an exact match to the genomic sequence (single bp deletion in the alignment) the best open reading frame in the transcript is not the same as the best open reading frame in the genomic sequence (so one of them likely contains an error).  MAKER since it is annotating the genome logically re-strands it to the genome and trinity (being unaware of the genome strands it to the transcript).  Because single exon alignments are very unreliable, they are ignored in MAKER by default.  They will not be used as hints for gene predictors and can only be used to support a gene if there is also protein evidence or a single exon ab initio prediction at the same location to support it (even then this will only happen if you set single_exon=1 in the control files).

?Carson


On Mar 20, 2015, at 7:17 AM, Mack, Brian <Brian.Mack at ARS.USDA.GOV <mailto:Brian.Mack at ARS.USDA.GOV>> wrote:

> Hi, I?ve noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I?ve copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I?ve noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? 
>  
> Thanks,
> Brian
>  
> Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496
> >contig_69  <http://10.114.143.20:4567/get_sequence/?id=contig_69&db=/home/brian/blastdb/af70_20130423_id-modified.assembly.txt> <> 
> Length=108040
>  
>  Score =  1043 bits (1156),  Expect = 0.0
>  Identities = 589/592 (99%), Gaps = 3/592 (1%)
>  Strand=Plus/Plus
>  
> Query  24      TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  83
>                ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct  105546  TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  105605
>  
> Query  84      CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  142
>                |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct  105606  CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  105665
>  
>  
>  
> 69           blastn    expressed_sequence_match         105546  106137  559        +             .               ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3
> 69           blastn    match_part         105546  106137  559        +             .               ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359
> 69           est2genome       expressed_sequence_match         105546  106137  2909      -              .               ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3
> 69           est2genome       match_part         105546  106137  2909      -              .               ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150320/f91a44d0/attachment-0001.html>

From xvazquezc at gmail.com  Sat Mar 21 21:27:27 2015
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=)
Date: Sun, 22 Mar 2015 14:27:27 +1100
Subject: [maker-devel] annotation stats: repeats
Message-ID: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>

Hi all,

I was wondering how can I get data about the repeat content of the genome
from maker if possible, as well as each type of repeats: RE, transposons,
simple repeats, low complexity repeats

Thank you in advance,

Xabier

-- 
Xabier V?zquez Campos
*PhD Candidate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150322/e07ccc08/attachment-0001.html>

From dence at genetics.utah.edu  Sat Mar 21 23:56:06 2015
From: dence at genetics.utah.edu (Daniel Ence)
Date: Sun, 22 Mar 2015 05:56:06 +0000
Subject: [maker-devel] annotation stats: repeats
In-Reply-To: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>
References: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>
Message-ID: <4FFB960C-CF1A-440F-B767-36EC369D2B58@genetics.utah.edu>

Hi Xabier, MAKER offer several options for repeat-masking. There?s a file of repetitive element proteins that is run with repeat runner, and there you can run RepeatMasker with a custom library and/or one of the default RepeatMasker libraries.

The results from these programs are in the gff3 files that maker generates. Depending on the library that you give RepeatMasker, the repeats might be classified as to transposable element family and simple vs. complex repeats. These are probably a good place to start, but you?ll have to extract that data from the gff3 file and compile it.

Let us know whether that helps.

Thanks,
Daniel


On Mar 21, 2015, at 9:27 PM, Xabier V?zquez Campos <xvazquezc at gmail.com<mailto:xvazquezc at gmail.com>> wrote:

Hi all,

I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats

Thank you in advance,

Xabier

--
Xabier V?zquez Campos
PhD Candidate
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150322/3d95c8da/attachment-0001.html>

From panos.ioannidis at gmail.com  Tue Mar 24 02:29:14 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 09:29:14 +0100
Subject: [maker-devel] Augustus retraining
Message-ID: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>

Hello All,

I'm trying to retrain Augustus using EST data from the same species and
realized that quite a few of the gene models I get based on EST data are
incomplete (i.e. no start and/or stop codon).

Now, when I get to the "etraining" step in Augustus retraining (right after
the time-consuming "optimize_augustus.pl" step), I get a warning for each
gene that doesn't contain a start or stop codon.

.....
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
does not begin with start codon but with acg
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
....

Does anyone know whether training is compromised by such incomplete gene
models? Do you usually exclude them from the training set?

Oh, and by the way, the best guide to retraining Augustus is here
<http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
The official
<http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
page isn't bad, but doesn't explain in detail certain things.

Thanks,
Panos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/a82d7062/attachment-0001.html>

From xvazquezc at gmail.com  Tue Mar 24 06:06:25 2015
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=)
Date: Tue, 24 Mar 2015 23:06:25 +1100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
Message-ID: <CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>

Hi Panos,

Have you tried using webAugustus for the (re)training? I found it very
convenient for generating the models for Augustus.

Cheers,

2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:

> Hello All,
>
> I'm trying to retrain Augustus using EST data from the same species and
> realized that quite a few of the gene models I get based on EST data are
> incomplete (i.e. no start and/or stop codon).
>
> Now, when I get to the "etraining" step in Augustus retraining (right
> after the time-consuming "optimize_augustus.pl" step), I get a warning
> for each gene that doesn't contain a start or stop codon.
>
> .....
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
> does not begin with start codon but with acg
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
> ....
>
> Does anyone know whether training is compromised by such incomplete gene
> models? Do you usually exclude them from the training set?
>
> Oh, and by the way, the best guide to retraining Augustus is here
> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
> The official
> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
> page isn't bad, but doesn't explain in detail certain things.
>
> Thanks,
> Panos
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 
Xabier V?zquez Campos
*PhD Candidate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/0b0a4daf/attachment-0001.html>

From panos.ioannidis at gmail.com  Tue Mar 24 06:24:45 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 13:24:45 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
Message-ID: <CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>

Hi Xabier,

Thanks for your quick reply!

No, I haven't used WebAugustus, but I just checked it out and it looks like
my training set is too big (~300 Mbp), so I can't even upload it!

Anyway, I prefer to train it locally because I have better control over
each step. Also, I have done the entire training procedure with less genes,
but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
replicate it using more of my scaffolds, but as it appears I get a lot more
incomplete models from exonerate (run through Maker).

P


On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com>
wrote:

> Hi Panos,
>
> Have you tried using webAugustus for the (re)training? I found it very
> convenient for generating the models for Augustus.
>
> Cheers,
>
> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>
>> Hello All,
>>
>> I'm trying to retrain Augustus using EST data from the same species and
>> realized that quite a few of the gene models I get based on EST data are
>> incomplete (i.e. no start and/or stop codon).
>>
>> Now, when I get to the "etraining" step in Augustus retraining (right
>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>> for each gene that doesn't contain a start or stop codon.
>>
>> .....
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>> does not begin with start codon but with acg
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>> ....
>>
>> Does anyone know whether training is compromised by such incomplete gene
>> models? Do you usually exclude them from the training set?
>>
>> Oh, and by the way, the best guide to retraining Augustus is here
>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>> The official
>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
>> page isn't bad, but doesn't explain in detail certain things.
>>
>> Thanks,
>> Panos
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
> --
> Xabier V?zquez Campos
> *PhD Candidate*
> Water Research Centre
> School of Civil and Environmental Engineering
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/2be728f0/attachment-0001.html>

From carsonhh at gmail.com  Tue Mar 24 08:14:51 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 08:14:51 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
Message-ID: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>

Hi Panos,

EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.

More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>

Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.

?Carson


> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Hi Xabier,
> 
> Thanks for your quick reply!
> 
> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
> 
> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
> 
> P
> 
> 
> 
> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
> Hi Panos,
> 
> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
> 
> Cheers,
> 
> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
> Hello All,
> 
> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
> 
> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
> 
> .....
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
> ....
> 
> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
> 
> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
> 
> Thanks,
> Panos
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> 
> 
> -- 
> Xabier V?zquez Campos
> PhD Candidate
> Water Research Centre
> School of Civil and Environmental Engineering
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/1e0e6b39/attachment-0001.html>

From panos.ioannidis at gmail.com  Tue Mar 24 08:31:04 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 15:31:04 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
Message-ID: <CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>

Hi Carson,

So you think it's okay to include incomplete gene models when training
Augustus?

I'll certainly try the bootstrap method you're suggesting. Even though I
did it for SNAP, for some weird reason I forgot it for Augustus :p Do you
think, however, that I can get a big improvement in gene-level sensitivity?
Currently, I have only 6%...

Thanks,
Panos


On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com> wrote:

> Hi Panos,
>
> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a
> first round of training you can run MAKER together with protein and EST
> evidence and the newly trained Augustus species file.  Because MAKER gives
> hints to Augustus as it runs, the models it produces will be improved over
> what it would get from just running Augustus on it?s own.  Then take these
> gene models and use them to retrain Augustus.  This is the standard
> bootstrap retraining procedure, and can be repeated as needed.
>
> More info on bootstrap training here (info is for SNAP but procedure is
> similar to Augustus) ?>  http://weatherby.genetics.
> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_
> Online_Training_2014#Training_ab_initio_Gene_Predictors
> Here is an excellent explanation of Augustus training ?>
> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html
> and here are tools to convert SNAP training files to Augustus training
> files (MAKER comes with a tool that converts GFF3 for SNAP training so just
> take that and convert it for Augustus)?> https://github.com/
> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
>
> Finally you can also manually edit the GFF3 file in Apollo (easier to use
> the legacy stand alone version), and then convert that file for bootstrap
> training.
>
> ?Carson
>
>
> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
> wrote:
>
> Hi Xabier,
>
> Thanks for your quick reply!
>
> No, I haven't used WebAugustus, but I just checked it out and it looks
> like my training set is too big (~300 Mbp), so I can't even upload it!
>
> Anyway, I prefer to train it locally because I have better control over
> each step. Also, I have done the entire training procedure with less genes,
> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
> replicate it using more of my scaffolds, but as it appears I get a lot more
> incomplete models from exonerate (run through Maker).
>
> P
>
>
>
> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <
> xvazquezc at gmail.com> wrote:
>
>> Hi Panos,
>>
>> Have you tried using webAugustus for the (re)training? I found it very
>> convenient for generating the models for Augustus.
>>
>> Cheers,
>>
>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>>
>>> Hello All,
>>>
>>> I'm trying to retrain Augustus using EST data from the same species and
>>> realized that quite a few of the gene models I get based on EST data are
>>> incomplete (i.e. no start and/or stop codon).
>>>
>>> Now, when I get to the "etraining" step in Augustus retraining (right
>>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>>> for each gene that doesn't contain a start or stop codon.
>>>
>>> .....
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>>> does not begin with start codon but with acg
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>> ....
>>>
>>> Does anyone know whether training is compromised by such incomplete gene
>>> models? Do you usually exclude them from the training set?
>>>
>>> Oh, and by the way, the best guide to retraining Augustus is here
>>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>>> The official
>>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
>>> page isn't bad, but doesn't explain in detail certain things.
>>>
>>> Thanks,
>>> Panos
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>>
>>
>>
>> --
>> Xabier V?zquez Campos
>> *PhD Candidate*
>> Water Research Centre
>> School of Civil and Environmental Engineering
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/34c2980c/attachment-0001.html>

From carsonhh at gmail.com  Tue Mar 24 08:39:20 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 08:39:20 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
Message-ID: <C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>

On your first round it is fine.  It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity.  If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics.  The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them).

?Carson


> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Hi Carson,
> 
> So you think it's okay to include incomplete gene models when training Augustus?
> 
> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...
> 
> Thanks,
> Panos
> 
> 
> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Hi Panos,
> 
> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.
> 
> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>
> 
> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.
> 
> ?Carson
> 
> 
>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>> 
>> Hi Xabier,
>> 
>> Thanks for your quick reply!
>> 
>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
>> 
>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
>> 
>> P
>> 
>> 
>> 
>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> Hi Panos,
>> 
>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
>> 
>> Cheers,
>> 
>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
>> Hello All,
>> 
>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
>> 
>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
>> 
>> .....
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>> ....
>> 
>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
>> 
>> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
>> 
>> Thanks,
>> Panos
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>> 
>> 
>> 
>> 
>> -- 
>> Xabier V?zquez Campos
>> PhD Candidate
>> Water Research Centre
>> School of Civil and Environmental Engineering
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/f25ab2fc/attachment-0001.html>

From panos.ioannidis at gmail.com  Tue Mar 24 09:05:54 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 16:05:54 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
	<C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
Message-ID: <CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>

Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level
is 88%. I only mentioned gene-level, because that's the only metric
mentioned in the Augustus web site.

I got these numbers outside of Maker. Actually, I only used Maker to
generate the gff files needed to start the training (ran it using only EST
evidence and only on a subset of my assembly, using this
<http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>
as a guide).

Now, I've started running the second round of training, as you suggested.
Since, however, I don't have data from closely related species, I'm only
using Uniref50 as protein evidence.

P

On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt <carsonhh at gmail.com> wrote:

> On your first round it is fine.  It gives the predictor enough to work
> with, then on the second round you use improved models. When you say 6%
> sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER
> that means you are not providing sufficient protein evidence (you need the
> full proteome of at least two related species). Also is that the gene
> level, exon level, or nucleotide level sensitivity.  If you are looking at
> the gene level sensitivity measure, you only get a match when you perfectly
> match all transcripts in a gene (models that may not be correct in the
> first place). This value will rarely go above 10% for any predictor. You
> need to use the nucleotide level sensitivity/specificity metrics.  The gene
> and exon level metrics are basically meaningless (unless it?s Drosophila
> which is the only species annotated correctly enough to use them).
>
> ?Carson
>
>
> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
> wrote:
>
> Hi Carson,
>
> So you think it's okay to include incomplete gene models when training
> Augustus?
>
> I'll certainly try the bootstrap method you're suggesting. Even though I
> did it for SNAP, for some weird reason I forgot it for Augustus :p Do you
> think, however, that I can get a big improvement in gene-level sensitivity?
> Currently, I have only 6%...
>
> Thanks,
> Panos
>
>
> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com> wrote:
>
>> Hi Panos,
>>
>> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a
>> first round of training you can run MAKER together with protein and EST
>> evidence and the newly trained Augustus species file.  Because MAKER gives
>> hints to Augustus as it runs, the models it produces will be improved over
>> what it would get from just running Augustus on it?s own.  Then take these
>> gene models and use them to retrain Augustus.  This is the standard
>> bootstrap retraining procedure, and can be repeated as needed.
>>
>> More info on bootstrap training here (info is for SNAP but procedure is
>> similar to Augustus) ?>  http://weatherby.genetics.
>> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_
>> Online_Training_2014#Training_ab_initio_Gene_Predictors
>> Here is an excellent explanation of Augustus training ?>
>> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html
>> and here are tools to convert SNAP training files to Augustus training
>> files (MAKER comes with a tool that converts GFF3 for SNAP training so just
>> take that and convert it for Augustus)?> https://github.com/
>> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
>>
>> Finally you can also manually edit the GFF3 file in Apollo (easier to use
>> the legacy stand alone version), and then convert that file for bootstrap
>> training.
>>
>> ?Carson
>>
>>
>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
>> wrote:
>>
>> Hi Xabier,
>>
>> Thanks for your quick reply!
>>
>> No, I haven't used WebAugustus, but I just checked it out and it looks
>> like my training set is too big (~300 Mbp), so I can't even upload it!
>>
>> Anyway, I prefer to train it locally because I have better control over
>> each step. Also, I have done the entire training procedure with less genes,
>> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
>> replicate it using more of my scaffolds, but as it appears I get a lot more
>> incomplete models from exonerate (run through Maker).
>>
>> P
>>
>>
>>
>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <
>> xvazquezc at gmail.com> wrote:
>>
>>> Hi Panos,
>>>
>>> Have you tried using webAugustus for the (re)training? I found it very
>>> convenient for generating the models for Augustus.
>>>
>>> Cheers,
>>>
>>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>>>
>>>> Hello All,
>>>>
>>>> I'm trying to retrain Augustus using EST data from the same species and
>>>> realized that quite a few of the gene models I get based on EST data are
>>>> incomplete (i.e. no start and/or stop codon).
>>>>
>>>> Now, when I get to the "etraining" step in Augustus retraining (right
>>>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>>>> for each gene that doesn't contain a start or stop codon.
>>>>
>>>> .....
>>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>>>> does not begin with start codon but with acg
>>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>>> ....
>>>>
>>>> Does anyone know whether training is compromised by such incomplete
>>>> gene models? Do you usually exclude them from the training set?
>>>>
>>>> Oh, and by the way, the best guide to retraining Augustus is here
>>>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>>>> The official
>>>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html>
>>>> web page isn't bad, but doesn't explain in detail certain things.
>>>>
>>>> Thanks,
>>>> Panos
>>>>
>>>>
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Xabier V?zquez Campos
>>> *PhD Candidate*
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/1567f72a/attachment-0001.html>

From carsonhh at gmail.com  Tue Mar 24 09:38:08 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 09:38:08 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
	<C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
	<CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>
Message-ID: <4EA5A1F6-2950-4D65-A59C-0F3848C86C02@gmail.com>

I?d pick a couple of species that are as closely related as you can find.  Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won?t have (those databases are usually a little too conservative).

The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with.  Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point.  This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics.

Thanks,
Carson


> On Mar 24, 2015, at 9:05 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site.
> 
> I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html> as a guide).
> 
> Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence.
> 
> P
> 
> On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> On your first round it is fine.  It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity.  If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics.  The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them).
> 
> ?Carson
> 
> 
>> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>> 
>> Hi Carson,
>> 
>> So you think it's okay to include incomplete gene models when training Augustus?
>> 
>> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...
>> 
>> Thanks,
>> Panos
>> 
>> 
>> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> Hi Panos,
>> 
>> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.
>> 
>> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
>> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
>> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>
>> 
>> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.
>> 
>> ?Carson
>> 
>> 
>>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>>> 
>>> Hi Xabier,
>>> 
>>> Thanks for your quick reply!
>>> 
>>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
>>> 
>>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
>>> 
>>> P
>>> 
>>> 
>>> 
>>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>>> Hi Panos,
>>> 
>>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
>>> 
>>> Cheers,
>>> 
>>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
>>> Hello All,
>>> 
>>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
>>> 
>>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
>>> 
>>> .....
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>> ....
>>> 
>>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
>>> 
>>> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
>>> 
>>> Thanks,
>>> Panos
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Xabier V?zquez Campos
>>> PhD Candidate
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/80336079/attachment-0001.html>

From alicebdennis at gmail.com  Thu Mar 26 04:34:26 2015
From: alicebdennis at gmail.com (Alice Dennis)
Date: Thu, 26 Mar 2015 11:34:26 +0100
Subject: [maker-devel] iterative Maker2
In-Reply-To: <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch>
	<1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
Message-ID: <CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>

Hello again,

I posted a while ago about a genome I'm running through the Maker2
pipeline. I was concerned because my results were still changing with
3 and 4 iterations.

Following the very useful advice of Carson (below), I've made a few
modifications (adding a RepeatModeler run, using a big protein
database), but my gene predictions are still changing between the 3rd
and 4th iterations. Perhaps this is ok, but these increasing gene
lengths make me worry that I haven't built stable models.

Here is the short version of what I've done.
1. Run RepeatModeler, but this only produced 47 sequences in the
resulting .fasta... so that seemed a bit small.

2. Run Maker2 using:
- RepeatModeler output + "model_org=all" and "softmask=1" in the
Repeat Masking section.
- protein evidence from 2 distantly related species AND all of Uniprot
- ests from a different strain of my species (a parasitoid wasp)
- the .hmm from Nasonia, one of the 2 distantly related species whose
proteome I also provided as protein evidence
- my assembled genome of 1,509 scaffolds.

3. After this, I did three subsequent rounds of Maker2 (cleverly named
Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
.hmm was replaced by a SNAP generated .hmm from the previous round.
Also, the est2genome and protein2genome was changed from 1 to 0 in all
runs after the first.

Here are some results:
Round1: 14,647 genes, average length 2,491
Round2: 12,158 genes, average length 3,760
Round3: 13,515 genes, average length 3,090
Round4: 12,169 genes, average length 3,918

This is a bit confusing because the number of genes predicted goes up
and down, as does their lengths. I've doubly checked the dates of my
files, and they are all labeled such that I don't think anything could
be swapped.

So my questions are:
Is this an indication that my models are unstable and I shouldn't
trust these predictions?
Is the decreasing number of genes, while also getting longer perhaps a
good thing?
How do I know when to stop if genes keep getting longer?


Thanks very much,
Alice


On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <carsonhh at gmail.com> wrote:
> The gene models are actually produced by SNAP, Augustus, or whatever gene
> predictor you are using, so if you change the HMM every round, then the
> models will change too.  But I have one concern.  You are using a very
> sparse protein evidence dataset.  The protein dataset is very important to
> MAKER?s performance, and for itterative training of the ab initio
> predictors.  Normally after the second iteration, additional training should
> not be beneficial, but if you are getting wildly different results on 3rd
> and 4th round, then you probably aren?t getting sufficient good models to
> train with.
>
> For a protein dataset you should be using the entire a proteome from a
> minimum of two related species and perhaps all of UniProt/Swiss-prot to get
> a broad protein database.  Don?t use the proteins extracted by CEGMA and
> HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff scrip
> that comes with MAEKR), but don?t give the proteins to MAKER as evidence,
> also the HaMSTr results will be redundant with the ESTs.  You need proteins
> from related species to look for homology not found in the EST dataset.
>
> Also repeat masking is important for any genome and has a huge effect on ab
> initio predictor performance.  Make sure you run something like
> RepeatModeler to look for species specific repeats that will not already be
> in RepBase.  Then add those results to the rmlib= option in the maker
> control files.
>
> Thanks,
> Carson
>
>
>
>
> On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch> wrote:
>
> Hi all,
>
> I am a relatively new user to Maker2, and I?m looking for advise on running
> many iterations of the same dataset in Maker2.
>
> I have a relatively small genome (~124 MB) from a wasp that is assembled
> into ~1,500 scaffold. I have run several iterations of Maker2 by
> re-generating .hmms in SNAP and feeding them into the next round, and my
> gene predictions keep increasing (in number and in size).  The only thing
> that changes at each round is the .hmm.
> This is the evidence that I give is:
> -          de novo assembled ESTs from a different strain of the same
> species (70,000 contigs? I am currently working on improving this assembly
> with the hope that this will be helpful here)
> -          610 proteins extracted from the genome scaffolds using CEGMA and
> HaMSTr
>
> For my 1st iteration, I used the Nasonia .hmm from SNAP, and the
> est2genome/protein2genome option.
>
> For the 2nd, 3rd and 4th rounds I have used .hmms generated from the
> previous round, all without the est2genome/protein2genome option. All other
> files are the same as in the original run.
>
> As I understand it, after the second round, nothing should change in Maker2.
> But the differences are obvious between runs. Some entirely new exons are
> annotated. For example,  just counting ?exon? in the .gff file gives me
> 73,000 after the third iteration and 96,000 after the fourth! Actually the
> biggest leap in this number is between the third and fourth round. I can
> also see that many features are longer when I look at the files in Geneious.
>
> Is this sort of change possible after the second round of Maker2? Is there
> something I have done wrong in my runs, or am a understanding this output
> incorrectly?
>
> Thank you,
> Alice
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 


Alice Dennis
alicebdennis at gmail.com

Postdoctoral Researcher
Institute for Integrative Biology, ETH Z?rich & EAWAG
?berlandstrasse 133
P.O. Box 611
8600 D?bendorf, Switzerland

https://adennis5.wordpress.com/


From michael.s.campbell1 at gmail.com  Thu Mar 26 09:50:41 2015
From: michael.s.campbell1 at gmail.com (Michael Campbell)
Date: Thu, 26 Mar 2015 09:50:41 -0600
Subject: [maker-devel] iterative Maker2
In-Reply-To: <CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>
References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch>
	<1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
	<CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>
Message-ID: <CAAi6vWXnyyFkTVD9tc-QGxSBCBenTy5QyTM6ReVqDveXQA0FTg@mail.gmail.com>

Hi Alice,

In my experience the fewer longer genes is generally a good thing (and very
normal) resulting from the merging of split models and extension of
incomplete models. I find it helpful to load the annotations and evidence
into a browser to get a visual idea of what is happening.

Mike

On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis <alicebdennis at gmail.com>
wrote:

> Hello again,
>
> I posted a while ago about a genome I'm running through the Maker2
> pipeline. I was concerned because my results were still changing with
> 3 and 4 iterations.
>
> Following the very useful advice of Carson (below), I've made a few
> modifications (adding a RepeatModeler run, using a big protein
> database), but my gene predictions are still changing between the 3rd
> and 4th iterations. Perhaps this is ok, but these increasing gene
> lengths make me worry that I haven't built stable models.
>
> Here is the short version of what I've done.
> 1. Run RepeatModeler, but this only produced 47 sequences in the
> resulting .fasta... so that seemed a bit small.
>
> 2. Run Maker2 using:
> - RepeatModeler output + "model_org=all" and "softmask=1" in the
> Repeat Masking section.
> - protein evidence from 2 distantly related species AND all of Uniprot
> - ests from a different strain of my species (a parasitoid wasp)
> - the .hmm from Nasonia, one of the 2 distantly related species whose
> proteome I also provided as protein evidence
> - my assembled genome of 1,509 scaffolds.
>
> 3. After this, I did three subsequent rounds of Maker2 (cleverly named
> Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
> .hmm was replaced by a SNAP generated .hmm from the previous round.
> Also, the est2genome and protein2genome was changed from 1 to 0 in all
> runs after the first.
>
> Here are some results:
> Round1: 14,647 genes, average length 2,491
> Round2: 12,158 genes, average length 3,760
> Round3: 13,515 genes, average length 3,090
> Round4: 12,169 genes, average length 3,918
>
> This is a bit confusing because the number of genes predicted goes up
> and down, as does their lengths. I've doubly checked the dates of my
> files, and they are all labeled such that I don't think anything could
> be swapped.
>
> So my questions are:
> Is this an indication that my models are unstable and I shouldn't
> trust these predictions?
> Is the decreasing number of genes, while also getting longer perhaps a
> good thing?
> How do I know when to stop if genes keep getting longer?
>
>
> Thanks very much,
> Alice
>
>
> On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <carsonhh at gmail.com> wrote:
> > The gene models are actually produced by SNAP, Augustus, or whatever gene
> > predictor you are using, so if you change the HMM every round, then the
> > models will change too.  But I have one concern.  You are using a very
> > sparse protein evidence dataset.  The protein dataset is very important
> to
> > MAKER?s performance, and for itterative training of the ab initio
> > predictors.  Normally after the second iteration, additional training
> should
> > not be beneficial, but if you are getting wildly different results on 3rd
> > and 4th round, then you probably aren?t getting sufficient good models to
> > train with.
> >
> > For a protein dataset you should be using the entire a proteome from a
> > minimum of two related species and perhaps all of UniProt/Swiss-prot to
> get
> > a broad protein database.  Don?t use the proteins extracted by CEGMA and
> > HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff
> scrip
> > that comes with MAEKR), but don?t give the proteins to MAKER as evidence,
> > also the HaMSTr results will be redundant with the ESTs.  You need
> proteins
> > from related species to look for homology not found in the EST dataset.
> >
> > Also repeat masking is important for any genome and has a huge effect on
> ab
> > initio predictor performance.  Make sure you run something like
> > RepeatModeler to look for species specific repeats that will not already
> be
> > in RepBase.  Then add those results to the rmlib= option in the maker
> > control files.
> >
> > Thanks,
> > Carson
> >
> >
> >
> >
> > On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch>
> wrote:
> >
> > Hi all,
> >
> > I am a relatively new user to Maker2, and I?m looking for advise on
> running
> > many iterations of the same dataset in Maker2.
> >
> > I have a relatively small genome (~124 MB) from a wasp that is assembled
> > into ~1,500 scaffold. I have run several iterations of Maker2 by
> > re-generating .hmms in SNAP and feeding them into the next round, and my
> > gene predictions keep increasing (in number and in size).  The only thing
> > that changes at each round is the .hmm.
> > This is the evidence that I give is:
> > -          de novo assembled ESTs from a different strain of the same
> > species (70,000 contigs? I am currently working on improving this
> assembly
> > with the hope that this will be helpful here)
> > -          610 proteins extracted from the genome scaffolds using CEGMA
> and
> > HaMSTr
> >
> > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the
> > est2genome/protein2genome option.
> >
> > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the
> > previous round, all without the est2genome/protein2genome option. All
> other
> > files are the same as in the original run.
> >
> > As I understand it, after the second round, nothing should change in
> Maker2.
> > But the differences are obvious between runs. Some entirely new exons are
> > annotated. For example,  just counting ?exon? in the .gff file gives me
> > 73,000 after the third iteration and 96,000 after the fourth! Actually
> the
> > biggest leap in this number is between the third and fourth round. I can
> > also see that many features are longer when I look at the files in
> Geneious.
> >
> > Is this sort of change possible after the second round of Maker2? Is
> there
> > something I have done wrong in my runs, or am a understanding this output
> > incorrectly?
> >
> > Thank you,
> > Alice
> >
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> >
> >
>
>
>
> --
>
>
> Alice Dennis
> alicebdennis at gmail.com
>
> Postdoctoral Researcher
> Institute for Integrative Biology, ETH Z?rich & EAWAG
> ?berlandstrasse 133
> P.O. Box 611
> 8600 D?bendorf, Switzerland
>
> https://adennis5.wordpress.com/
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 
Michael Campbell MS, RD.
Doctoral Candidate
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:585-3543
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150326/016a477f/attachment-0001.html>

From rens.holmer at wur.nl  Mon Mar 30 00:12:20 2015
From: rens.holmer at wur.nl (Holmer, Rens)
Date: Mon, 30 Mar 2015 06:12:20 +0000
Subject: [maker-devel] Incorporating cufflinks in maker
Message-ID: <42E0168F-C672-4B7F-97D4-98442B825BF9@wur.nl>

Hi maker team,

I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options:

Provide the cufflinks output as EST-gff
Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff

What would you suggest, and what would be the required formatting for both options?

Thanks in advance,

Rens Holmer


From goutham.atla at gmail.com  Fri Mar 27 23:37:08 2015
From: goutham.atla at gmail.com (Goutham atla)
Date: Sat, 28 Mar 2015 11:07:08 +0530
Subject: [maker-devel] Annotating Cufflinks GTF with Maker
Message-ID: <CALU8LA4CwLD8qm5f==xKSjZoCw+9Ajd=RCD62LkHTdBYbuajig@mail.gmail.com>

Dear All,

I have a draft genome for organism of my interest and I have around 150G of
100bp paired-end RNA-Seq data from different conditions. This organism has
ensemble annotations but very few.

My goal is to look at differential splicing analysis between two
conditions. For this I need good annotations in gtf format at isoform
level.I am interested in using the Splicing Analysis Kit
<http://cbcb.umd.edu/software/spanki/>

For now, I have aligned one sample to genome using tophat2 and then used
cufflinks to generate a de-novo GTF file. In either cases I have not used
the avail be GTF with very few annotations.

The GTF file generated by cufflinks should be annotated to know the
function of each transcript. So I am interested in adding annotations to
the gtf file generated from cufflinks. What is the best of doing it ?

Or is there any better way of getting a gtf file, like that of ensemble,
from my data ?

I have looked at trinotate, but its more about functional annotation and
expression studies.


Regards,

-- 
Goutham Atla
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150328/686b6c3b/attachment-0001.html>

From avhoeck at SCKCEN.BE  Mon Mar 30 10:11:16 2015
From: avhoeck at SCKCEN.BE (Van Hoeck Arne)
Date: Mon, 30 Mar 2015 16:11:16 +0000
Subject: [maker-devel] comments on Incorporating cufflinks in maker
Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF11826E1A@MAILSRV3.sck.be>

Dear Rens and Carlson,
I would like to comment on Rens' question on adding transcript data into your annotation. Since I have no access to google groups, I try to contact you both via mail with, hopefully, the correct mail addresses.

I have tested Maker-P with multiple parameters, optimized 2 gene prediction tools (SNAP and augustus). Both give similarly results at which maker find around 17000 gene annotations. I also have RNAseq samples, and as a test case I also used Cufflinks output and processed with TransDecoder to select gene annotations. By using this approach, I end up with 25000 gene annotations. More or less, all the genes that Maker selected were also selected by the TransDecoder approach. So is it wise to include the information on the 8000 missing genes, based on optimized gene models? probably, there will some false positive in these 8000 missing genes, but with good criteria and models, it would maybe possible that maker can find more genes annotations.

Best regards
Arne

Hi maker team,

I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options:

Provide the cufflinks output as EST-gff
Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff

What would you suggest, and what would be the required formatting for both options?

Thanks in advance,

Rens Holmer


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150330/1fa390fe/attachment-0001.html>

From kai.kamm at ecolevol.de  Thu Mar  5 09:47:02 2015
From: kai.kamm at ecolevol.de (Kai Kamm)
Date: Thu, 05 Mar 2015 17:47:02 +0100
Subject: [maker-devel] Better resolve conflicting gene models
Message-ID: <54F88886.9010004@ecolevol.de>

Hello, thanks for your previous advice.

(Btw, how can one reply to an existing thread such that the reply will 
be added to the same thread?)


I am trying to find the best parameters with Maker for the annotation of 
my genome. I have run Maker with several combinations of parameters and 
predictors on my three biggest scaffolds and looked at the results in 
Jbrowse. Overall most predictions seem fine, but there are some genes 
with conflicts and I have no idea why.

I have:

- 100Mb assembled genome
- Trinity RNAseq assembly
- cufflinks data (in my case don't seem to be messy as suggested, rather 
a good complement to the trinity data))
- protein evidence (related and unrelated species)
- repeat library from repeat modeler


Gene predictors used:

- Augustus trained with transcripts from related species: seems to 
perform fine

- SNAP: no convergence with Augustus even after second training. Dropped 
it because it predicted lots of additional low quality transcripts and 
sometimes disrupted final Maker transcripts.

- Genemark: converged with Augustus after training (introns received 
from TopHat2 output). Tends to predict some additional transcripts 
(compared to Augustus). Few (but some) of these are covered by evidence 
and thus become final Maker transcripts.


So the combination of Augustus and Genemark seems optimal. In general 
both perform well in Maker and tend to predict the same transcripts.

However, I still observe some problems in the behavior of Maker which I 
don't understand:

Example 1: One of the predictors predicts a small additional exon at the 
start which is also covered by protein or EST data. But sometimes Maker 
chooses the other predictors model for the final transcript. Mostly 
these are minor differences but I don't understand this behavior?

Example 2: there are some extreme cases like an Augustus prediction with 
17 exons which are all covered by Trinity and cufflinks isoforms. 
Genemark instead predicts two separate small genes with 2 and 4 exons 
respectively. The resulting final transcript has 7 exons and the 
additional evidence from the trinity and cufflinks data is treated as UTR.


So I thought Augustus seems a little more accurate and run Maker only 
with Augustus to resolve such conflicts, even though I would loose the 
few additional transcripts from Genemark.

This is what happened:

- The gene in Example 2 now has all the 17 exons. This is good!

- Sadly another gene with several exons, which was formerly predicted by 
both Augustus and Genemark and is also covered by cufflinks and trinity 
transcripts, now consists only of two small exons in the final 
transcript. Even though Augustus still predicts the same exons and the 
same evidence is present - only the Genemark prediction is absent which 
was almost identical to Augustus. This I completely don't understand.

I don't worry about the minor differences. The extreme cases are like 
two genes in a hundred and I don't understand the behavior. I was 
thinking that in case of conflicting models Maker will choose the one 
that best fits the evidence. Obviously with most conflicts this is what 
happens, because the majority of the final models look OK. But not the 
above mentioned cases and I don't understand why?

Is there any parameter I missed to better resolve such conflicts?

Best


From bmoore at genetics.utah.edu  Thu Mar  5 17:20:52 2015
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Fri, 6 Mar 2015 00:20:52 +0000
Subject: [maker-devel] Maker Software Question
In-Reply-To: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu>
References: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu>
Message-ID: <E682D1C1-B792-498E-88C9-D9349E9548C8@genetics.utah.edu>

Hi Chris,

I don?t know if others have responded to you already, but I just found this e-mail unanswered in the depths of my e-mail inbox, so apologies for no reply.

I?m going to forward you e-mail along to the full maker mailing list so that you?ll get the benefit of response from the full group of MAKER developers.

MAKER does include a couple of scripts that add functional annotation to MAKER GFF3 output.  This process is described in the recent paper:

Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using
MAKER and MAKER-P. Curr Protoc Bioinformatics.

http://www.ncbi.nlm.nih.gov/pubmed/25501943

Mike do you have a PDF of the final print version of that you could send directly to Christopher?

B

On Jan 16, 2015, at 8:38 AM, Seabury, Christopher <CSeabury at cvm.tamu.edu<mailto:CSeabury at cvm.tamu.edu>> wrote:

Dear Colleagues,

I would like to quickly ask about a specific routine/possible function in MAKER.
Previously, we have essentially made home-made versions of maker by way of
Multi-step programming.   At present we are exploring MAKER but are wondering
IF MAKER has the ability to populate the GFF with GENE/Protein ID information?
As an initial experiment, we gave MAKER cDNA refseqs, plus corresponding Protein refseqs,
And a reference, but do not see the GENE/Protein ID in the GFF.  Is there a subroutine
For this, or option we have missed?


Thanks and Kind Regards,


Christopher M. Seabury PhD
Associate Professor
Department of Veterinary Pathobiology
College of Veterinary Medicine
Texas A&M University
College Station, TX 77843-4467
cseabury at cvm.tamu.edu<mailto:cseabury at cvm.tamu.edu>
Mobile: 979-492-6400

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150306/15c2e575/attachment-0002.html>

From bmoore at genetics.utah.edu  Mon Mar  9 12:12:10 2015
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Mon, 9 Mar 2015 18:12:10 +0000
Subject: [maker-devel] Does the maker google forum works? -[Doubt]
	maker2zff line 109
In-Reply-To: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es>
References: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es>
Message-ID: <13826169-DFBB-4E32-AFB1-A01CB3EEE0A4@genetics.utah.edu>

Hi Javier,

The Google forum is only a mirror of the actual mailing list that allows it gets archived by Google (better Google search results for MAKER users), so posting is not allowed there.  Please join the official MAKER mailing list at:

http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Thanks,

B

On Mar 9, 2015, at 12:07 PM, FRANCISCO JAVIER SANCHEZ GARCIA <javiersg at um.es<mailto:javiersg at um.es>> wrote:


Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8.

../maker/marker_v2.31.8/maker/bin/maker2zff   ../sequences.all.gff everything.ann everything.dna
[sudo] password for soba:
No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, <GFF> line 1922870.

I read something about the problematic characters in the ID . But i dont know if it is my example.

http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html


Regards.
Thanks in advance.


Fco Javier S?nchez-Garc?a
PhD student Forest Entomology
?rea de Biolog?a Animal
Departamento de Zoolog?a y Antropolog?a F?sica
Facultad de Veterinaria
Universidad de Murcia
Campus de Espinardo 30100
Murcia (Espa?a-Spain)
Telf. +34 660 500 416 (mobile phone)
      +34 868 888 031 (laboratory-work phone)

http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl
http://www.researchgate.net/profile/Francisco_Sanchez-Garcia
http://orcid.org/0000-0002-5442-0292
http://www.researcherid.com/rid/M-2407-2014

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150309/0bd030ec/attachment-0002.html>

From javiersg at um.es  Mon Mar  9 16:27:00 2015
From: javiersg at um.es (FRANCISCO JAVIER SANCHEZ GARCIA)
Date: Mon, 09 Mar 2015 23:27:00 +0100
Subject: [maker-devel] [Doubt] maker2zff line 109Good morning. I am trying
 problem to write messages in the help forum of maker in google groups. I
 dont know if my problem or contrary it might be a problem with the
 permissions. But i cant see the red button of new threads. Anyway,
 I will try to show my problem with maker2zff. Which does not work. My
 version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff
 ../sequences.all.gff everything.ann everything.dna [sudo] password for
 soba: No such file or directory at
 ../maker/marker_v2.31.8/maker/bin/maker2zff line 109,
 <GFF> line 1922870. I read something about the problematic characters in the
 ID . But i dont know if it is my example.
 http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html
 Regards. Thanks in advance.
Message-ID: <20150309232700.Horde.39Aj_iELfhGei1Bv5JG1fw6@webmail.um.es>

Good night everyone.I will try to show my problem with maker2zff. Which
does not work. My version is the v2.31.8. The last line of the gff file is
the line which the mistake alert said ?that it doesnt find the file or
directory.

../maker/marker_v2.31.8/maker/bin/maker2zff?? ../sequences.all.gff
everything.ann everything.dna
[sudo] password for soba:
No such file?or directory?at?../maker/marker_v2.31.8/maker/bin/MAKER2ZFF
LINE 109, <GFF> line 1922870.

I read something about the problematic characters in the ID . But i dont
know if it is my example.

http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html

Regards.
Thanks in advance.

Fco Javier S?nchez-Garc?a
PhD student Forest Entomology
?rea de Biolog?a Animal
Departamento de Zoolog?a y Antropolog?a F?sica
Facultad de Veterinaria
Universidad de Murcia
Campus de Espinardo 30100
Murcia (Espa?a-Spain)
Telf. +34 660 500 416 (mobile phone)
       +34 868 888 031 (laboratory-work phone)

http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl
http://www.researchgate.net/profile/Francisco_Sanchez-Garcia
http://orcid.org/0000-0002-5442-0292
http://www.researcherid.com/rid/M-2407-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150309/6b31f9e4/attachment-0002.html>

From carsonhh at gmail.com  Thu Mar 12 13:50:44 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 12 Mar 2015 13:50:44 -0600
Subject: [maker-devel] Better resolve conflicting gene models
In-Reply-To: <54F88886.9010004@ecolevol.de>
References: <54F88886.9010004@ecolevol.de>
Message-ID: <7B8D93E6-DEE7-43D7-BFBF-38E474311A6C@gmail.com>

Sorry for the slow reply.


> how can one reply to an existing thread such that the reply will be added to the same thread?

Threads show up under the Google archive, but that is really just an artifact of how things are archived. The mailing list itself doesn?t have threads, just a subject line like any other e-mail. But because we are archiving to a message board any messages with the same subject get turned into parts of the same thread.


> Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior?

The gene chosen by MAKER is the one that best matches the evidence.  This is a numeric value called AED (lower means better match).  If a model predicts an exon or a base pair that is not supported by an evidence alignment, then it we be penalized.  If a model fails to predict a base pair that is supported by evidence then it will also be penalized.  The differences in your results are because you changed something between runs, either you added or subtracted a predictor (gives MAKER more models to choose from) or evidence changed (different AED score).  Whichever model has the lowest AED score will always be kept, so some change you made either resulted in different set of available models to choose from or a different AED score. 

Also be aware that models cannot overlap on the same strand, so if you made a change that results in a model being removed from consideration, you may have removed overlap that was precluding another model with a slightly higher AED from being chosen.

> 
> Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR.
> 
> - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. 
> Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand.

The model chosen will always be the one with the lowest AED.  The changes you describe are all based off of a model with lower AED supplanting another, which may have opened up a region that was previously overlapped by a longer gene with a poor AED score.

I would also recommend not including cufflinks output if you have trinity data.  Cufflinks results often falsely bridge regions and will cause gene mergers by falsely extending evidence clusters. This makes it appear as if genes should extend into regions where they shouldn?t.  Given those hints, predictors will then try and predict at least something in those regions, and if they do the models will get a good AED score since they are supported by evidence even if it is false evidence.

?Carson


From carsonhh at gmail.com  Thu Mar 12 14:03:11 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 12 Mar 2015 14:03:11 -0600
Subject: [maker-devel] maker-devel post from avhoeck@sckcen.be requires
	approval
In-Reply-To: <mailman.5.1426178330.13228.maker-devel_yandell-lab.org@box290.bluehost.com>
References: <mailman.5.1426178330.13228.maker-devel_yandell-lab.org@box290.bluehost.com>
Message-ID: <73D91049-1CB8-4C59-A9E5-58374FCB0789@gmail.com>

Hi Arne,

The genes found by CEGMA are very short genes, so where those genes might be identifiable at least partially on shorter contigs other genes will be far longer.  So the 87% value you get from CEGMA is probably what you are going to be able to find (even partially) for the entire genome. Remember that gene size when you include the introns can actually be quite large, and gene predictors need flanking signals from the sequence several hundred bp upstream and downstream to make a prediction, so having a gene partially contained by a contig makes that gene un-annotatable for ab initio predictors. Unless the organism has a high gene density or very few introns, you usually won?t be able to find a gene on contigs smaller than ~10kb.

?Carson


> On Mar 12, 2015, at 10:38 AM
> 
> From: Van Hoeck Arne <avhoeck at SCKCEN.BE <mailto:avhoeck at SCKCEN.BE>>
> To: "maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>>
> Subject: TACC lonestar and N50 value
> Date: March 12, 2015 at 10:38:42 AM MDT
> 
> 
> Dear MAKER developer,
> 
> We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)
> 
> Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?
> 
> Best regards
> Arne
> 
> 
> 	Consider the environment before you print
> Denk aan het milieu voor u deze e-mail print
> Pensez ? l'environnement avant d'imprimer
> 
> 
> SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer <http://www.sckcen.be/en/e-mail_disclaimer>
> 
> 
> Subject: confirm ff58f4c50902af9cf748cc6c032a1ee92a2a19a9
> From: maker-devel-request at yandell-lab.org <mailto:maker-devel-request at yandell-lab.org>
> Date: March 12, 2015 at 10:38:50 AM MDT
> 
> 
> If you reply to this message, keeping the Subject: header intact,
> Mailman will discard the held message.  Do this if the message is
> spam.  If you reply to this message and include an Approved: header
> with the list password in it, the message will be approved for posting
> to the list.  The Approved: header can also appear in the first line
> of the body of the reply.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150312/30a64d42/attachment-0002.html>

From avhoeck at SCKCEN.BE  Thu Mar 12 10:38:42 2015
From: avhoeck at SCKCEN.BE (Van Hoeck Arne)
Date: Thu, 12 Mar 2015 16:38:42 +0000
Subject: [maker-devel] TACC lonestar and N50 value
Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>

Dear MAKER developer,

We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)

Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?

Best regards
Arne


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150312/10c284bd/attachment-0002.html>

From mtollis at asu.edu  Fri Mar 13 14:50:33 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Fri, 13 Mar 2015 13:50:33 -0700
Subject: [maker-devel] Question about pre-masked genome.
Message-ID: <CAM2H015C_WDgMScNBeBe=hZxbYS_c6w1aOFuR+JAkYek4QoUZA@mail.gmail.com>

Hello,
I have a genome that has been well-annotated for species-specific repeats
(using RepeatModeler, trf, and RepeatMasker), and was wondering if I could
simply use the soft-masked assembly for maker annotation, in order to skip
the sometimes cumbersome repeatmasking step. Is this generally looked upon
as doable, or should I just stick with using the unmasked assembly and the
repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could
not confirm that the masked assembly (i.e., lowercase letters) was
maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150313/e4d6520e/attachment-0002.html>

From mtollis at asu.edu  Fri Mar 13 15:48:46 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Fri, 13 Mar 2015 14:48:46 -0700
Subject: [maker-devel] Question about pre-masked genome.
Message-ID: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>

Hello,
I have a genome that has been well-annotated for species-specific repeats
(using RepeatModeler, trf, and RepeatMasker), and was wondering if I could
simply use the soft-masked assembly for maker annotation, in order to skip
the sometimes cumbersome repeatmasking step. Is this generally looked upon
as doable, or should I just stick with using the unmasked assembly and the
repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could
not confirm that the masked assembly (i.e., lowercase letters) was
maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc

-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150313/e763639c/attachment-0002.html>

From dence at genetics.utah.edu  Fri Mar 13 18:14:52 2015
From: dence at genetics.utah.edu (Daniel Ence)
Date: Sat, 14 Mar 2015 00:14:52 +0000
Subject: [maker-devel] Question about pre-masked genome.
In-Reply-To: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>
References: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>
Message-ID: <51DE3D97-10DB-4FF9-A793-EEA051C705BD@genetics.utah.edu>

Hi Marc, Several groups have been doing what you described with fungal genomes for a while and it seems to be working for them.  With the softmasked genome, maker will still allow EST alignments to extend into the masked regions; if it were a hard masked genome, this wouldn?t be possible.

Let us know how it works out though!

Thanks,
Daniel


On Mar 13, 2015, at 3:48 PM, Marc Tollis <mtollis at asu.edu<mailto:mtollis at asu.edu>> wrote:

Hello,
I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc

--
Marc Tollis, Ph.D.
Post-Doctoral Research Associate
Arizona State University
LSE 313
(480) 965-7456
marc.tollis at asu.edu<mailto:marc.tollis at asu.edu>

website: https://sites.google.com/site/tollisresearch/
blog: anolistollis.wordpress.com<http://anolistollis.wordpress.com/>
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150314/44c59bd7/attachment-0002.html>

From mtollis at asu.edu  Sun Mar 15 08:19:37 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Sun, 15 Mar 2015 07:19:37 -0700
Subject: [maker-devel] control file for SNAP training
Message-ID: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>

This is a question about process, and to make sure I am doing things right
(when time is of the essence, some mistakes can set you back weeks).

I have run maker on my de novo vertebrate genome, using only the predictive
proteome from a congener (well-studied and available on Ensembl), and
generated the HMM for the first round of SNAP training. As per the 2014
tutorial, I edited the control file for this step as follows: I added the
path to the .hmm file, and set protein2genome to 0.

When I run maker, I notice that in addition to snap, it is still running
blastx and exonerate however. I noticed that this is because I did not
remove (or "comment out") the path to the protein.fa in the control file
(the output looks markedly different when I do comment out the protein file
- and I can't even tell if it's running snap in this instance).

Is it simply using exonerate to place the ab initio predictions on the
scaffolds (meaning that having protein2genome=1 is to tell maker to make
evidence annotations) ? Did I do this correctly, or should I also remove
the protein.fa out of the control file for SNAP training?
?
-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150315/ef226da4/attachment-0002.html>

From steinj at cshl.edu  Mon Mar 16 07:29:36 2015
From: steinj at cshl.edu (Stein, Joshua)
Date: Mon, 16 Mar 2015 13:29:36 +0000
Subject: [maker-devel] TACC lonestar and N50 value
In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>
References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>
Message-ID: <0A0DBB88-EC95-43B7-AA75-B7858B67A8F4@cshl.edu>

Hi Arne,

I have experience with iPlant resources and with MAKER-P.  I would encourage you to try the Atmosphere image (7888b8e1-c006-4794-82d9-4c940ddbf4c6).  You can request a large instance (up to 16 CPU's and 128 GB memory) and run in MPI-mode to distribute the work.  Please see this tutorial, which includes information on running in MPI-mode:  https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial.

You can also access the TACC Lonestar installation using the iPlant Discovery Environment.  There is an app called "MAKER-P-Lonestar-Small-Genomes 2.3".  Although it is advertised as appropriate for "small" genomes, I think there is a good chance that it will work for 450 Mb.  This is a new resource and the iPlant team would value any feedback and benchmarks on how the system is working.  Depending how this goes there are plans to roll-out additional apps intended for larger genomes.  Here is a tutorial: https://pods.iplantcollaborative.org/wiki/display/sciplant/Tutorial+for+running+MAKER-P+on+TACC-Lonestar+from+iPlant+Discovery+Environment

Regarding contig sizes, though not ideal, you can include contigs smaller than 10kbp in your run.  Plant genes tend to be more compact than vertebrate genes so you ought to be able to recover annotations on the smaller contigs, though keep an eye out for truncated genes.

Best,
Josh


On Mar 13, 2015, at 6:06 PM, Van Hoeck Arne <avhoeck at SCKCEN.BE<mailto:avhoeck at SCKCEN.BE>> wrote:

Dear MAKER developer,

We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)

Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?

Best regards
Arne


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu<mailto:steinj at cshl.edu>
http://ware.cshl.org/


From mtollis at asu.edu  Tue Mar 17 15:26:44 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Tue, 17 Mar 2015 14:26:44 -0700
Subject: [maker-devel] control file for SNAP training
In-Reply-To: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
References: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
Message-ID: <CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>

I answered my own question:
No need to re-align proteins again - takes too long.
So, I used the gff file from the gff_merge on the log file from the first
run (the one with just protein2genome). Then, after generating the .hmm
file, I put it in my control file, along with protein2genome=0, removed the
protein.fasta, set maker_gff and protein_pass=1. The output now shows that
only snap is running, and no blastx and exonerate - a relief because it is
much faster!

On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis <mtollis at asu.edu> wrote:

> This is a question about process, and to make sure I am doing things right
> (when time is of the essence, some mistakes can set you back weeks).
>
> I have run maker on my de novo vertebrate genome, using only the
> predictive proteome from a congener (well-studied and available on
> Ensembl), and generated the HMM for the first round of SNAP training. As
> per the 2014 tutorial, I edited the control file for this step as follows:
> I added the path to the .hmm file, and set protein2genome to 0.
>
> When I run maker, I notice that in addition to snap, it is still running
> blastx and exonerate however. I noticed that this is because I did not
> remove (or "comment out") the path to the protein.fa in the control file
> (the output looks markedly different when I do comment out the protein file
> - and I can't even tell if it's running snap in this instance).
>
> Is it simply using exonerate to place the ab initio predictions on the
> scaffolds (meaning that having protein2genome=1 is to tell maker to make
> evidence annotations) ? Did I do this correctly, or should I also remove
> the protein.fa out of the control file for SNAP training?
> ?
> --
> *Marc Tollis, Ph.D.*
> *Post-Doctoral Research Associate*
> *Arizona State University*
> *LSE 313*
> *(480) 965-7456 <%28480%29%20965-7456>*
> marc.tollis at asu.edu
>
> *website: *https://sites.google.com/site/tollisresearch/
> *blog: *anolistollis.wordpress.com
>


-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150317/051bb315/attachment-0002.html>

From carsonhh at gmail.com  Tue Mar 17 20:47:50 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 17 Mar 2015 20:47:50 -0600
Subject: [maker-devel] control file for SNAP training
In-Reply-To: <CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>
References: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
	<CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>
Message-ID: <AADABBD3-04F1-49BF-B261-4B316EF60D2B@gmail.com>

You can also add additional protein file if you want more evidence than that already in the GFF3. It makes re-annotation and adding evidence to an already annotated genome file really easy.

?Carson


> On Mar 17, 2015, at 3:26 PM, Marc Tollis <mtollis at asu.edu> wrote:
> 
> I answered my own question:
> No need to re-align proteins again - takes too long.
> So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster!
> 
> On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis <mtollis at asu.edu <mailto:mtollis at asu.edu>> wrote:
> This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks).
> 
> I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. 
> 
> When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). 
> 
> Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? 
> ?
> -- 
> Marc Tollis, Ph.D.
> Post-Doctoral Research Associate
> Arizona State University
> LSE 313
> (480) 965-7456 <tel:%28480%29%20965-7456>
> marc.tollis at asu.edu <mailto:marc.tollis at asu.edu>
> 
> website: https://sites.google.com/site/tollisresearch/ <https://sites.google.com/site/tollisresearch/>
> blog: anolistollis.wordpress.com <http://anolistollis.wordpress.com/>
> 
> 
> -- 
> Marc Tollis, Ph.D.
> Post-Doctoral Research Associate
> Arizona State University
> LSE 313
> (480) 965-7456
> marc.tollis at asu.edu <mailto:marc.tollis at asu.edu>
> 
> website: https://sites.google.com/site/tollisresearch/ <https://sites.google.com/site/tollisresearch/>
> blog: anolistollis.wordpress.com <http://anolistollis.wordpress.com/>_______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150317/f3680a6c/attachment-0002.html>

From Brian.Mack at ARS.USDA.GOV  Fri Mar 20 07:17:09 2015
From: Brian.Mack at ARS.USDA.GOV (Mack, Brian)
Date: Fri, 20 Mar 2015 13:17:09 +0000
Subject: [maker-devel] est2genome wrong strand
Message-ID: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>

Hi, I've noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I've copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I've noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this?


Thanks,

Brian


Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496

>contig_69 <http://10.114.143.20:4567/get_sequence/?id=contig_69&db=/home/brian/blastdb/af70_20130423_id-modified.assembly.txt>

Length=108040


 Score =  1043 bits (1156),  Expect = 0.0

 Identities = 589/592 (99%), Gaps = 3/592 (1%)

 Strand=Plus/Plus


Query  24      TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  83

               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct  105546  TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  105605


Query  84      CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  142

               |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||

Sbjct  105606  CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  105665


69           blastn    expressed_sequence_match         105546  106137  559        +             .               ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3
69           blastn    match_part         105546  106137  559        +             .               ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359
69           est2genome       expressed_sequence_match         105546  106137  2909      -              .               ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3
69           est2genome       match_part         105546  106137  2909      -              .               ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150320/d3f1dc4c/attachment-0002.html>

From carsonhh at gmail.com  Fri Mar 20 08:54:28 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 20 Mar 2015 08:54:28 -0600
Subject: [maker-devel] est2genome wrong strand
In-Reply-To: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>
References: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>
Message-ID: <C32539B3-CF24-4C99-9897-605FE8C8CCB8@gmail.com>

Hi Brian,

Multi-exon ESTs are stranded by their splice sites which can only work on one strand. Single exon ESTs on the other hand are stranded by their open reading frame. The standard chemistry used in most sequencing does not allow for strand specific sequence. There are technologies that do, but unless you used one of those, re-stranding single exon ESTs based on ORF is the best way to figure out where single exon alignments should go (not a completely reliable method but it works most of the time).  I believe trinity tries to determine strand the same way (but it is unreliable there too). For example since your alignment is not an exact match to the genomic sequence (single bp deletion in the alignment) the best open reading frame in the transcript is not the same as the best open reading frame in the genomic sequence (so one of them likely contains an error).  MAKER since it is annotating the genome logically re-strands it to the genome and trinity (being unaware of the genome strands it to the transcript).  Because single exon alignments are very unreliable, they are ignored in MAKER by default.  They will not be used as hints for gene predictors and can only be used to support a gene if there is also protein evidence or a single exon ab initio prediction at the same location to support it (even then this will only happen if you set single_exon=1 in the control files).

?Carson


On Mar 20, 2015, at 7:17 AM, Mack, Brian <Brian.Mack at ARS.USDA.GOV <mailto:Brian.Mack at ARS.USDA.GOV>> wrote:

> Hi, I?ve noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I?ve copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I?ve noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? 
>  
> Thanks,
> Brian
>  
> Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496
> >contig_69  <http://10.114.143.20:4567/get_sequence/?id=contig_69&db=/home/brian/blastdb/af70_20130423_id-modified.assembly.txt> <> 
> Length=108040
>  
>  Score =  1043 bits (1156),  Expect = 0.0
>  Identities = 589/592 (99%), Gaps = 3/592 (1%)
>  Strand=Plus/Plus
>  
> Query  24      TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  83
>                ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct  105546  TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  105605
>  
> Query  84      CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  142
>                |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct  105606  CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  105665
>  
>  
>  
> 69           blastn    expressed_sequence_match         105546  106137  559        +             .               ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3
> 69           blastn    match_part         105546  106137  559        +             .               ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359
> 69           est2genome       expressed_sequence_match         105546  106137  2909      -              .               ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3
> 69           est2genome       match_part         105546  106137  2909      -              .               ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150320/f91a44d0/attachment-0002.html>

From xvazquezc at gmail.com  Sat Mar 21 21:27:27 2015
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=)
Date: Sun, 22 Mar 2015 14:27:27 +1100
Subject: [maker-devel] annotation stats: repeats
Message-ID: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>

Hi all,

I was wondering how can I get data about the repeat content of the genome
from maker if possible, as well as each type of repeats: RE, transposons,
simple repeats, low complexity repeats

Thank you in advance,

Xabier

-- 
Xabier V?zquez Campos
*PhD Candidate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150322/e07ccc08/attachment-0002.html>

From dence at genetics.utah.edu  Sat Mar 21 23:56:06 2015
From: dence at genetics.utah.edu (Daniel Ence)
Date: Sun, 22 Mar 2015 05:56:06 +0000
Subject: [maker-devel] annotation stats: repeats
In-Reply-To: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>
References: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>
Message-ID: <4FFB960C-CF1A-440F-B767-36EC369D2B58@genetics.utah.edu>

Hi Xabier, MAKER offer several options for repeat-masking. There?s a file of repetitive element proteins that is run with repeat runner, and there you can run RepeatMasker with a custom library and/or one of the default RepeatMasker libraries.

The results from these programs are in the gff3 files that maker generates. Depending on the library that you give RepeatMasker, the repeats might be classified as to transposable element family and simple vs. complex repeats. These are probably a good place to start, but you?ll have to extract that data from the gff3 file and compile it.

Let us know whether that helps.

Thanks,
Daniel


On Mar 21, 2015, at 9:27 PM, Xabier V?zquez Campos <xvazquezc at gmail.com<mailto:xvazquezc at gmail.com>> wrote:

Hi all,

I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats

Thank you in advance,

Xabier

--
Xabier V?zquez Campos
PhD Candidate
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150322/3d95c8da/attachment-0002.html>

From panos.ioannidis at gmail.com  Tue Mar 24 02:29:14 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 09:29:14 +0100
Subject: [maker-devel] Augustus retraining
Message-ID: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>

Hello All,

I'm trying to retrain Augustus using EST data from the same species and
realized that quite a few of the gene models I get based on EST data are
incomplete (i.e. no start and/or stop codon).

Now, when I get to the "etraining" step in Augustus retraining (right after
the time-consuming "optimize_augustus.pl" step), I get a warning for each
gene that doesn't contain a start or stop codon.

.....
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
does not begin with start codon but with acg
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
....

Does anyone know whether training is compromised by such incomplete gene
models? Do you usually exclude them from the training set?

Oh, and by the way, the best guide to retraining Augustus is here
<http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
The official
<http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
page isn't bad, but doesn't explain in detail certain things.

Thanks,
Panos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/a82d7062/attachment-0002.html>

From xvazquezc at gmail.com  Tue Mar 24 06:06:25 2015
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=)
Date: Tue, 24 Mar 2015 23:06:25 +1100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
Message-ID: <CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>

Hi Panos,

Have you tried using webAugustus for the (re)training? I found it very
convenient for generating the models for Augustus.

Cheers,

2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:

> Hello All,
>
> I'm trying to retrain Augustus using EST data from the same species and
> realized that quite a few of the gene models I get based on EST data are
> incomplete (i.e. no start and/or stop codon).
>
> Now, when I get to the "etraining" step in Augustus retraining (right
> after the time-consuming "optimize_augustus.pl" step), I get a warning
> for each gene that doesn't contain a start or stop codon.
>
> .....
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
> does not begin with start codon but with acg
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
> ....
>
> Does anyone know whether training is compromised by such incomplete gene
> models? Do you usually exclude them from the training set?
>
> Oh, and by the way, the best guide to retraining Augustus is here
> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
> The official
> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
> page isn't bad, but doesn't explain in detail certain things.
>
> Thanks,
> Panos
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 
Xabier V?zquez Campos
*PhD Candidate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/0b0a4daf/attachment-0002.html>

From panos.ioannidis at gmail.com  Tue Mar 24 06:24:45 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 13:24:45 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
Message-ID: <CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>

Hi Xabier,

Thanks for your quick reply!

No, I haven't used WebAugustus, but I just checked it out and it looks like
my training set is too big (~300 Mbp), so I can't even upload it!

Anyway, I prefer to train it locally because I have better control over
each step. Also, I have done the entire training procedure with less genes,
but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
replicate it using more of my scaffolds, but as it appears I get a lot more
incomplete models from exonerate (run through Maker).

P


On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com>
wrote:

> Hi Panos,
>
> Have you tried using webAugustus for the (re)training? I found it very
> convenient for generating the models for Augustus.
>
> Cheers,
>
> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>
>> Hello All,
>>
>> I'm trying to retrain Augustus using EST data from the same species and
>> realized that quite a few of the gene models I get based on EST data are
>> incomplete (i.e. no start and/or stop codon).
>>
>> Now, when I get to the "etraining" step in Augustus retraining (right
>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>> for each gene that doesn't contain a start or stop codon.
>>
>> .....
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>> does not begin with start codon but with acg
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>> ....
>>
>> Does anyone know whether training is compromised by such incomplete gene
>> models? Do you usually exclude them from the training set?
>>
>> Oh, and by the way, the best guide to retraining Augustus is here
>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>> The official
>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
>> page isn't bad, but doesn't explain in detail certain things.
>>
>> Thanks,
>> Panos
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
> --
> Xabier V?zquez Campos
> *PhD Candidate*
> Water Research Centre
> School of Civil and Environmental Engineering
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/2be728f0/attachment-0002.html>

From carsonhh at gmail.com  Tue Mar 24 08:14:51 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 08:14:51 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
Message-ID: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>

Hi Panos,

EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.

More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>

Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.

?Carson


> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Hi Xabier,
> 
> Thanks for your quick reply!
> 
> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
> 
> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
> 
> P
> 
> 
> 
> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
> Hi Panos,
> 
> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
> 
> Cheers,
> 
> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
> Hello All,
> 
> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
> 
> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
> 
> .....
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
> ....
> 
> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
> 
> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
> 
> Thanks,
> Panos
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> 
> 
> -- 
> Xabier V?zquez Campos
> PhD Candidate
> Water Research Centre
> School of Civil and Environmental Engineering
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/1e0e6b39/attachment-0002.html>

From panos.ioannidis at gmail.com  Tue Mar 24 08:31:04 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 15:31:04 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
Message-ID: <CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>

Hi Carson,

So you think it's okay to include incomplete gene models when training
Augustus?

I'll certainly try the bootstrap method you're suggesting. Even though I
did it for SNAP, for some weird reason I forgot it for Augustus :p Do you
think, however, that I can get a big improvement in gene-level sensitivity?
Currently, I have only 6%...

Thanks,
Panos


On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com> wrote:

> Hi Panos,
>
> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a
> first round of training you can run MAKER together with protein and EST
> evidence and the newly trained Augustus species file.  Because MAKER gives
> hints to Augustus as it runs, the models it produces will be improved over
> what it would get from just running Augustus on it?s own.  Then take these
> gene models and use them to retrain Augustus.  This is the standard
> bootstrap retraining procedure, and can be repeated as needed.
>
> More info on bootstrap training here (info is for SNAP but procedure is
> similar to Augustus) ?>  http://weatherby.genetics.
> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_
> Online_Training_2014#Training_ab_initio_Gene_Predictors
> Here is an excellent explanation of Augustus training ?>
> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html
> and here are tools to convert SNAP training files to Augustus training
> files (MAKER comes with a tool that converts GFF3 for SNAP training so just
> take that and convert it for Augustus)?> https://github.com/
> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
>
> Finally you can also manually edit the GFF3 file in Apollo (easier to use
> the legacy stand alone version), and then convert that file for bootstrap
> training.
>
> ?Carson
>
>
> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
> wrote:
>
> Hi Xabier,
>
> Thanks for your quick reply!
>
> No, I haven't used WebAugustus, but I just checked it out and it looks
> like my training set is too big (~300 Mbp), so I can't even upload it!
>
> Anyway, I prefer to train it locally because I have better control over
> each step. Also, I have done the entire training procedure with less genes,
> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
> replicate it using more of my scaffolds, but as it appears I get a lot more
> incomplete models from exonerate (run through Maker).
>
> P
>
>
>
> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <
> xvazquezc at gmail.com> wrote:
>
>> Hi Panos,
>>
>> Have you tried using webAugustus for the (re)training? I found it very
>> convenient for generating the models for Augustus.
>>
>> Cheers,
>>
>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>>
>>> Hello All,
>>>
>>> I'm trying to retrain Augustus using EST data from the same species and
>>> realized that quite a few of the gene models I get based on EST data are
>>> incomplete (i.e. no start and/or stop codon).
>>>
>>> Now, when I get to the "etraining" step in Augustus retraining (right
>>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>>> for each gene that doesn't contain a start or stop codon.
>>>
>>> .....
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>>> does not begin with start codon but with acg
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>> ....
>>>
>>> Does anyone know whether training is compromised by such incomplete gene
>>> models? Do you usually exclude them from the training set?
>>>
>>> Oh, and by the way, the best guide to retraining Augustus is here
>>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>>> The official
>>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
>>> page isn't bad, but doesn't explain in detail certain things.
>>>
>>> Thanks,
>>> Panos
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>>
>>
>>
>> --
>> Xabier V?zquez Campos
>> *PhD Candidate*
>> Water Research Centre
>> School of Civil and Environmental Engineering
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/34c2980c/attachment-0002.html>

From carsonhh at gmail.com  Tue Mar 24 08:39:20 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 08:39:20 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
Message-ID: <C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>

On your first round it is fine.  It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity.  If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics.  The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them).

?Carson


> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Hi Carson,
> 
> So you think it's okay to include incomplete gene models when training Augustus?
> 
> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...
> 
> Thanks,
> Panos
> 
> 
> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Hi Panos,
> 
> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.
> 
> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>
> 
> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.
> 
> ?Carson
> 
> 
>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>> 
>> Hi Xabier,
>> 
>> Thanks for your quick reply!
>> 
>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
>> 
>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
>> 
>> P
>> 
>> 
>> 
>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> Hi Panos,
>> 
>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
>> 
>> Cheers,
>> 
>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
>> Hello All,
>> 
>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
>> 
>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
>> 
>> .....
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>> ....
>> 
>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
>> 
>> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
>> 
>> Thanks,
>> Panos
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>> 
>> 
>> 
>> 
>> -- 
>> Xabier V?zquez Campos
>> PhD Candidate
>> Water Research Centre
>> School of Civil and Environmental Engineering
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/f25ab2fc/attachment-0002.html>

From panos.ioannidis at gmail.com  Tue Mar 24 09:05:54 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 16:05:54 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
	<C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
Message-ID: <CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>

Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level
is 88%. I only mentioned gene-level, because that's the only metric
mentioned in the Augustus web site.

I got these numbers outside of Maker. Actually, I only used Maker to
generate the gff files needed to start the training (ran it using only EST
evidence and only on a subset of my assembly, using this
<http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>
as a guide).

Now, I've started running the second round of training, as you suggested.
Since, however, I don't have data from closely related species, I'm only
using Uniref50 as protein evidence.

P

On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt <carsonhh at gmail.com> wrote:

> On your first round it is fine.  It gives the predictor enough to work
> with, then on the second round you use improved models. When you say 6%
> sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER
> that means you are not providing sufficient protein evidence (you need the
> full proteome of at least two related species). Also is that the gene
> level, exon level, or nucleotide level sensitivity.  If you are looking at
> the gene level sensitivity measure, you only get a match when you perfectly
> match all transcripts in a gene (models that may not be correct in the
> first place). This value will rarely go above 10% for any predictor. You
> need to use the nucleotide level sensitivity/specificity metrics.  The gene
> and exon level metrics are basically meaningless (unless it?s Drosophila
> which is the only species annotated correctly enough to use them).
>
> ?Carson
>
>
> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
> wrote:
>
> Hi Carson,
>
> So you think it's okay to include incomplete gene models when training
> Augustus?
>
> I'll certainly try the bootstrap method you're suggesting. Even though I
> did it for SNAP, for some weird reason I forgot it for Augustus :p Do you
> think, however, that I can get a big improvement in gene-level sensitivity?
> Currently, I have only 6%...
>
> Thanks,
> Panos
>
>
> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com> wrote:
>
>> Hi Panos,
>>
>> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a
>> first round of training you can run MAKER together with protein and EST
>> evidence and the newly trained Augustus species file.  Because MAKER gives
>> hints to Augustus as it runs, the models it produces will be improved over
>> what it would get from just running Augustus on it?s own.  Then take these
>> gene models and use them to retrain Augustus.  This is the standard
>> bootstrap retraining procedure, and can be repeated as needed.
>>
>> More info on bootstrap training here (info is for SNAP but procedure is
>> similar to Augustus) ?>  http://weatherby.genetics.
>> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_
>> Online_Training_2014#Training_ab_initio_Gene_Predictors
>> Here is an excellent explanation of Augustus training ?>
>> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html
>> and here are tools to convert SNAP training files to Augustus training
>> files (MAKER comes with a tool that converts GFF3 for SNAP training so just
>> take that and convert it for Augustus)?> https://github.com/
>> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
>>
>> Finally you can also manually edit the GFF3 file in Apollo (easier to use
>> the legacy stand alone version), and then convert that file for bootstrap
>> training.
>>
>> ?Carson
>>
>>
>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
>> wrote:
>>
>> Hi Xabier,
>>
>> Thanks for your quick reply!
>>
>> No, I haven't used WebAugustus, but I just checked it out and it looks
>> like my training set is too big (~300 Mbp), so I can't even upload it!
>>
>> Anyway, I prefer to train it locally because I have better control over
>> each step. Also, I have done the entire training procedure with less genes,
>> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
>> replicate it using more of my scaffolds, but as it appears I get a lot more
>> incomplete models from exonerate (run through Maker).
>>
>> P
>>
>>
>>
>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <
>> xvazquezc at gmail.com> wrote:
>>
>>> Hi Panos,
>>>
>>> Have you tried using webAugustus for the (re)training? I found it very
>>> convenient for generating the models for Augustus.
>>>
>>> Cheers,
>>>
>>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>>>
>>>> Hello All,
>>>>
>>>> I'm trying to retrain Augustus using EST data from the same species and
>>>> realized that quite a few of the gene models I get based on EST data are
>>>> incomplete (i.e. no start and/or stop codon).
>>>>
>>>> Now, when I get to the "etraining" step in Augustus retraining (right
>>>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>>>> for each gene that doesn't contain a start or stop codon.
>>>>
>>>> .....
>>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>>>> does not begin with start codon but with acg
>>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>>> ....
>>>>
>>>> Does anyone know whether training is compromised by such incomplete
>>>> gene models? Do you usually exclude them from the training set?
>>>>
>>>> Oh, and by the way, the best guide to retraining Augustus is here
>>>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>>>> The official
>>>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html>
>>>> web page isn't bad, but doesn't explain in detail certain things.
>>>>
>>>> Thanks,
>>>> Panos
>>>>
>>>>
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Xabier V?zquez Campos
>>> *PhD Candidate*
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/1567f72a/attachment-0002.html>

From carsonhh at gmail.com  Tue Mar 24 09:38:08 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 09:38:08 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
	<C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
	<CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>
Message-ID: <4EA5A1F6-2950-4D65-A59C-0F3848C86C02@gmail.com>

I?d pick a couple of species that are as closely related as you can find.  Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won?t have (those databases are usually a little too conservative).

The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with.  Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point.  This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics.

Thanks,
Carson


> On Mar 24, 2015, at 9:05 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site.
> 
> I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html> as a guide).
> 
> Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence.
> 
> P
> 
> On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> On your first round it is fine.  It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity.  If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics.  The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them).
> 
> ?Carson
> 
> 
>> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>> 
>> Hi Carson,
>> 
>> So you think it's okay to include incomplete gene models when training Augustus?
>> 
>> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...
>> 
>> Thanks,
>> Panos
>> 
>> 
>> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> Hi Panos,
>> 
>> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.
>> 
>> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
>> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
>> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>
>> 
>> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.
>> 
>> ?Carson
>> 
>> 
>>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>>> 
>>> Hi Xabier,
>>> 
>>> Thanks for your quick reply!
>>> 
>>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
>>> 
>>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
>>> 
>>> P
>>> 
>>> 
>>> 
>>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>>> Hi Panos,
>>> 
>>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
>>> 
>>> Cheers,
>>> 
>>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
>>> Hello All,
>>> 
>>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
>>> 
>>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
>>> 
>>> .....
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>> ....
>>> 
>>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
>>> 
>>> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
>>> 
>>> Thanks,
>>> Panos
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Xabier V?zquez Campos
>>> PhD Candidate
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/80336079/attachment-0002.html>

From alicebdennis at gmail.com  Thu Mar 26 04:34:26 2015
From: alicebdennis at gmail.com (Alice Dennis)
Date: Thu, 26 Mar 2015 11:34:26 +0100
Subject: [maker-devel] iterative Maker2
In-Reply-To: <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch>
	<1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
Message-ID: <CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>

Hello again,

I posted a while ago about a genome I'm running through the Maker2
pipeline. I was concerned because my results were still changing with
3 and 4 iterations.

Following the very useful advice of Carson (below), I've made a few
modifications (adding a RepeatModeler run, using a big protein
database), but my gene predictions are still changing between the 3rd
and 4th iterations. Perhaps this is ok, but these increasing gene
lengths make me worry that I haven't built stable models.

Here is the short version of what I've done.
1. Run RepeatModeler, but this only produced 47 sequences in the
resulting .fasta... so that seemed a bit small.

2. Run Maker2 using:
- RepeatModeler output + "model_org=all" and "softmask=1" in the
Repeat Masking section.
- protein evidence from 2 distantly related species AND all of Uniprot
- ests from a different strain of my species (a parasitoid wasp)
- the .hmm from Nasonia, one of the 2 distantly related species whose
proteome I also provided as protein evidence
- my assembled genome of 1,509 scaffolds.

3. After this, I did three subsequent rounds of Maker2 (cleverly named
Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
.hmm was replaced by a SNAP generated .hmm from the previous round.
Also, the est2genome and protein2genome was changed from 1 to 0 in all
runs after the first.

Here are some results:
Round1: 14,647 genes, average length 2,491
Round2: 12,158 genes, average length 3,760
Round3: 13,515 genes, average length 3,090
Round4: 12,169 genes, average length 3,918

This is a bit confusing because the number of genes predicted goes up
and down, as does their lengths. I've doubly checked the dates of my
files, and they are all labeled such that I don't think anything could
be swapped.

So my questions are:
Is this an indication that my models are unstable and I shouldn't
trust these predictions?
Is the decreasing number of genes, while also getting longer perhaps a
good thing?
How do I know when to stop if genes keep getting longer?


Thanks very much,
Alice


On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <carsonhh at gmail.com> wrote:
> The gene models are actually produced by SNAP, Augustus, or whatever gene
> predictor you are using, so if you change the HMM every round, then the
> models will change too.  But I have one concern.  You are using a very
> sparse protein evidence dataset.  The protein dataset is very important to
> MAKER?s performance, and for itterative training of the ab initio
> predictors.  Normally after the second iteration, additional training should
> not be beneficial, but if you are getting wildly different results on 3rd
> and 4th round, then you probably aren?t getting sufficient good models to
> train with.
>
> For a protein dataset you should be using the entire a proteome from a
> minimum of two related species and perhaps all of UniProt/Swiss-prot to get
> a broad protein database.  Don?t use the proteins extracted by CEGMA and
> HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff scrip
> that comes with MAEKR), but don?t give the proteins to MAKER as evidence,
> also the HaMSTr results will be redundant with the ESTs.  You need proteins
> from related species to look for homology not found in the EST dataset.
>
> Also repeat masking is important for any genome and has a huge effect on ab
> initio predictor performance.  Make sure you run something like
> RepeatModeler to look for species specific repeats that will not already be
> in RepBase.  Then add those results to the rmlib= option in the maker
> control files.
>
> Thanks,
> Carson
>
>
>
>
> On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch> wrote:
>
> Hi all,
>
> I am a relatively new user to Maker2, and I?m looking for advise on running
> many iterations of the same dataset in Maker2.
>
> I have a relatively small genome (~124 MB) from a wasp that is assembled
> into ~1,500 scaffold. I have run several iterations of Maker2 by
> re-generating .hmms in SNAP and feeding them into the next round, and my
> gene predictions keep increasing (in number and in size).  The only thing
> that changes at each round is the .hmm.
> This is the evidence that I give is:
> -          de novo assembled ESTs from a different strain of the same
> species (70,000 contigs? I am currently working on improving this assembly
> with the hope that this will be helpful here)
> -          610 proteins extracted from the genome scaffolds using CEGMA and
> HaMSTr
>
> For my 1st iteration, I used the Nasonia .hmm from SNAP, and the
> est2genome/protein2genome option.
>
> For the 2nd, 3rd and 4th rounds I have used .hmms generated from the
> previous round, all without the est2genome/protein2genome option. All other
> files are the same as in the original run.
>
> As I understand it, after the second round, nothing should change in Maker2.
> But the differences are obvious between runs. Some entirely new exons are
> annotated. For example,  just counting ?exon? in the .gff file gives me
> 73,000 after the third iteration and 96,000 after the fourth! Actually the
> biggest leap in this number is between the third and fourth round. I can
> also see that many features are longer when I look at the files in Geneious.
>
> Is this sort of change possible after the second round of Maker2? Is there
> something I have done wrong in my runs, or am a understanding this output
> incorrectly?
>
> Thank you,
> Alice
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 


Alice Dennis
alicebdennis at gmail.com

Postdoctoral Researcher
Institute for Integrative Biology, ETH Z?rich & EAWAG
?berlandstrasse 133
P.O. Box 611
8600 D?bendorf, Switzerland

https://adennis5.wordpress.com/


From michael.s.campbell1 at gmail.com  Thu Mar 26 09:50:41 2015
From: michael.s.campbell1 at gmail.com (Michael Campbell)
Date: Thu, 26 Mar 2015 09:50:41 -0600
Subject: [maker-devel] iterative Maker2
In-Reply-To: <CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>
References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch>
	<1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
	<CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>
Message-ID: <CAAi6vWXnyyFkTVD9tc-QGxSBCBenTy5QyTM6ReVqDveXQA0FTg@mail.gmail.com>

Hi Alice,

In my experience the fewer longer genes is generally a good thing (and very
normal) resulting from the merging of split models and extension of
incomplete models. I find it helpful to load the annotations and evidence
into a browser to get a visual idea of what is happening.

Mike

On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis <alicebdennis at gmail.com>
wrote:

> Hello again,
>
> I posted a while ago about a genome I'm running through the Maker2
> pipeline. I was concerned because my results were still changing with
> 3 and 4 iterations.
>
> Following the very useful advice of Carson (below), I've made a few
> modifications (adding a RepeatModeler run, using a big protein
> database), but my gene predictions are still changing between the 3rd
> and 4th iterations. Perhaps this is ok, but these increasing gene
> lengths make me worry that I haven't built stable models.
>
> Here is the short version of what I've done.
> 1. Run RepeatModeler, but this only produced 47 sequences in the
> resulting .fasta... so that seemed a bit small.
>
> 2. Run Maker2 using:
> - RepeatModeler output + "model_org=all" and "softmask=1" in the
> Repeat Masking section.
> - protein evidence from 2 distantly related species AND all of Uniprot
> - ests from a different strain of my species (a parasitoid wasp)
> - the .hmm from Nasonia, one of the 2 distantly related species whose
> proteome I also provided as protein evidence
> - my assembled genome of 1,509 scaffolds.
>
> 3. After this, I did three subsequent rounds of Maker2 (cleverly named
> Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
> .hmm was replaced by a SNAP generated .hmm from the previous round.
> Also, the est2genome and protein2genome was changed from 1 to 0 in all
> runs after the first.
>
> Here are some results:
> Round1: 14,647 genes, average length 2,491
> Round2: 12,158 genes, average length 3,760
> Round3: 13,515 genes, average length 3,090
> Round4: 12,169 genes, average length 3,918
>
> This is a bit confusing because the number of genes predicted goes up
> and down, as does their lengths. I've doubly checked the dates of my
> files, and they are all labeled such that I don't think anything could
> be swapped.
>
> So my questions are:
> Is this an indication that my models are unstable and I shouldn't
> trust these predictions?
> Is the decreasing number of genes, while also getting longer perhaps a
> good thing?
> How do I know when to stop if genes keep getting longer?
>
>
> Thanks very much,
> Alice
>
>
> On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <carsonhh at gmail.com> wrote:
> > The gene models are actually produced by SNAP, Augustus, or whatever gene
> > predictor you are using, so if you change the HMM every round, then the
> > models will change too.  But I have one concern.  You are using a very
> > sparse protein evidence dataset.  The protein dataset is very important
> to
> > MAKER?s performance, and for itterative training of the ab initio
> > predictors.  Normally after the second iteration, additional training
> should
> > not be beneficial, but if you are getting wildly different results on 3rd
> > and 4th round, then you probably aren?t getting sufficient good models to
> > train with.
> >
> > For a protein dataset you should be using the entire a proteome from a
> > minimum of two related species and perhaps all of UniProt/Swiss-prot to
> get
> > a broad protein database.  Don?t use the proteins extracted by CEGMA and
> > HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff
> scrip
> > that comes with MAEKR), but don?t give the proteins to MAKER as evidence,
> > also the HaMSTr results will be redundant with the ESTs.  You need
> proteins
> > from related species to look for homology not found in the EST dataset.
> >
> > Also repeat masking is important for any genome and has a huge effect on
> ab
> > initio predictor performance.  Make sure you run something like
> > RepeatModeler to look for species specific repeats that will not already
> be
> > in RepBase.  Then add those results to the rmlib= option in the maker
> > control files.
> >
> > Thanks,
> > Carson
> >
> >
> >
> >
> > On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch>
> wrote:
> >
> > Hi all,
> >
> > I am a relatively new user to Maker2, and I?m looking for advise on
> running
> > many iterations of the same dataset in Maker2.
> >
> > I have a relatively small genome (~124 MB) from a wasp that is assembled
> > into ~1,500 scaffold. I have run several iterations of Maker2 by
> > re-generating .hmms in SNAP and feeding them into the next round, and my
> > gene predictions keep increasing (in number and in size).  The only thing
> > that changes at each round is the .hmm.
> > This is the evidence that I give is:
> > -          de novo assembled ESTs from a different strain of the same
> > species (70,000 contigs? I am currently working on improving this
> assembly
> > with the hope that this will be helpful here)
> > -          610 proteins extracted from the genome scaffolds using CEGMA
> and
> > HaMSTr
> >
> > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the
> > est2genome/protein2genome option.
> >
> > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the
> > previous round, all without the est2genome/protein2genome option. All
> other
> > files are the same as in the original run.
> >
> > As I understand it, after the second round, nothing should change in
> Maker2.
> > But the differences are obvious between runs. Some entirely new exons are
> > annotated. For example,  just counting ?exon? in the .gff file gives me
> > 73,000 after the third iteration and 96,000 after the fourth! Actually
> the
> > biggest leap in this number is between the third and fourth round. I can
> > also see that many features are longer when I look at the files in
> Geneious.
> >
> > Is this sort of change possible after the second round of Maker2? Is
> there
> > something I have done wrong in my runs, or am a understanding this output
> > incorrectly?
> >
> > Thank you,
> > Alice
> >
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> >
> >
>
>
>
> --
>
>
> Alice Dennis
> alicebdennis at gmail.com
>
> Postdoctoral Researcher
> Institute for Integrative Biology, ETH Z?rich & EAWAG
> ?berlandstrasse 133
> P.O. Box 611
> 8600 D?bendorf, Switzerland
>
> https://adennis5.wordpress.com/
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 
Michael Campbell MS, RD.
Doctoral Candidate
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:585-3543
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150326/016a477f/attachment-0002.html>

From rens.holmer at wur.nl  Mon Mar 30 00:12:20 2015
From: rens.holmer at wur.nl (Holmer, Rens)
Date: Mon, 30 Mar 2015 06:12:20 +0000
Subject: [maker-devel] Incorporating cufflinks in maker
Message-ID: <42E0168F-C672-4B7F-97D4-98442B825BF9@wur.nl>

Hi maker team,

I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options:

Provide the cufflinks output as EST-gff
Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff

What would you suggest, and what would be the required formatting for both options?

Thanks in advance,

Rens Holmer


From goutham.atla at gmail.com  Fri Mar 27 23:37:08 2015
From: goutham.atla at gmail.com (Goutham atla)
Date: Sat, 28 Mar 2015 11:07:08 +0530
Subject: [maker-devel] Annotating Cufflinks GTF with Maker
Message-ID: <CALU8LA4CwLD8qm5f==xKSjZoCw+9Ajd=RCD62LkHTdBYbuajig@mail.gmail.com>

Dear All,

I have a draft genome for organism of my interest and I have around 150G of
100bp paired-end RNA-Seq data from different conditions. This organism has
ensemble annotations but very few.

My goal is to look at differential splicing analysis between two
conditions. For this I need good annotations in gtf format at isoform
level.I am interested in using the Splicing Analysis Kit
<http://cbcb.umd.edu/software/spanki/>

For now, I have aligned one sample to genome using tophat2 and then used
cufflinks to generate a de-novo GTF file. In either cases I have not used
the avail be GTF with very few annotations.

The GTF file generated by cufflinks should be annotated to know the
function of each transcript. So I am interested in adding annotations to
the gtf file generated from cufflinks. What is the best of doing it ?

Or is there any better way of getting a gtf file, like that of ensemble,
from my data ?

I have looked at trinotate, but its more about functional annotation and
expression studies.


Regards,

-- 
Goutham Atla
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150328/686b6c3b/attachment-0002.html>

From avhoeck at SCKCEN.BE  Mon Mar 30 10:11:16 2015
From: avhoeck at SCKCEN.BE (Van Hoeck Arne)
Date: Mon, 30 Mar 2015 16:11:16 +0000
Subject: [maker-devel] comments on Incorporating cufflinks in maker
Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF11826E1A@MAILSRV3.sck.be>

Dear Rens and Carlson,
I would like to comment on Rens' question on adding transcript data into your annotation. Since I have no access to google groups, I try to contact you both via mail with, hopefully, the correct mail addresses.

I have tested Maker-P with multiple parameters, optimized 2 gene prediction tools (SNAP and augustus). Both give similarly results at which maker find around 17000 gene annotations. I also have RNAseq samples, and as a test case I also used Cufflinks output and processed with TransDecoder to select gene annotations. By using this approach, I end up with 25000 gene annotations. More or less, all the genes that Maker selected were also selected by the TransDecoder approach. So is it wise to include the information on the 8000 missing genes, based on optimized gene models? probably, there will some false positive in these 8000 missing genes, but with good criteria and models, it would maybe possible that maker can find more genes annotations.

Best regards
Arne

Hi maker team,

I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options:

Provide the cufflinks output as EST-gff
Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff

What would you suggest, and what would be the required formatting for both options?

Thanks in advance,

Rens Holmer


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150330/1fa390fe/attachment-0002.html>

From kai.kamm at ecolevol.de  Thu Mar  5 09:47:02 2015
From: kai.kamm at ecolevol.de (Kai Kamm)
Date: Thu, 05 Mar 2015 17:47:02 +0100
Subject: [maker-devel] Better resolve conflicting gene models
Message-ID: <54F88886.9010004@ecolevol.de>

Hello, thanks for your previous advice.

(Btw, how can one reply to an existing thread such that the reply will 
be added to the same thread?)


I am trying to find the best parameters with Maker for the annotation of 
my genome. I have run Maker with several combinations of parameters and 
predictors on my three biggest scaffolds and looked at the results in 
Jbrowse. Overall most predictions seem fine, but there are some genes 
with conflicts and I have no idea why.

I have:

- 100Mb assembled genome
- Trinity RNAseq assembly
- cufflinks data (in my case don't seem to be messy as suggested, rather 
a good complement to the trinity data))
- protein evidence (related and unrelated species)
- repeat library from repeat modeler


Gene predictors used:

- Augustus trained with transcripts from related species: seems to 
perform fine

- SNAP: no convergence with Augustus even after second training. Dropped 
it because it predicted lots of additional low quality transcripts and 
sometimes disrupted final Maker transcripts.

- Genemark: converged with Augustus after training (introns received 
from TopHat2 output). Tends to predict some additional transcripts 
(compared to Augustus). Few (but some) of these are covered by evidence 
and thus become final Maker transcripts.


So the combination of Augustus and Genemark seems optimal. In general 
both perform well in Maker and tend to predict the same transcripts.

However, I still observe some problems in the behavior of Maker which I 
don't understand:

Example 1: One of the predictors predicts a small additional exon at the 
start which is also covered by protein or EST data. But sometimes Maker 
chooses the other predictors model for the final transcript. Mostly 
these are minor differences but I don't understand this behavior?

Example 2: there are some extreme cases like an Augustus prediction with 
17 exons which are all covered by Trinity and cufflinks isoforms. 
Genemark instead predicts two separate small genes with 2 and 4 exons 
respectively. The resulting final transcript has 7 exons and the 
additional evidence from the trinity and cufflinks data is treated as UTR.


So I thought Augustus seems a little more accurate and run Maker only 
with Augustus to resolve such conflicts, even though I would loose the 
few additional transcripts from Genemark.

This is what happened:

- The gene in Example 2 now has all the 17 exons. This is good!

- Sadly another gene with several exons, which was formerly predicted by 
both Augustus and Genemark and is also covered by cufflinks and trinity 
transcripts, now consists only of two small exons in the final 
transcript. Even though Augustus still predicts the same exons and the 
same evidence is present - only the Genemark prediction is absent which 
was almost identical to Augustus. This I completely don't understand.

I don't worry about the minor differences. The extreme cases are like 
two genes in a hundred and I don't understand the behavior. I was 
thinking that in case of conflicting models Maker will choose the one 
that best fits the evidence. Obviously with most conflicts this is what 
happens, because the majority of the final models look OK. But not the 
above mentioned cases and I don't understand why?

Is there any parameter I missed to better resolve such conflicts?

Best


From bmoore at genetics.utah.edu  Thu Mar  5 17:20:52 2015
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Fri, 6 Mar 2015 00:20:52 +0000
Subject: [maker-devel] Maker Software Question
In-Reply-To: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu>
References: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu>
Message-ID: <E682D1C1-B792-498E-88C9-D9349E9548C8@genetics.utah.edu>

Hi Chris,

I don?t know if others have responded to you already, but I just found this e-mail unanswered in the depths of my e-mail inbox, so apologies for no reply.

I?m going to forward you e-mail along to the full maker mailing list so that you?ll get the benefit of response from the full group of MAKER developers.

MAKER does include a couple of scripts that add functional annotation to MAKER GFF3 output.  This process is described in the recent paper:

Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using
MAKER and MAKER-P. Curr Protoc Bioinformatics.

http://www.ncbi.nlm.nih.gov/pubmed/25501943

Mike do you have a PDF of the final print version of that you could send directly to Christopher?

B

On Jan 16, 2015, at 8:38 AM, Seabury, Christopher <CSeabury at cvm.tamu.edu<mailto:CSeabury at cvm.tamu.edu>> wrote:

Dear Colleagues,

I would like to quickly ask about a specific routine/possible function in MAKER.
Previously, we have essentially made home-made versions of maker by way of
Multi-step programming.   At present we are exploring MAKER but are wondering
IF MAKER has the ability to populate the GFF with GENE/Protein ID information?
As an initial experiment, we gave MAKER cDNA refseqs, plus corresponding Protein refseqs,
And a reference, but do not see the GENE/Protein ID in the GFF.  Is there a subroutine
For this, or option we have missed?


Thanks and Kind Regards,


Christopher M. Seabury PhD
Associate Professor
Department of Veterinary Pathobiology
College of Veterinary Medicine
Texas A&M University
College Station, TX 77843-4467
cseabury at cvm.tamu.edu<mailto:cseabury at cvm.tamu.edu>
Mobile: 979-492-6400

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150306/15c2e575/attachment-0003.html>

From bmoore at genetics.utah.edu  Mon Mar  9 12:12:10 2015
From: bmoore at genetics.utah.edu (Barry Moore)
Date: Mon, 9 Mar 2015 18:12:10 +0000
Subject: [maker-devel] Does the maker google forum works? -[Doubt]
	maker2zff line 109
In-Reply-To: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es>
References: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es>
Message-ID: <13826169-DFBB-4E32-AFB1-A01CB3EEE0A4@genetics.utah.edu>

Hi Javier,

The Google forum is only a mirror of the actual mailing list that allows it gets archived by Google (better Google search results for MAKER users), so posting is not allowed there.  Please join the official MAKER mailing list at:

http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Thanks,

B

On Mar 9, 2015, at 12:07 PM, FRANCISCO JAVIER SANCHEZ GARCIA <javiersg at um.es<mailto:javiersg at um.es>> wrote:


Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8.

../maker/marker_v2.31.8/maker/bin/maker2zff   ../sequences.all.gff everything.ann everything.dna
[sudo] password for soba:
No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, <GFF> line 1922870.

I read something about the problematic characters in the ID . But i dont know if it is my example.

http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html


Regards.
Thanks in advance.


Fco Javier S?nchez-Garc?a
PhD student Forest Entomology
?rea de Biolog?a Animal
Departamento de Zoolog?a y Antropolog?a F?sica
Facultad de Veterinaria
Universidad de Murcia
Campus de Espinardo 30100
Murcia (Espa?a-Spain)
Telf. +34 660 500 416 (mobile phone)
      +34 868 888 031 (laboratory-work phone)

http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl
http://www.researchgate.net/profile/Francisco_Sanchez-Garcia
http://orcid.org/0000-0002-5442-0292
http://www.researcherid.com/rid/M-2407-2014

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150309/0bd030ec/attachment-0003.html>

From javiersg at um.es  Mon Mar  9 16:27:00 2015
From: javiersg at um.es (FRANCISCO JAVIER SANCHEZ GARCIA)
Date: Mon, 09 Mar 2015 23:27:00 +0100
Subject: [maker-devel] [Doubt] maker2zff line 109Good morning. I am trying
 problem to write messages in the help forum of maker in google groups. I
 dont know if my problem or contrary it might be a problem with the
 permissions. But i cant see the red button of new threads. Anyway,
 I will try to show my problem with maker2zff. Which does not work. My
 version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff
 ../sequences.all.gff everything.ann everything.dna [sudo] password for
 soba: No such file or directory at
 ../maker/marker_v2.31.8/maker/bin/maker2zff line 109,
 <GFF> line 1922870. I read something about the problematic characters in the
 ID . But i dont know if it is my example.
 http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html
 Regards. Thanks in advance.
Message-ID: <20150309232700.Horde.39Aj_iELfhGei1Bv5JG1fw6@webmail.um.es>

Good night everyone.I will try to show my problem with maker2zff. Which
does not work. My version is the v2.31.8. The last line of the gff file is
the line which the mistake alert said ?that it doesnt find the file or
directory.

../maker/marker_v2.31.8/maker/bin/maker2zff?? ../sequences.all.gff
everything.ann everything.dna
[sudo] password for soba:
No such file?or directory?at?../maker/marker_v2.31.8/maker/bin/MAKER2ZFF
LINE 109, <GFF> line 1922870.

I read something about the problematic characters in the ID . But i dont
know if it is my example.

http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html

Regards.
Thanks in advance.

Fco Javier S?nchez-Garc?a
PhD student Forest Entomology
?rea de Biolog?a Animal
Departamento de Zoolog?a y Antropolog?a F?sica
Facultad de Veterinaria
Universidad de Murcia
Campus de Espinardo 30100
Murcia (Espa?a-Spain)
Telf. +34 660 500 416 (mobile phone)
       +34 868 888 031 (laboratory-work phone)

http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl
http://www.researchgate.net/profile/Francisco_Sanchez-Garcia
http://orcid.org/0000-0002-5442-0292
http://www.researcherid.com/rid/M-2407-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150309/6b31f9e4/attachment-0003.html>

From carsonhh at gmail.com  Thu Mar 12 13:50:44 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 12 Mar 2015 13:50:44 -0600
Subject: [maker-devel] Better resolve conflicting gene models
In-Reply-To: <54F88886.9010004@ecolevol.de>
References: <54F88886.9010004@ecolevol.de>
Message-ID: <7B8D93E6-DEE7-43D7-BFBF-38E474311A6C@gmail.com>

Sorry for the slow reply.


> how can one reply to an existing thread such that the reply will be added to the same thread?

Threads show up under the Google archive, but that is really just an artifact of how things are archived. The mailing list itself doesn?t have threads, just a subject line like any other e-mail. But because we are archiving to a message board any messages with the same subject get turned into parts of the same thread.


> Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior?

The gene chosen by MAKER is the one that best matches the evidence.  This is a numeric value called AED (lower means better match).  If a model predicts an exon or a base pair that is not supported by an evidence alignment, then it we be penalized.  If a model fails to predict a base pair that is supported by evidence then it will also be penalized.  The differences in your results are because you changed something between runs, either you added or subtracted a predictor (gives MAKER more models to choose from) or evidence changed (different AED score).  Whichever model has the lowest AED score will always be kept, so some change you made either resulted in different set of available models to choose from or a different AED score. 

Also be aware that models cannot overlap on the same strand, so if you made a change that results in a model being removed from consideration, you may have removed overlap that was precluding another model with a slightly higher AED from being chosen.

> 
> Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR.
> 
> - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. 
> Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand.

The model chosen will always be the one with the lowest AED.  The changes you describe are all based off of a model with lower AED supplanting another, which may have opened up a region that was previously overlapped by a longer gene with a poor AED score.

I would also recommend not including cufflinks output if you have trinity data.  Cufflinks results often falsely bridge regions and will cause gene mergers by falsely extending evidence clusters. This makes it appear as if genes should extend into regions where they shouldn?t.  Given those hints, predictors will then try and predict at least something in those regions, and if they do the models will get a good AED score since they are supported by evidence even if it is false evidence.

?Carson


From carsonhh at gmail.com  Thu Mar 12 14:03:11 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 12 Mar 2015 14:03:11 -0600
Subject: [maker-devel] maker-devel post from avhoeck@sckcen.be requires
	approval
In-Reply-To: <mailman.5.1426178330.13228.maker-devel_yandell-lab.org@box290.bluehost.com>
References: <mailman.5.1426178330.13228.maker-devel_yandell-lab.org@box290.bluehost.com>
Message-ID: <73D91049-1CB8-4C59-A9E5-58374FCB0789@gmail.com>

Hi Arne,

The genes found by CEGMA are very short genes, so where those genes might be identifiable at least partially on shorter contigs other genes will be far longer.  So the 87% value you get from CEGMA is probably what you are going to be able to find (even partially) for the entire genome. Remember that gene size when you include the introns can actually be quite large, and gene predictors need flanking signals from the sequence several hundred bp upstream and downstream to make a prediction, so having a gene partially contained by a contig makes that gene un-annotatable for ab initio predictors. Unless the organism has a high gene density or very few introns, you usually won?t be able to find a gene on contigs smaller than ~10kb.

?Carson


> On Mar 12, 2015, at 10:38 AM
> 
> From: Van Hoeck Arne <avhoeck at SCKCEN.BE <mailto:avhoeck at SCKCEN.BE>>
> To: "maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>>
> Subject: TACC lonestar and N50 value
> Date: March 12, 2015 at 10:38:42 AM MDT
> 
> 
> Dear MAKER developer,
> 
> We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)
> 
> Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?
> 
> Best regards
> Arne
> 
> 
> 	Consider the environment before you print
> Denk aan het milieu voor u deze e-mail print
> Pensez ? l'environnement avant d'imprimer
> 
> 
> SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer <http://www.sckcen.be/en/e-mail_disclaimer>
> 
> 
> Subject: confirm ff58f4c50902af9cf748cc6c032a1ee92a2a19a9
> From: maker-devel-request at yandell-lab.org <mailto:maker-devel-request at yandell-lab.org>
> Date: March 12, 2015 at 10:38:50 AM MDT
> 
> 
> If you reply to this message, keeping the Subject: header intact,
> Mailman will discard the held message.  Do this if the message is
> spam.  If you reply to this message and include an Approved: header
> with the list password in it, the message will be approved for posting
> to the list.  The Approved: header can also appear in the first line
> of the body of the reply.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150312/30a64d42/attachment-0003.html>

From avhoeck at SCKCEN.BE  Thu Mar 12 10:38:42 2015
From: avhoeck at SCKCEN.BE (Van Hoeck Arne)
Date: Thu, 12 Mar 2015 16:38:42 +0000
Subject: [maker-devel] TACC lonestar and N50 value
Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>

Dear MAKER developer,

We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)

Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?

Best regards
Arne


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150312/10c284bd/attachment-0003.html>

From mtollis at asu.edu  Fri Mar 13 14:50:33 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Fri, 13 Mar 2015 13:50:33 -0700
Subject: [maker-devel] Question about pre-masked genome.
Message-ID: <CAM2H015C_WDgMScNBeBe=hZxbYS_c6w1aOFuR+JAkYek4QoUZA@mail.gmail.com>

Hello,
I have a genome that has been well-annotated for species-specific repeats
(using RepeatModeler, trf, and RepeatMasker), and was wondering if I could
simply use the soft-masked assembly for maker annotation, in order to skip
the sometimes cumbersome repeatmasking step. Is this generally looked upon
as doable, or should I just stick with using the unmasked assembly and the
repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could
not confirm that the masked assembly (i.e., lowercase letters) was
maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150313/e4d6520e/attachment-0003.html>

From mtollis at asu.edu  Fri Mar 13 15:48:46 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Fri, 13 Mar 2015 14:48:46 -0700
Subject: [maker-devel] Question about pre-masked genome.
Message-ID: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>

Hello,
I have a genome that has been well-annotated for species-specific repeats
(using RepeatModeler, trf, and RepeatMasker), and was wondering if I could
simply use the soft-masked assembly for maker annotation, in order to skip
the sometimes cumbersome repeatmasking step. Is this generally looked upon
as doable, or should I just stick with using the unmasked assembly and the
repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could
not confirm that the masked assembly (i.e., lowercase letters) was
maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc

-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150313/e763639c/attachment-0003.html>

From dence at genetics.utah.edu  Fri Mar 13 18:14:52 2015
From: dence at genetics.utah.edu (Daniel Ence)
Date: Sat, 14 Mar 2015 00:14:52 +0000
Subject: [maker-devel] Question about pre-masked genome.
In-Reply-To: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>
References: <CAM2H017obfWB6atrBTG+mrEMS-a=5teSU_iormAmUNBQV4HMyw@mail.gmail.com>
Message-ID: <51DE3D97-10DB-4FF9-A793-EEA051C705BD@genetics.utah.edu>

Hi Marc, Several groups have been doing what you described with fungal genomes for a while and it seems to be working for them.  With the softmasked genome, maker will still allow EST alignments to extend into the masked regions; if it were a hard masked genome, this wouldn?t be possible.

Let us know how it works out though!

Thanks,
Daniel


On Mar 13, 2015, at 3:48 PM, Marc Tollis <mtollis at asu.edu<mailto:mtollis at asu.edu>> wrote:

Hello,
I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation?

P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory.
Thanks,
Marc

--
Marc Tollis, Ph.D.
Post-Doctoral Research Associate
Arizona State University
LSE 313
(480) 965-7456
marc.tollis at asu.edu<mailto:marc.tollis at asu.edu>

website: https://sites.google.com/site/tollisresearch/
blog: anolistollis.wordpress.com<http://anolistollis.wordpress.com/>
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150314/44c59bd7/attachment-0003.html>

From mtollis at asu.edu  Sun Mar 15 08:19:37 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Sun, 15 Mar 2015 07:19:37 -0700
Subject: [maker-devel] control file for SNAP training
Message-ID: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>

This is a question about process, and to make sure I am doing things right
(when time is of the essence, some mistakes can set you back weeks).

I have run maker on my de novo vertebrate genome, using only the predictive
proteome from a congener (well-studied and available on Ensembl), and
generated the HMM for the first round of SNAP training. As per the 2014
tutorial, I edited the control file for this step as follows: I added the
path to the .hmm file, and set protein2genome to 0.

When I run maker, I notice that in addition to snap, it is still running
blastx and exonerate however. I noticed that this is because I did not
remove (or "comment out") the path to the protein.fa in the control file
(the output looks markedly different when I do comment out the protein file
- and I can't even tell if it's running snap in this instance).

Is it simply using exonerate to place the ab initio predictions on the
scaffolds (meaning that having protein2genome=1 is to tell maker to make
evidence annotations) ? Did I do this correctly, or should I also remove
the protein.fa out of the control file for SNAP training?
?
-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150315/ef226da4/attachment-0003.html>

From steinj at cshl.edu  Mon Mar 16 07:29:36 2015
From: steinj at cshl.edu (Stein, Joshua)
Date: Mon, 16 Mar 2015 13:29:36 +0000
Subject: [maker-devel] TACC lonestar and N50 value
In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>
References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be>
Message-ID: <0A0DBB88-EC95-43B7-AA75-B7858B67A8F4@cshl.edu>

Hi Arne,

I have experience with iPlant resources and with MAKER-P.  I would encourage you to try the Atmosphere image (7888b8e1-c006-4794-82d9-4c940ddbf4c6).  You can request a large instance (up to 16 CPU's and 128 GB memory) and run in MPI-mode to distribute the work.  Please see this tutorial, which includes information on running in MPI-mode:  https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial.

You can also access the TACC Lonestar installation using the iPlant Discovery Environment.  There is an app called "MAKER-P-Lonestar-Small-Genomes 2.3".  Although it is advertised as appropriate for "small" genomes, I think there is a good chance that it will work for 450 Mb.  This is a new resource and the iPlant team would value any feedback and benchmarks on how the system is working.  Depending how this goes there are plans to roll-out additional apps intended for larger genomes.  Here is a tutorial: https://pods.iplantcollaborative.org/wiki/display/sciplant/Tutorial+for+running+MAKER-P+on+TACC-Lonestar+from+iPlant+Discovery+Environment

Regarding contig sizes, though not ideal, you can include contigs smaller than 10kbp in your run.  Plant genes tend to be more compact than vertebrate genes so you ought to be able to recover annotations on the smaller contigs, though keep an eye out for truncated genes.

Best,
Josh


On Mar 13, 2015, at 6:06 PM, Van Hoeck Arne <avhoeck at SCKCEN.BE<mailto:avhoeck at SCKCEN.BE>> wrote:

Dear MAKER developer,

We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified.  You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?)

Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER?

Best regards
Arne


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu<mailto:steinj at cshl.edu>
http://ware.cshl.org/


From mtollis at asu.edu  Tue Mar 17 15:26:44 2015
From: mtollis at asu.edu (Marc Tollis)
Date: Tue, 17 Mar 2015 14:26:44 -0700
Subject: [maker-devel] control file for SNAP training
In-Reply-To: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
References: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
Message-ID: <CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>

I answered my own question:
No need to re-align proteins again - takes too long.
So, I used the gff file from the gff_merge on the log file from the first
run (the one with just protein2genome). Then, after generating the .hmm
file, I put it in my control file, along with protein2genome=0, removed the
protein.fasta, set maker_gff and protein_pass=1. The output now shows that
only snap is running, and no blastx and exonerate - a relief because it is
much faster!

On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis <mtollis at asu.edu> wrote:

> This is a question about process, and to make sure I am doing things right
> (when time is of the essence, some mistakes can set you back weeks).
>
> I have run maker on my de novo vertebrate genome, using only the
> predictive proteome from a congener (well-studied and available on
> Ensembl), and generated the HMM for the first round of SNAP training. As
> per the 2014 tutorial, I edited the control file for this step as follows:
> I added the path to the .hmm file, and set protein2genome to 0.
>
> When I run maker, I notice that in addition to snap, it is still running
> blastx and exonerate however. I noticed that this is because I did not
> remove (or "comment out") the path to the protein.fa in the control file
> (the output looks markedly different when I do comment out the protein file
> - and I can't even tell if it's running snap in this instance).
>
> Is it simply using exonerate to place the ab initio predictions on the
> scaffolds (meaning that having protein2genome=1 is to tell maker to make
> evidence annotations) ? Did I do this correctly, or should I also remove
> the protein.fa out of the control file for SNAP training?
> ?
> --
> *Marc Tollis, Ph.D.*
> *Post-Doctoral Research Associate*
> *Arizona State University*
> *LSE 313*
> *(480) 965-7456 <%28480%29%20965-7456>*
> marc.tollis at asu.edu
>
> *website: *https://sites.google.com/site/tollisresearch/
> *blog: *anolistollis.wordpress.com
>


-- 
*Marc Tollis, Ph.D.*
*Post-Doctoral Research Associate*
*Arizona State University*
*LSE 313*
*(480) 965-7456*
marc.tollis at asu.edu

*website: *https://sites.google.com/site/tollisresearch/
*blog: *anolistollis.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150317/051bb315/attachment-0003.html>

From carsonhh at gmail.com  Tue Mar 17 20:47:50 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 17 Mar 2015 20:47:50 -0600
Subject: [maker-devel] control file for SNAP training
In-Reply-To: <CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>
References: <CAM2H017g7qu0qMmA3MruSPtVC66Pr06KcTrRSpT1VVwntF=hEQ@mail.gmail.com>
	<CAM2H0153aztqxuxjeGY8-axjM+Cjw1TYs-zkGVf5o1Z=XEo8HA@mail.gmail.com>
Message-ID: <AADABBD3-04F1-49BF-B261-4B316EF60D2B@gmail.com>

You can also add additional protein file if you want more evidence than that already in the GFF3. It makes re-annotation and adding evidence to an already annotated genome file really easy.

?Carson


> On Mar 17, 2015, at 3:26 PM, Marc Tollis <mtollis at asu.edu> wrote:
> 
> I answered my own question:
> No need to re-align proteins again - takes too long.
> So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster!
> 
> On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis <mtollis at asu.edu <mailto:mtollis at asu.edu>> wrote:
> This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks).
> 
> I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. 
> 
> When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). 
> 
> Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? 
> ?
> -- 
> Marc Tollis, Ph.D.
> Post-Doctoral Research Associate
> Arizona State University
> LSE 313
> (480) 965-7456 <tel:%28480%29%20965-7456>
> marc.tollis at asu.edu <mailto:marc.tollis at asu.edu>
> 
> website: https://sites.google.com/site/tollisresearch/ <https://sites.google.com/site/tollisresearch/>
> blog: anolistollis.wordpress.com <http://anolistollis.wordpress.com/>
> 
> 
> -- 
> Marc Tollis, Ph.D.
> Post-Doctoral Research Associate
> Arizona State University
> LSE 313
> (480) 965-7456
> marc.tollis at asu.edu <mailto:marc.tollis at asu.edu>
> 
> website: https://sites.google.com/site/tollisresearch/ <https://sites.google.com/site/tollisresearch/>
> blog: anolistollis.wordpress.com <http://anolistollis.wordpress.com/>_______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150317/f3680a6c/attachment-0003.html>

From Brian.Mack at ARS.USDA.GOV  Fri Mar 20 07:17:09 2015
From: Brian.Mack at ARS.USDA.GOV (Mack, Brian)
Date: Fri, 20 Mar 2015 13:17:09 +0000
Subject: [maker-devel] est2genome wrong strand
Message-ID: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>

Hi, I've noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I've copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I've noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this?


Thanks,

Brian


Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496

>contig_69 <http://10.114.143.20:4567/get_sequence/?id=contig_69&db=/home/brian/blastdb/af70_20130423_id-modified.assembly.txt>

Length=108040


 Score =  1043 bits (1156),  Expect = 0.0

 Identities = 589/592 (99%), Gaps = 3/592 (1%)

 Strand=Plus/Plus


Query  24      TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  83

               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct  105546  TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  105605


Query  84      CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  142

               |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||

Sbjct  105606  CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  105665


69           blastn    expressed_sequence_match         105546  106137  559        +             .               ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3
69           blastn    match_part         105546  106137  559        +             .               ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359
69           est2genome       expressed_sequence_match         105546  106137  2909      -              .               ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3
69           est2genome       match_part         105546  106137  2909      -              .               ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150320/d3f1dc4c/attachment-0003.html>

From carsonhh at gmail.com  Fri Mar 20 08:54:28 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 20 Mar 2015 08:54:28 -0600
Subject: [maker-devel] est2genome wrong strand
In-Reply-To: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>
References: <A107420D48B33B4B8347F0DEFE1C947FB5F587@001FSN2MPN3-122.001f.mgd2.msft.net>
Message-ID: <C32539B3-CF24-4C99-9897-605FE8C8CCB8@gmail.com>

Hi Brian,

Multi-exon ESTs are stranded by their splice sites which can only work on one strand. Single exon ESTs on the other hand are stranded by their open reading frame. The standard chemistry used in most sequencing does not allow for strand specific sequence. There are technologies that do, but unless you used one of those, re-stranding single exon ESTs based on ORF is the best way to figure out where single exon alignments should go (not a completely reliable method but it works most of the time).  I believe trinity tries to determine strand the same way (but it is unreliable there too). For example since your alignment is not an exact match to the genomic sequence (single bp deletion in the alignment) the best open reading frame in the transcript is not the same as the best open reading frame in the genomic sequence (so one of them likely contains an error).  MAKER since it is annotating the genome logically re-strands it to the genome and trinity (being unaware of the genome strands it to the transcript).  Because single exon alignments are very unreliable, they are ignored in MAKER by default.  They will not be used as hints for gene predictors and can only be used to support a gene if there is also protein evidence or a single exon ab initio prediction at the same location to support it (even then this will only happen if you set single_exon=1 in the control files).

?Carson


On Mar 20, 2015, at 7:17 AM, Mack, Brian <Brian.Mack at ARS.USDA.GOV <mailto:Brian.Mack at ARS.USDA.GOV>> wrote:

> Hi, I?ve noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I?ve copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I?ve noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? 
>  
> Thanks,
> Brian
>  
> Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496
> >contig_69  <http://10.114.143.20:4567/get_sequence/?id=contig_69&db=/home/brian/blastdb/af70_20130423_id-modified.assembly.txt> <> 
> Length=108040
>  
>  Score =  1043 bits (1156),  Expect = 0.0
>  Identities = 589/592 (99%), Gaps = 3/592 (1%)
>  Strand=Plus/Plus
>  
> Query  24      TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  83
>                ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct  105546  TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT  105605
>  
> Query  84      CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  142
>                |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct  105606  CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT  105665
>  
>  
>  
> 69           blastn    expressed_sequence_match         105546  106137  559        +             .               ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3
> 69           blastn    match_part         105546  106137  559        +             .               ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359
> 69           est2genome       expressed_sequence_match         105546  106137  2909      -              .               ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3
> 69           est2genome       match_part         105546  106137  2909      -              .               ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150320/f91a44d0/attachment-0003.html>

From xvazquezc at gmail.com  Sat Mar 21 21:27:27 2015
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=)
Date: Sun, 22 Mar 2015 14:27:27 +1100
Subject: [maker-devel] annotation stats: repeats
Message-ID: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>

Hi all,

I was wondering how can I get data about the repeat content of the genome
from maker if possible, as well as each type of repeats: RE, transposons,
simple repeats, low complexity repeats

Thank you in advance,

Xabier

-- 
Xabier V?zquez Campos
*PhD Candidate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150322/e07ccc08/attachment-0003.html>

From dence at genetics.utah.edu  Sat Mar 21 23:56:06 2015
From: dence at genetics.utah.edu (Daniel Ence)
Date: Sun, 22 Mar 2015 05:56:06 +0000
Subject: [maker-devel] annotation stats: repeats
In-Reply-To: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>
References: <CAL0hg4F-vSqj_ut=+6n9=e_PpabufTXs93jNNDhTCaGSfoTvxQ@mail.gmail.com>
Message-ID: <4FFB960C-CF1A-440F-B767-36EC369D2B58@genetics.utah.edu>

Hi Xabier, MAKER offer several options for repeat-masking. There?s a file of repetitive element proteins that is run with repeat runner, and there you can run RepeatMasker with a custom library and/or one of the default RepeatMasker libraries.

The results from these programs are in the gff3 files that maker generates. Depending on the library that you give RepeatMasker, the repeats might be classified as to transposable element family and simple vs. complex repeats. These are probably a good place to start, but you?ll have to extract that data from the gff3 file and compile it.

Let us know whether that helps.

Thanks,
Daniel


On Mar 21, 2015, at 9:27 PM, Xabier V?zquez Campos <xvazquezc at gmail.com<mailto:xvazquezc at gmail.com>> wrote:

Hi all,

I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats

Thank you in advance,

Xabier

--
Xabier V?zquez Campos
PhD Candidate
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150322/3d95c8da/attachment-0003.html>

From panos.ioannidis at gmail.com  Tue Mar 24 02:29:14 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 09:29:14 +0100
Subject: [maker-devel] Augustus retraining
Message-ID: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>

Hello All,

I'm trying to retrain Augustus using EST data from the same species and
realized that quite a few of the gene models I get based on EST data are
incomplete (i.e. no start and/or stop codon).

Now, when I get to the "etraining" step in Augustus retraining (right after
the time-consuming "optimize_augustus.pl" step), I get a warning for each
gene that doesn't contain a start or stop codon.

.....
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
does not begin with start codon but with acg
gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
....

Does anyone know whether training is compromised by such incomplete gene
models? Do you usually exclude them from the training set?

Oh, and by the way, the best guide to retraining Augustus is here
<http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
The official
<http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
page isn't bad, but doesn't explain in detail certain things.

Thanks,
Panos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/a82d7062/attachment-0003.html>

From xvazquezc at gmail.com  Tue Mar 24 06:06:25 2015
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=)
Date: Tue, 24 Mar 2015 23:06:25 +1100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
Message-ID: <CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>

Hi Panos,

Have you tried using webAugustus for the (re)training? I found it very
convenient for generating the models for Augustus.

Cheers,

2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:

> Hello All,
>
> I'm trying to retrain Augustus using EST data from the same species and
> realized that quite a few of the gene models I get based on EST data are
> incomplete (i.e. no start and/or stop codon).
>
> Now, when I get to the "etraining" step in Augustus retraining (right
> after the time-consuming "optimize_augustus.pl" step), I get a warning
> for each gene that doesn't contain a start or stop codon.
>
> .....
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
> does not begin with start codon but with acg
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
> ....
>
> Does anyone know whether training is compromised by such incomplete gene
> models? Do you usually exclude them from the training set?
>
> Oh, and by the way, the best guide to retraining Augustus is here
> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
> The official
> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
> page isn't bad, but doesn't explain in detail certain things.
>
> Thanks,
> Panos
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 
Xabier V?zquez Campos
*PhD Candidate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/0b0a4daf/attachment-0003.html>

From panos.ioannidis at gmail.com  Tue Mar 24 06:24:45 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 13:24:45 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
Message-ID: <CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>

Hi Xabier,

Thanks for your quick reply!

No, I haven't used WebAugustus, but I just checked it out and it looks like
my training set is too big (~300 Mbp), so I can't even upload it!

Anyway, I prefer to train it locally because I have better control over
each step. Also, I have done the entire training procedure with less genes,
but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
replicate it using more of my scaffolds, but as it appears I get a lot more
incomplete models from exonerate (run through Maker).

P


On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com>
wrote:

> Hi Panos,
>
> Have you tried using webAugustus for the (re)training? I found it very
> convenient for generating the models for Augustus.
>
> Cheers,
>
> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>
>> Hello All,
>>
>> I'm trying to retrain Augustus using EST data from the same species and
>> realized that quite a few of the gene models I get based on EST data are
>> incomplete (i.e. no start and/or stop codon).
>>
>> Now, when I get to the "etraining" step in Augustus retraining (right
>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>> for each gene that doesn't contain a start or stop codon.
>>
>> .....
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>> does not begin with start codon but with acg
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>> ....
>>
>> Does anyone know whether training is compromised by such incomplete gene
>> models? Do you usually exclude them from the training set?
>>
>> Oh, and by the way, the best guide to retraining Augustus is here
>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>> The official
>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
>> page isn't bad, but doesn't explain in detail certain things.
>>
>> Thanks,
>> Panos
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
> --
> Xabier V?zquez Campos
> *PhD Candidate*
> Water Research Centre
> School of Civil and Environmental Engineering
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/2be728f0/attachment-0003.html>

From carsonhh at gmail.com  Tue Mar 24 08:14:51 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 08:14:51 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
Message-ID: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>

Hi Panos,

EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.

More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>

Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.

?Carson


> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Hi Xabier,
> 
> Thanks for your quick reply!
> 
> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
> 
> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
> 
> P
> 
> 
> 
> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
> Hi Panos,
> 
> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
> 
> Cheers,
> 
> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
> Hello All,
> 
> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
> 
> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
> 
> .....
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
> ....
> 
> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
> 
> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
> 
> Thanks,
> Panos
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> 
> 
> -- 
> Xabier V?zquez Campos
> PhD Candidate
> Water Research Centre
> School of Civil and Environmental Engineering
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/1e0e6b39/attachment-0003.html>

From panos.ioannidis at gmail.com  Tue Mar 24 08:31:04 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 15:31:04 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
Message-ID: <CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>

Hi Carson,

So you think it's okay to include incomplete gene models when training
Augustus?

I'll certainly try the bootstrap method you're suggesting. Even though I
did it for SNAP, for some weird reason I forgot it for Augustus :p Do you
think, however, that I can get a big improvement in gene-level sensitivity?
Currently, I have only 6%...

Thanks,
Panos


On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com> wrote:

> Hi Panos,
>
> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a
> first round of training you can run MAKER together with protein and EST
> evidence and the newly trained Augustus species file.  Because MAKER gives
> hints to Augustus as it runs, the models it produces will be improved over
> what it would get from just running Augustus on it?s own.  Then take these
> gene models and use them to retrain Augustus.  This is the standard
> bootstrap retraining procedure, and can be repeated as needed.
>
> More info on bootstrap training here (info is for SNAP but procedure is
> similar to Augustus) ?>  http://weatherby.genetics.
> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_
> Online_Training_2014#Training_ab_initio_Gene_Predictors
> Here is an excellent explanation of Augustus training ?>
> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html
> and here are tools to convert SNAP training files to Augustus training
> files (MAKER comes with a tool that converts GFF3 for SNAP training so just
> take that and convert it for Augustus)?> https://github.com/
> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
>
> Finally you can also manually edit the GFF3 file in Apollo (easier to use
> the legacy stand alone version), and then convert that file for bootstrap
> training.
>
> ?Carson
>
>
> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
> wrote:
>
> Hi Xabier,
>
> Thanks for your quick reply!
>
> No, I haven't used WebAugustus, but I just checked it out and it looks
> like my training set is too big (~300 Mbp), so I can't even upload it!
>
> Anyway, I prefer to train it locally because I have better control over
> each step. Also, I have done the entire training procedure with less genes,
> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
> replicate it using more of my scaffolds, but as it appears I get a lot more
> incomplete models from exonerate (run through Maker).
>
> P
>
>
>
> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <
> xvazquezc at gmail.com> wrote:
>
>> Hi Panos,
>>
>> Have you tried using webAugustus for the (re)training? I found it very
>> convenient for generating the models for Augustus.
>>
>> Cheers,
>>
>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>>
>>> Hello All,
>>>
>>> I'm trying to retrain Augustus using EST data from the same species and
>>> realized that quite a few of the gene models I get based on EST data are
>>> incomplete (i.e. no start and/or stop codon).
>>>
>>> Now, when I get to the "etraining" step in Augustus retraining (right
>>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>>> for each gene that doesn't contain a start or stop codon.
>>>
>>> .....
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>>> does not begin with start codon but with acg
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>> ....
>>>
>>> Does anyone know whether training is compromised by such incomplete gene
>>> models? Do you usually exclude them from the training set?
>>>
>>> Oh, and by the way, the best guide to retraining Augustus is here
>>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>>> The official
>>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web
>>> page isn't bad, but doesn't explain in detail certain things.
>>>
>>> Thanks,
>>> Panos
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>>
>>
>>
>> --
>> Xabier V?zquez Campos
>> *PhD Candidate*
>> Water Research Centre
>> School of Civil and Environmental Engineering
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/34c2980c/attachment-0003.html>

From carsonhh at gmail.com  Tue Mar 24 08:39:20 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 08:39:20 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
Message-ID: <C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>

On your first round it is fine.  It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity.  If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics.  The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them).

?Carson


> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Hi Carson,
> 
> So you think it's okay to include incomplete gene models when training Augustus?
> 
> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...
> 
> Thanks,
> Panos
> 
> 
> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Hi Panos,
> 
> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.
> 
> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>
> 
> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.
> 
> ?Carson
> 
> 
>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>> 
>> Hi Xabier,
>> 
>> Thanks for your quick reply!
>> 
>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
>> 
>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
>> 
>> P
>> 
>> 
>> 
>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> Hi Panos,
>> 
>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
>> 
>> Cheers,
>> 
>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
>> Hello All,
>> 
>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
>> 
>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
>> 
>> .....
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>> ....
>> 
>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
>> 
>> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
>> 
>> Thanks,
>> Panos
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>> 
>> 
>> 
>> 
>> -- 
>> Xabier V?zquez Campos
>> PhD Candidate
>> Water Research Centre
>> School of Civil and Environmental Engineering
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/f25ab2fc/attachment-0003.html>

From panos.ioannidis at gmail.com  Tue Mar 24 09:05:54 2015
From: panos.ioannidis at gmail.com (Panos Ioannidis)
Date: Tue, 24 Mar 2015 16:05:54 +0100
Subject: [maker-devel] Augustus retraining
In-Reply-To: <C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
	<C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
Message-ID: <CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>

Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level
is 88%. I only mentioned gene-level, because that's the only metric
mentioned in the Augustus web site.

I got these numbers outside of Maker. Actually, I only used Maker to
generate the gff files needed to start the training (ran it using only EST
evidence and only on a subset of my assembly, using this
<http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>
as a guide).

Now, I've started running the second round of training, as you suggested.
Since, however, I don't have data from closely related species, I'm only
using Uniref50 as protein evidence.

P

On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt <carsonhh at gmail.com> wrote:

> On your first round it is fine.  It gives the predictor enough to work
> with, then on the second round you use improved models. When you say 6%
> sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER
> that means you are not providing sufficient protein evidence (you need the
> full proteome of at least two related species). Also is that the gene
> level, exon level, or nucleotide level sensitivity.  If you are looking at
> the gene level sensitivity measure, you only get a match when you perfectly
> match all transcripts in a gene (models that may not be correct in the
> first place). This value will rarely go above 10% for any predictor. You
> need to use the nucleotide level sensitivity/specificity metrics.  The gene
> and exon level metrics are basically meaningless (unless it?s Drosophila
> which is the only species annotated correctly enough to use them).
>
> ?Carson
>
>
> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
> wrote:
>
> Hi Carson,
>
> So you think it's okay to include incomplete gene models when training
> Augustus?
>
> I'll certainly try the bootstrap method you're suggesting. Even though I
> did it for SNAP, for some weird reason I forgot it for Augustus :p Do you
> think, however, that I can get a big improvement in gene-level sensitivity?
> Currently, I have only 6%...
>
> Thanks,
> Panos
>
>
> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com> wrote:
>
>> Hi Panos,
>>
>> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a
>> first round of training you can run MAKER together with protein and EST
>> evidence and the newly trained Augustus species file.  Because MAKER gives
>> hints to Augustus as it runs, the models it produces will be improved over
>> what it would get from just running Augustus on it?s own.  Then take these
>> gene models and use them to retrain Augustus.  This is the standard
>> bootstrap retraining procedure, and can be repeated as needed.
>>
>> More info on bootstrap training here (info is for SNAP but procedure is
>> similar to Augustus) ?>  http://weatherby.genetics.
>> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_
>> Online_Training_2014#Training_ab_initio_Gene_Predictors
>> Here is an excellent explanation of Augustus training ?>
>> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html
>> and here are tools to convert SNAP training files to Augustus training
>> files (MAKER comes with a tool that converts GFF3 for SNAP training so just
>> take that and convert it for Augustus)?> https://github.com/
>> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
>>
>> Finally you can also manually edit the GFF3 file in Apollo (easier to use
>> the legacy stand alone version), and then convert that file for bootstrap
>> training.
>>
>> ?Carson
>>
>>
>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com>
>> wrote:
>>
>> Hi Xabier,
>>
>> Thanks for your quick reply!
>>
>> No, I haven't used WebAugustus, but I just checked it out and it looks
>> like my training set is too big (~300 Mbp), so I can't even upload it!
>>
>> Anyway, I prefer to train it locally because I have better control over
>> each step. Also, I have done the entire training procedure with less genes,
>> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to
>> replicate it using more of my scaffolds, but as it appears I get a lot more
>> incomplete models from exonerate (run through Maker).
>>
>> P
>>
>>
>>
>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <
>> xvazquezc at gmail.com> wrote:
>>
>>> Hi Panos,
>>>
>>> Have you tried using webAugustus for the (re)training? I found it very
>>> convenient for generating the models for Augustus.
>>>
>>> Cheers,
>>>
>>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com>:
>>>
>>>> Hello All,
>>>>
>>>> I'm trying to retrain Augustus using EST data from the same species and
>>>> realized that quite a few of the gene models I get based on EST data are
>>>> incomplete (i.e. no start and/or stop codon).
>>>>
>>>> Now, when I get to the "etraining" step in Augustus retraining (right
>>>> after the time-consuming "optimize_augustus.pl" step), I get a warning
>>>> for each gene that doesn't contain a start or stop codon.
>>>>
>>>> .....
>>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1
>>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon
>>>> does not begin with start codon but with acg
>>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1
>>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon
>>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>>> ....
>>>>
>>>> Does anyone know whether training is compromised by such incomplete
>>>> gene models? Do you usually exclude them from the training set?
>>>>
>>>> Oh, and by the way, the best guide to retraining Augustus is here
>>>> <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>.
>>>> The official
>>>> <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html>
>>>> web page isn't bad, but doesn't explain in detail certain things.
>>>>
>>>> Thanks,
>>>> Panos
>>>>
>>>>
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Xabier V?zquez Campos
>>> *PhD Candidate*
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/1567f72a/attachment-0003.html>

From carsonhh at gmail.com  Tue Mar 24 09:38:08 2015
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 24 Mar 2015 09:38:08 -0600
Subject: [maker-devel] Augustus retraining
In-Reply-To: <CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>
References: <CAOVgjtQzA+5-2EnsOAmx8YRKGAV8J6Wry+Cz2CCwz1nt1seKCQ@mail.gmail.com>
	<CAL0hg4F0nb8ANT7UMzS2B9qTv5hW8_mMtEhjwMgGhDChs52MfQ@mail.gmail.com>
	<CAOVgjtTZObB+0LmynOf7OZeeQp3kZBxPV3Ee8nne2uNTJM7JeQ@mail.gmail.com>
	<09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com>
	<CAOVgjtREvLD_tgXgMYpCN8Kvu0=0DspM3wKSgGfT+8Cz1t=25g@mail.gmail.com>
	<C8523FF0-4805-48E9-9335-43FC1AFA8445@gmail.com>
	<CAOVgjtRbskdnFW1R2YtRU64P=JECw2uQGNDhop+KuGn4G0Y6AA@mail.gmail.com>
Message-ID: <4EA5A1F6-2950-4D65-A59C-0F3848C86C02@gmail.com>

I?d pick a couple of species that are as closely related as you can find.  Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won?t have (those databases are usually a little too conservative).

The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with.  Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point.  This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics.

Thanks,
Carson


> On Mar 24, 2015, at 9:05 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
> 
> Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site.
> 
> I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html> as a guide).
> 
> Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence.
> 
> P
> 
> On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> On your first round it is fine.  It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own?  If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity.  If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics.  The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them).
> 
> ?Carson
> 
> 
>> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>> 
>> Hi Carson,
>> 
>> So you think it's okay to include incomplete gene models when training Augustus?
>> 
>> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...
>> 
>> Thanks,
>> Panos
>> 
>> 
>> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> Hi Panos,
>> 
>> EST?s and mRNA-seq assemblies will bey their nature be partial.  After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file.  Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own.  Then take these gene models and use them to retrain Augustus.  This is the standard bootstrap retraining procedure, and can be repeated as needed.
>> 
>> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?>  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
>> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
>> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>
>> 
>> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.
>> 
>> ?Carson
>> 
>> 
>>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>>> 
>>> Hi Xabier,
>>> 
>>> Thanks for your quick reply!
>>> 
>>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
>>> 
>>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
>>> 
>>> P
>>> 
>>> 
>>> 
>>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>>> Hi Panos,
>>> 
>>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
>>> 
>>> Cheers,
>>> 
>>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
>>> Hello All,
>>> 
>>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
>>> 
>>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
>>> 
>>> .....
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>> ....
>>> 
>>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
>>> 
>>> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
>>> 
>>> Thanks,
>>> Panos
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Xabier V?zquez Campos
>>> PhD Candidate
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/80336079/attachment-0003.html>

From alicebdennis at gmail.com  Thu Mar 26 04:34:26 2015
From: alicebdennis at gmail.com (Alice Dennis)
Date: Thu, 26 Mar 2015 11:34:26 +0100
Subject: [maker-devel] iterative Maker2
In-Reply-To: <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch>
	<1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
Message-ID: <CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>

Hello again,

I posted a while ago about a genome I'm running through the Maker2
pipeline. I was concerned because my results were still changing with
3 and 4 iterations.

Following the very useful advice of Carson (below), I've made a few
modifications (adding a RepeatModeler run, using a big protein
database), but my gene predictions are still changing between the 3rd
and 4th iterations. Perhaps this is ok, but these increasing gene
lengths make me worry that I haven't built stable models.

Here is the short version of what I've done.
1. Run RepeatModeler, but this only produced 47 sequences in the
resulting .fasta... so that seemed a bit small.

2. Run Maker2 using:
- RepeatModeler output + "model_org=all" and "softmask=1" in the
Repeat Masking section.
- protein evidence from 2 distantly related species AND all of Uniprot
- ests from a different strain of my species (a parasitoid wasp)
- the .hmm from Nasonia, one of the 2 distantly related species whose
proteome I also provided as protein evidence
- my assembled genome of 1,509 scaffolds.

3. After this, I did three subsequent rounds of Maker2 (cleverly named
Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
.hmm was replaced by a SNAP generated .hmm from the previous round.
Also, the est2genome and protein2genome was changed from 1 to 0 in all
runs after the first.

Here are some results:
Round1: 14,647 genes, average length 2,491
Round2: 12,158 genes, average length 3,760
Round3: 13,515 genes, average length 3,090
Round4: 12,169 genes, average length 3,918

This is a bit confusing because the number of genes predicted goes up
and down, as does their lengths. I've doubly checked the dates of my
files, and they are all labeled such that I don't think anything could
be swapped.

So my questions are:
Is this an indication that my models are unstable and I shouldn't
trust these predictions?
Is the decreasing number of genes, while also getting longer perhaps a
good thing?
How do I know when to stop if genes keep getting longer?


Thanks very much,
Alice


On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <carsonhh at gmail.com> wrote:
> The gene models are actually produced by SNAP, Augustus, or whatever gene
> predictor you are using, so if you change the HMM every round, then the
> models will change too.  But I have one concern.  You are using a very
> sparse protein evidence dataset.  The protein dataset is very important to
> MAKER?s performance, and for itterative training of the ab initio
> predictors.  Normally after the second iteration, additional training should
> not be beneficial, but if you are getting wildly different results on 3rd
> and 4th round, then you probably aren?t getting sufficient good models to
> train with.
>
> For a protein dataset you should be using the entire a proteome from a
> minimum of two related species and perhaps all of UniProt/Swiss-prot to get
> a broad protein database.  Don?t use the proteins extracted by CEGMA and
> HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff scrip
> that comes with MAEKR), but don?t give the proteins to MAKER as evidence,
> also the HaMSTr results will be redundant with the ESTs.  You need proteins
> from related species to look for homology not found in the EST dataset.
>
> Also repeat masking is important for any genome and has a huge effect on ab
> initio predictor performance.  Make sure you run something like
> RepeatModeler to look for species specific repeats that will not already be
> in RepBase.  Then add those results to the rmlib= option in the maker
> control files.
>
> Thanks,
> Carson
>
>
>
>
> On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch> wrote:
>
> Hi all,
>
> I am a relatively new user to Maker2, and I?m looking for advise on running
> many iterations of the same dataset in Maker2.
>
> I have a relatively small genome (~124 MB) from a wasp that is assembled
> into ~1,500 scaffold. I have run several iterations of Maker2 by
> re-generating .hmms in SNAP and feeding them into the next round, and my
> gene predictions keep increasing (in number and in size).  The only thing
> that changes at each round is the .hmm.
> This is the evidence that I give is:
> -          de novo assembled ESTs from a different strain of the same
> species (70,000 contigs? I am currently working on improving this assembly
> with the hope that this will be helpful here)
> -          610 proteins extracted from the genome scaffolds using CEGMA and
> HaMSTr
>
> For my 1st iteration, I used the Nasonia .hmm from SNAP, and the
> est2genome/protein2genome option.
>
> For the 2nd, 3rd and 4th rounds I have used .hmms generated from the
> previous round, all without the est2genome/protein2genome option. All other
> files are the same as in the original run.
>
> As I understand it, after the second round, nothing should change in Maker2.
> But the differences are obvious between runs. Some entirely new exons are
> annotated. For example,  just counting ?exon? in the .gff file gives me
> 73,000 after the third iteration and 96,000 after the fourth! Actually the
> biggest leap in this number is between the third and fourth round. I can
> also see that many features are longer when I look at the files in Geneious.
>
> Is this sort of change possible after the second round of Maker2? Is there
> something I have done wrong in my runs, or am a understanding this output
> incorrectly?
>
> Thank you,
> Alice
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 


Alice Dennis
alicebdennis at gmail.com

Postdoctoral Researcher
Institute for Integrative Biology, ETH Z?rich & EAWAG
?berlandstrasse 133
P.O. Box 611
8600 D?bendorf, Switzerland

https://adennis5.wordpress.com/


From michael.s.campbell1 at gmail.com  Thu Mar 26 09:50:41 2015
From: michael.s.campbell1 at gmail.com (Michael Campbell)
Date: Thu, 26 Mar 2015 09:50:41 -0600
Subject: [maker-devel] iterative Maker2
In-Reply-To: <CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>
References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch>
	<1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch>
	<CAJstU0-O1i3GxNd3EV-Y9PXRpPjqPtT1B3T5_Ttg93_kupivOg@mail.gmail.com>
Message-ID: <CAAi6vWXnyyFkTVD9tc-QGxSBCBenTy5QyTM6ReVqDveXQA0FTg@mail.gmail.com>

Hi Alice,

In my experience the fewer longer genes is generally a good thing (and very
normal) resulting from the merging of split models and extension of
incomplete models. I find it helpful to load the annotations and evidence
into a browser to get a visual idea of what is happening.

Mike

On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis <alicebdennis at gmail.com>
wrote:

> Hello again,
>
> I posted a while ago about a genome I'm running through the Maker2
> pipeline. I was concerned because my results were still changing with
> 3 and 4 iterations.
>
> Following the very useful advice of Carson (below), I've made a few
> modifications (adding a RepeatModeler run, using a big protein
> database), but my gene predictions are still changing between the 3rd
> and 4th iterations. Perhaps this is ok, but these increasing gene
> lengths make me worry that I haven't built stable models.
>
> Here is the short version of what I've done.
> 1. Run RepeatModeler, but this only produced 47 sequences in the
> resulting .fasta... so that seemed a bit small.
>
> 2. Run Maker2 using:
> - RepeatModeler output + "model_org=all" and "softmask=1" in the
> Repeat Masking section.
> - protein evidence from 2 distantly related species AND all of Uniprot
> - ests from a different strain of my species (a parasitoid wasp)
> - the .hmm from Nasonia, one of the 2 distantly related species whose
> proteome I also provided as protein evidence
> - my assembled genome of 1,509 scaffolds.
>
> 3. After this, I did three subsequent rounds of Maker2 (cleverly named
> Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
> .hmm was replaced by a SNAP generated .hmm from the previous round.
> Also, the est2genome and protein2genome was changed from 1 to 0 in all
> runs after the first.
>
> Here are some results:
> Round1: 14,647 genes, average length 2,491
> Round2: 12,158 genes, average length 3,760
> Round3: 13,515 genes, average length 3,090
> Round4: 12,169 genes, average length 3,918
>
> This is a bit confusing because the number of genes predicted goes up
> and down, as does their lengths. I've doubly checked the dates of my
> files, and they are all labeled such that I don't think anything could
> be swapped.
>
> So my questions are:
> Is this an indication that my models are unstable and I shouldn't
> trust these predictions?
> Is the decreasing number of genes, while also getting longer perhaps a
> good thing?
> How do I know when to stop if genes keep getting longer?
>
>
> Thanks very much,
> Alice
>
>
> On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <carsonhh at gmail.com> wrote:
> > The gene models are actually produced by SNAP, Augustus, or whatever gene
> > predictor you are using, so if you change the HMM every round, then the
> > models will change too.  But I have one concern.  You are using a very
> > sparse protein evidence dataset.  The protein dataset is very important
> to
> > MAKER?s performance, and for itterative training of the ab initio
> > predictors.  Normally after the second iteration, additional training
> should
> > not be beneficial, but if you are getting wildly different results on 3rd
> > and 4th round, then you probably aren?t getting sufficient good models to
> > train with.
> >
> > For a protein dataset you should be using the entire a proteome from a
> > minimum of two related species and perhaps all of UniProt/Swiss-prot to
> get
> > a broad protein database.  Don?t use the proteins extracted by CEGMA and
> > HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff
> scrip
> > that comes with MAEKR), but don?t give the proteins to MAKER as evidence,
> > also the HaMSTr results will be redundant with the ESTs.  You need
> proteins
> > from related species to look for homology not found in the EST dataset.
> >
> > Also repeat masking is important for any genome and has a huge effect on
> ab
> > initio predictor performance.  Make sure you run something like
> > RepeatModeler to look for species specific repeats that will not already
> be
> > in RepBase.  Then add those results to the rmlib= option in the maker
> > control files.
> >
> > Thanks,
> > Carson
> >
> >
> >
> >
> > On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch>
> wrote:
> >
> > Hi all,
> >
> > I am a relatively new user to Maker2, and I?m looking for advise on
> running
> > many iterations of the same dataset in Maker2.
> >
> > I have a relatively small genome (~124 MB) from a wasp that is assembled
> > into ~1,500 scaffold. I have run several iterations of Maker2 by
> > re-generating .hmms in SNAP and feeding them into the next round, and my
> > gene predictions keep increasing (in number and in size).  The only thing
> > that changes at each round is the .hmm.
> > This is the evidence that I give is:
> > -          de novo assembled ESTs from a different strain of the same
> > species (70,000 contigs? I am currently working on improving this
> assembly
> > with the hope that this will be helpful here)
> > -          610 proteins extracted from the genome scaffolds using CEGMA
> and
> > HaMSTr
> >
> > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the
> > est2genome/protein2genome option.
> >
> > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the
> > previous round, all without the est2genome/protein2genome option. All
> other
> > files are the same as in the original run.
> >
> > As I understand it, after the second round, nothing should change in
> Maker2.
> > But the differences are obvious between runs. Some entirely new exons are
> > annotated. For example,  just counting ?exon? in the .gff file gives me
> > 73,000 after the third iteration and 96,000 after the fourth! Actually
> the
> > biggest leap in this number is between the third and fourth round. I can
> > also see that many features are longer when I look at the files in
> Geneious.
> >
> > Is this sort of change possible after the second round of Maker2? Is
> there
> > something I have done wrong in my runs, or am a understanding this output
> > incorrectly?
> >
> > Thank you,
> > Alice
> >
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> >
> >
>
>
>
> --
>
>
> Alice Dennis
> alicebdennis at gmail.com
>
> Postdoctoral Researcher
> Institute for Integrative Biology, ETH Z?rich & EAWAG
> ?berlandstrasse 133
> P.O. Box 611
> 8600 D?bendorf, Switzerland
>
> https://adennis5.wordpress.com/
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 
Michael Campbell MS, RD.
Doctoral Candidate
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:585-3543
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150326/016a477f/attachment-0003.html>

From rens.holmer at wur.nl  Mon Mar 30 00:12:20 2015
From: rens.holmer at wur.nl (Holmer, Rens)
Date: Mon, 30 Mar 2015 06:12:20 +0000
Subject: [maker-devel] Incorporating cufflinks in maker
Message-ID: <42E0168F-C672-4B7F-97D4-98442B825BF9@wur.nl>

Hi maker team,

I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options:

Provide the cufflinks output as EST-gff
Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff

What would you suggest, and what would be the required formatting for both options?

Thanks in advance,

Rens Holmer


From goutham.atla at gmail.com  Fri Mar 27 23:37:08 2015
From: goutham.atla at gmail.com (Goutham atla)
Date: Sat, 28 Mar 2015 11:07:08 +0530
Subject: [maker-devel] Annotating Cufflinks GTF with Maker
Message-ID: <CALU8LA4CwLD8qm5f==xKSjZoCw+9Ajd=RCD62LkHTdBYbuajig@mail.gmail.com>

Dear All,

I have a draft genome for organism of my interest and I have around 150G of
100bp paired-end RNA-Seq data from different conditions. This organism has
ensemble annotations but very few.

My goal is to look at differential splicing analysis between two
conditions. For this I need good annotations in gtf format at isoform
level.I am interested in using the Splicing Analysis Kit
<http://cbcb.umd.edu/software/spanki/>

For now, I have aligned one sample to genome using tophat2 and then used
cufflinks to generate a de-novo GTF file. In either cases I have not used
the avail be GTF with very few annotations.

The GTF file generated by cufflinks should be annotated to know the
function of each transcript. So I am interested in adding annotations to
the gtf file generated from cufflinks. What is the best of doing it ?

Or is there any better way of getting a gtf file, like that of ensemble,
from my data ?

I have looked at trinotate, but its more about functional annotation and
expression studies.


Regards,

-- 
Goutham Atla
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150328/686b6c3b/attachment-0003.html>

From avhoeck at SCKCEN.BE  Mon Mar 30 10:11:16 2015
From: avhoeck at SCKCEN.BE (Van Hoeck Arne)
Date: Mon, 30 Mar 2015 16:11:16 +0000
Subject: [maker-devel] comments on Incorporating cufflinks in maker
Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF11826E1A@MAILSRV3.sck.be>

Dear Rens and Carlson,
I would like to comment on Rens' question on adding transcript data into your annotation. Since I have no access to google groups, I try to contact you both via mail with, hopefully, the correct mail addresses.

I have tested Maker-P with multiple parameters, optimized 2 gene prediction tools (SNAP and augustus). Both give similarly results at which maker find around 17000 gene annotations. I also have RNAseq samples, and as a test case I also used Cufflinks output and processed with TransDecoder to select gene annotations. By using this approach, I end up with 25000 gene annotations. More or less, all the genes that Maker selected were also selected by the TransDecoder approach. So is it wise to include the information on the 8000 missing genes, based on optimized gene models? probably, there will some false positive in these 8000 missing genes, but with good criteria and models, it would maybe possible that maker can find more genes annotations.

Best regards
Arne

Hi maker team,

I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options:

Provide the cufflinks output as EST-gff
Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff

What would you suggest, and what would be the required formatting for both options?

Thanks in advance,

Rens Holmer


[-]     Consider the environment before you print
Denk aan het milieu voor u deze e-mail print
Pensez ? l'environnement avant d'imprimer
        [-]
[-]
SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150330/1fa390fe/attachment-0003.html>