[maker-devel] evidence for MAKER vs evidence to train gene finders

Daniel Ence dence at genetics.utah.edu
Mon Sep 19 22:45:02 MDT 2016


Just chiming in with my own perspective on the question. The gold-standard genes can be used as input for training the gene predictors  and also as evidence for the genome annotation. Presumably, you’ll have much more evidence than the gold-standard genes for the annotation, so it won’t be circular. As Carson said, the gene predictors are using the structure of the alignments of the input, rather than the sequence itself. The other source for input for gene predictors, in the case of a true bootstrap where you have no gold-standard, would be to use alignment generated by a program, like BUSCO or CEGMA, that identifies conserved orthologs in the genome. 

~Daniel




Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330

> On Sep 19, 2016, at 10:34 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The training does not involve so much the sequence, rather the structure (i.e. intron exon, start, stop etc.). You could use the evidence deposited as input to the iterative process described, but not directly. This is because you have the sequence but not the structure.
> 
> What MAKER does with the est2genome/protein2genome options is to align the evidence to the reference, polish for correct splicing (because blast alignments are not splice aware), then identify correct open reading frames with start and stop codons. The result is an intron/exon structure. The HMM for the predictor then builds probability models for moving from intron to exon states (which includes info such as leading sequence before the start codons, average intron lengths, etc.). All of which is not directly available from the protein or transcript data. But once it’s been polished against the reference, the structure can be discovered.
> 
> After initial training (i.e. the bootstrap run), MAKER provides hints in the form of probability bonuses when evidence alignments suggest UTR, CDS, intron, or exon. Then when the predictors run, they perform better than they would without the hint. As a result the second round of predictions are better than the first, and can be used as training to improve the HMM.
> 
> —Carson
> 
> 
> 
>> On Sep 19, 2016, at 10:21 PM, Steven Sullivan <sullis02 at nyu.edu> wrote:
>> 
>> I'm confused about the use(s) of gene sequence evidence in the MAKER de novo annotation pipeline
>> 
>> As I understand it, MAKER combines 1) its own BLAST alignments of user-supplied RNA ('EST evidence') and protein ('protein homology evidence') sequences to the genome assembly, with 2) models suggested by trained ab initio gene finders that run in parallel. 
>> 
>> The gene finders require a prior training step,  and the training sub-protocol in Campbell et al 2014 (Curr. Prot. Bioinf.) assumes that no 'gold standard' gene annotation exist for a newly-sequenced genome.  Therefore it describes an iterative/bootstrap  process whereby initial MAKER output becomes the gene finder training input for e.g. SNAP, whose output is then used in the next  MAKER round.  
>> 
>> But in my case, even before the genome was sequenced, a few hundred individual high-quality DNA/protein gene sequences for my species  have already been deposited  in public databases (Genbank, Swissprot) by various labs over the years, to accompany various publications.
>> 
>> Should these be used to train gene finders prior to a MAKER run, and *also* as user-supplied 'protein homology evidence' to MAKER itself? 
>> 
>> Or am I misunderstanding the workflow?
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



More information about the maker-devel mailing list