[maker-devel] question about Maker2

Carson Holt carson.holt at genetics.utah.edu
Mon Oct 17 12:25:54 MDT 2016


> what is the difference between files
> 
> 1) ContigXXX.maker.non_overlapping_ab_initio.proteins.fasta

Non-redundant non-overlapping models (i.e. subset of snap/augustus models that do not overlap a final MAKER selected model).


> and
> 
> 2 )ContigXXX.maker.augustus_masked.proteins.fasta

Contains all raw augustus models called without hints (i.e. the equivalent of just running Augustus on it’s own).


> None of these should have EST info (as the sequences headers are
> 
> 1) augustus_masked-1-processed-gene-

This was a raw augustus model that may or may not have UTR added using EST info (i.e model came strait from Augustus so no hints were used to produce the model, but MAKER did try and add UTR)



> and
> 
> 2) augustus_masked-1-abinit-gene-

Model strait from Augustus. No hints, and no MAKER attempt to add UTR. These are raw unmodified models and will never be in the final selected set.


> so no "maker-XXX)

maker-XXX means it was a hint derived model and not a raw Augustus model.


> Should file 2 just be ignored and 1) be kept aside the maker file, where EST/protein evidence is incorporated?

ignore all the abinit files. They are for reference purposes only. The non-overlapping file can be used to see what was rejected, does not overlap a current model (i.e. you may be able to find a handful of false negatives that can be rescued with domain analysis using something like InterProscan).

—Carson






> Thanks,
> 
> G
> 
> On 5/18/16 11:31 AM, Carson Holt wrote:
>> Hi Giancarlo,
>> 
>> There was no image attached. If you can, just send me the contig GFF3, and I can look at it in apollo (which lets me manipulate reading frame and display spice sites). Then I can tell you more. Basically the gene models are the result of an HMM for gene patterns plus hints to alter probability around evidence suggested sites. If there is any issue with the reading frame (can be a single bp assembly error) then no amount of hints can force a broken CDS to be coding, and the predictor will do the best it can to still produce a workable model (i.e. truncate exons, skip exons, etc). Also if your mRNA-seq is not aligned correctly around a canonical splice site (i.e. overhang beyond splice acceptor) then that hint may be ignored.
>> 
>> —Carson
>> 
>> 
>>> On May 17, 2016, at 4:50 AM, Russo Giancarlo <giancarlo.russo at fgcz.ethz.ch> wrote:
>>> 
>>> Hi Carson, thanks again for all your answers.
>>> A (hopefullly) final question: in the image attached you can see an IGV sashimi plot of RNA-seq data, with the annotated gene derived from Maker; what could  be the reason that in the gene model the two bits on the sides (UTRs?), which show high coverage from the RNA-seq data and plenty of splice junctions with the neighbouring exons are completely missing?
>>> 
>>> In this run I have used a closely related  species from the augustus database for gene prediction, RNA-seq based denovo assemblied transcripts as EST and protein sequences from the same closely related   species. I have masked using a customized library build following the guidelines in the tutorial.
>>> 
>>> Thanks,
>>> Giancarlo
>>> 
>>> Giancarlo Russo, Ph.D.
>>> Functional Genomics Center Zurich
>>> ETH Zurich / University of Zurich
>>> Winterthurerstrasse 190 / Y32 H66
>>> CH-8057 Zurich
>>> 
>>> Phone: +41 44 635 3964
>>> Fax: +41 44 635 3922
>>> e-mail: giancarlo.russo at fgcz.ethz.ch
>>> http://www.fgcz.ch
>>> ________________________________________
>>> From: Carson Holt [carson.holt at genetics.utah.edu]
>>> Sent: 09 May 2016 18:02
>>> To: Russo  Giancarlo
>>> Subject: Re: question about Maker2
>>> 
>>> For training gene predictors with protein and EST —> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors
>>> 
>>> If reusing MAKER results I don’t recommend GFF3 passthrough. The GFF# option is to get not MAKER sourced result into MAKER. You will actually lose some functionality by passing in MAKER sourced results as GFF3 (MAEKR can’t do things with GFF3 that it can do with self generated data).
>>> 
>>> It is best to just rerun MAKER in the same directory, it will reuse previous reports it finds in the datastore.
>>> 
>>> —Carson
>>> 
>>> 
>>> 
>>>> On May 3, 2016, at 2:08 AM, Russo Giancarlo <giancarlo.russo at fgcz.ethz.ch> wrote:
>>>> 
>>>> OK, thanks a lot, now it is clear.
>>>> 
>>>> About the passthrough procedure, would you have any particular advice on what would be the best strategy to run it?
>>>> I have tried an existing organism in Augustus but the results were not too good.
>>>> 
>>>> I have both EST and protein evidence, so I thought I could use EST to infer ab-initio and produce a first annotation and then run a second-pass using the first gff maker file as ab-initio.
>>>> 
>>>> Any advice would be appreciated.
>>>> 
>>>> Best and thanks again.
>>>> Giancarlo
>>>> 
>>>> Giancarlo Russo, Ph.D.
>>>> Functional Genomics Center Zurich
>>>> ETH Zurich / University of Zurich
>>>> Winterthurerstrasse 190 / Y32 H66
>>>> CH-8057 Zurich
>>>> 
>>>> Phone: +41 44 635 3964
>>>> Fax: +41 44 635 3922
>>>> e-mail: giancarlo.russo at fgcz.ethz.ch
>>>> http://www.fgcz.ch
>>>> ________________________________________
>>>> From: Carson Holt [carson.holt at genetics.utah.edu]
>>>> Sent: 02 May 2016 18:16
>>>> To: Russo  Giancarlo
>>>> Subject: Re: question about Maker2
>>>> 
>>>> As part of the MAEKR job, it runs Snap and Augustus on their own before aligning evidence and generating hints for the later run.  The Contig2.maker.augustus.transcripts.fasta are just the results of that uninformed Augustus run. They are not the final gene models, they are just the raw uninformed Augustus models. They are there for reference purposes only.  They are what you would have gotten by just running Augustus directly on the assembly without any additional input (i.e. what Augustus would have produced on it’s own outside of MAKER).
>>>> 
>>>> —Carson
>>>> 
>>>> 
>>>> 
>>>>> On May 2, 2016, at 2:27 AM, giancarlo.russo <giancarlo.russo at fgcz.ethz.ch> wrote:
>>>>> 
>>>>> Hi Carson,
>>>>> sorry to bother you again, I still don't understand the difference between
>>>>> 
>>>>> 1) Contig2.maker.augustus.transcripts.fasta
>>>>> and
>>>>> 2) Contig2.maker.transcripts.fasta
>>>>> 
>>>>> If 1) contains the transcripts "Produced by maker sending hints to
>>>>> augustus to modify scoring against the HMM",
>>>>> , and these hints are derived from EST/protein evidence, what extra
>>>>> information is used/extra steps are performed to produce 3) ?
>>>>> 
>>>>> Also, how is a passthrough using a first pass,  maker-produced  gff
>>>>> annotation file is best done?
>>>>> Should this gff file  be used for ab-initio gene models that are then
>>>>> corrected EST and protein evidence?
>>>>> Does it make sense to use augustus when a first pass gff file is
>>>>> available? Do these two options (ab-initio based on first pass gff and
>>>>> augustus switched on) exclude each other?
>>>>> 
>>>>> Thanks again for your time and help.
>>>>> 
>>>>> Best,
>>>>> G
>>>>> On 29/03/16 17:42, Carson Holt wrote:
>>>>>> Yes. The EST’s generate both hints as to intron location and exon location. The protein alignments generate CDS location hints. Each algorithm has different ways to feed hints with Augustus being the most advanced. It allows separate bonuses for partial vs exact matches, and you can optionally link hints so they have to be matched as a group. It also offerer many other hint types like splice donor and acceptor hints. However we really only use the intron, exon, and CDS hints. We also use the partial match bonus.
>>>>>> 
>>>>>> —Carson
>>>>>> 
>>>>>> 
>>>>>>> On Mar 29, 2016, at 7:50 AM, Russo Giancarlo <giancarlo.russo at fgcz.ethz.ch> wrote:
>>>>>>> 
>>>>>>> Hi Carson, thanks a lot for your answer.
>>>>>>> 
>>>>>>> So let's see if I get it correctly.
>>>>>>> In the final datastore I have the fasta files named
>>>>>>> 
>>>>>>> 1)Contig2.maker.augustus.transcripts.fasta
>>>>>>> 2)Contig2.maker.non_overlapping_ab_initio.transcripts.fasta
>>>>>>> 3)Contig2.maker.transcripts.fasta
>>>>>>> 
>>>>>>> 1) contains the transcripts "Produced by maker sending hints to augustus to modify scoring against the HMM"
>>>>>>> 2) contains the transcripts predicted only by the ab initio algorithm (e.g. augustus)
>>>>>>> 3) contains the transcripts with a full gene model based on ab initio + EST and/or PROTEIN
>>>>>>> 
>>>>>>> However, what "hints" are sent by maker to augustus? If these are EST/PROTEIN hints, then what is the difference between 1) and 3) ?
>>>>>>> 
>>>>>>> Thanks again for your help and sorry for bothering.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Giancarlo
>>>>>>> 
>>>>>>> Giancarlo Russo, Ph.D.
>>>>>>> Functional Genomics Center Zurich
>>>>>>> ETH Zurich / University of Zurich
>>>>>>> Winterthurerstrasse 190 / Y32 H66
>>>>>>> CH-8057 Zurich
>>>>>>> 
>>>>>>> Phone: +41 44 635 3964
>>>>>>> Fax: +41 44 635 3922
>>>>>>> e-mail: giancarlo.russo at fgcz.ethz.ch
>>>>>>> http://www.fgcz.ch
>>>>>>> ________________________________________
>>>>>>> From: Carson Holt [carson.holt at genetics.utah.edu]
>>>>>>> Sent: 24 March 2016 21:56
>>>>>>> To: maker-devel
>>>>>>> Cc: Russo  Giancarlo; Mark Yandell
>>>>>>> Subject: Re: question about Maker2
>>>>>>> 
>>>>>>> Hi Giancarlo,
>>>>>>> 
>>>>>>> Anything listed as something like maker-*-augustus was a result of MAKER sending hints to augustus, and anything like augustus-*-abinit was the result of augustus run directly from the HMM without hints.
>>>>>>> 
>>>>>>> Here is more detail on the format —>
>>>>>>> <top_level_source> - <contig> - <internal_source> -gene- <chunk> - <iterator>
>>>>>>> 
>>>>>>> Top level possibilities:
>>>>>>> maker                      #maker generated model
>>>>>>> snap_masked          #snap run on masked sequence
>>>>>>> augustus_masked   #augustus run on masked sequence
>>>>>>> etc.
>>>>>>> 
>>>>>>> Internal source:
>>>>>>> abinit        #ab initio model direct from HMM
>>>>>>> snap         #hints provided to SNAP (alters scoring)
>>>>>>> augustus  #hints provided to augustus (alters scoring)
>>>>>>> 
>>>>>>> Then chunk and iterator are just to generate a uniq ID.
>>>>>>> 
>>>>>>> 
>>>>>>> Example:
>>>>>>> augustus_masked-scaffold11899-abinit-gene-0.6    #Produced by Augustus on masked sequence using raw HMM (no MAKER intervention).
>>>>>>> maker-scaffold11899-augustus-gene-0.6                 #Produced by maker sending hints to augustus to modify scoring against the HMM
>>>>>>> 
>>>>>>> —Carson
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 3/24/16, 9:23 AM, "giancarlo.russo" <giancarlo.russo at fgcz.ethz.ch>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Dear Mike,
>>>>>>>>> 
>>>>>>>>> first of all thanks for taking care and sharing Maker, as part of the
>>>>>>>>> community I appreciate it.
>>>>>>>>> 
>>>>>>>>> I have a question about the nomenclature of the annotation in the output
>>>>>>>>> file:
>>>>>>>>> what is the difference between genes named
>>>>>>>>> 
>>>>>>>>> maker-Contig-XXX
>>>>>>>>> and those named
>>>>>>>>> augustus-Contig-XXX-processed genes
>>>>>>>>> ?
>>>>>>>>> 
>>>>>>>>> Please find attached the maker_opts file I have used for my annotation.
>>>>>>>>> I was under the impression that the ab-initio related prefixes would be
>>>>>>>>> present only in the genes which are not marked as "maker" in column 3 of
>>>>>>>>> the gff file (i.e., those
>>>>>>>>> with both ab-initio and EST evidence)
>>>>>>>>> 
>>>>>>>>> Is there something I am missing?
>>>>>>>>> 
>>>>>>>>> Thanks a lot in advance,
>>>>>>>>> Giancarlo
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Giancarlo Russo, Ph.D.
>>>>>>>>> Functional Genomics Center Zurich
>>>>>>>>> Y32 H66
>>>>>>>>> Winterthurerstr. 190
>>>>>>>>> 8057 Zurich
>>>>>>>>> SWITZERLAND
>>>>>>>>> Phone: +41 44 635 39 64
>>>>>>>>> Fax: +41 44 635 39 22
>>>>>>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch
>>>>>>>>> 
>>>>>>>> <maker_opts.ctl>
>>>>> --
>>>>> Giancarlo Russo, Ph.D.
>>>>> Functional Genomics Center Zurich
>>>>> Y32 H66
>>>>> Winterthurerstr. 190
>>>>> 8057 Zurich
>>>>> SWITZERLAND
>>>>> Phone: +41 44 635 39 64
>>>>> Fax: +41 44 635 39 22
>>>>> E-Mail: giancarlo.russo at fgcz.ethz.ch
>>>>> 
> 
> -- 
> Giancarlo Russo, Ph.D.
> Functional Genomics Center Zurich
> Winterthurerstrasse 190
> 8057 Zurich (CH)
> Phone: +41 044 635 3964
> Fax: +41 044 635 3922
> 





More information about the maker-devel mailing list