From natassa_g_2000 at yahoo.com  Thu Apr  2 07:42:48 2020
From: natassa_g_2000 at yahoo.com (natassa)
Date: Thu, 2 Apr 2020 13:42:48 +0000 (UTC)
Subject: [maker-devel] Optimal strategy and options for iterative maker2 runs
In-Reply-To: <1b0e5c3cae1b410397e61262a2384039@umu.se>
References: <1b0e5c3cae1b410397e61262a2384039@umu.se>
Message-ID: <1875152020.930988.1585834968860@mail.yahoo.com>

Hello maker community, 
I am annotating? with maker2 a fungal genome for which I have transcript evidence, plus transcripts and proteins from closely related species, a genemark .mod file from self-training I have run outside of maker, and an augustus model from a closely related species. I plan to run it iteratively, updating snap (and maybe augustus) models each time. Reading several iterative-maker pipelines online, I am a bit confused on the optimal strategy, and some details on the options used in consecutive runs. Some questions:

1) How will MAKER behave in the case where I would supply my different lines of evidence (EST+protein) along with trained abinitio models in the same run? Here is -what seems to me conflicting- info from posts I read (not in this list): "if est2genome and protein2genome are set to 1 +? abinitio tools are also on,? the abinitio tools will not use the EST-protein evidence to improve their gene models." but: "In case you activated SNAP and Augustus and you have fed MAKER with lines of evidence (Transcripts and proteins), it will predict gene models using Augustus-Evidence-driven and SNAP-Evidence-driven. In loci where both are present, it will chose the best one according to the lines of evidence (EST / protein when they are present)." Which one is correct?
2) I see in? a few tutorials that genemark is trained at a 3rd/4th run and separately from other abinitio programs. I don't understand why, since genemark is self-trained on the genome, so it doesnot really interact with training from evidence or maker gff files? 
3) Can I pass >1 abinitio models from one run to the next using the pred_gff option? For example? augustus+genemark hmms, separated by ","? In a 2017 post, Carson writes "I would avoid passing in Augustus results as GFF3, it removes the ability of MAKER to dynamically provide Augustus with hints as it runs". What is the correct way then?

Any input from experienced maker users is welcome!
Thank you in advance, 
Anastasia Gioti

Anastasia Gioti
Researcher
IMBBC-HCMR Crete, Greece
https://scholar.google.com/citations?user=eMsnakoAAAAJ&hl=en
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200402/afce9017/attachment.html>

From shore at yorku.ca  Fri Apr  3 11:02:36 2020
From: shore at yorku.ca (shore at yorku.ca)
Date: Fri, 03 Apr 2020 13:02:36 -0400
Subject: [maker-devel] final annotation issues
Message-ID: <1585933356.5e876c2c023db@oldmymail.yorku.ca>

Dear Maker team,

 I believe we are the final stage of annotation of a plant genome, having
previously trained snap following 3 rounds.

 In our attempts at final annotation we have now added new transcriptome data,
and generated a repeat library for our species (so we now mask with that, as
well as database of plant repeats , and TE proteins).

 In this final annotation run, we've set keep_pred=1 and then plan to
screen the final gff file retaining sequences with AED<= 0.5 (or there
abouts) and ones that possess a pfam domain .

 I've compared some of the proteins obtained in our previous round of Maker with
the latest. Indeed the masking appears to have removed some that were TEs. A
number of proteins differ somewhat, likely the result of different intron/exon
boundary calls, and some are quite different in length.
In particular some are roughly twice the length in previous annotation, and
appear to be of the correct size previously , based upon online blasts.

It is this latter finding that I'm concerned about.
Why it has occurred.

I did set single-exon=1 and wonder if that is causing this effect?

Thanks and sorry for the long-winded email.

Joel


-- 
Dr. Joel S. Shore
Prof. Biology
York University


From carsonhh at gmail.com  Fri Apr  3 14:51:47 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 3 Apr 2020 14:51:47 -0600
Subject: [maker-devel] guidance for first and subsequent annotation
 parameters
In-Reply-To: <CAEdk9jmg1rbuRPbCC4vNmNkTyN61H7XGQpUi4h6UD1czoFtH1g@mail.gmail.com>
References: <CAEdk9jmg1rbuRPbCC4vNmNkTyN61H7XGQpUi4h6UD1czoFtH1g@mail.gmail.com>
Message-ID: <81E8860F-A91B-4089-B179-4ED7EBAC36D3@gmail.com>

You may need to select a subset of gene models to drive training.  I find that I get best results when I use protein2genome models only from uniprot/swiss-prot alignments to generate a training set, set always_complete=1. Uniprot/swiss-prot is manually curated, so is very high quality. Then I select models with the highest end-to-end completion (low AED). Also if you add est_forward=1 the score column in the GFF3 will be the % match to the original model.  It?s and easy way to select only models with a very high percent match. Remove models without start codons and stop codons.  You can relax these parameters if you don?t have many models, but in general you want 100-300 models to train with. Only one round of training is needed with this type of training set. The EST method requires 2 rounds and I don?t like it as much.

In the some cases, model selection for training will be a mostly manual task. You can use editors like Apollo to identify models that match evidence well, and delete odd models. Then train on that result.


What you are seeing is likely the result of over-training. Usually happens if you use more that 2 rounds of training, but can happen with just two rounds.

?Carson

 
> On Mar 20, 2020, at 5:30 AM, Devon O'Rourke <devon.orourke at gmail.com> wrote:
> 
> With so many posts on the forum it's been challenging to determine what the best practices are for performing multiple rounds of annotation with Maker.
> My first round used est, altest, and protein fasta files with a custom GFF repeat masked file. The resulting vertebrate genome produced 21,970 gene models with a mean length of about 9016 bp; the BUSCO score was C:66.0%[S:64.2%,D:1.8%],F:4.2%,M:29.8%,n:9226 (mammalia_odb10 set). Things seemed to be on the right track, so I set up the next Maker round using both SNAP and Augustus-trained information in the round2 maker_opts.ctl file. At the end of that second round, I noticed a marked decrease in BUSCO score (C:53.3%[S:51.0%,D:2.3%],F:11.6%,M:35.1%,n:9226), yet an increase in the number of gene models (28,646) and mean length (16266 bp). 
> 
> This got me to wondering if I was setting up the _opts.ctl file incorrectly? I'm concerned with a few things (and maybe missing even more I should be concerned about!?):
> I specified the evidence to come from EST/Protein instead of using the section available under "#-----Re-annotation Using MAKER Derived GFF3". Maybe that was a fundamental mistake? What is the expected change in behavior if I moved my round1 Maker output into that category instead of using the EST/Protein Homology evidence sections as I did below?
> I wasn't sure what to do with the RepeatMasking GFF files in Round2. The RepeatMasker GFF I included in Round1 consisted of just complex repeats (setting model_org=simple and softmask=1 to effectively only hard mask those complex areas for the initial alignments). But what should be used in Round2 - the output GFF of Round1, or the input GFF from Round1?
> Here's what I did for the Round2 maker_opts.ctl file:
> 
> #-----Genome (these are always required)
> genome=/scratch/dro49/myluwork/annotation/input_files/mylu_hic_rails_noMasks.fa
> organism_type=eukaryotic
> #-----EST Evidence (for best results provide a file for at least one)
> est_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.est2genome.gff
> altest_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.cdna2genome.gff
> #-----Protein Homology Evidence (for best results provide a file for at least one)
> protein_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.protein2genome.gff
> #-----Repeat Masking (leave values blank to skip repeat masking)
> rm_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.repeats.gff
> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
> #-----Gene Prediction
> snaphmm=/scratch/dro49/myluwork/annotation/maker_rd2/snap_rd1/lu_rnd1.zff.length50_aed0.25.hmm #SNAP HMM file
> augustus_species=mylu #Augustus gene prediction species model
> run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
> est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
> protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
> trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
> unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
> allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)
> 
> 
> Thank you for your insights and support,
> 
> Devon
> 
> -- 
> Devon O'Rourke
> Postdoctoral researcher, Northern Arizona University
> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/ <https://fozlab.weebly.com/>
> twitter: @thesciencedork

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/45bd12db/attachment-0001.html>

From carsonhh at gmail.com  Fri Apr  3 16:03:12 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 3 Apr 2020 16:03:12 -0600
Subject: [maker-devel] Problem with Maker using GeneMark
In-Reply-To: <5e838561.1c69fb81.27f9c.24c6SMTPIN_ADDED_MISSING@mx.google.com>
References: <5e838561.1c69fb81.27f9c.24c6SMTPIN_ADDED_MISSING@mx.google.com>
Message-ID: <E2FF8D2F-8B8F-456B-BC0E-1A1C099D05D3@gmail.com>

Could you try the attached version, and let me know if it resolves the issue (copy over the old one)? The probuild command I used is just one I stole from another GeneMark script, so I just borrowed the updated command from the SplitFasta subroutine in gmes_petap.pl.

?Carson


> On Mar 31, 2020, at 11:53 AM, Gagn?, Patrick (NRCAN/RNCAN) <patrick.gagne at canada.ca> wrote:
> 
> Hi
>  
> I?ve come across a bug while using Maker. I?m trying to annotate a 560Mb Genome and I?m using Snap, GeneMark and Augustus in Maker.
> When Maker is executing the GeneMark command, it just failed (GeneMark Failed) without any error messages, so I?ve decided to debug it myself?So I launched every commands manually and found out that the gmhmm_wrap is causing the issue. The problem is in fact in the prebuild command; it doesn?t do anything (from what I understand, this command is supposed to split the fasta whre there is NNN to prevent GeneMark Crash). My genome got very long stretches of N (up to 14Kb)
>  
> After checking the prebuild help, I?ve found that the command used in gmhmm_wrap is not valid (half the options are not in probuild anymore, probably because of GeneMark updates)
>  
> I have tried different Probuild (those I could download from GeneMark site, they don?t give older versions except those that come with their program?s versions)
> 2.16
> 2.34
> 2.44 (lastest that come with GeneMark ES)
>  
> I?ve also tried to edit the gmhmm_wrap script and modify the prebuild command, but even when the fasta are splitted, I got another bug : ERROR: Logic error in getting offset. I?ve tried to replace the command for the offset extraction, which also worked, but now I got a bug when Maker try to get the ab-initio output :
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Calling translate without a seq argument!
>  
> Could you please tell me how to fix this, or tell me what probuild I should use (I will ask the GeneMark support for it)
>  
> Thanks in advance
>  
> P.S 
> Sorry for my English, It?s not my first language and I?m still learning
>  
> Patrick Gagn?
> Sp?cialiste en bio-informatique / Bioinformatics specialist
> Service canadien des for?ts / Canadian Forest Service
> Ressources naturelles Canada / Natural Resources Canada
> Gouvernement du Canada / Government of Canada
> Centre de foresterie des Laurentides/Laurentian Forestry Centre
> 1055, rue du P.E.P.S.
> C.P. 10380, succ. Sainte-Foy/P.O. Box 10380, Stn. Sainte-Foy
> Qu?bec (Qc) G1V 4C7
> Laboratoire de pathologie foresti?re (Local 2.21)
> patrick.gagne at canada.ca <mailto:patrick.gagne at canada.ca> / tel : (418) 648-4443
>  
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmhmm_wrap
Type: application/octet-stream
Size: 9027 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0001.html>

From carsonhh at gmail.com  Sat Apr  4 14:09:05 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sat, 4 Apr 2020 14:09:05 -0600
Subject: [maker-devel] repeatmasker output gff
In-Reply-To: <CA+un12sUfaxK_ycgves9cohrdFjOyg7Nt9WUzGSK6Yz-RbdgMQ@mail.gmail.com>
References: <CA+un12sUfaxK_ycgves9cohrdFjOyg7Nt9WUzGSK6Yz-RbdgMQ@mail.gmail.com>
Message-ID: <08952604-1F7A-4FC5-9F59-DB79665A324D@gmail.com>

It needs to be a two level feature. match/match_part is one example, others will work as long as when it is assembled it is two levels.

MAKER saves it?s state as it runs, so you can restart it at any time without losing progress.

?Carson


> On Mar 25, 2020, at 2:38 PM, Homa Papoli <hpapoli at gmail.com> wrote:
> 
> Hello,
> 
> I have 2 questions regarding user maker:
> 
> I have used repeatmasker for my genome separately and I have a gff file. However, my gff file, in the third column, has the word "similarity". In a workshop I had taken on genome annotation, it was said that the gff for maker should have "match" and "match_part" for the third column. I was wondering whether I could use the original gff output of repeatmasker or should I make any changes to it?
> 
> Another question is about running maker. Since maker takes several days to run, if the job gets interrupted due to limit in days of running the job, I was wondering whether it is possible to re-start maker from where it got interrupted?
> 
> Thank you,
> Homa
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Sat Apr  4 14:15:21 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sat, 4 Apr 2020 14:15:21 -0600
Subject: [maker-devel] Maker annotation  AED scores are around 0.5
In-Reply-To: <1b0e5c3cae1b410397e61262a2384039@umu.se>
References: <1b0e5c3cae1b410397e61262a2384039@umu.se>
Message-ID: <E2AAAE35-7B24-46EE-B77F-9E4BD584CC45@gmail.com>

Probably this ?>

https://groups.google.com/forum/#!searchin/maker-devel/Curious$20pattern$20in$20AED$20distributions%7Csort:date/maker-devel/QS3VnxhvEks/q3lPmywjBQAJ <https://groups.google.com/forum/#!searchin/maker-devel/Curious$20pattern$20in$20AED$20distributions|sort:date/maker-devel/QS3VnxhvEks/q3lPmywjBQAJ>


Likely caused by an over abundance of single-exon models and under masking of repeats in the genome.

?Carson


> On Mar 30, 2020, at 3:37 AM, Wei Zhao <zhao.wei at umu.se> wrote:
> 
> Dear maker team,
>  
> I am writing to ask for your help.
>  
> I am using make to annotate a big genome ~9 Gbp, I have 3 evidences: 1)  Transcriptome of this species; 2) protein sequence from relative species; 3) Augustus model trained from pasa.
>  
> When I use all of these 3 evidences to annotate the genome (basic pipeline), the distribution of AED score is weird (single peak around 0.5).
>  
> I have also tried to update the gene model I got from pasa  using maker, the distribution of AED scores is the same.
>  
> But when I try to only use  EST or protein as evidence (est2genome or protein2genome), the AED scores is normal (close to 0).
>  
> To my understand, it seems all the 3 evidences are conflict with each other, results in  the AED scores is higher  (~ 0.5) than expected,  could you give me some suggestion on how to fix this problem?
>  
> Best regards,
>  
> Wei
>  
>  
> <E6F3EF742C40408F8390EE9A1FF29894.png>
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200404/c6f3052e/attachment.html>

From danis_theo at hotmail.com  Thu Apr  2 12:24:05 2020
From: danis_theo at hotmail.com (Thodoris Danis)
Date: Thu, 2 Apr 2020 18:24:05 +0000
Subject: [maker-devel] Question about re-annotation
Message-ID: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>

Hello maker community,


I am annotating a fish genome with Maker 2.31.10, for which I have transcript evidence and proteins from closely related species. I am running maker in several passes and pass the gff output to the nest pass every time here (
"#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #re-annotate genome based on this gff3 file",
), is this wrong? and where is the difference running the maker from scratch without re-supplying every time the new gff?

Secondly,  in the second round after predictors training, have I to keep est2genome=1 , protein2genome=1, ests and proteins evidence or not?
How do we switch all thesre parameters?

Any input from experienced maker users is welcome
Thank you for your help


???????? ?????
Thodoris Danis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200402/151bd94e/attachment.html>

From carsonhh at gmail.com  Sun Apr  5 16:19:26 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sun, 5 Apr 2020 16:19:26 -0600
Subject: [maker-devel] Question about re-annotation
In-Reply-To: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>
References: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>
Message-ID: <9B863735-2D4D-4FD2-A8B1-3C542F3D767A@gmail.com>

If you are running several times, just rerun in the same directory after altering settings. MAKER will reuse old raw data reports as appropriate. The maker_gff option is really just for reannotating from an old maker run where you no longer have the raw files available.

?Carson


> On Apr 2, 2020, at 12:24 PM, Thodoris Danis <danis_theo at hotmail.com> wrote:
> 
> Hello maker community,
> 
> 
> I am annotating a fish genome with Maker 2.31.10, for which I have transcript evidence and proteins from closely related species. I am running maker in several passes and pass the gff output to the nest pass every time here (
>> 
>> "#-----Re-annotation Using MAKER Derived GFF3
>> maker_gff= #re-annotate genome based on this gff3 file",
> ), is this wrong? and where is the difference running the maker from scratch without re-supplying every time the new gff?
> 
> Secondly,  in the second round after predictors training, have I to keep est2genome=1 , protein2genome=1, ests and proteins evidence or not?
> How do we switch all thesre parameters? 
> 
> Any input from experienced maker users is welcome
> Thank you for your help
> 
> 
> ???????? ????? 
> Thodoris Danis
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200405/d67b5ca5/attachment-0001.html>

From carsonhh at gmail.com  Tue Apr  7 11:42:08 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 7 Apr 2020 11:42:08 -0600
Subject: [maker-devel] Maker 2.31.10: maker_functional_gff and
 maker_functional_fasta not parsing correctly,
 Can't use string ("") as a HASH ref while "strict refs" in use
In-Reply-To: <4A449297-A6D1-4A75-9547-FB5F70CE1A0A@ulaval.ca>
References: <4A449297-A6D1-4A75-9547-FB5F70CE1A0A@ulaval.ca>
Message-ID: <D775DED1-F3A5-478B-B7BE-8F318CFEADA3@gmail.com>

Thanks I?ll update the related scripts. In my tests the old regular expression still works, but ends up adding the OX= tag as part of the GFF3 entry and not throwing a hash ref error. So you still may have another issue if you are getting a hash ref error.

?Carson


> On Mar 14, 2020, at 11:24 AM, Christopher Keeling <christopher.keeling.1 at ulaval.ca> wrote:
> 
> Hello,
> 
> In sub parse_blast{, during parsing of uniprot fasta file:
> 
> if (/>(\S+)\s+(.*?)\s+OS=(.*?)\s+(GN=(.*?)\s+)?PE=/) {
> 
> should be changed to:
> 
> if (/>sp\|(\S+)\|\S+\s+(.*?)\s+OS=(.*?)\s+OX=\S+\s+(GN=(.*?)\s+)?PE=/) {
> 
> to avoid "Can't use string ("") as a HASH ref while "strict refs" in use at?" errors.
> 
> For UniProt release 2020_01: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz <ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz>
> 
> Cheers,
> Chris
> 
> 
> --
> Christopher I. Keeling
> Chercheur scientifique en g?nomique foresti?re/ Research Scientist in Forest Genomics
> 
> Ressources naturelles Canada / Natural Resources Canada
> Service canadien des for?ts / Canadian Forest Service
> Centre de foresterie des Laurentides / Laurentian Forestry Centre
> 1055, rue du PEPS Qu?bec, QC G1V 4C7 Canada
> https://cfs.nrcan.gc.ca/employees/read/ckeeling <https://cfs.nrcan.gc.ca/employees/read/ckeeling>
> 
> Professeur associ?
> D?partement de biochimie, de microbiologie et de bio-informatique
> Universit? Laval
> https://www.researchgate.net/profile/Christopher_Keeling <https://www.researchgate.net/profile/Christopher_Keeling>
> https://scholar.google.ca/citations?user=KtGr86UAAAAJ <https://scholar.google.ca/citations?user=KtGr86UAAAAJ>
>  
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200407/3f7959bd/attachment.html>

From andrei.kiselev at lrsv.ups-tlse.fr  Fri Apr 10 10:33:57 2020
From: andrei.kiselev at lrsv.ups-tlse.fr (andrei.kiselev at lrsv.ups-tlse.fr)
Date: Fri, 10 Apr 2020 16:33:57 +0000
Subject: [maker-devel] New assembly annotation
Message-ID: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>

Hello.
I'have recently got a new genome assembly using PacBio of oomycete Aphanomyces. 
I used MAKER in the manner as described here https://groups.google.com/forum/#!searchin/maker-devel/new$20assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ (https://groups.google.com/forum/#!searchin/maker-devel/new%2420assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ)

After first run I got the number of transcripts slightly higher than were in gff file of previous version of genome. Then I run the second MAKER with new gff file in option pred_gff + augustus trained for my species. As a result I got only half of the transcripts from initial gff.

Is there something that I could overlook running MAKER? Attached is control file of the last run.

Thank you in advance.
Andrei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200410/54d75f1c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4984 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200410/54d75f1c/attachment.obj>

From liorglic at mail.tau.ac.il  Mon Apr 13 08:12:42 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Mon, 13 Apr 2020 17:12:42 +0300
Subject: [maker-devel] Annotating a fragmented assembly
Message-ID: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>

Hello there,

I am working on creating plant pan genomes. This means that I produce many
assemblies for samples of the same species from NGS data available from SRA
and then annotate them with MAKER, based on a collection of relevant
evidence (transcripts and proteins).
As you might imagine, data quality is variable, so I sometimes create
assembles from >x20 sequencing depth, resulting in fragmented assemblies
(say N50 in the range of 5-10kb).
Annotation results of such genomes usually contain many partial genes,
broken across contigs, so in many cases I get two proteins, representing
the 3' and 5' parts of a broken gene. In other cases, only one part of the
gene is detected.
I've also found that applying reference-based scaffolding (I use RaGOO) to
generate pseudomolecules improves results by bringing together contigs
containing gene parts and allowing MAKER to create full annotation.
However, this also results in new erroneous predictions, spanning two
contigs that are not actually adjacent in the genome but were brought
together by the scaffolding process.
I suspect this has to do with the number of 'N' characters introduced as
padding between ordered contigs, so one thing I wanted to ask about is how
MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
I would also appreciate any advice on how to annotate fragmented genomes
and comments about the strategy I described above. Please note that I am
not expecting a reference-level annotation, but am simply trying to reduce
noise levels towards downstream comparative analyses.

Thanks a lot and best regards,
Lior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200413/254fdbe5/attachment.html>

From xpeng at ucsb.edu  Tue Apr 14 11:40:15 2020
From: xpeng at ucsb.edu (xpeng at ucsb.edu)
Date: Tue, 14 Apr 2020 10:40:15 -0700
Subject: [maker-devel] Can install but Cannot Run MAKER
Message-ID: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>

Dear Yandell Lab,

 
I am writing to get a bit of help on making MAKER to work.

 
I downloaded the v3.01.03 maker and followed the instructions on your wiki
page to install, both on my local computer as sudo and on PSC Bridges (with
MPI). 

 
The installation seemed to have completed successfully.

 
However, when I ran "maker -h" I received error messages (attached) that I
don't know what to do about. Could you please advise a solution?

 
Thank you!

 
Nick (Xuefeng Peng)

 
Postdoctoral Scholar

University of California

Santa Barbara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/38690941/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Error_Message_Ubuntu_19.10.txt
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/38690941/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Error_Message_PSC_Bridges.txt
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/38690941/attachment-0003.txt>

From carsonhh at gmail.com  Tue Apr 14 12:11:16 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 14 Apr 2020 12:11:16 -0600
Subject: [maker-devel] Can install but Cannot Run MAKER
In-Reply-To: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>
References: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>
Message-ID: <94E18556-771A-485C-B534-80B52BC7D586@gmail.com>

Please re-download and install again. I found the issue from your error in the new install package.

?Carson


> On Apr 14, 2020, at 11:40 AM, <xpeng at ucsb.edu> <xpeng at ucsb.edu> wrote:
> 
> Dear Yandell Lab,
>  
> I am writing to get a bit of help on making MAKER to work.
>  
> I downloaded the v3.01.03 maker and followed the instructions on your wiki page to install, both on my local computer as sudo and on PSC Bridges (with MPI). 
>  
> The installation seemed to have completed successfully.
>  
> However, when I ran ?maker -h? I received error messages (attached) that I don?t know what to do about. Could you please advise a solution?
>  
> Thank you!
>  
> Nick (Xuefeng Peng)
>  
> Postdoctoral Scholar
> University of California
> Santa Barbara, CA
> <Error_Message_Ubuntu_19.10.txt><Error_Message_PSC_Bridges.txt>_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/e9ef3203/attachment.html>

From liorglic at mail.tau.ac.il  Tue Apr 21 07:08:40 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Tue, 21 Apr 2020 16:08:40 +0300
Subject: [maker-devel] Missing genes in lift-over with est2genome
Message-ID: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>

Hello,
I am using MAKER to annotate a plant genome assembly. A high-quality
reference genome and annotation exists for another variety of the same
species, so my first step is lifting over reference genes to my genome. I
do this by setting est2genome = 1 and providing MAKER with the reference
cDNA (transcriptome). No other evidence is provided and no prediction is
performed. Repeat masking is done using the reference repeats library.
When checking the results, I found out lots of reference genes missing from
the lift-over result. However, if I blast the sequences of these genes
myself, I get good matches. I even see these matches when I look at the
blast results buried in the MAKER data_store.
For example, a transcript of length 1077 got a match of length 855 - 100%
identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a
pretty good match, but it is not found in the final MAKER results
(gff/fasta).
Why is this happening? Are there some cutoffs that are not satisfied? If
so, what are they and how can they be configured?

Thanks,
Lior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200421/dfdebbb1/attachment.html>

From carsonhh at gmail.com  Thu Apr 23 11:38:54 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:38:54 -0600
Subject: [maker-devel] Annotating a fragmented assembly
In-Reply-To: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>
References: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>
Message-ID: <C9C6F924-D27C-498A-81B8-B051C25CDB27@gmail.com>

N?s are handled by the gene predictors themselves. I know Augustus can span N?s within introns. I?m not sure how many N?s will cause it to split the gene. It may be a function of the expected intron length in the HMM. Organisms with large introns could then handles more N?s. Genemark will split genes on even short runs of N?s. I?m not sure on SNAP.  For BLAST alignments, extensions of gaps decrease the score, so how long the gap can be depends on the score of the initial seeding alignment. The larger the initial score, the longer the gap can be before scores drop below the termination threshold.

?Carson


> On Apr 13, 2020, at 8:12 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> 
> Hello there,
> 
> I am working on creating plant pan genomes. This means that I produce many assemblies for samples of the same species from NGS data available from SRA and then annotate them with MAKER, based on a collection of relevant evidence (transcripts and proteins).
> As you might imagine, data quality is variable, so I sometimes create assembles from >x20 sequencing depth, resulting in fragmented assemblies (say N50 in the range of 5-10kb).
> Annotation results of such genomes usually contain many partial genes, broken across contigs, so in many cases I get two proteins, representing the 3' and 5' parts of a broken gene. In other cases, only one part of the gene is detected.
> I've also found that applying reference-based scaffolding (I use RaGOO) to generate pseudomolecules improves results by bringing together contigs containing gene parts and allowing MAKER to create full annotation. However, this also results in new erroneous predictions, spanning two contigs that are not actually adjacent in the genome but were brought together by the scaffolding process.
> I suspect this has to do with the number of 'N' characters introduced as padding between ordered contigs, so one thing I wanted to ask about is how MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
> I would also appreciate any advice on how to annotate fragmented genomes and comments about the strategy I described above. Please note that I am not expecting a reference-level annotation, but am simply trying to reduce noise levels towards downstream comparative analyses.
> 
> Thanks a lot and best regards,
> Lior
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Apr 23 11:43:30 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:43:30 -0600
Subject: [maker-devel] Missing genes in lift-over with est2genome
In-Reply-To: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
References: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
Message-ID: <373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>

There are percent cutoffs for the est2genome algorithm you can set in the maker_bopts.ctl file. Additionally, maker will give the alignment but not produce a gene model if it can?t translate through the est2genome alignment (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add est_forward=1 to the maker_opts.ctl file names will be copied from the alignment source and the score in the GFF3 column will be the percent match to the original transcript.

?Carson


> On Apr 21, 2020, at 7:08 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> 
> Hello,
> I am using MAKER to annotate a plant genome assembly. A high-quality reference genome and annotation exists for another variety of the same species, so my first step is lifting over reference genes to my genome. I do this by setting est2genome = 1 and providing MAKER with the reference cDNA (transcriptome). No other evidence is provided and no prediction is performed. Repeat masking is done using the reference repeats library.
> When checking the results, I found out lots of reference genes missing from the lift-over result. However, if I blast the sequences of these genes myself, I get good matches. I even see these matches when I look at the blast results buried in the MAKER data_store.
> For example, a transcript of length 1077 got a match of length 855 - 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a pretty good match, but it is not found in the final MAKER results (gff/fasta).
> Why is this happening? Are there some cutoffs that are not satisfied? If so, what are they and how can they be configured?
> 
> Thanks,
> Lior
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Apr 23 11:53:27 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:53:27 -0600
Subject: [maker-devel] New assembly annotation
In-Reply-To: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>
References: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>
Message-ID: <DFFD73D7-8379-467B-9992-FDDBAE230802@gmail.com>

Fewer transcripts can mean fewer split and spurious genes. It can also be bad merges because of overtraining.  Use BUSCO to evaluate the completeness of gene models rather than transcript count.  Also review models visually using something like Apollo.  You will be able to see if models are spanning distinct evidence clusters or if they were previously split within evidence clusters.  That will help you better identify if the models now better follow the evidence alignments.

?Carson


> On Apr 10, 2020, at 10:33 AM, andrei.kiselev at lrsv.ups-tlse.fr wrote:
> 
> Hello.
> I'have recently got a new genome assembly using PacBio of oomycete Aphanomyces. 
> I used MAKER in the manner as described here https://groups.google.com/forum/#!searchin/maker-devel/new$20assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ <https://groups.google.com/forum/#!searchin/maker-devel/new%2420assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ>
> 
> After first run I got the number of transcripts slightly higher than were in gff file of previous version of genome. Then I run the second MAKER with new gff file in option pred_gff + augustus trained for my species. As a result I got only half of the transcripts from initial gff.
> 
> Is there something that I could overlook running MAKER? Attached is control file of the last run.
> 
> Thank you in advance.
> Andrei
> <maker_opts.ctl>_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200423/16d97e5b/attachment.html>

From carsonhh at gmail.com  Thu Apr 23 11:57:23 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:57:23 -0600
Subject: [maker-devel] final annotation issues
In-Reply-To: <1585933356.5e876c2c023db@oldmymail.yorku.ca>
References: <1585933356.5e876c2c023db@oldmymail.yorku.ca>
Message-ID: <D56728B7-B822-4EF4-AF75-7EE76C6D6908@gmail.com>

I would not recommend single-exon=1 unless this is an organism where you expect a lot of single exon genes (typically fungi or oomycetes).  It?s best to review models visually in something like Apollo to see how evidence alignments compare to gene predictions. There is always the chance that you have some overmasking that could trim some regions you don?t want to lose.

?Carson


> On Apr 3, 2020, at 11:02 AM, shore at yorku.ca wrote:
> 
> Dear Maker team,
> 
> I believe we are the final stage of annotation of a plant genome, having
> previously trained snap following 3 rounds.
> 
> In our attempts at final annotation we have now added new transcriptome data,
> and generated a repeat library for our species (so we now mask with that, as
> well as database of plant repeats , and TE proteins).
> 
> In this final annotation run, we've set keep_pred=1 and then plan to
> screen the final gff file retaining sequences with AED<= 0.5 (or there
> abouts) and ones that possess a pfam domain .
> 
> I've compared some of the proteins obtained in our previous round of Maker with
> the latest. Indeed the masking appears to have removed some that were TEs. A
> number of proteins differ somewhat, likely the result of different intron/exon
> boundary calls, and some are quite different in length.
> In particular some are roughly twice the length in previous annotation, and
> appear to be of the correct size previously , based upon online blasts.
> 
> It is this latter finding that I'm concerned about.
> Why it has occurred.
> 
> I did set single-exon=1 and wonder if that is causing this effect?
> 
> Thanks and sorry for the long-winded email.
> 
> Joel
> 
> 
> 
> -- 
> Dr. Joel S. Shore
> Prof. Biology
> York University
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From guerrer at uni-duesseldorf.de  Fri Apr 24 08:27:24 2020
From: guerrer at uni-duesseldorf.de (Ricardo Nuno Ferreira Martins Guerreiro)
Date: Fri, 24 Apr 2020 16:27:24 +0200
Subject: [maker-devel] Maker 0 genes after SNAP or with proteins.gff
Message-ID: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>

Dear Makers list,


I am struggling with Maker after many successful attempts. I don't 
understand why but my final .gff does not contain any genes, 0.

I am running first an Evidence based modelling, with proteins only. Here 
I get around 40 thousand genes if I give the proteins as a fasta to 
align (if I provide a protein.gff from a previous maker try, I get 0 
genes, same problem).

Afterwards I'm creating a SNAP hmm and running maker again, turning 
protein2genome=0 and snaphmm=snap.hmm as you say, but now I have 0 
genes. This happens either I keep providing proteins as a fasta or as 
.gff of a previous run.

I have done this many times and it always worked. The only difference 
now is that I am using no ESTs whatsoever, only proteins. It's also 
strange that it works on the first round of maker but doesn't work on 
the SNAP rounds.


Hope you can help,
Ricardo
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: maker_opts.ctl
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200424/636e3c77/attachment.ksh>

From taosheng.x at gmail.com  Sun Apr 26 00:58:47 2020
From: taosheng.x at gmail.com (Xu, taosheng)
Date: Sun, 26 Apr 2020 14:58:47 +0800
Subject: [maker-devel] Problems with openMPI in multiple computing nodes
Message-ID: <CALJhmFr9Q741vwAZHHH9-pV-PAjfCPRKi-2B0kLx8r0HVHWYOA@mail.gmail.com>

Hello,
I am using a  computer cluster with 20 nodes(40cpus per node) for
gene annotation. I submit my maker task to one node with 40 CPUs using
openMPI. Everything is well.
But I encounter the problem when submitting the same maker task to the
cluster with multiple nodes (120 cpus) There are errors shown below.
I would also appreciate any advice. Thank you.

Best regards,
Taosheng


*STATUS: Processing and indexing input FASTA files...cannot remove
directory for
home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
No such file or directory at /maker/bin/../lib/FastaDB.pm line 145.cannot
remove directory for
/home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
Directory not empty at /maker/bin/../lib/FastaDB.pm line 145.cannot remove
directory for
/home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
Directory not empty at /maker/bin/../lib/FastaDB.pm line 145.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200426/ccd6091e/attachment.html>

From xvazquezc at gmail.com  Sun Apr 26 20:15:53 2020
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 27 Apr 2020 12:15:53 +1000
Subject: [maker-devel] Maker 0 genes after SNAP or with proteins.gff
In-Reply-To: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>
References: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>
Message-ID: <CAL0hg4GUdbQMxN1j5KBQ6JymQSzT_tSbE19fwvEAg6+3_GmXMw@mail.gmail.com>

Hi Ricardo,
it is likely that you are not providing enough evidences to train SNAP (or
even none at all). When you run maker2zff, the defaults may not give any
output if you don't have any EST at all. Check maker2zff -h for the
evidence filtering options to create the model. In worst case, you'll need
to run maker2zff -n which doesn't filter the evidences at all. I also
suggest to search about this on the mailing list as it has come up many
times.
Cheers,
Xabi

On Sat, 25 Apr 2020 at 02:46, Ricardo Nuno Ferreira Martins Guerreiro <
guerrer at uni-duesseldorf.de> wrote:

> Dear Makers list,
>
>
> I am struggling with Maker after many successful attempts. I don't
> understand why but my final .gff does not contain any genes, 0.
>
> I am running first an Evidence based modelling, with proteins only. Here
> I get around 40 thousand genes if I give the proteins as a fasta to
> align (if I provide a protein.gff from a previous maker try, I get 0
> genes, same problem).
>
> Afterwards I'm creating a SNAP hmm and running maker again, turning
> protein2genome=0 and snaphmm=snap.hmm as you say, but now I have 0
> genes. This happens either I keep providing proteins as a fasta or as
> .gff of a previous run.
>
> I have done this many times and it always worked. The only difference
> now is that I am using no ESTs whatsoever, only proteins. It's also
> strange that it works on the first round of maker but doesn't work on
> the SNAP rounds.
>
>
> Hope you can help,
> Ricardo_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200427/d49cbc74/attachment-0001.html>

From liorglic at mail.tau.ac.il  Thu Apr 30 06:58:17 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Thu, 30 Apr 2020 15:58:17 +0300
Subject: [maker-devel] Missing genes in lift-over with est2genome
In-Reply-To: <373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>
References: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
	<373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>
Message-ID: <CAOzMDPyLSPa33x31R_2d+bKhDN2d6+aFK+mQn5C7xJd9Tq56yg@mail.gmail.com>

Thanks Carson - your answer was very helpful.
Another question related to the lift-over process, if I may.
I want to take the resulting gff and pass it on to another MAKER run, where
I provide further, lower confidence evidence (ESTs and proteins). I'm not
sure which option to use though. According to this helpful post
<https://computationalbiologysite.wordpress.com/2013/07/11/maker-gff-cite-online/>,
I tried using pred_gff and model_gff, but both created cases of fusion
genes when genes are very adjacent to one another (see attached picture),
even with the correct_est_fusion parameter enabled. It looks like the only
way to take lifted-over genes "as-is" would be to use other_gff, but I
figure that this was not really intended for genes. Would you recommend
this usage? Am I missing something?
Thank you!

??????? ??? ??, 23 ????? 2020 ?-20:43 ??? ?Carson Holt?? <?
carsonhh at gmail.com??>:?

> There are percent cutoffs for the est2genome algorithm you can set in the
> maker_bopts.ctl file. Additionally, maker will give the alignment but not
> produce a gene model if it can?t translate through the est2genome alignment
> (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add
> est_forward=1 to the maker_opts.ctl file names will be copied from the
> alignment source and the score in the GFF3 column will be the percent match
> to the original transcript.
>
> ?Carson
>
>
>
> > On Apr 21, 2020, at 7:08 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> >
> > Hello,
> > I am using MAKER to annotate a plant genome assembly. A high-quality
> reference genome and annotation exists for another variety of the same
> species, so my first step is lifting over reference genes to my genome. I
> do this by setting est2genome = 1 and providing MAKER with the reference
> cDNA (transcriptome). No other evidence is provided and no prediction is
> performed. Repeat masking is done using the reference repeats library.
> > When checking the results, I found out lots of reference genes missing
> from the lift-over result. However, if I blast the sequences of these genes
> myself, I get good matches. I even see these matches when I look at the
> blast results buried in the MAKER data_store.
> > For example, a transcript of length 1077 got a match of length 855 -
> 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like
> a pretty good match, but it is not found in the final MAKER results
> (gff/fasta).
> > Why is this happening? Are there some cutoffs that are not satisfied? If
> so, what are they and how can they be configured?
> >
> > Thanks,
> > Lior
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at yandell-lab.org
> > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200430/a53d513e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fusion.png
Type: image/png
Size: 33185 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200430/a53d513e/attachment-0001.png>

From natassa_g_2000 at yahoo.com  Thu Apr  2 07:42:48 2020
From: natassa_g_2000 at yahoo.com (natassa)
Date: Thu, 2 Apr 2020 13:42:48 +0000 (UTC)
Subject: [maker-devel] Optimal strategy and options for iterative maker2 runs
In-Reply-To: <1b0e5c3cae1b410397e61262a2384039@umu.se>
References: <1b0e5c3cae1b410397e61262a2384039@umu.se>
Message-ID: <1875152020.930988.1585834968860@mail.yahoo.com>

Hello maker community, 
I am annotating? with maker2 a fungal genome for which I have transcript evidence, plus transcripts and proteins from closely related species, a genemark .mod file from self-training I have run outside of maker, and an augustus model from a closely related species. I plan to run it iteratively, updating snap (and maybe augustus) models each time. Reading several iterative-maker pipelines online, I am a bit confused on the optimal strategy, and some details on the options used in consecutive runs. Some questions:

1) How will MAKER behave in the case where I would supply my different lines of evidence (EST+protein) along with trained abinitio models in the same run? Here is -what seems to me conflicting- info from posts I read (not in this list): "if est2genome and protein2genome are set to 1 +? abinitio tools are also on,? the abinitio tools will not use the EST-protein evidence to improve their gene models." but: "In case you activated SNAP and Augustus and you have fed MAKER with lines of evidence (Transcripts and proteins), it will predict gene models using Augustus-Evidence-driven and SNAP-Evidence-driven. In loci where both are present, it will chose the best one according to the lines of evidence (EST / protein when they are present)." Which one is correct?
2) I see in? a few tutorials that genemark is trained at a 3rd/4th run and separately from other abinitio programs. I don't understand why, since genemark is self-trained on the genome, so it doesnot really interact with training from evidence or maker gff files? 
3) Can I pass >1 abinitio models from one run to the next using the pred_gff option? For example? augustus+genemark hmms, separated by ","? In a 2017 post, Carson writes "I would avoid passing in Augustus results as GFF3, it removes the ability of MAKER to dynamically provide Augustus with hints as it runs". What is the correct way then?

Any input from experienced maker users is welcome!
Thank you in advance, 
Anastasia Gioti

Anastasia Gioti
Researcher
IMBBC-HCMR Crete, Greece
https://scholar.google.com/citations?user=eMsnakoAAAAJ&hl=en
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200402/afce9017/attachment-0002.html>

From shore at yorku.ca  Fri Apr  3 11:02:36 2020
From: shore at yorku.ca (shore at yorku.ca)
Date: Fri, 03 Apr 2020 13:02:36 -0400
Subject: [maker-devel] final annotation issues
Message-ID: <1585933356.5e876c2c023db@oldmymail.yorku.ca>

Dear Maker team,

 I believe we are the final stage of annotation of a plant genome, having
previously trained snap following 3 rounds.

 In our attempts at final annotation we have now added new transcriptome data,
and generated a repeat library for our species (so we now mask with that, as
well as database of plant repeats , and TE proteins).

 In this final annotation run, we've set keep_pred=1 and then plan to
screen the final gff file retaining sequences with AED<= 0.5 (or there
abouts) and ones that possess a pfam domain .

 I've compared some of the proteins obtained in our previous round of Maker with
the latest. Indeed the masking appears to have removed some that were TEs. A
number of proteins differ somewhat, likely the result of different intron/exon
boundary calls, and some are quite different in length.
In particular some are roughly twice the length in previous annotation, and
appear to be of the correct size previously , based upon online blasts.

It is this latter finding that I'm concerned about.
Why it has occurred.

I did set single-exon=1 and wonder if that is causing this effect?

Thanks and sorry for the long-winded email.

Joel


-- 
Dr. Joel S. Shore
Prof. Biology
York University


From carsonhh at gmail.com  Fri Apr  3 14:51:47 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 3 Apr 2020 14:51:47 -0600
Subject: [maker-devel] guidance for first and subsequent annotation
 parameters
In-Reply-To: <CAEdk9jmg1rbuRPbCC4vNmNkTyN61H7XGQpUi4h6UD1czoFtH1g@mail.gmail.com>
References: <CAEdk9jmg1rbuRPbCC4vNmNkTyN61H7XGQpUi4h6UD1czoFtH1g@mail.gmail.com>
Message-ID: <81E8860F-A91B-4089-B179-4ED7EBAC36D3@gmail.com>

You may need to select a subset of gene models to drive training.  I find that I get best results when I use protein2genome models only from uniprot/swiss-prot alignments to generate a training set, set always_complete=1. Uniprot/swiss-prot is manually curated, so is very high quality. Then I select models with the highest end-to-end completion (low AED). Also if you add est_forward=1 the score column in the GFF3 will be the % match to the original model.  It?s and easy way to select only models with a very high percent match. Remove models without start codons and stop codons.  You can relax these parameters if you don?t have many models, but in general you want 100-300 models to train with. Only one round of training is needed with this type of training set. The EST method requires 2 rounds and I don?t like it as much.

In the some cases, model selection for training will be a mostly manual task. You can use editors like Apollo to identify models that match evidence well, and delete odd models. Then train on that result.


What you are seeing is likely the result of over-training. Usually happens if you use more that 2 rounds of training, but can happen with just two rounds.

?Carson

 
> On Mar 20, 2020, at 5:30 AM, Devon O'Rourke <devon.orourke at gmail.com> wrote:
> 
> With so many posts on the forum it's been challenging to determine what the best practices are for performing multiple rounds of annotation with Maker.
> My first round used est, altest, and protein fasta files with a custom GFF repeat masked file. The resulting vertebrate genome produced 21,970 gene models with a mean length of about 9016 bp; the BUSCO score was C:66.0%[S:64.2%,D:1.8%],F:4.2%,M:29.8%,n:9226 (mammalia_odb10 set). Things seemed to be on the right track, so I set up the next Maker round using both SNAP and Augustus-trained information in the round2 maker_opts.ctl file. At the end of that second round, I noticed a marked decrease in BUSCO score (C:53.3%[S:51.0%,D:2.3%],F:11.6%,M:35.1%,n:9226), yet an increase in the number of gene models (28,646) and mean length (16266 bp). 
> 
> This got me to wondering if I was setting up the _opts.ctl file incorrectly? I'm concerned with a few things (and maybe missing even more I should be concerned about!?):
> I specified the evidence to come from EST/Protein instead of using the section available under "#-----Re-annotation Using MAKER Derived GFF3". Maybe that was a fundamental mistake? What is the expected change in behavior if I moved my round1 Maker output into that category instead of using the EST/Protein Homology evidence sections as I did below?
> I wasn't sure what to do with the RepeatMasking GFF files in Round2. The RepeatMasker GFF I included in Round1 consisted of just complex repeats (setting model_org=simple and softmask=1 to effectively only hard mask those complex areas for the initial alignments). But what should be used in Round2 - the output GFF of Round1, or the input GFF from Round1?
> Here's what I did for the Round2 maker_opts.ctl file:
> 
> #-----Genome (these are always required)
> genome=/scratch/dro49/myluwork/annotation/input_files/mylu_hic_rails_noMasks.fa
> organism_type=eukaryotic
> #-----EST Evidence (for best results provide a file for at least one)
> est_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.est2genome.gff
> altest_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.cdna2genome.gff
> #-----Protein Homology Evidence (for best results provide a file for at least one)
> protein_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.protein2genome.gff
> #-----Repeat Masking (leave values blank to skip repeat masking)
> rm_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.repeats.gff
> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
> #-----Gene Prediction
> snaphmm=/scratch/dro49/myluwork/annotation/maker_rd2/snap_rd1/lu_rnd1.zff.length50_aed0.25.hmm #SNAP HMM file
> augustus_species=mylu #Augustus gene prediction species model
> run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
> est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
> protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
> trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
> unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
> allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)
> 
> 
> Thank you for your insights and support,
> 
> Devon
> 
> -- 
> Devon O'Rourke
> Postdoctoral researcher, Northern Arizona University
> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/ <https://fozlab.weebly.com/>
> twitter: @thesciencedork

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/45bd12db/attachment-0002.html>

From carsonhh at gmail.com  Fri Apr  3 16:03:12 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 3 Apr 2020 16:03:12 -0600
Subject: [maker-devel] Problem with Maker using GeneMark
In-Reply-To: <5e838561.1c69fb81.27f9c.24c6SMTPIN_ADDED_MISSING@mx.google.com>
References: <5e838561.1c69fb81.27f9c.24c6SMTPIN_ADDED_MISSING@mx.google.com>
Message-ID: <E2FF8D2F-8B8F-456B-BC0E-1A1C099D05D3@gmail.com>

Could you try the attached version, and let me know if it resolves the issue (copy over the old one)? The probuild command I used is just one I stole from another GeneMark script, so I just borrowed the updated command from the SplitFasta subroutine in gmes_petap.pl.

?Carson


> On Mar 31, 2020, at 11:53 AM, Gagn?, Patrick (NRCAN/RNCAN) <patrick.gagne at canada.ca> wrote:
> 
> Hi
>  
> I?ve come across a bug while using Maker. I?m trying to annotate a 560Mb Genome and I?m using Snap, GeneMark and Augustus in Maker.
> When Maker is executing the GeneMark command, it just failed (GeneMark Failed) without any error messages, so I?ve decided to debug it myself?So I launched every commands manually and found out that the gmhmm_wrap is causing the issue. The problem is in fact in the prebuild command; it doesn?t do anything (from what I understand, this command is supposed to split the fasta whre there is NNN to prevent GeneMark Crash). My genome got very long stretches of N (up to 14Kb)
>  
> After checking the prebuild help, I?ve found that the command used in gmhmm_wrap is not valid (half the options are not in probuild anymore, probably because of GeneMark updates)
>  
> I have tried different Probuild (those I could download from GeneMark site, they don?t give older versions except those that come with their program?s versions)
> 2.16
> 2.34
> 2.44 (lastest that come with GeneMark ES)
>  
> I?ve also tried to edit the gmhmm_wrap script and modify the prebuild command, but even when the fasta are splitted, I got another bug : ERROR: Logic error in getting offset. I?ve tried to replace the command for the offset extraction, which also worked, but now I got a bug when Maker try to get the ab-initio output :
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Calling translate without a seq argument!
>  
> Could you please tell me how to fix this, or tell me what probuild I should use (I will ask the GeneMark support for it)
>  
> Thanks in advance
>  
> P.S 
> Sorry for my English, It?s not my first language and I?m still learning
>  
> Patrick Gagn?
> Sp?cialiste en bio-informatique / Bioinformatics specialist
> Service canadien des for?ts / Canadian Forest Service
> Ressources naturelles Canada / Natural Resources Canada
> Gouvernement du Canada / Government of Canada
> Centre de foresterie des Laurentides/Laurentian Forestry Centre
> 1055, rue du P.E.P.S.
> C.P. 10380, succ. Sainte-Foy/P.O. Box 10380, Stn. Sainte-Foy
> Qu?bec (Qc) G1V 4C7
> Laboratoire de pathologie foresti?re (Local 2.21)
> patrick.gagne at canada.ca <mailto:patrick.gagne at canada.ca> / tel : (418) 648-4443
>  
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0004.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmhmm_wrap
Type: application/octet-stream
Size: 9027 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0002.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0005.html>

From carsonhh at gmail.com  Sat Apr  4 14:09:05 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sat, 4 Apr 2020 14:09:05 -0600
Subject: [maker-devel] repeatmasker output gff
In-Reply-To: <CA+un12sUfaxK_ycgves9cohrdFjOyg7Nt9WUzGSK6Yz-RbdgMQ@mail.gmail.com>
References: <CA+un12sUfaxK_ycgves9cohrdFjOyg7Nt9WUzGSK6Yz-RbdgMQ@mail.gmail.com>
Message-ID: <08952604-1F7A-4FC5-9F59-DB79665A324D@gmail.com>

It needs to be a two level feature. match/match_part is one example, others will work as long as when it is assembled it is two levels.

MAKER saves it?s state as it runs, so you can restart it at any time without losing progress.

?Carson


> On Mar 25, 2020, at 2:38 PM, Homa Papoli <hpapoli at gmail.com> wrote:
> 
> Hello,
> 
> I have 2 questions regarding user maker:
> 
> I have used repeatmasker for my genome separately and I have a gff file. However, my gff file, in the third column, has the word "similarity". In a workshop I had taken on genome annotation, it was said that the gff for maker should have "match" and "match_part" for the third column. I was wondering whether I could use the original gff output of repeatmasker or should I make any changes to it?
> 
> Another question is about running maker. Since maker takes several days to run, if the job gets interrupted due to limit in days of running the job, I was wondering whether it is possible to re-start maker from where it got interrupted?
> 
> Thank you,
> Homa
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Sat Apr  4 14:15:21 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sat, 4 Apr 2020 14:15:21 -0600
Subject: [maker-devel] Maker annotation  AED scores are around 0.5
In-Reply-To: <1b0e5c3cae1b410397e61262a2384039@umu.se>
References: <1b0e5c3cae1b410397e61262a2384039@umu.se>
Message-ID: <E2AAAE35-7B24-46EE-B77F-9E4BD584CC45@gmail.com>

Probably this ?>

https://groups.google.com/forum/#!searchin/maker-devel/Curious$20pattern$20in$20AED$20distributions%7Csort:date/maker-devel/QS3VnxhvEks/q3lPmywjBQAJ <https://groups.google.com/forum/#!searchin/maker-devel/Curious$20pattern$20in$20AED$20distributions|sort:date/maker-devel/QS3VnxhvEks/q3lPmywjBQAJ>


Likely caused by an over abundance of single-exon models and under masking of repeats in the genome.

?Carson


> On Mar 30, 2020, at 3:37 AM, Wei Zhao <zhao.wei at umu.se> wrote:
> 
> Dear maker team,
>  
> I am writing to ask for your help.
>  
> I am using make to annotate a big genome ~9 Gbp, I have 3 evidences: 1)  Transcriptome of this species; 2) protein sequence from relative species; 3) Augustus model trained from pasa.
>  
> When I use all of these 3 evidences to annotate the genome (basic pipeline), the distribution of AED score is weird (single peak around 0.5).
>  
> I have also tried to update the gene model I got from pasa  using maker, the distribution of AED scores is the same.
>  
> But when I try to only use  EST or protein as evidence (est2genome or protein2genome), the AED scores is normal (close to 0).
>  
> To my understand, it seems all the 3 evidences are conflict with each other, results in  the AED scores is higher  (~ 0.5) than expected,  could you give me some suggestion on how to fix this problem?
>  
> Best regards,
>  
> Wei
>  
>  
> <E6F3EF742C40408F8390EE9A1FF29894.png>
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200404/c6f3052e/attachment-0002.html>

From danis_theo at hotmail.com  Thu Apr  2 12:24:05 2020
From: danis_theo at hotmail.com (Thodoris Danis)
Date: Thu, 2 Apr 2020 18:24:05 +0000
Subject: [maker-devel] Question about re-annotation
Message-ID: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>

Hello maker community,


I am annotating a fish genome with Maker 2.31.10, for which I have transcript evidence and proteins from closely related species. I am running maker in several passes and pass the gff output to the nest pass every time here (
"#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #re-annotate genome based on this gff3 file",
), is this wrong? and where is the difference running the maker from scratch without re-supplying every time the new gff?

Secondly,  in the second round after predictors training, have I to keep est2genome=1 , protein2genome=1, ests and proteins evidence or not?
How do we switch all thesre parameters?

Any input from experienced maker users is welcome
Thank you for your help


???????? ?????
Thodoris Danis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200402/151bd94e/attachment-0002.html>

From carsonhh at gmail.com  Sun Apr  5 16:19:26 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sun, 5 Apr 2020 16:19:26 -0600
Subject: [maker-devel] Question about re-annotation
In-Reply-To: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>
References: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>
Message-ID: <9B863735-2D4D-4FD2-A8B1-3C542F3D767A@gmail.com>

If you are running several times, just rerun in the same directory after altering settings. MAKER will reuse old raw data reports as appropriate. The maker_gff option is really just for reannotating from an old maker run where you no longer have the raw files available.

?Carson


> On Apr 2, 2020, at 12:24 PM, Thodoris Danis <danis_theo at hotmail.com> wrote:
> 
> Hello maker community,
> 
> 
> I am annotating a fish genome with Maker 2.31.10, for which I have transcript evidence and proteins from closely related species. I am running maker in several passes and pass the gff output to the nest pass every time here (
>> 
>> "#-----Re-annotation Using MAKER Derived GFF3
>> maker_gff= #re-annotate genome based on this gff3 file",
> ), is this wrong? and where is the difference running the maker from scratch without re-supplying every time the new gff?
> 
> Secondly,  in the second round after predictors training, have I to keep est2genome=1 , protein2genome=1, ests and proteins evidence or not?
> How do we switch all thesre parameters? 
> 
> Any input from experienced maker users is welcome
> Thank you for your help
> 
> 
> ???????? ????? 
> Thodoris Danis
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200405/d67b5ca5/attachment-0002.html>

From carsonhh at gmail.com  Tue Apr  7 11:42:08 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 7 Apr 2020 11:42:08 -0600
Subject: [maker-devel] Maker 2.31.10: maker_functional_gff and
 maker_functional_fasta not parsing correctly,
 Can't use string ("") as a HASH ref while "strict refs" in use
In-Reply-To: <4A449297-A6D1-4A75-9547-FB5F70CE1A0A@ulaval.ca>
References: <4A449297-A6D1-4A75-9547-FB5F70CE1A0A@ulaval.ca>
Message-ID: <D775DED1-F3A5-478B-B7BE-8F318CFEADA3@gmail.com>

Thanks I?ll update the related scripts. In my tests the old regular expression still works, but ends up adding the OX= tag as part of the GFF3 entry and not throwing a hash ref error. So you still may have another issue if you are getting a hash ref error.

?Carson


> On Mar 14, 2020, at 11:24 AM, Christopher Keeling <christopher.keeling.1 at ulaval.ca> wrote:
> 
> Hello,
> 
> In sub parse_blast{, during parsing of uniprot fasta file:
> 
> if (/>(\S+)\s+(.*?)\s+OS=(.*?)\s+(GN=(.*?)\s+)?PE=/) {
> 
> should be changed to:
> 
> if (/>sp\|(\S+)\|\S+\s+(.*?)\s+OS=(.*?)\s+OX=\S+\s+(GN=(.*?)\s+)?PE=/) {
> 
> to avoid "Can't use string ("") as a HASH ref while "strict refs" in use at?" errors.
> 
> For UniProt release 2020_01: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz <ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz>
> 
> Cheers,
> Chris
> 
> 
> --
> Christopher I. Keeling
> Chercheur scientifique en g?nomique foresti?re/ Research Scientist in Forest Genomics
> 
> Ressources naturelles Canada / Natural Resources Canada
> Service canadien des for?ts / Canadian Forest Service
> Centre de foresterie des Laurentides / Laurentian Forestry Centre
> 1055, rue du PEPS Qu?bec, QC G1V 4C7 Canada
> https://cfs.nrcan.gc.ca/employees/read/ckeeling <https://cfs.nrcan.gc.ca/employees/read/ckeeling>
> 
> Professeur associ?
> D?partement de biochimie, de microbiologie et de bio-informatique
> Universit? Laval
> https://www.researchgate.net/profile/Christopher_Keeling <https://www.researchgate.net/profile/Christopher_Keeling>
> https://scholar.google.ca/citations?user=KtGr86UAAAAJ <https://scholar.google.ca/citations?user=KtGr86UAAAAJ>
>  
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200407/3f7959bd/attachment-0002.html>

From andrei.kiselev at lrsv.ups-tlse.fr  Fri Apr 10 10:33:57 2020
From: andrei.kiselev at lrsv.ups-tlse.fr (andrei.kiselev at lrsv.ups-tlse.fr)
Date: Fri, 10 Apr 2020 16:33:57 +0000
Subject: [maker-devel] New assembly annotation
Message-ID: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>

Hello.
I'have recently got a new genome assembly using PacBio of oomycete Aphanomyces. 
I used MAKER in the manner as described here https://groups.google.com/forum/#!searchin/maker-devel/new$20assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ (https://groups.google.com/forum/#!searchin/maker-devel/new%2420assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ)

After first run I got the number of transcripts slightly higher than were in gff file of previous version of genome. Then I run the second MAKER with new gff file in option pred_gff + augustus trained for my species. As a result I got only half of the transcripts from initial gff.

Is there something that I could overlook running MAKER? Attached is control file of the last run.

Thank you in advance.
Andrei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200410/54d75f1c/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4984 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200410/54d75f1c/attachment-0002.obj>

From liorglic at mail.tau.ac.il  Mon Apr 13 08:12:42 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Mon, 13 Apr 2020 17:12:42 +0300
Subject: [maker-devel] Annotating a fragmented assembly
Message-ID: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>

Hello there,

I am working on creating plant pan genomes. This means that I produce many
assemblies for samples of the same species from NGS data available from SRA
and then annotate them with MAKER, based on a collection of relevant
evidence (transcripts and proteins).
As you might imagine, data quality is variable, so I sometimes create
assembles from >x20 sequencing depth, resulting in fragmented assemblies
(say N50 in the range of 5-10kb).
Annotation results of such genomes usually contain many partial genes,
broken across contigs, so in many cases I get two proteins, representing
the 3' and 5' parts of a broken gene. In other cases, only one part of the
gene is detected.
I've also found that applying reference-based scaffolding (I use RaGOO) to
generate pseudomolecules improves results by bringing together contigs
containing gene parts and allowing MAKER to create full annotation.
However, this also results in new erroneous predictions, spanning two
contigs that are not actually adjacent in the genome but were brought
together by the scaffolding process.
I suspect this has to do with the number of 'N' characters introduced as
padding between ordered contigs, so one thing I wanted to ask about is how
MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
I would also appreciate any advice on how to annotate fragmented genomes
and comments about the strategy I described above. Please note that I am
not expecting a reference-level annotation, but am simply trying to reduce
noise levels towards downstream comparative analyses.

Thanks a lot and best regards,
Lior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200413/254fdbe5/attachment-0002.html>

From xpeng at ucsb.edu  Tue Apr 14 11:40:15 2020
From: xpeng at ucsb.edu (xpeng at ucsb.edu)
Date: Tue, 14 Apr 2020 10:40:15 -0700
Subject: [maker-devel] Can install but Cannot Run MAKER
Message-ID: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>

Dear Yandell Lab,

 
I am writing to get a bit of help on making MAKER to work.

 
I downloaded the v3.01.03 maker and followed the instructions on your wiki
page to install, both on my local computer as sudo and on PSC Bridges (with
MPI). 

 
The installation seemed to have completed successfully.

 
However, when I ran "maker -h" I received error messages (attached) that I
don't know what to do about. Could you please advise a solution?

 
Thank you!

 
Nick (Xuefeng Peng)

 
Postdoctoral Scholar

University of California

Santa Barbara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/38690941/attachment-0002.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Error_Message_Ubuntu_19.10.txt
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/38690941/attachment-0004.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Error_Message_PSC_Bridges.txt
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/38690941/attachment-0005.txt>

From carsonhh at gmail.com  Tue Apr 14 12:11:16 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 14 Apr 2020 12:11:16 -0600
Subject: [maker-devel] Can install but Cannot Run MAKER
In-Reply-To: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>
References: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>
Message-ID: <94E18556-771A-485C-B534-80B52BC7D586@gmail.com>

Please re-download and install again. I found the issue from your error in the new install package.

?Carson


> On Apr 14, 2020, at 11:40 AM, <xpeng at ucsb.edu> <xpeng at ucsb.edu> wrote:
> 
> Dear Yandell Lab,
>  
> I am writing to get a bit of help on making MAKER to work.
>  
> I downloaded the v3.01.03 maker and followed the instructions on your wiki page to install, both on my local computer as sudo and on PSC Bridges (with MPI). 
>  
> The installation seemed to have completed successfully.
>  
> However, when I ran ?maker -h? I received error messages (attached) that I don?t know what to do about. Could you please advise a solution?
>  
> Thank you!
>  
> Nick (Xuefeng Peng)
>  
> Postdoctoral Scholar
> University of California
> Santa Barbara, CA
> <Error_Message_Ubuntu_19.10.txt><Error_Message_PSC_Bridges.txt>_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/e9ef3203/attachment-0002.html>

From liorglic at mail.tau.ac.il  Tue Apr 21 07:08:40 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Tue, 21 Apr 2020 16:08:40 +0300
Subject: [maker-devel] Missing genes in lift-over with est2genome
Message-ID: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>

Hello,
I am using MAKER to annotate a plant genome assembly. A high-quality
reference genome and annotation exists for another variety of the same
species, so my first step is lifting over reference genes to my genome. I
do this by setting est2genome = 1 and providing MAKER with the reference
cDNA (transcriptome). No other evidence is provided and no prediction is
performed. Repeat masking is done using the reference repeats library.
When checking the results, I found out lots of reference genes missing from
the lift-over result. However, if I blast the sequences of these genes
myself, I get good matches. I even see these matches when I look at the
blast results buried in the MAKER data_store.
For example, a transcript of length 1077 got a match of length 855 - 100%
identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a
pretty good match, but it is not found in the final MAKER results
(gff/fasta).
Why is this happening? Are there some cutoffs that are not satisfied? If
so, what are they and how can they be configured?

Thanks,
Lior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200421/dfdebbb1/attachment-0002.html>

From carsonhh at gmail.com  Thu Apr 23 11:38:54 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:38:54 -0600
Subject: [maker-devel] Annotating a fragmented assembly
In-Reply-To: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>
References: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>
Message-ID: <C9C6F924-D27C-498A-81B8-B051C25CDB27@gmail.com>

N?s are handled by the gene predictors themselves. I know Augustus can span N?s within introns. I?m not sure how many N?s will cause it to split the gene. It may be a function of the expected intron length in the HMM. Organisms with large introns could then handles more N?s. Genemark will split genes on even short runs of N?s. I?m not sure on SNAP.  For BLAST alignments, extensions of gaps decrease the score, so how long the gap can be depends on the score of the initial seeding alignment. The larger the initial score, the longer the gap can be before scores drop below the termination threshold.

?Carson


> On Apr 13, 2020, at 8:12 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> 
> Hello there,
> 
> I am working on creating plant pan genomes. This means that I produce many assemblies for samples of the same species from NGS data available from SRA and then annotate them with MAKER, based on a collection of relevant evidence (transcripts and proteins).
> As you might imagine, data quality is variable, so I sometimes create assembles from >x20 sequencing depth, resulting in fragmented assemblies (say N50 in the range of 5-10kb).
> Annotation results of such genomes usually contain many partial genes, broken across contigs, so in many cases I get two proteins, representing the 3' and 5' parts of a broken gene. In other cases, only one part of the gene is detected.
> I've also found that applying reference-based scaffolding (I use RaGOO) to generate pseudomolecules improves results by bringing together contigs containing gene parts and allowing MAKER to create full annotation. However, this also results in new erroneous predictions, spanning two contigs that are not actually adjacent in the genome but were brought together by the scaffolding process.
> I suspect this has to do with the number of 'N' characters introduced as padding between ordered contigs, so one thing I wanted to ask about is how MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
> I would also appreciate any advice on how to annotate fragmented genomes and comments about the strategy I described above. Please note that I am not expecting a reference-level annotation, but am simply trying to reduce noise levels towards downstream comparative analyses.
> 
> Thanks a lot and best regards,
> Lior
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Apr 23 11:43:30 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:43:30 -0600
Subject: [maker-devel] Missing genes in lift-over with est2genome
In-Reply-To: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
References: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
Message-ID: <373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>

There are percent cutoffs for the est2genome algorithm you can set in the maker_bopts.ctl file. Additionally, maker will give the alignment but not produce a gene model if it can?t translate through the est2genome alignment (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add est_forward=1 to the maker_opts.ctl file names will be copied from the alignment source and the score in the GFF3 column will be the percent match to the original transcript.

?Carson


> On Apr 21, 2020, at 7:08 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> 
> Hello,
> I am using MAKER to annotate a plant genome assembly. A high-quality reference genome and annotation exists for another variety of the same species, so my first step is lifting over reference genes to my genome. I do this by setting est2genome = 1 and providing MAKER with the reference cDNA (transcriptome). No other evidence is provided and no prediction is performed. Repeat masking is done using the reference repeats library.
> When checking the results, I found out lots of reference genes missing from the lift-over result. However, if I blast the sequences of these genes myself, I get good matches. I even see these matches when I look at the blast results buried in the MAKER data_store.
> For example, a transcript of length 1077 got a match of length 855 - 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a pretty good match, but it is not found in the final MAKER results (gff/fasta).
> Why is this happening? Are there some cutoffs that are not satisfied? If so, what are they and how can they be configured?
> 
> Thanks,
> Lior
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Apr 23 11:53:27 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:53:27 -0600
Subject: [maker-devel] New assembly annotation
In-Reply-To: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>
References: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>
Message-ID: <DFFD73D7-8379-467B-9992-FDDBAE230802@gmail.com>

Fewer transcripts can mean fewer split and spurious genes. It can also be bad merges because of overtraining.  Use BUSCO to evaluate the completeness of gene models rather than transcript count.  Also review models visually using something like Apollo.  You will be able to see if models are spanning distinct evidence clusters or if they were previously split within evidence clusters.  That will help you better identify if the models now better follow the evidence alignments.

?Carson


> On Apr 10, 2020, at 10:33 AM, andrei.kiselev at lrsv.ups-tlse.fr wrote:
> 
> Hello.
> I'have recently got a new genome assembly using PacBio of oomycete Aphanomyces. 
> I used MAKER in the manner as described here https://groups.google.com/forum/#!searchin/maker-devel/new$20assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ <https://groups.google.com/forum/#!searchin/maker-devel/new%2420assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ>
> 
> After first run I got the number of transcripts slightly higher than were in gff file of previous version of genome. Then I run the second MAKER with new gff file in option pred_gff + augustus trained for my species. As a result I got only half of the transcripts from initial gff.
> 
> Is there something that I could overlook running MAKER? Attached is control file of the last run.
> 
> Thank you in advance.
> Andrei
> <maker_opts.ctl>_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200423/16d97e5b/attachment-0002.html>

From carsonhh at gmail.com  Thu Apr 23 11:57:23 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:57:23 -0600
Subject: [maker-devel] final annotation issues
In-Reply-To: <1585933356.5e876c2c023db@oldmymail.yorku.ca>
References: <1585933356.5e876c2c023db@oldmymail.yorku.ca>
Message-ID: <D56728B7-B822-4EF4-AF75-7EE76C6D6908@gmail.com>

I would not recommend single-exon=1 unless this is an organism where you expect a lot of single exon genes (typically fungi or oomycetes).  It?s best to review models visually in something like Apollo to see how evidence alignments compare to gene predictions. There is always the chance that you have some overmasking that could trim some regions you don?t want to lose.

?Carson


> On Apr 3, 2020, at 11:02 AM, shore at yorku.ca wrote:
> 
> Dear Maker team,
> 
> I believe we are the final stage of annotation of a plant genome, having
> previously trained snap following 3 rounds.
> 
> In our attempts at final annotation we have now added new transcriptome data,
> and generated a repeat library for our species (so we now mask with that, as
> well as database of plant repeats , and TE proteins).
> 
> In this final annotation run, we've set keep_pred=1 and then plan to
> screen the final gff file retaining sequences with AED<= 0.5 (or there
> abouts) and ones that possess a pfam domain .
> 
> I've compared some of the proteins obtained in our previous round of Maker with
> the latest. Indeed the masking appears to have removed some that were TEs. A
> number of proteins differ somewhat, likely the result of different intron/exon
> boundary calls, and some are quite different in length.
> In particular some are roughly twice the length in previous annotation, and
> appear to be of the correct size previously , based upon online blasts.
> 
> It is this latter finding that I'm concerned about.
> Why it has occurred.
> 
> I did set single-exon=1 and wonder if that is causing this effect?
> 
> Thanks and sorry for the long-winded email.
> 
> Joel
> 
> 
> 
> -- 
> Dr. Joel S. Shore
> Prof. Biology
> York University
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From guerrer at uni-duesseldorf.de  Fri Apr 24 08:27:24 2020
From: guerrer at uni-duesseldorf.de (Ricardo Nuno Ferreira Martins Guerreiro)
Date: Fri, 24 Apr 2020 16:27:24 +0200
Subject: [maker-devel] Maker 0 genes after SNAP or with proteins.gff
Message-ID: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>

Dear Makers list,


I am struggling with Maker after many successful attempts. I don't 
understand why but my final .gff does not contain any genes, 0.

I am running first an Evidence based modelling, with proteins only. Here 
I get around 40 thousand genes if I give the proteins as a fasta to 
align (if I provide a protein.gff from a previous maker try, I get 0 
genes, same problem).

Afterwards I'm creating a SNAP hmm and running maker again, turning 
protein2genome=0 and snaphmm=snap.hmm as you say, but now I have 0 
genes. This happens either I keep providing proteins as a fasta or as 
.gff of a previous run.

I have done this many times and it always worked. The only difference 
now is that I am using no ESTs whatsoever, only proteins. It's also 
strange that it works on the first round of maker but doesn't work on 
the SNAP rounds.


Hope you can help,
Ricardo
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: maker_opts.ctl
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200424/636e3c77/attachment-0002.ksh>

From taosheng.x at gmail.com  Sun Apr 26 00:58:47 2020
From: taosheng.x at gmail.com (Xu, taosheng)
Date: Sun, 26 Apr 2020 14:58:47 +0800
Subject: [maker-devel] Problems with openMPI in multiple computing nodes
Message-ID: <CALJhmFr9Q741vwAZHHH9-pV-PAjfCPRKi-2B0kLx8r0HVHWYOA@mail.gmail.com>

Hello,
I am using a  computer cluster with 20 nodes(40cpus per node) for
gene annotation. I submit my maker task to one node with 40 CPUs using
openMPI. Everything is well.
But I encounter the problem when submitting the same maker task to the
cluster with multiple nodes (120 cpus) There are errors shown below.
I would also appreciate any advice. Thank you.

Best regards,
Taosheng


*STATUS: Processing and indexing input FASTA files...cannot remove
directory for
home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
No such file or directory at /maker/bin/../lib/FastaDB.pm line 145.cannot
remove directory for
/home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
Directory not empty at /maker/bin/../lib/FastaDB.pm line 145.cannot remove
directory for
/home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
Directory not empty at /maker/bin/../lib/FastaDB.pm line 145.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200426/ccd6091e/attachment-0002.html>

From xvazquezc at gmail.com  Sun Apr 26 20:15:53 2020
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 27 Apr 2020 12:15:53 +1000
Subject: [maker-devel] Maker 0 genes after SNAP or with proteins.gff
In-Reply-To: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>
References: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>
Message-ID: <CAL0hg4GUdbQMxN1j5KBQ6JymQSzT_tSbE19fwvEAg6+3_GmXMw@mail.gmail.com>

Hi Ricardo,
it is likely that you are not providing enough evidences to train SNAP (or
even none at all). When you run maker2zff, the defaults may not give any
output if you don't have any EST at all. Check maker2zff -h for the
evidence filtering options to create the model. In worst case, you'll need
to run maker2zff -n which doesn't filter the evidences at all. I also
suggest to search about this on the mailing list as it has come up many
times.
Cheers,
Xabi

On Sat, 25 Apr 2020 at 02:46, Ricardo Nuno Ferreira Martins Guerreiro <
guerrer at uni-duesseldorf.de> wrote:

> Dear Makers list,
>
>
> I am struggling with Maker after many successful attempts. I don't
> understand why but my final .gff does not contain any genes, 0.
>
> I am running first an Evidence based modelling, with proteins only. Here
> I get around 40 thousand genes if I give the proteins as a fasta to
> align (if I provide a protein.gff from a previous maker try, I get 0
> genes, same problem).
>
> Afterwards I'm creating a SNAP hmm and running maker again, turning
> protein2genome=0 and snaphmm=snap.hmm as you say, but now I have 0
> genes. This happens either I keep providing proteins as a fasta or as
> .gff of a previous run.
>
> I have done this many times and it always worked. The only difference
> now is that I am using no ESTs whatsoever, only proteins. It's also
> strange that it works on the first round of maker but doesn't work on
> the SNAP rounds.
>
>
> Hope you can help,
> Ricardo_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200427/d49cbc74/attachment-0002.html>

From liorglic at mail.tau.ac.il  Thu Apr 30 06:58:17 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Thu, 30 Apr 2020 15:58:17 +0300
Subject: [maker-devel] Missing genes in lift-over with est2genome
In-Reply-To: <373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>
References: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
	<373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>
Message-ID: <CAOzMDPyLSPa33x31R_2d+bKhDN2d6+aFK+mQn5C7xJd9Tq56yg@mail.gmail.com>

Thanks Carson - your answer was very helpful.
Another question related to the lift-over process, if I may.
I want to take the resulting gff and pass it on to another MAKER run, where
I provide further, lower confidence evidence (ESTs and proteins). I'm not
sure which option to use though. According to this helpful post
<https://computationalbiologysite.wordpress.com/2013/07/11/maker-gff-cite-online/>,
I tried using pred_gff and model_gff, but both created cases of fusion
genes when genes are very adjacent to one another (see attached picture),
even with the correct_est_fusion parameter enabled. It looks like the only
way to take lifted-over genes "as-is" would be to use other_gff, but I
figure that this was not really intended for genes. Would you recommend
this usage? Am I missing something?
Thank you!

??????? ??? ??, 23 ????? 2020 ?-20:43 ??? ?Carson Holt?? <?
carsonhh at gmail.com??>:?

> There are percent cutoffs for the est2genome algorithm you can set in the
> maker_bopts.ctl file. Additionally, maker will give the alignment but not
> produce a gene model if it can?t translate through the est2genome alignment
> (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add
> est_forward=1 to the maker_opts.ctl file names will be copied from the
> alignment source and the score in the GFF3 column will be the percent match
> to the original transcript.
>
> ?Carson
>
>
>
> > On Apr 21, 2020, at 7:08 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> >
> > Hello,
> > I am using MAKER to annotate a plant genome assembly. A high-quality
> reference genome and annotation exists for another variety of the same
> species, so my first step is lifting over reference genes to my genome. I
> do this by setting est2genome = 1 and providing MAKER with the reference
> cDNA (transcriptome). No other evidence is provided and no prediction is
> performed. Repeat masking is done using the reference repeats library.
> > When checking the results, I found out lots of reference genes missing
> from the lift-over result. However, if I blast the sequences of these genes
> myself, I get good matches. I even see these matches when I look at the
> blast results buried in the MAKER data_store.
> > For example, a transcript of length 1077 got a match of length 855 -
> 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like
> a pretty good match, but it is not found in the final MAKER results
> (gff/fasta).
> > Why is this happening? Are there some cutoffs that are not satisfied? If
> so, what are they and how can they be configured?
> >
> > Thanks,
> > Lior
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at yandell-lab.org
> > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200430/a53d513e/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fusion.png
Type: image/png
Size: 33185 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200430/a53d513e/attachment-0002.png>

From natassa_g_2000 at yahoo.com  Thu Apr  2 07:42:48 2020
From: natassa_g_2000 at yahoo.com (natassa)
Date: Thu, 2 Apr 2020 13:42:48 +0000 (UTC)
Subject: [maker-devel] Optimal strategy and options for iterative maker2 runs
In-Reply-To: <1b0e5c3cae1b410397e61262a2384039@umu.se>
References: <1b0e5c3cae1b410397e61262a2384039@umu.se>
Message-ID: <1875152020.930988.1585834968860@mail.yahoo.com>

Hello maker community, 
I am annotating? with maker2 a fungal genome for which I have transcript evidence, plus transcripts and proteins from closely related species, a genemark .mod file from self-training I have run outside of maker, and an augustus model from a closely related species. I plan to run it iteratively, updating snap (and maybe augustus) models each time. Reading several iterative-maker pipelines online, I am a bit confused on the optimal strategy, and some details on the options used in consecutive runs. Some questions:

1) How will MAKER behave in the case where I would supply my different lines of evidence (EST+protein) along with trained abinitio models in the same run? Here is -what seems to me conflicting- info from posts I read (not in this list): "if est2genome and protein2genome are set to 1 +? abinitio tools are also on,? the abinitio tools will not use the EST-protein evidence to improve their gene models." but: "In case you activated SNAP and Augustus and you have fed MAKER with lines of evidence (Transcripts and proteins), it will predict gene models using Augustus-Evidence-driven and SNAP-Evidence-driven. In loci where both are present, it will chose the best one according to the lines of evidence (EST / protein when they are present)." Which one is correct?
2) I see in? a few tutorials that genemark is trained at a 3rd/4th run and separately from other abinitio programs. I don't understand why, since genemark is self-trained on the genome, so it doesnot really interact with training from evidence or maker gff files? 
3) Can I pass >1 abinitio models from one run to the next using the pred_gff option? For example? augustus+genemark hmms, separated by ","? In a 2017 post, Carson writes "I would avoid passing in Augustus results as GFF3, it removes the ability of MAKER to dynamically provide Augustus with hints as it runs". What is the correct way then?

Any input from experienced maker users is welcome!
Thank you in advance, 
Anastasia Gioti

Anastasia Gioti
Researcher
IMBBC-HCMR Crete, Greece
https://scholar.google.com/citations?user=eMsnakoAAAAJ&hl=en
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200402/afce9017/attachment-0003.html>

From shore at yorku.ca  Fri Apr  3 11:02:36 2020
From: shore at yorku.ca (shore at yorku.ca)
Date: Fri, 03 Apr 2020 13:02:36 -0400
Subject: [maker-devel] final annotation issues
Message-ID: <1585933356.5e876c2c023db@oldmymail.yorku.ca>

Dear Maker team,

 I believe we are the final stage of annotation of a plant genome, having
previously trained snap following 3 rounds.

 In our attempts at final annotation we have now added new transcriptome data,
and generated a repeat library for our species (so we now mask with that, as
well as database of plant repeats , and TE proteins).

 In this final annotation run, we've set keep_pred=1 and then plan to
screen the final gff file retaining sequences with AED<= 0.5 (or there
abouts) and ones that possess a pfam domain .

 I've compared some of the proteins obtained in our previous round of Maker with
the latest. Indeed the masking appears to have removed some that were TEs. A
number of proteins differ somewhat, likely the result of different intron/exon
boundary calls, and some are quite different in length.
In particular some are roughly twice the length in previous annotation, and
appear to be of the correct size previously , based upon online blasts.

It is this latter finding that I'm concerned about.
Why it has occurred.

I did set single-exon=1 and wonder if that is causing this effect?

Thanks and sorry for the long-winded email.

Joel


-- 
Dr. Joel S. Shore
Prof. Biology
York University


From carsonhh at gmail.com  Fri Apr  3 14:51:47 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 3 Apr 2020 14:51:47 -0600
Subject: [maker-devel] guidance for first and subsequent annotation
 parameters
In-Reply-To: <CAEdk9jmg1rbuRPbCC4vNmNkTyN61H7XGQpUi4h6UD1czoFtH1g@mail.gmail.com>
References: <CAEdk9jmg1rbuRPbCC4vNmNkTyN61H7XGQpUi4h6UD1czoFtH1g@mail.gmail.com>
Message-ID: <81E8860F-A91B-4089-B179-4ED7EBAC36D3@gmail.com>

You may need to select a subset of gene models to drive training.  I find that I get best results when I use protein2genome models only from uniprot/swiss-prot alignments to generate a training set, set always_complete=1. Uniprot/swiss-prot is manually curated, so is very high quality. Then I select models with the highest end-to-end completion (low AED). Also if you add est_forward=1 the score column in the GFF3 will be the % match to the original model.  It?s and easy way to select only models with a very high percent match. Remove models without start codons and stop codons.  You can relax these parameters if you don?t have many models, but in general you want 100-300 models to train with. Only one round of training is needed with this type of training set. The EST method requires 2 rounds and I don?t like it as much.

In the some cases, model selection for training will be a mostly manual task. You can use editors like Apollo to identify models that match evidence well, and delete odd models. Then train on that result.


What you are seeing is likely the result of over-training. Usually happens if you use more that 2 rounds of training, but can happen with just two rounds.

?Carson

 
> On Mar 20, 2020, at 5:30 AM, Devon O'Rourke <devon.orourke at gmail.com> wrote:
> 
> With so many posts on the forum it's been challenging to determine what the best practices are for performing multiple rounds of annotation with Maker.
> My first round used est, altest, and protein fasta files with a custom GFF repeat masked file. The resulting vertebrate genome produced 21,970 gene models with a mean length of about 9016 bp; the BUSCO score was C:66.0%[S:64.2%,D:1.8%],F:4.2%,M:29.8%,n:9226 (mammalia_odb10 set). Things seemed to be on the right track, so I set up the next Maker round using both SNAP and Augustus-trained information in the round2 maker_opts.ctl file. At the end of that second round, I noticed a marked decrease in BUSCO score (C:53.3%[S:51.0%,D:2.3%],F:11.6%,M:35.1%,n:9226), yet an increase in the number of gene models (28,646) and mean length (16266 bp). 
> 
> This got me to wondering if I was setting up the _opts.ctl file incorrectly? I'm concerned with a few things (and maybe missing even more I should be concerned about!?):
> I specified the evidence to come from EST/Protein instead of using the section available under "#-----Re-annotation Using MAKER Derived GFF3". Maybe that was a fundamental mistake? What is the expected change in behavior if I moved my round1 Maker output into that category instead of using the EST/Protein Homology evidence sections as I did below?
> I wasn't sure what to do with the RepeatMasking GFF files in Round2. The RepeatMasker GFF I included in Round1 consisted of just complex repeats (setting model_org=simple and softmask=1 to effectively only hard mask those complex areas for the initial alignments). But what should be used in Round2 - the output GFF of Round1, or the input GFF from Round1?
> Here's what I did for the Round2 maker_opts.ctl file:
> 
> #-----Genome (these are always required)
> genome=/scratch/dro49/myluwork/annotation/input_files/mylu_hic_rails_noMasks.fa
> organism_type=eukaryotic
> #-----EST Evidence (for best results provide a file for at least one)
> est_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.est2genome.gff
> altest_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.cdna2genome.gff
> #-----Protein Homology Evidence (for best results provide a file for at least one)
> protein_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.protein2genome.gff
> #-----Repeat Masking (leave values blank to skip repeat masking)
> rm_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.repeats.gff
> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
> #-----Gene Prediction
> snaphmm=/scratch/dro49/myluwork/annotation/maker_rd2/snap_rd1/lu_rnd1.zff.length50_aed0.25.hmm #SNAP HMM file
> augustus_species=mylu #Augustus gene prediction species model
> run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
> est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
> protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
> trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
> unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
> allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)
> 
> 
> Thank you for your insights and support,
> 
> Devon
> 
> -- 
> Devon O'Rourke
> Postdoctoral researcher, Northern Arizona University
> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/ <https://fozlab.weebly.com/>
> twitter: @thesciencedork

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/45bd12db/attachment-0003.html>

From carsonhh at gmail.com  Fri Apr  3 16:03:12 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 3 Apr 2020 16:03:12 -0600
Subject: [maker-devel] Problem with Maker using GeneMark
In-Reply-To: <5e838561.1c69fb81.27f9c.24c6SMTPIN_ADDED_MISSING@mx.google.com>
References: <5e838561.1c69fb81.27f9c.24c6SMTPIN_ADDED_MISSING@mx.google.com>
Message-ID: <E2FF8D2F-8B8F-456B-BC0E-1A1C099D05D3@gmail.com>

Could you try the attached version, and let me know if it resolves the issue (copy over the old one)? The probuild command I used is just one I stole from another GeneMark script, so I just borrowed the updated command from the SplitFasta subroutine in gmes_petap.pl.

?Carson


> On Mar 31, 2020, at 11:53 AM, Gagn?, Patrick (NRCAN/RNCAN) <patrick.gagne at canada.ca> wrote:
> 
> Hi
>  
> I?ve come across a bug while using Maker. I?m trying to annotate a 560Mb Genome and I?m using Snap, GeneMark and Augustus in Maker.
> When Maker is executing the GeneMark command, it just failed (GeneMark Failed) without any error messages, so I?ve decided to debug it myself?So I launched every commands manually and found out that the gmhmm_wrap is causing the issue. The problem is in fact in the prebuild command; it doesn?t do anything (from what I understand, this command is supposed to split the fasta whre there is NNN to prevent GeneMark Crash). My genome got very long stretches of N (up to 14Kb)
>  
> After checking the prebuild help, I?ve found that the command used in gmhmm_wrap is not valid (half the options are not in probuild anymore, probably because of GeneMark updates)
>  
> I have tried different Probuild (those I could download from GeneMark site, they don?t give older versions except those that come with their program?s versions)
> 2.16
> 2.34
> 2.44 (lastest that come with GeneMark ES)
>  
> I?ve also tried to edit the gmhmm_wrap script and modify the prebuild command, but even when the fasta are splitted, I got another bug : ERROR: Logic error in getting offset. I?ve tried to replace the command for the offset extraction, which also worked, but now I got a bug when Maker try to get the ab-initio output :
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Calling translate without a seq argument!
>  
> Could you please tell me how to fix this, or tell me what probuild I should use (I will ask the GeneMark support for it)
>  
> Thanks in advance
>  
> P.S 
> Sorry for my English, It?s not my first language and I?m still learning
>  
> Patrick Gagn?
> Sp?cialiste en bio-informatique / Bioinformatics specialist
> Service canadien des for?ts / Canadian Forest Service
> Ressources naturelles Canada / Natural Resources Canada
> Gouvernement du Canada / Government of Canada
> Centre de foresterie des Laurentides/Laurentian Forestry Centre
> 1055, rue du P.E.P.S.
> C.P. 10380, succ. Sainte-Foy/P.O. Box 10380, Stn. Sainte-Foy
> Qu?bec (Qc) G1V 4C7
> Laboratoire de pathologie foresti?re (Local 2.21)
> patrick.gagne at canada.ca <mailto:patrick.gagne at canada.ca> / tel : (418) 648-4443
>  
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0006.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmhmm_wrap
Type: application/octet-stream
Size: 9027 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0003.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0007.html>

From carsonhh at gmail.com  Sat Apr  4 14:09:05 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sat, 4 Apr 2020 14:09:05 -0600
Subject: [maker-devel] repeatmasker output gff
In-Reply-To: <CA+un12sUfaxK_ycgves9cohrdFjOyg7Nt9WUzGSK6Yz-RbdgMQ@mail.gmail.com>
References: <CA+un12sUfaxK_ycgves9cohrdFjOyg7Nt9WUzGSK6Yz-RbdgMQ@mail.gmail.com>
Message-ID: <08952604-1F7A-4FC5-9F59-DB79665A324D@gmail.com>

It needs to be a two level feature. match/match_part is one example, others will work as long as when it is assembled it is two levels.

MAKER saves it?s state as it runs, so you can restart it at any time without losing progress.

?Carson


> On Mar 25, 2020, at 2:38 PM, Homa Papoli <hpapoli at gmail.com> wrote:
> 
> Hello,
> 
> I have 2 questions regarding user maker:
> 
> I have used repeatmasker for my genome separately and I have a gff file. However, my gff file, in the third column, has the word "similarity". In a workshop I had taken on genome annotation, it was said that the gff for maker should have "match" and "match_part" for the third column. I was wondering whether I could use the original gff output of repeatmasker or should I make any changes to it?
> 
> Another question is about running maker. Since maker takes several days to run, if the job gets interrupted due to limit in days of running the job, I was wondering whether it is possible to re-start maker from where it got interrupted?
> 
> Thank you,
> Homa
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Sat Apr  4 14:15:21 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sat, 4 Apr 2020 14:15:21 -0600
Subject: [maker-devel] Maker annotation  AED scores are around 0.5
In-Reply-To: <1b0e5c3cae1b410397e61262a2384039@umu.se>
References: <1b0e5c3cae1b410397e61262a2384039@umu.se>
Message-ID: <E2AAAE35-7B24-46EE-B77F-9E4BD584CC45@gmail.com>

Probably this ?>

https://groups.google.com/forum/#!searchin/maker-devel/Curious$20pattern$20in$20AED$20distributions%7Csort:date/maker-devel/QS3VnxhvEks/q3lPmywjBQAJ <https://groups.google.com/forum/#!searchin/maker-devel/Curious$20pattern$20in$20AED$20distributions|sort:date/maker-devel/QS3VnxhvEks/q3lPmywjBQAJ>


Likely caused by an over abundance of single-exon models and under masking of repeats in the genome.

?Carson


> On Mar 30, 2020, at 3:37 AM, Wei Zhao <zhao.wei at umu.se> wrote:
> 
> Dear maker team,
>  
> I am writing to ask for your help.
>  
> I am using make to annotate a big genome ~9 Gbp, I have 3 evidences: 1)  Transcriptome of this species; 2) protein sequence from relative species; 3) Augustus model trained from pasa.
>  
> When I use all of these 3 evidences to annotate the genome (basic pipeline), the distribution of AED score is weird (single peak around 0.5).
>  
> I have also tried to update the gene model I got from pasa  using maker, the distribution of AED scores is the same.
>  
> But when I try to only use  EST or protein as evidence (est2genome or protein2genome), the AED scores is normal (close to 0).
>  
> To my understand, it seems all the 3 evidences are conflict with each other, results in  the AED scores is higher  (~ 0.5) than expected,  could you give me some suggestion on how to fix this problem?
>  
> Best regards,
>  
> Wei
>  
>  
> <E6F3EF742C40408F8390EE9A1FF29894.png>
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200404/c6f3052e/attachment-0003.html>

From danis_theo at hotmail.com  Thu Apr  2 12:24:05 2020
From: danis_theo at hotmail.com (Thodoris Danis)
Date: Thu, 2 Apr 2020 18:24:05 +0000
Subject: [maker-devel] Question about re-annotation
Message-ID: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>

Hello maker community,


I am annotating a fish genome with Maker 2.31.10, for which I have transcript evidence and proteins from closely related species. I am running maker in several passes and pass the gff output to the nest pass every time here (
"#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #re-annotate genome based on this gff3 file",
), is this wrong? and where is the difference running the maker from scratch without re-supplying every time the new gff?

Secondly,  in the second round after predictors training, have I to keep est2genome=1 , protein2genome=1, ests and proteins evidence or not?
How do we switch all thesre parameters?

Any input from experienced maker users is welcome
Thank you for your help


???????? ?????
Thodoris Danis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200402/151bd94e/attachment-0003.html>

From carsonhh at gmail.com  Sun Apr  5 16:19:26 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sun, 5 Apr 2020 16:19:26 -0600
Subject: [maker-devel] Question about re-annotation
In-Reply-To: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>
References: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>
Message-ID: <9B863735-2D4D-4FD2-A8B1-3C542F3D767A@gmail.com>

If you are running several times, just rerun in the same directory after altering settings. MAKER will reuse old raw data reports as appropriate. The maker_gff option is really just for reannotating from an old maker run where you no longer have the raw files available.

?Carson


> On Apr 2, 2020, at 12:24 PM, Thodoris Danis <danis_theo at hotmail.com> wrote:
> 
> Hello maker community,
> 
> 
> I am annotating a fish genome with Maker 2.31.10, for which I have transcript evidence and proteins from closely related species. I am running maker in several passes and pass the gff output to the nest pass every time here (
>> 
>> "#-----Re-annotation Using MAKER Derived GFF3
>> maker_gff= #re-annotate genome based on this gff3 file",
> ), is this wrong? and where is the difference running the maker from scratch without re-supplying every time the new gff?
> 
> Secondly,  in the second round after predictors training, have I to keep est2genome=1 , protein2genome=1, ests and proteins evidence or not?
> How do we switch all thesre parameters? 
> 
> Any input from experienced maker users is welcome
> Thank you for your help
> 
> 
> ???????? ????? 
> Thodoris Danis
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200405/d67b5ca5/attachment-0003.html>

From carsonhh at gmail.com  Tue Apr  7 11:42:08 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 7 Apr 2020 11:42:08 -0600
Subject: [maker-devel] Maker 2.31.10: maker_functional_gff and
 maker_functional_fasta not parsing correctly,
 Can't use string ("") as a HASH ref while "strict refs" in use
In-Reply-To: <4A449297-A6D1-4A75-9547-FB5F70CE1A0A@ulaval.ca>
References: <4A449297-A6D1-4A75-9547-FB5F70CE1A0A@ulaval.ca>
Message-ID: <D775DED1-F3A5-478B-B7BE-8F318CFEADA3@gmail.com>

Thanks I?ll update the related scripts. In my tests the old regular expression still works, but ends up adding the OX= tag as part of the GFF3 entry and not throwing a hash ref error. So you still may have another issue if you are getting a hash ref error.

?Carson


> On Mar 14, 2020, at 11:24 AM, Christopher Keeling <christopher.keeling.1 at ulaval.ca> wrote:
> 
> Hello,
> 
> In sub parse_blast{, during parsing of uniprot fasta file:
> 
> if (/>(\S+)\s+(.*?)\s+OS=(.*?)\s+(GN=(.*?)\s+)?PE=/) {
> 
> should be changed to:
> 
> if (/>sp\|(\S+)\|\S+\s+(.*?)\s+OS=(.*?)\s+OX=\S+\s+(GN=(.*?)\s+)?PE=/) {
> 
> to avoid "Can't use string ("") as a HASH ref while "strict refs" in use at?" errors.
> 
> For UniProt release 2020_01: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz <ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz>
> 
> Cheers,
> Chris
> 
> 
> --
> Christopher I. Keeling
> Chercheur scientifique en g?nomique foresti?re/ Research Scientist in Forest Genomics
> 
> Ressources naturelles Canada / Natural Resources Canada
> Service canadien des for?ts / Canadian Forest Service
> Centre de foresterie des Laurentides / Laurentian Forestry Centre
> 1055, rue du PEPS Qu?bec, QC G1V 4C7 Canada
> https://cfs.nrcan.gc.ca/employees/read/ckeeling <https://cfs.nrcan.gc.ca/employees/read/ckeeling>
> 
> Professeur associ?
> D?partement de biochimie, de microbiologie et de bio-informatique
> Universit? Laval
> https://www.researchgate.net/profile/Christopher_Keeling <https://www.researchgate.net/profile/Christopher_Keeling>
> https://scholar.google.ca/citations?user=KtGr86UAAAAJ <https://scholar.google.ca/citations?user=KtGr86UAAAAJ>
>  
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200407/3f7959bd/attachment-0003.html>

From andrei.kiselev at lrsv.ups-tlse.fr  Fri Apr 10 10:33:57 2020
From: andrei.kiselev at lrsv.ups-tlse.fr (andrei.kiselev at lrsv.ups-tlse.fr)
Date: Fri, 10 Apr 2020 16:33:57 +0000
Subject: [maker-devel] New assembly annotation
Message-ID: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>

Hello.
I'have recently got a new genome assembly using PacBio of oomycete Aphanomyces. 
I used MAKER in the manner as described here https://groups.google.com/forum/#!searchin/maker-devel/new$20assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ (https://groups.google.com/forum/#!searchin/maker-devel/new%2420assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ)

After first run I got the number of transcripts slightly higher than were in gff file of previous version of genome. Then I run the second MAKER with new gff file in option pred_gff + augustus trained for my species. As a result I got only half of the transcripts from initial gff.

Is there something that I could overlook running MAKER? Attached is control file of the last run.

Thank you in advance.
Andrei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200410/54d75f1c/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4984 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200410/54d75f1c/attachment-0003.obj>

From liorglic at mail.tau.ac.il  Mon Apr 13 08:12:42 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Mon, 13 Apr 2020 17:12:42 +0300
Subject: [maker-devel] Annotating a fragmented assembly
Message-ID: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>

Hello there,

I am working on creating plant pan genomes. This means that I produce many
assemblies for samples of the same species from NGS data available from SRA
and then annotate them with MAKER, based on a collection of relevant
evidence (transcripts and proteins).
As you might imagine, data quality is variable, so I sometimes create
assembles from >x20 sequencing depth, resulting in fragmented assemblies
(say N50 in the range of 5-10kb).
Annotation results of such genomes usually contain many partial genes,
broken across contigs, so in many cases I get two proteins, representing
the 3' and 5' parts of a broken gene. In other cases, only one part of the
gene is detected.
I've also found that applying reference-based scaffolding (I use RaGOO) to
generate pseudomolecules improves results by bringing together contigs
containing gene parts and allowing MAKER to create full annotation.
However, this also results in new erroneous predictions, spanning two
contigs that are not actually adjacent in the genome but were brought
together by the scaffolding process.
I suspect this has to do with the number of 'N' characters introduced as
padding between ordered contigs, so one thing I wanted to ask about is how
MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
I would also appreciate any advice on how to annotate fragmented genomes
and comments about the strategy I described above. Please note that I am
not expecting a reference-level annotation, but am simply trying to reduce
noise levels towards downstream comparative analyses.

Thanks a lot and best regards,
Lior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200413/254fdbe5/attachment-0003.html>

From xpeng at ucsb.edu  Tue Apr 14 11:40:15 2020
From: xpeng at ucsb.edu (xpeng at ucsb.edu)
Date: Tue, 14 Apr 2020 10:40:15 -0700
Subject: [maker-devel] Can install but Cannot Run MAKER
Message-ID: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>

Dear Yandell Lab,

 
I am writing to get a bit of help on making MAKER to work.

 
I downloaded the v3.01.03 maker and followed the instructions on your wiki
page to install, both on my local computer as sudo and on PSC Bridges (with
MPI). 

 
The installation seemed to have completed successfully.

 
However, when I ran "maker -h" I received error messages (attached) that I
don't know what to do about. Could you please advise a solution?

 
Thank you!

 
Nick (Xuefeng Peng)

 
Postdoctoral Scholar

University of California

Santa Barbara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/38690941/attachment-0003.html>
-------------- next part --------------
$ ./maker -h
Possible precedence issue with control flow operator at /usr/share/perl5/Bio/DB/IndexedBase.pm line 845.
syntax error at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 105, near ")

	if"
syntax error at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 106, near "]{strand"
Global symbol "$strand" requires explicit package name (did you forget to declare "my $strand"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 106.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$strand" requires explicit package name (did you forget to declare "my $strand"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 109.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$strand" requires explicit package name (did you forget to declare "my $strand"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 113.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Execution of /home/csp/software/maker/bin/../lib/Widget/trnascan.pm aborted due to compilation errors.
Compilation failed in require at /home/csp/software/maker/bin/../lib/GI.pm line 40.
BEGIN failed--compilation aborted at /home/csp/software/maker/bin/../lib/GI.pm line 40.
Compilation failed in require at ./maker line 46.
BEGIN failed--compilation aborted at ./maker line 46.
-------------- next part --------------
$ maker -h
syntax error at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 105, near ")

	if"
syntax error at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 106, near "]{strand"
Global symbol "$strand" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 106.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$strand" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 109.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$strand" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 113.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
syntax error at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 122, near "}"
/pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm has too many errors.
Compilation failed in require at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/GI.pm line 40.
BEGIN failed--compilation aborted at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/GI.pm line 40.
Compilation failed in require at /pylon5/bi5618p/hmahzpxf/software/maker/bin/maker line 46.
BEGIN failed--compilation aborted at /pylon5/bi5618p/hmahzpxf/software/maker/bin/maker line 46.

From carsonhh at gmail.com  Tue Apr 14 12:11:16 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 14 Apr 2020 12:11:16 -0600
Subject: [maker-devel] Can install but Cannot Run MAKER
In-Reply-To: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>
References: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>
Message-ID: <94E18556-771A-485C-B534-80B52BC7D586@gmail.com>

Please re-download and install again. I found the issue from your error in the new install package.

?Carson


> On Apr 14, 2020, at 11:40 AM, <xpeng at ucsb.edu> <xpeng at ucsb.edu> wrote:
> 
> Dear Yandell Lab,
>  
> I am writing to get a bit of help on making MAKER to work.
>  
> I downloaded the v3.01.03 maker and followed the instructions on your wiki page to install, both on my local computer as sudo and on PSC Bridges (with MPI). 
>  
> The installation seemed to have completed successfully.
>  
> However, when I ran ?maker -h? I received error messages (attached) that I don?t know what to do about. Could you please advise a solution?
>  
> Thank you!
>  
> Nick (Xuefeng Peng)
>  
> Postdoctoral Scholar
> University of California
> Santa Barbara, CA
> <Error_Message_Ubuntu_19.10.txt><Error_Message_PSC_Bridges.txt>_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/e9ef3203/attachment-0003.html>

From liorglic at mail.tau.ac.il  Tue Apr 21 07:08:40 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Tue, 21 Apr 2020 16:08:40 +0300
Subject: [maker-devel] Missing genes in lift-over with est2genome
Message-ID: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>

Hello,
I am using MAKER to annotate a plant genome assembly. A high-quality
reference genome and annotation exists for another variety of the same
species, so my first step is lifting over reference genes to my genome. I
do this by setting est2genome = 1 and providing MAKER with the reference
cDNA (transcriptome). No other evidence is provided and no prediction is
performed. Repeat masking is done using the reference repeats library.
When checking the results, I found out lots of reference genes missing from
the lift-over result. However, if I blast the sequences of these genes
myself, I get good matches. I even see these matches when I look at the
blast results buried in the MAKER data_store.
For example, a transcript of length 1077 got a match of length 855 - 100%
identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a
pretty good match, but it is not found in the final MAKER results
(gff/fasta).
Why is this happening? Are there some cutoffs that are not satisfied? If
so, what are they and how can they be configured?

Thanks,
Lior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200421/dfdebbb1/attachment-0003.html>

From carsonhh at gmail.com  Thu Apr 23 11:38:54 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:38:54 -0600
Subject: [maker-devel] Annotating a fragmented assembly
In-Reply-To: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>
References: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>
Message-ID: <C9C6F924-D27C-498A-81B8-B051C25CDB27@gmail.com>

N?s are handled by the gene predictors themselves. I know Augustus can span N?s within introns. I?m not sure how many N?s will cause it to split the gene. It may be a function of the expected intron length in the HMM. Organisms with large introns could then handles more N?s. Genemark will split genes on even short runs of N?s. I?m not sure on SNAP.  For BLAST alignments, extensions of gaps decrease the score, so how long the gap can be depends on the score of the initial seeding alignment. The larger the initial score, the longer the gap can be before scores drop below the termination threshold.

?Carson


> On Apr 13, 2020, at 8:12 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> 
> Hello there,
> 
> I am working on creating plant pan genomes. This means that I produce many assemblies for samples of the same species from NGS data available from SRA and then annotate them with MAKER, based on a collection of relevant evidence (transcripts and proteins).
> As you might imagine, data quality is variable, so I sometimes create assembles from >x20 sequencing depth, resulting in fragmented assemblies (say N50 in the range of 5-10kb).
> Annotation results of such genomes usually contain many partial genes, broken across contigs, so in many cases I get two proteins, representing the 3' and 5' parts of a broken gene. In other cases, only one part of the gene is detected.
> I've also found that applying reference-based scaffolding (I use RaGOO) to generate pseudomolecules improves results by bringing together contigs containing gene parts and allowing MAKER to create full annotation. However, this also results in new erroneous predictions, spanning two contigs that are not actually adjacent in the genome but were brought together by the scaffolding process.
> I suspect this has to do with the number of 'N' characters introduced as padding between ordered contigs, so one thing I wanted to ask about is how MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
> I would also appreciate any advice on how to annotate fragmented genomes and comments about the strategy I described above. Please note that I am not expecting a reference-level annotation, but am simply trying to reduce noise levels towards downstream comparative analyses.
> 
> Thanks a lot and best regards,
> Lior
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Apr 23 11:43:30 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:43:30 -0600
Subject: [maker-devel] Missing genes in lift-over with est2genome
In-Reply-To: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
References: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
Message-ID: <373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>

There are percent cutoffs for the est2genome algorithm you can set in the maker_bopts.ctl file. Additionally, maker will give the alignment but not produce a gene model if it can?t translate through the est2genome alignment (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add est_forward=1 to the maker_opts.ctl file names will be copied from the alignment source and the score in the GFF3 column will be the percent match to the original transcript.

?Carson


> On Apr 21, 2020, at 7:08 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> 
> Hello,
> I am using MAKER to annotate a plant genome assembly. A high-quality reference genome and annotation exists for another variety of the same species, so my first step is lifting over reference genes to my genome. I do this by setting est2genome = 1 and providing MAKER with the reference cDNA (transcriptome). No other evidence is provided and no prediction is performed. Repeat masking is done using the reference repeats library.
> When checking the results, I found out lots of reference genes missing from the lift-over result. However, if I blast the sequences of these genes myself, I get good matches. I even see these matches when I look at the blast results buried in the MAKER data_store.
> For example, a transcript of length 1077 got a match of length 855 - 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a pretty good match, but it is not found in the final MAKER results (gff/fasta).
> Why is this happening? Are there some cutoffs that are not satisfied? If so, what are they and how can they be configured?
> 
> Thanks,
> Lior
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Apr 23 11:53:27 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:53:27 -0600
Subject: [maker-devel] New assembly annotation
In-Reply-To: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>
References: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>
Message-ID: <DFFD73D7-8379-467B-9992-FDDBAE230802@gmail.com>

Fewer transcripts can mean fewer split and spurious genes. It can also be bad merges because of overtraining.  Use BUSCO to evaluate the completeness of gene models rather than transcript count.  Also review models visually using something like Apollo.  You will be able to see if models are spanning distinct evidence clusters or if they were previously split within evidence clusters.  That will help you better identify if the models now better follow the evidence alignments.

?Carson


> On Apr 10, 2020, at 10:33 AM, andrei.kiselev at lrsv.ups-tlse.fr wrote:
> 
> Hello.
> I'have recently got a new genome assembly using PacBio of oomycete Aphanomyces. 
> I used MAKER in the manner as described here https://groups.google.com/forum/#!searchin/maker-devel/new$20assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ <https://groups.google.com/forum/#!searchin/maker-devel/new%2420assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ>
> 
> After first run I got the number of transcripts slightly higher than were in gff file of previous version of genome. Then I run the second MAKER with new gff file in option pred_gff + augustus trained for my species. As a result I got only half of the transcripts from initial gff.
> 
> Is there something that I could overlook running MAKER? Attached is control file of the last run.
> 
> Thank you in advance.
> Andrei
> <maker_opts.ctl>_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200423/16d97e5b/attachment-0003.html>

From carsonhh at gmail.com  Thu Apr 23 11:57:23 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:57:23 -0600
Subject: [maker-devel] final annotation issues
In-Reply-To: <1585933356.5e876c2c023db@oldmymail.yorku.ca>
References: <1585933356.5e876c2c023db@oldmymail.yorku.ca>
Message-ID: <D56728B7-B822-4EF4-AF75-7EE76C6D6908@gmail.com>

I would not recommend single-exon=1 unless this is an organism where you expect a lot of single exon genes (typically fungi or oomycetes).  It?s best to review models visually in something like Apollo to see how evidence alignments compare to gene predictions. There is always the chance that you have some overmasking that could trim some regions you don?t want to lose.

?Carson


> On Apr 3, 2020, at 11:02 AM, shore at yorku.ca wrote:
> 
> Dear Maker team,
> 
> I believe we are the final stage of annotation of a plant genome, having
> previously trained snap following 3 rounds.
> 
> In our attempts at final annotation we have now added new transcriptome data,
> and generated a repeat library for our species (so we now mask with that, as
> well as database of plant repeats , and TE proteins).
> 
> In this final annotation run, we've set keep_pred=1 and then plan to
> screen the final gff file retaining sequences with AED<= 0.5 (or there
> abouts) and ones that possess a pfam domain .
> 
> I've compared some of the proteins obtained in our previous round of Maker with
> the latest. Indeed the masking appears to have removed some that were TEs. A
> number of proteins differ somewhat, likely the result of different intron/exon
> boundary calls, and some are quite different in length.
> In particular some are roughly twice the length in previous annotation, and
> appear to be of the correct size previously , based upon online blasts.
> 
> It is this latter finding that I'm concerned about.
> Why it has occurred.
> 
> I did set single-exon=1 and wonder if that is causing this effect?
> 
> Thanks and sorry for the long-winded email.
> 
> Joel
> 
> 
> 
> -- 
> Dr. Joel S. Shore
> Prof. Biology
> York University
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From guerrer at uni-duesseldorf.de  Fri Apr 24 08:27:24 2020
From: guerrer at uni-duesseldorf.de (Ricardo Nuno Ferreira Martins Guerreiro)
Date: Fri, 24 Apr 2020 16:27:24 +0200
Subject: [maker-devel] Maker 0 genes after SNAP or with proteins.gff
Message-ID: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>

Dear Makers list,


I am struggling with Maker after many successful attempts. I don't 
understand why but my final .gff does not contain any genes, 0.

I am running first an Evidence based modelling, with proteins only. Here 
I get around 40 thousand genes if I give the proteins as a fasta to 
align (if I provide a protein.gff from a previous maker try, I get 0 
genes, same problem).

Afterwards I'm creating a SNAP hmm and running maker again, turning 
protein2genome=0 and snaphmm=snap.hmm as you say, but now I have 0 
genes. This happens either I keep providing proteins as a fasta or as 
.gff of a previous run.

I have done this many times and it always worked. The only difference 
now is that I am using no ESTs whatsoever, only proteins. It's also 
strange that it works on the first round of maker but doesn't work on 
the SNAP rounds.


Hope you can help,
Ricardo
-------------- next part --------------
#-----Genome (these are always required)
genome=/gpfs/project/projects/qggp/C34_PS/experiments/annotation/maker/b_tournefortii/b_tournefortii.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/gpfs/project/projects/qggp/C34_PS/data/proteins/all_prots95.fasta  #protein sequence file in fasta format (i.e. from mutiple organisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib=/gpfs/project/projects/qggp/C34_PS/experiments/annotation/maker/b_tournefortii/allRepeats.lib.noProtFinal #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm=snap2/snap2.hmm
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap=0 #allowed gene overlap fraction (value from 0 to 1, blank for default)

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
min_intron=20 #minimum intron length (used for alignment polishing)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

From taosheng.x at gmail.com  Sun Apr 26 00:58:47 2020
From: taosheng.x at gmail.com (Xu, taosheng)
Date: Sun, 26 Apr 2020 14:58:47 +0800
Subject: [maker-devel] Problems with openMPI in multiple computing nodes
Message-ID: <CALJhmFr9Q741vwAZHHH9-pV-PAjfCPRKi-2B0kLx8r0HVHWYOA@mail.gmail.com>

Hello,
I am using a  computer cluster with 20 nodes(40cpus per node) for
gene annotation. I submit my maker task to one node with 40 CPUs using
openMPI. Everything is well.
But I encounter the problem when submitting the same maker task to the
cluster with multiple nodes (120 cpus) There are errors shown below.
I would also appreciate any advice. Thank you.

Best regards,
Taosheng


*STATUS: Processing and indexing input FASTA files...cannot remove
directory for
home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
No such file or directory at /maker/bin/../lib/FastaDB.pm line 145.cannot
remove directory for
/home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
Directory not empty at /maker/bin/../lib/FastaDB.pm line 145.cannot remove
directory for
/home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
Directory not empty at /maker/bin/../lib/FastaDB.pm line 145.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200426/ccd6091e/attachment-0003.html>

From xvazquezc at gmail.com  Sun Apr 26 20:15:53 2020
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 27 Apr 2020 12:15:53 +1000
Subject: [maker-devel] Maker 0 genes after SNAP or with proteins.gff
In-Reply-To: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>
References: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>
Message-ID: <CAL0hg4GUdbQMxN1j5KBQ6JymQSzT_tSbE19fwvEAg6+3_GmXMw@mail.gmail.com>

Hi Ricardo,
it is likely that you are not providing enough evidences to train SNAP (or
even none at all). When you run maker2zff, the defaults may not give any
output if you don't have any EST at all. Check maker2zff -h for the
evidence filtering options to create the model. In worst case, you'll need
to run maker2zff -n which doesn't filter the evidences at all. I also
suggest to search about this on the mailing list as it has come up many
times.
Cheers,
Xabi

On Sat, 25 Apr 2020 at 02:46, Ricardo Nuno Ferreira Martins Guerreiro <
guerrer at uni-duesseldorf.de> wrote:

> Dear Makers list,
>
>
> I am struggling with Maker after many successful attempts. I don't
> understand why but my final .gff does not contain any genes, 0.
>
> I am running first an Evidence based modelling, with proteins only. Here
> I get around 40 thousand genes if I give the proteins as a fasta to
> align (if I provide a protein.gff from a previous maker try, I get 0
> genes, same problem).
>
> Afterwards I'm creating a SNAP hmm and running maker again, turning
> protein2genome=0 and snaphmm=snap.hmm as you say, but now I have 0
> genes. This happens either I keep providing proteins as a fasta or as
> .gff of a previous run.
>
> I have done this many times and it always worked. The only difference
> now is that I am using no ESTs whatsoever, only proteins. It's also
> strange that it works on the first round of maker but doesn't work on
> the SNAP rounds.
>
>
> Hope you can help,
> Ricardo_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200427/d49cbc74/attachment-0003.html>

From liorglic at mail.tau.ac.il  Thu Apr 30 06:58:17 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Thu, 30 Apr 2020 15:58:17 +0300
Subject: [maker-devel] Missing genes in lift-over with est2genome
In-Reply-To: <373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>
References: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
	<373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>
Message-ID: <CAOzMDPyLSPa33x31R_2d+bKhDN2d6+aFK+mQn5C7xJd9Tq56yg@mail.gmail.com>

Thanks Carson - your answer was very helpful.
Another question related to the lift-over process, if I may.
I want to take the resulting gff and pass it on to another MAKER run, where
I provide further, lower confidence evidence (ESTs and proteins). I'm not
sure which option to use though. According to this helpful post
<https://computationalbiologysite.wordpress.com/2013/07/11/maker-gff-cite-online/>,
I tried using pred_gff and model_gff, but both created cases of fusion
genes when genes are very adjacent to one another (see attached picture),
even with the correct_est_fusion parameter enabled. It looks like the only
way to take lifted-over genes "as-is" would be to use other_gff, but I
figure that this was not really intended for genes. Would you recommend
this usage? Am I missing something?
Thank you!

??????? ??? ??, 23 ????? 2020 ?-20:43 ??? ?Carson Holt?? <?
carsonhh at gmail.com??>:?

> There are percent cutoffs for the est2genome algorithm you can set in the
> maker_bopts.ctl file. Additionally, maker will give the alignment but not
> produce a gene model if it can?t translate through the est2genome alignment
> (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add
> est_forward=1 to the maker_opts.ctl file names will be copied from the
> alignment source and the score in the GFF3 column will be the percent match
> to the original transcript.
>
> ?Carson
>
>
>
> > On Apr 21, 2020, at 7:08 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> >
> > Hello,
> > I am using MAKER to annotate a plant genome assembly. A high-quality
> reference genome and annotation exists for another variety of the same
> species, so my first step is lifting over reference genes to my genome. I
> do this by setting est2genome = 1 and providing MAKER with the reference
> cDNA (transcriptome). No other evidence is provided and no prediction is
> performed. Repeat masking is done using the reference repeats library.
> > When checking the results, I found out lots of reference genes missing
> from the lift-over result. However, if I blast the sequences of these genes
> myself, I get good matches. I even see these matches when I look at the
> blast results buried in the MAKER data_store.
> > For example, a transcript of length 1077 got a match of length 855 -
> 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like
> a pretty good match, but it is not found in the final MAKER results
> (gff/fasta).
> > Why is this happening? Are there some cutoffs that are not satisfied? If
> so, what are they and how can they be configured?
> >
> > Thanks,
> > Lior
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at yandell-lab.org
> > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200430/a53d513e/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fusion.png
Type: image/png
Size: 33185 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200430/a53d513e/attachment-0003.png>

From natassa_g_2000 at yahoo.com  Thu Apr  2 07:42:48 2020
From: natassa_g_2000 at yahoo.com (natassa)
Date: Thu, 2 Apr 2020 13:42:48 +0000 (UTC)
Subject: [maker-devel] Optimal strategy and options for iterative maker2 runs
In-Reply-To: <1b0e5c3cae1b410397e61262a2384039@umu.se>
References: <1b0e5c3cae1b410397e61262a2384039@umu.se>
Message-ID: <1875152020.930988.1585834968860@mail.yahoo.com>

Hello maker community, 
I am annotating? with maker2 a fungal genome for which I have transcript evidence, plus transcripts and proteins from closely related species, a genemark .mod file from self-training I have run outside of maker, and an augustus model from a closely related species. I plan to run it iteratively, updating snap (and maybe augustus) models each time. Reading several iterative-maker pipelines online, I am a bit confused on the optimal strategy, and some details on the options used in consecutive runs. Some questions:

1) How will MAKER behave in the case where I would supply my different lines of evidence (EST+protein) along with trained abinitio models in the same run? Here is -what seems to me conflicting- info from posts I read (not in this list): "if est2genome and protein2genome are set to 1 +? abinitio tools are also on,? the abinitio tools will not use the EST-protein evidence to improve their gene models." but: "In case you activated SNAP and Augustus and you have fed MAKER with lines of evidence (Transcripts and proteins), it will predict gene models using Augustus-Evidence-driven and SNAP-Evidence-driven. In loci where both are present, it will chose the best one according to the lines of evidence (EST / protein when they are present)." Which one is correct?
2) I see in? a few tutorials that genemark is trained at a 3rd/4th run and separately from other abinitio programs. I don't understand why, since genemark is self-trained on the genome, so it doesnot really interact with training from evidence or maker gff files? 
3) Can I pass >1 abinitio models from one run to the next using the pred_gff option? For example? augustus+genemark hmms, separated by ","? In a 2017 post, Carson writes "I would avoid passing in Augustus results as GFF3, it removes the ability of MAKER to dynamically provide Augustus with hints as it runs". What is the correct way then?

Any input from experienced maker users is welcome!
Thank you in advance, 
Anastasia Gioti

Anastasia Gioti
Researcher
IMBBC-HCMR Crete, Greece
https://scholar.google.com/citations?user=eMsnakoAAAAJ&hl=en
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200402/afce9017/attachment-0004.html>

From shore at yorku.ca  Fri Apr  3 11:02:36 2020
From: shore at yorku.ca (shore at yorku.ca)
Date: Fri, 03 Apr 2020 13:02:36 -0400
Subject: [maker-devel] final annotation issues
Message-ID: <1585933356.5e876c2c023db@oldmymail.yorku.ca>

Dear Maker team,

 I believe we are the final stage of annotation of a plant genome, having
previously trained snap following 3 rounds.

 In our attempts at final annotation we have now added new transcriptome data,
and generated a repeat library for our species (so we now mask with that, as
well as database of plant repeats , and TE proteins).

 In this final annotation run, we've set keep_pred=1 and then plan to
screen the final gff file retaining sequences with AED<= 0.5 (or there
abouts) and ones that possess a pfam domain .

 I've compared some of the proteins obtained in our previous round of Maker with
the latest. Indeed the masking appears to have removed some that were TEs. A
number of proteins differ somewhat, likely the result of different intron/exon
boundary calls, and some are quite different in length.
In particular some are roughly twice the length in previous annotation, and
appear to be of the correct size previously , based upon online blasts.

It is this latter finding that I'm concerned about.
Why it has occurred.

I did set single-exon=1 and wonder if that is causing this effect?

Thanks and sorry for the long-winded email.

Joel


-- 
Dr. Joel S. Shore
Prof. Biology
York University


From carsonhh at gmail.com  Fri Apr  3 14:51:47 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 3 Apr 2020 14:51:47 -0600
Subject: [maker-devel] guidance for first and subsequent annotation
 parameters
In-Reply-To: <CAEdk9jmg1rbuRPbCC4vNmNkTyN61H7XGQpUi4h6UD1czoFtH1g@mail.gmail.com>
References: <CAEdk9jmg1rbuRPbCC4vNmNkTyN61H7XGQpUi4h6UD1czoFtH1g@mail.gmail.com>
Message-ID: <81E8860F-A91B-4089-B179-4ED7EBAC36D3@gmail.com>

You may need to select a subset of gene models to drive training.  I find that I get best results when I use protein2genome models only from uniprot/swiss-prot alignments to generate a training set, set always_complete=1. Uniprot/swiss-prot is manually curated, so is very high quality. Then I select models with the highest end-to-end completion (low AED). Also if you add est_forward=1 the score column in the GFF3 will be the % match to the original model.  It?s and easy way to select only models with a very high percent match. Remove models without start codons and stop codons.  You can relax these parameters if you don?t have many models, but in general you want 100-300 models to train with. Only one round of training is needed with this type of training set. The EST method requires 2 rounds and I don?t like it as much.

In the some cases, model selection for training will be a mostly manual task. You can use editors like Apollo to identify models that match evidence well, and delete odd models. Then train on that result.


What you are seeing is likely the result of over-training. Usually happens if you use more that 2 rounds of training, but can happen with just two rounds.

?Carson

 
> On Mar 20, 2020, at 5:30 AM, Devon O'Rourke <devon.orourke at gmail.com> wrote:
> 
> With so many posts on the forum it's been challenging to determine what the best practices are for performing multiple rounds of annotation with Maker.
> My first round used est, altest, and protein fasta files with a custom GFF repeat masked file. The resulting vertebrate genome produced 21,970 gene models with a mean length of about 9016 bp; the BUSCO score was C:66.0%[S:64.2%,D:1.8%],F:4.2%,M:29.8%,n:9226 (mammalia_odb10 set). Things seemed to be on the right track, so I set up the next Maker round using both SNAP and Augustus-trained information in the round2 maker_opts.ctl file. At the end of that second round, I noticed a marked decrease in BUSCO score (C:53.3%[S:51.0%,D:2.3%],F:11.6%,M:35.1%,n:9226), yet an increase in the number of gene models (28,646) and mean length (16266 bp). 
> 
> This got me to wondering if I was setting up the _opts.ctl file incorrectly? I'm concerned with a few things (and maybe missing even more I should be concerned about!?):
> I specified the evidence to come from EST/Protein instead of using the section available under "#-----Re-annotation Using MAKER Derived GFF3". Maybe that was a fundamental mistake? What is the expected change in behavior if I moved my round1 Maker output into that category instead of using the EST/Protein Homology evidence sections as I did below?
> I wasn't sure what to do with the RepeatMasking GFF files in Round2. The RepeatMasker GFF I included in Round1 consisted of just complex repeats (setting model_org=simple and softmask=1 to effectively only hard mask those complex areas for the initial alignments). But what should be used in Round2 - the output GFF of Round1, or the input GFF from Round1?
> Here's what I did for the Round2 maker_opts.ctl file:
> 
> #-----Genome (these are always required)
> genome=/scratch/dro49/myluwork/annotation/input_files/mylu_hic_rails_noMasks.fa
> organism_type=eukaryotic
> #-----EST Evidence (for best results provide a file for at least one)
> est_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.est2genome.gff
> altest_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.cdna2genome.gff
> #-----Protein Homology Evidence (for best results provide a file for at least one)
> protein_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.protein2genome.gff
> #-----Repeat Masking (leave values blank to skip repeat masking)
> rm_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.repeats.gff
> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
> #-----Gene Prediction
> snaphmm=/scratch/dro49/myluwork/annotation/maker_rd2/snap_rd1/lu_rnd1.zff.length50_aed0.25.hmm #SNAP HMM file
> augustus_species=mylu #Augustus gene prediction species model
> run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
> est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
> protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
> trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
> unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
> allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)
> 
> 
> Thank you for your insights and support,
> 
> Devon
> 
> -- 
> Devon O'Rourke
> Postdoctoral researcher, Northern Arizona University
> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/ <https://fozlab.weebly.com/>
> twitter: @thesciencedork

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/45bd12db/attachment-0004.html>

From carsonhh at gmail.com  Fri Apr  3 16:03:12 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 3 Apr 2020 16:03:12 -0600
Subject: [maker-devel] Problem with Maker using GeneMark
In-Reply-To: <5e838561.1c69fb81.27f9c.24c6SMTPIN_ADDED_MISSING@mx.google.com>
References: <5e838561.1c69fb81.27f9c.24c6SMTPIN_ADDED_MISSING@mx.google.com>
Message-ID: <E2FF8D2F-8B8F-456B-BC0E-1A1C099D05D3@gmail.com>

Could you try the attached version, and let me know if it resolves the issue (copy over the old one)? The probuild command I used is just one I stole from another GeneMark script, so I just borrowed the updated command from the SplitFasta subroutine in gmes_petap.pl.

?Carson


> On Mar 31, 2020, at 11:53 AM, Gagn?, Patrick (NRCAN/RNCAN) <patrick.gagne at canada.ca> wrote:
> 
> Hi
>  
> I?ve come across a bug while using Maker. I?m trying to annotate a 560Mb Genome and I?m using Snap, GeneMark and Augustus in Maker.
> When Maker is executing the GeneMark command, it just failed (GeneMark Failed) without any error messages, so I?ve decided to debug it myself?So I launched every commands manually and found out that the gmhmm_wrap is causing the issue. The problem is in fact in the prebuild command; it doesn?t do anything (from what I understand, this command is supposed to split the fasta whre there is NNN to prevent GeneMark Crash). My genome got very long stretches of N (up to 14Kb)
>  
> After checking the prebuild help, I?ve found that the command used in gmhmm_wrap is not valid (half the options are not in probuild anymore, probably because of GeneMark updates)
>  
> I have tried different Probuild (those I could download from GeneMark site, they don?t give older versions except those that come with their program?s versions)
> 2.16
> 2.34
> 2.44 (lastest that come with GeneMark ES)
>  
> I?ve also tried to edit the gmhmm_wrap script and modify the prebuild command, but even when the fasta are splitted, I got another bug : ERROR: Logic error in getting offset. I?ve tried to replace the command for the offset extraction, which also worked, but now I got a bug when Maker try to get the ab-initio output :
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Calling translate without a seq argument!
>  
> Could you please tell me how to fix this, or tell me what probuild I should use (I will ask the GeneMark support for it)
>  
> Thanks in advance
>  
> P.S 
> Sorry for my English, It?s not my first language and I?m still learning
>  
> Patrick Gagn?
> Sp?cialiste en bio-informatique / Bioinformatics specialist
> Service canadien des for?ts / Canadian Forest Service
> Ressources naturelles Canada / Natural Resources Canada
> Gouvernement du Canada / Government of Canada
> Centre de foresterie des Laurentides/Laurentian Forestry Centre
> 1055, rue du P.E.P.S.
> C.P. 10380, succ. Sainte-Foy/P.O. Box 10380, Stn. Sainte-Foy
> Qu?bec (Qc) G1V 4C7
> Laboratoire de pathologie foresti?re (Local 2.21)
> patrick.gagne at canada.ca <mailto:patrick.gagne at canada.ca> / tel : (418) 648-4443
>  
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0008.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmhmm_wrap
Type: application/octet-stream
Size: 9027 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0004.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200403/7766c8a5/attachment-0009.html>

From carsonhh at gmail.com  Sat Apr  4 14:09:05 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sat, 4 Apr 2020 14:09:05 -0600
Subject: [maker-devel] repeatmasker output gff
In-Reply-To: <CA+un12sUfaxK_ycgves9cohrdFjOyg7Nt9WUzGSK6Yz-RbdgMQ@mail.gmail.com>
References: <CA+un12sUfaxK_ycgves9cohrdFjOyg7Nt9WUzGSK6Yz-RbdgMQ@mail.gmail.com>
Message-ID: <08952604-1F7A-4FC5-9F59-DB79665A324D@gmail.com>

It needs to be a two level feature. match/match_part is one example, others will work as long as when it is assembled it is two levels.

MAKER saves it?s state as it runs, so you can restart it at any time without losing progress.

?Carson


> On Mar 25, 2020, at 2:38 PM, Homa Papoli <hpapoli at gmail.com> wrote:
> 
> Hello,
> 
> I have 2 questions regarding user maker:
> 
> I have used repeatmasker for my genome separately and I have a gff file. However, my gff file, in the third column, has the word "similarity". In a workshop I had taken on genome annotation, it was said that the gff for maker should have "match" and "match_part" for the third column. I was wondering whether I could use the original gff output of repeatmasker or should I make any changes to it?
> 
> Another question is about running maker. Since maker takes several days to run, if the job gets interrupted due to limit in days of running the job, I was wondering whether it is possible to re-start maker from where it got interrupted?
> 
> Thank you,
> Homa
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Sat Apr  4 14:15:21 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sat, 4 Apr 2020 14:15:21 -0600
Subject: [maker-devel] Maker annotation  AED scores are around 0.5
In-Reply-To: <1b0e5c3cae1b410397e61262a2384039@umu.se>
References: <1b0e5c3cae1b410397e61262a2384039@umu.se>
Message-ID: <E2AAAE35-7B24-46EE-B77F-9E4BD584CC45@gmail.com>

Probably this ?>

https://groups.google.com/forum/#!searchin/maker-devel/Curious$20pattern$20in$20AED$20distributions%7Csort:date/maker-devel/QS3VnxhvEks/q3lPmywjBQAJ <https://groups.google.com/forum/#!searchin/maker-devel/Curious$20pattern$20in$20AED$20distributions|sort:date/maker-devel/QS3VnxhvEks/q3lPmywjBQAJ>


Likely caused by an over abundance of single-exon models and under masking of repeats in the genome.

?Carson


> On Mar 30, 2020, at 3:37 AM, Wei Zhao <zhao.wei at umu.se> wrote:
> 
> Dear maker team,
>  
> I am writing to ask for your help.
>  
> I am using make to annotate a big genome ~9 Gbp, I have 3 evidences: 1)  Transcriptome of this species; 2) protein sequence from relative species; 3) Augustus model trained from pasa.
>  
> When I use all of these 3 evidences to annotate the genome (basic pipeline), the distribution of AED score is weird (single peak around 0.5).
>  
> I have also tried to update the gene model I got from pasa  using maker, the distribution of AED scores is the same.
>  
> But when I try to only use  EST or protein as evidence (est2genome or protein2genome), the AED scores is normal (close to 0).
>  
> To my understand, it seems all the 3 evidences are conflict with each other, results in  the AED scores is higher  (~ 0.5) than expected,  could you give me some suggestion on how to fix this problem?
>  
> Best regards,
>  
> Wei
>  
>  
> <E6F3EF742C40408F8390EE9A1FF29894.png>
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200404/c6f3052e/attachment-0004.html>

From danis_theo at hotmail.com  Thu Apr  2 12:24:05 2020
From: danis_theo at hotmail.com (Thodoris Danis)
Date: Thu, 2 Apr 2020 18:24:05 +0000
Subject: [maker-devel] Question about re-annotation
Message-ID: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>

Hello maker community,


I am annotating a fish genome with Maker 2.31.10, for which I have transcript evidence and proteins from closely related species. I am running maker in several passes and pass the gff output to the nest pass every time here (
"#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #re-annotate genome based on this gff3 file",
), is this wrong? and where is the difference running the maker from scratch without re-supplying every time the new gff?

Secondly,  in the second round after predictors training, have I to keep est2genome=1 , protein2genome=1, ests and proteins evidence or not?
How do we switch all thesre parameters?

Any input from experienced maker users is welcome
Thank you for your help


???????? ?????
Thodoris Danis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200402/151bd94e/attachment-0004.html>

From carsonhh at gmail.com  Sun Apr  5 16:19:26 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Sun, 5 Apr 2020 16:19:26 -0600
Subject: [maker-devel] Question about re-annotation
In-Reply-To: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>
References: <VI1PR06MB4286C416FD7E075263F3390DF7C60@VI1PR06MB4286.eurprd06.prod.outlook.com>
Message-ID: <9B863735-2D4D-4FD2-A8B1-3C542F3D767A@gmail.com>

If you are running several times, just rerun in the same directory after altering settings. MAKER will reuse old raw data reports as appropriate. The maker_gff option is really just for reannotating from an old maker run where you no longer have the raw files available.

?Carson


> On Apr 2, 2020, at 12:24 PM, Thodoris Danis <danis_theo at hotmail.com> wrote:
> 
> Hello maker community,
> 
> 
> I am annotating a fish genome with Maker 2.31.10, for which I have transcript evidence and proteins from closely related species. I am running maker in several passes and pass the gff output to the nest pass every time here (
>> 
>> "#-----Re-annotation Using MAKER Derived GFF3
>> maker_gff= #re-annotate genome based on this gff3 file",
> ), is this wrong? and where is the difference running the maker from scratch without re-supplying every time the new gff?
> 
> Secondly,  in the second round after predictors training, have I to keep est2genome=1 , protein2genome=1, ests and proteins evidence or not?
> How do we switch all thesre parameters? 
> 
> Any input from experienced maker users is welcome
> Thank you for your help
> 
> 
> ???????? ????? 
> Thodoris Danis
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200405/d67b5ca5/attachment-0004.html>

From carsonhh at gmail.com  Tue Apr  7 11:42:08 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 7 Apr 2020 11:42:08 -0600
Subject: [maker-devel] Maker 2.31.10: maker_functional_gff and
 maker_functional_fasta not parsing correctly,
 Can't use string ("") as a HASH ref while "strict refs" in use
In-Reply-To: <4A449297-A6D1-4A75-9547-FB5F70CE1A0A@ulaval.ca>
References: <4A449297-A6D1-4A75-9547-FB5F70CE1A0A@ulaval.ca>
Message-ID: <D775DED1-F3A5-478B-B7BE-8F318CFEADA3@gmail.com>

Thanks I?ll update the related scripts. In my tests the old regular expression still works, but ends up adding the OX= tag as part of the GFF3 entry and not throwing a hash ref error. So you still may have another issue if you are getting a hash ref error.

?Carson


> On Mar 14, 2020, at 11:24 AM, Christopher Keeling <christopher.keeling.1 at ulaval.ca> wrote:
> 
> Hello,
> 
> In sub parse_blast{, during parsing of uniprot fasta file:
> 
> if (/>(\S+)\s+(.*?)\s+OS=(.*?)\s+(GN=(.*?)\s+)?PE=/) {
> 
> should be changed to:
> 
> if (/>sp\|(\S+)\|\S+\s+(.*?)\s+OS=(.*?)\s+OX=\S+\s+(GN=(.*?)\s+)?PE=/) {
> 
> to avoid "Can't use string ("") as a HASH ref while "strict refs" in use at?" errors.
> 
> For UniProt release 2020_01: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz <ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz>
> 
> Cheers,
> Chris
> 
> 
> --
> Christopher I. Keeling
> Chercheur scientifique en g?nomique foresti?re/ Research Scientist in Forest Genomics
> 
> Ressources naturelles Canada / Natural Resources Canada
> Service canadien des for?ts / Canadian Forest Service
> Centre de foresterie des Laurentides / Laurentian Forestry Centre
> 1055, rue du PEPS Qu?bec, QC G1V 4C7 Canada
> https://cfs.nrcan.gc.ca/employees/read/ckeeling <https://cfs.nrcan.gc.ca/employees/read/ckeeling>
> 
> Professeur associ?
> D?partement de biochimie, de microbiologie et de bio-informatique
> Universit? Laval
> https://www.researchgate.net/profile/Christopher_Keeling <https://www.researchgate.net/profile/Christopher_Keeling>
> https://scholar.google.ca/citations?user=KtGr86UAAAAJ <https://scholar.google.ca/citations?user=KtGr86UAAAAJ>
>  
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200407/3f7959bd/attachment-0004.html>

From andrei.kiselev at lrsv.ups-tlse.fr  Fri Apr 10 10:33:57 2020
From: andrei.kiselev at lrsv.ups-tlse.fr (andrei.kiselev at lrsv.ups-tlse.fr)
Date: Fri, 10 Apr 2020 16:33:57 +0000
Subject: [maker-devel] New assembly annotation
Message-ID: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>

Hello.
I'have recently got a new genome assembly using PacBio of oomycete Aphanomyces. 
I used MAKER in the manner as described here https://groups.google.com/forum/#!searchin/maker-devel/new$20assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ (https://groups.google.com/forum/#!searchin/maker-devel/new%2420assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ)

After first run I got the number of transcripts slightly higher than were in gff file of previous version of genome. Then I run the second MAKER with new gff file in option pred_gff + augustus trained for my species. As a result I got only half of the transcripts from initial gff.

Is there something that I could overlook running MAKER? Attached is control file of the last run.

Thank you in advance.
Andrei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200410/54d75f1c/attachment-0004.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4984 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200410/54d75f1c/attachment-0004.obj>

From liorglic at mail.tau.ac.il  Mon Apr 13 08:12:42 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Mon, 13 Apr 2020 17:12:42 +0300
Subject: [maker-devel] Annotating a fragmented assembly
Message-ID: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>

Hello there,

I am working on creating plant pan genomes. This means that I produce many
assemblies for samples of the same species from NGS data available from SRA
and then annotate them with MAKER, based on a collection of relevant
evidence (transcripts and proteins).
As you might imagine, data quality is variable, so I sometimes create
assembles from >x20 sequencing depth, resulting in fragmented assemblies
(say N50 in the range of 5-10kb).
Annotation results of such genomes usually contain many partial genes,
broken across contigs, so in many cases I get two proteins, representing
the 3' and 5' parts of a broken gene. In other cases, only one part of the
gene is detected.
I've also found that applying reference-based scaffolding (I use RaGOO) to
generate pseudomolecules improves results by bringing together contigs
containing gene parts and allowing MAKER to create full annotation.
However, this also results in new erroneous predictions, spanning two
contigs that are not actually adjacent in the genome but were brought
together by the scaffolding process.
I suspect this has to do with the number of 'N' characters introduced as
padding between ordered contigs, so one thing I wanted to ask about is how
MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
I would also appreciate any advice on how to annotate fragmented genomes
and comments about the strategy I described above. Please note that I am
not expecting a reference-level annotation, but am simply trying to reduce
noise levels towards downstream comparative analyses.

Thanks a lot and best regards,
Lior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200413/254fdbe5/attachment-0004.html>

From xpeng at ucsb.edu  Tue Apr 14 11:40:15 2020
From: xpeng at ucsb.edu (xpeng at ucsb.edu)
Date: Tue, 14 Apr 2020 10:40:15 -0700
Subject: [maker-devel] Can install but Cannot Run MAKER
Message-ID: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>

Dear Yandell Lab,

 
I am writing to get a bit of help on making MAKER to work.

 
I downloaded the v3.01.03 maker and followed the instructions on your wiki
page to install, both on my local computer as sudo and on PSC Bridges (with
MPI). 

 
The installation seemed to have completed successfully.

 
However, when I ran "maker -h" I received error messages (attached) that I
don't know what to do about. Could you please advise a solution?

 
Thank you!

 
Nick (Xuefeng Peng)

 
Postdoctoral Scholar

University of California

Santa Barbara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/38690941/attachment-0004.html>
-------------- next part --------------
$ ./maker -h
Possible precedence issue with control flow operator at /usr/share/perl5/Bio/DB/IndexedBase.pm line 845.
syntax error at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 105, near ")

	if"
syntax error at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 106, near "]{strand"
Global symbol "$strand" requires explicit package name (did you forget to declare "my $strand"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 106.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$strand" requires explicit package name (did you forget to declare "my $strand"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 109.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$strand" requires explicit package name (did you forget to declare "my $strand"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 113.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "%g" requires explicit package name (did you forget to declare "my %g"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "$id" requires explicit package name (did you forget to declare "my $id"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "$i" requires explicit package name (did you forget to declare "my $i"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name (did you forget to declare "my @stuff"?) at /home/csp/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Execution of /home/csp/software/maker/bin/../lib/Widget/trnascan.pm aborted due to compilation errors.
Compilation failed in require at /home/csp/software/maker/bin/../lib/GI.pm line 40.
BEGIN failed--compilation aborted at /home/csp/software/maker/bin/../lib/GI.pm line 40.
Compilation failed in require at ./maker line 46.
BEGIN failed--compilation aborted at ./maker line 46.
-------------- next part --------------
$ maker -h
syntax error at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 105, near ")

	if"
syntax error at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 106, near "]{strand"
Global symbol "$strand" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 106.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 107.
Global symbol "$strand" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 109.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 110.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 111.
Global symbol "$strand" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 113.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 114.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 115.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 120.
Global symbol "%g" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "$id" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "$i" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
Global symbol "@stuff" requires explicit package name at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 121.
syntax error at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm line 122, near "}"
/pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/Widget/trnascan.pm has too many errors.
Compilation failed in require at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/GI.pm line 40.
BEGIN failed--compilation aborted at /pylon5/bi5618p/hmahzpxf/software/maker/bin/../lib/GI.pm line 40.
Compilation failed in require at /pylon5/bi5618p/hmahzpxf/software/maker/bin/maker line 46.
BEGIN failed--compilation aborted at /pylon5/bi5618p/hmahzpxf/software/maker/bin/maker line 46.

From carsonhh at gmail.com  Tue Apr 14 12:11:16 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 14 Apr 2020 12:11:16 -0600
Subject: [maker-devel] Can install but Cannot Run MAKER
In-Reply-To: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>
References: <010c01d61283$c22308e0$46691aa0$@ucsb.edu>
Message-ID: <94E18556-771A-485C-B534-80B52BC7D586@gmail.com>

Please re-download and install again. I found the issue from your error in the new install package.

?Carson


> On Apr 14, 2020, at 11:40 AM, <xpeng at ucsb.edu> <xpeng at ucsb.edu> wrote:
> 
> Dear Yandell Lab,
>  
> I am writing to get a bit of help on making MAKER to work.
>  
> I downloaded the v3.01.03 maker and followed the instructions on your wiki page to install, both on my local computer as sudo and on PSC Bridges (with MPI). 
>  
> The installation seemed to have completed successfully.
>  
> However, when I ran ?maker -h? I received error messages (attached) that I don?t know what to do about. Could you please advise a solution?
>  
> Thank you!
>  
> Nick (Xuefeng Peng)
>  
> Postdoctoral Scholar
> University of California
> Santa Barbara, CA
> <Error_Message_Ubuntu_19.10.txt><Error_Message_PSC_Bridges.txt>_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200414/e9ef3203/attachment-0004.html>

From liorglic at mail.tau.ac.il  Tue Apr 21 07:08:40 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Tue, 21 Apr 2020 16:08:40 +0300
Subject: [maker-devel] Missing genes in lift-over with est2genome
Message-ID: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>

Hello,
I am using MAKER to annotate a plant genome assembly. A high-quality
reference genome and annotation exists for another variety of the same
species, so my first step is lifting over reference genes to my genome. I
do this by setting est2genome = 1 and providing MAKER with the reference
cDNA (transcriptome). No other evidence is provided and no prediction is
performed. Repeat masking is done using the reference repeats library.
When checking the results, I found out lots of reference genes missing from
the lift-over result. However, if I blast the sequences of these genes
myself, I get good matches. I even see these matches when I look at the
blast results buried in the MAKER data_store.
For example, a transcript of length 1077 got a match of length 855 - 100%
identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a
pretty good match, but it is not found in the final MAKER results
(gff/fasta).
Why is this happening? Are there some cutoffs that are not satisfied? If
so, what are they and how can they be configured?

Thanks,
Lior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200421/dfdebbb1/attachment-0004.html>

From carsonhh at gmail.com  Thu Apr 23 11:38:54 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:38:54 -0600
Subject: [maker-devel] Annotating a fragmented assembly
In-Reply-To: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>
References: <CAOzMDPxTPirGeS-WcvQ_5BWijgBOh3O=h3SMnAzn1mSAic5aiA@mail.gmail.com>
Message-ID: <C9C6F924-D27C-498A-81B8-B051C25CDB27@gmail.com>

N?s are handled by the gene predictors themselves. I know Augustus can span N?s within introns. I?m not sure how many N?s will cause it to split the gene. It may be a function of the expected intron length in the HMM. Organisms with large introns could then handles more N?s. Genemark will split genes on even short runs of N?s. I?m not sure on SNAP.  For BLAST alignments, extensions of gaps decrease the score, so how long the gap can be depends on the score of the initial seeding alignment. The larger the initial score, the longer the gap can be before scores drop below the termination threshold.

?Carson


> On Apr 13, 2020, at 8:12 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> 
> Hello there,
> 
> I am working on creating plant pan genomes. This means that I produce many assemblies for samples of the same species from NGS data available from SRA and then annotate them with MAKER, based on a collection of relevant evidence (transcripts and proteins).
> As you might imagine, data quality is variable, so I sometimes create assembles from >x20 sequencing depth, resulting in fragmented assemblies (say N50 in the range of 5-10kb).
> Annotation results of such genomes usually contain many partial genes, broken across contigs, so in many cases I get two proteins, representing the 3' and 5' parts of a broken gene. In other cases, only one part of the gene is detected.
> I've also found that applying reference-based scaffolding (I use RaGOO) to generate pseudomolecules improves results by bringing together contigs containing gene parts and allowing MAKER to create full annotation. However, this also results in new erroneous predictions, spanning two contigs that are not actually adjacent in the genome but were brought together by the scaffolding process.
> I suspect this has to do with the number of 'N' characters introduced as padding between ordered contigs, so one thing I wanted to ask about is how MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
> I would also appreciate any advice on how to annotate fragmented genomes and comments about the strategy I described above. Please note that I am not expecting a reference-level annotation, but am simply trying to reduce noise levels towards downstream comparative analyses.
> 
> Thanks a lot and best regards,
> Lior
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Apr 23 11:43:30 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:43:30 -0600
Subject: [maker-devel] Missing genes in lift-over with est2genome
In-Reply-To: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
References: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
Message-ID: <373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>

There are percent cutoffs for the est2genome algorithm you can set in the maker_bopts.ctl file. Additionally, maker will give the alignment but not produce a gene model if it can?t translate through the est2genome alignment (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add est_forward=1 to the maker_opts.ctl file names will be copied from the alignment source and the score in the GFF3 column will be the percent match to the original transcript.

?Carson


> On Apr 21, 2020, at 7:08 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> 
> Hello,
> I am using MAKER to annotate a plant genome assembly. A high-quality reference genome and annotation exists for another variety of the same species, so my first step is lifting over reference genes to my genome. I do this by setting est2genome = 1 and providing MAKER with the reference cDNA (transcriptome). No other evidence is provided and no prediction is performed. Repeat masking is done using the reference repeats library.
> When checking the results, I found out lots of reference genes missing from the lift-over result. However, if I blast the sequences of these genes myself, I get good matches. I even see these matches when I look at the blast results buried in the MAKER data_store.
> For example, a transcript of length 1077 got a match of length 855 - 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a pretty good match, but it is not found in the final MAKER results (gff/fasta).
> Why is this happening? Are there some cutoffs that are not satisfied? If so, what are they and how can they be configured?
> 
> Thanks,
> Lior
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Apr 23 11:53:27 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:53:27 -0600
Subject: [maker-devel] New assembly annotation
In-Reply-To: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>
References: <206ab427337feed79898df112d76cb2e@lrsv.ups-tlse.fr>
Message-ID: <DFFD73D7-8379-467B-9992-FDDBAE230802@gmail.com>

Fewer transcripts can mean fewer split and spurious genes. It can also be bad merges because of overtraining.  Use BUSCO to evaluate the completeness of gene models rather than transcript count.  Also review models visually using something like Apollo.  You will be able to see if models are spanning distinct evidence clusters or if they were previously split within evidence clusters.  That will help you better identify if the models now better follow the evidence alignments.

?Carson


> On Apr 10, 2020, at 10:33 AM, andrei.kiselev at lrsv.ups-tlse.fr wrote:
> 
> Hello.
> I'have recently got a new genome assembly using PacBio of oomycete Aphanomyces. 
> I used MAKER in the manner as described here https://groups.google.com/forum/#!searchin/maker-devel/new$20assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ <https://groups.google.com/forum/#!searchin/maker-devel/new%2420assembly%7Csort:date/maker-devel/Xo5YbWgNwFw/KstkmXYYAgAJ>
> 
> After first run I got the number of transcripts slightly higher than were in gff file of previous version of genome. Then I run the second MAKER with new gff file in option pred_gff + augustus trained for my species. As a result I got only half of the transcripts from initial gff.
> 
> Is there something that I could overlook running MAKER? Attached is control file of the last run.
> 
> Thank you in advance.
> Andrei
> <maker_opts.ctl>_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200423/16d97e5b/attachment-0004.html>

From carsonhh at gmail.com  Thu Apr 23 11:57:23 2020
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 23 Apr 2020 11:57:23 -0600
Subject: [maker-devel] final annotation issues
In-Reply-To: <1585933356.5e876c2c023db@oldmymail.yorku.ca>
References: <1585933356.5e876c2c023db@oldmymail.yorku.ca>
Message-ID: <D56728B7-B822-4EF4-AF75-7EE76C6D6908@gmail.com>

I would not recommend single-exon=1 unless this is an organism where you expect a lot of single exon genes (typically fungi or oomycetes).  It?s best to review models visually in something like Apollo to see how evidence alignments compare to gene predictions. There is always the chance that you have some overmasking that could trim some regions you don?t want to lose.

?Carson


> On Apr 3, 2020, at 11:02 AM, shore at yorku.ca wrote:
> 
> Dear Maker team,
> 
> I believe we are the final stage of annotation of a plant genome, having
> previously trained snap following 3 rounds.
> 
> In our attempts at final annotation we have now added new transcriptome data,
> and generated a repeat library for our species (so we now mask with that, as
> well as database of plant repeats , and TE proteins).
> 
> In this final annotation run, we've set keep_pred=1 and then plan to
> screen the final gff file retaining sequences with AED<= 0.5 (or there
> abouts) and ones that possess a pfam domain .
> 
> I've compared some of the proteins obtained in our previous round of Maker with
> the latest. Indeed the masking appears to have removed some that were TEs. A
> number of proteins differ somewhat, likely the result of different intron/exon
> boundary calls, and some are quite different in length.
> In particular some are roughly twice the length in previous annotation, and
> appear to be of the correct size previously , based upon online blasts.
> 
> It is this latter finding that I'm concerned about.
> Why it has occurred.
> 
> I did set single-exon=1 and wonder if that is causing this effect?
> 
> Thanks and sorry for the long-winded email.
> 
> Joel
> 
> 
> 
> -- 
> Dr. Joel S. Shore
> Prof. Biology
> York University
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


From guerrer at uni-duesseldorf.de  Fri Apr 24 08:27:24 2020
From: guerrer at uni-duesseldorf.de (Ricardo Nuno Ferreira Martins Guerreiro)
Date: Fri, 24 Apr 2020 16:27:24 +0200
Subject: [maker-devel] Maker 0 genes after SNAP or with proteins.gff
Message-ID: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>

Dear Makers list,


I am struggling with Maker after many successful attempts. I don't 
understand why but my final .gff does not contain any genes, 0.

I am running first an Evidence based modelling, with proteins only. Here 
I get around 40 thousand genes if I give the proteins as a fasta to 
align (if I provide a protein.gff from a previous maker try, I get 0 
genes, same problem).

Afterwards I'm creating a SNAP hmm and running maker again, turning 
protein2genome=0 and snaphmm=snap.hmm as you say, but now I have 0 
genes. This happens either I keep providing proteins as a fasta or as 
.gff of a previous run.

I have done this many times and it always worked. The only difference 
now is that I am using no ESTs whatsoever, only proteins. It's also 
strange that it works on the first round of maker but doesn't work on 
the SNAP rounds.


Hope you can help,
Ricardo
-------------- next part --------------
#-----Genome (these are always required)
genome=/gpfs/project/projects/qggp/C34_PS/experiments/annotation/maker/b_tournefortii/b_tournefortii.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/gpfs/project/projects/qggp/C34_PS/data/proteins/all_prots95.fasta  #protein sequence file in fasta format (i.e. from mutiple organisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib=/gpfs/project/projects/qggp/C34_PS/experiments/annotation/maker/b_tournefortii/allRepeats.lib.noProtFinal #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm=snap2/snap2.hmm
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap=0 #allowed gene overlap fraction (value from 0 to 1, blank for default)

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
min_intron=20 #minimum intron length (used for alignment polishing)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

From taosheng.x at gmail.com  Sun Apr 26 00:58:47 2020
From: taosheng.x at gmail.com (Xu, taosheng)
Date: Sun, 26 Apr 2020 14:58:47 +0800
Subject: [maker-devel] Problems with openMPI in multiple computing nodes
Message-ID: <CALJhmFr9Q741vwAZHHH9-pV-PAjfCPRKi-2B0kLx8r0HVHWYOA@mail.gmail.com>

Hello,
I am using a  computer cluster with 20 nodes(40cpus per node) for
gene annotation. I submit my maker task to one node with 40 CPUs using
openMPI. Everything is well.
But I encounter the problem when submitting the same maker task to the
cluster with multiple nodes (120 cpus) There are errors shown below.
I would also appreciate any advice. Thank you.

Best regards,
Taosheng


*STATUS: Processing and indexing input FASTA files...cannot remove
directory for
home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
No such file or directory at /maker/bin/../lib/FastaDB.pm line 145.cannot
remove directory for
/home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
Directory not empty at /maker/bin/../lib/FastaDB.pm line 145.cannot remove
directory for
/home/20200425/genome.maker.output/mpi_blastdb/te_proteins%2Efasta.mpi.10//.dbtmp0:
Directory not empty at /maker/bin/../lib/FastaDB.pm line 145.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200426/ccd6091e/attachment-0004.html>

From xvazquezc at gmail.com  Sun Apr 26 20:15:53 2020
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 27 Apr 2020 12:15:53 +1000
Subject: [maker-devel] Maker 0 genes after SNAP or with proteins.gff
In-Reply-To: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>
References: <84c5fc195df0fcc5e03484e65076fa9c@uni-duesseldorf.de>
Message-ID: <CAL0hg4GUdbQMxN1j5KBQ6JymQSzT_tSbE19fwvEAg6+3_GmXMw@mail.gmail.com>

Hi Ricardo,
it is likely that you are not providing enough evidences to train SNAP (or
even none at all). When you run maker2zff, the defaults may not give any
output if you don't have any EST at all. Check maker2zff -h for the
evidence filtering options to create the model. In worst case, you'll need
to run maker2zff -n which doesn't filter the evidences at all. I also
suggest to search about this on the mailing list as it has come up many
times.
Cheers,
Xabi

On Sat, 25 Apr 2020 at 02:46, Ricardo Nuno Ferreira Martins Guerreiro <
guerrer at uni-duesseldorf.de> wrote:

> Dear Makers list,
>
>
> I am struggling with Maker after many successful attempts. I don't
> understand why but my final .gff does not contain any genes, 0.
>
> I am running first an Evidence based modelling, with proteins only. Here
> I get around 40 thousand genes if I give the proteins as a fasta to
> align (if I provide a protein.gff from a previous maker try, I get 0
> genes, same problem).
>
> Afterwards I'm creating a SNAP hmm and running maker again, turning
> protein2genome=0 and snaphmm=snap.hmm as you say, but now I have 0
> genes. This happens either I keep providing proteins as a fasta or as
> .gff of a previous run.
>
> I have done this many times and it always worked. The only difference
> now is that I am using no ESTs whatsoever, only proteins. It's also
> strange that it works on the first round of maker but doesn't work on
> the SNAP rounds.
>
>
> Hope you can help,
> Ricardo_______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200427/d49cbc74/attachment-0004.html>

From liorglic at mail.tau.ac.il  Thu Apr 30 06:58:17 2020
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Thu, 30 Apr 2020 15:58:17 +0300
Subject: [maker-devel] Missing genes in lift-over with est2genome
In-Reply-To: <373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>
References: <CAOzMDPzsV3F58ED0MOgMh0iJeXF+q4x=7riKjiRY2j6xx_7_bw@mail.gmail.com>
	<373413EA-9D4C-44CF-AA51-632C0F54B7AC@gmail.com>
Message-ID: <CAOzMDPyLSPa33x31R_2d+bKhDN2d6+aFK+mQn5C7xJd9Tq56yg@mail.gmail.com>

Thanks Carson - your answer was very helpful.
Another question related to the lift-over process, if I may.
I want to take the resulting gff and pass it on to another MAKER run, where
I provide further, lower confidence evidence (ESTs and proteins). I'm not
sure which option to use though. According to this helpful post
<https://computationalbiologysite.wordpress.com/2013/07/11/maker-gff-cite-online/>,
I tried using pred_gff and model_gff, but both created cases of fusion
genes when genes are very adjacent to one another (see attached picture),
even with the correct_est_fusion parameter enabled. It looks like the only
way to take lifted-over genes "as-is" would be to use other_gff, but I
figure that this was not really intended for genes. Would you recommend
this usage? Am I missing something?
Thank you!

??????? ??? ??, 23 ????? 2020 ?-20:43 ??? ?Carson Holt?? <?
carsonhh at gmail.com??>:?

> There are percent cutoffs for the est2genome algorithm you can set in the
> maker_bopts.ctl file. Additionally, maker will give the alignment but not
> produce a gene model if it can?t translate through the est2genome alignment
> (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add
> est_forward=1 to the maker_opts.ctl file names will be copied from the
> alignment source and the score in the GFF3 column will be the percent match
> to the original transcript.
>
> ?Carson
>
>
>
> > On Apr 21, 2020, at 7:08 AM, Lior Glick <liorglic at mail.tau.ac.il> wrote:
> >
> > Hello,
> > I am using MAKER to annotate a plant genome assembly. A high-quality
> reference genome and annotation exists for another variety of the same
> species, so my first step is lifting over reference genes to my genome. I
> do this by setting est2genome = 1 and providing MAKER with the reference
> cDNA (transcriptome). No other evidence is provided and no prediction is
> performed. Repeat masking is done using the reference repeats library.
> > When checking the results, I found out lots of reference genes missing
> from the lift-over result. However, if I blast the sequences of these genes
> myself, I get good matches. I even see these matches when I look at the
> blast results buried in the MAKER data_store.
> > For example, a transcript of length 1077 got a match of length 855 -
> 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like
> a pretty good match, but it is not found in the final MAKER results
> (gff/fasta).
> > Why is this happening? Are there some cutoffs that are not satisfied? If
> so, what are they and how can they be configured?
> >
> > Thanks,
> > Lior
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at yandell-lab.org
> > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200430/a53d513e/attachment-0004.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fusion.png
Type: image/png
Size: 33185 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200430/a53d513e/attachment-0004.png>