From vsoza at uw.edu  Fri Jun  1 14:36:10 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Fri, 1 Jun 2018 12:36:10 -0700
Subject: [maker-devel] how to input a masked assembly for annotation into
 Maker
Message-ID: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>

Hi Maker community

I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.

Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.

For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 

I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 

Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.


Annotation A default build steps:

$ maker -base Rwill10 -fix_nucleotides
$ maker -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11983   11983  312159
#should be 11985

$ maker -dsindex -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10_master_datastore_index.log

$ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
21960

$ fasta_merge -d Rwill10_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.default.log
Type: application/octet-stream
Size: 4650 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment.obj>
-------------- next part --------------


Annotation A standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta

#genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
#IDs in .tsv file are called "processed-gene" from .fasta file, 
#but in .gff file, I think these are called "abinit-gene"
#best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
$ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

#extract list of IDs only to grep for
cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
  
#used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff

$ maker -base Rwill10standard2 -fix_nucleotides
$ maker -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11975   11975  311953
#should be 11985

$ maker -dsindex -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10standard2.all.gff
23559

$ fasta_merge -d Rwill10standard2_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.standard.log
Type: application/octet-stream
Size: 4529 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0001.obj>
-------------- next part --------------


Annotation B default build steps:

$ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta

#Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
$ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 

#use script to extract ordered scaffolds for each chromosome
$ ./extract_scaffolds_synteny.sh

#use script to create pseudochromosomal sequence for each chromosome
$ ./create_pseudo_chromosome_allLGs.sh

#concatenate these into one fasta file
cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta

$ maker -base Rwill10.pseudochromos -fix_nucleotides
$ maker -base Rwill10.pseudochromos -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
18465

$ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.default.log
Type: application/octet-stream
Size: 4604 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0002.obj>
-------------- next part --------------


Annotation B standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta

$ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

$ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs

#used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff

$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
20830

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.standard.log
Type: application/octet-stream
Size: 4558 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0003.obj>
-------------- next part --------------


-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From carsonhh at gmail.com  Fri Jun  1 17:01:13 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 1 Jun 2018 16:01:13 -0600
Subject: [maker-devel] Building MAKER with specific perl version
In-Reply-To: <CAOeNZNuUnnu_gGtO-yHHrM6wzeEzqihH83gLMCrty20mftt0Wg@mail.gmail.com>
References: <CAOeNZNuUnnu_gGtO-yHHrM6wzeEzqihH83gLMCrty20mftt0Wg@mail.gmail.com>
Message-ID: <78BA2892-5E3A-4DE5-AA44-18A9BCEF8071@gmail.com>

You probably need to run './Build realclean? to clean up your previous build and settings that you likely ran with the system perl. You may even need to clean out the ?/maker/perl directory. You can also try updating your Module::Build installation.

?Carson


> On May 31, 2018, at 8:30 AM, Ksenia Lavrichenko <ksenia.lavrichenko at gmail.com> wrote:
> 
> Hi, 
> 
> I have been banging my head for a while now, trying to install MAKER with my specific perl. 
> 
> I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ <https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ>
> 
> However, this does not work for me. I make sure bin/* and Build are deleted before I run $myperl Build.PL.
> 
> I see my perl in shebang of Build however after ./Build install all scripts in bin have "#! /usr/bin/perl" which produces a version error when I try to run maker -h. 
> 
> Any tips of what do I need to adjust in Build.PL?
> 
> Many thanks,
> Ksenia
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180601/cb7ef413/attachment.html>

From carsonhh at gmail.com  Mon Jun 11 11:46:13 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 10:46:13 -0600
Subject: [maker-devel] how to input a masked assembly for annotation
 into Maker
In-Reply-To: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>
References: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>
Message-ID: <9987ECCD-E7B6-45CD-9CBB-0B13B7608E4C@gmail.com>

Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs).

?Carson


> On Jun 1, 2018, at 1:36 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Maker community
> 
> I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.
> 
> Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.
> 
> For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 
> 
> I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
> I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 
> 
> Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.
> 
> 
> Annotation A default build steps:
> 
> $ maker -base Rwill10 -fix_nucleotides
> $ maker -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11983   11983  312159
> #should be 11985
> 
> $ maker -dsindex -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
> 21960
> 
> $ fasta_merge -d Rwill10_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.default.log>
> 
> 
> Annotation A standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
> #IDs in .tsv file are called "processed-gene" from .fasta file, 
> #but in .gff file, I think these are called "abinit-gene"
> #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> #extract list of IDs only to grep for
> cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff
> 
> $ maker -base Rwill10standard2 -fix_nucleotides
> $ maker -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11975   11975  311953
> #should be 11985
> 
> $ maker -dsindex -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10standard2.all.gff
> 23559
> 
> $ fasta_merge -d Rwill10standard2_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.standard.log>
> 
> 
> Annotation B default build steps:
> 
> $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta
> 
> #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
> $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 
> 
> #use script to extract ordered scaffolds for each chromosome
> $ ./extract_scaffolds_synteny.sh
> 
> #use script to create pseudochromosomal sequence for each chromosome
> $ ./create_pseudo_chromosome_allLGs.sh
> 
> #concatenate these into one fasta file
> cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta
> 
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
> 18465
> 
> $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> <maker_opts.log.AnnotationB.default.log>
> 
> 
> Annotation B standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff
> 
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
> 20830
> 
> <maker_opts.log.AnnotationB.standard.log>
> 
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From flopezo84 at gmail.com  Sat Jun  9 15:06:48 2018
From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=)
Date: Sat, 9 Jun 2018 16:06:48 -0400
Subject: [maker-devel] Filtering gene models based on eAED scores
Message-ID: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>

Hello,

I'm using MAKER's "quality_filter.pl" with the default option (AED<1).
However, I have noticed cases in which models have low AED scores and high
eAED scores (1.00), so presumably the good AED scores are the result of
spurious evidence alignments. Is there a way to filter models based on eAED
scores too?

Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180609/7943b278/attachment.html>

From kissaj at miamioh.edu  Mon Jun 11 12:56:46 2018
From: kissaj at miamioh.edu (Andor J Kiss)
Date: Mon, 11 Jun 2018 13:56:46 -0400
Subject: [maker-devel] largest genome annotated?
Message-ID: <1528739806.4677.97.camel@miamioh.edu>

What's the largest genome that's been annotated with Maker2?

Thanks,

-- 
________________________________________________________________________________________________________________________
Andor J Kiss, PhD
Director - Center for Bioinformatics & Functional Genomics
086 Pearson Hall - Miami University
700 East High Street, Oxford
Ohio 45056
USA

eMAIL:?KissAJ at MiamiOH.edu?
Telephone: +1 (513) 529-4280
Fax: +1 (513) 529-2431
Ring ID:?andorjkiss

URL (CBFG):?http://miamioh.edu/cas/academics/centers/cbfg/?
URL (CBFG Services):?https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics
URL (Research):?http://openwetware.org/wiki/User:Andor_J_Kiss?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180611/f3194fbc/attachment.html>

From carsonhh at gmail.com  Mon Jun 11 13:05:07 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 12:05:07 -0600
Subject: [maker-devel] largest genome annotated?
In-Reply-To: <1528739806.4677.97.camel@miamioh.edu>
References: <1528739806.4677.97.camel@miamioh.edu>
Message-ID: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>

The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER.

?Carson


> On Jun 11, 2018, at 11:56 AM, Andor J Kiss <kissaj at miamioh.edu> wrote:
> 
> What's the largest genome that's been annotated with Maker2?
> 
> Thanks,
> -- 
> ________________________________________________________________________________________________________________________
> Andor J Kiss, PhD
> Director - Center for Bioinformatics & Functional Genomics
> 086 Pearson Hall - Miami University
> 700 East High Street, Oxford
> Ohio 45056
> USA
> 
> eMAIL: KissAJ at MiamiOH.edu <mailto:KissAJ at MiamiOH.edu> 
> Telephone: +1 (513) 529-4280
> Fax: +1 (513) 529-2431
> Ring ID: andorjkiss <https://ring.cx/>
> 
> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ <http://miamioh.edu/cas/academics/centers/cbfg/> 
> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics <https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics>
> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss <http://openwetware.org/wiki/User:Andor_J_Kiss> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180611/c1ca695f/attachment.html>

From carsonhh at gmail.com  Mon Jun 11 13:13:28 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 12:13:28 -0600
Subject: [maker-devel] largest genome annotated?
In-Reply-To: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>
References: <1528739806.4677.97.camel@miamioh.edu>
	<34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>
Message-ID: <45218C8A-70A4-4146-9962-EA0E8F506265@gmail.com>

Correction. The recently published axolotl just barely edged out the Sugar Pine in size a couple of months ago. They did not really annotate it though, instead they independently assembled the transcriptome from mRNA-seq and aligned those where they could.

?Carson


> On Jun 11, 2018, at 12:05 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER.
> 
> ?Carson
> 
> 
> 
>> On Jun 11, 2018, at 11:56 AM, Andor J Kiss <kissaj at miamioh.edu <mailto:kissaj at miamioh.edu>> wrote:
>> 
>> What's the largest genome that's been annotated with Maker2?
>> 
>> Thanks,
>> -- 
>> ________________________________________________________________________________________________________________________
>> Andor J Kiss, PhD
>> Director - Center for Bioinformatics & Functional Genomics
>> 086 Pearson Hall - Miami University
>> 700 East High Street, Oxford
>> Ohio 45056
>> USA
>> 
>> eMAIL: KissAJ at MiamiOH.edu <mailto:KissAJ at MiamiOH.edu> 
>> Telephone: +1 (513) 529-4280
>> Fax: +1 (513) 529-2431
>> Ring ID: andorjkiss <https://ring.cx/>
>> 
>> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ <http://miamioh.edu/cas/academics/centers/cbfg/> 
>> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics <https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics>
>> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss <http://openwetware.org/wiki/User:Andor_J_Kiss> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180611/95aa9c05/attachment.html>

From jennifer.anderson at ebc.uu.se  Tue Jun 12 10:59:31 2018
From: jennifer.anderson at ebc.uu.se (Jennifer Anderson)
Date: Tue, 12 Jun 2018 17:59:31 +0200
Subject: [maker-devel] Merge warning = 1
Message-ID: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>

Hello,

I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES).

I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below.  Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction.


000030F|arrow maker gene 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43
000030F|arrow maker mRNA 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1
000030F|arrow maker exon 9838 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker exon 9255 9762 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker CDS 9838 9992 . - 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker CDS 9255 9762 . - 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1

Best,

Jenni


N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180612/6165e00b/attachment.html>

From carsonhh at gmail.com  Tue Jun 12 11:03:37 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 12 Jun 2018 10:03:37 -0600
Subject: [maker-devel] Merge warning = 1
In-Reply-To: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>
References: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>
Message-ID: <D2F6D9CE-78B7-46B8-A9EC-2AC13E903655@gmail.com>

It?s an internal debugging related value used in the beta. It has no meaning for the general user and will eventually disappear.

?Carson


> On Jun 12, 2018, at 9:59 AM, Jennifer Anderson <jennifer.anderson at ebc.uu.se> wrote:
> 
> Hello,
> 
> I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES).
> 
> I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below.  Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction.
> 
> 
> 000030F|arrow  maker gene
> 9255 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43
> 000030F|arrow
> maker mRNA
> 9255 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1
> 000030F|arrow  maker exon
> 9838 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker exon
> 9255 9762
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker CDS
> 9838 9992
> . -
> 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker CDS
> 9255 9762
> . -
> 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 
> Best,
> 
> Jenni
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ 
> 
> E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180612/9fb68797/attachment.html>

From steinj at cshl.edu  Tue Jun 12 13:08:19 2018
From: steinj at cshl.edu (Stein, Joshua)
Date: Tue, 12 Jun 2018 18:08:19 +0000
Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions
Message-ID: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>

Dear Carson and maker-devel group,

In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.

How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.

Thanks,
Josh


Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu
http://ware.cshl.org/


From carsonhh at gmail.com  Tue Jun 12 15:19:19 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 12 Jun 2018 14:19:19 -0600
Subject: [maker-devel] Transcript & protein fasta sequence id/name
 collisions
In-Reply-To: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
Message-ID: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>

The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it.

On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it.

?Carson


> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <steinj at cshl.edu> wrote:
> 
> Dear Carson and maker-devel group,
> 
> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.
> 
> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
> 
> Thanks,
> Josh
> 
> 
> Joshua Stein, PhD
> Manager, Sci. Informatics III
> Cold Spring Harbor Laboratory
> steinj at cshl.edu
> http://ware.cshl.org/
> 
> 
> 


From steinj at cshl.edu  Tue Jun 12 15:31:13 2018
From: steinj at cshl.edu (Stein, Joshua)
Date: Tue, 12 Jun 2018 20:31:13 +0000
Subject: [maker-devel] Transcript & protein fasta sequence id/name
 collisions
In-Reply-To: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>
References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
	<91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>
Message-ID: <BE0C9812-CCE7-431D-89DB-6CAA60AD937F@cshl.edu>

Hi Carson,
Thanks for identifying the problem.  I see that the input fgenesh GFF3 did not use unique identifiers, so we will attack the problem there.

Best,
Josh

> On Jun 12, 2018, at 4:19 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it.
> 
> On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it.
> 
> ?Carson
> 
> 
>> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <steinj at cshl.edu> wrote:
>> 
>> Dear Carson and maker-devel group,
>> 
>> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.
>> 
>> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
>> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
>> 
>> Thanks,
>> Josh
>> 
>> 
>> Joshua Stein, PhD
>> Manager, Sci. Informatics III
>> Cold Spring Harbor Laboratory
>> steinj at cshl.edu
>> http://ware.cshl.org/
>> 
>> 
>> 
> 

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu
http://ware.cshl.org/


From carsonhh at gmail.com  Wed Jun 13 12:46:12 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Jun 2018 11:46:12 -0600
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
Message-ID: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>

The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first.

?Carson


> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com> wrote:
> 
> Hello,
> 
> I'm using MAKER's "quality_filter.pl <http://quality_filter.pl/>" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too?
> 
> Thank you.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180613/797116da/attachment.html>

From ss2489 at cornell.edu  Wed Jun 13 14:34:27 2018
From: ss2489 at cornell.edu (Surya Saha)
Date: Wed, 13 Jun 2018 15:34:27 -0400
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
	<3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
Message-ID: <CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>

Hi Carson,

We have been using AED as a primary metric for evaluating predictions in
our group but it sounds like we should be using both eAED and AED. Is there
a detailed explanation of how exactly eAED and AED are computed besides
Table 2 in the Cantarel 2008 paper? Thanks

-Surya

On Wed, Jun 13, 2018 at 2:03 PM Carson Holt <carsonhh at gmail.com> wrote:

> The eAED score also take protein reading frame into account and it can
> infers support for exons when both introns are validated (i.e. can be lower
> than AED in some cases). For your case where eAED is 1 but AED less than 1
> means that you evidence support is from an overlapping protein, but it is
> never in the same reading frame as the gene model. So the positive evidence
> support may be suspect, or it may be real and the model is poor because of
> the assembly, gaps, etc. To use eAED instead in the quality_filter.pl
> script, you would have to to manually edit the script and replace ?_AED'
> with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower
> quality assemblies (places where the predictors make the best model they
> can and not the correct model because the assembly won?t allow for the
> correct model but there is evidence that there is a gene locus). So make
> sure to always view suspect regions in browser first.
>
> ?Carson
>
>
>
> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com> wrote:
>
> Hello,
>
> I'm using MAKER's "quality_filter.pl" with the default option (AED<1).
> However, I have noticed cases in which models have low AED scores and high
> eAED scores (1.00), so presumably the good AED scores are the result of
> spurious evidence alignments. Is there a way to filter models based on eAED
> scores too?
>
> Thank you.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 

Surya Saha
Sol Genomics Network
Boyce Thompson Institute, Ithaca, NY, USA
https://citrusgreening.org/ <http://www.linkedin.com/in/suryasaha>
http://www.linkedin.com/in/suryasaha
https://twitter.com/SahaSurya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180613/2ecc5a19/attachment.html>

From carsonhh at gmail.com  Wed Jun 13 14:57:46 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Jun 2018 13:57:46 -0600
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
	<3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
	<CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>
Message-ID: <C4B3ED69-3D9E-421E-8447-90E63695FE68@gmail.com>

AED is documented in the 2011 MAKER2 paper, but eAED (extended AED) is not currently documented in a publication and is not used by any of the scripts that come with MAKER (it?s just there for reference right now). Basically AED is calculated with evidence overlap, but eAED will not count protein overlap unless it occurs in the same codon reading frame as the model (so evidence may count for a stretch, then stop counting for a few codons, then count again if there is an insertion in the alignment). Also eAED will infer support for exons if both introns are validated by evidence and the region in between is all ORF (this allows joint intron support to infer support for an internal exon). 99% of the time AED and eAED are the same, but eAED can be useful in identifying edge cases. Much of the time if AED and eAED are very different, it?s because there is a single base pair insertion or deletion in the assembly. The predictors still find the locus the best they can, but protein evidence and alignments will be out of sync with the reading frame on one of the exons. BLAST can?t really handle single bp INDELs in it?s alignments, but Exonerate can do mid alignment reading frame shifts to capture the assembly INDEL (and eAED is an attempt to use the extra Exonerate info in the score).

?Carson


> On Jun 13, 2018, at 1:34 PM, Surya Saha <ss2489 at cornell.edu> wrote:
> 
> Hi Carson,
> 
> We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks
> 
> -Surya
> 
> On Wed, Jun 13, 2018 at 2:03 PM Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl <http://quality_filter.pl/> script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first.
> 
> ?Carson
> 
> 
> 
>> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com <mailto:flopezo84 at gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I'm using MAKER's "quality_filter.pl <http://quality_filter.pl/>" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too?
>> 
>> Thank you.
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> -- 
> 
> Surya Saha
> Sol Genomics Network
> Boyce Thompson Institute, Ithaca, NY, USA
> https://citrusgreening.org/ <http://www.linkedin.com/in/suryasaha>
> http://www.linkedin.com/in/suryasaha <http://www.linkedin.com/in/suryasaha>
> https://twitter.com/SahaSurya <https://twitter.com/SahaSurya>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180613/7eb17966/attachment.html>

From gdolby at asu.edu  Fri Jun 15 11:29:16 2018
From: gdolby at asu.edu (Greer Dolby)
Date: Fri, 15 Jun 2018 09:29:16 -0700
Subject: [maker-devel] Best annotation set failure (auto_annotator.pm line
 1774)
Message-ID: <7A8554F9-3456-4D53-B5A8-80FA794F42EE@asu.edu>

Hello,

I?m de novo annotating a second-generation genome and I have a handful of scaffolds that keep failing while choosing the best annotation set (errors below). To my knowledge there is anything wrong with the scaffolds themselves and the others have run fine. Looking at the VOID folder for the failed contigs for different runs, they appear to fail at different chunks but always with these two errors. I?m running 30 cores with MPICH2 on a private server. Any suggestions? Thanks!

Best,
Greer

^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 1
 ...processing 8 of 12
total clusters:44 now processing 0
 ...processing 0 of 3
 ...processing 1 of 3
 ...processing 2 of 3
total clusters:44 now processing 0
 ...processing 0 of 4
 ...processing 1 of 4
 ...processing 9 of 12
 ...processing 2 of 4
 ...processing 3 of 4
total clusters:44 now processing 0
 ...processing 10 of 12
 ...processing 0 of 67
 ...processing 1 of 67
ERROR: Chunk failed at level:6, tier_type:0
 ...processing 2 of 67
FAILED CONTIG:ScCC6lQ_38465;HRSCAF=64658

^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 2
 ...processing 9 of 298
 ...processing 8 of 81
 ...processing 11 of 202
 ...processing 13 of 20
 ...processing 10 of 298
 ...processing 9 of 81
 ...processing 10 of 81
 ...processing 18 of 123
 ...processing 14 of 20
 ...processing 17 of 54
 ...processing 18 of 54
 ...processing 37 of 164
 ...processing 20 of 254
Died at /home/jcornel3/tools/maker/bin/../lib/maker/auto_annotator.pm line 1774.
--> rank=17, hostname=omega
ERROR: Failed while choosing best annotation set
ERROR: Chunk failed at level:4, tier_type:4
FAILED CONTIG:ScCC6lQ_16796;HRSCAF=38896
_________________________________
Greer Dolby, PhD
Postdoctoral Research Scholar
SoLS, Arizona State U.
office: LSE 313, 480.965.7456
website <http://www.greerdolby.org/> | twitter <https://twitter.com/gadolby>
Kusumi Lab <http://kusumi.lab.asu.edu/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180615/64384280/attachment.html>

From kapeelc at gmail.com  Fri Jun 22 14:41:58 2018
From: kapeelc at gmail.com (Kapeel Chougule)
Date: Fri, 22 Jun 2018 15:41:58 -0400
Subject: [maker-devel] map_forward=1 not mapping reference ID's to output
 correctly
Message-ID: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>

Hi,

I am trying to update community annotation
<https://de.cyverse.org/dl/d/39D60E88-078D-4CF5-9F3A-D712B714CDD8/community.annotation.gff3>
in the light of new evidence data but my MAKER runs are not keeping all the
genes from the community annotation.

Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon
51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR
MAKER gene count->
awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105

In the maker_opts.ctl file attached, I did make keep_preds=1 and
map_forward=1 which keep all the community gene models even if they dont
have evidence support. This was explained here:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data
. So not sure why we dont have the all the community gene models mapped in
the MAKER output

Thanks

Kapeel
--


*Kapeel ChouguleComputational Scientist Developer II*


*One Bungtown Road Cold Spring Harbor, NY 11724http://www.warelab.org/
<http://www.warelab.org/>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180622/15c366fc/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4990 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180622/15c366fc/attachment.obj>

From monica.poelchau at ars.usda.gov  Fri Jun 22 15:04:28 2018
From: monica.poelchau at ars.usda.gov (Poelchau, Monica)
Date: Fri, 22 Jun 2018 20:04:28 +0000
Subject: [maker-devel] [CAUTION: Suspicious Link] map_forward=1 not
 mapping reference ID's to output correctly
In-Reply-To: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>
References: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>
Message-ID: <D5A4E18F-CFDC-489E-BA1B-FB88FA66C338@ars.usda.gov>

Hi Kapeel,

If you just want your community annotations to replace models in an existing gene set, we have a tool for this:

https://github.com/NAL-i5K/GFF3toolkit

You?d need to run gff3_QC on your annotation files first to make sure your annotations are okay, then use gff3_merge to merge your community annotations with your existing gene set (in gff3 format). If you end up trying this out - we?re actively developing the GFF3toolkit, so feel free to post an issue if you notice any problems.

Hth,

Monica

From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of Kapeel Chougule <kapeelc at gmail.com>
Date: Friday, June 22, 2018 at 13:53
To: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject: [CAUTION: Suspicious Link][maker-devel] map_forward=1 not mapping reference ID's to output correctly


PROCEED WITH CAUTION: This message triggered warnings of potentially malicious web content. Evaluate this email by considering whether you are expecting the message, along with inspection for suspicious links.

Questions: Spam.Abuse at wdc.usda.gov

Hi,

I am trying to update community annotation<https://de.cyverse.org/dl/d/39D60E88-078D-4CF5-9F3A-D712B714CDD8/community.annotation.gff3> in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation.


Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR
MAKER gene count->
awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105

In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data
. So not sure why we dont have the all the community gene models mapped in the MAKER output

Thanks

Kapeel
--

Kapeel Chougule
Computational Scientist Developer II
One Bungtown Road Cold Spring Harbor, NY 11724
http://www.warelab.org/


This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180622/b515a311/attachment.html>

From andremmachado25 at gmail.com  Tue Jun 26 10:36:24 2018
From: andremmachado25 at gmail.com (=?UTF-8?Q?Andr=C3=A9_Machado?=)
Date: Tue, 26 Jun 2018 16:36:24 +0100
Subject: [maker-devel] Maker Error : Thread 1 terminated abnormally..
Message-ID: <CAAXcPKC3mnkqP9OU7L9bBLtts4KujCoBrUNieuUfgo+wd-E4Yw@mail.gmail.com>

Hi ,


First of all thanks for your efforts in Maker pipeline. Its a tremendous
help for the people that works with genomes.

In the last 4 days i have broke my head.. with an error .. but still
without a solution.

I found this old thread:
https://groups.google.com/forum/#!msg/maker-devel/X2-76BH9gvg/rU4kLJ3B6tsJ

Seems to be a quite similar... but don't point to a specific solution.

I have run maker with the data test and all runned ok. Maker finalize the
entire process without errors.

Recently, i?m trying to aplly my own data on MPI cluster. But this error,
frequently occurred.

Thread 1 terminated abnormally:
../dna.maker.output/mpi_blastdb/dna%2Efa.mpi.1/dna%2Efa.mpi.1.0

--> rank=8, hostname=compute-0-1.local, at ../Analysis/Geno/maker/bin/maker
line 1451 thread 1.

--> rank=8, hostname=compute-0-1.local

deleted:0 hits

deleted:0 hits

preparing ab-inits

deleted:0 hits

deleted:0 hits

FATAL: Thread terminated, causing all processes to fail

--> rank=8, hostname=compute-0-1.local

deleted:0 hits


Basically im tring to run a maker with dna.fa, rna.fa, prot.fa and
my_custom_lib_of_repeats.fa, to produce raw genes models which will be used
to train SNAP.


I already used several command lines and all gave me the same error.. The
only change between different tests was the local of the error, sometimes
happened in compute-0-1.local other time in compute-0-4.local or in another
one.

mpiexec -n 63 --hostfile Host maker 1>1.log 2>2.err

mpiexec --hostfile Host maker 1>1.log 2>2.err

mpiexec -mca btl ^openib -n 63 --hostfile Host maker 1>1.log 2>2.err

nohup mpiexec -mca btl ^openib -n 63 --hostfile Host maker -a 1>1.log
2>2.err


The log file as well the option files are provided below.


Many thanks in advance,


Andr?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2.log
Type: text/x-log
Size: 38654 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_exe.ctl
Type: application/octet-stream
Size: 1223 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4547 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_bopts.ctl
Type: application/octet-stream
Size: 1412 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0002.obj>

From vsoza at uw.edu  Fri Jun  1 13:36:10 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Fri, 1 Jun 2018 12:36:10 -0700
Subject: [maker-devel] how to input a masked assembly for annotation into
 Maker
Message-ID: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>

Hi Maker community

I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.

Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.

For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 

I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 

Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.


Annotation A default build steps:

$ maker -base Rwill10 -fix_nucleotides
$ maker -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11983   11983  312159
#should be 11985

$ maker -dsindex -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10_master_datastore_index.log

$ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
21960

$ fasta_merge -d Rwill10_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.default.log
Type: application/octet-stream
Size: 4650 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0004.obj>
-------------- next part --------------


Annotation A standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta

#genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
#IDs in .tsv file are called "processed-gene" from .fasta file, 
#but in .gff file, I think these are called "abinit-gene"
#best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
$ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

#extract list of IDs only to grep for
cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
  
#used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff

$ maker -base Rwill10standard2 -fix_nucleotides
$ maker -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11975   11975  311953
#should be 11985

$ maker -dsindex -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10standard2.all.gff
23559

$ fasta_merge -d Rwill10standard2_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.standard.log
Type: application/octet-stream
Size: 4529 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0005.obj>
-------------- next part --------------


Annotation B default build steps:

$ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta

#Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
$ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 

#use script to extract ordered scaffolds for each chromosome
$ ./extract_scaffolds_synteny.sh

#use script to create pseudochromosomal sequence for each chromosome
$ ./create_pseudo_chromosome_allLGs.sh

#concatenate these into one fasta file
cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta

$ maker -base Rwill10.pseudochromos -fix_nucleotides
$ maker -base Rwill10.pseudochromos -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
18465

$ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.default.log
Type: application/octet-stream
Size: 4604 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0006.obj>
-------------- next part --------------


Annotation B standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta

$ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

$ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs

#used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff

$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
20830

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.standard.log
Type: application/octet-stream
Size: 4558 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0007.obj>
-------------- next part --------------


-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From carsonhh at gmail.com  Fri Jun  1 16:01:13 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 1 Jun 2018 16:01:13 -0600
Subject: [maker-devel] Building MAKER with specific perl version
In-Reply-To: <CAOeNZNuUnnu_gGtO-yHHrM6wzeEzqihH83gLMCrty20mftt0Wg@mail.gmail.com>
References: <CAOeNZNuUnnu_gGtO-yHHrM6wzeEzqihH83gLMCrty20mftt0Wg@mail.gmail.com>
Message-ID: <78BA2892-5E3A-4DE5-AA44-18A9BCEF8071@gmail.com>

You probably need to run './Build realclean? to clean up your previous build and settings that you likely ran with the system perl. You may even need to clean out the ?/maker/perl directory. You can also try updating your Module::Build installation.

?Carson


> On May 31, 2018, at 8:30 AM, Ksenia Lavrichenko <ksenia.lavrichenko at gmail.com> wrote:
> 
> Hi, 
> 
> I have been banging my head for a while now, trying to install MAKER with my specific perl. 
> 
> I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ <https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ>
> 
> However, this does not work for me. I make sure bin/* and Build are deleted before I run $myperl Build.PL.
> 
> I see my perl in shebang of Build however after ./Build install all scripts in bin have "#! /usr/bin/perl" which produces a version error when I try to run maker -h. 
> 
> Any tips of what do I need to adjust in Build.PL?
> 
> Many thanks,
> Ksenia
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/cb7ef413/attachment-0001.html>

From carsonhh at gmail.com  Mon Jun 11 10:46:13 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 10:46:13 -0600
Subject: [maker-devel] how to input a masked assembly for annotation
 into Maker
In-Reply-To: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>
References: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>
Message-ID: <9987ECCD-E7B6-45CD-9CBB-0B13B7608E4C@gmail.com>

Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs).

?Carson


> On Jun 1, 2018, at 1:36 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Maker community
> 
> I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.
> 
> Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.
> 
> For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 
> 
> I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
> I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 
> 
> Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.
> 
> 
> Annotation A default build steps:
> 
> $ maker -base Rwill10 -fix_nucleotides
> $ maker -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11983   11983  312159
> #should be 11985
> 
> $ maker -dsindex -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
> 21960
> 
> $ fasta_merge -d Rwill10_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.default.log>
> 
> 
> Annotation A standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
> #IDs in .tsv file are called "processed-gene" from .fasta file, 
> #but in .gff file, I think these are called "abinit-gene"
> #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> #extract list of IDs only to grep for
> cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff
> 
> $ maker -base Rwill10standard2 -fix_nucleotides
> $ maker -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11975   11975  311953
> #should be 11985
> 
> $ maker -dsindex -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10standard2.all.gff
> 23559
> 
> $ fasta_merge -d Rwill10standard2_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.standard.log>
> 
> 
> Annotation B default build steps:
> 
> $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta
> 
> #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
> $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 
> 
> #use script to extract ordered scaffolds for each chromosome
> $ ./extract_scaffolds_synteny.sh
> 
> #use script to create pseudochromosomal sequence for each chromosome
> $ ./create_pseudo_chromosome_allLGs.sh
> 
> #concatenate these into one fasta file
> cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta
> 
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
> 18465
> 
> $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> <maker_opts.log.AnnotationB.default.log>
> 
> 
> Annotation B standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff
> 
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
> 20830
> 
> <maker_opts.log.AnnotationB.standard.log>
> 
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From flopezo84 at gmail.com  Sat Jun  9 14:06:48 2018
From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=)
Date: Sat, 9 Jun 2018 16:06:48 -0400
Subject: [maker-devel] Filtering gene models based on eAED scores
Message-ID: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>

Hello,

I'm using MAKER's "quality_filter.pl" with the default option (AED<1).
However, I have noticed cases in which models have low AED scores and high
eAED scores (1.00), so presumably the good AED scores are the result of
spurious evidence alignments. Is there a way to filter models based on eAED
scores too?

Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180609/7943b278/attachment-0001.html>

From kissaj at miamioh.edu  Mon Jun 11 11:56:46 2018
From: kissaj at miamioh.edu (Andor J Kiss)
Date: Mon, 11 Jun 2018 13:56:46 -0400
Subject: [maker-devel] largest genome annotated?
Message-ID: <1528739806.4677.97.camel@miamioh.edu>

What's the largest genome that's been annotated with Maker2?

Thanks,

-- 
________________________________________________________________________________________________________________________
Andor J Kiss, PhD
Director - Center for Bioinformatics & Functional Genomics
086 Pearson Hall - Miami University
700 East High Street, Oxford
Ohio 45056
USA

eMAIL:?KissAJ at MiamiOH.edu?
Telephone: +1 (513) 529-4280
Fax: +1 (513) 529-2431
Ring ID:?andorjkiss

URL (CBFG):?http://miamioh.edu/cas/academics/centers/cbfg/?
URL (CBFG Services):?https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics
URL (Research):?http://openwetware.org/wiki/User:Andor_J_Kiss?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180611/f3194fbc/attachment-0001.html>

From carsonhh at gmail.com  Mon Jun 11 12:05:07 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 12:05:07 -0600
Subject: [maker-devel] largest genome annotated?
In-Reply-To: <1528739806.4677.97.camel@miamioh.edu>
References: <1528739806.4677.97.camel@miamioh.edu>
Message-ID: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>

The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER.

?Carson


> On Jun 11, 2018, at 11:56 AM, Andor J Kiss <kissaj at miamioh.edu> wrote:
> 
> What's the largest genome that's been annotated with Maker2?
> 
> Thanks,
> -- 
> ________________________________________________________________________________________________________________________
> Andor J Kiss, PhD
> Director - Center for Bioinformatics & Functional Genomics
> 086 Pearson Hall - Miami University
> 700 East High Street, Oxford
> Ohio 45056
> USA
> 
> eMAIL: KissAJ at MiamiOH.edu <mailto:KissAJ at MiamiOH.edu> 
> Telephone: +1 (513) 529-4280
> Fax: +1 (513) 529-2431
> Ring ID: andorjkiss <https://ring.cx/>
> 
> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ <http://miamioh.edu/cas/academics/centers/cbfg/> 
> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics <https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics>
> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss <http://openwetware.org/wiki/User:Andor_J_Kiss> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180611/c1ca695f/attachment-0001.html>

From carsonhh at gmail.com  Mon Jun 11 12:13:28 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 12:13:28 -0600
Subject: [maker-devel] largest genome annotated?
In-Reply-To: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>
References: <1528739806.4677.97.camel@miamioh.edu>
	<34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>
Message-ID: <45218C8A-70A4-4146-9962-EA0E8F506265@gmail.com>

Correction. The recently published axolotl just barely edged out the Sugar Pine in size a couple of months ago. They did not really annotate it though, instead they independently assembled the transcriptome from mRNA-seq and aligned those where they could.

?Carson


> On Jun 11, 2018, at 12:05 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER.
> 
> ?Carson
> 
> 
> 
>> On Jun 11, 2018, at 11:56 AM, Andor J Kiss <kissaj at miamioh.edu <mailto:kissaj at miamioh.edu>> wrote:
>> 
>> What's the largest genome that's been annotated with Maker2?
>> 
>> Thanks,
>> -- 
>> ________________________________________________________________________________________________________________________
>> Andor J Kiss, PhD
>> Director - Center for Bioinformatics & Functional Genomics
>> 086 Pearson Hall - Miami University
>> 700 East High Street, Oxford
>> Ohio 45056
>> USA
>> 
>> eMAIL: KissAJ at MiamiOH.edu <mailto:KissAJ at MiamiOH.edu> 
>> Telephone: +1 (513) 529-4280
>> Fax: +1 (513) 529-2431
>> Ring ID: andorjkiss <https://ring.cx/>
>> 
>> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ <http://miamioh.edu/cas/academics/centers/cbfg/> 
>> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics <https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics>
>> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss <http://openwetware.org/wiki/User:Andor_J_Kiss> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180611/95aa9c05/attachment-0001.html>

From jennifer.anderson at ebc.uu.se  Tue Jun 12 09:59:31 2018
From: jennifer.anderson at ebc.uu.se (Jennifer Anderson)
Date: Tue, 12 Jun 2018 17:59:31 +0200
Subject: [maker-devel] Merge warning = 1
Message-ID: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>

Hello,

I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES).

I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below.  Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction.


000030F|arrow maker gene 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43
000030F|arrow maker mRNA 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1
000030F|arrow maker exon 9838 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker exon 9255 9762 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker CDS 9838 9992 . - 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker CDS 9255 9762 . - 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1

Best,

Jenni


N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180612/6165e00b/attachment-0001.html>

From carsonhh at gmail.com  Tue Jun 12 10:03:37 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 12 Jun 2018 10:03:37 -0600
Subject: [maker-devel] Merge warning = 1
In-Reply-To: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>
References: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>
Message-ID: <D2F6D9CE-78B7-46B8-A9EC-2AC13E903655@gmail.com>

It?s an internal debugging related value used in the beta. It has no meaning for the general user and will eventually disappear.

?Carson


> On Jun 12, 2018, at 9:59 AM, Jennifer Anderson <jennifer.anderson at ebc.uu.se> wrote:
> 
> Hello,
> 
> I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES).
> 
> I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below.  Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction.
> 
> 
> 000030F|arrow  maker gene
> 9255 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43
> 000030F|arrow
> maker mRNA
> 9255 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1
> 000030F|arrow  maker exon
> 9838 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker exon
> 9255 9762
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker CDS
> 9838 9992
> . -
> 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker CDS
> 9255 9762
> . -
> 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 
> Best,
> 
> Jenni
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ 
> 
> E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180612/9fb68797/attachment-0001.html>

From steinj at cshl.edu  Tue Jun 12 12:08:19 2018
From: steinj at cshl.edu (Stein, Joshua)
Date: Tue, 12 Jun 2018 18:08:19 +0000
Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions
Message-ID: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>

Dear Carson and maker-devel group,

In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.

How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.

Thanks,
Josh


Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu
http://ware.cshl.org/


From carsonhh at gmail.com  Tue Jun 12 14:19:19 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 12 Jun 2018 14:19:19 -0600
Subject: [maker-devel] Transcript & protein fasta sequence id/name
 collisions
In-Reply-To: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
Message-ID: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>

The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it.

On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it.

?Carson


> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <steinj at cshl.edu> wrote:
> 
> Dear Carson and maker-devel group,
> 
> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.
> 
> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
> 
> Thanks,
> Josh
> 
> 
> Joshua Stein, PhD
> Manager, Sci. Informatics III
> Cold Spring Harbor Laboratory
> steinj at cshl.edu
> http://ware.cshl.org/
> 
> 
> 


From steinj at cshl.edu  Tue Jun 12 14:31:13 2018
From: steinj at cshl.edu (Stein, Joshua)
Date: Tue, 12 Jun 2018 20:31:13 +0000
Subject: [maker-devel] Transcript & protein fasta sequence id/name
 collisions
In-Reply-To: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>
References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
	<91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>
Message-ID: <BE0C9812-CCE7-431D-89DB-6CAA60AD937F@cshl.edu>

Hi Carson,
Thanks for identifying the problem.  I see that the input fgenesh GFF3 did not use unique identifiers, so we will attack the problem there.

Best,
Josh

> On Jun 12, 2018, at 4:19 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it.
> 
> On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it.
> 
> ?Carson
> 
> 
>> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <steinj at cshl.edu> wrote:
>> 
>> Dear Carson and maker-devel group,
>> 
>> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.
>> 
>> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
>> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
>> 
>> Thanks,
>> Josh
>> 
>> 
>> Joshua Stein, PhD
>> Manager, Sci. Informatics III
>> Cold Spring Harbor Laboratory
>> steinj at cshl.edu
>> http://ware.cshl.org/
>> 
>> 
>> 
> 

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu
http://ware.cshl.org/


From carsonhh at gmail.com  Wed Jun 13 11:46:12 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Jun 2018 11:46:12 -0600
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
Message-ID: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>

The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first.

?Carson


> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com> wrote:
> 
> Hello,
> 
> I'm using MAKER's "quality_filter.pl <http://quality_filter.pl/>" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too?
> 
> Thank you.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180613/797116da/attachment-0001.html>

From ss2489 at cornell.edu  Wed Jun 13 13:34:27 2018
From: ss2489 at cornell.edu (Surya Saha)
Date: Wed, 13 Jun 2018 15:34:27 -0400
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
	<3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
Message-ID: <CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>

Hi Carson,

We have been using AED as a primary metric for evaluating predictions in
our group but it sounds like we should be using both eAED and AED. Is there
a detailed explanation of how exactly eAED and AED are computed besides
Table 2 in the Cantarel 2008 paper? Thanks

-Surya

On Wed, Jun 13, 2018 at 2:03 PM Carson Holt <carsonhh at gmail.com> wrote:

> The eAED score also take protein reading frame into account and it can
> infers support for exons when both introns are validated (i.e. can be lower
> than AED in some cases). For your case where eAED is 1 but AED less than 1
> means that you evidence support is from an overlapping protein, but it is
> never in the same reading frame as the gene model. So the positive evidence
> support may be suspect, or it may be real and the model is poor because of
> the assembly, gaps, etc. To use eAED instead in the quality_filter.pl
> script, you would have to to manually edit the script and replace ?_AED'
> with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower
> quality assemblies (places where the predictors make the best model they
> can and not the correct model because the assembly won?t allow for the
> correct model but there is evidence that there is a gene locus). So make
> sure to always view suspect regions in browser first.
>
> ?Carson
>
>
>
> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com> wrote:
>
> Hello,
>
> I'm using MAKER's "quality_filter.pl" with the default option (AED<1).
> However, I have noticed cases in which models have low AED scores and high
> eAED scores (1.00), so presumably the good AED scores are the result of
> spurious evidence alignments. Is there a way to filter models based on eAED
> scores too?
>
> Thank you.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 

Surya Saha
Sol Genomics Network
Boyce Thompson Institute, Ithaca, NY, USA
https://citrusgreening.org/ <http://www.linkedin.com/in/suryasaha>
http://www.linkedin.com/in/suryasaha
https://twitter.com/SahaSurya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180613/2ecc5a19/attachment-0001.html>

From carsonhh at gmail.com  Wed Jun 13 13:57:46 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Jun 2018 13:57:46 -0600
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
	<3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
	<CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>
Message-ID: <C4B3ED69-3D9E-421E-8447-90E63695FE68@gmail.com>

AED is documented in the 2011 MAKER2 paper, but eAED (extended AED) is not currently documented in a publication and is not used by any of the scripts that come with MAKER (it?s just there for reference right now). Basically AED is calculated with evidence overlap, but eAED will not count protein overlap unless it occurs in the same codon reading frame as the model (so evidence may count for a stretch, then stop counting for a few codons, then count again if there is an insertion in the alignment). Also eAED will infer support for exons if both introns are validated by evidence and the region in between is all ORF (this allows joint intron support to infer support for an internal exon). 99% of the time AED and eAED are the same, but eAED can be useful in identifying edge cases. Much of the time if AED and eAED are very different, it?s because there is a single base pair insertion or deletion in the assembly. The predictors still find the locus the best they can, but protein evidence and alignments will be out of sync with the reading frame on one of the exons. BLAST can?t really handle single bp INDELs in it?s alignments, but Exonerate can do mid alignment reading frame shifts to capture the assembly INDEL (and eAED is an attempt to use the extra Exonerate info in the score).

?Carson


> On Jun 13, 2018, at 1:34 PM, Surya Saha <ss2489 at cornell.edu> wrote:
> 
> Hi Carson,
> 
> We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks
> 
> -Surya
> 
> On Wed, Jun 13, 2018 at 2:03 PM Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl <http://quality_filter.pl/> script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first.
> 
> ?Carson
> 
> 
> 
>> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com <mailto:flopezo84 at gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I'm using MAKER's "quality_filter.pl <http://quality_filter.pl/>" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too?
>> 
>> Thank you.
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> -- 
> 
> Surya Saha
> Sol Genomics Network
> Boyce Thompson Institute, Ithaca, NY, USA
> https://citrusgreening.org/ <http://www.linkedin.com/in/suryasaha>
> http://www.linkedin.com/in/suryasaha <http://www.linkedin.com/in/suryasaha>
> https://twitter.com/SahaSurya <https://twitter.com/SahaSurya>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180613/7eb17966/attachment-0001.html>

From gdolby at asu.edu  Fri Jun 15 10:29:16 2018
From: gdolby at asu.edu (Greer Dolby)
Date: Fri, 15 Jun 2018 09:29:16 -0700
Subject: [maker-devel] Best annotation set failure (auto_annotator.pm line
 1774)
Message-ID: <7A8554F9-3456-4D53-B5A8-80FA794F42EE@asu.edu>

Hello,

I?m de novo annotating a second-generation genome and I have a handful of scaffolds that keep failing while choosing the best annotation set (errors below). To my knowledge there is anything wrong with the scaffolds themselves and the others have run fine. Looking at the VOID folder for the failed contigs for different runs, they appear to fail at different chunks but always with these two errors. I?m running 30 cores with MPICH2 on a private server. Any suggestions? Thanks!

Best,
Greer

^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 1
 ...processing 8 of 12
total clusters:44 now processing 0
 ...processing 0 of 3
 ...processing 1 of 3
 ...processing 2 of 3
total clusters:44 now processing 0
 ...processing 0 of 4
 ...processing 1 of 4
 ...processing 9 of 12
 ...processing 2 of 4
 ...processing 3 of 4
total clusters:44 now processing 0
 ...processing 10 of 12
 ...processing 0 of 67
 ...processing 1 of 67
ERROR: Chunk failed at level:6, tier_type:0
 ...processing 2 of 67
FAILED CONTIG:ScCC6lQ_38465;HRSCAF=64658

^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 2
 ...processing 9 of 298
 ...processing 8 of 81
 ...processing 11 of 202
 ...processing 13 of 20
 ...processing 10 of 298
 ...processing 9 of 81
 ...processing 10 of 81
 ...processing 18 of 123
 ...processing 14 of 20
 ...processing 17 of 54
 ...processing 18 of 54
 ...processing 37 of 164
 ...processing 20 of 254
Died at /home/jcornel3/tools/maker/bin/../lib/maker/auto_annotator.pm line 1774.
--> rank=17, hostname=omega
ERROR: Failed while choosing best annotation set
ERROR: Chunk failed at level:4, tier_type:4
FAILED CONTIG:ScCC6lQ_16796;HRSCAF=38896
_________________________________
Greer Dolby, PhD
Postdoctoral Research Scholar
SoLS, Arizona State U.
office: LSE 313, 480.965.7456
website <http://www.greerdolby.org/> | twitter <https://twitter.com/gadolby>
Kusumi Lab <http://kusumi.lab.asu.edu/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180615/64384280/attachment-0001.html>

From kapeelc at gmail.com  Fri Jun 22 13:41:58 2018
From: kapeelc at gmail.com (Kapeel Chougule)
Date: Fri, 22 Jun 2018 15:41:58 -0400
Subject: [maker-devel] map_forward=1 not mapping reference ID's to output
 correctly
Message-ID: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>

Hi,

I am trying to update community annotation
<https://de.cyverse.org/dl/d/39D60E88-078D-4CF5-9F3A-D712B714CDD8/community.annotation.gff3>
in the light of new evidence data but my MAKER runs are not keeping all the
genes from the community annotation.

Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon
51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR
MAKER gene count->
awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105

In the maker_opts.ctl file attached, I did make keep_preds=1 and
map_forward=1 which keep all the community gene models even if they dont
have evidence support. This was explained here:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data
. So not sure why we dont have the all the community gene models mapped in
the MAKER output

Thanks

Kapeel
--


*Kapeel ChouguleComputational Scientist Developer II*


*One Bungtown Road Cold Spring Harbor, NY 11724http://www.warelab.org/
<http://www.warelab.org/>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180622/15c366fc/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4990 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180622/15c366fc/attachment-0001.obj>

From monica.poelchau at ars.usda.gov  Fri Jun 22 14:04:28 2018
From: monica.poelchau at ars.usda.gov (Poelchau, Monica)
Date: Fri, 22 Jun 2018 20:04:28 +0000
Subject: [maker-devel] [CAUTION: Suspicious Link] map_forward=1 not
 mapping reference ID's to output correctly
In-Reply-To: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>
References: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>
Message-ID: <D5A4E18F-CFDC-489E-BA1B-FB88FA66C338@ars.usda.gov>

Hi Kapeel,

If you just want your community annotations to replace models in an existing gene set, we have a tool for this:

https://github.com/NAL-i5K/GFF3toolkit

You?d need to run gff3_QC on your annotation files first to make sure your annotations are okay, then use gff3_merge to merge your community annotations with your existing gene set (in gff3 format). If you end up trying this out - we?re actively developing the GFF3toolkit, so feel free to post an issue if you notice any problems.

Hth,

Monica

From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of Kapeel Chougule <kapeelc at gmail.com>
Date: Friday, June 22, 2018 at 13:53
To: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject: [CAUTION: Suspicious Link][maker-devel] map_forward=1 not mapping reference ID's to output correctly


PROCEED WITH CAUTION: This message triggered warnings of potentially malicious web content. Evaluate this email by considering whether you are expecting the message, along with inspection for suspicious links.

Questions: Spam.Abuse at wdc.usda.gov

Hi,

I am trying to update community annotation<https://de.cyverse.org/dl/d/39D60E88-078D-4CF5-9F3A-D712B714CDD8/community.annotation.gff3> in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation.


Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR
MAKER gene count->
awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105

In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data
. So not sure why we dont have the all the community gene models mapped in the MAKER output

Thanks

Kapeel
--

Kapeel Chougule
Computational Scientist Developer II
One Bungtown Road Cold Spring Harbor, NY 11724
http://www.warelab.org/


This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180622/b515a311/attachment-0001.html>

From andremmachado25 at gmail.com  Tue Jun 26 09:36:24 2018
From: andremmachado25 at gmail.com (=?UTF-8?Q?Andr=C3=A9_Machado?=)
Date: Tue, 26 Jun 2018 16:36:24 +0100
Subject: [maker-devel] Maker Error : Thread 1 terminated abnormally..
Message-ID: <CAAXcPKC3mnkqP9OU7L9bBLtts4KujCoBrUNieuUfgo+wd-E4Yw@mail.gmail.com>

Hi ,


First of all thanks for your efforts in Maker pipeline. Its a tremendous
help for the people that works with genomes.

In the last 4 days i have broke my head.. with an error .. but still
without a solution.

I found this old thread:
https://groups.google.com/forum/#!msg/maker-devel/X2-76BH9gvg/rU4kLJ3B6tsJ

Seems to be a quite similar... but don't point to a specific solution.

I have run maker with the data test and all runned ok. Maker finalize the
entire process without errors.

Recently, i?m trying to aplly my own data on MPI cluster. But this error,
frequently occurred.

Thread 1 terminated abnormally:
../dna.maker.output/mpi_blastdb/dna%2Efa.mpi.1/dna%2Efa.mpi.1.0

--> rank=8, hostname=compute-0-1.local, at ../Analysis/Geno/maker/bin/maker
line 1451 thread 1.

--> rank=8, hostname=compute-0-1.local

deleted:0 hits

deleted:0 hits

preparing ab-inits

deleted:0 hits

deleted:0 hits

FATAL: Thread terminated, causing all processes to fail

--> rank=8, hostname=compute-0-1.local

deleted:0 hits


Basically im tring to run a maker with dna.fa, rna.fa, prot.fa and
my_custom_lib_of_repeats.fa, to produce raw genes models which will be used
to train SNAP.


I already used several command lines and all gave me the same error.. The
only change between different tests was the local of the error, sometimes
happened in compute-0-1.local other time in compute-0-4.local or in another
one.

mpiexec -n 63 --hostfile Host maker 1>1.log 2>2.err

mpiexec --hostfile Host maker 1>1.log 2>2.err

mpiexec -mca btl ^openib -n 63 --hostfile Host maker 1>1.log 2>2.err

nohup mpiexec -mca btl ^openib -n 63 --hostfile Host maker -a 1>1.log
2>2.err


The log file as well the option files are provided below.


Many thanks in advance,


Andr?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2.log
Type: text/x-log
Size: 38654 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0001.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_exe.ctl
Type: application/octet-stream
Size: 1223 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4547 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_bopts.ctl
Type: application/octet-stream
Size: 1412 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0005.obj>

From vsoza at uw.edu  Fri Jun  1 13:36:10 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Fri, 1 Jun 2018 12:36:10 -0700
Subject: [maker-devel] how to input a masked assembly for annotation into
 Maker
Message-ID: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>

Hi Maker community

I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.

Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.

For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 

I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 

Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.


Annotation A default build steps:

$ maker -base Rwill10 -fix_nucleotides
$ maker -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11983   11983  312159
#should be 11985

$ maker -dsindex -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10_master_datastore_index.log

$ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
21960

$ fasta_merge -d Rwill10_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.default.log
Type: application/octet-stream
Size: 4650 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0008.obj>
-------------- next part --------------


Annotation A standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta

#genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
#IDs in .tsv file are called "processed-gene" from .fasta file, 
#but in .gff file, I think these are called "abinit-gene"
#best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
$ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

#extract list of IDs only to grep for
cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
  
#used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff

$ maker -base Rwill10standard2 -fix_nucleotides
$ maker -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11975   11975  311953
#should be 11985

$ maker -dsindex -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10standard2.all.gff
23559

$ fasta_merge -d Rwill10standard2_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.standard.log
Type: application/octet-stream
Size: 4529 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0009.obj>
-------------- next part --------------


Annotation B default build steps:

$ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta

#Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
$ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 

#use script to extract ordered scaffolds for each chromosome
$ ./extract_scaffolds_synteny.sh

#use script to create pseudochromosomal sequence for each chromosome
$ ./create_pseudo_chromosome_allLGs.sh

#concatenate these into one fasta file
cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta

$ maker -base Rwill10.pseudochromos -fix_nucleotides
$ maker -base Rwill10.pseudochromos -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
18465

$ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.default.log
Type: application/octet-stream
Size: 4604 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0010.obj>
-------------- next part --------------


Annotation B standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta

$ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

$ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs

#used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff

$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
20830

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.standard.log
Type: application/octet-stream
Size: 4558 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0011.obj>
-------------- next part --------------


-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From carsonhh at gmail.com  Fri Jun  1 16:01:13 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 1 Jun 2018 16:01:13 -0600
Subject: [maker-devel] Building MAKER with specific perl version
In-Reply-To: <CAOeNZNuUnnu_gGtO-yHHrM6wzeEzqihH83gLMCrty20mftt0Wg@mail.gmail.com>
References: <CAOeNZNuUnnu_gGtO-yHHrM6wzeEzqihH83gLMCrty20mftt0Wg@mail.gmail.com>
Message-ID: <78BA2892-5E3A-4DE5-AA44-18A9BCEF8071@gmail.com>

You probably need to run './Build realclean? to clean up your previous build and settings that you likely ran with the system perl. You may even need to clean out the ?/maker/perl directory. You can also try updating your Module::Build installation.

?Carson


> On May 31, 2018, at 8:30 AM, Ksenia Lavrichenko <ksenia.lavrichenko at gmail.com> wrote:
> 
> Hi, 
> 
> I have been banging my head for a while now, trying to install MAKER with my specific perl. 
> 
> I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ <https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ>
> 
> However, this does not work for me. I make sure bin/* and Build are deleted before I run $myperl Build.PL.
> 
> I see my perl in shebang of Build however after ./Build install all scripts in bin have "#! /usr/bin/perl" which produces a version error when I try to run maker -h. 
> 
> Any tips of what do I need to adjust in Build.PL?
> 
> Many thanks,
> Ksenia
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/cb7ef413/attachment-0002.html>

From carsonhh at gmail.com  Mon Jun 11 10:46:13 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 10:46:13 -0600
Subject: [maker-devel] how to input a masked assembly for annotation
 into Maker
In-Reply-To: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>
References: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>
Message-ID: <9987ECCD-E7B6-45CD-9CBB-0B13B7608E4C@gmail.com>

Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs).

?Carson


> On Jun 1, 2018, at 1:36 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Maker community
> 
> I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.
> 
> Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.
> 
> For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 
> 
> I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
> I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 
> 
> Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.
> 
> 
> Annotation A default build steps:
> 
> $ maker -base Rwill10 -fix_nucleotides
> $ maker -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11983   11983  312159
> #should be 11985
> 
> $ maker -dsindex -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
> 21960
> 
> $ fasta_merge -d Rwill10_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.default.log>
> 
> 
> Annotation A standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
> #IDs in .tsv file are called "processed-gene" from .fasta file, 
> #but in .gff file, I think these are called "abinit-gene"
> #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> #extract list of IDs only to grep for
> cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff
> 
> $ maker -base Rwill10standard2 -fix_nucleotides
> $ maker -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11975   11975  311953
> #should be 11985
> 
> $ maker -dsindex -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10standard2.all.gff
> 23559
> 
> $ fasta_merge -d Rwill10standard2_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.standard.log>
> 
> 
> Annotation B default build steps:
> 
> $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta
> 
> #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
> $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 
> 
> #use script to extract ordered scaffolds for each chromosome
> $ ./extract_scaffolds_synteny.sh
> 
> #use script to create pseudochromosomal sequence for each chromosome
> $ ./create_pseudo_chromosome_allLGs.sh
> 
> #concatenate these into one fasta file
> cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta
> 
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
> 18465
> 
> $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> <maker_opts.log.AnnotationB.default.log>
> 
> 
> Annotation B standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff
> 
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
> 20830
> 
> <maker_opts.log.AnnotationB.standard.log>
> 
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From flopezo84 at gmail.com  Sat Jun  9 14:06:48 2018
From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=)
Date: Sat, 9 Jun 2018 16:06:48 -0400
Subject: [maker-devel] Filtering gene models based on eAED scores
Message-ID: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>

Hello,

I'm using MAKER's "quality_filter.pl" with the default option (AED<1).
However, I have noticed cases in which models have low AED scores and high
eAED scores (1.00), so presumably the good AED scores are the result of
spurious evidence alignments. Is there a way to filter models based on eAED
scores too?

Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180609/7943b278/attachment-0002.html>

From kissaj at miamioh.edu  Mon Jun 11 11:56:46 2018
From: kissaj at miamioh.edu (Andor J Kiss)
Date: Mon, 11 Jun 2018 13:56:46 -0400
Subject: [maker-devel] largest genome annotated?
Message-ID: <1528739806.4677.97.camel@miamioh.edu>

What's the largest genome that's been annotated with Maker2?

Thanks,

-- 
________________________________________________________________________________________________________________________
Andor J Kiss, PhD
Director - Center for Bioinformatics & Functional Genomics
086 Pearson Hall - Miami University
700 East High Street, Oxford
Ohio 45056
USA

eMAIL:?KissAJ at MiamiOH.edu?
Telephone: +1 (513) 529-4280
Fax: +1 (513) 529-2431
Ring ID:?andorjkiss

URL (CBFG):?http://miamioh.edu/cas/academics/centers/cbfg/?
URL (CBFG Services):?https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics
URL (Research):?http://openwetware.org/wiki/User:Andor_J_Kiss?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180611/f3194fbc/attachment-0002.html>

From carsonhh at gmail.com  Mon Jun 11 12:05:07 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 12:05:07 -0600
Subject: [maker-devel] largest genome annotated?
In-Reply-To: <1528739806.4677.97.camel@miamioh.edu>
References: <1528739806.4677.97.camel@miamioh.edu>
Message-ID: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>

The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER.

?Carson


> On Jun 11, 2018, at 11:56 AM, Andor J Kiss <kissaj at miamioh.edu> wrote:
> 
> What's the largest genome that's been annotated with Maker2?
> 
> Thanks,
> -- 
> ________________________________________________________________________________________________________________________
> Andor J Kiss, PhD
> Director - Center for Bioinformatics & Functional Genomics
> 086 Pearson Hall - Miami University
> 700 East High Street, Oxford
> Ohio 45056
> USA
> 
> eMAIL: KissAJ at MiamiOH.edu <mailto:KissAJ at MiamiOH.edu> 
> Telephone: +1 (513) 529-4280
> Fax: +1 (513) 529-2431
> Ring ID: andorjkiss <https://ring.cx/>
> 
> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ <http://miamioh.edu/cas/academics/centers/cbfg/> 
> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics <https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics>
> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss <http://openwetware.org/wiki/User:Andor_J_Kiss> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180611/c1ca695f/attachment-0002.html>

From carsonhh at gmail.com  Mon Jun 11 12:13:28 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 12:13:28 -0600
Subject: [maker-devel] largest genome annotated?
In-Reply-To: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>
References: <1528739806.4677.97.camel@miamioh.edu>
	<34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>
Message-ID: <45218C8A-70A4-4146-9962-EA0E8F506265@gmail.com>

Correction. The recently published axolotl just barely edged out the Sugar Pine in size a couple of months ago. They did not really annotate it though, instead they independently assembled the transcriptome from mRNA-seq and aligned those where they could.

?Carson


> On Jun 11, 2018, at 12:05 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER.
> 
> ?Carson
> 
> 
> 
>> On Jun 11, 2018, at 11:56 AM, Andor J Kiss <kissaj at miamioh.edu <mailto:kissaj at miamioh.edu>> wrote:
>> 
>> What's the largest genome that's been annotated with Maker2?
>> 
>> Thanks,
>> -- 
>> ________________________________________________________________________________________________________________________
>> Andor J Kiss, PhD
>> Director - Center for Bioinformatics & Functional Genomics
>> 086 Pearson Hall - Miami University
>> 700 East High Street, Oxford
>> Ohio 45056
>> USA
>> 
>> eMAIL: KissAJ at MiamiOH.edu <mailto:KissAJ at MiamiOH.edu> 
>> Telephone: +1 (513) 529-4280
>> Fax: +1 (513) 529-2431
>> Ring ID: andorjkiss <https://ring.cx/>
>> 
>> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ <http://miamioh.edu/cas/academics/centers/cbfg/> 
>> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics <https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics>
>> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss <http://openwetware.org/wiki/User:Andor_J_Kiss> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180611/95aa9c05/attachment-0002.html>

From jennifer.anderson at ebc.uu.se  Tue Jun 12 09:59:31 2018
From: jennifer.anderson at ebc.uu.se (Jennifer Anderson)
Date: Tue, 12 Jun 2018 17:59:31 +0200
Subject: [maker-devel] Merge warning = 1
Message-ID: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>

Hello,

I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES).

I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below.  Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction.


000030F|arrow maker gene 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43
000030F|arrow maker mRNA 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1
000030F|arrow maker exon 9838 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker exon 9255 9762 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker CDS 9838 9992 . - 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker CDS 9255 9762 . - 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1

Best,

Jenni


N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180612/6165e00b/attachment-0002.html>

From carsonhh at gmail.com  Tue Jun 12 10:03:37 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 12 Jun 2018 10:03:37 -0600
Subject: [maker-devel] Merge warning = 1
In-Reply-To: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>
References: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>
Message-ID: <D2F6D9CE-78B7-46B8-A9EC-2AC13E903655@gmail.com>

It?s an internal debugging related value used in the beta. It has no meaning for the general user and will eventually disappear.

?Carson


> On Jun 12, 2018, at 9:59 AM, Jennifer Anderson <jennifer.anderson at ebc.uu.se> wrote:
> 
> Hello,
> 
> I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES).
> 
> I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below.  Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction.
> 
> 
> 000030F|arrow  maker gene
> 9255 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43
> 000030F|arrow
> maker mRNA
> 9255 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1
> 000030F|arrow  maker exon
> 9838 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker exon
> 9255 9762
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker CDS
> 9838 9992
> . -
> 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker CDS
> 9255 9762
> . -
> 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 
> Best,
> 
> Jenni
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ 
> 
> E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180612/9fb68797/attachment-0002.html>

From steinj at cshl.edu  Tue Jun 12 12:08:19 2018
From: steinj at cshl.edu (Stein, Joshua)
Date: Tue, 12 Jun 2018 18:08:19 +0000
Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions
Message-ID: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>

Dear Carson and maker-devel group,

In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.

How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.

Thanks,
Josh


Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu
http://ware.cshl.org/


From carsonhh at gmail.com  Tue Jun 12 14:19:19 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 12 Jun 2018 14:19:19 -0600
Subject: [maker-devel] Transcript & protein fasta sequence id/name
 collisions
In-Reply-To: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
Message-ID: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>

The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it.

On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it.

?Carson


> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <steinj at cshl.edu> wrote:
> 
> Dear Carson and maker-devel group,
> 
> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.
> 
> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
> 
> Thanks,
> Josh
> 
> 
> Joshua Stein, PhD
> Manager, Sci. Informatics III
> Cold Spring Harbor Laboratory
> steinj at cshl.edu
> http://ware.cshl.org/
> 
> 
> 


From steinj at cshl.edu  Tue Jun 12 14:31:13 2018
From: steinj at cshl.edu (Stein, Joshua)
Date: Tue, 12 Jun 2018 20:31:13 +0000
Subject: [maker-devel] Transcript & protein fasta sequence id/name
 collisions
In-Reply-To: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>
References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
	<91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>
Message-ID: <BE0C9812-CCE7-431D-89DB-6CAA60AD937F@cshl.edu>

Hi Carson,
Thanks for identifying the problem.  I see that the input fgenesh GFF3 did not use unique identifiers, so we will attack the problem there.

Best,
Josh

> On Jun 12, 2018, at 4:19 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it.
> 
> On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it.
> 
> ?Carson
> 
> 
>> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <steinj at cshl.edu> wrote:
>> 
>> Dear Carson and maker-devel group,
>> 
>> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.
>> 
>> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
>> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
>> 
>> Thanks,
>> Josh
>> 
>> 
>> Joshua Stein, PhD
>> Manager, Sci. Informatics III
>> Cold Spring Harbor Laboratory
>> steinj at cshl.edu
>> http://ware.cshl.org/
>> 
>> 
>> 
> 

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu
http://ware.cshl.org/


From carsonhh at gmail.com  Wed Jun 13 11:46:12 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Jun 2018 11:46:12 -0600
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
Message-ID: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>

The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first.

?Carson


> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com> wrote:
> 
> Hello,
> 
> I'm using MAKER's "quality_filter.pl <http://quality_filter.pl/>" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too?
> 
> Thank you.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180613/797116da/attachment-0002.html>

From ss2489 at cornell.edu  Wed Jun 13 13:34:27 2018
From: ss2489 at cornell.edu (Surya Saha)
Date: Wed, 13 Jun 2018 15:34:27 -0400
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
	<3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
Message-ID: <CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>

Hi Carson,

We have been using AED as a primary metric for evaluating predictions in
our group but it sounds like we should be using both eAED and AED. Is there
a detailed explanation of how exactly eAED and AED are computed besides
Table 2 in the Cantarel 2008 paper? Thanks

-Surya

On Wed, Jun 13, 2018 at 2:03 PM Carson Holt <carsonhh at gmail.com> wrote:

> The eAED score also take protein reading frame into account and it can
> infers support for exons when both introns are validated (i.e. can be lower
> than AED in some cases). For your case where eAED is 1 but AED less than 1
> means that you evidence support is from an overlapping protein, but it is
> never in the same reading frame as the gene model. So the positive evidence
> support may be suspect, or it may be real and the model is poor because of
> the assembly, gaps, etc. To use eAED instead in the quality_filter.pl
> script, you would have to to manually edit the script and replace ?_AED'
> with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower
> quality assemblies (places where the predictors make the best model they
> can and not the correct model because the assembly won?t allow for the
> correct model but there is evidence that there is a gene locus). So make
> sure to always view suspect regions in browser first.
>
> ?Carson
>
>
>
> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com> wrote:
>
> Hello,
>
> I'm using MAKER's "quality_filter.pl" with the default option (AED<1).
> However, I have noticed cases in which models have low AED scores and high
> eAED scores (1.00), so presumably the good AED scores are the result of
> spurious evidence alignments. Is there a way to filter models based on eAED
> scores too?
>
> Thank you.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 

Surya Saha
Sol Genomics Network
Boyce Thompson Institute, Ithaca, NY, USA
https://citrusgreening.org/ <http://www.linkedin.com/in/suryasaha>
http://www.linkedin.com/in/suryasaha
https://twitter.com/SahaSurya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180613/2ecc5a19/attachment-0002.html>

From carsonhh at gmail.com  Wed Jun 13 13:57:46 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Jun 2018 13:57:46 -0600
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
	<3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
	<CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>
Message-ID: <C4B3ED69-3D9E-421E-8447-90E63695FE68@gmail.com>

AED is documented in the 2011 MAKER2 paper, but eAED (extended AED) is not currently documented in a publication and is not used by any of the scripts that come with MAKER (it?s just there for reference right now). Basically AED is calculated with evidence overlap, but eAED will not count protein overlap unless it occurs in the same codon reading frame as the model (so evidence may count for a stretch, then stop counting for a few codons, then count again if there is an insertion in the alignment). Also eAED will infer support for exons if both introns are validated by evidence and the region in between is all ORF (this allows joint intron support to infer support for an internal exon). 99% of the time AED and eAED are the same, but eAED can be useful in identifying edge cases. Much of the time if AED and eAED are very different, it?s because there is a single base pair insertion or deletion in the assembly. The predictors still find the locus the best they can, but protein evidence and alignments will be out of sync with the reading frame on one of the exons. BLAST can?t really handle single bp INDELs in it?s alignments, but Exonerate can do mid alignment reading frame shifts to capture the assembly INDEL (and eAED is an attempt to use the extra Exonerate info in the score).

?Carson


> On Jun 13, 2018, at 1:34 PM, Surya Saha <ss2489 at cornell.edu> wrote:
> 
> Hi Carson,
> 
> We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks
> 
> -Surya
> 
> On Wed, Jun 13, 2018 at 2:03 PM Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl <http://quality_filter.pl/> script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first.
> 
> ?Carson
> 
> 
> 
>> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com <mailto:flopezo84 at gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I'm using MAKER's "quality_filter.pl <http://quality_filter.pl/>" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too?
>> 
>> Thank you.
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> -- 
> 
> Surya Saha
> Sol Genomics Network
> Boyce Thompson Institute, Ithaca, NY, USA
> https://citrusgreening.org/ <http://www.linkedin.com/in/suryasaha>
> http://www.linkedin.com/in/suryasaha <http://www.linkedin.com/in/suryasaha>
> https://twitter.com/SahaSurya <https://twitter.com/SahaSurya>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180613/7eb17966/attachment-0002.html>

From gdolby at asu.edu  Fri Jun 15 10:29:16 2018
From: gdolby at asu.edu (Greer Dolby)
Date: Fri, 15 Jun 2018 09:29:16 -0700
Subject: [maker-devel] Best annotation set failure (auto_annotator.pm line
 1774)
Message-ID: <7A8554F9-3456-4D53-B5A8-80FA794F42EE@asu.edu>

Hello,

I?m de novo annotating a second-generation genome and I have a handful of scaffolds that keep failing while choosing the best annotation set (errors below). To my knowledge there is anything wrong with the scaffolds themselves and the others have run fine. Looking at the VOID folder for the failed contigs for different runs, they appear to fail at different chunks but always with these two errors. I?m running 30 cores with MPICH2 on a private server. Any suggestions? Thanks!

Best,
Greer

^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 1
 ...processing 8 of 12
total clusters:44 now processing 0
 ...processing 0 of 3
 ...processing 1 of 3
 ...processing 2 of 3
total clusters:44 now processing 0
 ...processing 0 of 4
 ...processing 1 of 4
 ...processing 9 of 12
 ...processing 2 of 4
 ...processing 3 of 4
total clusters:44 now processing 0
 ...processing 10 of 12
 ...processing 0 of 67
 ...processing 1 of 67
ERROR: Chunk failed at level:6, tier_type:0
 ...processing 2 of 67
FAILED CONTIG:ScCC6lQ_38465;HRSCAF=64658

^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 2
 ...processing 9 of 298
 ...processing 8 of 81
 ...processing 11 of 202
 ...processing 13 of 20
 ...processing 10 of 298
 ...processing 9 of 81
 ...processing 10 of 81
 ...processing 18 of 123
 ...processing 14 of 20
 ...processing 17 of 54
 ...processing 18 of 54
 ...processing 37 of 164
 ...processing 20 of 254
Died at /home/jcornel3/tools/maker/bin/../lib/maker/auto_annotator.pm line 1774.
--> rank=17, hostname=omega
ERROR: Failed while choosing best annotation set
ERROR: Chunk failed at level:4, tier_type:4
FAILED CONTIG:ScCC6lQ_16796;HRSCAF=38896
_________________________________
Greer Dolby, PhD
Postdoctoral Research Scholar
SoLS, Arizona State U.
office: LSE 313, 480.965.7456
website <http://www.greerdolby.org/> | twitter <https://twitter.com/gadolby>
Kusumi Lab <http://kusumi.lab.asu.edu/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180615/64384280/attachment-0002.html>

From kapeelc at gmail.com  Fri Jun 22 13:41:58 2018
From: kapeelc at gmail.com (Kapeel Chougule)
Date: Fri, 22 Jun 2018 15:41:58 -0400
Subject: [maker-devel] map_forward=1 not mapping reference ID's to output
 correctly
Message-ID: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>

Hi,

I am trying to update community annotation
<https://de.cyverse.org/dl/d/39D60E88-078D-4CF5-9F3A-D712B714CDD8/community.annotation.gff3>
in the light of new evidence data but my MAKER runs are not keeping all the
genes from the community annotation.

Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon
51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR
MAKER gene count->
awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105

In the maker_opts.ctl file attached, I did make keep_preds=1 and
map_forward=1 which keep all the community gene models even if they dont
have evidence support. This was explained here:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data
. So not sure why we dont have the all the community gene models mapped in
the MAKER output

Thanks

Kapeel
--


*Kapeel ChouguleComputational Scientist Developer II*


*One Bungtown Road Cold Spring Harbor, NY 11724http://www.warelab.org/
<http://www.warelab.org/>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180622/15c366fc/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4991 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180622/15c366fc/attachment-0002.obj>

From monica.poelchau at ars.usda.gov  Fri Jun 22 14:04:28 2018
From: monica.poelchau at ars.usda.gov (Poelchau, Monica)
Date: Fri, 22 Jun 2018 20:04:28 +0000
Subject: [maker-devel] [CAUTION: Suspicious Link] map_forward=1 not
 mapping reference ID's to output correctly
In-Reply-To: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>
References: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>
Message-ID: <D5A4E18F-CFDC-489E-BA1B-FB88FA66C338@ars.usda.gov>

Hi Kapeel,

If you just want your community annotations to replace models in an existing gene set, we have a tool for this:

https://github.com/NAL-i5K/GFF3toolkit

You?d need to run gff3_QC on your annotation files first to make sure your annotations are okay, then use gff3_merge to merge your community annotations with your existing gene set (in gff3 format). If you end up trying this out - we?re actively developing the GFF3toolkit, so feel free to post an issue if you notice any problems.

Hth,

Monica

From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of Kapeel Chougule <kapeelc at gmail.com>
Date: Friday, June 22, 2018 at 13:53
To: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject: [CAUTION: Suspicious Link][maker-devel] map_forward=1 not mapping reference ID's to output correctly


PROCEED WITH CAUTION: This message triggered warnings of potentially malicious web content. Evaluate this email by considering whether you are expecting the message, along with inspection for suspicious links.

Questions: Spam.Abuse at wdc.usda.gov

Hi,

I am trying to update community annotation<https://de.cyverse.org/dl/d/39D60E88-078D-4CF5-9F3A-D712B714CDD8/community.annotation.gff3> in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation.


Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR
MAKER gene count->
awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105

In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data
. So not sure why we dont have the all the community gene models mapped in the MAKER output

Thanks

Kapeel
--

Kapeel Chougule
Computational Scientist Developer II
One Bungtown Road Cold Spring Harbor, NY 11724
http://www.warelab.org/


This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180622/b515a311/attachment-0002.html>

From andremmachado25 at gmail.com  Tue Jun 26 09:36:24 2018
From: andremmachado25 at gmail.com (=?UTF-8?Q?Andr=C3=A9_Machado?=)
Date: Tue, 26 Jun 2018 16:36:24 +0100
Subject: [maker-devel] Maker Error : Thread 1 terminated abnormally..
Message-ID: <CAAXcPKC3mnkqP9OU7L9bBLtts4KujCoBrUNieuUfgo+wd-E4Yw@mail.gmail.com>

Hi ,


First of all thanks for your efforts in Maker pipeline. Its a tremendous
help for the people that works with genomes.

In the last 4 days i have broke my head.. with an error .. but still
without a solution.

I found this old thread:
https://groups.google.com/forum/#!msg/maker-devel/X2-76BH9gvg/rU4kLJ3B6tsJ

Seems to be a quite similar... but don't point to a specific solution.

I have run maker with the data test and all runned ok. Maker finalize the
entire process without errors.

Recently, i?m trying to aplly my own data on MPI cluster. But this error,
frequently occurred.

Thread 1 terminated abnormally:
../dna.maker.output/mpi_blastdb/dna%2Efa.mpi.1/dna%2Efa.mpi.1.0

--> rank=8, hostname=compute-0-1.local, at ../Analysis/Geno/maker/bin/maker
line 1451 thread 1.

--> rank=8, hostname=compute-0-1.local

deleted:0 hits

deleted:0 hits

preparing ab-inits

deleted:0 hits

deleted:0 hits

FATAL: Thread terminated, causing all processes to fail

--> rank=8, hostname=compute-0-1.local

deleted:0 hits


Basically im tring to run a maker with dna.fa, rna.fa, prot.fa and
my_custom_lib_of_repeats.fa, to produce raw genes models which will be used
to train SNAP.


I already used several command lines and all gave me the same error.. The
only change between different tests was the local of the error, sometimes
happened in compute-0-1.local other time in compute-0-4.local or in another
one.

mpiexec -n 63 --hostfile Host maker 1>1.log 2>2.err

mpiexec --hostfile Host maker 1>1.log 2>2.err

mpiexec -mca btl ^openib -n 63 --hostfile Host maker 1>1.log 2>2.err

nohup mpiexec -mca btl ^openib -n 63 --hostfile Host maker -a 1>1.log
2>2.err


The log file as well the option files are provided below.


Many thanks in advance,


Andr?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2.log
Type: text/x-log
Size: 38655 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0002.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_exe.ctl
Type: application/octet-stream
Size: 1224 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4548 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0007.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_bopts.ctl
Type: application/octet-stream
Size: 1413 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0008.obj>

From vsoza at uw.edu  Fri Jun  1 13:36:10 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Fri, 1 Jun 2018 12:36:10 -0700
Subject: [maker-devel] how to input a masked assembly for annotation into
 Maker
Message-ID: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>

Hi Maker community

I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.

Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.

For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 

I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 

Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.


Annotation A default build steps:

$ maker -base Rwill10 -fix_nucleotides
$ maker -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11983   11983  312159
#should be 11985

$ maker -dsindex -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10_master_datastore_index.log

$ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
21960

$ fasta_merge -d Rwill10_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.default.log
Type: application/octet-stream
Size: 4650 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0012.obj>
-------------- next part --------------


Annotation A standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta

#genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
#IDs in .tsv file are called "processed-gene" from .fasta file, 
#but in .gff file, I think these are called "abinit-gene"
#best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
$ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

#extract list of IDs only to grep for
cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
  
#used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff

$ maker -base Rwill10standard2 -fix_nucleotides
$ maker -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11975   11975  311953
#should be 11985

$ maker -dsindex -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10standard2.all.gff
23559

$ fasta_merge -d Rwill10standard2_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.standard.log
Type: application/octet-stream
Size: 4529 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0013.obj>
-------------- next part --------------


Annotation B default build steps:

$ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta

#Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
$ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 

#use script to extract ordered scaffolds for each chromosome
$ ./extract_scaffolds_synteny.sh

#use script to create pseudochromosomal sequence for each chromosome
$ ./create_pseudo_chromosome_allLGs.sh

#concatenate these into one fasta file
cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta

$ maker -base Rwill10.pseudochromos -fix_nucleotides
$ maker -base Rwill10.pseudochromos -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
18465

$ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.default.log
Type: application/octet-stream
Size: 4604 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0014.obj>
-------------- next part --------------


Annotation B standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta

$ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

$ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs

#used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff

$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
20830

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.standard.log
Type: application/octet-stream
Size: 4558 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0015.obj>
-------------- next part --------------


-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From carsonhh at gmail.com  Fri Jun  1 16:01:13 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 1 Jun 2018 16:01:13 -0600
Subject: [maker-devel] Building MAKER with specific perl version
In-Reply-To: <CAOeNZNuUnnu_gGtO-yHHrM6wzeEzqihH83gLMCrty20mftt0Wg@mail.gmail.com>
References: <CAOeNZNuUnnu_gGtO-yHHrM6wzeEzqihH83gLMCrty20mftt0Wg@mail.gmail.com>
Message-ID: <78BA2892-5E3A-4DE5-AA44-18A9BCEF8071@gmail.com>

You probably need to run './Build realclean? to clean up your previous build and settings that you likely ran with the system perl. You may even need to clean out the ?/maker/perl directory. You can also try updating your Module::Build installation.

?Carson


> On May 31, 2018, at 8:30 AM, Ksenia Lavrichenko <ksenia.lavrichenko at gmail.com> wrote:
> 
> Hi, 
> 
> I have been banging my head for a while now, trying to install MAKER with my specific perl. 
> 
> I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ <https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ>
> 
> However, this does not work for me. I make sure bin/* and Build are deleted before I run $myperl Build.PL.
> 
> I see my perl in shebang of Build however after ./Build install all scripts in bin have "#! /usr/bin/perl" which produces a version error when I try to run maker -h. 
> 
> Any tips of what do I need to adjust in Build.PL?
> 
> Many thanks,
> Ksenia
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/cb7ef413/attachment-0003.html>

From carsonhh at gmail.com  Mon Jun 11 10:46:13 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 10:46:13 -0600
Subject: [maker-devel] how to input a masked assembly for annotation
 into Maker
In-Reply-To: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>
References: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu>
Message-ID: <9987ECCD-E7B6-45CD-9CBB-0B13B7608E4C@gmail.com>

Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs).

?Carson


> On Jun 1, 2018, at 1:36 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Maker community
> 
> I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.
> 
> Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.
> 
> For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 
> 
> I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
> I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 
> 
> Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.
> 
> 
> Annotation A default build steps:
> 
> $ maker -base Rwill10 -fix_nucleotides
> $ maker -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11983   11983  312159
> #should be 11985
> 
> $ maker -dsindex -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
> 21960
> 
> $ fasta_merge -d Rwill10_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.default.log>
> 
> 
> Annotation A standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
> #IDs in .tsv file are called "processed-gene" from .fasta file, 
> #but in .gff file, I think these are called "abinit-gene"
> #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> #extract list of IDs only to grep for
> cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff
> 
> $ maker -base Rwill10standard2 -fix_nucleotides
> $ maker -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11975   11975  311953
> #should be 11985
> 
> $ maker -dsindex -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10standard2.all.gff
> 23559
> 
> $ fasta_merge -d Rwill10standard2_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.standard.log>
> 
> 
> Annotation B default build steps:
> 
> $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta
> 
> #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
> $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 
> 
> #use script to extract ordered scaffolds for each chromosome
> $ ./extract_scaffolds_synteny.sh
> 
> #use script to create pseudochromosomal sequence for each chromosome
> $ ./create_pseudo_chromosome_allLGs.sh
> 
> #concatenate these into one fasta file
> cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta
> 
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
> 18465
> 
> $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> <maker_opts.log.AnnotationB.default.log>
> 
> 
> Annotation B standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff
> 
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
> 20830
> 
> <maker_opts.log.AnnotationB.standard.log>
> 
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From flopezo84 at gmail.com  Sat Jun  9 14:06:48 2018
From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=)
Date: Sat, 9 Jun 2018 16:06:48 -0400
Subject: [maker-devel] Filtering gene models based on eAED scores
Message-ID: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>

Hello,

I'm using MAKER's "quality_filter.pl" with the default option (AED<1).
However, I have noticed cases in which models have low AED scores and high
eAED scores (1.00), so presumably the good AED scores are the result of
spurious evidence alignments. Is there a way to filter models based on eAED
scores too?

Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180609/7943b278/attachment-0003.html>

From kissaj at miamioh.edu  Mon Jun 11 11:56:46 2018
From: kissaj at miamioh.edu (Andor J Kiss)
Date: Mon, 11 Jun 2018 13:56:46 -0400
Subject: [maker-devel] largest genome annotated?
Message-ID: <1528739806.4677.97.camel@miamioh.edu>

What's the largest genome that's been annotated with Maker2?

Thanks,

-- 
________________________________________________________________________________________________________________________
Andor J Kiss, PhD
Director - Center for Bioinformatics & Functional Genomics
086 Pearson Hall - Miami University
700 East High Street, Oxford
Ohio 45056
USA

eMAIL:?KissAJ at MiamiOH.edu?
Telephone: +1 (513) 529-4280
Fax: +1 (513) 529-2431
Ring ID:?andorjkiss

URL (CBFG):?http://miamioh.edu/cas/academics/centers/cbfg/?
URL (CBFG Services):?https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics
URL (Research):?http://openwetware.org/wiki/User:Andor_J_Kiss?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180611/f3194fbc/attachment-0003.html>

From carsonhh at gmail.com  Mon Jun 11 12:05:07 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 12:05:07 -0600
Subject: [maker-devel] largest genome annotated?
In-Reply-To: <1528739806.4677.97.camel@miamioh.edu>
References: <1528739806.4677.97.camel@miamioh.edu>
Message-ID: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>

The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER.

?Carson


> On Jun 11, 2018, at 11:56 AM, Andor J Kiss <kissaj at miamioh.edu> wrote:
> 
> What's the largest genome that's been annotated with Maker2?
> 
> Thanks,
> -- 
> ________________________________________________________________________________________________________________________
> Andor J Kiss, PhD
> Director - Center for Bioinformatics & Functional Genomics
> 086 Pearson Hall - Miami University
> 700 East High Street, Oxford
> Ohio 45056
> USA
> 
> eMAIL: KissAJ at MiamiOH.edu <mailto:KissAJ at MiamiOH.edu> 
> Telephone: +1 (513) 529-4280
> Fax: +1 (513) 529-2431
> Ring ID: andorjkiss <https://ring.cx/>
> 
> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ <http://miamioh.edu/cas/academics/centers/cbfg/> 
> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics <https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics>
> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss <http://openwetware.org/wiki/User:Andor_J_Kiss> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180611/c1ca695f/attachment-0003.html>

From carsonhh at gmail.com  Mon Jun 11 12:13:28 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Jun 2018 12:13:28 -0600
Subject: [maker-devel] largest genome annotated?
In-Reply-To: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>
References: <1528739806.4677.97.camel@miamioh.edu>
	<34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com>
Message-ID: <45218C8A-70A4-4146-9962-EA0E8F506265@gmail.com>

Correction. The recently published axolotl just barely edged out the Sugar Pine in size a couple of months ago. They did not really annotate it though, instead they independently assembled the transcriptome from mRNA-seq and aligned those where they could.

?Carson


> On Jun 11, 2018, at 12:05 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER.
> 
> ?Carson
> 
> 
> 
>> On Jun 11, 2018, at 11:56 AM, Andor J Kiss <kissaj at miamioh.edu <mailto:kissaj at miamioh.edu>> wrote:
>> 
>> What's the largest genome that's been annotated with Maker2?
>> 
>> Thanks,
>> -- 
>> ________________________________________________________________________________________________________________________
>> Andor J Kiss, PhD
>> Director - Center for Bioinformatics & Functional Genomics
>> 086 Pearson Hall - Miami University
>> 700 East High Street, Oxford
>> Ohio 45056
>> USA
>> 
>> eMAIL: KissAJ at MiamiOH.edu <mailto:KissAJ at MiamiOH.edu> 
>> Telephone: +1 (513) 529-4280
>> Fax: +1 (513) 529-2431
>> Ring ID: andorjkiss <https://ring.cx/>
>> 
>> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ <http://miamioh.edu/cas/academics/centers/cbfg/> 
>> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics <https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics>
>> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss <http://openwetware.org/wiki/User:Andor_J_Kiss> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180611/95aa9c05/attachment-0003.html>

From jennifer.anderson at ebc.uu.se  Tue Jun 12 09:59:31 2018
From: jennifer.anderson at ebc.uu.se (Jennifer Anderson)
Date: Tue, 12 Jun 2018 17:59:31 +0200
Subject: [maker-devel] Merge warning = 1
Message-ID: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>

Hello,

I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES).

I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below.  Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction.


000030F|arrow maker gene 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43
000030F|arrow maker mRNA 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1
000030F|arrow maker exon 9838 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker exon 9255 9762 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker CDS 9838 9992 . - 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
000030F|arrow maker CDS 9255 9762 . - 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1

Best,

Jenni


N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180612/6165e00b/attachment-0003.html>

From carsonhh at gmail.com  Tue Jun 12 10:03:37 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 12 Jun 2018 10:03:37 -0600
Subject: [maker-devel] Merge warning = 1
In-Reply-To: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>
References: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se>
Message-ID: <D2F6D9CE-78B7-46B8-A9EC-2AC13E903655@gmail.com>

It?s an internal debugging related value used in the beta. It has no meaning for the general user and will eventually disappear.

?Carson


> On Jun 12, 2018, at 9:59 AM, Jennifer Anderson <jennifer.anderson at ebc.uu.se> wrote:
> 
> Hello,
> 
> I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES).
> 
> I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below.  Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction.
> 
> 
> 000030F|arrow  maker gene
> 9255 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43
> 000030F|arrow
> maker mRNA
> 9255 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1
> 000030F|arrow  maker exon
> 9838 9992
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker exon
> 9255 9762
> . -
> . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker CDS
> 9838 9992
> . -
> 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 000030F|arrow  maker CDS
> 9255 9762
> . -
> 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1
> 
> Best,
> 
> Jenni
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ 
> 
> E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180612/9fb68797/attachment-0003.html>

From steinj at cshl.edu  Tue Jun 12 12:08:19 2018
From: steinj at cshl.edu (Stein, Joshua)
Date: Tue, 12 Jun 2018 18:08:19 +0000
Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions
Message-ID: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>

Dear Carson and maker-devel group,

In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.

How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.

Thanks,
Josh


Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu
http://ware.cshl.org/


From carsonhh at gmail.com  Tue Jun 12 14:19:19 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 12 Jun 2018 14:19:19 -0600
Subject: [maker-devel] Transcript & protein fasta sequence id/name
 collisions
In-Reply-To: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
Message-ID: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>

The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it.

On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it.

?Carson


> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <steinj at cshl.edu> wrote:
> 
> Dear Carson and maker-devel group,
> 
> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.
> 
> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
> 
> Thanks,
> Josh
> 
> 
> Joshua Stein, PhD
> Manager, Sci. Informatics III
> Cold Spring Harbor Laboratory
> steinj at cshl.edu
> http://ware.cshl.org/
> 
> 
> 


From steinj at cshl.edu  Tue Jun 12 14:31:13 2018
From: steinj at cshl.edu (Stein, Joshua)
Date: Tue, 12 Jun 2018 20:31:13 +0000
Subject: [maker-devel] Transcript & protein fasta sequence id/name
 collisions
In-Reply-To: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>
References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu>
	<91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com>
Message-ID: <BE0C9812-CCE7-431D-89DB-6CAA60AD937F@cshl.edu>

Hi Carson,
Thanks for identifying the problem.  I see that the input fgenesh GFF3 did not use unique identifiers, so we will attack the problem there.

Best,
Josh

> On Jun 12, 2018, at 4:19 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it.
> 
> On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it.
> 
> ?Carson
> 
> 
>> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <steinj at cshl.edu> wrote:
>> 
>> Dear Carson and maker-devel group,
>> 
>> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter.
>> 
>> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)?
>> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
>> 
>> Thanks,
>> Josh
>> 
>> 
>> Joshua Stein, PhD
>> Manager, Sci. Informatics III
>> Cold Spring Harbor Laboratory
>> steinj at cshl.edu
>> http://ware.cshl.org/
>> 
>> 
>> 
> 

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu
http://ware.cshl.org/


From carsonhh at gmail.com  Wed Jun 13 11:46:12 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Jun 2018 11:46:12 -0600
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
Message-ID: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>

The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first.

?Carson


> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com> wrote:
> 
> Hello,
> 
> I'm using MAKER's "quality_filter.pl <http://quality_filter.pl/>" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too?
> 
> Thank you.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180613/797116da/attachment-0003.html>

From ss2489 at cornell.edu  Wed Jun 13 13:34:27 2018
From: ss2489 at cornell.edu (Surya Saha)
Date: Wed, 13 Jun 2018 15:34:27 -0400
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
	<3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
Message-ID: <CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>

Hi Carson,

We have been using AED as a primary metric for evaluating predictions in
our group but it sounds like we should be using both eAED and AED. Is there
a detailed explanation of how exactly eAED and AED are computed besides
Table 2 in the Cantarel 2008 paper? Thanks

-Surya

On Wed, Jun 13, 2018 at 2:03 PM Carson Holt <carsonhh at gmail.com> wrote:

> The eAED score also take protein reading frame into account and it can
> infers support for exons when both introns are validated (i.e. can be lower
> than AED in some cases). For your case where eAED is 1 but AED less than 1
> means that you evidence support is from an overlapping protein, but it is
> never in the same reading frame as the gene model. So the positive evidence
> support may be suspect, or it may be real and the model is poor because of
> the assembly, gaps, etc. To use eAED instead in the quality_filter.pl
> script, you would have to to manually edit the script and replace ?_AED'
> with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower
> quality assemblies (places where the predictors make the best model they
> can and not the correct model because the assembly won?t allow for the
> correct model but there is evidence that there is a gene locus). So make
> sure to always view suspect regions in browser first.
>
> ?Carson
>
>
>
> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com> wrote:
>
> Hello,
>
> I'm using MAKER's "quality_filter.pl" with the default option (AED<1).
> However, I have noticed cases in which models have low AED scores and high
> eAED scores (1.00), so presumably the good AED scores are the result of
> spurious evidence alignments. Is there a way to filter models based on eAED
> scores too?
>
> Thank you.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>


-- 

Surya Saha
Sol Genomics Network
Boyce Thompson Institute, Ithaca, NY, USA
https://citrusgreening.org/ <http://www.linkedin.com/in/suryasaha>
http://www.linkedin.com/in/suryasaha
https://twitter.com/SahaSurya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180613/2ecc5a19/attachment-0003.html>

From carsonhh at gmail.com  Wed Jun 13 13:57:46 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Jun 2018 13:57:46 -0600
Subject: [maker-devel] Filtering gene models based on eAED scores
In-Reply-To: <CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>
References: <CAEW5o_MZiEbLtpk8+WuZ92EQSoSfAbx8Zk=_odXis4SQdM-xdQ@mail.gmail.com>
	<3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com>
	<CAEiaqDm3r9e-D-p+jrHE_v6Gs2=Pq2QRpBfsh=SF_yRO9odWBQ@mail.gmail.com>
Message-ID: <C4B3ED69-3D9E-421E-8447-90E63695FE68@gmail.com>

AED is documented in the 2011 MAKER2 paper, but eAED (extended AED) is not currently documented in a publication and is not used by any of the scripts that come with MAKER (it?s just there for reference right now). Basically AED is calculated with evidence overlap, but eAED will not count protein overlap unless it occurs in the same codon reading frame as the model (so evidence may count for a stretch, then stop counting for a few codons, then count again if there is an insertion in the alignment). Also eAED will infer support for exons if both introns are validated by evidence and the region in between is all ORF (this allows joint intron support to infer support for an internal exon). 99% of the time AED and eAED are the same, but eAED can be useful in identifying edge cases. Much of the time if AED and eAED are very different, it?s because there is a single base pair insertion or deletion in the assembly. The predictors still find the locus the best they can, but protein evidence and alignments will be out of sync with the reading frame on one of the exons. BLAST can?t really handle single bp INDELs in it?s alignments, but Exonerate can do mid alignment reading frame shifts to capture the assembly INDEL (and eAED is an attempt to use the extra Exonerate info in the score).

?Carson


> On Jun 13, 2018, at 1:34 PM, Surya Saha <ss2489 at cornell.edu> wrote:
> 
> Hi Carson,
> 
> We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks
> 
> -Surya
> 
> On Wed, Jun 13, 2018 at 2:03 PM Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl <http://quality_filter.pl/> script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first.
> 
> ?Carson
> 
> 
> 
>> On Jun 9, 2018, at 2:06 PM, Federico L?pez <flopezo84 at gmail.com <mailto:flopezo84 at gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I'm using MAKER's "quality_filter.pl <http://quality_filter.pl/>" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too?
>> 
>> Thank you.
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> -- 
> 
> Surya Saha
> Sol Genomics Network
> Boyce Thompson Institute, Ithaca, NY, USA
> https://citrusgreening.org/ <http://www.linkedin.com/in/suryasaha>
> http://www.linkedin.com/in/suryasaha <http://www.linkedin.com/in/suryasaha>
> https://twitter.com/SahaSurya <https://twitter.com/SahaSurya>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180613/7eb17966/attachment-0003.html>

From gdolby at asu.edu  Fri Jun 15 10:29:16 2018
From: gdolby at asu.edu (Greer Dolby)
Date: Fri, 15 Jun 2018 09:29:16 -0700
Subject: [maker-devel] Best annotation set failure (auto_annotator.pm line
 1774)
Message-ID: <7A8554F9-3456-4D53-B5A8-80FA794F42EE@asu.edu>

Hello,

I?m de novo annotating a second-generation genome and I have a handful of scaffolds that keep failing while choosing the best annotation set (errors below). To my knowledge there is anything wrong with the scaffolds themselves and the others have run fine. Looking at the VOID folder for the failed contigs for different runs, they appear to fail at different chunks but always with these two errors. I?m running 30 cores with MPICH2 on a private server. Any suggestions? Thanks!

Best,
Greer

^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 1
 ...processing 8 of 12
total clusters:44 now processing 0
 ...processing 0 of 3
 ...processing 1 of 3
 ...processing 2 of 3
total clusters:44 now processing 0
 ...processing 0 of 4
 ...processing 1 of 4
 ...processing 9 of 12
 ...processing 2 of 4
 ...processing 3 of 4
total clusters:44 now processing 0
 ...processing 10 of 12
 ...processing 0 of 67
 ...processing 1 of 67
ERROR: Chunk failed at level:6, tier_type:0
 ...processing 2 of 67
FAILED CONTIG:ScCC6lQ_38465;HRSCAF=64658

^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 2
 ...processing 9 of 298
 ...processing 8 of 81
 ...processing 11 of 202
 ...processing 13 of 20
 ...processing 10 of 298
 ...processing 9 of 81
 ...processing 10 of 81
 ...processing 18 of 123
 ...processing 14 of 20
 ...processing 17 of 54
 ...processing 18 of 54
 ...processing 37 of 164
 ...processing 20 of 254
Died at /home/jcornel3/tools/maker/bin/../lib/maker/auto_annotator.pm line 1774.
--> rank=17, hostname=omega
ERROR: Failed while choosing best annotation set
ERROR: Chunk failed at level:4, tier_type:4
FAILED CONTIG:ScCC6lQ_16796;HRSCAF=38896
_________________________________
Greer Dolby, PhD
Postdoctoral Research Scholar
SoLS, Arizona State U.
office: LSE 313, 480.965.7456
website <http://www.greerdolby.org/> | twitter <https://twitter.com/gadolby>
Kusumi Lab <http://kusumi.lab.asu.edu/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180615/64384280/attachment-0003.html>

From kapeelc at gmail.com  Fri Jun 22 13:41:58 2018
From: kapeelc at gmail.com (Kapeel Chougule)
Date: Fri, 22 Jun 2018 15:41:58 -0400
Subject: [maker-devel] map_forward=1 not mapping reference ID's to output
 correctly
Message-ID: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>

Hi,

I am trying to update community annotation
<https://de.cyverse.org/dl/d/39D60E88-078D-4CF5-9F3A-D712B714CDD8/community.annotation.gff3>
in the light of new evidence data but my MAKER runs are not keeping all the
genes from the community annotation.

Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon
51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR
MAKER gene count->
awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105

In the maker_opts.ctl file attached, I did make keep_preds=1 and
map_forward=1 which keep all the community gene models even if they dont
have evidence support. This was explained here:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data
. So not sure why we dont have the all the community gene models mapped in
the MAKER output

Thanks

Kapeel
--


*Kapeel ChouguleComputational Scientist Developer II*


*One Bungtown Road Cold Spring Harbor, NY 11724http://www.warelab.org/
<http://www.warelab.org/>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180622/15c366fc/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4991 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180622/15c366fc/attachment-0003.obj>

From monica.poelchau at ars.usda.gov  Fri Jun 22 14:04:28 2018
From: monica.poelchau at ars.usda.gov (Poelchau, Monica)
Date: Fri, 22 Jun 2018 20:04:28 +0000
Subject: [maker-devel] [CAUTION: Suspicious Link] map_forward=1 not
 mapping reference ID's to output correctly
In-Reply-To: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>
References: <CA+DOteeTFd06_k5ONYLvn7FpUuv-JDNqp1PCFa9QF0TxDa9iEg@mail.gmail.com>
Message-ID: <D5A4E18F-CFDC-489E-BA1B-FB88FA66C338@ars.usda.gov>

Hi Kapeel,

If you just want your community annotations to replace models in an existing gene set, we have a tool for this:

https://github.com/NAL-i5K/GFF3toolkit

You?d need to run gff3_QC on your annotation files first to make sure your annotations are okay, then use gff3_merge to merge your community annotations with your existing gene set (in gff3 format). If you end up trying this out - we?re actively developing the GFF3toolkit, so feel free to post an issue if you notice any problems.

Hth,

Monica

From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of Kapeel Chougule <kapeelc at gmail.com>
Date: Friday, June 22, 2018 at 13:53
To: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject: [CAUTION: Suspicious Link][maker-devel] map_forward=1 not mapping reference ID's to output correctly


PROCEED WITH CAUTION: This message triggered warnings of potentially malicious web content. Evaluate this email by considering whether you are expecting the message, along with inspection for suspicious links.

Questions: Spam.Abuse at wdc.usda.gov

Hi,

I am trying to update community annotation<https://de.cyverse.org/dl/d/39D60E88-078D-4CF5-9F3A-D712B714CDD8/community.annotation.gff3> in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation.


Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR
MAKER gene count->
awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105

In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data
. So not sure why we dont have the all the community gene models mapped in the MAKER output

Thanks

Kapeel
--

Kapeel Chougule
Computational Scientist Developer II
One Bungtown Road Cold Spring Harbor, NY 11724
http://www.warelab.org/


This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180622/b515a311/attachment-0003.html>

From andremmachado25 at gmail.com  Tue Jun 26 09:36:24 2018
From: andremmachado25 at gmail.com (=?UTF-8?Q?Andr=C3=A9_Machado?=)
Date: Tue, 26 Jun 2018 16:36:24 +0100
Subject: [maker-devel] Maker Error : Thread 1 terminated abnormally..
Message-ID: <CAAXcPKC3mnkqP9OU7L9bBLtts4KujCoBrUNieuUfgo+wd-E4Yw@mail.gmail.com>

Hi ,


First of all thanks for your efforts in Maker pipeline. Its a tremendous
help for the people that works with genomes.

In the last 4 days i have broke my head.. with an error .. but still
without a solution.

I found this old thread:
https://groups.google.com/forum/#!msg/maker-devel/X2-76BH9gvg/rU4kLJ3B6tsJ

Seems to be a quite similar... but don't point to a specific solution.

I have run maker with the data test and all runned ok. Maker finalize the
entire process without errors.

Recently, i?m trying to aplly my own data on MPI cluster. But this error,
frequently occurred.

Thread 1 terminated abnormally:
../dna.maker.output/mpi_blastdb/dna%2Efa.mpi.1/dna%2Efa.mpi.1.0

--> rank=8, hostname=compute-0-1.local, at ../Analysis/Geno/maker/bin/maker
line 1451 thread 1.

--> rank=8, hostname=compute-0-1.local

deleted:0 hits

deleted:0 hits

preparing ab-inits

deleted:0 hits

deleted:0 hits

FATAL: Thread terminated, causing all processes to fail

--> rank=8, hostname=compute-0-1.local

deleted:0 hits


Basically im tring to run a maker with dna.fa, rna.fa, prot.fa and
my_custom_lib_of_repeats.fa, to produce raw genes models which will be used
to train SNAP.


I already used several command lines and all gave me the same error.. The
only change between different tests was the local of the error, sometimes
happened in compute-0-1.local other time in compute-0-4.local or in another
one.

mpiexec -n 63 --hostfile Host maker 1>1.log 2>2.err

mpiexec --hostfile Host maker 1>1.log 2>2.err

mpiexec -mca btl ^openib -n 63 --hostfile Host maker 1>1.log 2>2.err

nohup mpiexec -mca btl ^openib -n 63 --hostfile Host maker -a 1>1.log
2>2.err


The log file as well the option files are provided below.


Many thanks in advance,


Andr?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2.log
Type: text/x-log
Size: 38655 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0003.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_exe.ctl
Type: application/octet-stream
Size: 1224 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0009.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4548 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0010.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_bopts.ctl
Type: application/octet-stream
Size: 1413 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180626/7b74d074/attachment-0011.obj>