[maker-devel] AED scores from MAKER pipeline - deterministic or not?
Mark Yandell
myandell at genetics.utah.edu
Tue Sep 8 10:13:32 MDT 2015
awesome detective work everybody!
Mark Yandell
Professor of Human Genetics
H.A. & Edna Benning Presidential Endowed Chair
Co-director USTAR Center for Genetic Discovery
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:801-587-7707
________________________________________
From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Carson Holt [carsonhh at gmail.com]
Sent: Tuesday, September 08, 2015 10:12 AM
To: Cheng, Chia-Yi
Cc: maker-devel
Subject: Re: [maker-devel] AED scores from MAKER pipeline - deterministic or not?
Hi Chia-Yi,
I’m glad to see you found a way around the issue you were seeing. Another solution may be to split up your input genome into several separate jobs, and run each one separately.
Just out of curiosity could you send me the results of these two commands?
df -h /tmp
df -h <directory_where_you_are_running_maker>
A GFFDB.pm lock failure generally means either your working directory is network mounted and MAKER can’t detect it or that /tmp is tmpfs both of which can cause SQLite failures.
Thanks,
Carson
On Sep 8, 2015, at 9:46 AM, Cheng, Chia-Yi <ccheng at jcvi.org<mailto:ccheng at jcvi.org>> wrote:
Hi Carson,
Thank you for the suggestions. For my previous runs, I’ve been setting the TMP to a non-NFS position and used 4 or 8 CPUs for MPI. In the MPI log file there is a consistent error, DBD::SQLite::db selectcol_arrayref failed: database is locked at maker-2.31.8/bin/../lib/GFFDB.pm line 525./, which may associate with the IO error you pointed out. This is likely caused by the MPI setting in our institute. Therefore, my team mate Vivek suggested to run on non-MPI. It took about a day to run, compared to ~6 hours when using MPI. Yet it did not create any error and the AED from two runs were identical. The command for the successful runs was, maker -R -quiet -TMP /tmp -fix_nucleotides
It looks like this approach has resolved the issue. Please feel free to post this update to the Google group. Again, thank you for your help.
Best,
Chia-Yi
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Date: Friday, September 4, 2015 at 2:43 PM
To: Cheng Chia-Yi <ccheng at jcvi.org<mailto:ccheng at jcvi.org>>
Subject: Re: [maker-devel] AED scores from MAKER pipeline - deterministic or not?
Hi Chia-Yi,
I think I found the issue based off the data difference between the GFF3 files. MAKER uses a number of intermediate files to store data as it progresses (will be in regional chunks). It looks like you had an IO error in one of the runs and one of these files was likely empty (note attached image with circled region where all EST/mRNA data just drops out - only happens in one of the files). It didn’t kill the job (NFS errors rarely do - it’s one of their optimizations, they always return success and assume it will complete eventually). You can run again with MAKER -a options to rebuild the data output.
Make sure your TMP= environment variable is not pointing to an NFS mounted location (that would exacerbate issues). You also may need to scale back the number of CPUs you are running using MPI in order to reduce the IO burden.
Thanks,
Carson
<Screen Shot 2015-09-04 at 11.17.09 AM.png>
On Sep 4, 2015, at 9:06 AM, Cheng, Chia-Yi <ccheng at jcvi.org<mailto:ccheng at jcvi.org>> wrote:
Hi Carson,
Thank you for clarifying it up. The two MAKER generated GFF files could be downloaded from iPlant now,
http://de.iplantcollaborative.org/dl/d/0C9CBD8F-9B6E-40F1-A2FA-4F7AC7AAE4B5/Chr1.gff.20150831
http://de.iplantcollaborative.org/dl/d/4C73FD9D-BE7E-4937-84D5-1D7F32196B67/Chr1.gff.repeat_20150831
The control files for these two runs and the a list of 818 models with different AED scores are attached to this email.
Please let me know if you need any other information. Thank you so much for your help.
Best,
Chia-Yi
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Date: Thursday, September 3, 2015 at 6:40 PM
To: Cheng Chia-Yi <ccheng at jcvi.org<mailto:ccheng at jcvi.org>>
Subject: Re: [maker-devel] AED scores from MAKER pipeline - deterministic or not?
Hi Chia-Yi,
What I really need are the MAKER produced GFF3 outputs from both runs (the individual contig files with the fasta at the end). Just Chr1 is sufficient.
Thanks,
Carson
On Aug 31, 2015, at 10:20 AM, Cheng, Chia-Yi <ccheng at jcvi.org<mailto:ccheng at jcvi.org>> wrote:
Hi Carson,
Please find the 1142 gene models with different AED from both runs. Due to the size, please download the annotated GFF3 and fasta files from iPlant,
http://de.iplantcollaborative.org/dl/d/2C1901E6-7F52-4264-9CB7-AB72CEF6BD67/TAIR10.protein_coding_loci_27415.gff
http://de.iplantcollaborative.org/dl/d/44A6AD38-E408-4DB7-AC32-6689D3D1AC7A/TAIR10.protein_coding_loci_27415.fasta
The single_exon= was set to zero in both sets. The two runs have used identical control files which were also attached. I thought single_exon= only mattered for generating annotation and didn’t realize it would also affect AED calculation.
Thank you.
Chia-Yi
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Date: Monday, August 31, 2015 at 11:08 AM
To: Cheng Chia-Yi <ccheng at jcvi.org<mailto:ccheng at jcvi.org>>
Cc: "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
Subject: Re: [maker-devel] AED scores from MAKER pipeline - deterministic or not?
I would have to see the actual GFF3 files (full file including fast at the end). Give me both GFF3 files and the coordinates of the gene in question. My first guess is that you had the single_exon= filter set to different values on each run. The gene in question is an unspliced single exon gene (based on the QI), your primary piece of evidence appears to be a single exon EST, and the only value that changes in the QI is the exon overlap. Single exon evidence will be ignored by default for the AED calculation unless you have single_exon set to 1.
Thanks,
Carson
On Aug 31, 2015, at 8:47 AM, Cheng, Chia-Yi <ccheng at jcvi.org<mailto:ccheng at jcvi.org>> wrote:
Hello MAKER team,
We at JCVI have been using MAKER (2.31.8) to calculate the AED of Arabidopsis gene models. We provided the annotation set as ‘model_gff’ with evidence file in ‘protein_gff’ and ‘est_gff’. All the other settings were default. One issue I’ve noticed was that the AED scores did not seem to be deterministic. When I compare the AED scores from two runs using identical control files, ~1,000 (out of 35,385) gene models had different AED scores. The difference between two sets of AED scores could range from 0.01 to 1.00.
I looked into several gene models with lager difference, i.e. AED = 0.00 in run 1 and AED = 1.00 in run 2, and noticed a disagreement in the QI:
Run 1: _AED=0.00;_eAED=-0.00;_QI=0|-1|0|1|-1|0|1|0|344
Run 2: _AED=1.00;_eAED=1.00;_QI=0|-1|0|0|-1|0|1|0|344
The discrepancy in the 4th column seemed to suggest the evidence file was not used properly in run 2. I’m not sure what may have caused as both runs have used the same input. A snapshot of the evidence files are pasted in the end of the email in case needed.
Please let me know if more info is needed. Any help is appreciated. Thank you.
Chia-Yi
RNA-seq evidence file:
Chr1 assembler-aerial2_pasacDNA_match36245927.+.ID=aerial2_align_161343;Target=asmbl_1 1082 1234 +%2Casmbl_1 692 1081 +%2Casmbl_1 572 691 +%2Casmbl_1 1 290 +%2Casmbl_1 291 571 +%2Casmbl_1 1235 1723 +
Chr1 assembler-aerial2_pasamatch_part36243913.+.ID=aerial2_align_161343-1;Parent=aerial2_align_161343
Chr1 assembler-aerial2_pasamatch_part39964276.+.ID=aerial2_align_161343-2;Parent=aerial2_align_161343
EST evidence file:
Chr1 est2genomeexpressed_sequence_match547058992150-.ID=Chr1:hit:213:3.2.0.0;Name=gi|19829901|gb|AV795918|RAFL08-19-M04
Chr1 est2genomematch_part547058992150-.ID=Chr1:hsp:500:3.2.0.0;Parent=Chr1:hit:213:3.2.0.0;Target=gi|19829901|gb|AV795918|RAFL08-19-M04 2 431 +;Gap=M430
Protein evidence file:
Chr1 protein2genomeprotein_match37605284727+.ID=Chr1:hit:202:3.10.0.0;Name=UniRef90_M4EWW1
Chr1 protein2genomematch_part37603913727+.ID=Chr1:hsp:488:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 1 50;Gap=M31 D1 M19 F1
Chr1 protein2genomematch_part39964276727+.ID=Chr1:hsp:489:3.10.0.0;Parent=Chr1:hit:202:3.10.0.0;Target=UniRef90_M4EWW1 51 144;Gap=R1 M23 D1 M28 D1 M36 I2 M5
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
<maker_bopts.ctl><maker_exe.ctl><maker_opts.ctl><1142_models.diff_AED.gff>
<818.diff_AED.20150831><maker_bopts.ctl><maker_exe.ctl><maker_opts.ctl>
<Screen Shot 2015-09-04 at 11.17.09 AM.png>
More information about the maker-devel
mailing list