[maker-devel] maker 2.22-beta: identical names and sequences repeated in maker.all.proteins fasta - attached FASTA, gff3
Carson Holt
carsonhh at gmail.com
Mon Mar 26 11:22:15 MDT 2012
Thanks for the file. I now see that the issue is caused by repeated model
entries in the maker_gff file (example from the following lines in the
input GFF3).
scaffold09875 maker mRNA 255 1019 . + .
ID=maker-scaffold09875-est_gff_Cufflinks-gene-0.0-mRNA-1;Parent=maker-scaff
old09875-est_gff_Cufflinks-gene-0.0;
scaffold09875 model_gff:maker match 255 1019 . + .
ID=scaffold09875:hit:419759:0_0;Name=maker-scaffold09875-est_gff_Cufflinks-
gene-0.0-mRNA-1;
During recent MAKER updates, it was requested that MAKER add model_gff
features used from a previous run as a reference annotation, so you can
still see it even when it is not chosen on the second round. However this
has an unexpected effect on multiple reruns using maker output as the new
input because both entries get interpreted as model_gff, they both then
end up in the results of a rerun (duplicated each round).
I've fixed this in the developers release (be a couple of days till it
hits the download page as a beta release), but in the mean time just
remove the model_gff:maker entries from the input fasta and it will work
as expected.
To do that use this command -->
grep -v "model_gff:maker"
Msex05162011.genome.all.maker-2.22-15Feb2012.gff3 >
filtered_Msex05162011.genome.all.maker-2.22-15Feb2012.gff3
Then put filtered_Msex05162011.genome.all.maker-2.22-15Feb2012.gff3 as
your maker_gff file.
I also recommend that you delete and .db extension files from the maker
output directory (there will be only one there). That will make extra
sure that the GFF3 file index gets rebuilt to the new file.
Note: I also noticed that Msex05162011.genome.cegma.gff is not in GFF3
format (it is in ZFF format). It would work for training SNAP but will
not work with MAKER.
Thanks,
Carson
On 12-03-26 1:02 PM, "Sanjay Chellapilla" wrote:
>
>
>----- Original Message -----
>> Since you are using it as input. I'll need to see this file.
>>
>>
>>/home/sanjay/manduca_sexta/maker/maker-runs-2.22/Msex05162011.genome.all.
>>ma
>> ker-2.22-15Feb2012.gff3
>>
>> Thanks,
>> Carson
>>
>>
>>
>> On 12-03-26 12:42 PM, "Sanjay Chellapilla"
>> wrote:
>>
>> >Attached
>> >"Msex-maker-2.22-identical-repeated-proteins-input-files.tar.bz2"
>> >containing 3 files
>> >
>> >maker-2.22_opts.ctl.28Feb2012
>> >est_gff = baylor_cufflinks_transcripts_gtf_no_G14G15.gff3
>> >model_gff = Msex05162011.genome.cegma.gff
>> >
>> >The maker_gff (Msex05162011.genome.all.maker-2.22-15Feb2012.gff3) is
>> >633MB so I didn't include it in this message. Please let me know if
>> >you'd want to see it - I'll send it separately.
>> >
>> >Thank you,
>> >Sanjay.
>> >
>> >----- Original Message -----
>> >> Could you send me the file you are passing to the est_gff,
>> >> model_gff,
>> >> or
>> >> maker_gff options.
>> >>
>> >> Also could you send me your MAKER control files?
>> >>
>> >> Thanks,
>> >> Carson
>> >>
>> >>
>> >>
>> >>
>> >> On 12-03-14 1:26 PM, "Sanjay Chellapilla"
>> >> wrote:
>> >>
>> >> >Hi Carson,
>> >> >
>> >> >Sorry I forgot to attach files showing the issue. Attached zip
>> >> >containing one such maker proteins fasta file and corresponding
>> >> >maker gff3 for scaffold00126 having identical repeated sequence
>> >> >">maker-scaffold00126-est_gff_Cufflinks-gene-2.0-mRNA-1".
>> >> >
>> >> >Thank you.
>> >> >
>> >> >----- Forwarded Message -----
>> >> >> Hi Carson,
>> >> >>
>> >> >> I ran maker-2.22-beta a total of 3 times with the same
>> >> >> evidence,
>> >> >> to annotate the Manduca.sexta genome, each time using
>> >> >> maker-gff3
>> >> >> for the re-annotation run and gene-predictors SNAP, Augustus
>> >> >> trained
>> >> >> on maker-gff3 from the previous run. At the end of the third
>> >> >> run,
>> >> >> I used fasta_merge script to obtain the various fasta files.
>> >> >> I notice that some sequences are repeated in the maker.all
>> >> >> transcripts/proteins fasta files. I found 34 repeated out of
>> >> >> 16128 transcripts/proteins, so there are actually only 16094
>> >> >> unique
>> >> >> sequences. Could this be related to the "repeated genes" issue
>> >> >> from the strange cufflinks cuffmerge gff3 that's used as input
>> >> >> to maker - we discussed this back in January when we first came
>> >> >> across and I had sent you a portion of the gff3 where we saw
>> >> >> this,
>> >> >> and then you recommended trying maker-2.22-beta where this was
>> >> >> fixed
>> >> >> ?
>> >> >> Naturally this also causes repeated short-IDs created using the
>> >> >> maker_map_ids, map_fasta_ids, map_gff_ids scripts.
>> >> >>
>> >> >> Thanks,
>> >> >> Sanjay.
More information about the maker-devel
mailing list