[maker-devel] Transcript & protein fasta sequence id/name collisions

Stein, Joshua steinj at cshl.edu
Tue Jun 12 12:08:19 MDT 2018


Dear Carson and maker-devel group,

In our recent MAKER run, some of the transcript and protein id’s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=“ field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ‘mRNA_4’ occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the “pred_gff=“ parameter.

How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=“ field for transcript/protein fasta id’s)?
Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.

Thanks,
Josh


Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj at cshl.edu
http://ware.cshl.org/





More information about the maker-devel mailing list