[maker-devel] est_forward and conflicting names

Thu May 8 16:41:41 MDT 2014

Interesting. Thanks for the clarification. I’m working on a plant
mitochondrion, and so as far as I know, there’s no alternative splicing. My
protein FASTA file is composed of the protein sequences of ~100 species
downloaded from GenBank. It looks like this:

>cox1|lcl|KJ461445.1_cdsid_AHY20320.1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=AHY20320.1] [location=complement(59212..60795)]
…
>cox1|lcl|EU534409.1_cdsid_ACA62629.1 [gene=cox1] [protein=cox1] [protein_id=ACA62629.1] [location=245282..246856]
…
>cox1|lcl|NC_023103.1_cdsid_YP_008964124.1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=YP_008964124.1] [location=join(317824..318438,319511..320368)]
…

I’m not sure that I actually want the fancy behaviour that you describe,
though it probably wouldn’t hurt anything. Will this FASTA format trigger
the fancy behaviour?

Cheers,
Shaun

*http://sjackman.ca <http://sjackman.ca>*

On 8 May 2014 15:33, Carson Holt <carsonhh at gmail.com> wrote:

> When moving transcripts onto a new assembly, you may have multiple
> transcripts of the same gene. Because your transcript name should be your
> fasta ID there is no way for MAKER to know that they go together when
> moving the models forward, so you can use the gene= option to make MAKER
> aware that these belong to the same genes.  They will be grouped and you
> recover all splice forms as a group.
>
> Example:
>
> >SMEDT_00004   gene=dpp
> AAAAAAA
>
> >SMEDT_00005 gene=dpp
> AAAAAAA
>
> --Carson
>
>
>
> From: Shaun Jackman <sjackman at gmail.com>
> Reply-To: Shaun Jackman <sjackman at gmail.com>
> Date: Thursday, May 8, 2014 at 4:26 PM
> To: Carson Holt <carsonhh at gmail.com>
> Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] est_forward and conflicting names
>
> Hi, Carson. Could you give an example of how to add gene_id= to the
> header of the FASTA file? I’m not clear on what you mean by this. In the
> FASTA header, what portion is the transcript name, and what portion is the
> gene name?
>
> Cheers,
> Shaun
>
> *http://sjackman.ca <http://sjackman.ca>*
>
>
> On 2 May 2014 11:55, Carson Holt <carsonhh at gmail.com> wrote:
>
>> Whichever has the best AED score I believe, but you can add gene_id= to
>> the header of each fasta file to ensure MAKER doesn't try and cluster
>> unrelated transcripts into a single gene.  Then the transcript name and
>> gene name will be guaranteed to match up.
>>
>> --Carson
>>
>>
>> From: Shaun Jackman <sjackman at gmail.com>
>> Date: Wednesday, April 30, 2014 at 5:25 PM
>> To: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
>> Subject: [maker-devel] est_forward and conflicting names
>>
>> Hi, Carson.
>>
>> I’ve downloaded a number genes from GenBank using Entrez Direct, which
>> I’m using with est and protein to annotate a plant mitochondrion. Most
>> of these reference sequences have sensible and consistent gene names, and
>> so I’m using est_forward to retain the gene names. This workflow is
>> working well for me. Some of the genes pulled in from GenBank have less
>> useful names like orf1234 or other numeric IDs. When multiple evidence
>> sequences map to the same location, how does est_forward choose which
>> name to use? If it’s chosen arbitrarily, could it be possible to choose the
>> most common name instead?
>>
>> Thanks,
>> Shaun
>>
>>
>> _______________________________________________ maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140508/d8689667/attachment-0003.html>