[maker-devel] collecting protein sequences as evidences

Quanwei Zhang qwzhang0601 at gmail.com
Thu Feb 2 14:05:53 MST 2017


Thank you for your suggestions. I need to have some tests.

Best
Quanwei

2017-02-02 14:24 GMT-05:00 Michael Campbell <michael.s.campbell1 at gmail.com>:

> It is nice to have an outgroup of some kind. In your case Human would
> serve that function. The issue that you might have with distantly related
> proteins is funny blast alignments that may lead to merging genes. You
> generally don’t get many false positives because the alignment parameters
> require a pretty good match, which is unlikely to happen by chance. You
> could limit swiss-prot to mammals if you wanted to.
>
> Sometimes I’ll try different combinations of evidence on a few large
> scaffolds and look at the results in a browser to get a feel for what is
> going to work best.
>
> Thanks,
> Mike
>
> On Feb 2, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
>
> Thank for you suggestions. So it does not matter if there are redundancies
> of protein sequence from different sources?
> I am trying to annotating a *rodent *genome, and planned to collect
> protein sequences of human, mouse, rat from bout UniProt and NCBI (besides
> I also have RNA-seq data). I choose these species, because they are close
> to the species that I am working on and they are well annotated. But I saw
> someone said that if we choose protein sequence from one lineage, the genes
> that are missing in the lineage will not be detected. And in the following
> paper, the authors claim they used the entire SwissProt database as the
> input. How do you think about this? Should I include protein sequences from
> more species (like all Eukaryota)? I think it can help us identify more
> genes, but on the other hand won't this also give us more false positives?
>
> This paper used the entire SwissProt database as the input.
> Insights into the evolution of longevity from the bowhead whale genome.
> 2015.  *Cell Rep* *10*(1): 112-122.
>
> Thanks
>
> Best
> Quanwei
>
> 2017-01-31 15:57 GMT-05:00 Michael Campbell <michael.s.campbell1 at gmail.com
> >:
>
>> Hi Quanwei,
>>
>> (1) When I use uniprot I use SWISS-prot and not tremble.
>> (2) I don’t merge files together. I just pass them all to MAKER as a
>> comma separated list.
>>
>> Thanks,
>> Mike
>>
>> > On Jan 31, 2017, at 12:36 PM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>> >
>> > I wonder what's the best way to collect protein sequences for gene
>> annotation of a de novo genome assembly.
>> > (1) My first choice is to get protein sequences of human and mouse from
>> UniProt. At this step, I am not clear whether I should download the
>> reviewed ones (i.e., SWISS-prot) or automatically annotated ones (i.e.,
>> TrEMBL).
>> > (2) On ther other hand, I also get protein sequences from NCBI, should
>> I just simply merge those fasta files. Does it matter if there are
>> redundancies? And also, if I get protein sequences from different sources,
>> they may not have the same quality. Do I need to do something before I
>> integrate protein sequences from different sources?
>> >
>> > Many thanks
>> >
>> > Best
>> > Quanwei
>> > _______________________________________________
>> > maker-devel mailing list
>> > maker-devel at box290.bluehost.com
>> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170202/85a8d7c0/attachment-0003.html>


More information about the maker-devel mailing list