[maker-devel] collecting protein sequences as evidences
Michael Campbell
michael.s.campbell1 at gmail.com
Thu Feb 2 12:24:04 MST 2017
It is nice to have an outgroup of some kind. In your case Human would serve that function. The issue that you might have with distantly related proteins is funny blast alignments that may lead to merging genes. You generally don’t get many false positives because the alignment parameters require a pretty good match, which is unlikely to happen by chance. You could limit swiss-prot to mammals if you wanted to.
Sometimes I’ll try different combinations of evidence on a few large scaffolds and look at the results in a browser to get a feel for what is going to work best.
Thanks,
Mike
> On Feb 2, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
>
> Thank for you suggestions. So it does not matter if there are redundancies of protein sequence from different sources?
> I am trying to annotating a rodent genome, and planned to collect protein sequences of human, mouse, rat from bout UniProt and NCBI (besides I also have RNA-seq data). I choose these species, because they are close to the species that I am working on and they are well annotated. But I saw someone said that if we choose protein sequence from one lineage, the genes that are missing in the lineage will not be detected. And in the following paper, the authors claim they used the entire SwissProt database as the input. How do you think about this? Should I include protein sequences from more species (like all Eukaryota)? I think it can help us identify more genes, but on the other hand won't this also give us more false positives?
>
> This paper used the entire SwissProt database as the input.
> Insights into the evolution of longevity from the bowhead whale genome. 2015. Cell Rep 10(1): 112-122.
>
> Thanks
>
> Best
> Quanwei
>
> 2017-01-31 15:57 GMT-05:00 Michael Campbell <michael.s.campbell1 at gmail.com <mailto:michael.s.campbell1 at gmail.com>>:
> Hi Quanwei,
>
> (1) When I use uniprot I use SWISS-prot and not tremble.
> (2) I don’t merge files together. I just pass them all to MAKER as a comma separated list.
>
> Thanks,
> Mike
>
> > On Jan 31, 2017, at 12:36 PM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
> >
> > I wonder what's the best way to collect protein sequences for gene annotation of a de novo genome assembly.
> > (1) My first choice is to get protein sequences of human and mouse from UniProt. At this step, I am not clear whether I should download the reviewed ones (i.e., SWISS-prot) or automatically annotated ones (i.e., TrEMBL).
> > (2) On ther other hand, I also get protein sequences from NCBI, should I just simply merge those fasta files. Does it matter if there are redundancies? And also, if I get protein sequences from different sources, they may not have the same quality. Do I need to do something before I integrate protein sequences from different sources?
> >
> > Many thanks
> >
> > Best
> > Quanwei
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170202/6546c9ab/attachment-0003.html>
More information about the maker-devel
mailing list