<div dir="ltr"><br><div><div><div><div>Thank for you suggestions. So it does not matter if there are redundancies of protein sequence from different sources?<br></div>I am trying to annotating a <strong>rodent </strong>genome,
and planned to collect protein sequences of human, mouse, rat from bout
UniProt and NCBI (besides I also have RNA-seq data). I choose these
species, because they are close to the species that I am working on and
they are well annotated. But I saw someone said that if we choose
protein sequence from one lineage, the genes that are missing in the
lineage will not be detected. And in the following paper, the authors
claim they used
<span style="font-size:12pt;font-family:helvetica">the entire SwissProt
database </span>as the input. How do you think about this? Should I include protein sequences from more species (like all Eukaryota)? I think it can help us identify more genes, but on the other hand won't this also give us more false positives?
<br><br>This paper used
<span style="font-size:12pt;font-family:helvetica">the entire SwissProt
database </span>as the input.<br>
<span style="font-size:12pt;font-family:cambria">Insights into the evolution of longevity from the
bowhead whale genome. 2015. <u>Cell Rep</u> <b>10</b>(1): 112-122.<br><br></span></div><span style="font-size:12pt;font-family:cambria">Thanks<br><br></span></div><span style="font-size:12pt;font-family:cambria">Best<br></span></div><span style="font-size:12pt;font-family:cambria">Quanwei</span>
</div><div class="gmail_extra"><br><div class="gmail_quote">2017-01-31 15:57 GMT-05:00 Michael Campbell <span dir="ltr"><<a href="mailto:michael.s.campbell1@gmail.com" target="_blank">michael.s.campbell1@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Quanwei,<br>
<br>
(1) When I use uniprot I use SWISS-prot and not tremble.<br>
(2) I don’t merge files together. I just pass them all to MAKER as a comma separated list.<br>
<br>
Thanks,<br>
Mike<br>
<div><div class="h5"><br>
> On Jan 31, 2017, at 12:36 PM, Quanwei Zhang <<a href="mailto:qwzhang0601@gmail.com">qwzhang0601@gmail.com</a>> wrote:<br>
><br>
> I wonder what's the best way to collect protein sequences for gene annotation of a de novo genome assembly.<br>
> (1) My first choice is to get protein sequences of human and mouse from UniProt. At this step, I am not clear whether I should download the reviewed ones (i.e., SWISS-prot) or automatically annotated ones (i.e., TrEMBL).<br>
> (2) On ther other hand, I also get protein sequences from NCBI, should I just simply merge those fasta files. Does it matter if there are redundancies? And also, if I get protein sequences from different sources, they may not have the same quality. Do I need to do something before I integrate protein sequences from different sources?<br>
><br>
> Many thanks<br>
><br>
> Best<br>
> Quanwei<br>
</div></div>> ______________________________<wbr>_________________<br>
> maker-devel mailing list<br>
> <a href="mailto:maker-devel@box290.bluehost.com">maker-devel@box290.bluehost.<wbr>com</a><br>
> <a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org" rel="noreferrer" target="_blank">http://box290.bluehost.com/<wbr>mailman/listinfo/maker-devel_<wbr>yandell-lab.org</a><br>
<br>
</blockquote></div><br></div>