<div dir="ltr"><div><div><div><div><div><div><div><div><div><div>Hi,<br></div>I'm dealing with a fungal genome with at least 40% of repeats, so I'm trying to follow the advanced repeat construction protocol.<br></div>So far, so good, but I have doubts about how to build the protein database as explained at the end of the page<br><br></div><div>In summary<br></div>1. get SwissProt and RefSeq fungal proteins<br></div>2. tblastn (from 1) against EST-NCBI database and keep the matches<br></div>3. blastp the output from 2 against the transposase protein db. Remove matches<br></div>but from here on I'm a bit lost... <br><br>"Finally, the rice protein sequences were compared with verified 

transposons (such as Pack-MULEs) in the rice genome. If the protein 

sequence matched a transposon perfectly and was the only perfect match 

in the genome, the relevant protein sequence was excluded. Although 

elements such as Pack-MULEs contain true gene sequences, the annotation 

(the protein sequence in the database) often extends to non-gene 

sequences such as terminal inverted repeat or sub-terminal repeat, which

 are not true plant proteins and would cause great complications. As a 

result, it is essential to exclude them."<br><br></div>Are the proteins kept at the end of the step 3 the 'protein database'?<br></div>Could you provide a bit more detail on how to tackle this?<br><br></div>Thank you in advance,<br></div>Xabi<br><div><div><div><div><br><div><div><div><div><div><div><div><div><div>-- <br><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>Xabier Vázquez-Campos, <i>PhD</i><br><i>Research Associate</i><br>NSW Systems Biology Initiative<br>School of Biotechnology and Biomolecular Sciences<br>

The University of New South Wales<br>Sydney NSW 2052 AUSTRALIA<br></div></div></div></div></div></div></div></div></div></div></div>

</div></div></div></div></div></div></div></div></div></div></div></div></div></div>