[maker-devel] maker output- transcripts.fasta and proteins.fasta files missing
Carson Holt
carsonhh at gmail.com
Thu Mar 13 12:53:05 MDT 2014
For future reference, I suggest using the …/maker/bin/fasta_merge tool to
merge based on the datastore.index rather than other command line based
methods. It will handle the multiple fasta types that are produced in the
results, and will validate with the datastore.index file.
Example:
fasta_merge -d
opgenResult+scaffoldsLengthsLess200_master_datastore_index.log
The same is also true when merging gff3 files.
gff3_merge -d opgenResult+scaffoldsLengthsLess200_master_datastore_index.log
Thanks,
Carson
From: dhivya arasappan <darasappan at gmail.com>
Date: Thursday, March 13, 2014 at 12:48 PM
To: Carson Holt <carsonhh at gmail.com>
Subject: Re: maker output- transcripts.fasta and proteins.fasta files
missing
ah I forgot that some were called superscaffolds. That is a difference
between the old and new assembly. This was definitely the issue. Thanks and
sorry for the mix up.
Dhivya
On Mar 13, 2014, at 12:51 PM, Carson Holt wrote:
> Note that your command does not capture everything because not all scaffolds
> start with the name “scaffold".
>
> This works though —>
> ls -lh opgenResult+scaffoldsLengthsLess200_datastore/*/*/*/*trans*fasta|wc -l
>
> Thanks,
> Carson
>
>
> From: dhivya arasappan <darasappan at gmail.com>
> Date: Thursday, March 13, 2014 at 11:34 AM
> To: Carson Holt <carsonhh at gmail.com>
> Subject: Re: maker output- transcripts.fasta and proteins.fasta files missing
>
> Hi Carson,
>
> Am I looking in the wrong place for my fasta files? I looked here:
>
> ls -lh opgenResult+scaffoldsLengthsLess200_datastore/*/*/sca*/*trans*fasta|wc
> -l
>
> I see only 97 such files- so 97 contigs with transcripts.fasta files?
>
> When I count the number of sequences in all these files, I get 514 sequences.
>
> grep -c '^>'
> opgenResult+scaffoldsLengthsLess200_datastore/*/*/sca*/*trans*fasta|cut -d ':'
> -f 2|awk '{total+=$0}END{print total}'
>
> Could you tell how and where you are getting the 21,183 transcripts?
>
> thanks
> dhivya
>
> On Mar 13, 2014, at 12:21 PM, Carson Holt wrote:
>
>> This is what I see in your uploaded data. There are 21,183 transcripts from
>> 201 contigs. Then there are 707 contigs with no gene models.
>>
>> —Carson
>>
>>
>> From: Carson Holt <carsonhh at gmail.com>
>> Date: Thursday, March 13, 2014 at 11:11 AM
>> To: dhivya arasappan <darasappan at gmail.com>
>> Subject: Re: maker output- transcripts.fasta and proteins.fasta files
>> missing
>>
>> "as you saw from the output I uploaded before, the output certainly was much
>> less than 20,000 transcripts”
>>
>> Actually there were 21,183 in the output you uploaded. I saw no loss of
>> entries.
>>
>> —Carson
>>
>> From: dhivya arasappan <darasappan at gmail.com>
>> Date: Thursday, March 13, 2014 at 11:09 AM
>> To: Carson Holt <carsonhh at gmail.com>
>> Subject: Re: maker output- transcripts.fasta and proteins.fasta files
>> missing
>>
>> Hi Carson,
>>
>> The datastore.index file looks fine- it has a started and finished status for
>> my 980 scaffolds. I reran with increased time twice. Second time around, I
>> actually deleted the entire output directory to make sure it runs all over
>> again. It still seemed to complete within a day. As you saw from the output
>> I uploaded before, the output certainly was much less than 20,000
>> transcripts. Given that I was seeing great results for an older version of my
>> assembly, I'm puzzled as to why my results are worse this time around. Any
>> suggestions of what to check or what I can do to see improved results would
>> be really helpful.
>>
>> I do know that I went from ~4% gaps to ~6% gaps in my new assembly- other
>> than that, its better in every way. Could this cause just a dramatic
>> difference in results?
>>
>> Thanks
>> dhivya
>>
>> On Mar 13, 2014, at 11:55 AM, Carson Holt wrote:
>>
>>> The second time, it should have just started where it left off, so it would
>>> run faster (because the processing from the previous job counted towards the
>>> second one). The archived output you sent me had 21,183 proteins and
>>> transcripts. If you are using the fasta_merge to collect them, just make
>>> sure the datastore.index file is not truncated or corrupt otherwise it won’t
>>> collect all the fastas from every contig. You can rebuild the
>>> datastore.index using the -dsindex flag with MAKER, if you want to check
>>> that. Also you can have maker just regenerate results without rerunning
>>> BLAST etc., by using the -a flag if you want to just recalculate ll results
>>> quickly (rebuilds all FASTA and GFF3 without redoing most analysis).
>>>
>>> —Carson
>>>
>>>
>>> From: dhivya arasappan <darasappan at gmail.com>
>>> Date: Thursday, March 13, 2014 at 10:47 AM
>>> To: Carson Holt <carsonhh at gmail.com>
>>> Cc: Daniel Ence <dence at genetics.utah.edu>, "maker-devel at yandell-lab.org"
>>> <maker-devel at yandell-lab.org>
>>> Subject: Re: maker output- transcripts.fasta and proteins.fasta files
>>> missing
>>>
>>> Thanks Carson for the response. I understand that est2genome=1 does not use
>>> any ab initio gene predictions, but simply identifies ests based on
>>> alignment. I'm a little confused because I ran maker on my assembly before,
>>> using the same parameters ( including est2genome=1). I got a very good
>>> result with > 20,000 transcripts and proteins.
>>>
>>> Then I was able to get an improved assembly, where many scaffolds were
>>> combined into superscaffolds. So I reran maker on this assembly. Same
>>> parameters, same transcriptome and proteins files. Now, I see such
>>> drastically different results: Only 500+ genes and transcripts. My
>>> scaffolds are now bigger than before, so I'm not sure how this is happening.
>>> These were the results I sent you.
>>>
>>> Another odd thing I noticed (and I am hesitant to report this because
>>> perhaps it is due to some sort of error on my part): I ran maker on the
>>> improved assembly the first time and maker did not complete in the 48 hours
>>> I allocated. But I had 19,000+ transcripts in the unfinished output. When
>>> I reran maker, just changing the time allocated, it completed much faster,
>>> but is giving much fewer transcripts and proteins as output. Could
>>> something like this happen? If not, then I'm guessing I must have changed
>>> something although I'm pretty sure that I did not change anything other than
>>> the time allocated. I've attached the trascripts and proteins files from the
>>> first time I ran maker on my improved assembly.
>>>
>>> Thanks again for your help
>>> Dhivya
>>>
>>>
>>>
>>> On Mar 13, 2014, at 11:14 AM, Carson Holt wrote:
>>>
>>>> Note protein/transcript fasts are only created when there are gene models
>>>> to output to those files (so their absence means there were no gene models
>>>> for that contig). Most sequences without protein/transcript fasts in your
>>>> sample are very short and thus don’t contain anything. What is left either
>>>> have no est2genome results or the est2genome alignments do not have
>>>> sufficient open reading frame to be turned into a gene model (false merging
>>>> of regions by trinity can cause this, so make sure you use the jaccard
>>>> index option when assembling reads with trinity to avoid this).
>>>>
>>>> You are using only the est2genome=1 option. This will result in a limited
>>>> set of genes that can be used for training SNAP/Augustus (so not getting
>>>> results on all contigs is expected). You really won’t get much as far as
>>>> results until you have one of the ab initio predictors turned on.
>>>>
>>>> Thanks,
>>>> Carson
>>>>
>>>>
>>>> From: dhivya arasappan <darasappan at gmail.com>
>>>> Date: Tuesday, March 11, 2014 at 8:52 AM
>>>> To: Carson Holt <carsonhh at gmail.com>
>>>> Cc: Daniel Ence <dence at genetics.utah.edu>
>>>> Subject: Re: maker output- transcripts.fasta and proteins.fasta files
>>>> missing
>>>>
>>>> Alright done. My username is daras
>>>>
>>>> Thanks
>>>> Dhivya
>>>>
>>>> On Mar 10, 2014, at 5:10 PM, Carson Holt wrote:
>>>>
>>>>> Input and compressed file of output.
>>>>>
>>>>> Thanks,
>>>>> Carson
>>>>>
>>>>> From: dhivya arasappan <darasappan at gmail.com>
>>>>> Date: Monday, March 10, 2014 at 2:09 PM
>>>>> To: Carson Holt <carsonhh at gmail.com>
>>>>> Cc: Daniel Ence <dence at genetics.utah.edu>
>>>>> Subject: Re: maker output- transcripts.fasta and proteins.fasta files
>>>>> missing
>>>>>
>>>>> Hi Carson,
>>>>>
>>>>> Do you mean the whole maker output?
>>>>>
>>>>> Thanks
>>>>> dhivya
>>>>>
>>>>> On Mar 10, 2014, at 4:55 PM, Carson Holt wrote:
>>>>>
>>>>>> Could you upload everything here —>
>>>>>> http://weatherby.genetics.utah.edu/cgi-bin/mwas/bug.cgi
>>>>>>
>>>>>> Than send us the link generated or your user ID.
>>>>>>
>>>>>> Thanks,
>>>>>> Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> From: dhivya arasappan <darasappan at gmail.com>
>>>>>> Date: Monday, March 10, 2014 at 1:50 PM
>>>>>> To: Carson Holt <carsonhh at gmail.com>, Daniel Ence
>>>>>> <dence at genetics.utah.edu>
>>>>>> Subject: Fwd: maker output- transcripts.fasta and proteins.fasta files
>>>>>> missing
>>>>>>
>>>>>> Hi Carson and Daniel,
>>>>>>
>>>>>> I'm sending this across to you separately since maker list is blocking my
>>>>>> email due to attachment size.
>>>>>>
>>>>>> As always, thanks for any guidance you can provide.
>>>>>> Dhivya
>>>>>>
>>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>>> From: dhivya arasappan <darasappan at gmail.com>
>>>>>>> Date: March 10, 2014 3:14:03 PM CDT
>>>>>>> To: maker-devel at yandell-lab.org
>>>>>>> Subject: maker output- transcripts.fasta and proteins.fasta files
>>>>>>> missing
>>>>>>>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I've been running maker with different assembly files, reference files
>>>>>>> etc and I check the output by:
>>>>>>>
>>>>>>> 1. concatenating the gff files
>>>>>>> 2. concatenating the *transcripts.fasta files
>>>>>>> 3. concatenating the *proteins.fasta files
>>>>>>>
>>>>>>> I'm noticing that when I ran maker twice with same parameters, the
>>>>>>> second time around, many of the output subdirectories do not have a
>>>>>>> *transcripts.fasta or *proteins.fasta file in it.
>>>>>>> There are 251 subdirectories and only 97 of them have all 3 output
>>>>>>> files. Maker log looks ok to me, but I've attached it here as well.
>>>>>>>
>>>>>>> What could be the reason for this?
>>>>>>>
>>>>>>> Thanks
>>>>>>> dhivya
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140313/dff0c913/attachment-0003.html>
More information about the maker-devel
mailing list