[maker-devel] loading scaffold features into chado
Scott Cain
scott at scottcain.net
Wed Mar 21 14:59:19 MDT 2012
Hi Claudia,
I wanted to bring this back to the mailing lists, so I cc'ed them here.
First, with the fasta loading issue: what command are you using to
load the fasta sequences? It works for me whether the fasta is at the
bottom of the GFF file or if it is a separate fasta file (as long as I
supply the --fastafile flag to the loader).
About the searching problem: when I turn on full text searching (which
means both running gmod_chado_fts_prep.pl and adding "-fulltext 1" to
the db_args in the gbrowse config file), I can search for "cnot1" and
find both a gene and an mRNA (of course, they are really the same
feature, but GBrowse doesn't know that). Also, searching for "maker"
works, but in a real database, this will not be an effective query,
since the number of results returned are limited, and presumably there
will be lots of features resulting from a query like that. Please
remind me, is that what you wanted to do?
Scott
On Wed, Mar 21, 2012 at 3:16 PM, claudia <dinatal at uwindsor.ca> wrote:
> Hi Scott,
>
> Wanted to give you a quick heads up that the bulk loader seems to be
> loading my fasta files now after deleting the ' ##FASTA' header ( the first
> line of the file now looks like this >scaffold0001)...
> Never had this problem before, it seems the bulk loader wanted to see a '>'
> symbol in front of the first line...
>
> -- when I say seems, I will let you know if it finishes, as it currently
> states " Loading sequences ( if any)" ... and I never made it this far
> before :)
>
> Claudia
>
>
>
> On 21/03/2012 12:53 PM, Scott Cain wrote:
>>
>> Hi Claudia,
>>
>> I imagine one scaffold and gene models would be good--the problem is
>> finding genes, right?
>>
>> Also, with loading fasta: were the corresponding features from the GFF
>> file already loaded? If so, that should have worked, and if it didn't
>> it implies a bug. If not, that's why.
>>
>> Scott
>>
>>
>> On Wed, Mar 21, 2012 at 12:37 PM, claudia<dinatal at uwindsor.ca> wrote:
>>>
>>> Hi Scott,
>>> So would one scaffold with Maker gene models suffice? Do you want the
>>> analysis as well?
>>>
>>> --along those same lines, I did try and load the original sequence
>>> (fasta)
>>> file first that I ran the Pipeline on and chado seems to refuse the files
>>> saying they don't contain the appropriate feature '>' in the header which
>>> in
>>> fact they do i.e> scaffold00001 ... So not sure what is wrong with the
>>> fasta that chado doesn't want to load even if it is embedded in the GFF3,
>>> the bulk loader or maker2chado return errors stating 'feature not
>>> found'...
>>>
>>>
>>> Claudia
>>>
>>>
>>>
>>> On 21/03/2012 12:20 PM, Scott Cain wrote:
>>>>
>>>> Hi Claudia,
>>>>
>>>> I was hoping to get actual files that I could do testing on, not
>>>> pictures of files :-)
>>>>
>>>> Scott
>>>>
>>>>
>>>> On Tue, Mar 20, 2012 at 4:15 PM, Dinatale C<dinatal at uwindsor.ca>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Attached: I have samples of the contig file ( I extracted the contig
>>>>> features first to load prior to the gene models) the fasta of the
>>>>> sequences
>>>>> is in the footer of the gff3 file.
>>>>>
>>>>> --so basically, based on experience with contig annotations, I should
>>>>> be
>>>>> able to type in 'maker' in to the gbrowse search bar, and recieve all
>>>>> the
>>>>> maker gene annotations, but I don't. I must specifiy the exact ID i.e "
>>>>> maker-scaffold11323-augustus-gene...." or 'scaffold11323'
>>>>>
>>>>> --so I wonder if it has to do with the fasta files being named as
>>>>> 'scaffolds' and perhaps causing a problem with chado recognizing that
>>>>> they
>>>>> are linked to the gene annotations due to scaffold not being a SOFA
>>>>> type
>>>>> term, if in fact the sequences must be submitted to the database first?
>>>>>
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Claudia
>>>>>
>>>>> On Tue, 20 Mar 2012 15:50:55 -0400 Scott Cain wrote:
>>>>>>
>>>>>> Hi Claudia,
>>>>>>
>>>>>> Can you post a sample of the gff that shows what you are looking for
>>>>>> and
>>>>>> not finding?
>>>>>>
>>>>>> Scott
>>>>>>
>>>>>>
>>>>>> Sent from my iPad
>>>>>>
>>>>>> On Mar 20, 2012, at 2:03 PM, claudia wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have enabled full text searching and I still have this problem,
>>>>>>> another reason for
>>>>>>
>>>>>> concern... So I wondered if in fact I changed all the ID's in the GFF3
>>>>>> file to supercontigs,
>>>>>> then perhaps Chado would better link all the terms, annotations, and
>>>>>> fasta
>>>>>> files....
>>>>>> Although, i realize that the seq_id ( column 1) shouldn't need to be
>>>>>> specific since the
>>>>>> 'type' term would take care of designating the feature type, no?
>>>>>>>
>>>>>>> Claudia
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 20/03/2012 1:25 PM, Scott Cain wrote:
>>>>>>>>
>>>>>>>> Hi Claudia,
>>>>>>>>
>>>>>>>> I agree with everything that Carson wrote, except about name
>>>>>>>> searching--it's a little trickier in Chado. What you probably want
>>>>>>>> to
>>>>>>>> do is implement full text searching. See:
>>>>>>>>
>>>>>>>> http://gmod.org/wiki/Chado_Full_Text_Search
>>>>>>>>
>>>>>>>> for more information on setting it up and maintaining it.
>>>>>>>>
>>>>>>>> Scott
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 20, 2012 at 1:13 PM, Carson Holt wrote:
>>>>>>>>>>
>>>>>>>>>> I have 2 concerns, the first is: regarding representing scaffold
>>>>>>>>>> features in chado and gbrowse. I noticed that the Sequence
>>>>>>>>>> ontology
>>>>>>>>>> uses
>>>>>>>>>> the term supercontig and so if my assembly generated scaffolds
>>>>>>>>>> entitled
>>>>>>>>>> "scaffold" should I change the names to supercontigs so that chado
>>>>>>>>>> recognizes the terms?
>>>>>>>>>
>>>>>>>>> Yes. You must use valid SO terms. It is a requirement of GFF3, and
>>>>>>>>> Chado
>>>>>>>>> will enforce this requirement on loading a GFF3 file (note Chado
>>>>>>>>> will
>>>>>>>>> even
>>>>>>>>> go as far as to check the validity of the Ontology_term= attribute
>>>>>>>>> in
>>>>>>>>> GFF3
>>>>>>>>> if you use it). You can decide to use contig or supercontig as your
>>>>>>>>> sequence feature. It doesnšt really matter unless you are placing
>>>>>>>>>
>>>>>>>>> both
>>>>>>>>> into the database as separate features (i.e. You have a supercontig
>>>>>>>>> as
>>>>>>>>> the
>>>>>>>>> parent feature and then you enter contigs individually as children
>>>>>>>>> of
>>>>>>>>> the
>>>>>>>>> supercontig).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Corresponding to my first question, Maker does not know that the
>>>>>>>>>> contigs
>>>>>>>>>> are actually scaffold/supercontigs when annotating and so Maker
>>>>>>>>>> will
>>>>>>>>>> still call the "type" feature or column 3 in the GFF3, a 'contig',
>>>>>>>>>> how
>>>>>>>>>> can Maker be implemented to change this naming convention before
>>>>>>>>>> annotation, or after?
>>>>>>>>>
>>>>>>>>> Not really important unless you plan on making contigs children of
>>>>>>>>> the
>>>>>>>>> supercontig. But you can always do a search and replace. -->
>>>>>>>>> cat file.gff | perl -ane 's/\tcontig\t/\tsupercontig\t/s; print
>>>>>>>>> $_'>
>>>>>>>>> new_file.gff
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Consequently, I am having problems pulling up gene features in
>>>>>>>>>> Gbrowse
>>>>>>>>>> when doing a generic gene search, and I must provide the maker
>>>>>>>>>> generated
>>>>>>>>>> unique-gene_id in the gbrowse search bar or the known sequence id
>>>>>>>>>> i.e
>>>>>>>>>> 'scaffold001', which is not useful for someone who does not have
>>>>>>>>>> this
>>>>>>>>>> information.
>>>>>>>>>> ---- I do not have this problem when my seq_id, and 'type' feature
>>>>>>>>>> id
>>>>>>>>>> match in the true case of 'contigs'. I can do a generic gene
>>>>>>>>>> search
>>>>>>>>>> in
>>>>>>>>>> gbrowse with the term 'maker' and gbrowse will provide me all the
>>>>>>>>>> associated maker generated gene calls.
>>>>>>>>>
>>>>>>>>> See "Adjusting GBrowse Name Searches" in the GBrowse tutorial -->
>>>>>>>>> http://gmod.org/gbrowse2/tutorial/tutorial.html#naming
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Carson
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thank you for any guidance resolving these concerns,
>>>>>>>>>> Claudia
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Claudia DiNatale
>>>>>>>>>> Master's Candidate
>>>>>>>>>> The Crosby Lab
>>>>>>>>>> University of Windsor
>>>>>>>>>> 519-253-3000 ext: 4755
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> maker-devel mailing list
>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Claudia DiNatale
>>>>>>> Master's Candidate
>>>>>>> The Crosby Lab
>>>>>>> University of Windsor
>>>>>>> 519-253-3000 ext: 4755
>>>>>>>
>>>>
>>>
>>> --
>>> Claudia DiNatale
>>> Master's Candidate
>>> The Crosby Lab
>>> University of Windsor
>>> 519-253-3000 ext: 4755
>>>
>>
>>
>
>
> --
> Claudia DiNatale
> Master's Candidate
> The Crosby Lab
> University of Windsor
> 519-253-3000 ext: 4755
>
--
------------------------------------------------------------------------
Scott Cain, Ph. D. scott at scottcain dot net
GMOD Coordinator (http://gmod.org/) 216-392-3087
Ontario Institute for Cancer Research
More information about the maker-devel
mailing list