CGL Installation:

General Instructions

The CGL library is packaged like a normal CPAN module (insofar as there is such a beast...) so installing it should be straightforward. There are a variety of sources within the perl community that describe the process, including one at CPAN and the perlmodinstall man/pod page (which should be available on your system, you should be able to read it with "perldoc perlmodinstall").

CGL depends on a few other Perl packages though, and they in turn have dependencies on other Perl packages and libraries. Getting everything in the right place and satisfying all of these dependencies can require a bit of attention and the process depends on the operating system you're using.

In particular, CGL itself depends on BioPerl and the Gnome project's XML parser. Depending on how you intend to use it, you'll also probably need the Chaos-xml toolset for creating the XML files that CGL uses as input and the Datastore module to help manage large collections of those files.

We'll discuss the process in detail and give some operating system specific hints below.

Prerequisites

We can't really tell you the best way to acquire and install these prerequisites (but we have some helpful hints down below). In some cases you'll need to make a choice between downloading a precompiled package that's specific for your operating system vs. building the package from source. In other cases (e.g. perl modules in CPAN), you'll have to decide whether to fetch them and build them by hand or to use perl's CPAN module to automate it. Some of the perl modules (e.g. XML::LibXML) can't be built by any method until their underlying libraries have been installed and thinking about that brings you right back to this basic question. The FreeBSD ports system automates all of this beautifully while fink and darwinports make a pretty go stab at it for Mac OS X. Other operating systems offer similar systems but you might even choose and to build and install all of the dependencies yourself.

There's no right answer. The trick that's most expedient now may make upgrading later more difficult but sometime's the simplest path is the best. If you have any doubts about what will work well in your environment, you should check with your local gurus.

We've tried hard to describe the things that you need to do below, and give some operating system specific examples for simple situations. Hopefully it'll get you heading in the right direction.

Perl

We have been developing CGL on a variety of modern Unix-like systems using Perl 5.8 line of releases. We've tested this release with the following set of operating systems and perl releases:

FreeBSD-stable on i386, with perl, v5.8.5 built for i386-freebsd-64int.
Redhat's Fedora Core 3 Linux on i386 with perl, v5.8.5 built for i386-linux-thread-multi.
Ubuntu 5.04 Linux on i386 with perl, v5.8.4 built for i386-linux-thread-multi.
Mac OS X Panther with perl, v5.8.1-RC3 built for darwin-thread-multi-2level.

XML handling

CGL uses the Gnome project's XML toolkit to manipulate XML data, via the standard CPAN module XML::LibXML .

You'll need to install the underlying C libraries (from source, or using your platform's favorite package manager), then grab and install the XML::LibXML package and it's dependencies from CPAN.

BioPerl

CGL uses and/or extends a variety of modules provided by BioPerl. While all of the functionality that CGL needs is included in BioPerl's "stable" 1.4 release, the Chaos-xml tools for transforming various sources of genomic data into the formats that CGL expects require a newer version of the library.

We recommend that you use the BioPerl 1.5 "developers" release and give some pointers to obtaining and installing it below. We have also tested the process with bioperl-live from their CVS repository, but since it's something of a moving target, we won't go into particular details about installing it. If you're into bleeding edge stuff and notice any CGL problems when you're playing with bioperl-live, please drop us a note.

The BioPerl library provides an impressive array of functionality and in turn has a set of dependencies which may seem overwhelming. Since only a small portion of this functionality is used by the CGL or Chaos-XML libraries, we've concentrated on getting describing a minimalist BioPerl installation with the the functionality that we require. Installing a fully capable version is left as an exercise, there's lots of great information on the BioPerl web site and a very strong user community to lend a hand.

Chaos-XML

The Chaos-XML Library contains software and specifications for the Chaos XML format for representing biological sequences and sequence features. It grew out of a need for an annotation data exchange format within the Berkeley Drosophila Genome Project for an XML format that was capable of expressing the rich annotation data that they were representing in their Chado database.

More information about the design and implementation of the library is available at its web site.

Datastore

The Datastore modules provide a graceful way to work with a large number of subdirectories (e.g. one subdirectory per gene in a genome). Many filesystems have performance problems with large numbers of subdirectories in a directory and even when the underlying filesystems handle things gracefully, access via network filesystems can be an issue.

The Datastore modules create a hiearchy of subdirectory layers, starting from a "base", and mapping end-user's identifiers to the corresponding subdirectory. There are provisions for a variety of ID to subdirectory mappings, we use a random mapping based on MD5 hashes (Datastore::MD5) exclusively, but the package includes an example that uses digits from a drosophila "CG" identifier. For example, a datastore with a depth of 2 and a root at /tmp would map "CG1040" to "/tmp/0A/70/CG1040".

The package includes library routines for cd'ing into a Datastore directory given it's ID, iterating over all of the ID's in a Datastore, iterating over some of the ID's in a Datastore, etc.... There are also a pair of scripts, ds_dir and ds_do, that provide shell-level access to Datastores.

You can download a CPAN style package containing the datastore modules below.

Installation

You'll need to go through the following steps to install the CGL library:

Make sure that you understand how to build and install CPAN style perl modules on your platform. We recommend that you read through the 'perlmodinstall' document, you should be able to read it by running "perldoc perlmodinstall", or the current version of it is available on the web. If you're an Apple OS X user, you should follow the "Unix" guidelines, their "mac" guidelines refer to OS 9 and earlier.
Think about what your setting for PREFIX should be. See the previous item if you don't know if/when/how you should set a PREFIX. You'll probably need to if you don't have permission to install stuff into the system directories, and you may want to if you're setting up your own private set of perl libraries.
Take care of the prerequisites:
- XML::LibXML
- BioPerl
- Chaos-xml [optional]
- Datastore [optional]
Unpack the CGL distribution and build it
1. Use tar [and possibly gzip] to unpack the distribution, e.g. tar -xvzf CGL-1.00.gz.
2. Change your current working directory to the top level of the distribution, e.g. cd CGL-1.0.
3. And finally, build the Makefile from the Makefile.PL, setting PREFIX if you've decided you should, e.g. perl Makefile.PL.
Test the distribution
1. Set the CGL_SO_SOURCE environment variable to point to your copy of the sequence ontology source file. The CGL distribution includes a copy of the version we used for our testing in the sample_data directory.
  If you use a C shell derivative (e.g. csh, tcsh), you'll want to use something like this:
  setenv CGL_SO_SOURCE /usr/home/moose/src/CGL-1.0/sample_data/so.obo
  setenv CGL_GO_SOURCE /usr/home/moose/src/CGL-1.0/sample_data/gene_ontology.obo
  If you use a Bourne Shell derivative (e.g. sh, bash, ash), you'll need something like:
  export CGL_SO_SOURCE=/usr/home/moose/src/CGL-1.0/sample_data/so.obo
  export CGL_GO_SOURCE=/usr/home/moose/src/CGL-1.0/sample_data/gene_ontology.obo
2. Run make test, which should run to completion, without any failures.
Install it

Operating system-specific and documentation

Sample scripts

CGL installs a set of sample scripts that should help you become familiar with how to use the libraries. These scripts should have been installed when you installed the CGL libraries, in whatever directory perl normally installs such things in (/usr/local/bin, or PREFIX/bin if you set a PREFIX when you created the Makefile).

The samples include:

cgl_validate

This script validates a chaos-XML data file, and is a good way to validate that all of the XML parsing tools are correctly installed. You can test it by running:

> cgl_validate [path to the CGL distribution]/sample_data/dmel.sample.chaos.xml

and if all is well you should see the message:

DOCUMENT IS VALID

cgl_tutorial

This script provides examples of manipulating annotations using CGL. Try:

> cgl_tutorial [path to the CGL distribution]/sample_data/dmel.sample.chaos.xml

and have a look at the internals of the script to learn how to gain access to just about any feature of an annotation, annotated or imaginable.

cgl_phat_tutorial

This script demonstrates CGL BLAST and PHAT HSP functionality. Try:

> cgl_phat_tutorial  -t blastp  [path to the CGL distribution]/sample_data/blastp.sample.report

Have a look at the internals of this script. Phat hits and Phat HSPs grant easy access to all kinds of new information about the contents of BLAST reports.

Finally have a look at the CGL TUTORIAL PDF on the CGL web page. This document explains how to combine the Phat Hit and Phat HSP aspects of CGL with its ability to manipulate annotations. Together, these two aspects of CGL make it possible to align genes and sequences to one another in completely new ways.

Converting GenBank to Chaos.xml

This section discusses walks through downloading a genome and it's annotations from GenBank and converting those annotations to a datastore full of chaos.xml documents.

Download and install the Datastore modules

If you haven't already downloaded and installed the CGL Datastore modules, you should do so now. Grab the tarball from the Downloads section and install them (they use the traditional CPAN procedure):

> gunzip Datastore-0.xx.tar.gz
> tar xvf Datastore-0.xx.tar
> cd Datastore-0.xx
> perl Makefile.PL
> make test
> sudo make install # you'll only need to sudo to install system-wide

The Bio-Chaos modules

Next you'll need to download and install the Bio-Chaos modules. Read the Chaos XML section of this page and follow the link to the Chaos XML site. After grabbing the library source tarball, install it in the traditional CPAN way. The Chaos libraries can make use of a perl module named XML::Parser::PerlSAX and will complain if you don't have it installed. We don't need it for our purposes though, so you can just keep going.

> gunzip Bio-Chaos-0.xx.tar.gz
> tar xvf Bio-Chaos-0.xx.tar
> cd Bio-Chaos-0.xx
> perl Makefile.PL
> make
> make test
> sudo make install # you'll only need sudo to install system-wide.

Now have a look in the Bio-Chaos-0.xx/bin directory. Note the script called cx-genbank2chaos.pl. If you run it with its "-h" option you can see the various options it supports:

>cx-genbank2chaos.pl -h

Often, a GenBank file contains more than one gene. Using the -islands option will create one chaos.xml file for every gene in the GenBank flat file. This is the option you will want to use.

Generally, GenBank flat files have a *.gbk suffix. Try the command on some of the sample data contained in the Bio-Chaos sample-data directory:

>cx-genbank2chaos.pl -islands sample-data/AE003734.gbk

Note that the output of this command creates a directory named AE003644.3. Within this directory are chaos.xml documents describing each of the genes contained within the file AE0033734.gbk. You may notice that the files have funny names, for example:

gene:EMBLGenBankSwissProt:AE003644:128108:128179.xml

The file's name contains a lot of semantic information. The one shown above says that the document describes a gene on the a sequence database entry whose id is AE003644 in EMBL, GeneBank and SwissProt; The 5-prime-most feature of the gene begins at position 128108; the 3-prime-most feature of the gene ends at position 128179. A little later on in this document, we will tell you how to find a particular gene in this documents using that gene's common name. For now though let's concentrate on how to download a genome from GenBank.

Downloading an annotated genome from GenBank

Now download a genome from GenBank. Keep in mind GenBank sometimes moves things around. Basically you are looking for the "Genomes" division. To download the latest D. melanogaster assembly and its annotations use this command

>wget -r -np -nv ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster

If you aren't familiar with the wget command, or don't have this software installed on your machine, you might want to find out more about it; failing that you can just use FTP to grab the various files in that directory one-at-a-time.

When this finishes... you will find a directory beneath your current working directory entitled: ftp.ncbi.nih.gov; have a look inside:

>cd ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster

The directory's contents should look something like this:

> ls
CHR_2           CHR_4           CHR_X           RELEASE_4_1
CHR_3           CHR_Un          README

If you read the README and dig around a bit you will find that there is much more than just .gbk files in these directories, all of it useful; nevertheless it's the *.gbk files that we are currently interested in.

Converting the genbank files to chaos.xml

Genomes are big things, so you might want to test things out before proceeding to attempt to make an entire datastore. As it turns out CHR_4 of Drosophila is quite small. This is therefore a good chromosome to test things on before you try the others, as it only contains about 80 genes.

Try this command to generate chaos.xml documents for every annotation on chromosome 4 (don't forget to check the errors file when it finishes):


>cx-genbank2chaos.pl -islands -ds_root chaos_datastore CHR_4/*.gbk >& errors

To dump the entire genome use this command:

>cx-genbank2chaos.pl -islands -ds_root chaos_datastore CHR_*/*.gbk >& errors

But before you do that, read on and learn more about datastores.

Creating a datastore index using ds_do and cgl_validate

If you read the documentation concerning the Datastore library you will find that the code comes with some helpful scripts to help you navigate a datastore. One of these is called "ds_do". This script can be used in a variety of ways to learn about the contents of a datastore. For example, typing :

>ds_do -all ls

within the root directory of a datastore will list the contents of every directory beneath it.

ds_do can be used together with the scripts provided in the CGL script directory to find out even more about the contents of a datastore. Try using cgl_validate together with ds_do to create an index of a datastore

This command will create will accomplish two desirable tasks at once. First it will create a file containing tab-delimited file with the following columns:

DOC_(IN)VALID

This is an especially useful piece of data as the resulting file provides a simple means to locate the chaos.xml document corresponding to a particular gene of interest using that generon, the annotated transcript contains a premature stop codon, or some other snafu. Sometimes this is due to an error on the part of those who made the annotation; sometimes it's a result of GenBank having tracked an older annotation forward to a new assembly. Sometimes GenBank will place a comment to this effect in the GenBank document; sometimes not. (Try searching a chaos.xml file for the element "comment"). It all depends on whether or not the various authorities creating and maintaining an annotation are even aware that the annotation is semantically confused. Also, there may be undetected bugs in the GenBank to Chaos conversion process as well. As a rule of thumb, at time of writing, about 3% of human annotations distributed by GenBank produce invalid Chaos.xml documents, and we are currently working hard to understand the underlying causes of this phenomenon.

Finally, the CGL cgl_tutorial provides a quick jumpstart to using CGL & Chaos.xml documents. Try it out and have a look at the code within for hints as to how to access different parts of an annotation.

Tutorial

Here's a tutorial (in pdf format) about using CGL.

Pod Documentation

Fleshing out the documentation for the various modules is still under way. Most of the modules have both class and method documentation, and we're in the process of finishing off the rest.

Yandell Lab

Department of Human Genetics - University of Utah