BioParser Project
The BioParser Project is a collection of perl modules
and scripts that provide parsing and object-interfaces for common
Bioinformatics text databases.
BioParser provides tools that allow you to easily search and
manipulate text data files downloaded from NCBI's LocusLink,
UniGene, GenBank and OMIM resources and from TIGR's Resourcerer
and Georgetown's PDB. Other databases are targeted for incorporation
into BioParser including PubMed, dbEST, UniProt and dbSNP.
Advantages of BioParser text databases include:
- no need for internet access once the text database has been
downloaded
- for smaller databases, the whole database is read into memory which
gives very quick access times
- no need for database software such as MySQL or Oracle
- uses only perl modules and scripts so is easy to install and works
well cross-platform
- does not require high CPU speeds so runs well on laptops and
older servers
Limitations of BioParser text databases:
- works best with large amounts of memory so databases can be fully
cached
- does not scale to very large databases such as GenBank and
PubMed (works well for subsets of records from these databases)
- the initial process where text records are converted into data
objects can be slow
How BioParser works:
- user downloads and installs BioParser
- user downloads target text data file from data source
- use initiates parsing of the text file
The number and sophistication of databases providing biological
information is expanding rapidly and many of the databases have
excellent websites that provide researchers with tools to search
and view the data. The problem with web interfaces is that they
seldom scale. In other words websites are great if you want to look
at 1 record or tens of records. If however you want to look at
hundreds or thousands of records, it is impractical to use a
website to click through all of the pages. In such a situation,
you need direct access to the data.
There are two ways you can get direct access to the data. The first
is to use the online search and query tools provided by the website
to retrieve one or more records. The limitation of this approach
is that the flexibility of the online tools depends on the programmer
resources that have been devoted to developing them.
Unless the website providers have a lot of resources, many useful
queries will not be possible.
The second way to get direct access to the data is to download a
copy. Many biological databases make their contents availble for download
as text files. Unfortunately there is no standard format for
these files and they are often difficult to understand and use.
BioParser was designed to make these text file easy to use by
taking away any necessity for researchers to understand the
file contents. BioParser provides modules that convert the
text files into data objects that are saved to disk for later use.
The process of parsing a text record into a data object can be
time consuming but once it has been done, the process of loading
a pre-made data record from disk is almost instantaneous.
It should be noted that a number of other resources including the excellent
Bioperl Project already provide
object-oriented interfaces to some of the databases targeted by
BioParser. BioPerl is an ambitious project and defines an object model
for biological information that can be applied to a wide range of
biological data sources. While this generic data model is a great
strength of BioPerl, it can also be a weakness as inexperienced
programmers often have great difficulty in coming to grips with the
the BioPerl object model and are intimidated by the hundreds of
BioPerl modules.
BioParser on the other hand is much more limited in scope
and is tightly focussed on one task - providing easy access
to the information in text data files provided for download by
major biological databases.
This tight focus means that BioParser modules are very specific
to individual data sources. The disadvantage of this approach is
that every data access module has to be created from scratch however
it has a critical advantage - if someone is familiar with the online
version of a data source, then the methods in the corresponding
BioParser data module are likely to be somewhat intuitive.
We hope that this will enable the BioParser system to
be used by researchers who are relatively inexperienced
as programmers.
Here's a simple perl script that shows how the BioParser
system might be used to extract a list of data fields
from every human locus in LocusLink:
#!/usr/bin/perl -w
use strict;
use Bio::Parser::LocusLink::SerialDatabase;
my $infile = 'locuslink_serialdatabase.storable';
my $sdb = Bio::Parser::LocusLink::SerialDatabase->new( -storable => $infile );
while (my $locus = $sdb->next_record) {
next unless $locus->organism eq 'Homo sapiens';
print $locus->locusid, "\t",
$locus->status, "\t",
$locus->official_symbol || $locus->preferred_symbol, "\t",
$locus->official_gene_name || $locus->preferred_gene_name, "\n";
}
The script does the following:
- creates an object
($sdb)
that loads a complete pre-processed copy of NCBI's LocusLink database
from a disk file
(locuslink_serialdatabase.storable)
- loops through LocusLink by calling the
next_record method on the
$sdb object
($sdb->next_record) and returning
BioParser LocusLink objects
($locus)
- using the
$locus object's
organism method,
loci are skipped unless they relate to a human gene
('Homo sapiens')
- For human loci, the contents of four data fields are printed out -
the locusid, the status of the record, and
the official or preferred gene symbol and gene name
A fragment of the output from the script is shown here:
10016 REVIEWED PDCD6 programmed cell death 6
10017 REVIEWED BCL2L10 BCL2-like 10 (apoptosis facilitator)
10018 REVIEWED BCL2L11 BCL2-like 11 (apoptosis facilitator)
10019 PROVISIONAL LNK lymphocyte adaptor protein
1002 REVIEWED CDH4 cadherin 4, type 1, R-cadherin (retinal)
The majority of the parser modules are based on Damian Conway's
fantastic Parse::RecDescent perl module.
All BioParser scripts and modules contain perl POD (plain old
documentation) that follows a standard style outlined in the
Bio::TGen::Util::PodMaker module from the
TGen-POD2HTML Project.
By sticking to this standard, we can automatically generate
HTML documentation for all BioParser modules and
scripts directly from the code itself. Assuming you
are using a unix-based operating system for development, once
BioParser is installed the POD should also be viewable using the
standard perldoc and man commands.
BioParser is copyright 2005 by The Translational Genomics Research
Institute. All rights reserved. This License is limited to, and you
may use the Software solely for, your own internal and non-commercial
use for academic and research purposes. Without limiting the foregoing,
you may not use the Software as part of, or in any way in connection
with the production, marketing, sale or support of any commercial
product or service or for any governmental purposes. For commercial or
governmental use, please contact licensing@tgen.org. By installing this
Software you are agreeing to the terms of the LICENSE file distributed
with this software.
In any work or product derived from the use of this Software, proper
attribution of the authors as the source of the software or data must be
made. The following URL should be cited:
http://bioinformatics.tgen.org/software/bioparser/
|