Bioinformatics Research Unit > Software > BioParser Project

BioParser Project

^ About the BioParser Project

The BioParser Project is a collection of perl modules and scripts that provide parsing and object-interfaces for common Bioinformatics text databases.

BioParser provides tools that allow you to easily search and manipulate text data files downloaded from NCBI's LocusLink, UniGene, GenBank and OMIM resources and from TIGR's Resourcerer and Georgetown's PDB. Other databases are targeted for incorporation into BioParser including PubMed, dbEST, UniProt and dbSNP.

Advantages of BioParser text databases include:

  • no need for internet access once the text database has been downloaded
  • for smaller databases, the whole database is read into memory which gives very quick access times
  • no need for database software such as MySQL or Oracle
  • uses only perl modules and scripts so is easy to install and works well cross-platform
  • does not require high CPU speeds so runs well on laptops and older servers

Limitations of BioParser text databases:

  • works best with large amounts of memory so databases can be fully cached
  • does not scale to very large databases such as GenBank and PubMed (works well for subsets of records from these databases)
  • the initial process where text records are converted into data objects can be slow

How BioParser works:

  1. user downloads and installs BioParser
  2. user downloads target text data file from data source
  3. use initiates parsing of the text file

^ Background

The number and sophistication of databases providing biological information is expanding rapidly and many of the databases have excellent websites that provide researchers with tools to search and view the data. The problem with web interfaces is that they seldom scale. In other words websites are great if you want to look at 1 record or tens of records. If however you want to look at hundreds or thousands of records, it is impractical to use a website to click through all of the pages. In such a situation, you need direct access to the data.

There are two ways you can get direct access to the data. The first is to use the online search and query tools provided by the website to retrieve one or more records. The limitation of this approach is that the flexibility of the online tools depends on the programmer resources that have been devoted to developing them. Unless the website providers have a lot of resources, many useful queries will not be possible.

The second way to get direct access to the data is to download a copy. Many biological databases make their contents availble for download as text files. Unfortunately there is no standard format for these files and they are often difficult to understand and use. BioParser was designed to make these text file easy to use by taking away any necessity for researchers to understand the file contents. BioParser provides modules that convert the text files into data objects that are saved to disk for later use. The process of parsing a text record into a data object can be time consuming but once it has been done, the process of loading a pre-made data record from disk is almost instantaneous.

It should be noted that a number of other resources including the excellent Bioperl Project already provide object-oriented interfaces to some of the databases targeted by BioParser. BioPerl is an ambitious project and defines an object model for biological information that can be applied to a wide range of biological data sources. While this generic data model is a great strength of BioPerl, it can also be a weakness as inexperienced programmers often have great difficulty in coming to grips with the the BioPerl object model and are intimidated by the hundreds of BioPerl modules.

BioParser on the other hand is much more limited in scope and is tightly focussed on one task - providing easy access to the information in text data files provided for download by major biological databases.

This tight focus means that BioParser modules are very specific to individual data sources. The disadvantage of this approach is that every data access module has to be created from scratch however it has a critical advantage - if someone is familiar with the online version of a data source, then the methods in the corresponding BioParser data module are likely to be somewhat intuitive. We hope that this will enable the BioParser system to be used by researchers who are relatively inexperienced as programmers.


^ Usage Example

Here's a simple perl script that shows how the BioParser system might be used to extract a list of data fields from every human locus in LocusLink:

#!/usr/bin/perl -w

use strict;
use Bio::Parser::LocusLink::SerialDatabase;

my $infile = 'locuslink_serialdatabase.storable';
my $sdb = Bio::Parser::LocusLink::SerialDatabase->new( -storable => $infile );

while (my $locus = $sdb->next_record) {
    next unless $locus->organism eq 'Homo sapiens';
    print $locus->locusid, "\t",
          $locus->status, "\t",
          $locus->official_symbol || $locus->preferred_symbol, "\t",
          $locus->official_gene_name || $locus->preferred_gene_name, "\n";
}

The script does the following:

  1. creates an object ($sdb) that loads a complete pre-processed copy of NCBI's LocusLink database from a disk file (locuslink_serialdatabase.storable)
  2. loops through LocusLink by calling the next_record method on the $sdb object ($sdb->next_record) and returning BioParser LocusLink objects ($locus)
  3. using the $locus object's organism method, loci are skipped unless they relate to a human gene ('Homo sapiens')
  4. For human loci, the contents of four data fields are printed out - the locusid, the status of the record, and the official or preferred gene symbol and gene name

A fragment of the output from the script is shown here:

10016   REVIEWED        PDCD6   programmed cell death 6
10017   REVIEWED        BCL2L10 BCL2-like 10 (apoptosis facilitator)
10018   REVIEWED        BCL2L11 BCL2-like 11 (apoptosis facilitator)
10019   PROVISIONAL     LNK     lymphocyte adaptor protein
1002    REVIEWED        CDH4    cadherin 4, type 1, R-cadherin (retinal)

The majority of the parser modules are based on Damian Conway's fantastic Parse::RecDescent perl module.


^ Documentation

All BioParser scripts and modules contain perl POD (plain old documentation) that follows a standard style outlined in the Bio::TGen::Util::PodMaker module from the TGen-POD2HTML Project. By sticking to this standard, we can automatically generate HTML documentation for all BioParser modules and scripts directly from the code itself. Assuming you are using a unix-based operating system for development, once BioParser is installed the POD should also be viewable using the standard perldoc and man commands.


^ Copyright and Licensing

BioParser is copyright 2005 by The Translational Genomics Research Institute. All rights reserved. This License is limited to, and you may use the Software solely for, your own internal and non-commercial use for academic and research purposes. Without limiting the foregoing, you may not use the Software as part of, or in any way in connection with the production, marketing, sale or support of any commercial product or service or for any governmental purposes. For commercial or governmental use, please contact licensing@tgen.org. By installing this Software you are agreeing to the terms of the LICENSE file distributed with this software.

In any work or product derived from the use of this Software, proper attribution of the authors as the source of the software or data must be made. The following URL should be cited:

http://bioinformatics.tgen.org/software/bioparser/