Bioinformatics Research Unit > Software > BioParser Project > Documentation

Bio::Parser    v1.36

^ NAME

Bio::Parser - Text Parsers for Bioinformatics Databases

^ SYNOPSIS

  use Bio::Parser;

^ DESCRIPTION

The BioParser Project is a collection of perl modules that provide parsing and object-interfaces for common Bioinformatics text databases. The majority of the parser modules are based on Damian Conway's fantastic Parse::RecDescent perl module.

All BioParser scripts and modules contain POD that follows a standard form outlined in the Bio::TGen::Util::PodMaker module from the TGen-POD2HTML project. By sticking to this standard, we can automatically generate detailed HTML documentation for all code directly from the code. The POD should also be viewable using the standard perldoc and man methods. (You are on a unix-based operating system aren't you?)

^ PUBLIC METHODS

These methods are inherited by every module in the BioParser system. They implement methods things that are common to (almost) all BioParser objects.

new()

The ultimate ancestor constructor for all classes in the BioParser system. It handles blessing the instance into the correct class as well as setting version values for the overall BioParser project and the invoking class.

bioparser_version()

Returns the version string for the installed BioParser system as determined by the $VERSION global for the Bio::Parser class. At the time of creation of any Bio::Parser object, the ancestor Bio::Parser::new() constructor copies its own class global $VERSION string into the object hash under the key 'BioParserVERSION'.

This accessor method is get-only. it has no set functionality because this instance attribute should never be reset.

creator_module_version()

Returns the version string for the module that created the currect object. This is not the same as the class global $VERSION string for this class. At the time of creation of any BioParser object, the ancestor Bio::Parser::new() constructor copies the class global $VERSION string for the creating module into the object hash under the key 'ClassVERSION'.

This means that every BioParser object, including those serialized to disk with to_storable(), carry the version of the original creating module. This lets us do some regression logic if newer modules are used with objects created by older modules.

This accessor method is get-only. it has no set functionality because this instance attribute should never be reset.

creation_time()

Holds a time stamp from when the root Bio::Parser class created this object instance. Since every BioParser object actually gets 'blessed' in the root Bio::Parser class, then every BioParser object should have this stamp. It's a string scalar from a call to localtime().

This accessor method is get-only. it has no set functionality because this instance attribute should never be reset.

verbose()

Sets and gets the current verbosity level. The verbosity level for any BioParser object can be specified by supplying the -verbose parameter to the appropriate new() method.

to_storable()

  $self->to_storable( $self->id . '.storable' );

This method is only really meaningful for ::Record and ::SerialDatabase objects.

Uses the perl Storable module to serialize the current object to disk. Takes a single argument which is the filename to be used to save the record. By convention, we're suggesting that you base your filenames on the ID of the object and use a '.storable' extension. This is not required but it has worked well for us as a standard practice. You could add a database-specific prefix like 'll_' for LocusLink.

from_storable()

This subroutine takes a filename and tries to retrieve a serialized BioParser object from the filename. The object must also belong to the same class as the invoking class or an undef is returned indicating failure. The following code fragments show how you might traverse a directory of serialized Record objects. You can call from_storable() as an instance or class method so the two following code fragments are equivalent:

  my $filename = '165.storable';  # 165 = ID of first record
  my $getter = Bio::Parser::OMIM::Record->new();
  while (my $record = $getter->from_storable( $filename )) {
      # Do some processing on the current record
      $filename = $record->next_id() . '.storable';
  }
  my $filename = '165.storable';  # 165 = ID of first record
  while (my $record = Bio::Parser::OMIM::Record->from_storable( $filename )) {
      # Do some processing on the current record
      $filename = $record->next_id() . '.storable';
  }

The use of $getter in the first example might seem silly but it is required since from_storable() has logic to make sure that the class of the object retrieved from the serialized file matches the invoker of the method. In other words, only a ::Record object can retrieve ::Record objects. This saves a lot of potential problems.

storable_filename()

This routine returns the name of the filename if the current object was retrieved from storable. It is only ever set by the from_storable() method so subclasses can use it in their new() method to work out whether or not they need to do any special processing that they would normally do on a brand new object. This sort of logic can be skipped for an object pulled from storable because it should have been done when the serialized object was originally created.

Some classes such as Bio::Parser::LocusLink::Record do no special processing on new records so the new() method in that class never even has to check whether the object returned from the call to SUPER::new() was retrieved from a storable object or created fresh.

Other classes such as Bio::Parser::Resourcerer::Record do a lot of special processing on new records so the new() method in that class has to check whether the object from SUPER::new() was retrieved from a storable object and skip the processing if it was.

checksum()

   my $crc1 = $object1->checksum();
   my $crc2 = $object2->checksum();
   if ($crc1 == $crc2) {
       print "Objects are equivalent\n";
   }
   else {
       print "Checksums do not match: $crc1, $crc2\n";
   }

This routine uses the Storable.pm nfreeze() method to create an image of the current object as a single scalar in memory. A CRC-32 checksum is calculated across the scalar and returned.

This function is useful when you want to check whether two or more Bio::Parser objects are identical. This is not currently used but it will be in the future when the bpr_patch.pl script will be able to update a SerialDatabase file by comparing the objects in it against a "patch" file of changed objects.

Assuming you haven't reset the value of $Storable::canonical to false, all hashes in the nfreeze'd representation should have keys arranged in a canonical order (alphabetically in this case). This is very important as it is this property that allows multiple objects to be compared. Storable's default behaviour is to store hash values in the order they occur which is non-deterministic, in other words logically identical objects might be arranged in memory in slightly different ways. Even these small differences would be enough to cause the objects to return different CRC32 numbers even though they contain logically identical information.

Just to reinforce the point, BioParser sets the $Storable::canonical variable to a true value and you must not change it if you expect to use the BioParser CRC routines.

The CRC32 code used to compare the nfreeze'd objects is adapted from the Non-XS routines in version 0.09 of Oliver Maul's Digest::CRC module which is available from CPAN. Digest::CRC has a great many extra features that we don't need so we just pulled the chunks of code we needed. Oliver specifically disclaimed all copyright in his module so we assume he won't mind that we borrowed from it. Of course this means that we also disclaim all copyright to the code that was borrowed, i.e. the majority of the contents of the routines _crc_reflect(), _crc_init() and _crc32().

N.B. We've noticed that the CRC's calculated can change from machine to machine. This could be due to 32bit vs 64bit perl versions but we haven't tested it extensively yet. Caveat Emptor.

object_info()

All Bio::Parser subclass modules should implement this method so the copy in Bio::Parser is a stub that should never get called. If it does get called it writes out a warning identifying the subclass that appears to not be implementing the method and then returns basic Bio::Parser object information by calling the private L_object_info()> primitive method.

^ PRIVATE METHODS

None of these methods needs to be exposed to a user of the BioParser system but you will want to use some of them if you are a developer implementing new BioParser modules.

_deep_copy()

This method makes a deep copy of a complex data structure. It takes a single reference as input. You need to use this to return complex internal data structures because if you just return the ref then the user is actually operating on the real copy of the data and could possibly break/modify it.

This method is borrowed directly from Randal Schwartz's article at http://www.stonehenge.com/merlyn/UnixReview/col30.html with minor mods to make it OO.

_object_info()

This routine is a primitive that subclass object_info() methods can call to do some of the basic reporting that should be common.

_crc32()

This class method can be used to calculate a CRC-32 checksum for a single scalar which must be supplied by the caller. It is called by the checksum() method which is how users are expected to calculate CRCs.

_crc_init()

_crc_reflect()

These 2 routines should never be called by anyone. They are written as functions, not OO style methods, since they'll be called to set the class global $CRC32 when the Bio::Parser module is first require'd. The CRC calculating stuff only needs to be initialized once for the entire BioParser system and once initialized, can be shared by any number of objects.

^ PARSING A NEW DATABASE

You may find that you have a favorite database that does not have a BioParser parser. In this case, you have 2 options: (a) contact the TGen BioParser team at the email address shown in the AUTHORS section; (b) roll your own parser. A number of scripts that are used internally by the BioParser development team are included in this distribution which should help you if you choose option (b). In this section we'll describe these scripts and some techniques that the team has found helpful.

Note, we claim no special expertise in Parse::RecDescent and there are probably better ways to do what we've done. All we can claim is that we know just enough to make Parse::RecDescent do what we want. If you are a Parse::RecDescent expert and would be willing to share your knowledge, we'd love to hear your suggestions.

Let us look at an example that describes our basic development process. Here is a step by step process that we followed in creating parser modules for OMIM database.

* Create a directory for the new parser

This directory should be placed under the Bio::Parser hierachy. In the case of OMIM it would be Bio::Parser::OMIM

* Copy modules from an existing parser

At a minimum, you will need the FileParser.pm, RDParser.pm, and Record.pm modules. If the database is relatively small (under 100MB in text format), you may be able to go the extra mile and create a serialized version of the whole database in which case you'll also want to make a copy of the SerialDatabase.pm module. Now let us create the three modules under Bio::Parser::OMIM

FileParser Module. First you need to copy the FileParser.pm module from an existing parser. This module is used to iterate through a text file one record at a time (in our case OMIM text file). This module is primarily used in scripts that have no need to look at a record more than once so simply looping through all records is acceptable. When creating a new OMIM::FileParser object the only thing to be passed in is the name of the text file to be parsed and after that it's simply a matter of starting a loop and calling next_record() until the end of the file is reached as shown in the example below.

  use Bio::Parser::OMIM::FileParser;
  my $om_datafile = '/usr/local/data/omim.txt';
  my $parser = Bio::Parser::OMIM::FileParser->new(
                   -file => $om_datafile );
  
  $parser->object_mode(1);
  while (my $omo = $parser->next_record){
     # Process $omo which is a Bio::Parser::OMIM::Record object.
     print $record->no, "\t",
           $record->title, "\n";
  }

RDParser Module. This module takes a text scalar that contains a single text record and returns an object containing all of the data. It can be used standalone but is really intended for use by the matching FileParser module which iterates through a text file one record at a time.

This module implements a Parse::RecDescent grammar for parsing text records. You can copy an existing RDParser module and modify the grammar so that it matches the records in your text file. You will have to make a list of all the available fields for a given text record and then write grammar to match those fields. For example OMIM record has eleven *FIELD* values like NO, TI, MN etc. We will have to write grammar to match each one of those fields. One way to parse these fileds is via a grammar rule such as:

  NO:   /^\*FIELD\*/ 'NO'  EOL
        integer EOL  
        { $omim{'NO'} = $item{integer}; }

This rule says that a "NO" field should be two lines - the first line contains the string "*FIELD* NO" and an end-of-line character. The second line contains an integer followed by an end-of-line character. If this pattern is found then the "NO" rule is satisfied and the integer is saved into the %omim hash under the 'NO' key.

Record Module. This module is used by the matching FileParser module as a data container to store all of the data for a single data record. This module basically just implements accessor methods. Accessor methods come in a number of flavours defined by their outputs - some return a single scalar value, others return an array of values or an arrayref depending on context. In the case of OMIM we will need eleven accessor methods indicating what each method would return. An example accessor method for the field NO would return a scalar as shown below

  sub no {
      my $self = shift;
      return $self->{'NO'} = shift if @_;    
      return $self->{'NO'}
  }

SerialDatabase Module. A SerialDatabase is a single perl object that contains all of your data records as objects plus some methods to access them by ID number or to traverse through them in the order they occurred in the original data file.

There are several methods that can be used in this module. For example the l method creates a new instance of the SerialDatabse and the l method traverses through the records present in your SerialDatabase. As mentioned earlier you could copy an exisiting SerialDatabase module and make necessary changes. Using a SerialDatabase module presumes you can get the whole thing into memory at once so it only makes sense for small databases or databases where you only use a subset of records from a larger database. A SerialDatabase containing OMIM records could be used:

  use Bio::Parser::OMIM::SerialDatabase;
  my $osdb_datafile = '/usr/local/data/omim_serialdatabase.storable';
  my $osdb = Bio::Parser::OMIM::SerialDatabase->new(
                  '-storable' => $osdb_datafile )
  while (my $omo = $osdb->next_record) {
      print $omo->no, "\n";
  }

Assuming BioParser contains a SerialDatabase module for a given database, the two scripts bpr_serialize.pl and bpr_create_serial_database.pl can be used together to create a SerialDatabase file.

If you are considering developing a parser for a new database, we'd suggest OMIM as a good candidate for copying as it's relatively simple but has all of the required functionality. One day we'll get around to creating a set of skeleton modules that can be used as templates rather than copying from OMIM.

* Download the text version of the target database

^ SEE ALSO

^ AUTHORS

  • John Pearson, bioinfresearch@tgen.org
  • Deepthi Chidambaram
  • Vidyadhari Edupugani
  • Srilaskmi Ganta
  • Vijaylakshmi Shanmugam

^ VERSION

$Id: Parser.pm,v 1.36 2007/08/16 06:50:13 jpearson Exp $

^ COPYRIGHT

BioParser is copyright 2005 by The Translational Genomics Research Institute. All rights reserved. This License is limited to, and you may use the Software solely for, your own internal and non-commercial use for academic and research purposes. Without limiting the foregoing, you may not use the Software as part of, or in any way in connection with the production, marketing, sale or support of any commercial product or service or for any governmental purposes. For commercial or governmental use, please contact licensing@tgen.org. By installing this Software you are agreeing to the terms of the LICENSE file distributed with this software.

In any work or product derived from the use of this Software, proper attribution of the authors as the source of the software or data must be made. The following URL should be cited:

http://bioinformatics.tgen.org/software/bioparser/