Bio::Parser v1.36
Bio::Parser - Text Parsers for Bioinformatics Databases
use Bio::Parser;
The BioParser Project is a collection of perl modules that provide
parsing and object-interfaces for common Bioinformatics text databases.
The majority of the parser modules are based on Damian Conway's
fantastic Parse::RecDescent perl module.
All BioParser scripts and modules contain POD that follows a
standard form outlined in the Bio::TGen::Util::PodMaker module from
the TGen-POD2HTML project. By
sticking to this standard, we can automatically generate detailed
HTML documentation for all code directly from the code. The POD should
also be viewable using the standard perldoc and man methods. (You
are on a unix-based operating system aren't you?)
These methods are inherited by every module in the BioParser system.
They implement methods things that are common to (almost) all BioParser
objects.
new()
The ultimate ancestor constructor for all classes in the BioParser
system. It handles blessing the instance into the correct class
as well as setting version values for the overall BioParser project
and the invoking class.
bioparser_version()
Returns the version string for the installed BioParser system as
determined by the $VERSION global for the Bio::Parser class.
At the time of creation of any Bio::Parser object, the
ancestor Bio::Parser::new() constructor copies its own
class global $VERSION string into the object hash under the key
'BioParserVERSION'.
This accessor method is get-only. it has no set functionality because
this instance attribute should never be reset.
creator_module_version()
Returns the version string for the module that created the currect
object. This is not the same as the class global $VERSION string
for this class. At the time of creation of any BioParser object, the
ancestor Bio::Parser::new() constructor copies the class global $VERSION
string for the creating module into the object hash under the key
'ClassVERSION'.
This means that every BioParser object, including those serialized to
disk with to_storable(), carry the version of the original creating
module. This lets us do some
regression logic if newer modules are used with objects created by
older modules.
This accessor method is get-only. it has no set functionality because
this instance attribute should never be reset.
creation_time()
Holds a time stamp from when the root Bio::Parser class created this
object instance. Since every BioParser object actually gets 'blessed'
in the root Bio::Parser class, then every BioParser object should have
this stamp. It's a string scalar from a call to localtime().
This accessor method is get-only. it has no set functionality because
this instance attribute should never be reset.
verbose()
Sets and gets the current verbosity level. The verbosity level for any
BioParser object can be specified by supplying the -verbose parameter
to the appropriate new() method.
to_storable()
$self->to_storable( $self->id . '.storable' );
This method is only really meaningful for ::Record and ::SerialDatabase
objects.
Uses the perl Storable module to serialize the current object to
disk. Takes a single argument which is the filename to be used to
save the record. By convention, we're suggesting that you base your
filenames
on the ID of the object and use a '.storable' extension. This is not
required but it has worked well for us as a standard practice.
You could add a database-specific prefix like 'll_' for LocusLink.
from_storable()
This subroutine takes a filename and tries to retrieve a serialized
BioParser object from the filename. The object must also
belong to the same class as the invoking class or an undef
is returned indicating failure. The following code fragments show how
you might traverse a directory of serialized Record objects.
You can call from_storable() as an instance or class method so the
two following code fragments are equivalent:
my $filename = '165.storable'; # 165 = ID of first record
my $getter = Bio::Parser::OMIM::Record->new();
while (my $record = $getter->from_storable( $filename )) {
# Do some processing on the current record
$filename = $record->next_id() . '.storable';
}
my $filename = '165.storable'; # 165 = ID of first record
while (my $record = Bio::Parser::OMIM::Record->from_storable( $filename )) {
# Do some processing on the current record
$filename = $record->next_id() . '.storable';
}
The use of $getter in the first example might seem silly but it is
required since from_storable() has logic to make sure that the class
of the object retrieved from the serialized file matches the invoker
of the method. In other words, only a ::Record object can retrieve
::Record objects. This saves a lot of potential problems.
storable_filename()
This routine returns the name of the filename if the current object
was retrieved from storable. It is only ever set by the
from_storable() method so subclasses can use it in their new()
method to work out whether or not they need to do any special
processing that they would normally do on a brand new object. This sort
of logic can be skipped for an object pulled from storable because
it should have been done when the serialized object was originally
created.
Some classes such as Bio::Parser::LocusLink::Record do no special
processing on new records so the new() method in that class never
even has to check whether the object returned from the call to
SUPER::new() was retrieved from a storable
object or created fresh.
Other classes such as Bio::Parser::Resourcerer::Record do a lot
of special processing on new records so the new() method in that
class has to check whether the object from SUPER::new()
was retrieved from a storable object and skip the processing if it was.
checksum()
my $crc1 = $object1->checksum();
my $crc2 = $object2->checksum();
if ($crc1 == $crc2) {
print "Objects are equivalent\n";
}
else {
print "Checksums do not match: $crc1, $crc2\n";
}
This routine uses the Storable.pm nfreeze() method to create an image
of the current object as a single scalar in memory. A CRC-32 checksum
is calculated across the scalar and returned.
This function is useful when you want to check whether two or more
Bio::Parser objects are identical. This is not currently used but
it will be in the future when the bpr_patch.pl script will be
able to update a SerialDatabase file by comparing the objects in it
against a "patch" file of changed objects.
Assuming you haven't reset the value of $Storable::canonical to
false, all hashes in the nfreeze'd representation should have keys
arranged in a canonical order (alphabetically in this case). This
is very important as it is this property that allows multiple
objects to be compared. Storable's default behaviour
is to store hash values in the order they occur which is
non-deterministic, in other words logically identical objects
might be arranged in memory in slightly different ways. Even
these small differences would be enough to cause the objects
to return different CRC32 numbers even though they contain
logically identical information.
Just to reinforce the point, BioParser sets the
$Storable::canonical variable to a true value and you
must not change it if you expect to use the BioParser CRC
routines.
The CRC32 code used to compare the nfreeze'd objects is adapted from
the Non-XS routines in version 0.09 of Oliver Maul's Digest::CRC module
which is available from CPAN. Digest::CRC has a great many extra
features that we don't need so we just pulled the chunks of code we
needed. Oliver specifically disclaimed all copyright in his module
so we assume he won't mind that we borrowed from it. Of course this means
that we also disclaim all copyright to the code that was borrowed, i.e.
the majority of the contents of the routines _crc_reflect(),
_crc_init() and _crc32().
N.B. We've noticed that the CRC's calculated can change from machine to
machine. This could be due to 32bit vs 64bit perl versions but we
haven't tested it extensively yet. Caveat Emptor.
object_info()
All Bio::Parser subclass modules should implement this method so the
copy in Bio::Parser is a stub that should never get called. If it
does get called it writes out a warning identifying the subclass
that appears to not be implementing the method and
then returns basic Bio::Parser object information by calling the
private L_object_info()> primitive method.
None of these methods needs to be exposed to a user of the BioParser
system but you will want to use some of them if you are a developer
implementing new BioParser modules.
_deep_copy()
This method makes a deep copy of a complex data structure. It takes a
single reference as input. You need to use this to return complex
internal data structures because if you just return the ref then the
user is actually operating on the real copy of the data and could
possibly break/modify it.
This method is borrowed directly from Randal Schwartz's
article at http://www.stonehenge.com/merlyn/UnixReview/col30.html
with minor mods to make it OO.
_object_info()
This routine is a primitive that subclass object_info() methods can
call to do some of the basic reporting that should be common.
_crc32()
This class method can be used to calculate a CRC-32 checksum for a
single scalar which must be supplied by the caller. It is called by
the checksum() method which is how users are expected to calculate
CRCs.
_crc_init()
_crc_reflect()
These 2 routines should never be called by anyone. They are written
as functions, not OO style methods,
since they'll be called to set the class global $CRC32 when the
Bio::Parser module is first require'd. The CRC calculating stuff
only needs to be initialized once for the entire BioParser system
and once initialized, can be shared by any number of objects.
You may find that you have a favorite database that does not have a
BioParser parser. In this case, you have 2 options: (a) contact the
TGen BioParser team at the email address shown in the AUTHORS section;
(b) roll your own parser. A number of scripts that are used internally
by the BioParser development team are included in this distribution
which should help you if you choose option (b). In this section we'll
describe these scripts and some techniques that the team has found helpful.
Note, we claim no special expertise in Parse::RecDescent and there are
probably better ways to do what we've done. All we can claim
is that we know just enough to make Parse::RecDescent do what we want.
If you are a Parse::RecDescent expert and would be willing to share
your knowledge, we'd love to hear your suggestions.
Let us look at an example that describes our basic development process.
Here is a step by step process that we followed in creating parser modules
for OMIM database.
* Create a directory for the new parser
This directory should be placed under the Bio::Parser hierachy.
In the case of OMIM it would be Bio::Parser::OMIM
* Copy modules from an existing parser
At a minimum, you will need the FileParser.pm, RDParser.pm, and Record.pm
modules. If the database is relatively small (under 100MB in text
format), you may be able to go the extra mile and create a serialized
version of the whole database in which case you'll also want to make a
copy of the SerialDatabase.pm module. Now let us create the three modules
under Bio::Parser::OMIM
FileParser Module.
First you need to copy the FileParser.pm module from an existing parser.
This module is used to iterate through a text file one record at a time
(in our case OMIM text file). This module is primarily used in scripts
that have no need to look at a record more than once so simply looping
through all records is acceptable.
When creating a new OMIM::FileParser object the only thing to be
passed in is the name of the text file to be parsed and after that
it's simply a matter of starting a loop and calling next_record() until
the end of the file is reached as shown in the example below.
use Bio::Parser::OMIM::FileParser;
my $om_datafile = '/usr/local/data/omim.txt';
my $parser = Bio::Parser::OMIM::FileParser->new(
-file => $om_datafile );
$parser->object_mode(1);
while (my $omo = $parser->next_record){
# Process $omo which is a Bio::Parser::OMIM::Record object.
print $record->no, "\t",
$record->title, "\n";
}
RDParser Module.
This module takes a text scalar that contains a single text record and
returns an object containing all of the data. It can be used standalone but
is really intended for use by the matching FileParser module which
iterates through a text file one record at a time.
This module implements a Parse::RecDescent grammar for parsing text records.
You can copy an existing RDParser module and modify the grammar so that
it matches the records in your text file. You will have to make a list of
all the available fields for a given text record and then write grammar to
match those fields.
For example OMIM record has eleven *FIELD* values like NO, TI, MN etc. We
will have to write grammar to match each one of those fields. One way
to parse these fileds is via a grammar rule such as:
NO: /^\*FIELD\*/ 'NO' EOL
integer EOL
{ $omim{'NO'} = $item{integer}; }
This rule says that a "NO" field should be two lines - the first line
contains the string "*FIELD* NO" and an end-of-line character. The
second line contains an integer followed by an end-of-line character.
If this pattern is found then the "NO" rule is satisfied and the integer
is saved into the %omim hash under the 'NO' key.
Record Module.
This module is used by the matching FileParser module as a data container
to store all of the data for a single data record. This module basically
just implements accessor methods. Accessor methods come in a number
of flavours defined by their outputs - some return a single scalar value,
others return an array of values or an arrayref depending on context.
In the case of OMIM we will need eleven accessor methods indicating what
each method would return. An example accessor method for the field NO
would return a scalar as shown below
sub no {
my $self = shift;
return $self->{'NO'} = shift if @_;
return $self->{'NO'}
}
SerialDatabase Module.
A SerialDatabase is a single perl object that contains all of your
data records as objects plus some methods to access them by ID number
or to traverse through them in the order they occurred in the original
data file.
There are several methods that can be used in this module. For example
the l method creates a new instance of the SerialDatabse and
the l method traverses through the records
present in your SerialDatabase. As mentioned earlier you could copy an
exisiting SerialDatabase module and make necessary changes. Using a
SerialDatabase module presumes you can get the whole thing into memory
at once so it only makes sense for small databases or databases where
you only use a subset of records from a larger database. A
SerialDatabase containing OMIM records could be used:
use Bio::Parser::OMIM::SerialDatabase;
my $osdb_datafile = '/usr/local/data/omim_serialdatabase.storable';
my $osdb = Bio::Parser::OMIM::SerialDatabase->new(
'-storable' => $osdb_datafile )
while (my $omo = $osdb->next_record) {
print $omo->no, "\n";
}
Assuming BioParser contains a SerialDatabase module for a given
database, the two scripts bpr_serialize.pl and
bpr_create_serial_database.pl can be used together to create a
SerialDatabase file.
If you are considering developing a parser for a new database, we'd
suggest OMIM as a good candidate for copying as it's relatively simple
but has all of the required functionality. One day we'll get around to
creating a set of skeleton modules that can be used as templates rather
than copying from OMIM.
* Download the text version of the target database
- John Pearson, bioinfresearch@tgen.org
- Deepthi Chidambaram
- Vidyadhari Edupugani
- Srilaskmi Ganta
- Vijaylakshmi Shanmugam
$Id: Parser.pm,v 1.36 2007/08/16 06:50:13 jpearson Exp $
BioParser is copyright 2005 by The Translational Genomics Research
Institute. All rights reserved. This License is limited to, and you
may use the Software solely for, your own internal and non-commercial
use for academic and research purposes. Without limiting the foregoing,
you may not use the Software as part of, or in any way in connection
with the production, marketing, sale or support of any commercial
product or service or for any governmental purposes. For commercial or
governmental use, please contact licensing@tgen.org. By installing this
Software you are agreeing to the terms of the LICENSE file distributed
with this software.
In any work or product derived from the use of this Software, proper
attribution of the authors as the source of the software or data must be
made. The following URL should be cited:
http://bioinformatics.tgen.org/software/bioparser/
|