Bioinformatics Research Unit > Software > BioParser Project > Documentation

enc_bio_parser_homologene_fileparser_pm.shtml

package Bio::Parser::Homologene::FileParser;

###########################################################################
#
#  Module:   Bio::Parser::Homologene::FileParser.pm
#  Author:   Carol Barner
#  Created:  2005-06-15
#
#  This parser is designed to iterate through NCBI's Homologene 
#  text-format data file, homologene.data, found at:
#
#    ftp://ftp.ncbi.nih.gov/pub/HomoloGene/
#
#  It returns individual records as either Bio::Parser::Homologene::Record 
#  objects or as text scalars.
#
#  $Id: FileParser.pm,v 1.7 2007/08/16 06:41:18 jpearson Exp $
#
###########################################################################

use strict;
use Bio::Parser::FileParser;
use vars qw($VERSION @ISA);

( $VERSION ) = '$Revision: 1.7 $ ' =~ /\$Revision:\s+([^\s]+)/;

@ISA = qw( Bio::Parser::FileParser );  # Inherit from parent class


###########################################################################
#                             PUBLIC METHODS                              #
###########################################################################


sub new {
    my $invocant = shift;
    my %params = @_;
    my $self =$invocant->SUPER::new( %params );  
    $self->this_id( undef );
    $self->next_id( undef );
    $self->{'first_line'} = undef;
    return $self;
}


sub next_record {
    my $self = shift;

    my @lineary = [];
    my $newHID = 0;            
    my $oldHID = 0;
    my $record = '';
    my $fh = ${ $self->{'filehandle'} };

 	# check for not first or last record
    if ($self->{'first_line'} && !(eof $fh)) {  
        $record = $self->{'first_line'}."\n";
        @lineary = split("\t", $record);  # split line to get the group ID
        $oldHID = $lineary[0];
        $newHID = $self->next_id;
    }

	# loop through additional lines in group
    while ( my $line = $fh->getline ){
        $line =~ s/\s+$//;
        @lineary = split("\t", $line); 
        $newHID = $lineary[0];  # peek ahead at the HID for this line 
        $oldHID = $newHID unless $oldHID; # if old is zero, first record in file
        $self->_incr_line_count;

   		# look to see if we have found a new group
        if ( $oldHID ne $newHID ) {
            $self->{'first_line'} = $line;
            $self->next_id( $newHID );
            last if $record;
            next;
        }

        # add additional line to current group record
        $record .= $line."\n";
    }

    return undef unless $record;  # failure or past last record
    
    $self->record_text( $record );
    $self->_incr_record_count;

    # Check whether a Record object or text scalar should be returned
    return $self->_record;
}


sub next_id {
    my $self = shift;
    return $self->{'next_id'} = shift if @_;
    return $self->{'next_id'};
}

1;

__END__

=head1 NAME

Bio::Parser::Homologene::FileParser - Perl extension for parsing Homologene files

=head1 SYNOPSIS

  use Bio::Parser::Homologene::FileParser;

  $hgf = '/usr/local/data/homologene.data';
  $parser = Bio::Parser::Homologene::FileParser->new( -file => $hgf );
  $parser->object_mode(1);  # return objects, not text

  while (my $hrec = $parser->next_record) {
      print "HID     ", $hrec->HID, "\n";
      print join( "\t", $_->{taxID},
                        $_->{geneID},
                        $_->{symbol},
                        $_->{prot_gi},
                        $_->{accessn} ), "\n"
            foreach @{ $hrec->members };
  }


=head1 DESCRIPTION

This module can be used to iterate through a Homologene text file. Its
primary purpose is use in scripts that have no need to look at a record
more than once. For programs that need to store all or some of the locus
records into memory this module can be used in the reading-in phase 
although it will be up to the user to determine how to store the
loci returned.

When creating a new Homologene FileParser object the only thing to be passed
in is the name of the text file to be parsed and after that it's simply a
matter of starting a loop and calling L<next_record()> until the end
of the file is reached as shown in the example above.

As of 2007-08-15, the definitive text homologene.data file is available
via ftp from NCBI at: ftp://ftp.ncbi.nih.gov/pub/HomoloGene/

=head1 PUBLIC METHODS

=over 2

=item B<new()>

  my $datafile = '/usr/local/data/homologene.data';
  my $fp = Bio::Parser::Homologene::FileParser->new( -file => $datafile );

Creates a new instance of the FileParser.
Must be passed the name of a file that contains one or more
LocusLink records in the same format as the LL_tmpl text file
distributed on the NCBI FTP site.

=item B<next_record()>

Gathers all the lines that are part of a record using the >> lines as
delimiters, and passes back a text scalar containing all of the lines
in the record.  The '>>' line is ommitted.

The >> lines fall at the beginning of the record, but contain only 
redundant information (LOCUSID) so they can be ignored.


=item B<next_id()>

Returns the ID of the next record.


=back

=head2 Inherited Methods

The following methods are inherited from the L<Bio::Parser::FileParser>
parent class.  You should look at the documentation for the parent class
to see what these methods do.

=over 2

=item B<object_mode()>

=item B<parser()>

=item B<record_count()>

=item B<line_count()>

=item B<record_text()>

=item B<record_object()>

=back


=head1 SEE ALSO

=over 2

=item L<Bio::Parser>

=item L<Bio::Parser::FileParser>

=item L<Bio::Parser::Homologene::Record>

=item L<Bio::Parser::Homologene::RDParser>

=item L<Bio::Parser::Homologene::SerialDatabase>

=back



=head1 AUTHORS

=over 2

=item Carol Barner

=item John Pearson, L<mailto:bioinfresearch@tgen.org>

=back


=head1 VERSION

$Id: FileParser.pm,v 1.7 2007/08/16 06:41:18 jpearson Exp $


=head1 COPYRIGHT

BioParser is copyright 2005 by The Translational Genomics Research
Institute.  All rights reserved.  This License is limited to, and you
may use the Software solely for, your own internal and non-commercial
use for academic and research purposes. Without limiting the foregoing,
you may not use the Software as part of, or in any way in connection
with the production, marketing, sale or support of any commercial 
product or service or for any governmental purposes.  For commercial or
governmental use, please contact licensing@tgen.org.  By installing this 
Software you are agreeing to the terms of the LICENSE file distributed 
with this software.

In any work or product derived from the use of this Software, proper 
attribution of the authors as the source of the software or data must be 
made.  The following URL should be cited:

L<http://bioinformatics.tgen.org/software/bioparser/>
  
=cut