Bioinformatics Research Unit > Software > TGen-EUtils Project > Documentation

Bio::TGen::EUtils    v1.22

^ NAME

Bio::TGen::EUtils - Access NCBI's E-utilities CGI interface

^ SYNOPSIS

  use Bio::TGen::EUtils;
  my $eu = Bio::TGen::EUtils->new( 'tool'  => 'my_cool_tool.pl',
                                   'email' => 'me@myplace.org' );
  my $response = $eu->esearch(
                   'db'         => 'gene',
                   'term'       => 'gfap[gene] AND human[orgn]',
                   'usehistory' => 'y' );
  print $response->id_report();

^ ABSTRACT

TGen-EUtils is a collection of perl modules and scripts that provide an object-oriented interface to NCBI's Entrez Programming Utilities (E-utilities), a collection of web-based CGI programs that provide a remote programming interface to the Entrez system.

Entrez is a framework implemented on top of the NCBI source databases and individual source databases (such as Nucelotide and Gene) retain their own design and implementation. Consequently, some aspects of Entrez are common to all databases and other aspects are specific to individual databases. Each Entrez source database has a different organizing principle and contains different types of information so each database is indexed and searched using a series of database-specific terms.

TGen-EUtils is intended to simplify access to NCBI's E-utilities and is structured so that users interact with a single instance of the Bio::TGen::EUtils module which acts as a factory for making all subsequent calls against the NCBI E-utilities interface. The user is isolated from any requirement to create URLs or manage the interaction with the NCBI CGI interface.

Development of TGen-EUtils package was inspired by the NCBI Power Scripting course taught at NCBI by David Wheeler and Eric Sayers:

http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/course.html

^ DESCRIPTION

In practical terms NCBI's E-utilities system is implemented as 8 user-accessible CGI programs and the programs are driven by supplying them with carefully constructed URLs that contain specific parameters and values. TGen-EUtils provides a Bio::TGen::EUtils Request method corresponding to each of the NCBI CGI programs. Each method takes as input a hash of input parameters that correspond to the URL parameters used to drive the equivalent NCBI E-utilities program. When one of the Bio::TGen::EUtils methods is called, it makes a call against the NCBI E-utilities server and the server's response is placed into a Bio::TGen::EUtils::Response object that is returned to the user.

Response objects come in a number of subclasses - one to match each of the Request methods in Bio::TGen::EUtils. Response objects all inherit from Bio::TGen::EUtils::Response but have specific routines for handling their own particular forms of output including XML, text etc. In effect, there is a Response module to match each XML DTD in NCBI's entrez query system:

http://www.ncbi.nlm.nih.gov/entrez/query/DTD/

Usage Example 1

The easiest way to demonstrate how TGen-EUtils works is to show an example. The following URL (split into 2 lines for display purposes) shows the URL that would be used with the NCBI E-utilities to search for the ID of the Human CDKN2A gene:

 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene \
 &term=CDKN2A[gene]+AND+human[orgn]

To actually use this URL, a significant amount of code would be required - the perl LWP system would have to be used to submit the URL to the NCBI server and get a reply and the XML reply would have to be parsed to extract the id(s), if any, returned by the query. In contrast, the complete code for the equivalent TGen-EUtils program is:

 use Bio::TGen::EUtils;
 my $eu = Bio::TGen::EUtils->new( tool  => 'test_program.pl',
                                  email => 'myname@myplace.org');
 my $esearch = $eu->esearch( db   => 'gene',
                             term => 'CDKN2A[gene] AND human[orgn]' );
 my $ra_ids = $esearch->ids();

The user creates a Bio::TGen::EUtils object, calls the esearch() method with a few parameters and uses the ids() method to extract the IDs from the response. The user is completely insulated from the creation of the URL and any of the details of the communication with the NCBI E-utilities server.

If we wanted to expand this example to retrieve the full record for the gene ID(s) identified by our search then we would need to craft a second URL to interact with a different E-utilities program - efetch.fcgi:

 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene \
 &id=1029&retmode=xml

Note that to use the URL we need to extract the gene ID (1029) returned by the esearch URL and supply it as a parameter to the efetch URL. The revised TGen-Utils program is shown below - the only additions are the extra usehistory = 'y'> parameter to esearch() and the call to efetch() which extracts all of its required inputs from the output of esearch():

 use Bio::TGen::EUtils;
 my $eu = Bio::TGen::EUtils->new( tool  => 'test_program.pl',
                                  email => 'myname@myplace.org');
 my $esearch = $eu->esearch( db         => 'gene',
                             term       => 'CDKN2A[gene] AND human[orgn]',
                             usehistory => 'y' );
 my $efetch = $eu->efetch( from_hist => $esearch );
 print $efetch->raw();

In addition to simplified code, TGen-EUtils provides extensive error checking of all required and optional input parameters; use of smart defaulting to minimize the number of parameters that users have to understand; error checking of all communication with the NCBI E-utilties server; and a verbose mode where diagnostics are printed to track the progess of, and help debug, TGen-EUtils applications.

E-utilities and TGen-EUtils equivalences

Before attemtping to describe the various TGen-EUtils methods that can be used to interact with the Entrez system, we should detail the underlying NCBI E-utilities CGI programs that those methods will use. The following table lists the current NCBI E-utilities services and the primary function each one provides:

 einfo.fcgi        provide information and stats about an Entrez db
 egquery.fcgi      like ESearch but searches all 31 Entrez dbs at once
 esearch.fcgi      search an Entrez db for matching records
 elink.fcgi        find records in other dbs linked to a given record
 esummary.fcgi     retrieve summaries of db record(s)
 efetch.fcgi       retrieve full database record(s)
 epost.fcgi        upload a list of database records IDs to NCBI
 espell.fcgi       retrieve spelling options for given terms

The correspondence between NCBI E-utilities programs, Bio::TGen::EUtils methods and Bio::TGen::EUtils::Response objects is summarized here:

 NCBI E-utilities  TGen-EUtils  Bio::TGen::EUtils::Response
 CGI program       method       subclass
 --------------------------------------------------------------------
 einfo.fcgi        einfo()      Bio::TGen::EUtils::Response::EInfo
 egquery.fcgi      egquery()    Bio::TGen::EUtils::Response::EGQuery
 esearch.fcgi      esearch()    Bio::TGen::EUtils::Response::ESearch
 elink.fcgi        elink()      Bio::TGen::EUtils::Response::ELink
 esummary.fcgi     esummary()   Bio::TGen::EUtils::Response::ESummary
 efetch.fcgi       efetch()     Bio::TGen::EUtils::Response::EFetch
 epost.fcgi        epost()      Bio::TGen::EUtils::Response::EPost
 espell.fcgi       espell()     Bio::TGen::EUtils::Response::ESpell

As can be seen, there is a clear one-is-to-one correspondence between the NCBI CGI programs and TGen-EUtils methods. This is important because the same parameters that would be placed in a URL to drive the CGI programs are passed in as an input hash to the equivalent TGen-EUtils method. A summary table and a more detailed discussion of each of these parameters can be found below in the Request Method Parameters section.

Usage Example 2

Much of the power of the E-utilities system comes from stringing together multiple queries in a pipeline. This example shows retrieval of an XML document showing the full details of up to 500 SNPs that NCBI has linked to the human TP53 gene. The script uses a pipeline of 3 E-utilities: esearch -> elink -> efetch.

 use Bio::TGen::EUtils;
 my $eu = Bio::TGen::EUtils->new(
                    tool       => 'gene_2_snp.pl',
                    email      => 'myname@myplace.org');
 my $esearch = $eu->esearch(
                    db         => 'gene',
                    term       => 'TP53[gene] AND human[orgn]',
                    usehistory => 'y' );
 my $elink = $eu->elink(
                    dbfrom     => 'gene',
                    db         => 'snp',
                    usehistory => 'y',
                    from_hist  => $esearch );
 my $efetch = $eu->efetch( from_hist => $elink );
 $efetch->write_raw( file => 'TP53_snps.xml' );

The user creates a single instance of Bio::TGen::EUtils using the new() method and then makes all subsequent NCBI queries through this factory object using the various request methods - esearch(), elink(), and efetch() in this example. There is a Bio::TGen::EUtils request method to match each of the 8 NCBI E-utilities programs.

Behind the scenes, each time one of the "e" methods is called, the Bio::TGen::EUtils object communicates with the NCBI Entrez system and the response from NCBI is used to generate an object that is a subclass of Bio::TGen::EUtils::Response. Response objects come in a number of flavours - one to match each of the 8 request methods in Bio::TGen::EUtils. They all inherit from Bio::TGen::EUtils::Response but have specific routines for handling their own particular forms of output including XML, text etc.

Usage Example 3

The following table shows all of the Entrez databases (as at 28th January 2006), the number of records in each database, and the name by which the database appears in the menu on the NCBI Entrez website.

  Database                 Records   Database menu name
  -----------------------------------------------------
  books                     137758   Books
  cancerchromosomes          50380   CancerChromosomes
  cdd                        11530   Conserved Domains
  domains                   150266   3D Domains
  gds                         5971   GEO DataSets
  gene                     1686173   Gene
  genome                      5015   Genome
  genomeprj                   1986   Genome Project
  gensat                     46801   GENSAT
  geo                     15264560   GEO Profiles
  homologene                 84326   HomoloGene
  journals                   20781   Journals
  mesh                      181651   MeSH
  ncbisearch                  4069   NCBI Web Site
  nlmcatalog               1234393   NLM Catalog
  nucleotide              67518060   Nucleotide
  omia                        2484   OMIA
  omim                       17224   OMIM
  pcassay                      186   PubChem BioAssay
  pccompound               5311600   PubChem Compound
  pcsubstance              8028104   PubChem Substance
  pmc                       538358   PMC
  popset                     46316   PopSet
  probe                    3395818   Probe
  protein                  8618109   Protein
  pubmed                  16056582   PubMed
  snp                     26430220   SNP
  structure                  34421   Structure
  taxonomy                  293619   Taxonomy
  unigene                  1858703   UniGene
  unists                    476711   UniSTS

The table above was generated using the db_info.pl script distributed as part of the TGen-EUtils package. The processing portion of that script is reproduced here:

   use Bio::TGen::EUtils;
   my $eu = Bio::TGen::EUtils->new( 
                    tool  => 'db_info.pl',
                    email => 'myname@myplace.org');
   my $response = $eu->egquery( 'term' => 'all[filter]' );
   my %results = $response->process_egquery;
   foreach my $db (sort keys %results) {
       printf "%-17s   %12d   %-s\n",
              $db,
              $results{$db}->{'count'},
              $results{$db}->{'menudb'};
   }

The user creates a single instance of Bio::TGen::EUtils using the new() method and then makes all subsequent NCBI queries through this factory object using the various request methods - egquery() in this example. The script is equivalent to supplying the string all[filter] as the term parameter to the NCBI E-utilities egquery.cgi program. To verify this, try copying the following URL into a web browser:

  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=all[filter]

The output from this request is XML so the display you see in your browser is unstructured but viewing the page source will show the underlying XML.

^ PUBLIC METHODS

There are 3 categories of methods in EUtils. The first is Constructor Methods of which there is only 1 - new(). The second category is Accessor Methods that can be used to change the values of attributes that control how the EUtils module works. The third category is the Request Methods that make calls against the NCBI CGI E-utilities.

The Bio::TGen::EUtils::Response objects returned by calls to the Request Methods methods have an additional category of Report methods that process the XML or text results returned by the CGI methods.

Constructor Methods

new()

  my $eutil = Bio::TGen::EUtils->new(
                 tool     => 'my_cool_tool.pl',
                 email    => 'me@myplace.org' );
  
  my $eutil = Bio::TGen::EUtils->new(
                 tool     => 'my_cool_tool.pl',
                 email    => 'me@myplace.org',
                 url_base => 'http://ncbi.nih.gov/entrez/',
                 retmax   => 50,
                 delay    => 4 );

There are a number of options that can be supplied to the new() method. It is compulsory to supply values for tool and email - NCBI does not enforce this rule but the TGen-EUtils modules do. See the tool and email sections below for more details. The full list of new() options is:

tool

email

tool and email are both compulsory and the new() method will die unless values are supplied for both. tool should contain some string that identifies the application that is making the E-utilities request and email should be the email address of the user or programmer who created the application. In the event of a problem with an E-utilities request, NCBI staff may use the contents of these 2 fields to try to contact the user. If these fields are not specified then NCBI staff may be left with no option but to block IP addresses of machines that are generating badly behaved E-utilities requests. NCBI blocks by IP address so there is no point in not being truthful in the values you give for tool and email - you can't hide.

url_base

This parameter is optional and should not be set in most instances as the subclass Bio::TGen::EUtils::Request defaults it to the address of the current NCBI E-utilities server: 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'. If NCBI ever changes the server address, users could use this option to point to the new server until a revised version of TGen-EUtils was released.

retmax

The subclass Bio::TGen::EUtils::Request defaults this value to 500 unless it is explicitly set here. This value is used by esummary() and efetch() to limit the number of records retrieved when a query returns a large number of matches. This helps spread the load on the NCBI servers by requiring users to break large retrieval tasks into a number of smaller retrievals. In practice, 500 will amost always be a reasonable value but users should be aware that esummary() and efetch() queries will be truncated at 500 results by default. esearch() also uses retmax although its behaviour is to always report how many IDs it would return and then only return the first 20 unless retmax is set to a higher value (maxmum 10,000) or batch mode is used. For a discussion of how to get more than 500 records using batch mode, see the batch parameter in the Request Method Parameters section below.

delay

The NCBI usage guidelines state that there should be at least 3 seconds between E-utilities requests. The default set by TGen-EUtils is 3 seconds. A user can override the default by passing in a parameter to new() or by calling the delay() method at any time however setting a value less than 3 will result in a warning message being printed. The delay is calculated from the time the last request was made, not from when it returned results so if the results took more than 3 seconds to be returned, no delay is required before the next request is made. Users do not need to worry about the delay as TGen-EUtils contains internal logic to enforce the delay. The user's only concern is what value to set for delay. You can set the delay to less than 3 seconds but that's just asking to have your IP address blocked by NCBI so think carefully.

NCBI also has restrictions on the times of the day/week when it is acceptable to make large numbers of queries against the E-utilities system. TGen-EUtils users should read the full text of the NCBI usage guidelines at:

http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html

verbose

The verbose parameter can be any positive integer and the higher the number, the more detailed the diagnostic messages. The default level is 0. At level 1, all Request Methods will print out a string identifying the Request type, a timestamp, all of the parameters that will be used to construct the URL (including those values that have been defaulted), plus the actual URL that will be passed to the NCBI E-utilities server. At level 2, the Request module will print a message showing the values used to compute and enforce the specified between-query delay. The example below shows the output from an efetch() request with verbose set to 2.

 EFETCH:  [Tue Feb 28 23:05:38 2006]
   Parameters:
     db          nucleotide
     email       jpearson@Translational_Genomics_Research_Institute
     id          62988321
     retmax      20
     retmode     text
     rettype     fasta
     seq_start   1
     seq_stop    138
     tool        TGen-EUtils-efetch_examples
   Enforce delay: (now:1141193138) (last:) (delay:0)
   Request->get(http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
   ?seq_stop=138&retmode=text&db=nucleotide&retmax=20&email=jpearson@Tra
   nslational_Genomics_Research_Institute&rettype=fasta&tool=TGen-EUtils
   -efetch_examples&seq_start=1&id=62988321)

Accessor Methods

In general, an accessor method exists corresponding to each of the options that can be specified in the new() method with the exception of the tool and email parameters which should in most cases only be set once during a TGen-EUtils session. If a user does wish to change the email and tool values, tool and email parameters can be passed to individual Request Methods.

request()

Returns the Bio::TGen::EUtils::Request object that interacts with the NCBI E-utilities server. A user should never need to directly touch this object and this method is only provided to assist in debugging failed queries.

url_base()

Get/set the address of the current NCBI E-utilities server.

delay()

Get/set the delay in seconds between consecutive TGen-EUtils calls against the NCBI E-utilities server.

retmax()

Get/set the maximum number of records returned in a single efetch(), esearch() or esummary() Request.

Request Methods

The 8 Request Methods directly map to the 8 NCBI E-utilities programs. Each method returns an object that belongs to one of the subclasses of Bio::TGen::EUtils::Response. It is very important that the user consult the documentation for the specific subclass as it is likely to be more extensive than the summary information provided here. The subclass documentation also details any subclass-specific methods for accessing the information in the Response.

einfo()

  $response = $eu->einfo( db => 'snp' );

Required: db, (email, tool)

Returns: Bio::TGen::EUtils::Response::EInfo object

NCBI docs: http://www.ncbi.nlm.nih.gov/entrez/query/static/einfo_help.html

Returns information about the specified Entrez database including the number of records, which fields have been indexed and so are available as filters, plus what links to other databases exist.

egquery()

  $response = $eu->egquery( term => 'CDKN2A' );
  $response = $eu->egquery( term => 'all[filter]' );

Required: term, (email, tool)

Returns: Bio::TGen::EUtils::Response::EGQuery object

NCBI docs: http://www.ncbi.nlm.nih.gov/entrez/query/static/egquery_help.html

Performs EGQuery which is effectively a global esearch() that searches all Entrez databases at once. egquery() will search all of the Entrez databases with the given search string and will return a list of databases and how many matching records were found within the database. It can also be used with the 'all[filter]' search string to return the current record totals for all of the Entrez databases as shown in the example above.

esearch()

  $response = $eu->esearch( db         => 'gene',
                            term       => 'CDKN2A[gene] AND human[orgn]',
                            usehistory => 'y' );
  $response = $eu->esearch( db      => 'pubmed',
                            term    => 'asthma[mh] OR hay fever[mh]',
                            reldate => 50 );
  $response = $eu->esearch( db       => 'snp',
                            term     => '',
                            datetype => 'pdat',
                            mindate  => '1999/01/01',
                            maxdate  => '1999/12/31' );

Required: db, term, (email, tool)

Optional: usehistory, retstart, retmax, WebEnv

Returns: Bio::TGen::EUtils::Response::ESearch object

NCBI docs: http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html

Searches a single specified Entrez database for a term or id. This method call is almost always the first call in any pipeline of TGen-EUtils calls. The only exception is where a known ID list is uploaded to the NCBI History Server using epost().

esummary()

  $response = $eu->esummary( db => 'gene',
                             id => '1234' );
  $response = $eu->esummary( db => 'gene',
                             id => '1234,242,11552823' );
  $response = $eu->esummary( retmax    => 1000,
                             from_hist => $esearch );
  
  $response = $eu->esummary( batch     => 1,
                             from_hist => $epost );

Required: db, id | WebEnv+query_key, (email, tool)

Optional: retstart, retmax

Returns: Bio::TGen::EUtils::Response::ESearch object

NCBI docs: http://www.ncbi.nlm.nih.gov/entrez/query/static/esummary_help.html

Retrieves document summarys (DocSums) for each of the ids listed or from an id list on the History Server. A DocSum is much smaller than the whole record that would be retrieved by efetch(). In many cases, the small number of attributes returned in the DocSum are sufficient and the whole record does not need to be retrieved. DocSum requests also tend to be faster since the central Entrez engine provides searching based on the database indexes and retrieval of summary documents (DocSums) for every record from every source database while retrieval of full data records (via efetch()) is delegated to the source databases.

efetch()

  $response = $eu->efetch( db => 'nucleotide',
                           id => '62988321' );  
  $response = $eu->efetch( from_hist => $esearch );
  $response = $eu->efetch( from_hist => $esearch,
                           batch     => 1 );
  $response = $eu->efetch( db      => 'snp',
                           retmode => 'text',
                           rettype => 'chr',
                           id      => '3180061' );
  $response = $eu->efetch( db        => 'nucleotide',
                           retmode   => 'text',
                           rettype   => 'fasta',
                           seq_start => 1,
                           seq_stop  => 138,
                           id        => '62988321' );

Required: db, id | WebEnv+query_key, (email, tool)

Optional: retstart, retmax, retmode, rettype, seq_start, seq_stop, strand, complexity

Returns: Bio::TGen::EUtils::Response::EFetch object

NCBI docs: http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html

Fetches full data record for the specified id(s). If you want to retrieve a large number of records then it is suggested that you use epost() to push the ID list onto the NCBI History Server and then do a batch-mode efetch() by specifying the batch parameter. EFetch is the only one of the NCBI E-utilities that does not always return XML Responses. Because it is used to return records from a variety of databases, each database may come with data-sepcific output formats (eg FASTA for nucleotide) that must be supported by EFetch. The retmode and rettype parameters are used in combination to select from these alternate record output formats. A list of the valid retmode/rettype combinations can be found in the NCBI EFetch documentation linked above.

EFetch is not implemented by the central Entrez engine and t be implemented by each Entrez source database so efetch() is currently only supported in the following databases: PubMed, PubMed Central, Journals, Nucleotide, Protein, Genome, Gene, SNP, PopSet, and Taxonomy.

elink()

  $response = $eu->elink( dbfrom => 'snp',
                          db     => 'gene',
                          id     => [242, 1234, 11552823] );
  
  $response = $eu->elink( db        => 'gene',
                          from_hist => $esearch );

Required: db, dbfrom, id | WebEnv+query_key, (email, tool)

Optional: term

Returns: Bio::TGen::EUtils::Response::ELink object

NCBI docs: http://www.ncbi.nlm.nih.gov/entrez/query/static/elink_help.html

The id parameter can be a string of comma-separated ids or it can be a reference to an array of ids. In practice, these 2 are quite different. The string approach produces a URL of the form ...&id=1,2,3,4&... whereas the array approach produces a URL of the form ...&id=1,$id=2&id=3&id=4&.... These 2 URL forms result in different XML reports. For the first form, a single composite report of linked IDs is returned but it is not possible to work out which of the original IDs each linked ID relates to. Using the second URL form, a separate small report is produced for each of the original IDs so it is possible to work out the relationships between original and linked IDs.

Using the WebEnv/query_key approch always produces the composite report form so it is sometimes desirable to retrieve the IDs and submit them directly with elink rather than to use the History Server.

The results from a ELink can be places onto the History Server by specifying the usehistory parameter and that Elink Reponse object can be passed to another ELink call using the from_hist parameter. In this way multiple ELink calls can be chained together however as noted above, this method will not show the user the individual links from the input IDs to the output IDs.

epost()

  $response = $eu->epost( 'db'         => 'snp',
                          'id'         => [242, 1234, 11552823] );
  $response = $eu->epost( 'db'         => 'snp',
                          'id'         => \@ids );
  $response = $eu->epost( 'db'         => 'snp',
                          'id'         => '242 ,1234, 11552823' );
  $response = $eu->epost( 'db'         => 'snp',
                          'file'       => 'snp_ids.txt' );
  $response = $eu->epost( 'db'         => 'snp',
                          'file'       => 'snp_ids.txt',
                          'separator'  => ',',
                          'headers'    => 1,
                          'column'     => 2 );

Required: db, id | file, (email, tool)

Optional: headers, separator, column

Returns: Bio::TGen::EUtils::Response::EPost object

NCBI docs: http://www.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html

This method uses the CGI POST method to place a list of IDs onto the NCBI History Server. As shown in the examples above, the ID list can be specified as a string, an array or a file. If the file method is chosen then the default parsing behaviour is to read every line, split on tabs and take the first (0'th) element of the rsulting array and add it to the list of IDs. Optionally the headers option can be used to skip 1 or more lines at the start of the file, the separator option can be used to specify a perl-style regular expression to split the file lines on, and the column option can be used to select a column other than the first. The column option uses perl's 0-based array indexing scheme so '0' is the first item on a line, '1' is the second, etc.

Request Method Parameters

These are the parameters that are passed to the NCBI E-utilities CGI programs as part of the request URL. In TGen-EUtils, these parameters are supplied as elements of the parameter hash for one of the Request Methods.

The following table shows which input parameters are required or optional for each of the TGen-EUtils Request methods. At first glance, this table may seem confusing however there are 3 things to keep in mind: (a) most queries only require use of a small subset of the possible parameters; (b) NCBI has good documentation for each of the 8 E-utilities; (c) the TGen-EUtils package attempts to insulate the user as much as possible from the intricacies of the E-utilities system.

       Methods     e   e   e   e   e   e   e   e
                   i   g   s   s   f   l   p   s
                   n   q   e   u   e   i   o   p
                   f   u   a   m   t   n   s   e
                   o   e   r   m   c   k   t   l
                       r   c   a   h           l
                       y   h   r
  Parameters                   y
                 .-------------------------------.
  db             | R       R   R   R   R   R   R |
  dbfrom         |                     R         |
  id             |             r   r   r   r     |
  term           |     R   R           O       R |
  field          |         O                     |
  retstart       |         O   O   O             |
  retmax         |         O   O   O             |
  retmode        |             O   O             |
  rettype        |         O       O             |
  usehistory     |         O           O   O     |
  WebEnv         |             r   r   r   r     |
  query_key      |             r   r   r   r     |
  cmd            |                     O         |
  seq_start      |                 O             |
  seq_stop       |                 O             |
  strand         |                 O             |
  complexity     |                 O             |
  reldate        |         O           O         |
  mindate        |         O           O         |
  maxdate        |         O           O         |
  datetype       |         O           O         |
  sort           |         O                     |
  holding        |                     O         |
  version        |                     O         |
  tool           | d   d   d   d   d   d   d   d |
  email          | d   d   d   d   d   d   d   d |
                 |-------------------------------|
  from_hist    * |         O   O   O   O         |
  file         * |                         O     |
  headers      * |                         O     |
  column       * |                         O     |
  separator    * |                         O     |
  batch        * |         O   O   O             |
                 '-------------------------------'
  Key:  R = required
        r = either id, or WebEnv and query_key must be specified
        d = required but has default so can be left unset
        O = optional
        * = parameter is specific to TGen-EUtils, not NCBI

The following section details each of the parameters from the table above in more detail.

db

Name of the target Entrez database. Must be one of the strings from the first column of the Entrez Database table shown in Usage Example 3

dbfrom

Name of the source Entrez database. Must be one of the strings from the first column of the Entrez Database table shown in Usage Example 3 This parameter is only used by elink() where known records from one database (specified with dbfrom) are linked to records from another database (specified with db).

id

In most cases, the ID value(s) specified by the id parameter must be in the Primary ID format for the database specified by the db parameter. For example, if db=pubmed then id should contain PubMed IDs and if db=nucleotide then id should contain GI numbers. efetch() is the exception to this rule and appears to be able to recognize some IDs that are not Primary IDs, e.g. accession, accession.version. The NCBI Entrez Programming Utilities webpage (http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html) has a list of Primary IDs and the efetchseq help page (http://www.ncbi.nih.gov/entrez/query/static/efetchseq_help.html) has a list of some of the valid ID types for efetch.

term

An Entrez query string. Examples are 'all[filter]' and '"Homo sapiens"[orgn]'. This string will become part of the URL passed to NCBI's E-utilities server so TGen-EUtils will automatically replace spaces with '+' symbols. The use of the '[xxx]' format to limit the way a search is performed is discussed in more detail in the Field section of Bio::TGen::EUtils::Response::EInfo

field

This parameter can be used to modify term. For example, setting field to 'mh' and term to 'asthma OR hay fever' is equivalent to just setting term to 'asthma[mh] OR hay fever[mh]'. 'mh' stands for MeSH Heading and setting field to 'mh' is equivalent to specifiying [mh] on both the items in term: 'asthma' and 'hay fever'.

usehistory

Should the Entrez History Server be used (y/n).

retstart

First item in results list to display (default=0).

retmax

Number of items to display in results list (default=20).

retmode

Output data format.

rettype

Output data record type.

WebEnv

Web Environment for accessing existing data sets. This is effectively a cookie and for a pipeline of TGen-EUtils queries, the value returned by NCBI as part of the previous response should be supplied as part of all subsequent queries. TGen-EUtils handles this for the user as long as the previous Request specified usehistory='y' and the Response object from that Request is specified using the from_hist parameter. For an example of the correct use of usehistory/from_hist in a command pipeline, see Usage Example 2 in the DESCRIPTION section above.

query_key

Used in conjunction with WebEnv to access lists from the History Server.

cmd

This parameter is unlikely to be directly used by a TGen-EUtils user. Internally TGen-EUtils uses it for elink(), where the string "&cmd=neighbor_history" in the URL allows the use of usehistory pipelines which are not officially supported by elink().

seq_start

Retrieve sequence starting at this base position.

seq_stop

Retrieve sequence until this base position.

strand

Which DNA strand to retrieve (1=plus, 2=minus).

complexity

Determines what data object to retrieve.

reldate

Limits a search to being within reldate days of today so a value of 1 means within today, a value of 7 means within the past week, a value of 365 means within the past year etc.

mindate

maxdate

mindate and maxdate should be used together to denote a date range. Each one contains a date of the form YYYY, YYYY/MM or YYYY/MM/DD where YYYY is a four-digit year, MM is a two-digit month, and DD is a two-digit day.

datetype

This parameter modifies the reldate, mindate and maxdate parameters by specifying what sort of date they are. For example, the pubmed database is indexed on pdat the date of publication; edat the date the publication was first available through Entrez; mhda the date the publication was indexed with MeSH terms; cdat date of completion; and mdat the date of last modification. The names and types of date indexes available will vary for every Entrez database and the only way to know what values are possible is to use the einfo() method from Bio::TGen::EUtils and look at the output of the fields() method for the Response. More details about Fields and how to use them can be can be found in the Bio::TGen::EUtils::Response::EInfo documentation.

tool

Name of script or module that is making calls to NCBI E-utilities server. This parameter has a default value that must be set when initializing the Bio::TGen::EUtils system. If you set this value for a particular Request Method then it overrides the default.

email

E-mail of the user (or developer) who is using TGen-EUtils to make calls to NCBI E-utilities server. This parameter has a default value that must be set when initializing the Bio::TGen::EUtils system. If you set this value for a particular Request Method then it overrides the default.

batch

batch can be used as a parameter to esearch(), esummary(), and efetch(). This is one of the most important and most problematic features of the TGen-EUtils system and a more detailed discussion of "batch mode" can be found in the documentation for Bio::TGen::EUtils::Response::EFetch. The three retrieval E-utilties (esearch(), esummary(), and particularly efetch()) can be an enormous drain on the NCBI servers if users retrieve huge numbers of records in a single request so NCBI asks that you never retrieve more than 500 records in a single request and provides the retstart and retmax parameters as a mechanism for retieving subranges of records from a large request. TGen-EUtils has a "batch mode" that transparently uses retstart and retmax to split large retrieval requests into into multiple 500 record requests and concatenates the responses into a single Response object. For this reason, requests that include the batch parameter can take a long time if the underlying request is large. The maximum number of records returned by a single request is defined by setting the retmax parameter and TGen-EUtils defaults this value to 500 for efetch() and esummary().

file

headers

column

separator

file, headers, column, and separator are all parameters that are part of the TGen-EUtils system and are not passed to the NCBI server. They are all used in the epost() method to retrieve a list of IDs from a text file and upload it to the NCBI History Server.

^ SEE ALSO

^ AUTHORS

John Pearson bioinfresearch@tgen.org

^ VERSION

$Id: EUtils.pm,v 1.22 2006/03/14 07:12:16 jpearson Exp $

^ COPYRIGHT

TGen-EUtils is copyright 2006 by The Translational Genomics Research Institute. All rights reserved. This License is limited to, and you may use the Software solely for, your own internal and non-commercial use for academic and research purposes. Without limiting the foregoing, you may not use the Software as part of, or in any way in connection with the production, marketing, sale or support of any commercial product or service. For commercial use, please contact licensing@tgen.org. By installing this Software you are agreeing to the terms of the LICENSE file distributed with this software.

In any work or product derived from the use of this Software, proper attribution of the authors as the source of the software or data must be made. The following URL should be cited:

http://bioinformatics.tgen.org/software/tgen-eutils/