Bio::TGen::EUtils v1.22
Bio::TGen::EUtils - Access NCBI's E-utilities CGI interface
use Bio::TGen::EUtils;
my $eu = Bio::TGen::EUtils->new( 'tool' => 'my_cool_tool.pl',
'email' => 'me@myplace.org' );
my $response = $eu->esearch(
'db' => 'gene',
'term' => 'gfap[gene] AND human[orgn]',
'usehistory' => 'y' );
print $response->id_report();
TGen-EUtils is a collection of perl modules and scripts that provide an
object-oriented interface to NCBI's Entrez Programming Utilities
(E-utilities), a collection of web-based CGI programs that provide
a remote programming interface to the Entrez system.
Entrez is a framework implemented on top of the
NCBI source databases and individual source databases (such as
Nucelotide and Gene) retain their own design and implementation.
Consequently, some aspects of Entrez are common to all databases and
other aspects are specific to individual databases. Each Entrez source
database has a different organizing principle and contains different
types of information so each database is indexed and searched using
a series of database-specific terms.
TGen-EUtils is intended to simplify access to NCBI's E-utilities
and is structured so that users interact with a single instance of the
Bio::TGen::EUtils module which acts as a factory for making all
subsequent calls against the NCBI E-utilities interface. The user is
isolated from any requirement to create URLs or manage the interaction with
the NCBI CGI interface.
Development of TGen-EUtils package was inspired by the NCBI Power
Scripting course taught at NCBI by David Wheeler and Eric Sayers:
http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/course.html
In practical terms NCBI's E-utilities system is implemented as 8
user-accessible CGI programs and the programs are driven by supplying
them with carefully constructed URLs that contain specific parameters
and values. TGen-EUtils provides a Bio::TGen::EUtils Request method
corresponding to each of the NCBI CGI programs. Each method takes as
input a hash of input parameters that correspond to the URL parameters
used to drive the equivalent NCBI E-utilities program.
When one of the Bio::TGen::EUtils methods is called, it makes a call
against the NCBI E-utilities server and the server's response is placed
into a Bio::TGen::EUtils::Response object that is returned to the user.
Response objects come in a number of subclasses - one to match each of the
Request methods in Bio::TGen::EUtils. Response objects all inherit from
Bio::TGen::EUtils::Response
but have specific routines for handling their own particular
forms of output including XML, text etc. In effect, there is a Response
module to match each XML DTD in NCBI's entrez query system:
http://www.ncbi.nlm.nih.gov/entrez/query/DTD/
The easiest way to demonstrate how TGen-EUtils works is to show an
example. The following URL (split into 2 lines for display purposes)
shows the URL that would be used with the NCBI E-utilities to search for
the ID of the Human CDKN2A gene:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene \
&term=CDKN2A[gene]+AND+human[orgn]
To actually use this URL, a significant amount of code would be required
- the perl LWP system would have to be
used to submit the URL to the NCBI server and get a reply and the XML
reply would have to be parsed to extract the id(s), if any, returned
by the query. In contrast, the complete code for the equivalent
TGen-EUtils program is:
use Bio::TGen::EUtils;
my $eu = Bio::TGen::EUtils->new( tool => 'test_program.pl',
email => 'myname@myplace.org');
my $esearch = $eu->esearch( db => 'gene',
term => 'CDKN2A[gene] AND human[orgn]' );
my $ra_ids = $esearch->ids();
The user creates a Bio::TGen::EUtils object, calls the
esearch() method with a few parameters and uses the ids()
method to extract the IDs from the response. The user is completely
insulated from the creation of the URL and any of the details of
the communication with the NCBI E-utilities server.
If we wanted to expand this example to retrieve the full record for the
gene ID(s) identified by our search then we would need to craft a second
URL to interact with a different E-utilities program - efetch.fcgi:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene \
&id=1029&retmode=xml
Note that to use the URL we need to extract the gene ID (1029) returned
by the esearch URL and supply it as a parameter to the efetch URL. The
revised TGen-Utils program is shown below - the only additions are the
extra usehistory = 'y'> parameter to esearch() and the call to
efetch() which extracts all of its required inputs from the output
of esearch():
use Bio::TGen::EUtils;
my $eu = Bio::TGen::EUtils->new( tool => 'test_program.pl',
email => 'myname@myplace.org');
my $esearch = $eu->esearch( db => 'gene',
term => 'CDKN2A[gene] AND human[orgn]',
usehistory => 'y' );
my $efetch = $eu->efetch( from_hist => $esearch );
print $efetch->raw();
In addition to simplified code, TGen-EUtils provides extensive error
checking of all required and optional input parameters; use of
smart defaulting to minimize the number of parameters that
users have to understand; error checking of all communication with the
NCBI E-utilties server; and a verbose mode where diagnostics are
printed to track the progess of, and help debug, TGen-EUtils applications.
Before attemtping to describe the various TGen-EUtils methods that can
be used to interact with the Entrez system, we should detail the
underlying NCBI E-utilities CGI programs that those methods will use.
The following table lists the current NCBI E-utilities
services and the primary function each one provides:
einfo.fcgi provide information and stats about an Entrez db
egquery.fcgi like ESearch but searches all 31 Entrez dbs at once
esearch.fcgi search an Entrez db for matching records
elink.fcgi find records in other dbs linked to a given record
esummary.fcgi retrieve summaries of db record(s)
efetch.fcgi retrieve full database record(s)
epost.fcgi upload a list of database records IDs to NCBI
espell.fcgi retrieve spelling options for given terms
The correspondence between NCBI E-utilities programs,
Bio::TGen::EUtils methods and
Bio::TGen::EUtils::Response objects is summarized here:
NCBI E-utilities TGen-EUtils Bio::TGen::EUtils::Response
CGI program method subclass
--------------------------------------------------------------------
einfo.fcgi einfo() Bio::TGen::EUtils::Response::EInfo
egquery.fcgi egquery() Bio::TGen::EUtils::Response::EGQuery
esearch.fcgi esearch() Bio::TGen::EUtils::Response::ESearch
elink.fcgi elink() Bio::TGen::EUtils::Response::ELink
esummary.fcgi esummary() Bio::TGen::EUtils::Response::ESummary
efetch.fcgi efetch() Bio::TGen::EUtils::Response::EFetch
epost.fcgi epost() Bio::TGen::EUtils::Response::EPost
espell.fcgi espell() Bio::TGen::EUtils::Response::ESpell
As can be seen, there is a clear one-is-to-one correspondence between
the NCBI CGI programs and TGen-EUtils methods. This is important
because the same parameters that would be placed in a URL to drive the
CGI programs are passed in as an input hash to the equivalent
TGen-EUtils method. A summary table and a more
detailed discussion of each of these parameters can be found
below in the Request Method Parameters section.
Much of the power of the E-utilities system comes from stringing together
multiple queries in a pipeline. This example shows retrieval of
an XML document showing the full details of up to 500 SNPs that NCBI has
linked to the human TP53 gene. The script uses a pipeline of 3
E-utilities: esearch -> elink -> efetch.
use Bio::TGen::EUtils;
my $eu = Bio::TGen::EUtils->new(
tool => 'gene_2_snp.pl',
email => 'myname@myplace.org');
my $esearch = $eu->esearch(
db => 'gene',
term => 'TP53[gene] AND human[orgn]',
usehistory => 'y' );
my $elink = $eu->elink(
dbfrom => 'gene',
db => 'snp',
usehistory => 'y',
from_hist => $esearch );
my $efetch = $eu->efetch( from_hist => $elink );
$efetch->write_raw( file => 'TP53_snps.xml' );
The user creates a single instance of Bio::TGen::EUtils using the
new() method and then makes all subsequent NCBI queries through
this factory object using the various request methods - esearch(),
elink(), and efetch() in this example. There is a
Bio::TGen::EUtils request method to match each of the 8 NCBI
E-utilities programs.
Behind the scenes, each time one of the "e" methods is called,
the Bio::TGen::EUtils object communicates with the
NCBI Entrez system and the response from NCBI is used to generate an
object that is a subclass of Bio::TGen::EUtils::Response. Response
objects come in a number of flavours - one to match each of the
8 request methods in Bio::TGen::EUtils. They all inherit from
Bio::TGen::EUtils::Response
but have specific routines for handling their own particular
forms of output including XML, text etc.
The following table shows all of the Entrez databases (as at 28th
January 2006), the number of records in each database, and the name
by which the database appears in the menu on the NCBI Entrez website.
Database Records Database menu name
-----------------------------------------------------
books 137758 Books
cancerchromosomes 50380 CancerChromosomes
cdd 11530 Conserved Domains
domains 150266 3D Domains
gds 5971 GEO DataSets
gene 1686173 Gene
genome 5015 Genome
genomeprj 1986 Genome Project
gensat 46801 GENSAT
geo 15264560 GEO Profiles
homologene 84326 HomoloGene
journals 20781 Journals
mesh 181651 MeSH
ncbisearch 4069 NCBI Web Site
nlmcatalog 1234393 NLM Catalog
nucleotide 67518060 Nucleotide
omia 2484 OMIA
omim 17224 OMIM
pcassay 186 PubChem BioAssay
pccompound 5311600 PubChem Compound
pcsubstance 8028104 PubChem Substance
pmc 538358 PMC
popset 46316 PopSet
probe 3395818 Probe
protein 8618109 Protein
pubmed 16056582 PubMed
snp 26430220 SNP
structure 34421 Structure
taxonomy 293619 Taxonomy
unigene 1858703 UniGene
unists 476711 UniSTS
The table above was generated using the db_info.pl script
distributed as part of the TGen-EUtils package.
The processing portion of that script is reproduced here:
use Bio::TGen::EUtils;
my $eu = Bio::TGen::EUtils->new(
tool => 'db_info.pl',
email => 'myname@myplace.org');
my $response = $eu->egquery( 'term' => 'all[filter]' );
my %results = $response->process_egquery;
foreach my $db (sort keys %results) {
printf "%-17s %12d %-s\n",
$db,
$results{$db}->{'count'},
$results{$db}->{'menudb'};
}
The user creates a single instance of Bio::TGen::EUtils using
the new() method and then makes all subsequent NCBI queries through
this factory object using the various request methods - egquery()
in this example.
The script is equivalent to supplying the string all[filter] as
the term parameter to the NCBI E-utilities egquery.cgi program. To
verify this, try copying the following URL into a web browser:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=all[filter]
The output from this request is XML so the display you see in your
browser is unstructured but viewing the page source will show
the underlying XML.
There are 3 categories of methods in EUtils. The first is
Constructor Methods
of which there is only 1 - new(). The second category is
Accessor Methods that can be used to change the values of attributes
that control how the EUtils module works. The third category is the
Request Methods that make calls against the NCBI CGI E-utilities.
The Bio::TGen::EUtils::Response objects returned by calls to the
Request Methods
methods have an additional category of Report methods that process the
XML or text results returned by the CGI methods.
new()
my $eutil = Bio::TGen::EUtils->new(
tool => 'my_cool_tool.pl',
email => 'me@myplace.org' );
my $eutil = Bio::TGen::EUtils->new(
tool => 'my_cool_tool.pl',
email => 'me@myplace.org',
url_base => 'http://ncbi.nih.gov/entrez/',
retmax => 50,
delay => 4 );
There are a number of options that can be supplied to the new()
method. It is compulsory to supply values for tool and email - NCBI does
not enforce this rule but the TGen-EUtils modules do. See the tool
and email sections below for more details. The full list of new()
options is:
tool
email
tool and email are both compulsory and the new() method will
die unless values are supplied for both.
tool should contain some string that identifies the application
that is making the E-utilities request and email should be the email
address of the user or programmer who created the application.
In the event of a problem with an E-utilities request, NCBI staff may
use the contents of these 2 fields to try to contact the user.
If these fields are not specified
then NCBI staff may be left with no option but to block IP addresses of
machines that are generating badly behaved E-utilities requests.
NCBI blocks by IP address so there is no point in not being truthful in
the values you give for tool and email - you can't hide.
url_base
This parameter is optional and should not be set in most instances as
the subclass Bio::TGen::EUtils::Request
defaults it to the address of the current NCBI E-utilities server:
'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'. If NCBI ever changes
the server address, users could use this option to point to the new
server until a revised version of TGen-EUtils was released.
retmax
The subclass Bio::TGen::EUtils::Request defaults this value to 500
unless it is explicitly set here. This value is used by esummary() and
efetch() to limit the number of records retrieved when a query returns
a large number of matches. This helps spread the load on the NCBI servers
by requiring
users to break large retrieval tasks into a number of smaller
retrievals. In practice, 500 will amost always be a reasonable value but
users should be aware that esummary() and efetch()
queries will be truncated at 500 results by default.
esearch() also uses retmax although its behaviour is to always report
how many IDs it would return and then only return the first 20 unless
retmax is set to a higher value (maxmum 10,000) or batch mode is
used. For a discussion of how to get more than 500 records using batch
mode, see the batch parameter in the
Request Method Parameters section below.
delay
The NCBI usage guidelines state that there should be at least 3 seconds
between E-utilities requests. The default set by TGen-EUtils is 3 seconds.
A user can override the default by passing in a parameter to new() or by
calling the delay() method at any time however setting a value less than
3 will result in a warning message being printed. The delay is calculated
from the time the last request was made, not from when it returned
results so if the results took more than 3 seconds to be returned, no
delay is required before the next request is made. Users do not
need to worry about the delay as TGen-EUtils contains internal logic to
enforce the delay. The user's only concern is what value to set for
delay. You can set the delay to less than 3 seconds but that's
just asking to have your IP address blocked by NCBI so think carefully.
NCBI also has restrictions on the times of the day/week when
it is acceptable to make large numbers of queries against the
E-utilities system. TGen-EUtils users should read the full text of
the NCBI usage guidelines at:
http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html
verbose
The verbose parameter can be any positive integer and the higher the
number, the more detailed the diagnostic messages. The default level is
0. At level 1, all Request Methods will print out a string identifying
the Request type, a timestamp, all of the parameters that will be used
to construct the URL (including those values that have been defaulted),
plus the actual URL that will be passed to the NCBI E-utilities server.
At level 2, the Request module will print a message showing the values
used to compute and enforce the specified between-query delay. The
example below shows the output from an efetch() request with verbose
set to 2.
EFETCH: [Tue Feb 28 23:05:38 2006]
Parameters:
db nucleotide
email jpearson@Translational_Genomics_Research_Institute
id 62988321
retmax 20
retmode text
rettype fasta
seq_start 1
seq_stop 138
tool TGen-EUtils-efetch_examples
Enforce delay: (now:1141193138) (last:) (delay:0)
Request->get(http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
?seq_stop=138&retmode=text&db=nucleotide&retmax=20&email=jpearson@Tra
nslational_Genomics_Research_Institute&rettype=fasta&tool=TGen-EUtils
-efetch_examples&seq_start=1&id=62988321)
In general, an accessor method exists corresponding to each of the
options that can be specified in the new() method with the exception
of the tool and email parameters which should in most cases only be set
once during a TGen-EUtils session. If a user does wish to change the
email and tool values, tool and email parameters can be passed to
individual Request Methods.
request()
Returns the Bio::TGen::EUtils::Request object that interacts with the
NCBI E-utilities server. A user should never need to directly touch
this object and this method is only provided to assist in debugging
failed queries.
url_base()
Get/set the address of the current NCBI E-utilities server.
delay()
Get/set the delay in seconds between consecutive TGen-EUtils calls
against the NCBI E-utilities server.
retmax()
Get/set the maximum number of records returned in a single efetch(),
esearch() or esummary() Request.
The 8 Request Methods directly map to the 8 NCBI E-utilities programs.
Each method returns an object that belongs to one of the subclasses of
Bio::TGen::EUtils::Response. It is very important that the user
consult the documentation for the specific subclass as it is likely to
be more extensive than the summary information provided here. The
subclass documentation also details any subclass-specific methods for
accessing the information in the Response.
einfo()
$response = $eu->einfo( db => 'snp' );
Required: db, (email, tool)
Returns: Bio::TGen::EUtils::Response::EInfo object
NCBI docs:
http://www.ncbi.nlm.nih.gov/entrez/query/static/einfo_help.html
Returns information about the specified Entrez database including the
number of records, which fields have been indexed and so are available
as filters, plus what links to other databases exist.
egquery()
$response = $eu->egquery( term => 'CDKN2A' );
$response = $eu->egquery( term => 'all[filter]' );
Required: term, (email, tool)
Returns: Bio::TGen::EUtils::Response::EGQuery object
NCBI docs:
http://www.ncbi.nlm.nih.gov/entrez/query/static/egquery_help.html
Performs EGQuery which is effectively a global esearch() that
searches all Entrez databases at once.
egquery() will search all of the Entrez databases with the
given search string and will return a list of databases and how many
matching records were found within the database. It can also be used
with the 'all[filter]' search string to return the current record
totals for all of the Entrez databases as shown in the example above.
esearch()
$response = $eu->esearch( db => 'gene',
term => 'CDKN2A[gene] AND human[orgn]',
usehistory => 'y' );
$response = $eu->esearch( db => 'pubmed',
term => 'asthma[mh] OR hay fever[mh]',
reldate => 50 );
$response = $eu->esearch( db => 'snp',
term => '',
datetype => 'pdat',
mindate => '1999/01/01',
maxdate => '1999/12/31' );
Required: db, term, (email, tool)
Optional: usehistory, retstart, retmax, WebEnv
Returns: Bio::TGen::EUtils::Response::ESearch object
NCBI docs:
http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html
Searches a single specified Entrez database for a term or id.
This method call is almost always the first call in any pipeline of
TGen-EUtils calls. The only exception is where a known ID list is
uploaded to the NCBI History Server using epost().
esummary()
$response = $eu->esummary( db => 'gene',
id => '1234' );
$response = $eu->esummary( db => 'gene',
id => '1234,242,11552823' );
$response = $eu->esummary( retmax => 1000,
from_hist => $esearch );
$response = $eu->esummary( batch => 1,
from_hist => $epost );
Required: db, id | WebEnv+query_key, (email, tool)
Optional: retstart, retmax
Returns: Bio::TGen::EUtils::Response::ESearch object
NCBI docs:
http://www.ncbi.nlm.nih.gov/entrez/query/static/esummary_help.html
Retrieves document summarys (DocSums) for each of the ids listed or from
an id list on the History Server. A DocSum is much smaller than the
whole record that would be retrieved by efetch(). In many cases,
the small number of attributes returned in the DocSum are sufficient
and the whole record does not need to be retrieved. DocSum requests
also tend to be faster since the central Entrez engine provides searching
based on the database indexes and retrieval of summary documents
(DocSums) for every record from every source database while retrieval of
full data records (via efetch()) is delegated to the source databases.
efetch()
$response = $eu->efetch( db => 'nucleotide',
id => '62988321' );
$response = $eu->efetch( from_hist => $esearch );
$response = $eu->efetch( from_hist => $esearch,
batch => 1 );
$response = $eu->efetch( db => 'snp',
retmode => 'text',
rettype => 'chr',
id => '3180061' );
$response = $eu->efetch( db => 'nucleotide',
retmode => 'text',
rettype => 'fasta',
seq_start => 1,
seq_stop => 138,
id => '62988321' );
Required: db, id | WebEnv+query_key, (email, tool)
Optional: retstart, retmax, retmode, rettype, seq_start, seq_stop,
strand, complexity
Returns: Bio::TGen::EUtils::Response::EFetch object
NCBI docs:
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html
Fetches full data record for the specified id(s).
If you want to retrieve a large number of records then it is suggested
that you use epost() to push the ID list onto the NCBI History Server
and then do a batch-mode efetch() by specifying the batch parameter.
EFetch is the only one of the NCBI E-utilities that does not always
return XML Responses. Because it is used to return records from a
variety of databases, each database may come with data-sepcific output
formats (eg FASTA for nucleotide) that must be supported by EFetch. The
retmode and rettype parameters are used in combination to select
from these alternate record output formats. A list of the valid
retmode/rettype combinations can be found in the NCBI EFetch
documentation linked above.
EFetch is not implemented by the central Entrez engine and t be
implemented by each Entrez source database so efetch() is currently
only supported in the following databases: PubMed,
PubMed Central, Journals, Nucleotide, Protein, Genome, Gene, SNP,
PopSet, and Taxonomy.
elink()
$response = $eu->elink( dbfrom => 'snp',
db => 'gene',
id => [242, 1234, 11552823] );
$response = $eu->elink( db => 'gene',
from_hist => $esearch );
Required: db, dbfrom, id | WebEnv+query_key, (email, tool)
Optional: term
Returns: Bio::TGen::EUtils::Response::ELink object
NCBI docs:
http://www.ncbi.nlm.nih.gov/entrez/query/static/elink_help.html
The id parameter can be a string of comma-separated ids or it can be
a reference to an array of ids. In practice, these 2 are quite
different. The string approach produces a URL of the form
...&id=1,2,3,4&... whereas the array approach produces a URL of the form
...&id=1,$id=2&id=3&id=4&.... These 2 URL forms result in different
XML reports. For the first form, a single composite report of linked IDs is
returned but it is not possible to work out which of the original IDs
each linked ID relates to. Using the second URL form, a separate small
report is produced for each of the original IDs so it is possible to
work out the relationships between original and linked IDs.
Using the WebEnv/query_key approch always produces the composite
report form so it is sometimes desirable to retrieve the IDs and submit
them directly with elink rather than to use the History Server.
The results from a ELink can be places onto the History Server by
specifying the usehistory parameter and that Elink Reponse object can
be passed to another ELink call using the from_hist parameter. In
this way multiple ELink calls can be chained together however as noted
above, this method will not show the user the individual links from the
input IDs to the output IDs.
epost()
$response = $eu->epost( 'db' => 'snp',
'id' => [242, 1234, 11552823] );
$response = $eu->epost( 'db' => 'snp',
'id' => \@ids );
$response = $eu->epost( 'db' => 'snp',
'id' => '242 ,1234, 11552823' );
$response = $eu->epost( 'db' => 'snp',
'file' => 'snp_ids.txt' );
$response = $eu->epost( 'db' => 'snp',
'file' => 'snp_ids.txt',
'separator' => ',',
'headers' => 1,
'column' => 2 );
Required: db, id | file, (email, tool)
Optional: headers, separator, column
Returns: Bio::TGen::EUtils::Response::EPost object
NCBI docs:
http://www.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html
This method uses the CGI POST method to place a list of IDs onto the
NCBI History Server. As shown in the examples above, the ID list can
be specified as a string, an array or a file. If the file method is
chosen then the default parsing behaviour is to read every line, split
on tabs and take the first (0'th) element of the rsulting array and add
it to the list of IDs. Optionally the headers option can be used to
skip 1 or more lines at the start of the file, the separator option
can be used to specify a perl-style regular expression to split the file
lines on, and the column option can be used to select a column other
than the first. The column option uses perl's 0-based array indexing
scheme so '0' is the first item on a line, '1' is the second, etc.
These are the parameters that are passed to the NCBI E-utilities
CGI programs as part of the request URL. In TGen-EUtils, these
parameters are supplied as elements of the parameter hash for one
of the Request Methods.
The following table shows which input parameters are required or
optional for each of the TGen-EUtils Request methods. At first
glance, this table may seem confusing however there are 3 things to
keep in mind: (a) most queries only require use of a small subset of the
possible parameters; (b) NCBI has good documentation for each of the 8
E-utilities; (c) the TGen-EUtils package attempts to insulate the user
as much as possible from the intricacies of the E-utilities system.
Methods e e e e e e e e
i g s s f l p s
n q e u e i o p
f u a m t n s e
o e r m c k t l
r c a h l
y h r
Parameters y
.-------------------------------.
db | R R R R R R R |
dbfrom | R |
id | r r r r |
term | R R O R |
field | O |
retstart | O O O |
retmax | O O O |
retmode | O O |
rettype | O O |
usehistory | O O O |
WebEnv | r r r r |
query_key | r r r r |
cmd | O |
seq_start | O |
seq_stop | O |
strand | O |
complexity | O |
reldate | O O |
mindate | O O |
maxdate | O O |
datetype | O O |
sort | O |
holding | O |
version | O |
tool | d d d d d d d d |
email | d d d d d d d d |
|-------------------------------|
from_hist * | O O O O |
file * | O |
headers * | O |
column * | O |
separator * | O |
batch * | O O O |
'-------------------------------'
Key: R = required
r = either id, or WebEnv and query_key must be specified
d = required but has default so can be left unset
O = optional
* = parameter is specific to TGen-EUtils, not NCBI
The following section details each of the parameters from the table
above in more detail.
db
Name of the target Entrez database. Must be one of the strings from
the first column of the Entrez Database table shown in Usage Example 3
dbfrom
Name of the source Entrez database. Must be one of the strings from
the first column of the Entrez Database table shown in Usage Example 3
This parameter is only used by elink() where known records from one
database (specified with dbfrom) are linked to records from another
database (specified with db).
id
In most cases, the ID value(s) specified by the id parameter
must be in the Primary ID format for the database specified by the
db parameter. For example, if db=pubmed then id should contain
PubMed IDs and if db=nucleotide then id should contain GI numbers.
efetch() is the exception to this rule and appears to be able to recognize
some IDs that are not Primary IDs, e.g. accession, accession.version.
The NCBI Entrez Programming Utilities webpage
(http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html)
has a list of Primary IDs and the efetchseq help page
(http://www.ncbi.nih.gov/entrez/query/static/efetchseq_help.html)
has a list of some of the valid ID types for efetch.
term
An Entrez query string. Examples are 'all[filter]' and
'"Homo sapiens"[orgn]'. This string will become part of the URL
passed to NCBI's E-utilities server so TGen-EUtils will automatically
replace spaces with '+'
symbols. The use of the '[xxx]' format to limit the way a search is
performed is discussed in more detail in the Field section of
Bio::TGen::EUtils::Response::EInfo
field
This parameter can be used to modify term. For example,
setting field to 'mh' and term to 'asthma OR hay fever'
is equivalent to just setting term to 'asthma[mh] OR hay fever[mh]'.
'mh' stands for MeSH Heading and setting field to 'mh' is equivalent
to specifiying [mh] on both the items in term: 'asthma' and 'hay
fever'.
usehistory
Should the Entrez History Server be used (y/n).
retstart
First item in results list to display (default=0).
retmax
Number of items to display in results list (default=20).
retmode
Output data format.
rettype
Output data record type.
WebEnv
Web Environment for accessing existing data sets. This is effectively a
cookie and for a pipeline of TGen-EUtils queries, the value returned by
NCBI as part of the previous response should be supplied as part of all
subsequent queries. TGen-EUtils handles this for the user as long as
the previous Request specified usehistory='y' and the Response
object from that Request is specified using the from_hist parameter.
For an example of the correct use of usehistory/from_hist in a
command pipeline, see Usage Example 2 in the DESCRIPTION section
above.
query_key
Used in conjunction with WebEnv to access lists from the History Server.
cmd
This parameter is unlikely to be directly used by a TGen-EUtils user.
Internally TGen-EUtils uses it for elink(), where the
string "&cmd=neighbor_history" in the URL allows the use of
usehistory pipelines which are not officially supported by elink().
seq_start
Retrieve sequence starting at this base position.
seq_stop
Retrieve sequence until this base position.
strand
Which DNA strand to retrieve (1=plus, 2=minus).
complexity
Determines what data object to retrieve.
reldate
Limits a search to being within reldate days of today so a value of 1
means within today, a value of 7 means within the past week, a value of
365 means within the past year etc.
mindate
maxdate
mindate and maxdate should be used together to denote a date
range. Each one contains a date of the form YYYY, YYYY/MM
or YYYY/MM/DD where YYYY is a four-digit year, MM is a two-digit
month, and DD is a two-digit day.
datetype
This parameter modifies the reldate, mindate and maxdate
parameters by specifying what sort of date they are. For example,
the pubmed database is indexed on pdat the date of
publication; edat the date the publication was first available
through Entrez; mhda the date the publication was indexed with
MeSH terms; cdat date of completion; and mdat the date of last
modification. The names and types of date indexes available will vary
for every Entrez database and the only way to know what values are
possible is to use the einfo() method from Bio::TGen::EUtils
and look at the output of the fields() method for the Response.
More details about Fields and how to use them can be can be found
in the Bio::TGen::EUtils::Response::EInfo documentation.
tool
Name of script or module that is making calls to NCBI E-utilities
server. This parameter has a default value that must be set when
initializing the Bio::TGen::EUtils system. If you set this value for
a particular Request Method then it overrides the default.
email
E-mail of the user (or developer) who is using TGen-EUtils to make
calls to NCBI E-utilities server.
This parameter has a default value that must be set when
initializing the Bio::TGen::EUtils system. If you set this value for
a particular Request Method then it overrides the default.
batch
batch can be used as a parameter to esearch(), esummary(),
and efetch(). This is one of the most important and most problematic
features of the TGen-EUtils system and a more detailed discussion of
"batch mode" can be found in the documentation for
Bio::TGen::EUtils::Response::EFetch. The three retrieval
E-utilties (esearch(), esummary(), and particularly efetch()) can
be an enormous drain on the NCBI
servers if users retrieve huge numbers of records in a single request
so NCBI asks that you never retrieve more than 500 records in a single
request and provides the retstart and retmax parameters as a
mechanism for retieving subranges of records from a large request.
TGen-EUtils has a "batch mode" that transparently uses retstart and
retmax to split large retrieval requests into into multiple 500
record requests and concatenates the responses into a single Response
object. For this reason, requests that include the batch parameter
can take a long time if the underlying request is large.
The maximum number of records returned by a single request is defined
by setting the retmax parameter and TGen-EUtils defaults this value
to 500 for efetch() and esummary().
file
headers
column
separator
file, headers, column, and separator are all parameters
that are part of the TGen-EUtils system and are not passed to the NCBI
server. They are all used in the epost() method to retrieve a list
of IDs from a text file and upload it to the NCBI History Server.
John Pearson bioinfresearch@tgen.org
$Id: EUtils.pm,v 1.22 2006/03/14 07:12:16 jpearson Exp $
TGen-EUtils is copyright 2006 by The Translational Genomics Research
Institute. All rights reserved. This License is limited to, and you
may use the Software solely for, your own internal and non-commercial
use for academic and research purposes. Without limiting the foregoing,
you may not use the Software as part of, or in any way in connection
with the production, marketing, sale or support of any commercial
product or service. For commercial use, please contact
licensing@tgen.org. By installing this Software you are agreeing to
the terms of the LICENSE file distributed with this software.
In any work or product derived from the use of this Software, proper
attribution of the authors as the source of the software or data must
be made. The following URL should be cited:
http://bioinformatics.tgen.org/software/tgen-eutils/
|