This library provides an interface to harvest OAI-PMH metadata from any OAI 2.0 compliant endpoint.
Features:
- PSR-0 thru PSR-2 Compliant
- Composer-compatible
- Unit-tested
- Prefers Guzzle for HTTP transport layer, but can fall back to cURL
- Easy-to-use iterator that hides all the HTTP junk necessary to get paginated records
Install via Composer by including the following in your composer.json file:
{
"require": {
"caseyamcl/phpoaipmh": "~2.0",
"guzzlehttp/guzzle": "~5.0"
}
}
Or, drop the src
folder into your application and use a PSR-0 autoloader to include the files.
Note: Guzzle v5.0 or newer is strongly recommended, but if you choose not to use Guzzle, the
library will fall back to using the PHP cURL extension. If neither is installed, the library will
throw an exception. Alternatively, you can use a different HTTP client library by passing your own
implementation of the Phpoaipmh\HttpAdapter\HttpAdapterInterface
to the Phpoaipmh\Client
constructor.
There are several backwards-incompatible API improvements in version 2.0. See UPGRADE.md for information about how to upgrade your code to use the new version.
Setup a new endpoint client:
$client = new \Phpoaipmh\Client('http://some.service.com/oai');
$myEndpoint = new \Phpoaipmh\Endpoint($client);
Get basic information:
// Result will be a SimpleXMLElement object
$result = $myEndpoint->identify();
var_dump($result);
// Results will be iterator of SimpleXMLElement objects
$results = $myEndpoint->listMetadataFormats();
foreach($results as $item) {
var_dump($item);
}
Get a lists of records:
// Recs will be an iterator of SimpleXMLElement objects
$recs = $myEndpoint->listRecords('someMetaDataFormat');
// The iterator will continue retrieving items across multiple HTTP requests.
// You can keep running this loop through the *entire* collection you
// are harvesting. All OAI-PMH and HTTP pagination logic is hidden neatly
// behind the iterator API.
foreach($recs as $rec) {
var_dump($rec);
}
Optionally, specify a date/time granularity level to use for date-based queries:
use Phpoaipmh\Client,
Phpoaipmh\Endpoint,
Phpoaipmh\Granularity;
$client = new Client('http://some.service.com/oai');
$myEndpoint = new Endpoint($client, Granularity::DATE_AND_TIME);
Depending on the verb you use, the library will send back either a SimpleXMLELement
or an iterator containing SimpleXMLElement
objects.
- For
identify
andgetRecord
, aSimpleXMLElement
object is returned - For
listMetadataFormats
,listSets
,listIdentifiers
, andlistRecords
aPhpoaipmh\ResponseIterator
is returned
The Phpoaipmh\ResponseIterator
object encapsulates the logic to iterate through paginated sets of records.
This library will throw different exceptions under different circumstances:
- HTTP request errors will generate a
Phpoaipmh\Exception\HttpException
- Response body parsing issues (e.g. invalid XML) will generate a
Phpoaipmh\Exception\MalformedResponseException
- OAI-PMH protocol errors (e.g. invalid verb or missing params) will generate a
Phpoaipmh\Exception\OaipmhException
All exceptions extend the Phpoaipmh\Exception\BaseoaipmhException
class.
You can customize the default request parameters (for example, request timeout) for both cURL and Guzzle clients by building the adapter objects manually.
To customize cURL parameters, pass them in as an array of key/value items to CurlAdapter::setCurlOpts()
:
use Phpoaipmh\Client,
Phpoaipmh\HttpAdapter\CurlAdapter;
$adapter = new CurlAdapter();
$adapter->setCurlOpts([CURLOPT_TIMEOUT => 120]);
$client = new Client('http://some.service.com/oai', $adapter);
$myEndpoint = new Endpoint($client);
If you're using Guzzle, you can set the parameters in a similar way:
use Phpoaipmh\Client,
Phpoaipmh\HttpAdapter\GuzzleAdapter;
$adapter = new GuzzleAdapter();
$adapter->getGuzzleClient()->setDefaultOption('timeout', 120);
$client = new Client('http://some.service.com/oai', $adapter);
$myEndpoint = new Endpoint($client);
Many OAI-PMH XML documents make use of XML Namespaces. For non-XML experts, it can be confusing to implement these in PHP. SitePoint has a brief but excellent overview of how to use Namespaces in SimpleXML.
The Phpoaipmh\RecordIterator
iterator contains some helper methods:
getNumRequests()
- Returns the number of HTTP requests made thus fargetNumRetrieved()
- Returns the number of individual records retrievedgetTotalRecordsInCollection()
- Returns the total number of records in the collection- Note - This number should be treated as an estimate at best. The number of records
can change while the records are being retrieved, so it is not guaranteed to be accurate.
Also, many OAI-PMH endpoints do not provide this information, in which case, this method will
return
null
.
- Note - This number should be treated as an estimate at best. The number of records
can change while the records are being retrieved, so it is not guaranteed to be accurate.
Also, many OAI-PMH endpoints do not provide this information, in which case, this method will
return
reset()
- Resets the iterator, which will restart the record retrieval from scratch.
Some OAI-PMH endpoints employ rate-limiting so that you can only make X number
of requests in a given time period. These endpoints will return a 503 Retry-AFter
HTTP status code if your code generates too many HTTP requests too quickly.
If you have installed Guzzle, then you can use the Retry-Subscriber to automatically adhere to the OAI-PMH endpoint rate-limiting rules.
First, make sure you include the retry-subscriber as a dependency in your
composer.json
:
require: {
/* ... */
"guzzlehttp/retry-subscriber": "~2.0"
}
Then, when loading the Phpoaipmh libraries, instantiate the Guzzle adapter manually, and add the subscriber as indicated in the code below:
// Create a Retry Guzzle Subscriber
$retrySubscriber = new \GuzzleHttp\Subscriber\Retry\RetrySubscriber([
'delay' => function($numRetries, \GuzzleHttp\Event\AbstractTransferEvent $event) {
$waitSecs = $event->getResponse()->getHeader('Retry-After') ?: '5';
return ($waitSecs * 1000) + 1000; // wait one second longer than the server said to
},
'filter' => \GuzzleHttp\Subscriber\Retry\RetrySubscriber::createStatusFilter(),
]);
// Manually create a Guzzle HTTP adapter
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\Guzzle();
$guzzleAdapter->getGuzzleClient()->getEmitter()->attach($retrySubscriber);
$client = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);
This will create a client that adheres to the rate-limiting rules enforced by the OAI-PMH record provider.
If you wish to send arbitrary HTTP query parameters with your requests, you can
send them via the \Phpoaipmh\Client
class:
$client = new \Phpoaipmh\Client('http://some.service.com/oai');
$client->request('Identify', ['some' => 'extra-param']);
Alternatively, if you wish to send arbitrary parameters while taking advantage of the
convenience of the \Phpoaipmh\Endpoint
class, you can use the Guzzle event system:
// Create a function or class to add parameters to a request
$addParamsListener = function(\GuzzleHttp\Event\BeforeEvent $event) {
$req = $event->getRequest();
$req->getQuery()->add('api_key', 'xyz123');
// You could do other things to the request here, too, like adding a header..
$req->addHeader('Some-Header', 'some-header-value');
};
// Manually create a Guzzle HTTP adapter
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\Guzzle();
$guzzleAdapter->getGuzzleClient()->getEmitter()->on('before', $addParamsListener);
$client = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);
Harvesting data from a OAI-PMH endpoint can be a time-consuming task, especially when there are lots of records. Typically, this kind of task is done via a CLI script or background process that can run for a long time. It is not normally a good idea to make it part of a web request.
MIT License; see LICENSE file for details