๐ Framework-agnostic API scraper to load items from any paginated JSON API into a Laravel lazy collection via async HTTP requests.
use Illuminate\Support\LazyCollection;
LazyCollection::fromJsonPages($source)
->totalPages('pagination.total_pages')
->async(requests: 3)
->throttle(requests: 100, perMinutes: 1)
->collect('data.*');
Framework-agnostic API scraper to load items from any paginated JSON API into a Laravel lazy collection via async HTTP requests.
[!TIP]
Need to read large JSON with no pagination in a memory-efficient way?Consider using ๐ผ Lazy JSON or ๐งฉ JSON Parser instead.
Via Composer:
composer require cerbero/lazy-json-pages
Depending on our coding style, we can instantiate Lazy JSON Pages in 4 different ways:
use Cerbero\LazyJsonPages\LazyJsonPages;
use Illuminate\Support\LazyCollection;
use function Cerbero\LazyJsonPages\lazyJsonPages;
// lazy collection macro
LazyCollection::fromJsonPages($source);
// classic instantiation
new LazyJsonPages($source);
// static method
LazyJsonPages::from($source);
// namespaced helper
lazyJsonPages($source);
The variable $source
in our examples represents any source that points to a paginated JSON API. Once we define the source, we can then chain methods to define how the API is paginated:
$lazyCollection = LazyJsonPages::from($source)
->totalItems('pagination.total_items')
->offset()
->collect('results.*');
When calling collect()
, we indicate that the pagination structure is defined and that we are ready to collect the paginated items within a Laravel lazy collection, where we can loop through the items one by one and apply filters and transformations in a memory-efficient way.
A source is any means that can point to a paginated JSON API. A number of sources is supported by default:
https://example.com/api/v1/users
or any instance of Psr\Http\Message\UriInterface
Psr\Http\Message\RequestInterface
Illuminate\Http\Client\Request
Illuminate\Http\Client\Response
Illuminate\Http\Request
Symfony\Component\HttpFoundation\Request
Cerbero\LazyJsonPages\Sources\Source
Here are some examples of sources:
// a simple URI string
$source = 'https://example.com/api/v1/users';
// any PSR-7 compatible request is supported, including Guzzle requests
$source = new GuzzleHttp\Psr7\Request('GET', 'https://example.com/api/v1/users');
// while being framework-agnostic, Lazy JSON Pages integrates well with Laravel
$source = Http::withToken($bearer)->get('https://example.com/api/v1/users');
If none of the above sources satifies our use case, we can implement our own source.
To implement a custom source, we need to extend Source
and implement 2 methods:
use Cerbero\LazyJsonPages\Sources\Source;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;
class CustomSource extends Source
{
public function request(): RequestInterface
{
// return a PSR-7 request
}
public function response(): ResponseInterface
{
// return a PSR-7 response
}
}
The parent class Source
gives us access to 2 properties:
$source
: the custom source for our use case$client
: the Guzzle HTTP clientThe methods to implement turn our custom source into a PSR-7 request and a PSR-7 response. Please refer to the already existing sources to see some implementations.
Once the custom source is implemented, we can instruct Lazy JSON Pages to use it:
LazyJsonPages::from(new CustomSource($source));
If you find yourself implementing the same custom source in different projects, feel free to send a PR and we will consider to support your custom source by default. Thank you in advance for any contribution!
After defining the source, we need to let Lazy JSON Pages know what the paginated API looks like.
If the API uses a query parameter different from page
to specify the current page - for example ?current_page=1
- we can chain the method pageName()
:
LazyJsonPages::from($source)->pageName('current_page');
Otherwise, if the number of the current page is present in the URI path - for example https://example.com/users/page/1
- we can chain the method pageInPath()
:
LazyJsonPages::from($source)->pageInPath();
By default the last integer in the URI path is considered the page number. However we can customize the regular expression used to capture the page number, if need be:
LazyJsonPages::from($source)->pageInPath('~/page/(\d+)$~');
Some API paginations may start with a page different from 1
. If thatโs the case, we can define the first page by chaining the method firstPage()
:
LazyJsonPages::from($source)->firstPage(0);
Now that we have customized the basic structure of the API, we can describe how items are paginated depending on whether the pagination is length-aware or cursor based.
The term โlength-awareโ indicates any pagination containing at least one of the following length information:
Lazy JSON Pages only needs one of these details to work properly:
LazyJsonPages::from($source)->totalPages('pagination.total_pages');
LazyJsonPages::from($source)->totalItems('pagination.total_items');
LazyJsonPages::from($source)->lastPage('pagination.last_page');
If the length information is nested in the JSON body, we can use dot-notation to indicate the level of nesting. For example, pagination.total_pages
means that the total number of pages sits in the object pagination
, under the key total_pages
.
Otherwise, if the length information is displayed in the headers, we can use the same methods to gather it by simply defining the name of the header:
LazyJsonPages::from($source)->totalPages('X-Total-Pages');
LazyJsonPages::from($source)->totalItems('X-Total-Items');
LazyJsonPages::from($source)->lastPage('X-Last-Page');
APIs can expose their length information in the form of numbers (total_pages: 10
) or URIs (last_page: "https://example.com?page=10"
), Lazy JSON Pages supports both.
If the pagination works with an offset, we can configure it with the offset()
method. The value of the offset will be calculated based on the number of items present on the first page:
// indicate that the offset is defined by the `offset` query parameter, e.g. ?offset=50
LazyJsonPages::from($source)
->totalItems('pagination.total_items')
->offset();
// indicate that the offset is defined by the `skip` query parameter, e.g. ?skip=50
LazyJsonPages::from($source)
->totalItems('pagination.total_items')
->offset('skip');
Not all paginations are length-aware, some may be built in a way where each page has a cursor pointing to the next page.
We can tackle this kind of pagination by indicating the key or the header holding the cursor:
LazyJsonPages::from($source)->cursor('pagination.cursor');
LazyJsonPages::from($source)->cursor('X-Cursor');
The cursor may be a number, a string or a URI: Lazy JSON Pages supports them all.
Some paginated API responses include a header called Link
. An example is GitHub: if we inspect the response headers, we can see the Link
header looking like this:
<https://api.github.com/repositories/1296269/issues?state=open&page=2>; rel="next",
<https://api.github.com/repositories/1296269/issues?state=open&page=43>; rel="last"
To lazy-load items from a Link header pagination, we can chain the method linkHeader()
:
LazyJsonPages::from($source)->linkHeader();
Lazy JSON Pages provides several methods to extract items from the most popular pagination mechanisms. However if we need a custom solution, we can implement our own pagination.
To implement a custom pagination, we need to extend Pagination
and implement 1 method:
use Cerbero\LazyJsonPages\Paginations\Pagination;
use Traversable;
class CustomPagination extends Pagination
{
public function getIterator(): Traversable
{
// return a Traversable yielding the paginated items
}
}
The parent class Pagination
gives us access to 3 properties:
$source
: the source pointing to the paginated JSON API$client
: the Guzzle HTTP client$config
: the configuration that we generated by chaining methods like totalPages()
The method getIterator()
defines the logic to extract paginated items in a memory-efficient way. Please refer to the already existing paginations to see some implementations.
Once the custom pagination is implemented, we can instruct Lazy JSON Pages to use it:
LazyJsonPages::from($source)->pagination(CustomPagination::class);
If you find yourself implementing the same custom pagination in different projects, feel free to send a PR and we will consider to support your custom pagination by default. Thank you in advance for any contribution!
Paginated APIs differ from each other, so Lazy JSON Pages lets us tweak our HTTP requests specifically for our use case.
By default HTTP requests are sent synchronously. If we want to send more than one request without waiting for the response, we can call the async()
method and set the number of concurrent requests:
LazyJsonPages::from($source)->async(requests: 5);
[!NOTE]
Please note that asynchronous requests improve speed at the expense of memory, as more responses are going to be loaded at once.
Several APIs set rate limits to reduce the number of allowed requests for a period of time. We can instruct Lazy JSON Pages to respect such limits by throttling our requests:
// we send a maximum of 3 requests per second, 60 per minute and 3,000 per hour
LazyJsonPages::from($source)
->throttle(requests: 3, perSeconds: 1)
->throttle(requests: 60, perMinutes: 1)
->throttle(requests: 3000, perHours: 1);
Internally, Lazy JSON Pages uses Guzzle as its HTTP client. We can customize the client behavior by adding as many middleware as we need:
LazyJsonPages::from($source)
->middleware('log_requests', $logRequests)
->middleware('cache_responses', $cacheResponses);
If we need a middleware to be added every time we invoke Lazy JSON Pages, we can add a global middleware:
LazyJsonPages::globalMiddleware('fire_events', $fireEvents);
Sometimes writing Guzzle middleware might be cumbersome. Alternatively Lazy JSON Pages provides convenient methods to fire callbacks when sending a request or receiving a response:
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;
LazyJsonPages::from($source)
->onRequest(fn(RequestInterface $request) => ...)
->onResponse(fn(ResponseInterface $response, RequestInterface $request) => ...);
We can also tweak the number of allowed seconds before an API connection times out or the allowed duration of the entire HTTP request (by default they are both set to 5 seconds):
LazyJsonPages::from($source)
->connectionTimeout(7)
->requestTimeout(10);
If the 3rd party API is faulty or error-prone, we can indicate how many times we want to retry failing HTTP requests and the backoff strategy to compute the milliseconds to wait before retrying (by default failing requests are repeated 3 times after an exponential backoff of 100, 400 and 900 milliseconds):
// repeat failing requests 5 times after a backoff of 1, 2, 3, 4 and 5 seconds
LazyJsonPages::from($source)
->attempts(5)
->backoff(fn(int $attempt) => $attempt * 1000);
If something goes wrong during the scraping process, we can intercept the error and execute a custom logic to handle it:
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;
LazyJsonPages::from($source)
->onError(fn(Throwable $e, RequestInterface $request, ?ResponseInterface $response) => ...);
Any exception thrown by this package extends the LazyJsonPagesException
class. This makes it easy to handle all exceptions in a single catch block:
use Cerbero\LazyJsonPages\Exceptions\LazyJsonPagesException;
try {
LazyJsonPages::from($source)->linkHeader()->collect()->each(...);
} catch (LazyJsonPagesException $e) {
// handle any exception thrown by Lazy JSON Pages
}
For reference, here is a comprehensive table of all the exceptions thrown by this package:
Cerbero\LazyJsonPages\Exceptions\ |
thrown when |
---|---|
InvalidKeyException |
a JSON key does not contain a valid value |
InvalidPageInPathException |
a page cannot be found in the URI path |
InvalidPaginationException |
a pagination implementation is not valid |
OutOfAttemptsException |
an HTTP request failed too many times |
RequestNotSentException |
a JSON source didnโt send any HTTP request |
UnsupportedPaginationException |
a pagination is not supported |
UnsupportedSourceException |
a JSON source is not supported |
If used in a Laravel project, Lazy JSON Pages automatically fires events when:
Illuminate\Http\Client\Events\RequestSending
Illuminate\Http\Client\Events\ResponseReceived
Illuminate\Http\Client\Events\ConnectionFailed
This is especially handy for debugging tools like Laravel Telescope or Spatie Ray or for triggering the related event listeners.
Please see CHANGELOG for more information on what has changed recently.
composer test
Please see CONTRIBUTING and CODE_OF_CONDUCT for details.
If you discover any security related issues, please email [email protected] instead of using the issue tracker.
The MIT License (MIT). Please see License File for more information.