lazy json pages

๐Ÿ“œ Framework-agnostic API scraper to load items from any paginated JSON API into a Laravel lazy collection via async HTTP requests.

182
2
PHP

๐Ÿ“œ Lazy JSON Pages

Author
PHP Version
Build Status
Coverage Status
Quality Score
PHPStan Level
Latest Version
Software License
PER
Total Downloads

use Illuminate\Support\LazyCollection;

LazyCollection::fromJsonPages($source)
    ->totalPages('pagination.total_pages')
    ->async(requests: 3)
    ->throttle(requests: 100, perMinutes: 1)
    ->collect('data.*');

Framework-agnostic API scraper to load items from any paginated JSON API into a Laravel lazy collection via async HTTP requests.

[!TIP]
Need to read large JSON with no pagination in a memory-efficient way?

Consider using ๐Ÿผ Lazy JSON or ๐Ÿงฉ JSON Parser instead.

๐Ÿ“ฆ Install

Via Composer:

composer require cerbero/lazy-json-pages

๐Ÿ”ฎ Usage

๐Ÿ‘ฃ Basics

Depending on our coding style, we can instantiate Lazy JSON Pages in 4 different ways:

use Cerbero\LazyJsonPages\LazyJsonPages;
use Illuminate\Support\LazyCollection;

use function Cerbero\LazyJsonPages\lazyJsonPages;

// lazy collection macro
LazyCollection::fromJsonPages($source);

// classic instantiation
new LazyJsonPages($source);

// static method
LazyJsonPages::from($source);

// namespaced helper
lazyJsonPages($source);

The variable $source in our examples represents any source that points to a paginated JSON API. Once we define the source, we can then chain methods to define how the API is paginated:

$lazyCollection = LazyJsonPages::from($source)
    ->totalItems('pagination.total_items')
    ->offset()
    ->collect('results.*');

When calling collect(), we indicate that the pagination structure is defined and that we are ready to collect the paginated items within a Laravel lazy collection, where we can loop through the items one by one and apply filters and transformations in a memory-efficient way.

๐Ÿ’ง Sources

A source is any means that can point to a paginated JSON API. A number of sources is supported by default:

  • endpoint URIs, e.g. https://example.com/api/v1/users or any instance of Psr\Http\Message\UriInterface
  • PSR-7 requests, i.e. any instance of Psr\Http\Message\RequestInterface
  • Laravel HTTP client requests, i.e. any instance of Illuminate\Http\Client\Request
  • Laravel HTTP client responses, i.e. any instance of Illuminate\Http\Client\Response
  • Laravel HTTP requests, i.e. any instance of Illuminate\Http\Request
  • Symfony requests, i.e. any instance of Symfony\Component\HttpFoundation\Request
  • user-defined sources, i.e. any instance of Cerbero\LazyJsonPages\Sources\Source

Here are some examples of sources:

// a simple URI string
$source = 'https://example.com/api/v1/users';

// any PSR-7 compatible request is supported, including Guzzle requests
$source = new GuzzleHttp\Psr7\Request('GET', 'https://example.com/api/v1/users');

// while being framework-agnostic, Lazy JSON Pages integrates well with Laravel
$source = Http::withToken($bearer)->get('https://example.com/api/v1/users');

If none of the above sources satifies our use case, we can implement our own source.

Click here to see how to implement a custom source.

To implement a custom source, we need to extend Source and implement 2 methods:

use Cerbero\LazyJsonPages\Sources\Source;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;

class CustomSource extends Source
{
    public function request(): RequestInterface
    {
        // return a PSR-7 request
    }

    public function response(): ResponseInterface
    {
        // return a PSR-7 response
    }
}

The parent class Source gives us access to 2 properties:

  • $source: the custom source for our use case
  • $client: the Guzzle HTTP client

The methods to implement turn our custom source into a PSR-7 request and a PSR-7 response. Please refer to the already existing sources to see some implementations.

Once the custom source is implemented, we can instruct Lazy JSON Pages to use it:

LazyJsonPages::from(new CustomSource($source));

If you find yourself implementing the same custom source in different projects, feel free to send a PR and we will consider to support your custom source by default. Thank you in advance for any contribution!

๐Ÿ›๏ธ Pagination structure

After defining the source, we need to let Lazy JSON Pages know what the paginated API looks like.

If the API uses a query parameter different from page to specify the current page - for example ?current_page=1 - we can chain the method pageName():

LazyJsonPages::from($source)->pageName('current_page');

Otherwise, if the number of the current page is present in the URI path - for example https://example.com/users/page/1 - we can chain the method pageInPath():

LazyJsonPages::from($source)->pageInPath();

By default the last integer in the URI path is considered the page number. However we can customize the regular expression used to capture the page number, if need be:

LazyJsonPages::from($source)->pageInPath('~/page/(\d+)$~');

Some API paginations may start with a page different from 1. If thatโ€™s the case, we can define the first page by chaining the method firstPage():

LazyJsonPages::from($source)->firstPage(0);

Now that we have customized the basic structure of the API, we can describe how items are paginated depending on whether the pagination is length-aware or cursor based.

๐Ÿ“ Length-aware paginations

The term โ€œlength-awareโ€ indicates any pagination containing at least one of the following length information:

  • the total number of pages
  • the total number of items
  • the number of the last page

Lazy JSON Pages only needs one of these details to work properly:

LazyJsonPages::from($source)->totalPages('pagination.total_pages');

LazyJsonPages::from($source)->totalItems('pagination.total_items');

LazyJsonPages::from($source)->lastPage('pagination.last_page');

If the length information is nested in the JSON body, we can use dot-notation to indicate the level of nesting. For example, pagination.total_pages means that the total number of pages sits in the object pagination, under the key total_pages.

Otherwise, if the length information is displayed in the headers, we can use the same methods to gather it by simply defining the name of the header:

LazyJsonPages::from($source)->totalPages('X-Total-Pages');

LazyJsonPages::from($source)->totalItems('X-Total-Items');

LazyJsonPages::from($source)->lastPage('X-Last-Page');

APIs can expose their length information in the form of numbers (total_pages: 10) or URIs (last_page: "https://example.com?page=10"), Lazy JSON Pages supports both.

If the pagination works with an offset, we can configure it with the offset() method. The value of the offset will be calculated based on the number of items present on the first page:

// indicate that the offset is defined by the `offset` query parameter, e.g. ?offset=50
LazyJsonPages::from($source)
    ->totalItems('pagination.total_items')
    ->offset();

// indicate that the offset is defined by the `skip` query parameter, e.g. ?skip=50
LazyJsonPages::from($source)
    ->totalItems('pagination.total_items')
    ->offset('skip');

โ†ช๏ธ Cursor-aware paginations

Not all paginations are length-aware, some may be built in a way where each page has a cursor pointing to the next page.

We can tackle this kind of pagination by indicating the key or the header holding the cursor:

LazyJsonPages::from($source)->cursor('pagination.cursor');

LazyJsonPages::from($source)->cursor('X-Cursor');

The cursor may be a number, a string or a URI: Lazy JSON Pages supports them all.

๐Ÿ”— Link header paginations

Some paginated API responses include a header called Link. An example is GitHub: if we inspect the response headers, we can see the Link header looking like this:

<https://api.github.com/repositories/1296269/issues?state=open&page=2>; rel="next",
<https://api.github.com/repositories/1296269/issues?state=open&page=43>; rel="last"

To lazy-load items from a Link header pagination, we can chain the method linkHeader():

LazyJsonPages::from($source)->linkHeader();

๐Ÿ‘ฝ Custom paginations

Lazy JSON Pages provides several methods to extract items from the most popular pagination mechanisms. However if we need a custom solution, we can implement our own pagination.

Click here to see how to implement a custom pagination.

To implement a custom pagination, we need to extend Pagination and implement 1 method:

use Cerbero\LazyJsonPages\Paginations\Pagination;
use Traversable;

class CustomPagination extends Pagination
{
    public function getIterator(): Traversable
    {
        // return a Traversable yielding the paginated items
    }
}

The parent class Pagination gives us access to 3 properties:

  • $source: the source pointing to the paginated JSON API
  • $client: the Guzzle HTTP client
  • $config: the configuration that we generated by chaining methods like totalPages()

The method getIterator() defines the logic to extract paginated items in a memory-efficient way. Please refer to the already existing paginations to see some implementations.

Once the custom pagination is implemented, we can instruct Lazy JSON Pages to use it:

LazyJsonPages::from($source)->pagination(CustomPagination::class);

If you find yourself implementing the same custom pagination in different projects, feel free to send a PR and we will consider to support your custom pagination by default. Thank you in advance for any contribution!

๐Ÿš€ Requests optimization

Paginated APIs differ from each other, so Lazy JSON Pages lets us tweak our HTTP requests specifically for our use case.

By default HTTP requests are sent synchronously. If we want to send more than one request without waiting for the response, we can call the async() method and set the number of concurrent requests:

LazyJsonPages::from($source)->async(requests: 5);

[!NOTE]
Please note that asynchronous requests improve speed at the expense of memory, as more responses are going to be loaded at once.

Several APIs set rate limits to reduce the number of allowed requests for a period of time. We can instruct Lazy JSON Pages to respect such limits by throttling our requests:

// we send a maximum of 3 requests per second, 60 per minute and 3,000 per hour
LazyJsonPages::from($source)
    ->throttle(requests: 3, perSeconds: 1)
    ->throttle(requests: 60, perMinutes: 1)
    ->throttle(requests: 3000, perHours: 1);

Internally, Lazy JSON Pages uses Guzzle as its HTTP client. We can customize the client behavior by adding as many middleware as we need:

LazyJsonPages::from($source)
    ->middleware('log_requests', $logRequests)
    ->middleware('cache_responses', $cacheResponses);

If we need a middleware to be added every time we invoke Lazy JSON Pages, we can add a global middleware:

LazyJsonPages::globalMiddleware('fire_events', $fireEvents);

Sometimes writing Guzzle middleware might be cumbersome. Alternatively Lazy JSON Pages provides convenient methods to fire callbacks when sending a request or receiving a response:

use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;

LazyJsonPages::from($source)
    ->onRequest(fn(RequestInterface $request) => ...)
    ->onResponse(fn(ResponseInterface $response, RequestInterface $request) => ...);

We can also tweak the number of allowed seconds before an API connection times out or the allowed duration of the entire HTTP request (by default they are both set to 5 seconds):

LazyJsonPages::from($source)
    ->connectionTimeout(7)
    ->requestTimeout(10);

If the 3rd party API is faulty or error-prone, we can indicate how many times we want to retry failing HTTP requests and the backoff strategy to compute the milliseconds to wait before retrying (by default failing requests are repeated 3 times after an exponential backoff of 100, 400 and 900 milliseconds):

// repeat failing requests 5 times after a backoff of 1, 2, 3, 4 and 5 seconds
LazyJsonPages::from($source)
    ->attempts(5)
    ->backoff(fn(int $attempt) => $attempt * 1000);

๐Ÿ’ข Errors handling

If something goes wrong during the scraping process, we can intercept the error and execute a custom logic to handle it:

use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;

LazyJsonPages::from($source)
    ->onError(fn(Throwable $e, RequestInterface $request, ?ResponseInterface $response) => ...);

Any exception thrown by this package extends the LazyJsonPagesException class. This makes it easy to handle all exceptions in a single catch block:

use Cerbero\LazyJsonPages\Exceptions\LazyJsonPagesException;

try {
    LazyJsonPages::from($source)->linkHeader()->collect()->each(...);
} catch (LazyJsonPagesException $e) {
    // handle any exception thrown by Lazy JSON Pages
}

For reference, here is a comprehensive table of all the exceptions thrown by this package:

Cerbero\LazyJsonPages\Exceptions\ thrown when
InvalidKeyException a JSON key does not contain a valid value
InvalidPageInPathException a page cannot be found in the URI path
InvalidPaginationException a pagination implementation is not valid
OutOfAttemptsException an HTTP request failed too many times
RequestNotSentException a JSON source didnโ€™t send any HTTP request
UnsupportedPaginationException a pagination is not supported
UnsupportedSourceException a JSON source is not supported

๐Ÿค Laravel integration

If used in a Laravel project, Lazy JSON Pages automatically fires events when:

  • an HTTP request is about to be sent, by firing Illuminate\Http\Client\Events\RequestSending
  • an HTTP response is received, by firing Illuminate\Http\Client\Events\ResponseReceived
  • a connection failed, by firing Illuminate\Http\Client\Events\ConnectionFailed

This is especially handy for debugging tools like Laravel Telescope or Spatie Ray or for triggering the related event listeners.

๐Ÿ“† Change log

Please see CHANGELOG for more information on what has changed recently.

๐Ÿงช Testing

composer test

๐Ÿ’ž Contributing

Please see CONTRIBUTING and CODE_OF_CONDUCT for details.

๐Ÿงฏ Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

๐Ÿ… Credits

โš–๏ธ License

The MIT License (MIT). Please see License File for more information.