grover

A Ruby gem to transform HTML into PDFs, PNGs or JPEGs using Google Puppeteer/Chromium

Studiosity

1018

117

Ruby

Grover

A Ruby gem to transform HTML into PDFs, PNGs or JPEGs using Google Puppeteer
and Chromium.

Grover

Installation

Add this line to your application’s Gemfile:

gem 'grover'

Google Puppeteer

npm install puppeteer

Usage

# Grover.new accepts a URL or inline HTML and optional parameters for Puppeteer
grover = Grover.new('https://google.com', format: 'A4')

# Get an inline PDF
pdf = grover.to_pdf

# Get a screenshot
png = grover.to_png
jpeg = grover.to_jpeg

# Get the HTML content (including DOCTYPE)
html = grover.to_html

# Options can be provided through meta tags
Grover.new('<html><head><meta name="grover-page_ranges" content="1-3"')
Grover.new('<html><head><meta name="grover-margin-top" content="10px"')

N.B.

options are underscore case, and sub-options separated with a dash
all options can be overwritten, including emulate_media and display_url

From a view template

It’s easy to render a normal Rails view template as a PDF, using Rails’ render_to_string:

html = MyController.new.render_to_string({
  template: 'controller/view',
  layout: 'my_layout',
  locals: { :@instance_var => ... }
})
pdf = Grover.new(html, **grover_options).to_pdf

Relative paths

If calling Grover directly (not through middleware) you will need to either specify a display_url or modify your
HTML by converting any relative paths to absolute paths before passing to Grover.

This can be achieved using the HTML pre-processor helper (pay attention to the slash at the end of the url):

absolute_html = Grover::HTMLPreprocessor.process relative_html, 'http://my.server/', 'http'

This is important because Chromium will try and resolve any relative paths via the display url host. If not provided,
the display URL defaults to http://example.com.

Why would you pre-process the HTML rather than just use the `display_url`

There are many scenarios where specifying a different host of relative paths would be preferred. For example, your
server might be behind a NAT gateway and the display URL in front of it. The display URL might be shown in the
header/footer, and as such shouldn’t expose details of your private network.

If you run into trouble, take a look at the debugging section below which would allow you to inspect the
page content and devtools.

Configuration

Grover can be configured to adjust the layout of the resulting PDF/image.

For available PDF options, see https://github.com/puppeteer/puppeteer/blob/main/docs/api/puppeteer.pdfoptions.md

Also available are the emulate_media, cache, viewport, timeout, requestTimeout, convertTimeout
and launch_args options.

# config/initializers/grover.rb
Grover.configure do |config|
  config.options = {
    format: 'A4',
    margin: {
      top: '5px',
      bottom: '10cm'
    },
    user_agent: 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',
    viewport: {
      width: 640,
      height: 480
    },
    prefer_css_page_size: true,
    emulate_media: 'screen',
    bypass_csp: true,
    media_features: [{ name: 'prefers-color-scheme', value: 'dark' }],
    timezone: 'Australia/Sydney',
    vision_deficiency: 'deuteranopia',
    extra_http_headers: { 'Accept-Language': 'en-US' },
    geolocation: { latitude: 59.95, longitude: 30.31667 },
    focus: '#some-element',
    hover: '#another-element',
    cache: false,
    timeout: 0, # Timeout in ms. A value of `0` means 'no timeout'
    request_timeout: 1000, # Timeout when fetching the content (overloads the `timeout` option)
    convert_timeout: 2000, # Timeout when converting the content (overloads the `timeout` option, only applies to PDF conversion)
    launch_args: ['--font-render-hinting=medium'],
    wait_until: 'domcontentloaded'
  }
end

For available PNG/JPEG options, see https://github.com/puppeteer/puppeteer/blob/main/docs/api/puppeteer.page.screenshot.md#remarks

Note that by default the full_page option is set to false and you will get a 800x600 image. You can either specify
the image size using the clip options, or capture the entire page with full_page set to true.

For viewport options, see https://github.com/puppeteer/puppeteer/blob/main/docs/api/puppeteer.page.setviewport.md#remarks

For launch_args options, see http://peter.sh/experiments/chromium-command-line-switches/
Launch parameter args can also be provided using a meta tag:

For timezone IDs see ICUs metaZones.txt.
Passing nil disables timezone emulation.

The vision_deficiency option can be passed one of achromatopsia, deuteranopia, protanopia, tritanopia,
blurredVision or none.

The focus option takes a CSS selector and will focus on the first matching element after rendering is complete
(including waiting for the specified wait_for_selector).

The hover option takes a CSS selector and will hover on the first matching element after rendering is complete
(including waiting for the specified wait_for_selector).

<meta name="grover-launch_args" content="['--disable-speech-api']" />

For wait_until option, default for URLs is networkidle2 and for HTML content networkidle0.
For available options see https://github.com/puppeteer/puppeteer/blob/main/docs/api/puppeteer.page.goto.md#remarks

The wait_for_selector option can also be used to wait until an element appears on the page. Additional waiting parameters can be set with the wait_for_selector_options options hash. For available options, see: https://github.com/puppeteer/puppeteer/blob/main/docs/api/puppeteer.page.waitforselector.md#remarks.

The wait_for_function option can be used to wait until a specific function returns a truthy value. Additional parameters can be set with the wait_for_function_options options hash. For available options, see: https://github.com/puppeteer/puppeteer/blob/main/docs/api/puppeteer.page.waitforfunction.md#remarks

The wait_for_timeout option can also be used to wait the specified number of milliseconds have elapsed.

The raise_on_request_failure option, when enabled, will raise a Grover::JavaScript::RequestFailedError
if the initial content request or any subsequent asset request returns a bad response or times out.

The raise_on_js_error option, when enabled, will raise a Grover::JavaScript::PageRenderError if any uncaught
Javascript errors occur when trying to render the page.

The Chrome/Chromium executable path can be overridden with the executable_path option.

Supplementary JavaScript can be executed on the page (after render and before conversion to PDF/image)
by passing it to the execute_script option.

Grover.new(<some url>, execute_script: 'document.getElementsByTagName("footer")[0].innerText = "Hey"').to_pdf

You can also evaluate JavaScript on the page before any of its scripts is run, by passing it a string to the evaluate_on_new_document option. See https://github.com/puppeteer/puppeteer/blob/main/docs/api/puppeteer.page.evaluateonnewdocument.md

Grover.new(<some url>, evaluate_on_new_document: 'window.someConfig = "some value"').to_pdf

Basic authentication

For requesting a page with basic authentication, username and password options can be provided. Note that this
only really makes sense if you’re calling Grover directly (and not via middleware).

Grover.new('<some URI with basic authentication', username: 'the username', password: 'super secret').to_pdf

Remote Chromium

By default, Grover launches a local Chromium instance. You can connect to a remote/external
Chromium with the browser_ws_endpoint options.

For example, to connect to a chrome instance started with docker using docker run -p 3000:3000 ghcr.io/browserless/chrome:latest:

grover = Grover.new("https://mysite.com/path/to/thing", browser_ws_endpoint: "ws://localhost:3000/chrome")
File.open("grover.png", "wb") { |f| f << grover.to_png }

You can also pass launch flags like this: ws://localhost:3000/chrome?--disable-speech-api

If you are only using remote chromium, you can install the puppeteer-core node package instead of puppeteer to avoid downloading chrome.
Grover will use puppeteer or fallback to puppeteer-core if it is available.

npm install puppeteer-core

Adding cookies

To set request cookies when requesting a URL, pass an array of hashes as such
N.B. Only the name and value properties are required.
See page.setCookie documentation for more details (old documentation with more detailed description is available here).

myCookies = [
  { name: 'sign_username', value: '[email protected]', domain: 'mydomain' },
  { name: '_session_id', value: '9c014df0b699d8dc08d1c472f8cc594c', domain: 'mydomain' }
]
Grover.new('<some URI with cookies', cookies: myCookies).to_pdf

If you need to forward the cookies from the original request, you could extract them as such:

def header_cookies
  request.headers['Cookie'].split('; ').map do |cookie|
    key, value = cookie.split '='
    { name: key, value: value, domain: request.headers['Host'] }
  end
end

And give that array to Grover:

Grover.new('<some URI with cookies', cookies: header_cookies).to_pdf

Adding style tags

To add style tags, pass an array of style tag options as such
See page.addStyleTag documentation for more details (old documentation with more detailed description is available here).

style_tag_options = [
  { url: 'http://example.com/style.css' },
  { path: 'style.css' },
  { content: '.body{background: red}' }
]
Grover.new('<html><body><h1>Heading</h1></body></html>', style_tag_options: style_tag_options).to_pdf

Adding script tags

To add script tags, pass an array of script tag options as such
See documentation for more details page.addScriptTag (old documentation is available here).

script_tag_options = [
  { url: 'http://example.com/script.js' },
  { path: 'script.js' },
  { content: 'document.querySelector("h1").style.display = "none"' }
]
Grover.new('<html><body><h1>Heading</h1></body></html>', script_tag_options: script_tag_options).to_pdf

Page URL for middleware requests (or passing through raw HTML)

If you want to have the header or footer display the page URL, Grover requires that this is passed through via the
display_url option. This is because the page URL is not available in the raw HTML!

For Rack middleware conversions, the original request URL (without the .pdf extension) will be passed through and
assigned to display_url for you. You can of course override this by using a meta tag in the downstream HTML response.

For raw HTML conversions, if the display_url is not provided http://example.com will be used as the default.

Header and footer templates

Should be valid HTML markup with following classes used to inject printing values into them:

date formatted print date
title document title
url document location
pageNumber current page number
totalPages total pages in the document

Setting custom PDF filename with header

In respective controller’s action use:

respond_to do |format|
  format.html do
    response.headers['content-disposition'] = %(attachment; filename="lorem_ipsum.pdf")

    render layout: 'pdf'
  end
end

Setting custom environment variable for node

The node_env_vars configuration option enables you to set custom environment variables for the spawned node process. For example you might need to disable jemalloc in some environments (https://github.com/Studiosity/grover/issues/80).

# config/initializers/grover.rb
Grover.configure do |config|
  config.node_env_vars = { "LD_PRELOAD" => "" }
end

Yarn PnP strategy

If you are using the Yarn PnP strategy, you can override the run JS runtime for grover:

Grover.configure do |config|
  config.js_runtime_bin = ['yarn', 'node']
end

Middleware

Grover comes with a middleware that allows users to get a PDF, PNG or JPEG view of
any page on your site by appending .pdf, .png or .jpeg/.jpg to the URL.

Middleware Setup

Non-Rails Rack apps

# in config.ru
require 'grover'
use Grover::Middleware

Rails apps

# in application.rb
require 'grover'
config.middleware.use Grover::Middleware

N.B. by default PNG and JPEG are not modified in the middleware to prevent breaking standard behaviours.
To enable them, there are configuration options for each image type as well as an option to disable the PDF middleware
(on by default).

If either of the image handling middleware options are enabled, the ignore_path and/or
ignore_request should also be configured, otherwise assets are likely to be handled
which would likely result in 404 responses.

# config/initializers/grover.rb
Grover.configure do |config|
  config.use_png_middleware = true
  config.use_jpeg_middleware = true
  config.use_pdf_middleware = false
end

root_url

The root_url option can be specified either when configuring the middleware or as a global option. This is needed
when running the Grover middleware behind a URL rewriting proxy or within a containerised system.

As a middleware option:

# in application.rb
require 'grover'
config.middleware.use Grover::Middleware, root_url: 'https://my.external.domain'

or as a global option:

# config/initializers/grover.rb
Grover.configure do |config|
  config.root_url = 'https://my.external.domain'
end

ignore_path

The ignore_path configuration option can be used to tell Grover’s middleware whether it should handle/modify
the response. There are three ways to set up the ignore_path:

a String which matches the start of the request path.
a Regexp which could match any part of the request path.
a Proc which accepts the request path as a parameter.

# config/initializers/grover.rb
Grover.configure do |config|
  # assigning a String
  config.ignore_path = '/assets/'
  # matches `www.example.com/assets/foo.png` and not `www.example.com/bar/assets/foo.png`

  # assigning a Regexp
  config.ignore_path = /my\/path/
  # matches `www.example.com/foo/my/path/bar.png`

  # assigning a Proc
  config.ignore_path = ->(path) do
    /\A\/foo\/.+\/[0-9]+\.png\z/.match path
  end
  # matches `www.example.com/foo/bar/123.png`
end

ignore_request

The ignore_request configuration option can be used to tell Grover’s middleware whether it should handle/modify
the response. It should be set with a Proc which accepts the request (Rack::Request) as a parameter.

# config/initializers/grover.rb
Grover.configure do |config|
  # assigning a Proc
  config.ignore_request = ->(req) do
    req.host == 'www.example.com'
  end
  # matches `www.example.com/foo/bar/123.png`

  config.ignore_request = ->(req) do
    req.has_header?('X-BLOCK')
  end
  # matches `HTTP Header X-BLOCK`
end

allow_file_uris

The allow_file_uris option can be used to render an html document from the file system.
This should be used with EXTREME CAUTION. If used improperly it could potentially be manipulated to reveal
sensitive files on the system. Do not enable if rendering content from outside entities
(user uploads, external URLs, etc).

It defaults to false preventing local system files from being read.

# config/initializers/grover.rb
Grover.configure do |config|
  config.allow_file_uris = true
end

And used as such:

# Grover.new accepts a file URI and optional parameters for Puppeteer
grover = Grover.new('file:///some/local/file.html', format: 'A4')

# Get an inline PDF of the local file
pdf = grover.to_pdf

Cover pages

Since the header/footer for Puppeteer is configured globally, displaying of front/back cover
pages (with potentially different headers/footers etc) is not possible.

To get around this, Grover’s middleware allows you to specify relative paths for the cover page contents.
For direct execution, you can make multiple calls and combine the resulting PDFs together.

Using middleware

You can specify relative paths to the cover page contents using the front_cover_path and back_cover_path
options either via the global configuration, or via meta tags. These paths (with query parameters) are then
requested from the downstream app.

Note, to use this functionality you need to add the combine_pdf gem to your app.

The cover pages are converted to PDF in isolation, and then combined together with the original PDF response,
before being returned back up through the Rack stack.

N.B To simplify things, the same request method and body are used for the cover page requests.

# config/initializers/grover.rb
Grover.configure do |config|
  config.options = {
    front_cover_path: '/some/global/cover/page?foo=bar'
  }
end

Or via the meta tags in the original response:

<html>
  <head>
    <meta name="grover-back_cover_path" content="/back/cover/page?bar=baz" />
  </head>
  ...
</html>

Direct execution

To add a cover page using direct execution, you can make multiple calls and combine the results using the
combine_pdf gem.

require 'combine_pdf'

  # ...

  def invoke(file_path)
    pdf = CombinePDF.parse(Grover.new(pdf_report_url).to_pdf)
    pdf >> CombinePDF.parse(Grover.new(pdf_front_cover_url).to_pdf)
    pdf << CombinePDF.parse(Grover.new(pdf_back_cover_url).to_pdf)
    pdf.save file_path
  end

Running on Heroku

To run Grover (Puppeteer) on Heroku follow these steps:

Add the node buildpack. Puppeteer requires a node environment to run.

heroku buildpacks:add heroku/nodejs --index=1 [--remote yourappname]

Add the puppeteer buildpack.
Make sure the puppeteer buildpack runs after the node buildpack and before the main ruby buildpack.
```
heroku buildpacks:add jontewks/puppeteer --index=2 [--remote yourappname]
```
Next, tell Grover to run Puppeteer in the “no-sandbox” mode by setting an ENV variable
GROVER_NO_SANDBOX=true on your app dyno. Make sure that you trust all
the HTML/JS you provide to Grover.
```
heroku config:set GROVER_NO_SANDBOX=true [--remote yourappname]
```

Finally, if using puppeteer 19+ (the default) add the following to a .puppeteerrc.cjs file in the root of your project:

const {join} = require('path');

/**
* @type {import("puppeteer").Configuration}
* */
module.exports = {
  cacheDirectory: join(__dirname, '.cache', 'puppeteer'),
};

“Failed to launch the browser process!” and “No usable sandbox!”

Errors around no usable sandbox are likely caused by changes to how linux secures itself with apparmor. This restricts
which applications are allowed to create sandboxes on the system, including and as required by puppeteer.

This resource talks about some of the options available, but note that it is talking in the context of DEVELOPER Chrome
builds: https://chromium.googlesource.com/chromium/src/+/main/docs/security/apparmor-userns-restrictions.md

However, that does point to what solutions might look like.

Option 1. Use the version of Chrome that your system already allows

Most systems will likely have an apparmor profile for the system installed Chrome, so the easiest option may be to just
tell Grover to use that. eg:

  executable_path: '/opt/google/chrome/chrome' # this path may be different depending on your system

The problem with this solution is that puppeteer is somewhat designed to be run with a specific version of Chrome. If
you use the version installed on your system that may not match the puppeteer version, and as such may have
incompatibilities.

Option 2. Tell your system to allow the Puppeteer installed Chrome

Alternatively, you could create a new apparmor profile for the puppeteer managed version of Chrome. Something like:

export PUPPETEER_CHROME_PATH=$(node -e "console.log(require('puppeteer').executablePath())")
cat | sudo tee /etc/apparmor.d/chrome <<EOF
abi <abi/4.0>,
include <tunables/global>

profile chrome $PUPPETEER_CHROME_PATH flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/chrome>
}
EOF
sudo service apparmor reload  # reload AppArmor profiles to include the new one

However, it is important when using this method to also explicitly set the executable_path within your Grover code to
reference this same Puppeteer Chrome path. By default, if you don’t explicitly specify the executable path, Puppeteer
will ACTUALLY use chrome-headless-shell and not chrome
(both would likely be installed in ~/.cache/puppeteer/ by default when installing the puppeteer NPM package).
See https://pptr.dev/guides/headless-modes

Option 3. Tell your system to allow the Puppeteer installed headless shell version of Chrome

If you prefer not to overload the executable path, you could instead create an apparmor profile for
chrome-headless-shell as such:

export CHROME_HEADLESS_SHELL_PATH=$(find ~/.cache/puppeteer/chrome-headless-shell/ -executable -type f -name chrome-headless-shell -print -quit)
cat | sudo tee /etc/apparmor.d/chrome-headless-shell <<EOF
abi <abi/4.0>,
include <tunables/global>

profile chrome-headless-shell $CHROME_HEADLESS_SHELL_PATH flags=(unconfined) {
  userns,

  # Site-specific additions and overrides. See local/README for details.
  include if exists <local/chrome>
}
EOF
sudo service apparmor reload  # reload AppArmor profiles to include the new one

Debugging

If you’re having trouble with converting the HTML content, you can enable some debugging options to help. These can be
enabled as global options via Grover.configure, by passing through to the Grover initializer, or using meta tag
options.

debug: {
  headless: false,  # Default true. When set to false, the Chromium browser will be displayed
  devtools: true    # Default false. When set to true, the browser devtools will be displayed.
}

N.B.

The headless option disabled is not compatible with exporting of the PDF.
If showing the devtools, the browser will halt resulting in a navigation timeout

Troubleshooting

If you’re generating files from a web server, starting with puppeteer version 22, chromium automatically upgrades all HTTP requests to HTTPS requests. If your server is only expecting HTTP requests then adding launch_args: ['--disable-features=HttpsUpgrades'] will prevent the automatic protocol conversion from occurring.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/Studiosity/grover.

Note that spec tests are appreciated to minimise regressions. Before submitting a PR, please ensure that:

$ rspec

and

$ rubocop

both succeed

To run tests tagged with remote_browser, you need to start a browser in a container:
docker run -p 3000:3000 browserless/chrome:latest
and run:
rspec --tag remote_browser

Special mention

Thanks are given to the great work done in the PDFKit project.
The middleware and HTML preprocessing components were used heavily in the implementation of Grover.

Thanks are also given to the excellent Schmooze project.
The Ruby to NodeJS interface in Grover is heavily based off that work. Grover previously used that gem,
however migrated away due to differing requirements over persistence/cleanup of the NodeJS worker process.

License

The gem is available as open source under the terms of the MIT License.

grover

Grover

Installation

Google Puppeteer

Usage

From a view template

Relative paths

Why would you pre-process the HTML rather than just use the display_url

Configuration

Basic authentication

Remote Chromium

Adding cookies

Adding style tags

Adding script tags

Page URL for middleware requests (or passing through raw HTML)

Header and footer templates

Setting custom PDF filename with header

Setting custom environment variable for node

Yarn PnP strategy

Middleware

Middleware Setup

root_url

ignore_path

ignore_request

allow_file_uris

Cover pages

Using middleware

Direct execution

Running on Heroku

“Failed to launch the browser process!” and “No usable sandbox!”

Option 1. Use the version of Chrome that your system already allows

Option 2. Tell your system to allow the Puppeteer installed Chrome

Option 3. Tell your system to allow the Puppeteer installed headless shell version of Chrome

Debugging

Troubleshooting

Contributing

Special mention

License

Why would you pre-process the HTML rather than just use the `display_url`