node pdfutils

tool for analyzing and converting PDF

104
23
C++

Flattr this git repo

PDF Utils for node

This library contains tools for analysing and converting PDF files. You can
get metadata, extract text, render pages to svg or png, all with our beloved
asynchronous programming style.

It is planed to support extracting links from the document and create ImageMaps
(You remember them, don’t you?) on
the fly. Also pdfutils should support password locked files.
But that’s still on the todo.

The library is currently beta. This means it has incomplete error handling and
it lacks a testing suite.

Installation

To install pdfutils you have to install libpoppler-glib first.

Using Debian execute:

apt-get install libpoppler-glib-dev libpoppler-glib8 libcairo2-dev libcairo2

Using CentOS execute:

yum install poppler poppler-glib-devel

Using MacOS and Macports:

port install poppler

or if you prefere brew:

brew install poppler --with-glib
export PKG_CONFIG_PATH=/usr/X11/lib/pkgconfig

Then install pdfutils

npm install pdfutils

Usage

See this very basic example:

var pdfutils = require('pdfutils').pdfutils;

pdfutils("document.pdf", function(err, doc) {
	doc[0].asPNG({maxWidth: 100, maxHeight: 100}).toFile("firstpage.png");
});

3sloc to generate thumbnails of PDFs. Awesome!

Here a bit more documentation:

pdfutils(source, callback)

this function is a factory for Documents

arguments:

  • source: can be a Buffer or a String. If it’s a string, read from the
    file. If it’s a buffer, treat the buffer content as in-memory PDF.
    Please make sure to not change the buffer while using it by pdfutils!
  • callback(err, doc): a callback with the following arguments:
    • err: an error string when the pdf couldn’t be loaded successfully,
      otherwise null
    • doc: an instance of Document when the pdf is loaded successfully,
      otherwise undefined

Class PDFDocument

This class is generated by pdfutils(source, callback) described above.

members:

  • 0, 1, 2, 3, 4, … , n instances of the Pages contained by the
    Document. See description of Page below
  • length: number of Pages in a document
  • author: the author of the document or null if not known
  • creationDate: the creation date as integer since 1970-01-01
  • creator: creator of the document or null if unknown
  • format: exact format of this PDF file or null if unknown
  • keywords: keywords of the document as string or null if unknown
  • linearized: true if document is linearized,
    otherwise false
  • metadata: Metadata as string
  • modDate: last modification of pdf as integer since 1970-01-01
  • pageLayout: the layout of the pages. Can be on of the following strings or null if unknown:
    • singlePage
    • oneColumn
    • twoColumnLeft
    • twoColumnRight
    • twoPageLeft
    • twoPageRight
  • pageMode: the suggested viewing mode of a page. Can be one of the following strings or null if unkown:
    • none
    • useOutlines
    • useThumbs
    • fullscreen
    • useOc
    • useAttachments
  • permissions: the permissions of this document. Is an object with the following members:
    • print: whether the user is allowed to print
    • modify: whether the user is allowed to modify the document
    • copy: whether the user is allowed to take copies of this document
    • notes: whether the user is allowed to make notes
    • fillForm: whether the user is allowed to fill out forms
  • producer: producer of a document or null if unknown
  • subject: subject of this document or null if unknown
  • title: title of the document or null if unknown

Class PDFPage

This class represents a page of a document

members:

  • width: width of the document
  • height: width of the document
  • index: number of this page.
  • label: label of this page or null if no label was defined.
  • links: array containing links of a page
  • asSVG(opts): returns an instance of PageJob described below, opts is an
    optional argument with an Object with the following optional fields:
    • maxWidth: maximal width of the resulting SVG in px.
    • minWidth: minimal width of the resulting SVG in px.
    • maxHeight: maximal height of the resulting SVG in px.
    • minHeight: minimal height of the resulting SVG in px.
    • width: the width of the resulting SVG in px. Overwrites minWidth and
      maxWidth.
    • height: the height of the resulting SVG in px. Overwrites minHeight and
      maxHeight.
  • asPDF(opts): returns an instance of PageJob described below, opts is an
    optional argument with an Object with the following optional fields:
    • maxWidth: maximal width of the resulting PDF in pt.
    • minWidth: minimal width of the resulting PDF in pt.
    • maxHeight: maximal height of the resulting PDF in pt.
    • minHeight: minimal height of the resulting PDF in pt.
    • width: the width of the resulting PDF in pt. Overwrites minWidth and
      maxWidth.
    • height: the height of the resulting PDF in pt. Overwrites minHeight and
      maxHeight.
  • asPNG(opts): returns an instance of PageJob described below, opts is an
    optional argument with an Object with the following optional fields:
    • maxWidth: maximal width of the resulting PNG in px
    • minWidth: minimal width of the resulting PNG in px
    • maxHeight: maximal height of the resulting PNG in px
    • minHeight: minimal height of the resulting PNG in px
    • width: the width of the resulting PNG in px. Overwrites minWidth and
      maxWidth.
    • height: the height of the resulting PNG in px. Overwrites minHeight and
      maxHeight.
  • asText(opts): returns an instance of PageJob described below. opts is an
    optional argument with an Object, which is currently ignored.

Class PDFPageJob

This class inherits Stream. It handles
converting a Page (described above) to SVG, PNG or Text

members:

  • links: array containing links of a page, translated to fit the output page.

events:

  • data: emitted when a new chunk of the converted file is available
  • end: emitted when the file is successfully converted
  • error: emitted when the file cannot be converted. Is not implemented yet.

members:

  • toFile(path, [options]): writes a page to the file in the desired format.
  • see Stream for further members.

License

This module is licensed under GPL.