Cinch2

A project to add site crawling, file normalization, natural language processing and increased scalability to the current Cinch project. Cinch is a project to develop a bulk download service to a central repository that will maintain original file timestamps, virus check, extract file level metadata, create file checksums and periodically validate checksums for continued file integrity. Users merely need to upload a list of URLs to download and when the process completes they can download the requested files and file metadata to their local environment. Funding for the first version of CINCH: Capture, Ingest, & Checksum tool was made possible through an IMLS Sparks! Ignition grant.

2
3
Ruby

Cinch2 will port the goodness of Cinch from PHP to Ruby.
While we feel PHP is a fine language it doesn’t lend itself well to the goals we have for Cinch2.

Cinch2 looks to build on the foundations of Cinch. Cinch2 looks to add web crawling capabilities so that instead of
generating their own download lists user can merely enter a url to crawl and generate a report of Cinch supported file
types they might download from a site, which they can then selectively edit for download. User no longer need to
generate their own list of files to download, though they still have that option if they so desire.

Cinch2 will also attempt to incorporate natural language processing tools to create richer metadata than can be
extracted by merely looking at file level metadata.

Users will have the option to normalize PDF files to the PDF/A format, and normalize images to the JPEG2000 format.

It will also add distributed and parallel task processing to increase scalability.

Cinch itself is a project to develop a bulk download service to a central repository
that will maintain original file timestamps, extract file level metadata,
create file checksums and periodically validate checksums for continued file integrity.

Users merely need to upload a list of URLs to download and
when the process completes they can download the requested files and file metadata
to their local environment.

Currently supported file types:

  • PDF
  • Microsoft Word
  • Microsoft Excel
  • Microsoft PowerPoint
  • Jpeg
  • PNG
  • GIF
  • Text (e.g. files with .txt, .csv extensions, etc.)

Full end user instructions

Learn more at: http://digitalpreservation.ncdcr.gov/cinch/.

Funding for the CINCH: Capture, Ingest, & Checksum tool was made possible through
an IMLS Sparks! Ignition grant.

License: CINCH2 is released under the Unlicense (http://unlicense.org/)
Individual gems maintain their own licenses, generally MIT. One exception is Sidekiq which
is licensed under the LGPLv3. Commercial use of Sidekiq requires the purchase of a license.


Requirements

  • Currently Cinch2 will only run on *nix systems
  • Ruby 1.9.3+
  • Rails 3.2.* (We use 3.2.8)
  • MySQL or SQLite3
  • ClamAV
  • Redis
  • ImageMagick 6.3.5+
  • Ghostscript
  • libcurl 7.5+
  • Apache Tika

Recommended, but not required:

  • Ruby Version Manager (RVM)