Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
Yomu is a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit.
Here are some of the formats supported:
For the complete list of supported formats, please visit the Apache Tika
Supported Document Formats page.
Text, metadata and MIME type information can be extracted by calling Yomu.read
directly:
require 'yomu'
data = File.read 'sample.pages'
text = Yomu.read :text, data
metadata = Yomu.read :metadata, data
mimetype = Yomu.read :mimetype, data
Create a new instance of Yomu and pass a filename.
yomu = Yomu.new 'sample.pages'
text = yomu.text
This is useful for reading remote files, like documents hosted on Amazon S3.
yomu = Yomu.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
text = yomu.text
Yomu can also read from a stream or any object that responds to read
, including file uploads from Ruby on Rails or Sinatra.
post '/:name/:filename' do
yomu = Yomu.new params[:data][:tempfile]
yomu.text
end
Metadata is returned as a hash.
yomu = Yomu.new 'sample.pages'
yomu.metadata['Content-Type'] #=> "application/vnd.apple.pages"
MIME type is returned as a MIME::Type object.
yomu = Yomu.new 'sample.docx'
yomu.mimetype.content_type #=> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
yomu.mimetype.extensions #=> ['docx']
Yomu packages the Apache Tika application jar and requires a working JRE for it to work.
Add this line to your application’s Gemfile:
gem 'yomu'
And then execute:
$ bundle
Or install it yourself as:
$ gem install yomu
git checkout -b my-new-feature
)rake test
)git commit -am 'Added some feature'
)git push origin my-new-feature
)