Extraction and re-use(ability) of chemical information from common scientific documents containing ChemDraw files
The ChemScanner
library attempts to extract and interpret reactions/molecules information from ChemDraw-related files format: CDX, CDXML, embedded CDX within DOC and DOCX, Perkin Elmer ELN.
The gem is using rdkit_chem gem, therefore it requires dependencies of rdkit_chem gem
python-dev
)sqlite3-dev
)libboost-all-dev
)Add this line to your application’s Gemfile:
gem 'chem_scanner'
And then execute:
$ bundle
Or install it yourself as:
$ gem install chem_scanner
You can try the ChemScanner
at https://eln.chemotion.net/ or https://eln.chemotion.net/chemscanner. The UI is more user-friendly which some additional features:
To scan/extract a single CDX file
require 'chem_scanner'
cdx = ChemScanner::Cdx.new
cdx.read('/path/to/cdx/file')
# Get array of scanned Canonical SMILES
cdx.molecules.map(&:get_cano_smiles)
# Get array of scanned Reactions in SMILES
cdx.reactions.map(&:reaction_smiles)
There are 5 classes correspond to 5 supported file formats: CDX, CDXML, DOC, DOCX, PerkinELN.
# Molecules - array of scanned molecules
cdx.molecules
# Get array of scanned Canonical SMILES
cdx.molecules.map(&:get_cano_smiles)
# Get one molecule
molecule = cdx.molecules.first
# Number of scanned molecules
cdx.molecules.count
# Canonical SMILES
molecule.get_cano_smiles
# Molfile
molecule.get_mdl
# RDKIT RWMol (https://www.rdkit.org/docs/cppapi/classRDKit_1_1RWMol.html)
molecule.rw_mol
# Molecule label (bold text near molecule)
molecule.label
# Molecule text (molecule description)
molecule.text
# Molecule details (additional information from Perkin Elmer ELN)
molecule.details
We are using a ruby-binding version of RDKit
as a dependency of ChemScanner
.
Reaction consist of 3 groups of molecules: reactants
, reagents
and products
. Each group is and array of molecules, which each element is an object of Molecule
class. In addition, some abbreviations belong to the reaction are represented by SMILES. Those could be access via reagent_smiles
reaction = cdx.reactions.first
# Access extracted structure group
reactants = reaction.reactants
reagents = reaction.reagents
products = reaction.products
reagent_smiles = reaction.reagent_smiles
Further manipulation of each group would be similar to Molecule
class.
Reaction itself has description
, yield
, time
, temperature
and details
properties. All these properties are extracted from the ChemDraw scheme, excep details
field are additional information from PerkinELN
.
Some multi-step reactions can also be recognized. If a reaction is a multi-step reaction, the “steps” could be accessed via:
# Get first scanned reaction
reaction = cdx.reactions.first
# Access first step
step = reaction.steps.first
step.number # Should be 1
step.description
step.time
step.temperature
# List reagents SMILES
step.reagents
Each step has these following properties: description
, time
, temperature
, and reagents
CDX, CDXML, PerkinELN usage and API are described above. Their outputs are simple molecules
and reactions
.
DOC and DOCX classes are little bit different. Since DOC and DOCX file can contain more than 1 embedded ChemDraw schemes, which each embedded scheme is 1 CDX scheme.
ChemScanner
attempts to extract all of them and put into one Hash
map, called cdx_map
.
require 'chem_scanner'
doc = ChemScanner::Doc.new
doc.read('/path/to/doc/file')
doc.cdx_map.each do |key, cdx|
puts cdx.reactions.map(&:reaction_smiles)
end
# Access all molecules in all CDXs
doc.molecules.map(&:get_cano_smiles)
# Access all reactions in all CDXs
doc.reactions.map(&:get_cano_smiles)
DOCX is a bit different, ChemScanner
can extract the CDX together with its preview image within the documents.
require 'chem_scanner'
docx = ChemScanner::Docx.new
docx.read('/path/to/docx/file')
docx.cdx_map.each do |key, cdx_info|
# Get the CDX scheme
cdx = cdx_info[:cdx]
puts cdx.reactions.map(&:reaction_smiles)
# Preview images, used for ChemScanner UI
img_ext = cdx_info[:img_ext] # Could be '.png', '.emf'
img_b64 = cdx_info[:img_b64] # Base64 encoded of image
end
# Access all molecules in all CDXs
docx.molecules.map(&:get_cano_smiles)
# Access all reactions in all CDXs
docx.reactions.map(&:get_cano_smiles)
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/ComPlat/chem_scanner. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
The gem is available as open source under the terms of the GNU AGPLv3 License.