Science Parse parses scientific papers (in PDF form) and returns them in structured form.
Science Parse parses scientific papers (in PDF form) and returns them in structured form. As of today, it supports these fields:
In JSON format, the output looks like this (or like this, if you want sections). The easiest way to get started is to use the output from this server.
There is a new version of science-parse out that works in a completely different way. It has fewer
features, but higher quality in the output. Check out the details at https://github.com/allenai/spv2.
There are three different ways to get started with SP. Each has its own document:
The current version is 3.0.0
. If you want to include it in your own project, use this:
For SBT:
libraryDependencies += "org.allenai" %% "science-parse" % "3.0.0"
For Maven:
<dependency>
<groupId>org.allenai</groupId>
<artifactId>science-parse_2.12</artifactId>
<version>3.0.0</version>
</dependency>
The first time you run it, SP will download some rather large model files. Don’t be alarmed! The model files are cached, and startup is much faster the second time.
For licensing reasons, SP does not include libraries for some image formats. Without these
libraries, SP cannot process PDFs that contain images in these formats. If you have no
licensing restrictions in your project, we recommend you add these additional dependencies to your
project as well:
"com.github.jai-imageio" % "jai-imageio-core" % "1.2.1",
"com.github.jai-imageio" % "jai-imageio-jpeg2000" % "1.3.0", // For handling jpeg2000 images
"com.levigo.jbig2" % "levigo-jbig2-imageio" % "1.6.5", // For handling jbig2 images
This project is a hybrid between Java and Scala. The interaction between the languages is fairly seamless, and SP can be used as a library in any JVM-based language.
Our build system is sbt. To build science-parse, you have to have sbt installed and working. You can
find details about that at https://www.scala-sbt.org.
Once you have sbt set up, just start sbt
in the main project folder to launch sbt’s shell. There
are many things you can do in the shell, but here are the most important ones:
+test
runs all the tests in all the projects across Scala versions.cli/assembly
builds a runnable superjar (i.e., a jar with all dependencies bundled) for thejava -Xmx10g -jar <location of superjar>
.server/assembly
builds a runnable superjar for the webserver.server/run
starts the server directly from the sbt shell.This project uses Lombok which requires you to enable annotation processing inside of an IDE.
Here is the IntelliJ plugin and you’ll need to enable annotation processing (instructions here).
Lombok has a lot of useful annotations that give you some of the nice things in Scala:
val
is equivalent to final
and the right-hand-side class. It gives you type-inference via some tricks@Data
Special thanks goes to @kermitt2, whose work on kermitt2/grobid inspired Science Parse, and helped us get started with some labeled data.
This project releases to BinTray. To make a release:
git tag -a vX.Y.Z -m "Release X.Y.Z"
replacing X.Y.Z with the correct versiongit push origin vX.Y.Z
sbt +publish
(the “+” is required to cross-compile)build.sbt
on master (and push!) with X.Y.Z+1 (e.g., 2.5.1 afterIf you make a mistake you can rollback the release with sbt bintrayUnpublish
and retag the
version to a different commit as necessary.