Natural language processing pipeline for book-length documents (archival Java version; for current Python version, see: https://github.com/booknlp/booknlp)
BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:
This pipeline is described in the following paper; please cite if you write a research paper using this software:
David Bamman, Ted Underwood and Noah Smith, “A Bayesian Mixed Effects Model of Literary Character,” ACL 2014.
Download external jars (which are sadly too big for GitHub’s 100MB file size limit)
From the command line, run the following:
./runjava novels/BookNLP -doc data/originalTexts/dickens.oliver.pg730.txt -printHTML -p data/output/dickens -tok data/tokens/dickens.oliver.tokens -f
(On a 2.6 GHz MBP, this takes about 3.5 minutes)
This runs the bookNLP pipeline on “Oliver Twist” in the data/originalTexts directory and writes the processed document to data/tokens/dickens.oliver.tokens, along with diagnostic info to data/output/dickens. To run on your own texts, change the following:
-doc
-tok
-p : the directory to write all diagnostic files to. Creates the directory if it does not already exist.
-id : a unique book ID for this book (output files include this in the filename)
-printHTML : also print the text as an HTML file with character aliases, coref and speaker ID annotated
-f : force the (slower) syntactic processing of the original text file, even if the
The main output here is data/tokens/dickens.oliver.tokens, which contains the original book, one token per line, with part of speech, syntax, NER, coreference and other annotations. The (tab-separated) format is:
The data/output/dickens folder will now contain:
With apache ant installed, running ant
compiles everything.
Coreference only needs to be trained when there’s new training data (or new feature ideas: current features are based on syntactic tree distance, linear distance, POS identity, gender matching, quotation scope and salience).
Coreference annotated data is located in the coref/ directory.
annotatedData.txt contains coreference annotations, in the (tab-separated) format:
bookIDs are mapped to their respective token files in docPaths.txt. All of these token files are located in finalTokenData/. These tokens files are all read-only – since the annotations are keyed to specific token IDs in those files, we want to make sure they stay permanent.
Given the coref/ folder above, train new coreference weights with:
./runjava novels.training/TrainCoref -training coref/annotatedData.txt -o coref/weights.txt
-training specifies the input training file
-o specifies the output file to write the trained weights to
Two parameters control the amount of regularization in the model (higher regularization dampens the impact of any single feature, and L1 regularization removes features from the model; both help prevent overfitting to training data.)
-l1 specifies the L1 regularization parameter (higher = more weights end up driven to 0). Default = 2
-l2 specifies the L2 regularization parameter (higher = weights shrink faster). Default = .1
To use the newly trained weights in the pipeline above, copy them to files/coref.weights or specify them on the novels.BookNLP command line with the -w flag.