Basic Utilities for PyTorch Natural Language Processing (NLP)
With the PyTorch toolchain maturing, it’s time to archive repos like this one. You’ll be able to find more developed options for every part of this toolkit:
Happy developing! ✨
Feel free to contact me if anyone wants to unarchive this repo and continue developing it. You can reach me at “petrochukm [at] gmail.com”.
PyTorch-NLP, or torchnlp
for short, is a library of basic utilities for PyTorch
NLP. torchnlp
extends PyTorch to provide you with
basic text data processing functions.
Logo by Chloe Yeo, Corporate Sponsorship by WellSaid Labs
Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install pytorch-nlp
using
pip:
pip install pytorch-nlp
Or to install the latest code via:
pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
The complete documentation for PyTorch-NLP is available
via our ReadTheDocs website.
Within an NLP data pipeline, you’ll want to implement these basic steps:
Load the IMDB dataset, for example:
from torchnlp.datasets import imdb_dataset
# Load the imdb training dataset
train = imdb_dataset(train=True)
train[0] # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}
Load a custom dataset, for example:
from pathlib import Path
from torchnlp.download import download_file_maybe_extract
directory_path = Path('data/')
train_file_path = Path('trees/train.txt')
download_file_maybe_extract(
url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
directory=directory_path,
check_files=[train_file_path])
open(directory_path / train_file_path)
Don’t worry we’ll handle caching for you!
Tokenize and encode your text as a tensor.
For example, a WhitespaceEncoder
breaks
text into tokens whenever it encounters a whitespace character.
from torchnlp.encoders.text import WhitespaceEncoder
loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = WhitespaceEncoder(loaded_data)
encoded_data = [encoder.encode(example) for example in loaded_data]
With your loaded and encoded data in hand, you’ll want to batch your dataset.
import torch
from torchnlp.samplers import BucketBatchSampler
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors
encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]
train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
train_batch_sampler = BucketBatchSampler(
train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])
batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]
PyTorch-NLP builds on top of PyTorch’s existing torch.utils.data.sampler
, torch.stack
and default_collate
to support sequential inputs of varying lengths!
With your batch in hand, you can use PyTorch to develop and train your model using gradient descent.
For example, check out this example code for training on the Stanford
Natural Language Inference (SNLI) Corpus.
PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗
Now you’ve setup your pipeline, you may want to ensure that some functions run deterministically.
Wrap any code that’s random, with fork_rng
and you’ll be good to go, like so:
import random
import numpy
import torch
from torchnlp.random import fork_rng
with fork_rng(seed=123): # Ensure determinism
print('Random:', random.randint(1, 2**31))
print('Numpy:', numpy.random.randint(1, 2**31))
print('Torch:', int(torch.randint(1, 2**31, (1,))))
This will always print:
Random: 224899943
Numpy: 843828735
Torch: 843828736
Now that you’ve computed your vocabulary, you may want to make use of
pre-trained word vectors to set your embeddings, like so:
import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe
encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
vocab_set = set(encoder.vocab)
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
embedding_weights[i] = pretrained_embedding[token]
For example, from the neural network package, apply the state-of-the-art LockedDropout
:
import torch
from torchnlp.nn import LockedDropout
input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)
# Apply a LockedDropout to `input_`
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
Compute common NLP metrics such as the BLEU score.
from torchnlp.metrics import get_moses_multi_bleu
hypotheses = ["The brown fox jumps over the dog 笑"]
references = ["The quick brown fox jumps over the lazy dog 笑"]
# Compute BLEU score with the official BLEU perl script
get_moses_multi_bleu(hypotheses, references, lowercase=True) # RETURNS: 47.9
Maybe looking at longer examples may help you at examples/
.
Need more help? We are happy to answer your questions via Gitter Chat
We’ve released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hope
that other organizations can benefit from the project. We are thankful for any contributions from
the community.
Read our contributing guide
to learn about our development process, how to propose bugfixes and improvements, and how to build
and test your changes to PyTorch-NLP.
torchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar.
torchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders.
PyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint,
torchtext is object orientated with external coupling while PyTorch-NLP is object orientated with
low coupling.
AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.
If you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX to
cite it:
@misc{pytorch-nlp,
author = {Petrochuk, Michael},
title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/PetrochukM/PyTorch-NLP}},
}