This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.
Note
Quick legend on available resource types:
⭐ - open source project, usually a GitHub repository with its number of stars
📙 - resource you can read, usually a blog post or a paper
🗂️ - a collection of additional resources
🔱 - non-open source tool, framework or paid service
🎥️ - a resource you can watch
🎙️ - a resource you can listen to
Table of Contents
📇 Main Section |
🗃️ Sub-sections Sample |
NLP Resources |
Paper Summaries, Conference Summaries, NLP Datasets |
NLP Podcasts |
NLP-only Podcasts, Podcasts with many NLP Episodes |
NLP Newsletters |
- |
NLP Meetups |
- |
NLP YouTube Channels |
- |
NLP Benchmarks |
General NLU, Question Answering, Multilingual |
Research Resources |
Resource on Transformer Models, Distillation and Pruning, Automated Summarization |
Industry Resources |
Best Practices for NLP Systems, MLOps for NLP |
Speech Recognition |
General Resources, Text to Speech, Speech to Text, Datasets |
Topic Modeling |
Blogs, Frameworks, Repositories and Projects |
Keyword Extraction |
Text Rank, Rake, Other Approaches |
Responsible NLP |
NLP and ML Interpretability, Ethics, Bias, and Equality in NLP, Adversarial Attacks for NLP |
NLP Frameworks |
General Purpose, Data Augmentation, Machine Translation, Adversarial Attacks, Dialog Systems & Speech, Entity and String Matching, Non-English Frameworks, Text Annotation |
Learning NLP |
Courses, Books, Tutorials |
NLP Communities |
- |
Other NLP Topics |
Tokenization, Data Augmentation, Named Entity Recognition, Error Correction, AutoML/AutoNLP, Text Generation |
Note
Section keywords: paper summaries, compendium, awesome list
Compendiums and awesome lists on the topic of NLP:
NLP Conferences, Paper Summaries and Paper Compendiums:
Papers and Paper Summaries
Conference Summaries
NLP Progress and NLP Tasks:
NLP Datasets:
Word and Sentence embeddings:
Notebooks, Scripts and Repositories
Non-English resources and Compendiums
- ⭐ NLP Resources for Bahasa Indonesian [GitHub, 480 stars]
- ⭐ Indic NLP Catalog [GitHub, 552 stars]
- ⭐ Pre-trained language models for Vietnamese [GitHub, 653 stars]
- ⭐ Natural Language Toolkit for Indic Languages (iNLTK) [GitHub, 814 stars]
- ⭐ Indic NLP Library [GitHub, 550 stars]
- ⭐ AI4Bharat-IndicNLP Portal
- ⭐ ARBML - Implementation of many Arabic NLP and ML projects [GitHub, 387 stars]
- ⭐ zemberek-nlp - NLP tools for Turkish [GitHub, 1146 stars]
- ⭐ TDD AI - An open-source platform for all Turkish datasets, language models, and NLP tools.
- ⭐ KLUE - Korean Language Understanding Evaluation [GitHub, 560 stars]
- ⭐ Persian NLP Benchmark - benchmark for evaluation and comparison of various NLP tasks in Persian language [GitHub, 73 stars]
- ⭐ nlp-greek - Greek language sources [GitHub, 5 stars]
- ⭐ Awesome NLP Resources for Hungarian [GitHub, 221 stars]
Pre-trained NLP models
NLP History
General
2020 Year in Review
🔙 Back to the Table of Contents
NLP-only podcasts
Many NLP episodes
Some NLP episodes
🔙 Back to the Table of Contents
General NLU
- ⭐ GLUE - General Language Understanding Evaluation (GLUE) benchmark
- ⭐ SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
- ⭐ decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
- ⭐ dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [GitHub, 280 stars]
- ⭐ DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking
- ⭐ Big-Bench - collaborative benchmark for measuring and extrapolating the capabilities of language models [GitHub, 2835 stars]
Summarization
- ⭐ WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset
- ⭐ WikiLingua - A Multilingual Abstractive Summarization Dataset
Question Answering
- ⭐ SQuAD - Stanford Question Answering Dataset (SQuAD)
- ⭐ XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
- ⭐ GrailQA - Strongly Generalizable Question Answering (GrailQA)
- ⭐ CSQA - Complex Sequential Question Answering
Multilingual and Non-English Benchmarks
- 📙 XTREME - Massively Multilingual Multi-task Benchmark
- ⭐ GLUECoS - A benchmark for code-switched NLP
- ⭐ IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
- ⭐ LinCE - Linguistic Code-Switching Evaluation Benchmark
- ⭐ Russian SuperGlue - Russian SuperGlue Benchmark
Bio, Law, and other scientific domains
- ⭐ BLURB - Biomedical Language Understanding and Reasoning Benchmark
- ⭐ BLUE - Biomedical Language Understanding Evaluation benchmark
- ⭐ LexGLUE - A Benchmark Dataset for Legal Language Understanding in English
Transformer Efficiency
Speech Processing
- ⭐ SUPERB - Speech processing Universal PERformance Benchmark
Other
🔙 Back to the Table of Contents
General
Embeddings
Repositories
Blogs
Cross-lingual Word and Sentence Embeddings
- ⭐ vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 644 stars]
- ⭐ sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 14981 stars]
Byte Pair Encoding
- ⭐ bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1179 stars]
- ⭐ subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 2185 stars]
- ⭐ python-bpe - Byte Pair Encoding for Python [GitHub, 223 stars]
Transformer-based Architectures
General
- 📙 The Transformer Family by Lilian Weng [Blog, 2020]
- 📙 Playing the lottery with rewards and multiple languages - about the effect of random initialization [ICLR 2020 Paper]
- 📙 Attention? Attention! by Lilian Weng [Blog, 2018]
- 📙 the transformer … “explained”? [Blog, 2019]
- 🎥️ Attention is all you need; Attentional Neural Network Models by Łukasz Kaiser [Talk, 2017]
- 📙 Attention Is Off By One [July, 2023]
- 🎥️ Understanding and Applying Self-Attention for NLP [Talk, 2018]
- 📙 The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures [Paper, April 2021]
- 📙 Pre-Trained Models: Past, Present and Future [Paper, June 2021]
- 📙 A Survey of Transformers [Paper, June 2021]
Transformer
- 📙 The Annotated Transformer by Harvard NLP [Blog, 2018]
- 📙 The Illustrated Transformer by Jay Alammar [Blog, 2018]
- 📙 Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
- 📙 Sequential Transformer with Adaptive Attention Span by Facebook. Blog [Blog, 2019]
- 📙 Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
- 📙 Reformer: The Efficient Transformer [Blog, 2020]
- 📙 Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
- 📙 TRANSFORMERS FROM SCRATCH [Blog, 2019]
- 📙 Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
- ⭐ Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub, 596 stars]
- 📙 Transformers from Scratch [Blog, Oct 2021]
BERT
- 📙 A Visual Guide to Using BERT for the First Time by Jay Alammar [Blog, 2019]
- 📙 The Dark Secrets of BERT by Anna Rogers [Blog, 2020]
- 📙 Understanding searches better than ever before [Blog, 2019]
- 📙 Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework [Blog, 2019]
- ⭐ SemBERT - Semantics-aware BERT for Language Understanding [GitHub, 286 stars]
- ⭐ BERTweet - BERTweet: A pre-trained language model for English Tweets [GitHub, 574 stars]
- ⭐ Optimal Subarchitecture Extraction for BERT [GitHub, 470 stars]
- ⭐ CharacterBERT: Reconciling ELMo and BERT [GitHub, 195 stars]
- 📙 When BERT Plays The Lottery, All Tickets Are Winning [Blog, Dec 2020]
- ⭐ BERT-related Papers a list of BERT-related papers [GitHub, 2032 stars]
Other Transformer Variants
T5
BigBird
Reformer / Linformer / Longformer / Performers
- 🎥️ Reformer: The Efficient Transformer - [Paper, February 2020] [Video, October 2020]
- 🎥️ Longformer: The Long-Document Transformer - [Paper, April 2020] [Video, April 2020]
- 🎥️ Linformer: Self-Attention with Linear Complexity - [Paper, June 2020] [Video, June 2020]
- 🎥️ Rethinking Attention with Performers - [Paper, September 2020] [Video, September 2020]
- ⭐ performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 1084 stars]
Switch Transformer
GPT-family
General
GPT-3
Learning Resources
Applications
- ⭐ Awesome GPT-3 - list of all resources related to GPT-3 [GitHub, 4589 stars]
- 🗂️ GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
- 🗂️ GPT-3 Demo Showcase - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
- 🔱 OpenAI API - API Demo to use OpenAI GPT for commercial applications
Open-source Efforts
Other
Distillation, Pruning and Quantization
Reading Material
Tools
- ⭐ Bert-squeeze - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 79 stars]
- ⭐ XtremeDistil - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 153 stars]
Automated Summarization
- 📙 PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
- ⭐ CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 146 stars]
- ⭐ XL-Sum - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 252 stars]
- ⭐ SummerTime - an open-source text summarization toolkit for non-experts [GitHub, 265 stars]
- ⭐ PRIMER - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 151 stars]
- ⭐ summarus - Models for automatic abstractive summarization [GitHub, 170 stars]
Knowledge Graphs and NLP
Note
Section keywords: best practices, MLOps
🔙 Back to the Table of Contents
Best Practices for building NLP Projects
MLOps for NLP
MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.
In general, MLOps for NLP includes having the following processes in place:
- Data Versioning - make sure your training, annotation and other types of data are versioned and tracked
- Experiment Tracking - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
- Model Registry - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
- Automated Testing and Behavioral Testing - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
- Model Deployment and Serving - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
- Data and Model Observability - track data drift, model accuracy drift etc.
Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:
- Feature Store - centralized storage of all features developed for ML models than can be easily reused by any other ML project
- Metadata Management - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.
MLOps Compilations & Awesome Lists
Reading Material
- 📙 Machine Learning Operations (MLOps): Overview, Definition, and Architecture [Paper, May 2022]
- 📙 Requirements and Reference Architecture for MLOps:Insights from Industry [Paper, Oct 2022]
- 📙 MLOps: What It Is, Why it Matters, and How To Implement It by Neptune AI [Blog, July 2021]
- 📙 Best MLOps Tools You Need to Know as a Data Scientist by Neptune AI [Blog, July 2021]
- 📙 State of MLOps 2021 by Valohai [Blog, August 2021]
- 📙 The MLOps Stack by Valohai [Blog, October 2020]
- 📙 Data Version Control for Machine Learning Applications by Megagon AI [Blog, July 2021]
- 📙 The Rapid Evolution of the Canonical Stack for Machine Learning [Blog, July 2021]
- 📙 MLOps: Comprehensive Beginner’s Guide [Blog, March 2021]
- 📙 What I’ve learned about MLOps from speaking with 100+ ML practitioners [Blog, May 2021]
- 📙 DataRobot Challenger Models - MLOps Champion/Challenger Models
- 📙 State of MLOps Blog by Dr. Ori Cohen
- 📙 MLOps Ecosystem Overview [Blog, 2021]
Learning Material
MLOps Communities
Data Versioning
- ⭐ DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
Experiment Tracking
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- 🔱 SigOpt - automate training & tuning, visualize & compare runs [Paid Service]
- ⭐ Optuna - hyperparameter optimization framework [GitHub, 10650 stars]
- ⭐ Clear ML - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] Link to GitHub
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 8093 stars]
Model Registry
- ⭐ DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- ⭐ ModelDB - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1696 stars]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- 🔱 Valohai - End-to-end ML pipelines [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
Automated Testing and Behavioral Testing
- ⭐ CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2003 stars]
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
- ⭐ WildNLP - Corrupt an input text to test NLP models’ robustness [GitHub, 76 stars]
- ⭐ Great Expectations - Write tests for your data [GitHub, 9874 stars]
- ⭐ Deepchecks - Python package for comprehensively validating your machine learning models and data [GitHub, 3582 stars]
Model Deployability and Serving
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- 🔱 Amazon SageMaker [Paid Service]
- 🔱 Valohai - End-to-end ML pipelines [Paid Service]
- 🔱 NLP Cloud - Production-ready NLP API [Paid Service]
- 🔱 Saturn Cloud [Paid Service]
- 🔱 SELDON - machine learning deployment for enterprise [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- ⭐ TorchServe - flexible and easy to use tool for serving PyTorch models [GitHub, 4174 stars]
- 🔱 Kubeflow - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
- ⭐ KFServing - Serverless Inferencing on Kubernetes [GitHub, 3504 stars]
- 🔱 TFX - TensorFlow Extended - end-to-end platform for deploying production ML pipelines [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- 🔱 Cortex - containers as a service on AWS [Paid Service]
- 🔱 Azure Machine Learning - end-to-end machine learning lifecycle [Paid Service]
- ⭐ End2End Serverless Transformers On AWS Lambda [GitHub, 121 stars]
- ⭐ NLP-Service - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]
- 🔱 Dagster - data orchestrator for machine learning [Free and Open Source]
- 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 8093 stars]
- ⭐ flyte - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 5525 stars]
- ⭐ MLRun - Machine Learning automation and tracking [GitHub, 1425 stars]
- 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
Model Debugging
- ⭐ imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1375 stars]
- ⭐ Cockpit - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 474 stars]
Model Accuracy Prediction
- ⭐ WeightWatcher - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 1453 stars]
Data and Model Observability
General
- ⭐ Arize AI - embedding drift monitoring for NLP models
- ⭐ Arize-Phoenix - ML observability for LLMs, vision, language, and tabular models
- ⭐ whylogs - open source standard for data and ML logging [GitHub, 2636 stars]
- ⭐ Rubrix - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 3843 stars]
- ⭐ MLRun - Machine Learning automation and tracking [GitHub, 1425 stars]
- 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
- 🔱 Cortex - containers as a service on AWS [Paid Service]
Model Centric
- 🔱 Algorithmia - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
- 🔱 Dataiku - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
- ⭐ Evidently AI - tools to analyze and monitor machine learning models [Free and Open Source] Link to GitHub
- 🔱 Fiddler - ML Model Performance Management Tool [Paid Service]
- 🔱 Hydrosphere - open-source platform for managing ML models [Paid Service]
- 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
- 🔱 Domino Model Ops - Deploy and Manage Models to Drive Business Impact [Paid Service]
Data Centric
- 🔱 Datafold - data quality through diffs, profiling, and anomaly detection [Paid Service]
- 🔱 acceldata - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
- 🔱 Bigeye - monitoring and alerting to your datasets in minutes [Paid Service]
- 🔱 datakin - end-to-end, real-time data lineage solution [Paid Service]
- 🔱 Monte Carlo - data integrity, drifts, schema, lineage [Paid Service]
- 🔱 SODA - data monitoring, testing and validation [Paid Service]
Feature Stores
- 🔱 Tecton - enterprise feature store for machine learning [Paid Service]
- ⭐ FEAST - open source feature store for machine learning Website [GitHub, 5525 stars]
- 🔱 Hopsworks Feature Store - data management system for managing machine learning features [Paid Service]
Metadata Management
- ⭐ ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 617 stars]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
MLOps Frameworks
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 8093 stars]
- ⭐ kedro - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 9883 stars]
- ⭐ Seldon Core - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 4353 stars]
- ⭐ ZenML - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 3972 stars]
- 🔱 Google Vertex AI - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]
- ⭐ Diffgram - Complete training data platform for machine learning delivered as a single application [GitHub, 1834 stars]
- 🔱 Continual.ai - build, deploy, and operationalize ML models easier and faster with a declarative interface on cloud data warehouses like Snowflake, BigQuery, RedShift, and Databricks. [Paid Service]
Transformer-based Architectures
🔙 Back to the Table of Contents
General
Multi-GPU Transformers
Training Transformers Effectively
Embeddings as a Service
NLP Recipes Industrial Applications:
NLP Applications in Bio, Finance, Legal and other industries
- ⭐ Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 636 stars]
- ⭐ Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1688 stars]
- ⭐ FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 197 stars]
- ⭐ LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 692 stars]
- ⭐ NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
- ⭐ Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 613 stars]
- ⭐ BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 338 stars]
Note
Section keywords: speech recognition
🔙 Back to the Table of Contents
General Speech Recognition
- ⭐ wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6370 stars]
- ⭐ DeepSpeech - Baidu’s DeepSpeech architecture [GitHub, 25166 stars]
- 📙 Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
- ⭐ kaldi - Kaldi is a toolkit for speech recognition [GitHub, 14177 stars]
- ⭐ awesome-kaldi - resources for using Kaldi [GitHub, 532 stars]
- ⭐ ESPnet - End-to-End Speech Processing Toolkit [GitHub, 8355 stars]
- 📙 HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]
Text to Speech / Speech Generation
- ⭐ FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 857 stars]
- ⭐ TTS - a deep learning toolkit for Text-to-Speech [GitHub, 34356 stars]
- 🔱 NotebookLM - Google Gemini powered personal assistant / podcast generator
Speech to Text
- ⭐ whisper - Robust Speech Recognition via Large-Scale Weak Supervision, by OpenAI [GitHub, 68884 stars]
- ⭐ vibe - GUI tool to work with whisper, multilingual and cuda support included [GitHub, 931 stars]
Datasets
- ⭐ VoxPopuli - large-scale multilingual speech corpus for representation learning [GitHub, 507 stars]
Note
Section keywords: topic modeling
🔙 Back to the Table of Contents
Blogs
Frameworks for Topic Modeling
- ⭐ gensim - framework for topic modeling [GitHub, 15597 stars]
- ⭐ Spark NLP [GitHub, 3826 stars]
Repositories
Note
Section keywords: keyword extraction
🔙 Back to the Table of Contents
Text Rank
- ⭐ PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 2132 stars]
- ⭐ textrank - TextRank implementation for Python 3 [GitHub, 1248 stars]
RAKE - Rapid Automatic Keyword Extraction
- ⭐ rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
- ⭐ yake - Single-document unsupervised keyword extraction [GitHub, 1632 stars]
- ⭐ RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 375 stars]
- ⭐ rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
Other Approaches
- ⭐ flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5583 stars]
- ⭐ BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 254 stars]
- ⭐ keyBERT - Minimal keyword extraction with BERT [GitHub, 3471 stars]
- ⭐ KeyphraseVectorizers - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 251 stars]
Further Reading
Note
Section keywords: ethics, responsible NLP
🔙 Back to the Table of Contents
NLP and ML Interpretability
NLP-centric
General
- ⭐ Language Interpretability Tool (LIT) [GitHub, 3474 stars]
- ⭐ WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 468 stars]
- ⭐ Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 413 stars]
- ⭐ InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 6238 stars]
- ⭐ thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 143 stars]
- ⭐ Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 342 stars]
- ⭐ imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1375 stars]
Ethics, Bias, and Equality in NLP
Adversarial Attacks for NLP
Hate Speech Analysis
- ⭐ HateXplain - BERT for detecting abusive language [GitHub, 187 stars]
Note
Section keywords: frameworks
🔙 Back to the Table of Contents
General Purpose
- ⭐ spaCy by Explosion AI [GitHub, 29784 stars]
- ⭐ flair by Zalando [GitHub, 13855 stars]
- ⭐ AllenNLP by AI2 [GitHub, 11740 stars]
- ⭐ stanza (former Stanford NLP) [GitHub, 7253 stars]
- ⭐ spaCy stanza [GitHub, 723 stars]
- ⭐ nltk [GitHub, 13489 stars]
- ⭐ gensim - framework for topic modeling [GitHub, 15597 stars]
- ⭐ pororo - Platform of neural models for natural language processing [GitHub, 1279 stars]
- ⭐ NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2936 stars]
- ⭐ FARM [GitHub, 1734 stars]
- ⭐ gobbli by RTI International [GitHub, 275 stars]
- ⭐ headliner - training and deployment of seq2seq models [GitHub, 229 stars]
- ⭐ SyferText - A privacy preserving NLP framework [GitHub, 197 stars]
- ⭐ DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1263 stars]
- ⭐ TextHero - Text preprocessing, representation and visualization [GitHub, 2882 stars]
- ⭐ textblob - TextBlob: Simplified Text Processing [GitHub, 9109 stars]
- ⭐ AdaptNLP - A high level framework and library for NLP [GitHub, 407 stars]
- ⭐ textacy - NLP, before and after spaCy [GitHub, 2209 stars]
- ⭐ texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2388 stars]
- ⭐ jiant - jiant is an NLP toolkit [GitHub, 1639 stars]
Data Augmentation
- ⭐ WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
- ⭐ snorkel Framework to generate training data [GitHub, 5791 stars]
- ⭐ NLPAug Data augmentation for NLP [GitHub, 4419 stars]
- ⭐ SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
- ⭐ faker - Python package that generates fake data for you [GitHub, 17648 stars]
- ⭐ textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 639 stars]
- ⭐ Parrot - Practical and feature-rich paraphrasing framework [GitHub, 871 stars]
- ⭐ AugLy - data augmentations library for audio, image, text, and video [GitHub, 4950 stars]
- ⭐ TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 396 stars]
Adversarial NLP Attacks & Behavioral Testing
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
- ⭐ CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 6172 stars]
- ⭐ CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2003 stars]
Transformer-oriented
- ⭐ transformers by HuggingFace [GitHub, 132974 stars]
- ⭐ Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 2543 stars]
- ⭐ haystack - Transformers at scale for question answering & neural search. [GitHub, 16997 stars]
Dialogue Systems and Speech
- ⭐ DeepPavlov by MIPT [GitHub, 6676 stars]
- ⭐ ParlAI by FAIR [GitHub, 10477 stars]
- ⭐ rasa - Framework for Conversational Agents [GitHub, 18726 stars]
- ⭐ wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6370 stars]
- ⭐ ChatterBot - conversational dialog engine for creating chatbots [GitHub, 14039 stars]
- ⭐ SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 8674 stars]
- ⭐ dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]
Word/Sentence-embeddings oriented
- ⭐ MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3181 stars]
- ⭐ vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 644 stars]
- ⭐ sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 14981 stars]
Social Media Oriented
- ⭐ Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 661 stars]
Phonetics
- ⭐ DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 352 stars]
Morphology
- ⭐ LemmInflect - python module for English lemmatization and inflection [GitHub, 259 stars]
- ⭐ Inflect - generate plurals, ordinals, indefinite articles [GitHub, 964 stars]
- ⭐ simplemma - simple multilingual lemmatizer for Python [GitHub, 964 stars]
Multi-lingual tools
- ⭐ polyglot - Multi-lingual NLP Framework [GitHub, 2309 stars]
- ⭐ trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 730 stars]
Distributed NLP / Multi-GPU NLP
Machine Translation
- ⭐ COMET -A Neural Framework for MT Evaluation [GitHub, 493 stars]
- ⭐ marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 1236 stars]
- ⭐ argos-translate - Open source neural machine translation in Python [GitHub, 3771 stars]
- ⭐ Opus-MT - Open neural machine translation models and web services [GitHub, 605 stars]
- ⭐ dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 440 stars]
- ⭐ CTranslate2 - CTranslate2 end-to-end machine translation [GitHub, 3300 stars]
Entity and String Matching
- ⭐ PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 736 stars]
- ⭐ pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]
- ⭐ fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 9220 stars]
- ⭐ jellyfish - approximate and phonetic matching of strings [GitHub, 2049 stars]
- ⭐ textdistance - Compute distance between sequences [GitHub, 3367 stars]
- ⭐ DeepMatcher - Compute distance between sequences [GitHub, 555 stars]
- ⭐ RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 339 stars]
- ⭐ Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 17 stars]
Discourse Analysis
- ⭐ ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 543 stars]
PII scrubbing
- ⭐ scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 394 stars]
Hastag Segmentation
- ⭐ hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 68 stars]
Books Analysis / Literary Analysis / Semantic Search
- ⭐ booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 785 stars]
- ⭐ bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 76 stars]
- ⭐ SemanticFinder - frontend-only live semantic search with transformers.js [GitHub, 224 stars]
Non-English oriented
Japanese
- ⭐ fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 391 stars]
- ⭐ SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 390 stars]
- ⭐ Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 226 stars]
- ⭐ jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 148 stars]
- ⭐ Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 745 stars]
- ⭐ kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 953 stars]
- ⭐ nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 382 stars]
- ⭐ KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 201 stars]
- ⭐ Jigg - Pipeline framework for easy natural language processing [GitHub, 74 stars]
- ⭐ Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 376 stars]
- ⭐ RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 473 stars]
- ⭐ toiro - a comparison tool of Japanese tokenizers [GitHub, 118 stars]
Thai
- ⭐ AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 79 stars]
- ⭐ ThaiLMCut - Word Tokenizer for Thai Language [GitHub, 15 stars]
Chinese
- ⭐ Spacy-pkuseg - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 53 stars]
Ukrainian
- ⭐ recruitment-dataset - Recruitment Dataset Preprocessing and Recommender System (Ukrainian, English)
Other
- ⭐ textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 103 stars]
- ⭐ Kashgari Transfer Learning with focus on Chinese [GitHub, 2389 stars]
- ⭐ Underthesea - Vietnamese NLP Toolkit [GitHub, 1383 stars]
- ⭐ PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 84 stars]
Text Data Labelling & Classification
- ⭐ Small-Text - Active Learning for Text Classifcation in Python [GitHub, 549 stars]
- ⭐ Doccano - open source annotation tool for machine learning practitioners [GitHub, 9460 stars]
- ⭐ Adala - Autonomous DAta (Labeling) Agent framework [GitHub, 927 stars]
- ⭐ EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
- 🔱 Prodigy - annotation tool powered by active learning [Paid Service]
Note
Section keywords: learn NLP
🔙 Back to the Table of Contents
General
Courses
Books
Tutorials
🔙 Back to the Table of Contents
Tokenization
- ⭐ tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 8940 stars]
- ⭐ SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 10141 stars]
- ⭐ SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 135 stars]
Data Augmentation and Weak Supervision
Libraries and Frameworks
- ⭐ WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
- ⭐ NLPAug Data augmentation for NLP [GitHub, 4419 stars]
- ⭐ SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
- ⭐ skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 917 stars]
- ⭐ NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 773 stars]
- ⭐ EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
- ⭐ snorkel Framework to generate training data [GitHub, 5791 stars]
- ⭐ dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]
Reading Material and Tutorials
Named Entity Recognition (NER)
Relation Extraction
- ⭐ tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 355 stars]
- ⭐ tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 69 stars]
- ⭐ tac-self-attention Relation extraction with position-aware self-attention [GitHub, 64 stars]
- ⭐ Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 51 stars]
Coreference Resolution
Sentiment Analysis
Domain Adaptation
Low Resource NLP
Spell Correction / Error Correction
- ⭐ Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1502 stars]
- ⭐ NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 665 stars]
- ⭐ SymSpellPy - Python port of SymSpell [GitHub, 796 stars]
- 📙 Speller100 by Microsoft [Blog, Feb 2021]
- ⭐ JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 608 stars]
- ⭐ pycorrector - spell correction for Chinese [GitHub, 5517 stars]
- ⭐ contractions - Fixes contractions such as
you're
to you are
[GitHub, 308 stars]
- 📙 Fine Tuning T5 for Grammar Correction by Sachin Abeywardana [Blog, Nov 2022]
Style Transfer for NLP
- ⭐ Styleformer - Neural Language Style Transfer framework [GitHub, 475 stars]
- ⭐ StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 60 stars]
Automata Theory for NLP
- ⭐ pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]
Obscene words detection
- ⭐ LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 2899 stars]
Reddit Analysis
- ⭐ Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 489 stars]
Skill Detection
- ⭐ SkillNER - rule based NLP module to extract job skills from text [GitHub, 153 stars]
Reinforcement Learning for NLP
- ⭐ nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 192 stars]
AutoML / AutoNLP
- ⭐ AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 3836 stars]
- ⭐ TPOT - Python Automated Machine Learning tool [GitHub, 9691 stars]
- ⭐ Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 2359 stars]
- ⭐ HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 710 stars]
- 🔱 AutoML Natural Language - Google’s paid AutoML NLP service
- ⭐ Optuna - hyperparameter optimization framework [GitHub, 10650 stars]
- ⭐ FLAML - fast and lightweight AutoML library [GitHub, 3871 stars]
- ⭐ Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 306 stars]
OCR - Optical Character Recognition
Document AI
Text Generation
Title / Headlines Generation
- ⭐ TitleStylist Learning to Generate Headlines with Controlled Styles [GitHub, 76 stars]
NLP research reproducibility
License CC0
Attributions
Resources
- All linked resources belong to original authors
Icons
Fonts
The Pandect Series also includes