Thai natural language processing in Python
PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with a focus on Thai language.
PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD
pip install pythainlp
Now, You can contact with or ask any questions of the PyThaiNLP team.
Version | Description | Status |
---|---|---|
5.0.4 | Stable | Change Log |
dev |
Release Candidate for 5.1 | Change Log |
PyThaiNLP provides standard linguistic analysis for Thai language and standard Thai locale utility functions.
Some of these functions are also available via the command-line interface (run thainlp
in your shell).
Partial list of features:
pythainlp.thai_consonants
), vowels (pythainlp.thai_vowels
), digits (pythainlp.thai_digits
), and stop words (pythainlp.corpus.thai_stopwords
) – comparable to constants like string.letters
, string.digits
, and string.punctuation
sent_tokenize
), word (word_tokenize
), and subword (subword_tokenize
)pos_tag
)spell
and correct
)soundex
and transliterate
)collate
)num_to_thaiword
and bahttext
)thai_strftime
)eng_to_thai
, thai_to_eng
)pip install --upgrade pythainlp
This will install the latest stable release of PyThaiNLP.
Install different releases:
pip install --upgrade pythainlp
pip install --upgrade --pre pythainlp
pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of [name]
immediately after pythainlp
:
pip install "pythainlp[extra1,extra2,...]"
Possible extras
:
full
(install everything)compact
(install a stable and small subset of dependencies)attacut
(to support attacut, a fast and accurate tokenizer)benchmarks
(for word tokenization benchmarking)icu
(for ICU, International Components for Unicode, support in transliteration and tokenization)ipa
(for IPA, International Phonetic Alphabet, support in transliteration)ml
(to support ULMFiT models for classification)thai2fit
(for Thai word vector)thai2rom
(for machine-learnt romanization)wordnet
(for Thai WordNet API)For dependency details, look at the extras
variable in
setup.py
.
~/pythainlp-data
by default.PYTHAINLP_DATA_DIR
.db.json
) at https://github.com/PyThaiNLP/pythainlp-corpusSome of PyThaiNLP functionalities can be used via command line with the thainlp
command.
For example, to display a catalog of datasets:
thainlp data catalog
To show how to use:
thainlp help
We test core functionalities on all officially supported Python versions.
Some functionality requiring extra dependencies may be tested less frequently
due to potential version conflicts or incompatibilities between packages.
Test cases are categorized into three groups: core, compact, and extra.
You can find these tests in the tests/ directory.
For more detailed information on testing, please refer to the tests README:
tests/README.md
License | |
---|---|
PyThaiNLP source codes and notebooks | Apache Software License 2.0 |
Corpora, datasets, and documentations created by PyThaiNLP | Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0) |
Language models created by PyThaiNLP | Creative Commons Attribution 4.0 International Public License (CC-by) |
Other corpora and models that may be included in PyThaiNLP | See Corpus License |
You can read INTHEWILD.md.
If you use PyThaiNLP
in your project or publication,
please cite the library as follows:
Phatthiyaphaibun, Wannaphong, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, and Pattarawat Chormai. “Pythainlp: Thai Natural Language Processing in Python”. Zenodo, 2 June 2024. http://doi.org/10.5281/zenodo.3519354.
or by BibTeX entry:
@software{pythainlp,
title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
author = "Phatthiyaphaibun, Wannaphong and
Chaovavanich, Korakot and
Polpanumas, Charin and
Suriyawongkul, Arthit and
Lowphansirikul, Lalita and
Chormai, Pattarawat",
doi = {10.5281/zenodo.3519354},
license = {Apache-2.0},
month = jun,
url = {https://github.com/PyThaiNLP/pythainlp/},
version = {v5.0.4},
year = {2024},
}
Our NLP-OSS 2023 paper:
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai Natural Language Processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore, Singapore. Empirical Methods in Natural Language Processing.
and its BibTeX entry:
@inproceedings{phatthiyaphaibun-etal-2023-pythainlp,
title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
author = "Phatthiyaphaibun, Wannaphong and
Chaovavanich, Korakot and
Polpanumas, Charin and
Suriyawongkul, Arthit and
Lowphansirikul, Lalita and
Chormai, Pattarawat and
Limkonchotiwat, Peerat and
Suntorntip, Thanathip and
Udomcharoenchaikit, Can",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Empirical Methods in Natural Language Processing",
url = "https://aclanthology.org/2023.nlposs-1.4",
pages = "25--36",
abstract = "We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.",
}
Logo | Description |
---|---|
Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute. | |
We get support of free Mac Mini M1 from MacStadium for running CI builds. |