Internationalized Domain Names for Python (IDNA 2008 and UTS #46)

248
93
Python

Internationalized Domain Names in Applications (IDNA)

Support for the Internationalized Domain Names in
Applications (IDNA) protocol as specified in RFC 5891 <https://tools.ietf.org/html/rfc5891>_. This is the latest version of
the protocol and is sometimes referred to as “IDNA 2008”.

This library also provides support for Unicode Technical
Standard 46, Unicode IDNA Compatibility Processing <https://unicode.org/reports/tr46/>_.

This acts as a suitable replacement for the “encodings.idna”
module that comes with the Python standard library, but which
only supports the older superseded IDNA specification (RFC 3490 <https://tools.ietf.org/html/rfc3490>_).

Basic functions are simply executed:

… code-block:: pycon

>>> import idna
>>> idna.encode('ドメイン.テスト')
b'xn--eckwd4c7c.xn--zckzah'
>>> print(idna.decode('xn--eckwd4c7c.xn--zckzah'))
ドメイン.テスト

Installation

This package is available for installation from PyPI:

… code-block:: bash

$ python3 -m pip install idna

Usage

For typical usage, the encode and decode functions will take a
domain name argument and perform a conversion to A-labels or U-labels
respectively.

… code-block:: pycon

>>> import idna
>>> idna.encode('ドメイン.テスト')
b'xn--eckwd4c7c.xn--zckzah'
>>> print(idna.decode('xn--eckwd4c7c.xn--zckzah'))
ドメイン.テスト

You may use the codec encoding and decoding methods using the
idna.codec module:

… code-block:: pycon

>>> import idna.codec
>>> print('домен.испытание'.encode('idna2008'))
b'xn--d1acufc.xn--80akhbyknj4f'
>>> print(b'xn--d1acufc.xn--80akhbyknj4f'.decode('idna2008'))
домен.испытание

Conversions can be applied at a per-label basis using the ulabel or
alabel functions if necessary:

… code-block:: pycon

>>> idna.alabel('测试')
b'xn--0zwm56d'

Compatibility Mapping (UTS #46)
+++++++++++++++++++++++++++++++

As described in RFC 5895 <https://tools.ietf.org/html/rfc5895>_, the
IDNA specification does not normalize input from different potential
ways a user may input a domain name. This functionality, known as
a “mapping”, is considered by the specification to be a local
user-interface issue distinct from IDNA conversion functionality.

This library provides one such mapping that was developed by the
Unicode Consortium. Known as Unicode IDNA Compatibility Processing <https://unicode.org/reports/tr46/>_, it provides for both a regular
mapping for typical applications, as well as a transitional mapping to
help migrate from older IDNA 2003 applications. Strings are
preprocessed according to Section 4.4 “Preprocessing for IDNA2008”
prior to the IDNA operations.

For example, “Königsgäßchen” is not a permissible label as LATIN
CAPITAL LETTER K
is not allowed (nor are capital letters in general).
UTS 46 will convert this into lower case prior to applying the IDNA
conversion.

… code-block:: pycon

>>> import idna
>>> idna.encode('Königsgäßchen')
...
idna.core.InvalidCodepoint: Codepoint U+004B at position 1 of 'Königsgäßchen' not allowed
>>> idna.encode('Königsgäßchen', uts46=True)
b'xn--knigsgchen-b4a3dun'
>>> print(idna.decode('xn--knigsgchen-b4a3dun'))
königsgäßchen

Transitional processing provides conversions to help transition from
the older 2003 standard to the current standard. For example, in the
original IDNA specification, the LATIN SMALL LETTER SHARP S (ß) was
converted into two LATIN SMALL LETTER S (ss), whereas in the current
IDNA specification this conversion is not performed.

… code-block:: pycon

>>> idna.encode('Königsgäßchen', uts46=True, transitional=True)
'xn--knigsgsschen-lcb0w'

Implementers should use transitional processing with caution, only in
rare cases where conversion from legacy labels to current labels must be
performed (i.e. IDNA implementations that pre-date 2008). For typical
applications that just need to convert labels, transitional processing
is unlikely to be beneficial and could produce unexpected incompatible
results.

encodings.idna Compatibility
++++++++++++++++++++++++++++++++

Function calls from the Python built-in encodings.idna module are
mapped to their IDNA 2008 equivalents using the idna.compat module.
Simply substitute the import clause in your code to refer to the new
module name.

Exceptions

All errors raised during the conversion following the specification
should raise an exception derived from the idna.IDNAError base
class.

More specific exceptions that may be generated as idna.IDNABidiError
when the error reflects an illegal combination of left-to-right and
right-to-left characters in a label; idna.InvalidCodepoint when
a specific codepoint is an illegal character in an IDN label (i.e.
INVALID); and idna.InvalidCodepointContext when the codepoint is
illegal based on its positional context (i.e. it is CONTEXTO or CONTEXTJ
but the contextual requirements are not satisfied.)

Building and Diagnostics

The IDNA and UTS 46 functionality relies upon pre-calculated lookup
tables for performance. These tables are derived from computing against
eligibility criteria in the respective standards. These tables are
computed using the command-line script tools/idna-data.

This tool will fetch relevant codepoint data from the Unicode repository
and perform the required calculations to identify eligibility. There are
three main modes:

  • idna-data make-libdata. Generates idnadata.py and
    uts46data.py, the pre-calculated lookup tables used for IDNA and
    UTS 46 conversions. Implementers who wish to track this library against
    a different Unicode version may use this tool to manually generate a
    different version of the idnadata.py and uts46data.py files.

  • idna-data make-table. Generate a table of the IDNA disposition
    (e.g. PVALID, CONTEXTJ, CONTEXTO) in the format found in Appendix
    B.1 of RFC 5892 and the pre-computed tables published by IANA <https://www.iana.org/>_.

  • idna-data U+0061. Prints debugging output on the various
    properties associated with an individual Unicode codepoint (in this
    case, U+0061), that are used to assess the IDNA and UTS 46 status of a
    codepoint. This is helpful in debugging or analysis.

The tool accepts a number of arguments, described using idna-data -h. Most notably, the --version argument allows the specification
of the version of Unicode to be used in computing the table data. For
example, idna-data --version 9.0.0 make-libdata will generate
library data against Unicode 9.0.0.

Additional Notes

  • Packages. The latest tagged release version is published in the
    Python Package Index <https://pypi.org/project/idna/>_.

  • Version support. This library supports Python 3.6 and higher.
    As this library serves as a low-level toolkit for a variety of
    applications, many of which strive for broad compatibility with older
    Python versions, there is no rush to remove older interpreter support.
    Removing support for older versions should be well justified in that the
    maintenance burden has become too high.

  • Python 2. Python 2 is supported by version 2.x of this library.
    Use “idna<3” in your requirements file if you need this library for
    a Python 2 application. Be advised that these versions are no longer
    actively developed.

  • Testing. The library has a test suite based on each rule of the
    IDNA specification, as well as tests that are provided as part of the
    Unicode Technical Standard 46, Unicode IDNA Compatibility Processing <https://unicode.org/reports/tr46/>_.

  • Emoji. It is an occasional request to support emoji domains in
    this library. Encoding of symbols like emoji is expressly prohibited by
    the technical standard IDNA 2008 and emoji domains are broadly phased
    out across the domain industry due to associated security risks. For
    now, applications that need to support these non-compliant labels
    may wish to consider trying the encode/decode operation in this library
    first, and then falling back to using encodings.idna. See the Github project <https://github.com/kjd/idna/issues/18>_ for more discussion.