inflectible

Template engine for natural languages that allows using grammatically appropriate word forms

84
3
Java

Build Status
Coverage Status

Inflectible

Inflectible is a flexible template engine with
inflection.
It can use correct word forms where other template engines can’t.

Maven dependency

<dependency>
    <groupId>org.tendiwa</groupId>
    <artifactId>inflectible</artifactId>
    <version>0.2.0</version>
</dependency>

What problem does it solve?

Many natural languages rely heavily on non-trivial rules of inflection. In order to
construct texts in those languages with variable members of sentences, we
can’t always just concatenate strings: generally we have to know the grammatical
structure of sentences we’re constructing, and we have to know how words in
particular form are spelled. For example, in Russian, a typical noun can have
up to a dozen forms that are written differently in different sentences, and
there is no simple “cram-it-in-printf” rule for how those forms are derived
from the dictionary form of a word.

In English it is not usually a problem. But even in English, sometimes just
concatenating strings is not enough to produce a grammatically correct sentence.

Consider this example: we need to display a message that
some cutting tool cuts paper well. With something like printf function,
we could use a template like this:

%s cuts paper well

We could pass "Knife" or "Razor", but if we pass "Scissors", then it
produces a grammatically incorrect sentence “Scissors cuts paper well”. This
is just the most basic example how properly constructed sentences require the
template engine to be aware of inflection rules.

How does it work?

Inflectible introduces two kinds of markup: vocabularies and templates.

In vocabularies, you put words of a language in all their various forms, and
assign each form a grammatical meaning:

WOLF (Noun) {
    wolf
    wolves <Plur>
}
CHILD (Noun) {
    child
    children <Plur>
}
SCISSORS (Noun) <Plur> {
    scissors
}

In templatuaries, you put templates. Templates declare arguments and describe
how those arguments are used to fill out the template:

actions.bite(subject, object) {
   [Subject] (and [subject]<Plur> are well known for their painful bites!) is biting a [object].
}

In your application, you have classes to represents the same concept that the
words of a language represent. Those classes would implement Concept
interface that require them to return the identifier of their lexeme:

class Wolf implements Concept {
    @Override
    public String identifier() {
        return "WOLF";
    }
}

With those classes, you construct a NativeSpeaker that knows how to speak a
particular language using proper inflection rules, and ask him to fill out a
particular template with particular concepts:

Wolf wolf = new Wolf();
Human girl = new Human("GIRL");
System.out.printf(
    nativeSpeaker.fillOut("actions.bite", wolf, girl);
);
// -> Output: Wolf (and wolves are known for their painful bites!) is biting a girl.

This may seem not very useful for English, but it makes a lot of sense e.g.
in Russian, where a lexeme for НОЖ (KNIFE) would look like this:

НОЖ (Сущ) <Муж Неодуш> {
    нож
    ножа   <Ед Р>
    ножу   <Ед Д>
    нож    <Ед В>
    ножом  <Ед Т>
    ноже   <Ед П>
    ножи   <Мн И>
    ножей  <Мн Р>
    ножам  <Мн Д>
    ножи   <Мн В>
    ножами <Мн Т>
    ножах  <Мн П>
}

There are 12 different forms a word НОЖ can assume under different
grammatical meanings, so choosing the correct one can become crucial.

Of course, it would be a pain to type all these words manually in a vocabulary
markup. But the good news are that a machine can often guess with very high
accuracy what would a particular word form would be, if we know the persistent
grammatical meaning of a word and its dictionary form. Inflectible can
generate those word forms for you, all you need to do is:

НОЖ (Сущ) <Муж Неодуш> {
    нож
    ...
}

That’s the actual markup, and if template engine sees it, it can
automatically produce a lexeme equivalent to the previous tediously written
example. It even supports
suppletion!

ЧЕЛОВЕК (Сущ) <Муж Одуш> {
    человек
    люди   <Мн>
    людьми <Мн Т>
    ...
}

What features is it going to provide?

The goals for version 1.0.0 are:

  • Full automated word form generation support for every part of speech in
    Russian and English;
  • Flexible design that allows allows automating inflection in any flective
    language;
  • Agreement with numbers (двух коней, два коня, две лошади, пять коней, один
    конь, миллион и двадцать один конь — that is the Russian for two horses, two
    male horses, two female horses, five horses, one horse, million and twenty one
    horses. Just look at all the different endings);
  • Phonetic “agreement” (indefinite article “a”/“an” in English and “de/d’” in
    French depend not on grammatical features of another word, but on its phonetical
    features);
  • Complete basic vocabularies for English and Russian — built-in vocabularies
    with the most common words, such as articles, pronouns and numbers. It wouldn’t
    make sense to ask every user of the template engine to compose or copy their own
    vocabulary for the most basic words of a language.
  • Multipart templates
    for the cases when you want to split the result of filling a template into
    logical parts;
  • IntelliJ IDEA plugin for markup editing;
  • Maven plugin for generating explicit lexemes from partially defined ones at
    build time.