Project Showcase: OmniLingo

17 December 2021

Share this item:

Animated gif showing Omnilingo in use

Language learning apps can give a fun, convenient way to learn a new language. Like many uses of the internet, though, there is an opportunity for dark patterns to sneak in: user data hoarding, targeted presentation, and majority-cultural filter bubbles can have negative social impacts, and properietary or closed-source software and always-connected centralised backends can create unstable infrastructure and restrict user freedom.

We present here Omnilingo (opens new window) - an open project to build language learning protocols, software, and infrastructure that avoids these problems, prioritizing minority language communities and user data sovereignty. Let’s start with a few example stories about Omnilingo’s users.

# Language Activist

A language activist in a minority community wants to add their language to their favourite language learning app. The activist gathers audio samples and transcriptions and designs a curriculum. They contact the company producing the app, which controls what material can be used. The company responds that the language has too few speakers to be worth their time, saying one of:

Their lawyers don’t have time to evaluate whether they can use this material legally.
Their developers don’t have time to add the language to their app.
Their servers don’t have space or bandwidth to host the material for use in the app.
They can’t evaluate the quality of the material, and would rather not allow anyone to use it.

With OmniLingo, the activist can host the material they’ve collected and, through the IFPS decentralised storage network, and users can add the material to their app using the material’s Content ID hash.

# Language Learner

A privacy-conscious language learner wants an app that doesn’t track their abilities or habits; apps with registration and partnerships with advertising surveillance networks are hard to find, and many of them have disappeared along with their data. With OmniLingo, as long as their are users the data remains on the network; the open protocol allows developing clients which store no information on the network, or which allow users to choose who gets access to their data.

# Language Instructor

A language instructor is tired of having to adapt their course to the subscription plastic-wrapped curriculum their school purchased: the curriculum dictates the topics that are covered, the order, the vocabulary, and the variety of voices (gender, dialect, accent) that are used. With OmniLingo, the instructor can create, share and mix their own material.

# Who we are

We’re a pair of computational linguistics researchers and language activists. Fran Tyers is an assistant professor of computational linguistics at Indiana University and Nick Howell is a language technology consultant. We work at all layers of the stack to bring the latest advancements in language technology to minority and under-resourced language communities in ways that support their right to self-determination.

# What is OmniLingo

OmniLingo is a protocol and sample implementation for language-learning applications with a focus on decentralised and self-determined storage, user rights and privacy, and minority and under-represented languages.

                  distributed
                    system
                  developers
                      |
  language -----      |   
  curators      \     |      --------- marginalised language
                 \    |     /          learners
                           /
  language ------   IFPS   ----------- language learning
  instructors              \           researchers
                 /   | |    \
                /    | |     --------- privacy-concious language
  activists ----     | |               learners
                    /   \ 
                   /     \
  language  -------       ------------ experimental language
  community                            task designers
  supporters

# Architecture

language community  ----------
   authors                    \ collection    OmniLingo      publish on IFPS             fetched by any     language
                                --------->  node operators  ----------------->   IFPS   ---------------->    learner
language community            /               (anyone!)       with toolkit                 conformant       
   speakers         ----------                                                               client

OmniLingo language data is stored on IFPS in a hierarchy of JSON and MP3 piles. The _root index_of a language data store is a JSON dictionary mapping ISO-639 (opens new window) language codes to language index and language metadata.

{
  ⋮
  "or": {
    "cids": [
      "QmXYXMPyCREZCLao7k462rKNAFznxszeshdXqauh63Tet4"
    ],
    "meta": "QmV2SJHieJP1P6RDsSRchTwpWjAMho7ci7HXbkvN4ULuoR"
  },
  "pa-IN": {
    "cids": [
      "QmSKHQJdznnaNSNxDw9MXZxNzVNiJC9Fyi5pV1eXD9dWmU"
    ],
    "meta": "QmNT6VGydPEJvcqmMPpp1g3qCqWtBn1WHAmzswL3NqAHMU"
  },
  "pl": {
    "cids": [
      "QmSP7E9MFGphEodZuxf73yinGMaKvg8nx3V98bs5HQ4gfT"
    ],
    "meta": "QmQjPam8Nxwu2GckmSfp73iq1BauNm7mNXmKu8odGyRzZN"
  },
  ⋮
}

Language metadata consists of a “display name” for the language and a set of character rewrite rules to make typing easier; an example from Turkish:

{
  "alternatives": {
    "İ": [
      "I"
    ]
  },
  "display": "Türkçe"
}

Language indices are JSON lists of audio sample and difficulty metadata, used to generate an appropriate exercise for the learner’s level.

[
  ⋮
  {
    "chars_sec": 15.211640211640212,
    "clip_cid": "QmVDBB3qAujxuesvmhKZajxgYZnms3dUb24bBx6rr5hJvg",
    "length": 6.048,
    "meta_cid": "QmZNucpVvYxGiWhyqte9izfoaW79uQHuE7w4fWagWv1LLh",
    "sentence_cid": "QmR1uoBwNJRHhcPhD2utiNji1J3e4Q2t1nmUie9cJxccYt"
  },
  {
    "chars_sec": 15.972222222222223,
    "clip_cid": "QmcjJwcPF5oUX3WPtkAqFU9iCr9NvQAwVQabgBQSuVFMaG",
    "length": 5.76,
    "meta_cid": "QmZNucpVvYxGiWhyqte9izfoaW79uQHuE7w4fWagWv1LLh",
    "sentence_cid": "QmR1uoBwNJRHhcPhD2utiNji1J3e4Q2t1nmUie9cJxccYt"
  },
  ⋮
]

The sentence_cid field refers to a JSON dictionary of the original transcript, license, and language:

{
  "content": "Tavaliselt ongi nii, et mesinik jääb oma surnud mesilastega ja mitte mingit lahendust ei tule.",
  "copyright": "CC0-1.0",
  "language": "et"
}

The clip_cid field is the CID of the MP3 pile; meta_cid is a link to more detailed sentence metadata, including a tokenised transcript and punctation tags for each token:

{
  "sentence_cid": "QmR1uoBwNJRHhcPhD2utiNji1J3e4Q2t1nmUie9cJxccYt",
  "tags": [ "X", "X", "X", "PUNCT", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "PUNCT" ],
  "tokens": [ "Tavaliselt", "ongi", "nii", ", ", "et", "mesinik", "jääb", "oma", "surnud", "mesilastega", "ja", "mitte", "mingit", "lahendust", "ei", "tule", "." ]
}

In summary, the OmniLingo language store looks like this:

root-index
    ├── lang1
    │             ├── lang1-index
    │             │             ├── sent1
    │             │             │             ├── audioclip
    │             │             │             ├── metadata
    │             │             │             └── transcript
    │             │             └── …
    │             └── lang1-metadata
    ├── lang2
    └── …

Root indexes are encouraged to be published to IPNS, so that clients can receive updates.

# Implementation

We have implemented:

a Python toolkit for publishing language data
a Python command-line client demo
a HTML+JS web demo

# Language data publishing toolkit

There are three main steps in adding your data to OmniLingo. The first step is importing the data into IFPS, the second is indexing the data and the final step is publishing the data.

# Import

Import data into your local IFPS node and generate an index:

$ importer.py dataset_dir index_path

e.g.

$ importer.py ./cv-corpus-7.0-2021-07-21/tr/ tr.json

where the dataset_dir is in Common Voice format.

# Index

Index the data, extracting a balanced subset of clips by a complexity metric:

$ indexer.py locale index_path

e.g.

$ indexer.py tr tr.json

This will return a CID that looks like QmXpgcavH2shpBbfnFoymPxEw2zpr4MdAgi1aaoZT4Yeho

# Publish

Publish data to the global index in OmniLingo on IFPS:

$ publisher.py locale cid

e.g.

$ publisher.py tr QmXpgcavH2shpBbfnFoymPxEw2zpr4MdAgi1aaoZT4Yeho

This will return a CID which looks something like QmWAmrGNGkL8N6LfsfAKueYGYLqJ2gqn9EZR2a11fxRos6, which you can
then publish to an IFPS name using the local node ID:

IFPS name publish cid

e.g.

IFPS name publish QmWAmrGNGkL8N6LfsfAKueYGYLqJ2gqn9EZR2a11fxRos6

# Command-line client demo

The Python command-line client demonstrates how you can build your own local OmniLingo client; users are presented with fill-in-the-blank exercises, but there isn’t any difficulty analysis.

# HTML+JS web demo

Try it out! (opens new window) We wrote a HTML+JS web demo using a browserified copy of js-IFPS (opens new window). It stores a list of root indexes in local storage, and merges their trees.

Users are presented with fill-in-the-blank exercises in order of increasing difficulty, as measured by characters-per-second in the audio. We look forward to developing other measures of difficulty!

One remaining point of centralisation is the IPNS resolution: IPNS names will be resolved through the gateway at IFPS.io.

# What’s next

Omnilingo’s design encourages experimentation; we hope to see expansion along several axes:

New client experiences:
- bringing native OmniLingo experiences to platforms
- new algorithms for choosing exercises
- choosing and customising exercises
New data:
- bring more languages to OmniLingo
- bring more connections to OmniLingo (e.g., Wikidata for vocabulary learning)
Both:
- bring pronunciation assistence to OmniLingo
- bring distributed identities to OmniLingo, multi-device or multi-client usage

We have specific plans for pronunciation assistence and distributed identities in the next phase of our OmniLingo development work. Want to help out? Share your ideas and collaborate with us in #OmniLingo:matrix.org (opens new window).