CITIA Baselayer for Multilingual Speech Technology

An Affordable and Achievable Vision - CALL FOR ACTION

Imagine a future, in 2024, where human-human, human-machine, and human-environment communication is not hampered by differences in language capability, accessibility, or knowledge of the technology, and where security and privacy are built in. Such a future could be enabled by conversational interaction technologies which will enable interaction, collaboration, creativity, and information access within a vast, dynamic, and heterogeneous information space.

During 2014 and 2015, CITIA has constructed a technology roadmap that will enable this vision to be realised. The roadmapping process was carried at at the European level, connecting the strong R&D base with commercial and industrial activity and policy makers at the EU and national levels. Over 200 experts were consulted, and the complete exercise is viewable here.

Based on our findings, the strengths, weaknesses, opportunities and threats for conversational interaction technologies in Europe can be assessed as follows:

One of the main points, reiterated many times across the entire stakeholder community, was the need for multilingual data and baseline systems. Although there is now a strong software layer for building speech recognition, machine translation, and speech synthesis systems – arising from EU projects such as Simple4All and EU-Bridge – both commercial and research users have to put in significant effort, including basic data collection, when extending a system to new languages.

This is a serious issue if we are to build a multilingual digital single market in Europe. To in order to crack the language barrier, European speech technology systems need to support the 24 official languages and the 5 “semi-official” languages, as well as the languages required for the global market. And this is not to mention the many regional and minority languages in use in Europe. If we take the 25 languages comprising the 26th to the 50th most used in Europe (Finnish - Montenegrin) we find there are over 50 million speakers of these languages residing in Europe.

To address this, we propose the construction of an open multilingual infrastructure for speech technology. The aim of such a multilingual baselayer is to enable and accelerate both research and innovation, and to provide one of the key blocks that will be essential to the multilingual digital single market. In a nutshell we propose:

  1. The creation of core, open access data, initially across 40–50 languages.
  2. The development of baseline open systems for speech recognition and speech synthesis across these languages.
  3. The dissemination of practical recipes for collecting and transcribing multilingual data for training and evaluating speech technology systems.
  4. The deployment of cloud-based reference systems to enable people to prototype and test new multilingual spoken applications.

This is an ambitious vision, but we believe it can be achieved if we coordinate at the European level, and take advantage of the strong and diverse community committed to multilingual speech technologies, Achieving the vision will require:

  1. Standard data collection processes. This needs to include protocols for data collection across a variety of genres (read speech, conversational speech, lectures, broadcast speech, …) combined with standard transcription procedures. The initial layer will require 100 hours of transcribed speech for constructing a baseline ASR system, plus about 10 hours of to construct a male and female synthetic voice per language.
  2. Open source baseline systems building on existing open source projects (e.g. Kaldi).
  3. Cloud-based backend with standard APIs to enable the rapid deployment of prototype multilingual systems that can be used to develop new applications.

The multilingual baselayer should be open source and open data, using permissive licenses such as CC-BY and the BSD-type open source license. Open source and open data is essential to stimulate and accelerate research in the field, and seed a sustainable research and innovation community.

Establishing the multilingual baselayer will require a core project team concerned with project management, definition of data collection process, open source recipes, building baseline systems, and the deployment of reference cloud-based systems. Each individual language will require data collection, transcription, and quality control, following the guidelines and processes defined by the core team, and will probably require 2-3 person years of effort per language.

We believe the project will rapidly become self-sustaining owing to the use of open data and open source software, which will seed a research and innovation community. The project will grow through a virtuous cycle of data-systems-apps-users-data-….

JOIN US and and help establish the CITIA Multilingual Baselayer!