Knowing how words are pronounced is a vital part of most speech recognition and speech synthesis systems. The pronunciation component forms the core of such systems, making their overall performance rely on the coverage and quality of the pronunciation model.
Automatic speech recognition and text-to-speech systems normally use handcrafted word-pronunciation dictionaries. The dictionary maps each word to one or more phonetic transcriptions and usually has a large but finite vocabulary.
Such a static list can never cover all possible words in a language and is usually accompanied by a grapheme-to-phoneme (G2P) engine that can automatically generate pronunciations for out-of-dictionary words.
A G2P converts an input word (a sequence of characters or graphemes) to a corresponding prounciation (a string of phones). For example, given the word "computer" a G2P should output /kəmˈpjuːtər/.
There are different types of G2P algorithms. Unlike the less-common rule-based G2Ps, data-driven G2P methods automatically learn from a set of word-pronunciation pairs (the ground truth). The underlying conversion rules are captured implicitly which also makes the implementation language-independent. Various data-driven models use tree classifiers, hidden Markov models, and neural networks. Recurrent neural networks (RNNs) with long short term memory cells (LSTMs) show good accuracy while being very easy to use — they simply learn from the training data.
G2P conversion can be viewed as a (neural) machine translation problem in which spelling (orthography) is being translated into pronunciation (phonology). The performance and quality of G2Ps is usually judged by their phoneme error rate (PER) which is similar to the word error rate (WER) metric used in machine translation.
G2P algorithms generalize from their training data and typically mispronounce non-standard words or foreign names. For example, they might pronounce the Māori name "Onehunga" as /wʌnˈhʌŋə/ which is far from the correct local pronunciation /ˌɒnɪˈhʌŋə/.
The existence of homographs also complicates things. Unlike Spanish or German where pronunciation of a word can be inferred from its spelling, English is full of words that have the same spelling but are pronounced differently depending on meaning. The examples include words such as "dove" which can be pronounced as /ˈdʌv/ or /ˈdoʊv/ depending on what you are talking about. A more complex example is the name "Houston" which is pronounced as /ˈhjuːstən/ when it refers to the city in Texas and /ˈhaʊstən/ in the name of the Houston Street in New York which highlights the importance of the Cofactor Ora pronunciation knowledge base.
Since most G2P conversion algorithms require clean training data, G2P models are rarely available for underresourced languages such as Māori. Building a manually annotated pronunciation dictionary is the most straightforward way to contribute to the efficient development of G2P converters for Māori. Cofactor Ora collects Māori pronunciations in a systematic and structured way, enabling the development of Māori speech technologies.