Towards an ImageNet Moment for Speech-to-Text

13 min readSep 13, 2020

https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/

Speech-to-text (STT), also known as automated-speech-recognition (ASR), has a long history and has made amazing progress over the past decade. Currently, it is often believed that only large corporations like Google, Facebook, or Baidu (or local state-backed monopolies for the Russian language) can provide deployable “in-the-wild” solutions. This is due to several reasons:

High compute requirements that are usually used in papers erect artificially high entry barriers;
Speech requiring significant data due to the diverse vocabulary, speakers, and compression artifacts;
A mentality where practical solutions are abandoned in favor of impractical, yet state of the art (SOTA) solutions

In this piece we describe our effort to alleviate these concerns, both globally and for the Russian language, by:

Introducing the diverse 20,000 hour Open STT dataset published under CC-NC-BY license;
Demonstrating that it is possible to achieve competitive results using only TWO consumer-grade and widely available GPUs;
Offering a plethora of design patterns that democratize entry to the speech domain for a wide range of researchers and practitioners.

Table of Contents

Introduction

Following the success and the democratization (the so-called “ImageNet moment”, i.e. the reduction of hardware requirements, time-to-market and minimal dataset sizes to produce deployable products) of computer vision, it is logical to hope that other branches of Machine Learning (ML) will follow suit. The only questions are, when will it happen and what are the necessary conditions for it to happen?

In our opinion, the ImageNet moment in a given ML sub-field arrives when:

The architectures and model building blocks required to solve 95% of standard “useful” tasks are widely available as standard and tested open-source framework modules;
Most popular models are available with pre-trained weights;
Knowledge transfer from standard tasks using pre-trained models to different everyday tasks is solved;
The compute required to train models for everyday tasks is minimal (e.g. 1–10 GPU days in STT) compared to the compute requirements previously reported in papers (100–1000 GPU days in STT);
The compute for pre-training large models is available to small independent companies and research groups;

If the above conditions are satisfied, one can develop new useful applications with reasonable costs. Also democratization occurs — one no longer has to rely on giant companies such as Google as the only source of truth in the industry.

Click to Read: On Usefulness

This piece will describe our pursuit of an ImageNet moment for STT, which has so far not been found, and particularly in the context of Russian language. Our main goal is to build and deploy useful models as fast as possible on a limited compute budget, and to share our results of how to do this so others can build on our findings, so we can collectively achieve an ImageNet moment for STT.

Click to Read: Why not share this in an academic paper

Related Work and Inspiration

For our experiments we have chosen the following stack of technologies:

Feed-forward neural networks for acoustic modelling (mostly grouped 1D convolutions with squeeze and excitation and transformer blocks);
Connectionist temporal classification loss (CTC loss);
Composite tokens consisting of graphemes (i.e. alphabet letters) as modelling units (opposed to phonemes);
Beam search with a pre-trained language model (LM) as a decoder.

There are many ways to approach STT. Discussing their drawbacks and advantages is out of scope here. Everything in this article is said about an end-to-end approach using mostly graphemes (i.e. alphabet letters) and neural networks.

In a nutshell — to train an end-to-end grapheme model you just need a lot of small audio files with corresponding transcriptions, i.e. file.wav and transcription.txt. You can also use CTC loss, which alleviates the requirement to have time-aligned annotation (otherwise you will need either to provide an alignment table by yourself or learn alignment within your network). A common alternative to CTC loss is the standard categorical cross-entropy loss with attention, but it trains slowly by itself and it is usually used together with CTC loss anyway.

This “stack” was chosen for a number of reasons:

Scalability. You can scale your compute by adding GPUs;
Future proofing. Should a new neural network block become mainstream, it can be integrated and tested within days. Migrating to another framework is also easy;
Simplicity. Namely using Python and PyTorch you can focus on experimentation and not solving legacy constraints;
Flexibility. Building proper code in Python you can test new features (i.e. speaker diarization) in days;
By not using attention in the decoder nor phonemes or recurrent neural networks we achieve faster convergence and need less maintenance for our models;

Click to read more: the most prominent building blocks in STT

Open Speech To Text (Russian)

All publicly available supervised English datasets that we know of are smaller than 1,000 hours and have very limited variability. DeepSpeech 2, a seminal STT paper, suggests that you need at least 10,000 hours of annotation to build a proper STT system. 1,000 hours is also a good start, but given the generalization gap (discussed below) you need around 10,000 hours of data in different domains.

Typical academic datasets have the following drawbacks:

Too ideal. Recorded in studio or too clean compared to real world applications;
Too narrow of a domain. Difficulty in STT follows this simple formula: noise level * vocabulary size * number of speakers;
Mostly only English. Though projects like Common Voice alleviate this constraint to some extent, you cannot reliably find a lot of data in languages other than German and English. Also Common Voice is probably more suitable for speaker identification task more than speech-to-text because their texts are not very diverse;
Different compression. Wav files have little to no compression artifacts and therefore don’t represent real world sound bytes that are compressed in different ways;

Because of these drawbacks, about 6 months ago we decided to collect and share an unprecedented spoken corpus in Russian. We targeted 10,000 hours at first. To our knowledge this is unprecedented even for the English language. We have seen an attempt to do work similar to ours, but despite the government funding, their datasets are not publicly available.

Recently we released a 1.0-beta version of the dataset. It includes the following domains:

DOMAINANNOTATIONUTTERANCESHOURSGBRadioAlignment8,3М11,9961367Public SpeechAlignment1,7M2,709301YouTubeSubtitles2,6М2,117346AudiobooksAlignment / ASR1,3М1,632180CallsASR695K81991OtherTTS, narration1.9M83595

Our data-collection process was the following:

Collect some data then clean it using heuristics;
Train some models and use those models to further clean the data;
Collect more data and use alignment to align transcripts with audio;
Train better models and use those models to further clean the data;
Collect more data and manually annotate some data;
Repeat all the steps.

You can find our corpus here and you can support our dataset here.

Though this is already substantial, we are not yet done. Our short term plan is:

Do some housekeeping, clean the data more, and clean-up some legacy code;
Migrate to .ogg in order to minimize data storage space while maintaining quality;
Add several new domains (courtroom dialogues, medical lectures and seminars, poetry).

Making a Great Speech To Text Model

To build a great STT model, it needs the following characteristics:

Quick inference;
Parameter-efficient;
Easy to maintain and improve;
Does not require a lot of compute to train, a 2 x 1080Ti machine or less should suffice;

We take these as our goals, and describe how we fulfilled them below.

Click to Read: Model Selection Methodology

Overall Progress Made

Initially we started with a fork of Deep Speech 2 in PyTorch. The original Deep Speech 2 model is based on a deep LSTM or GRU recurrent network, which are slow. The above image illustrates the optimizations we were able to add to the original pipeline. More specifically, we were able to do the following without hurting model performance:

Reduce the model size around 5x;
Speed up its convergence 5–10x;
The small (25M-35M params) final model can be trained on 2x1080 Ti GPUs instead of 4;
The large model still requires 4x1080 Ti but has a bit lower final CER (1–1.5 percentage point lower) compared to the small model.

The above chart only has convolutional models, which we found to be much faster than their recurrent counterparts. We started on the process to getting these results as follows:

Used an existing implementation of Deep Speech 2;
Run a few experiments on LibriSpeech, where we noticed that RNN models are typically very slow compared to their convolutional counterparts;
Added a plain Wav2Letter inspired model, which was actually underparameterized for Russian, so we increased the model size;
Noticed that the model was okay, but very slow to train, so we tried to optimize the training time.

So, we then explored the following ideas to improve things:

Idea 1 — Model Stride

Idea 2 — Compact Regularized Networks

Idea 3 — Using Byte-Pair Encoding

Idea 4 — Better Encoder

Idea 5 — Balance Capacity — Never Use 4 GPUs Again

Idea 6 — Stabilize the Training in Different Domains, Balance Generalization

Idea 7 — Make A Very Fast Decoder

Model Benchmarks and Generalization Gap

In real life it is expected that if the model is trained on one domain, there will be a significant generalization gap on another. But is there a generalization gap in the first place? If there is, then what are the main differences between domains? Can you train one model to work fine on many reasonable domains with decent signal-to-noise ratio?

There is a generalization gap, and you can even deduce which ASR systems were trained on which domains. Also, with the ideas above, you can train a model that will perform decently even on unseen domains.

According to our observations, these are the main differences that cause the generalization gap between domains:

Overall noise level;
Vocabulary and pronunciation;
The codecs or hardware used to compress audio;

DATASET (HOURS) / OUT-OF-DATASET?OUR CER / WERBEST CLOUD CER / WERCOMMENTNarration, yes3% / 11%3% / 6%TTS narration datasetAudioBooks, no9% / 30%7% / 25%Very cleanYouTube, no15% / 37%16% / 32%Very diverseRadio, no8% / 19%14% / 26%Very diversePublic speech, no6% / 16%12% / 23%Very diverseCalls — taxi clean, yes7% / 20%7% / 15%Clean annotationCalls — e-commerce, yes19% / 34%17% / 31%Noisy, rare wordsCalls — pranks, yes22% / 43%23% / 39%Very noisy

Legend:

All of the speed benchmarks were done on a EX51-SSD-GPU server — 4 core CPU + GTX1080;
This benchmark includes both an acoustic model and a language model. The acoustic model is run on GPU, the results are accumulated, and then language model post-processing is run on multiple CPUs;
The speed of the benchmark is measured by dividing the total audio duration of audio in the dataset by the total time spent applying acoustic model and language model post-processing. For these tests this metric ranged from 125 to 250, which is very fast, but bear in mind that this not representative of real production usage speeds;

More Detailed Benchmarks

Model Benchmark Analysis

More often than not, when some systems are compared to other systems (if you speak Russian, this is a typical culprit) the following quirks occur:

A new 2019 model is compared to a competitor’s 2018 model;
Results are cherry-picked to fit some narrative;
Practical, compute, or maintenance concerns are omitted;

While we are not 100% immune to these deficiencies, we nevertheless argue that you should at least attempt to:

Do out-of-dataset (ood) validation;
Do validation on clean and noisy data;
Try to create a general model suitable for real world usage. Obviously there is a reason why some companies provide several models for several domains, but creating a general model is more difficult than creating a model for a narrow domain;
Compare the model to vastly different approaches — at least on a black box level;

Comparing the benchmarks in the previous section, some conclusions can be drawn (please note that these tests were made at the end of 2019 / early 2020):

By using Open STT we managed to train a general model that performs on par with the best generalist models on the market and does not lag behind too much on out-of-dataset tests;
No other model besides ours performs reasonably well on all of the validation sets without significantly sacrificing performance on some of the domains (i.e. WER > 50%);
We can clearly see that for the majority of datasets for the majority of systems, there is a clear convergent performance of several best performing systems;
Surprisingly, even though Google shows the best results on real calls, it severely lags behind on some of the domains. We could not understand whether it happened because of some arbitrarily high internal confidence threshold, but from the data it seemed that Google’s STT just omits the output when it is not certain, which is less than desirable for some applications and the default behavior for others;
Kaldi was probably trained on audio books or a similar domain;
Even though on easy domains (such as narration or audio books) Tinkoff’s models blow everybody out of the water, the performance on other domains is even worse than our models without LMs. Also they most likely trained their model on audio books and Youtube (or possibly narration or short commands);
Surprisingly, despite the popular misconception, Yandex is not the best on the market for all domains;
Though performing well on easier domains’ simple vocabulary, Kaldi falls behind on noisier domains.

Production Usage

We have shown that with close to zero manual annotation and a limited hardware budget (2–4 x 1080 Ti) you can train a robust and scalable acoustic model. But an obvious question is — how much more data, compute and effort is required to make this model deployable? How well does the model have to perform?

As for performance, an obvious criterion for us would be to beat Google on all of our validation datasets. But the problem with that is that Google has stellar performance on some domains (i.e. calls) and average performance on the others. Therefore for this section we decided to take the results of the best model of a local Russian state-backed monopoly SpeechPro that is usually cited as “the best” solution on the market.

WER, OUR MODEL

WER, OUR MODEL,

MORE CAPACITY, MORE DATA

WER, SPEECHPRO CALLSNarration11%10%15%AudioBooks30%28%30%YouTube37%32%35%Radio19%19%26%Public speech16%16%23%Calls taxi (clean)20%14%25%Calls e-commerce34%33%31%

Looking at the above table, we found that our model outperforms SpeechPro Calls (they have many models, but only this one performs well on all of the domains) for all categories except for e-commerce calls and audiobooks. Additionally, we found that adding more capacity to the model and training on more data greatly improved our results for most domains, and that SpeechPro only outperforms this model on e-commerce calls.

To achieve these results we had to source an additional 100 hours of calls with manual annotation, around 300 hours of calls with various sources of automatic annotation and 10,000 hours of data similar to OpenSTT, but not yet public. We also had to increase the model decoder capacity by bumping its total size to around 65M parameters. If you speak Russian, you can test our model here.

As for other questions, it is a bit more complicated. Obviously you need a language model and a post-processing pipeline, which is not discussed here. To deploy on just CPU some further minification or quantization is desirable but unnecessary. Being able to process 2–3 seconds of audio per CPU core second (on a slow processor, on faster processor cores we get somewhere around 4–5) is good enough, but state-of-the art solutions claim figures around 8–10 seconds (on a faster processor).

Further Work

Here is a list of ideas, that we tested (some of which even worked), but we decided in the end that their complexity does not justify the benefits they provide:

Getting rid of gradient clipping. Gradient clipping takes from 25% to 40% of batch time. We tried various hacks to get rid of it, but could not do it without suffering a severe drop in convergence speed;
ADAM, Novograd and other new and promising optimizers. In our experience, they worked only with simpler non speech related domains or toy datasets;
Sequence-to-sequence decoder, double supervision. These ideas work. Attention-based decoders with categorical cross-entropy loss instead of CTC are notoriously slow starters (you add speech decoding to the already burdensome task of alignment). Hybrid networks did not perform much better to justify their complexity. This probably just means that hybrid networks require a lot of parameter fine-tuning;
Phoneme-based and phoneme-augmented methods. Though these helped us regularize a few over-parametrized models (100–150M params), they proved not very useful for smaller models. Surprisingly an extensive tokenization study by Google arrived at the similar result;
Networks that increase in width gradually. A common design pattern in computer vision, so far such networks converged worse that their counterparts with the same network width;
Usage of IdleBlocks. At first glance, this did not work, but maybe more time was needed to make it work;
Try any sort of tunable filters instead of STFT. We tried various implementations of tunable STFT filters and SincNet filters, but in most cases we could not even stabilize the training of the models with such filters;
Train a pyramid-shaped model with different strides. We failed to achieve any improvement here;
Use model distillation and quantization to speed up inference. At the moment when we tried native quantization in PyTorch it was still in beta and did not support our modules yet;
Add complementary objectives like speaker diarization or noise cancelling. Noise cancelling works, but it proved to be more of an aesthetic use;

References

Author Bio
Alexander Veysov is a Data Scientist in Silero, a small company building NLP / Speech / CV enabled products, and author of Open STT — probably the largest public Russian spoken corpus (we are planning to add more languages). Silero has recently shipped its own Russian STT engine. Previously he worked in a then Moscow-based VC firm and Ponominalu.ru, a ticketing startup acquired by MTS (major Russian TelCo). He received his BA and MA in Economics in Moscow State University for International Relations (MGIMO). You can follow his channel in telegram (@snakers41).

Acknowledgments
Thanks to Andrey Kurenkov and Jacob Anderson for their contributions to this piece.

Citation
For attribution in academic contexts or books, please cite this work as

Alexander Veysov, “Toward’s an ImageNet Moment for Speech-to-Text”, The Gradient, 2020.

BibTeX citation:

@article{veysov2020towardimagenetstt,
author = {Veysov, Alexander},
title = {Toward’s an ImageNet Moment for Speech-to-Text},
journal = {The Gradient},
year = {2020},
howpublished = {\url{https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/ } },
}