A comprehensive evaluation of self-supervised speech models - SUPERB
Summary Limitations of AI today and the rising trend of self-supervised learning

With the rapid advancement of deep learning technology, the ability of AI is getting stronger and stronger. However, although machines can outperform humans on Go, the ability of machines and humans to learn languages ​​is not on the same level. Taking speech recognition as an example, to train a speech recognition system, developers have to collect excessive speech signals and label them with corresponding texts The learning task of inferring from labeled training data is called Supervised Learning. The sine qua non of massive data in supervised learning hinders the fast advancement of AI today, especially on speech recognition.Today’s commercial speech recognition systems often require more than 100,000 hours of text-annotated speech signals for training. Hence, only common languages ​used by many people, such as Chinese and English, have sufficient amount of annotated data to achieve high recognition accuracy. For rare languages ​​or dialects used by fewer people, such as Taiwanese and Hakka, it is challenging to develop high-quality speech recognition systems. As a result, machine learning systems today can only understand common and dominant languages, which magnifies the supremacy of dominant languages in use. As AI becomes widely used across industries, subordinate groups may face more challenges catching up with the technological trends, making it more difficult to preserve the cultural diversity around the world. We hope that machines can work for every human being regardless of race, nationality, and languages they speak. However, there are more than 7,000 languages ​​globally, and it is almost impossible to collect a huge amount of annotation data for each language.

Machines need annotations to learn, but human babies learn human language with almost no annotations. Can machines do the same thing? To answer this question, a wave of self-supervised learning has emerged in the field of voice AI to reduce the dependency on large labeled data sets. The goal of self-supervised learning is to train the machine with only conversations in real life and videos clips on the Internet, without any human annotation, to understand human speech.

A comprehensive evaluation of self-supervised speech models - SUPERB

To allow machines to learn human language with only observations like human babies, the speech lab at the National Taiwan University has partnered with the speech research groups in Meta, CMU, MIT, and JHU to develop a brand new self-supervised speech processing evaluation framework, Speech Processing Universal PERformance Benchmark (SUPERB). SUPERB aims at understanding human voice signals using self-supervised learning with massive unlabeled speech audios. Next, when the developers want machines to learn a specific speech processing task (these specific tasks are called Downstream Tasks), such as, speech recognition, only a small amount of annotation data related to downstream tasks is required. With self-supervised learning, we can quickly learn tasks that originally required a large amount of annotation data to learn.
One of the main contributions of SUPERB is that it does not only focus on one single downstream task; instead, it allows a comprehensive assessment of machines' ability to understand human voices. Speech signals contain multiple aspects of the information, including speaker characteristics (who is speaking), prosodic characteristics (how to say), and content information (what is said), etc. It is a natural question to the researchers in this field: can we leverage the large amount of unlabeled data with few labeled data effectively to develop a model that can be generically applicable to wide-ranging downstream tasks? To address this question,SUPERB currently covers ten voice-related tasks to comprehensively evaluate the ability of the self-supervised voice model to understand all aspects of human voice signals. The ten tasks include: phoneme recognition, speech recognition, keyword spotting, query by example (QbyE), intent classification, slot filling, emotion recognition, speaker recognition, speaker verification, and speaker diarization.

SUPERB has conducted unprecedented large-scale modeling experiments in the self-supervised learning speech processing field, and evaluated the performance of the popular models in the field in understanding human voices, including wav2vec series[2] proposed by Facebook AI, HuBERT[3], the Mockingjay series proposed by National Taiwan University [4], APC[5] and NPC[6] proposed by MIT, PASE[7] proposed by MILA, etc. The corresponding paper has been accepted by the universally acknowledged top-notch international speech research conference, INTERSPEECH [1]. SUPERB will become a default evaluation benchmark for self-supervised learning in speech. We encourage speech researchers all around the world to participate in the challenge and to push the frontier of self-supervised learning together. Moreover, NTU Speech lab's PhD student, Leo Yang, led a group of graduate students to integrate the codes of self-supervised learning into an open-source toolkit, S3PRL. The code base is publicly shared on the network platform Github [8] and has been used and recommended by over 1500 researchers and developers globally.

A sharing research and innovation platform to advocate AI democracy

By creating a sharing self-supervised learning innovation platform, SUPERB aspires to increase the utility of AI in sub-dominant languages and groups, protecting the diversity of language and culture; alleviate the monopoly of large corporate in large-scale user data and computational resources, providing a fairer environment for smaller companies and researchers while protecting user privacy.

SUPERB researchers hope that with the efforts being devoted, AI can understand everyone's voice, and any person or group can be equally benefiting from the advancement of the AI technologies. We encourage more researchers and policy makers to join us and advocate the democratization of AI from both technical and policy standpoints, to make it accessible for all.

For more details, please refer to SUPERB official website: http://superbbenchmark.org/.

[1] Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee, SUPERB: Speech processing Universal PERformance Benchmark, INTERSPEECH, 2021
[2] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, NeurIPS, 2020
[3] Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, Abdelrahman Mohamed, HUBERT: How much can a bad teacher benefit ASR pre-training?, ICASSP, 2021
[4] Andy T. Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, Hung-yi Lee, Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders, ICASSP, 2020
[5] Yu-An Chung, Wei-Ning Hsu, Hao Tang, James Glass, An Unsupervised Autoregressive Model for Speech Representation Learning, INTERSPEECH, 2019
[6] Alexander H. Liu, Yu-An Chung, James Glass, Non-autoregressive predictive coding for learning speech representations from local dependencies, INTERSPEECH, 2021
[7] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, Yoshua Bengio, Multi-task self-supervised learning for Robust Speech Recognition, ICASSP, 2020
[8] https://github.com/s3prl/s3prl
Technical Film
Keyword Self-supervised Learning
Research Project Self-supervisedTrustworthy Learning for Next- generation Intelligent Services
Research Team Led by PI: Prof. Winston H. Hsu, National Taiwan University, Co-PI: Associate Prof. Hung-Yi Lee, National Taiwan University
More like this
Provide the latest information of AI research centers and applied industries