A comprehensive evaluation of self-supervised speech models - SUPERB
Summary | Limitations of AI today and the rising trend of self-supervised learning With the rapid advancement of deep learning technology, the ability of AI is getting stronger and stronger. However, although machines can outperform humans on Go, the ability of machines and humans to learn languages is not on the same level. Taking speech recognition as an example, to train a speech recognition system, developers have to collect excessive speech signals and label them with corresponding texts The learning task of inferring from labeled training data is called Supervised Learning. The sine qua non of massive data in supervised learning hinders the fast advancement of AI today, especially on speech recognition.Today’s commercial speech recognition systems often require more than 100,000 hours of text-annotated speech signals for training. Hence, only common languages used by many people, such as Chinese and English, have sufficient amount of annotated data to achieve high recognition accuracy. For rare languages or dialects used by fewer people, such as Taiwanese and Hakka, it is challenging to develop high-quality speech recognition systems. As a result, machine learning systems today can only understand common and dominant languages, which magnifies the supremacy of dominant languages in use. As AI becomes widely used across industries, subordinate groups may face more challenges catching up with the technological trends, making it more difficult to preserve the cultural diversity around the world. We hope that machines can work for every human being regardless of race, nationality, and languages they speak. However, there are more than 7,000 languages globally, and it is almost impossible to collect a huge amount of annotation data for each language. Machines need annotations to learn, but human babies learn human language with almost no annotations. Can machines do the same thing? To answer this question, a wave of self-supervised learning has emerged in the field of voice AI to reduce the dependency on large labeled data sets. The goal of self-supervised learning is to train the machine with only conversations in real life and videos clips on the Internet, without any human annotation, to understand human speech. A comprehensive evaluation of self-supervised speech models - SUPERB To allow machines to learn human language with only observations like human babies, the speech lab at the National Taiwan University has partnered with the speech research groups in Meta, CMU, MIT, and JHU to develop a brand new self-supervised speech processing evaluation framework, Speech Processing Universal PERformance Benchmark (SUPERB). SUPERB aims at understanding human voice signals using self-supervised learning with massive unlabeled speech audios. Next, when the developers want machines to learn a specific speech processing task (these specific tasks are called Downstream Tasks), such as, speech recognition, only a small amount of annotation data related to downstream tasks is required. With self-supervised learning, we can quickly learn tasks that originally required a large amount of annotation data to learn. One of the main contributions of SUPERB is that it does not only focus on one single downstream task; instead, it allows a comprehensive assessment of machines' ability to understand human voices. Speech signals contain multiple aspects of the information, including speaker characteristics (who is speaking), prosodic characteristics (how to say), and content information (what is said), etc. It is a natural question to the researchers in this field: can we leverage the large amount of unlabeled data with few labeled data effectively to develop a model that can be generically applicable to wide-ranging downstream tasks? To address this question,SUPERB currently covers ten voice-related tasks to comprehensively evaluate the ability of the self-supervised voice model to understand all aspects of human voice signals. The ten tasks include: phoneme recognition, speech recognition, keyword spotting, query by example (QbyE), intent classification, slot filling, emotion recognition, speaker recognition, speaker verification, and speaker diarization. SUPERB has conducted unprecedented large-scale modeling experiments in the self-supervised learning speech processing field, and evaluated the performance of the popular models in the field in understanding human voices, including wav2vec series[2] proposed by Facebook AI, HuBERT[3], the Mockingjay series proposed by National Taiwan University [4], APC[5] and NPC[6] proposed by MIT, PASE[7] proposed by MILA, etc. The corresponding paper has been accepted by the universally acknowledged top-notch international speech research conference, INTERSPEECH [1]. SUPERB will become a default evaluation benchmark for self-supervised learning in speech. We encourage speech researchers all around the world to participate in the challenge and to push the frontier of self-supervised learning together. Moreover, NTU Speech lab's PhD student, Leo Yang, led a group of graduate students to integrate the codes of self-supervised learning into an open-source toolkit, S3PRL. The code base is publicly shared on the network platform Github [8] and has been used and recommended by over 1500 researchers and developers globally. A sharing research and innovation platform to advocate AI democracy By creating a sharing self-supervised learning innovation platform, SUPERB aspires to increase the utility of AI in sub-dominant languages and groups, protecting the diversity of language and culture; alleviate the monopoly of large corporate in large-scale user data and computational resources, providing a fairer environment for smaller companies and researchers while protecting user privacy. SUPERB researchers hope that with the efforts being devoted, AI can understand everyone's voice, and any person or group can be equally benefiting from the advancement of the AI technologies. We encourage more researchers and policy makers to join us and advocate the democratization of AI from both technical and policy standpoints, to make it accessible for all. For more details, please refer to SUPERB official website: http://superbbenchmark.org/. Reference: [1] Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee, SUPERB: Speech processing Universal PERformance Benchmark, INTERSPEECH, 2021 [2] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, NeurIPS, 2020 [3] Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, Abdelrahman Mohamed, HUBERT: How much can a bad teacher benefit ASR pre-training?, ICASSP, 2021 [4] Andy T. Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, Hung-yi Lee, Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders, ICASSP, 2020 [5] Yu-An Chung, Wei-Ning Hsu, Hao Tang, James Glass, An Unsupervised Autoregressive Model for Speech Representation Learning, INTERSPEECH, 2019 [6] Alexander H. Liu, Yu-An Chung, James Glass, Non-autoregressive predictive coding for learning speech representations from local dependencies, INTERSPEECH, 2021 [7] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, Yoshua Bengio, Multi-task self-supervised learning for Robust Speech Recognition, ICASSP, 2020 [8] https://github.com/s3prl/s3prl |
||
---|---|---|---|
Technical Film | |||
Keyword | Self-supervised Learning | ||
Research Project | Self-supervisedTrustworthy Learning for Next- generation Intelligent Services | ||
Research Team | Led by PI: Prof. Winston H. Hsu, National Taiwan University, Co-PI: Associate Prof. Hung-Yi Lee, National Taiwan University |
More like this
Provide the latest information of AI research centers and applied industries
-
Embedding multimodal machine intelligence in the digital life of AI technology
This project collaborates with the international team to collect a very large-scale Chinese emotional corpus. In terms of technology, the fairness of speech emotion recognition is also discussed to solve social issues that may be encountered regarding the usability of emotion recognition. Among them, it is found that the database annotations are all labeled with the unfair perspective of men and women, which leads to biases in the trained model. In order to solve this problem, there have been preliminary achievements in the technological development of fairness, and will be submitted in the near future.
-
Deep Reinforcement Learning in Autonomous Miniature Car Racing
This project develops a high-performance end-to-end reinforcement learning training platform for autonomous miniature car racing. With this platform, our team won the championship of Amazon DeepRacer, a world autonomous racing competition. In addition, by combining various reinforcement learning algorithms and frameworks, our self-developed autonomous racing platform can operate at a much higher speed, surpassing the performance of Amazon DeepRacer.
-
Advanced Technologies for Designing Trustable AI Services
This integrated research project follows the Taiwan's 2030 Science & Technology Vision and takes LOHAS community and inclusive technology as the major research direction. We aim to develop trustable AI technologies, and introduce them to future smart services. That will realize the development of human-centric smart technology, and strengthen the governance and application of emerging technologies. The integrated project consists of 7 sub-projects led by PIs from National Taiwan University, National Tsing-Hua Universiy and Academia Sinica and composed of top AI technological teams. These sub-projects are divided into 3 clusters, including machine learning (sub-projects 1 and 2), computer vision (sub-projects 3 and 4), and human-centric computing (sub-projects 5, 6 and 7). We will deal with the issues of bias, fairness, transparency, explainability, traceability, and so on, from the aspects of data collection, technology, and application landing. Each sub-project will implement specific smart services to reflect the benefits and practical applications of the developed technologies. The NTU Joint Research Center for AI Technology and All Vista Healthcare, an AI Innovation Research Center supported by MOST, is responsible for management, planning, and execution of the integrated research project. We will propose a plan that can be generalized and applied to the intelligent service industry.
-
Computer Vision Research Center, National Yang-Ming Chiao-Tung university
Development of AI Platform for Smart Drone - Intelligent Flight: Due to its high mobility and the ability to fly in the sky, the drone has inspired more and more innovative applications/services in recent years. The goal of this project is to resolve the problem of blindly flying an unmanned aerial vehicle (UAV, which a drone in our case) when it is out of human sight or the range of wireless communication, and three major research and development directions will be considered in this project. Three artificial intelligence (AI) technologies, namely, smart sensing, smart control, and smart simulation, are applied in this project. Smart sensing - a flight system is developed, which can avoid the obstacles, complete a flight mission, and land safely. Smart control - an intelligence flight control system and a light-weighted somatosensory vest are developed. Smart simulation - a cost-effective training system and a 3D model simplification method are designed.
-
Ckip Lab
Textual Advertisement Generator: Given any limited specifics of any product, AI Advertisement Producer can automatically generate tons of top-quality descriptions and advertisements for the product in just one second. And not just one copy is produced. With deep learning and natural language processing technologies learned from millions of existing samples, our AI model can produce various styles of advertisements at the same time for users to select. It will be a big helper or a virtual brainstorming partner for any brands or advertisers to create their advertisements.
-
Stepped Respiratory Care Platform based on Zero-Contact Physiological Monitoring System
Combined with millimeter wave radar detection of chest undulation breathing mode and heart rate, continuous blood oxygen detection, active disease record of chat robot, and mobile phone analysis of 30 second sitting and standing alternate activity frequency mode, a set of personalized respiratory capacity benchmark is established through AI modeling, which can be applied to zero-contact respiratory physiological monitoring and useful for infectious disease ward, epidemic prevention hotels, centralized quarantine centers.
-
Deep Learning Based Anomaly Detection
For video anomaly detection, we apply pretrained models to obtain the foreground and the optical flow as ground truth. Then our model estimates the information by taking only a single frame as input. For human behaviors, we take the human poses as input and use a GCN-based model to predict the future poses. Both the anomaly scores of these two works are given by the error of the estimation. For defect detection, our model takes patches of the image as input and learns to extract features. The anomaly score of each patch is given by the distance between the patch and all the training patches.
-
Visually Impaired Navigation Dialogue System with Multiple AI Models
The dialogue system is the main subsystem of the visually impaired navigation system, which provides destinations for the navigation system through multiple dialogues. We use the knowledge graph as the basis for reasoning. In terms of close-range navigation, deep learning technology is used to develop RGB camera detection depth algorithm, indoor semantic cutting algorithm, integrated detection depth estimation and indoor semantic cutting in indoor obstacle avoidance, etc. The whole system uses the CellS software design framework to integrate distributed AIoT systems.
-
A deep learning based outdoor walking assistive system for the visually impaired
We provides a wearable device for the visually impaired to walk outdoors. By the deep learning network, the system can recognize and guide the visually impaired to walk on safe areas such as sidewalks and crosswalk. In addition, it can recognize the types of common obstacles and guide the visually impaired to avoid it in advance. Finally, we can convert the Google Maps route into easy-to-understand voice prompts instruction to guide the visually impaired to move in the right direction.