

格林先生MrGreen arXiv每日学术速递 2022-05-05



【1】 Code-Switching Text Augmentation for Multilingual Speech Processing

作者:Amir Hussein,Shammur Absar Chowdhury,Ahmed Abdelali,Najim Dehak,Ahmed Ali
机构:KANARI AI , California, USA, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, USA, Qatar Computing Research Institute, Qatar
摘要:The pervasiveness of intra-utterance Code-switching (CS) in spoken content has enforced ASR systems to handle mixed input. Yet, designing a CS-ASR has many challenges, mainly due to the data scarcity, grammatical structure complexity, and mismatch along with unbalanced language usage distribution. Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena with little CS data. However, the dependency on the CS data still remains. In this work, we propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules. We based our approach on Equivalence Constraint theory while exploiting aligned translation pairs, to generate grammatically valid CS content. Our empirical results show a relative gain of 29-34 % in perplexity and around 2% in WER for two ecological and noisy CS test sets. Finally, the human evaluation suggests that 83.8% of the generated data is acceptable to humans.

【2】 Audio representations for deep learning in sound synthesis: A review

作者:Anastasia Natsiou,Sean O'Leary
机构:Technological University of Dublin, Dublin, Ireland, Se´an O’Leary
摘要:The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficiently - and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.

【3】 A sinusoidal signal reconstruction method for the inversion of the mel-spectrogram

作者:Anastasia Natsiou,Sean O'Leary
机构:Technological University of Dublin, Dublin, Ireland, Se´an O’Leary
摘要:The synthesis of sound via deep learning methods has recently received much attention. Some problems for deep learning approaches to sound synthesis relate to the amount of data needed to specify an audio signal and the necessity of preserving both the long and short time coherence of the synthesised signal. Visual time-frequency representations such as the log-mel-spectrogram have gained in popularity. The log-mel-spectrogram is a perceptually informed representation of audio that greatly compresses the amount of information required for the description of the sound. However, because of this compression, this representation is not directly invertible. Both signal processing and machine learning techniques have previously been applied to the inversion of the log-mel-spectrogram but they both caused audible distortions in the synthesized sounds due to issues of temporal and spectral coherence. In this paper, we outline the application of a sinusoidal model to the inversion of the log-mel-spectrogram for pitched musical instrument sounds outperforming state-of-the-art deep learning methods. The approach could be later used as a general decoding step from spectral to time intervals in neural applications.

【4】 Automatic Speech Recognition Datasets in Cantonese Language: A Survey and a New Dataset

作者:Tiezheng Yu,Rita Frieske,Peng Xu,Samuel Cahyawijaya,Cheuk Tung Shadow Yiu,Holy Lovenia,Wenliang Dai,Elham J. Barezi,Qifeng Chen,Xiaojuan Ma,Bertram E. Shi,Pascale Fung
机构:The Hong Kong University of Science and Technology
摘要:低资源语言上的自动语音识别(ASR)提高了语言少数群体获得人工智能(AI)技术优势的机会。在本文中,我们解决一个问题,香港广东话语言的数据稀缺性,通过创建一个新的广东话数据集。我们的数据集,多域粤语语料库(MCDC),由73.6个小时的干净阅读语音配对成绩单,收集来自广东香港的有声读物。它结合了哲学、政治、教育、文化、生活方式和家庭领域,涵盖了广泛的主题。我们还回顾了所有现有的粤语数据集,并在两个最大的数据集(MDCC和Common Voice zh HK)上进行了实验。我们根据语音类型、数据源、总大小和可用性对现有数据集进行分析。使用Fairseq S2T Transformer(最先进的ASR模型)进行的实验结果表明了我们数据集的有效性。此外,通过在MDCC和Common Voice zh HK上应用多数据集学习,我们创建了一个强大而健壮的广东话ASR模型。
摘要:Automatic speech recognition (ASR) on low resource languages improves access of linguistic minorities to technological advantages provided by Artificial Intelligence (AI). In this paper, we address a problem of data scarcity of Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and perform experiments on the two biggest datasets (MDCC and Common Voice zh-HK). We analyze the existing datasets according to their speech type, data source, total size and availability. The results of experiments conducted with Fairseq S2T Transformer, a state-of-the-art ASR model, show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.



