查看原文
其他

音频处理学术速递[1.10]

格林先生MrGreen arXiv每日学术速递 2022-05-05

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!


eess.AS音频处理,共计4篇


【1】 Code-Switching Text Augmentation for Multilingual Speech Processing
标题:用于多语言语音处理的码型转换文本增强
链接:https://arxiv.org/abs/2201.02550

作者:Amir Hussein,Shammur Absar Chowdhury,Ahmed Abdelali,Najim Dehak,Ahmed Ali
机构:KANARI AI , California, USA, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, USA, Qatar Computing Research Institute, Qatar
摘要:话语内语码转换(CS)在口语内容中的普遍性迫使ASR系统处理混合输入。然而,设计CS-ASR有很多挑战,主要是由于数据稀缺、语法结构复杂、不匹配以及语言使用分布不平衡。最近的ASR研究表明,E2E-ASR使用多语言数据处理CS现象的优势在于CS数据很少。但是,对CS数据的依赖性仍然存在。在这项工作中,我们提出了一种方法来增加单语数据,人工生成口语CS文本,以改进不同的语音模块。我们的方法基于对等约束理论,同时利用对齐翻译对生成语法有效的CS内容。我们的实证结果显示,对于两个生态和噪声CS测试集,困惑测试的相对增益为29-34%,WER测试的相对增益为2%左右。最后,人类评估表明,83.8%的生成数据是人类可以接受的。
摘要:The pervasiveness of intra-utterance Code-switching (CS) in spoken content has enforced ASR systems to handle mixed input. Yet, designing a CS-ASR has many challenges, mainly due to the data scarcity, grammatical structure complexity, and mismatch along with unbalanced language usage distribution. Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena with little CS data. However, the dependency on the CS data still remains. In this work, we propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules. We based our approach on Equivalence Constraint theory while exploiting aligned translation pairs, to generate grammatically valid CS content. Our empirical results show a relative gain of 29-34 % in perplexity and around 2% in WER for two ecological and noisy CS test sets. Finally, the human evaluation suggests that 83.8% of the generated data is acceptable to humans.

【2】 Audio representations for deep learning in sound synthesis: A review
标题:声音合成中深度学习的音频表征:综述
链接:https://arxiv.org/abs/2201.02490

作者:Anastasia Natsiou,Sean O'Leary
机构:Technological University of Dublin, Dublin, Ireland, Se´an O’Leary
摘要:深度学习算法的兴起使得许多研究人员不再使用经典的信号处理方法来产生声音。深度学习模型实现了富有表现力的语音合成、逼真的声音纹理和来自虚拟仪器的音符。然而,最合适的深度学习架构仍在研究中。架构的选择与音频表示紧密耦合。对于深度学习模型来说,声音的原始波形可能过于密集和丰富,无法有效处理,而且复杂性增加了训练时间和计算成本。此外,它并不代表声音的感知方式。因此,在许多情况下,原始音频已通过上采样、特征提取,甚至通过采用更高级别的波形图示,转换为压缩且更有意义的形式。此外,在所选择的形式的条件下,研究了附加的条件表示、不同的模型结构和许多用于评估重建声音的度量。本文概述了使用深度学习进行声音合成的音频表示。此外,它还介绍了使用深度学习模型开发和评估声音合成体系结构的最重要的方法,始终取决于音频表示。
摘要:The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficiently - and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.

【3】 A sinusoidal signal reconstruction method for the inversion of the mel-spectrogram
标题:一种用于Mel谱图反演的正弦信号重构方法
链接:https://arxiv.org/abs/2201.02483

作者:Anastasia Natsiou,Sean O'Leary
机构:Technological University of Dublin, Dublin, Ireland, Se´an O’Leary
摘要:通过深度学习方法合成声音最近受到了广泛关注。声音合成的深度学习方法的一些问题涉及到指定音频信号所需的数据量以及保持合成信号的长时间和短时间一致性的必要性。视觉时频表示法,如对数mel谱图,已越来越流行。对数mel频谱图是对音频的感知表示,它极大地压缩了描述声音所需的信息量。然而,由于这种压缩,这种表示形式不是直接可逆的。信号处理和机器学习技术之前都已应用于对数mel谱图的反演,但由于时间和光谱相干性问题,它们都会在合成声音中造成可听失真。在本文中,我们概述了正弦模型的应用,以反演对数梅尔谱图的音调乐器声音优于国家的最先进的深度学习方法。该方法可作为神经应用中从频谱到时间间隔的通用解码步骤。
摘要:The synthesis of sound via deep learning methods has recently received much attention. Some problems for deep learning approaches to sound synthesis relate to the amount of data needed to specify an audio signal and the necessity of preserving both the long and short time coherence of the synthesised signal. Visual time-frequency representations such as the log-mel-spectrogram have gained in popularity. The log-mel-spectrogram is a perceptually informed representation of audio that greatly compresses the amount of information required for the description of the sound. However, because of this compression, this representation is not directly invertible. Both signal processing and machine learning techniques have previously been applied to the inversion of the log-mel-spectrogram but they both caused audible distortions in the synthesized sounds due to issues of temporal and spectral coherence. In this paper, we outline the application of a sinusoidal model to the inversion of the log-mel-spectrogram for pitched musical instrument sounds outperforming state-of-the-art deep learning methods. The approach could be later used as a general decoding step from spectral to time intervals in neural applications.

【4】 Automatic Speech Recognition Datasets in Cantonese Language: A Survey and a New Dataset
标题:粤语自动语音识别数据集:综述和一个新的数据集
链接:https://arxiv.org/abs/2201.02419

作者:Tiezheng Yu,Rita Frieske,Peng Xu,Samuel Cahyawijaya,Cheuk Tung Shadow Yiu,Holy Lovenia,Wenliang Dai,Elham J. Barezi,Qifeng Chen,Xiaojuan Ma,Bertram E. Shi,Pascale Fung
机构:The Hong Kong University of Science and Technology
摘要:低资源语言上的自动语音识别(ASR)提高了语言少数群体获得人工智能(AI)技术优势的机会。在本文中,我们解决一个问题,香港广东话语言的数据稀缺性,通过创建一个新的广东话数据集。我们的数据集,多域粤语语料库(MCDC),由73.6个小时的干净阅读语音配对成绩单,收集来自广东香港的有声读物。它结合了哲学、政治、教育、文化、生活方式和家庭领域,涵盖了广泛的主题。我们还回顾了所有现有的粤语数据集,并在两个最大的数据集(MDCC和Common Voice zh HK)上进行了实验。我们根据语音类型、数据源、总大小和可用性对现有数据集进行分析。使用Fairseq S2T Transformer(最先进的ASR模型)进行的实验结果表明了我们数据集的有效性。此外,通过在MDCC和Common Voice zh HK上应用多数据集学习,我们创建了一个强大而健壮的广东话ASR模型。
摘要:Automatic speech recognition (ASR) on low resource languages improves access of linguistic minorities to technological advantages provided by Artificial Intelligence (AI). In this paper, we address a problem of data scarcity of Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and perform experiments on the two biggest datasets (MDCC and Common Voice zh-HK). We analyze the existing datasets according to their speech type, data source, total size and availability. The results of experiments conducted with Fairseq S2T Transformer, a state-of-the-art ASR model, show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.

机器翻译,仅供参考

点击“阅读原文”获取带摘要的学术速递

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存