语音/音频处理学术速递[12.14]
点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.SD语音,共计3篇
eess.AS音频处理,共计3篇
1.cs.SD语音:
【1】 Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis
标题:风格标签无关:语音合成中基于量化VAE和说话人归一化的交叉说话人风格转换
链接:https://arxiv.org/abs/2212.06397
机构:Kwai, Beijing, P.R. China
备注:Published to ISCSLP 2022
摘要:跨说话人风格迁移是指将源说话人的风格迁移到目标说话人的合成语音中。大多数先前的方法依赖于具有样式标签的数据,但是手动注释标签是昂贵的并且不总是可靠的。针对这一问题,本文提出了一种跨说话人风格迁移方法--无风格标签方法,该方法可以实现源说话人到目标说话人的风格迁移,而无需风格标签。首先,设计了一种基于量化变分自动编码器(Q-VAE)和风格瓶颈的参考编码器结构,用于提取离散风格表示。其次,提出了一种基于说话人的批量归一化层,以减少源说话人泄漏。为了提高参考编码器的风格提取能力,提出了一种风格不变性和对比数据增强的方法。实验结果表明,该方法优于基线方法。我们提供了一个网站与音频样本。
摘要:Cross-speaker style transfer in speech synthesis aims at transferring a style
from source speaker to synthesised speech of a target speaker's timbre. Most
previous approaches rely on data with style labels, but manually-annotated
labels are expensive and not always reliable. In response to this problem, we
propose Style-Label-Free, a cross-speaker style transfer method, which can
realize the style transfer from source speaker to target speaker without style
labels. Firstly, a reference encoder structure based on quantized variational
autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style
representations. Secondly, a speaker-wise batch normalization layer is proposed
to reduce the source speaker leakage. In order to improve the style extraction
ability of the reference encoder, a style invariant and contrastive data
augmentation method is proposed. Experimental results show that the method
outperforms the baseline. We provide a website with audio samples.
【2】 Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric
标题:基于自回归模型和改进评价指标的可信音素边界检测
链接:https://arxiv.org/abs/2212.06387
机构:Supertone, Inc., Seoul National University
备注:5 pages, submitted to ICASSP 2023
摘要:音素边界检测由于其在各种语音应用中的中心作用而被研究。在本文中,我们指出,这一任务不仅需要通过算法的方式来解决,而且需要通过评估度量来解决。为此,我们首先提出了一种以自回归方式操作的最先进的音素边界检测器,称为SuperSeg。在TIMIT和Buckeye语料库上的实验表明,与现有模型相比,SuperSeg识别音素边界具有显著的边缘。此外,我们注意到流行的评估度量R值存在限制,并提出了新的评估度量,以防止每个边界对评估有多次贡献。该方法揭示了非自回归基线的不足,建立了一个适合于评价音素边界检测的可靠准则。
摘要:Phoneme boundary detection has been studied due to its central role in
various speech applications. In this work, we point out that this task needs to
be addressed not only by algorithmic way, but also by evaluation metric. To
this end, we first propose a state-of-the-art phoneme boundary detector that
operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT
and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries
with significant margin compared to existing models. Furthermore, we note that
there is a limitation on the popular evaluation metric, R-value, and propose
new evaluation metrics that prevent each boundary from contributing to
evaluation multiple times. The proposed metrics reveal the weaknesses of
non-autoregressive baselines and establishes a reliable criterion that suits
for evaluating phoneme boundary detection.
【3】 Jointly Learning Visual and Auditory Speech Representations from Raw Data
标题:从原始数据中联合学习视觉和听觉语音表示
链接:https://arxiv.org/abs/2212.06246
机构:Imperial College London, Meta AI
备注:22 pages
摘要:本文提出了一种自监督的多模态语音学习方法RAVEn,用于联合学习视觉和听觉语音表示。我们的预训练目标包括编码屏蔽输入,然后预测由缓慢进化的动量编码器生成的上下文化目标。受视频和音频之间固有差异的驱动,我们的设计是不对称的。两种模式的借口任务:听觉流预测视觉和听觉目标,而视觉流只预测听觉目标。当微调由单个预训练阶段产生的视觉和听觉编码器时,我们观察到在低和高资源标记数据设置中的强结果,其中编码器被联合训练。值得注意的是,RAVEn超越了LRS3上视觉语音识别(VSR)的所有自监督方法,并且将RAVEn与仅使用30小时标记数据的自训练相结合,甚至超过了最近在90,000小时非公开数据上训练的半监督方法。同时,我们在听觉语音识别(以及VSR)的LRS3低资源设置中实现了最先进的结果。我们的发现指出了完全从原始视频和音频学习强大的语音表示的可行性,即,而不依赖于手工制作的特征。代码和模型将公开。
摘要:We present RAVEn, a self-supervised multi-modal approach to jointly learn
visual and auditory speech representations. Our pre-training objective involves
encoding masked inputs, and then predicting contextualised targets generated by
slowly-evolving momentum encoders. Driven by the inherent differences between
video and audio, our design is asymmetric w.r.t. the two modalities' pretext
tasks: Whereas the auditory stream predicts both the visual and auditory
targets, the visual one predicts only the auditory targets. We observe strong
results in low- and high-resource labelled data settings when fine-tuning the
visual and auditory encoders resulting from a single pre-training stage, in
which the encoders are jointly trained. Notably, RAVEn surpasses all
self-supervised methods on visual speech recognition (VSR) on LRS3, and
combining RAVEn with self-training using only 30 hours of labelled data even
outperforms a recent semi-supervised method trained on 90,000 hours of
non-public data. At the same time, we achieve state-of-the-art results in the
LRS3 low-resource setting for auditory speech recognition (as well as for VSR).
Our findings point to the viability of learning powerful speech representations
entirely from raw video and audio, i.e., without relying on handcrafted
features. Code and models will be made public.
2.eess.AS音频处理:
【1】 Towards deep generation of guided wave representations for composite materials
标题:复合材料导波表示的深层次生成
链接:https://arxiv.org/abs/2212.06365
机构: Senthilnath is with the Institute for Infocomm Research
摘要:层合复合材料广泛应用于工程领域。波传播分析在理解复合材料结构的短时瞬态响应中起着至关重要的作用。基于正演物理学的模型被用于从弹性属性空间映射到层合复合材料中的波传播行为。由于导波的高频、多模态和色散性质,基于物理的模拟在计算上要求很高。这使得性能预测、生成和材料设计问题更具挑战性。本文利用基于正向物理的模拟方法,如刚度矩阵法,对一组复合材料的导波群速度进行了采集。提出了一种基于变分自动编码器(VAE)的深度生成模型,用于生成新的、真实的极群速度表示。观察到,深度生成器能够以非常低的均方重构误差重构看不见的表示。采用全局蒙特卡罗方法和定向等间距采样器对VAE的连续、完整、有序的低维特征空间进行采样。采样点被馈送到经训练的解码器中以生成新的极坐标表示。该网络已显示出卓越的发电能力。还可以看出,潜在空间形成概念空间,其中不同的方向和区域显示与所生成的表征及其相应的材料属性相关的固有模式。
摘要:Laminated composite materials are widely used in most fields of engineering.
Wave propagation analysis plays an essential role in understanding the
short-duration transient response of composite structures. The forward
physics-based models are utilized to map from elastic properties space to wave
propagation behavior in a laminated composite material. Due to the
high-frequency, multi-modal, and dispersive nature of the guided waves, the
physics-based simulations are computationally demanding. It makes property
prediction, generation, and material design problems more challenging. In this
work, a forward physics-based simulator such as the stiffness matrix method is
utilized to collect group velocities of guided waves for a set of composite
materials. A variational autoencoder (VAE)-based deep generative model is
proposed for the generation of new and realistic polar group velocity
representations. It is observed that the deep generator is able to reconstruct
unseen representations with very low mean square reconstruction error. Global
Monte Carlo and directional equally-spaced samplers are used to sample the
continuous, complete and organized low-dimensional latent space of VAE. The
sampled point is fed into the trained decoder to generate new polar
representations. The network has shown exceptional generation capabilities. It
is also seen that the latent space forms a conceptual space where different
directions and regions show inherent patterns related to the generated
representations and their corresponding material properties.
【2】 Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis
标题:风格标签无关:语音合成中基于量化VAE和说话人归一化的交叉说话人风格转换
链接:https://arxiv.org/abs/2212.06397
机构:Kwai, Beijing, P.R. China
备注:Published to ISCSLP 2022
摘要:None
摘要:Cross-speaker style transfer in speech synthesis aims at transferring a style
from source speaker to synthesised speech of a target speaker's timbre. Most
previous approaches rely on data with style labels, but manually-annotated
labels are expensive and not always reliable. In response to this problem, we
propose Style-Label-Free, a cross-speaker style transfer method, which can
realize the style transfer from source speaker to target speaker without style
labels. Firstly, a reference encoder structure based on quantized variational
autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style
representations. Secondly, a speaker-wise batch normalization layer is proposed
to reduce the source speaker leakage. In order to improve the style extraction
ability of the reference encoder, a style invariant and contrastive data
augmentation method is proposed. Experimental results show that the method
outperforms the baseline. We provide a website with audio samples.
【3】 Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric
标题:基于自回归模型和改进评价指标的可信音素边界检测
链接:https://arxiv.org/abs/2212.06387
机构:Supertone, Inc., Seoul National University
备注:5 pages, submitted to ICASSP 2023
摘要:音素边界检测由于其在各种语音应用中的中心作用而被研究。在本文中,我们指出,这一任务不仅需要通过算法的方式来解决,而且需要通过评估度量来解决。为此,我们首先提出了一种以自回归方式操作的最先进的音素边界检测器,称为SuperSeg。在TIMIT和Buckeye语料库上的实验表明,与现有模型相比,SuperSeg识别音素边界具有显著的边缘。此外,我们注意到流行的评估度量R值存在限制,并提出了新的评估度量,以防止每个边界对评估有多次贡献。该方法揭示了非自回归基线的不足,建立了一个适合于评价音素边界检测的可靠准则。
摘要:Phoneme boundary detection has been studied due to its central role in
various speech applications. In this work, we point out that this task needs to
be addressed not only by algorithmic way, but also by evaluation metric. To
this end, we first propose a state-of-the-art phoneme boundary detector that
operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT
and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries
with significant margin compared to existing models. Furthermore, we note that
there is a limitation on the popular evaluation metric, R-value, and propose
new evaluation metrics that prevent each boundary from contributing to
evaluation multiple times. The proposed metrics reveal the weaknesses of
non-autoregressive baselines and establishes a reliable criterion that suits
for evaluating phoneme boundary detection.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递