第十八届全国人机语音通讯学术会议丨思必驰联合上海交大发表11篇论文
2023年12月8—10日,第十八届全国人机语音通讯学术会议在江苏苏州CCF业务总部&学术交流中心举行。本次会议由中国中文信息学会和中国计算机学会联合主办,思必驰和上海交通大学承办。大会邀请到了国内外著名学者进行大会报告和教程报告,会议还举行了青年学者论坛、学生论坛、企业论坛等活动。
会上,思必驰联合上海交通大学发表论文十余篇。思必驰研发总监樊帅为大家带来了《对话式人工智能及其产业应用》的主题演讲,展现了思必驰DFM-2大模型赋能产业应用的能力。思必驰联合创始人、首席科学家、上海交通大学教授俞凯作为大会主席出席会议并做闭幕式致辞。
思必驰DFM-2大模型全面赋能产业
今年7月,思必驰发布了DFM-2大模型,结合DFM-2,思必驰DUI平台全新升级为DUI 2.0,完成对话式AI全链路技术的升级,推进深度产业应用。近期,思必驰推出大模型应用平台,开启全面赋能垂域,助力产业智慧升级。
现场,思必驰研发总监樊帅分享了DUI2.0全链路大模型方案在智能家居、智能汽车、会议办公等领域的应用。基于大模型的语义泛化&跨领域多意图、文档问答、人设闲聊、知识百科等能力,可以为客户打造随心、智能、温度、博学的AI产品。
左右滑动查看更多
在智能家居场景中,支持从玄关到客厅、厨房、卧室、全屋,创造N种智能场景方案。全屋智能场景下,通过语音控制,智能照明、智能影音、智能窗帘、智能安防等系统可以通过思必驰智能中控大屏实现智慧联动,使用户智享高品质家居生活;在智能汽车场景中,天琴5.0系统升级为天琴6.0系统,支持多模态、多意图、多音区、全场景多轮连续对话;在智能会议场景中,麦耳会记升级为3.0版本,支持一键成稿、AI摘要、AI待办等功能。
在数字政企类客户为主的生产、生活和社会治理领域,思必驰参与组建的“语言计算国家新一代人工智能开放创新平台智慧环卫创新应用中心”近期正式揭牌落户重庆。思必驰将基于DFM-2大模型以及声纹认证、情绪识别、图像分析、行为识别等技术,对驾驶员疲劳、分神等危险驾驶行为进行识别、预警,帮助驾驶员开展安全劳动,助力重庆环卫系统加快数智化转型。
十余篇论文荣获发表
本次大会上,思必驰联合上海交通大学共计发表了11篇论文,内容涉及长语音识别、语音合成、语音编辑、端到端流式可定制关键词检测方法等领域。据大会官方表示,本次大会的优秀英文论文将推荐至《上海交通大学学报》(EI检索)发表。
1.Improving VAD Performance in Long-form Noisy Condition Through ASR Integration
——Bingqing Zhu; Shaofei Xue; Qing Zhuo
向上滑动阅读摘要
While end-to-end CTC models have shown great success on the automatic speech recognition (ASR), performance degrades severely when the target is long-form speech. Compared with short speech, the recognition performance of long-form speech is affected by more factors,e.g., Voice Activity Detection (VAD). The failure of the VAD results in insertion or delete errors in the inference of followed ASR model. To make the VAD prediction more suitable for the long-form ASR task, it is necessary to incorporate the VAD training process into the building of whole ASR system. In this paper, we present a novel joint training (JT) framework to realize the idea. Firstly, we apply the VAD output to mask the input feature of ASR and build up a VAD-ASR joint training model through multi-task learning criterion. With the assistance of joint training, both the performances of VAD and final ASR can improve conspicuously as the aim of joint training framework is more suitable for long-form speech task. Secondly, we fed a ASR embedding vector as the auxiliary input into the VAD model. The ASR embedding vector is extracted from a small-size ASR model. We find it helpful for the classi- fication of speech and non-speech in VAD task. Experimental results in long-form speech scene show that our proposed methods can outperform the baseline system both in the evaluation metric of VAD and the final character error rate (CER) of ASR.
2.BER: Balanced Error Rate For Speaker Diarization
——Tao Liu; Shuai Fan ; Kai Yu
向上滑动阅读摘要
Assessing diarization performance amidst real-world scenarios involving multiple speakers and spontaneous speech poses substantial challenges. Existing metrics like Diarization Error Rate (DER) can overlook less vocal speakers and brief, semantically rich utterances. Recent studies present an utterance-level error rate, calculating the Intersection over Union (IoU) of reference and hypothesized utterances. Despite these advancements, utterances can still become fragmented into segments due to algorithmic disparities in segmentation or labeling uncertainties, often leading to unanticipated outcomes. In response to these challenges, this paper unveils the innovative Segment Error Rate (SER), which constructs interconnected sub-graphs between the proposed and hypothesized segments. This scheme treats only segments within each sub-graph as correlated and subsequently factored into the IoU calculations.
Moreover, we favor a dynamic threshold over a rough, predetermined value to account for the variability inherent in utterance lengths. Experimental results corroborate the efficacy of SER for arbitrary segmentation. Furthermore, we introduce the novel Balanced Error Rate (BER), a sophisticated metric synthesizing the SER with duration and speaker errors to afford a holistic appraisal of diarization performance. To corroborate the effectiveness of our proposed methodology, we have conducted a series of rigorous experiments to verify our methodology's effectiveness.
3.CLAUDIO: Clustering, Augmentation, and Discriminator for Foley Sound Synthesis
——Zeyu Xie ; Xuenan Xu; Baihan Li; Mengyue Wu; Kai Yu
向上滑动阅读摘要
Foley sound synthesis plays a pivotal role in the realm of multimedia and entertainment, enriching auditory experiences across various applications. Mainstream audio generation frameworks comprise two core components: audio representation and tokens prediction. In pursuit of enhancing the accuracy, fidelity, and diversity of generated audio, we present the CLAUDIO system, which amalgamates various training strategies, encompassing (1) a CLustering module for better training; (2) an mixup module for data AUgmentation; (3) a DIscriminatOr model to selectively filter audio. Our findings demonstrate the efficacy of these approaches in ameliorating the quality of synthesized audio.
4.Knowledge-driven Text Generation for Zero-shot ASR Domain Adaptation
——Xizhuo Zhang ; Baochen Yang; Sen Liu ; Yiwei Guo ; Zheng Liang; Kai Yu
向上滑动阅读摘要
Domain adaptation with few data is a challenging problem in automatic speech recognition (ASR). Recent works have proposed to augment the training data using in-domain audio generated by Text-to-speech (TTS). Text content has been demonstrated as the real crucial part for effective adaptation, while speaker and style have minimal effect. However, abundant high quality in-domain text corpus is still required and not always available in real world scenarios. In this paper, we propose a complete zero-shot domain adaptation method with knowledge-driven text generation. Here, we design a domain knowledge description framework and use it to guide a large language model (LLM) to generate in-domain text corpus which is later synthesized by a high fidelity TTS system. The synthetic audio-text pairs are then used to augment training data for domain adaptation. To progressively refine text generation, we propose a novel iterative self-regeneration approach, where the recognized hypotheses of target audio are fed back to LLM to elaborate in-domain knowledge and direct finer text generation. Experiments on a TED domain adaptation task show that, without any in-domain text or audio data, the proposed method can obtain significant performance improvement via adaptation and approach the performance of using real in-domain text corpus.
5.End-to-end Streaming Customizable Keyword Spotting based on text-adaptive neural search
——Baochen Yang ; Jiaqi Guo; Yu Xi; Haoyu Li ; Kai Yu
向上滑动阅读摘要
Streaming keyword spotting (KWS) is an important technique for voice assistant wake-up. While KWS with a preset fixed keyword has been well studied, test-time customizable keyword spotting in streaming mode remains a great challenge due to the lack of pre-collected keyword-specific training data and the requirement of streaming detection output. In this paper, we propose a novel end-to-end text-adaptive neural search architecture with a multi-label trigger mechanism to allow any pre-trained ASR acoustic model to be effectively used for fast streaming customizable keyword spotting. Evaluation results on various datasets show that our approach significantly outperforms both traditional post-processing baseline and the neural search baseline, meanwhile achieving a 44x search speedup compared to the traditional post-processing method.
6.Contextual Spectrum and Prosody Integration for Text-Based Speech Editing
——Zheng Liang; Chenpeng Du ; Kai Yu ; Xie Chen
向上滑动阅读摘要
Advancements in text-to-speech (TTS) models have considerably elevated audio synthesis quality and naturalness, finding applications in speech data augmentation and editing. Yet, achieving high naturalness in text-based speech editing, while maintaining audio similarity to the original, remains challenging. This paper introduces CSP-Edit, a novel text-based speech editing approach built on a neural TTS framework, allowing users to efficiently delete, insert, and replace within sentences. Utilizing a mask prediction mechanism, it masks the original mel-spectrogram's altered region, with its BERT-style bidirectional transformers predicting based on text and unaltered speech contexts. This method also underpins a new voice cloning technique. Evaluations on LibriTTS and HiFiTTS datasets demonstrate its superiority over several benchmarks in speech naturalness, quality, spectral distortion, and restorative conditions.
7. A Framework Combining Separate and Joint Training for Neural Vocoder-Based Monaural Speech Enhancement
——Qiaoyi Pan ; Wenbin Jiang ; Kai Yu
向上滑动阅读摘要
Conventional single-channel speech enhancement methodologies have predominantly emphasized the enhancement of the amplitude spectrum while preserving the original phase spectrum. Nonetheless, this may introduce speech distortion. While the intricate nature of the multifaceted spectra and waveform characteristics presents formidable challenges in training. In this paper, we introduce a novel framework with the Mel-spectrogram serving as an intermediary feature for speech enhancement. It integrates a denoising network and a deep generative network vocoder, allowing the reconstruction of the speech without using the phase. The denoising network, constituting a recurrent convolutional autoencoder, is meticulously trained to align with the Mel-spectrogram representations of both clean and noisy speech, resulting in an enhanced spectral output. This enhanced spectrum serves as the input for a high-fidelity, high-generation speed vocoder, which synthesizes the improved speech waveform. Following the pre-training of these two modules, they are stacked for joint training. Experimental results show the superiority of this approach in terms of speech quality, surpassing the performance of conventional models. Notably, our method demonstrates commendable adaptability across both the Chinese dataset CSMSC and the English language speech dataset VoiceBank+DEMAND, underscoring its considerable promise for real-world applications and beyond.
8.Iteration Noisy-target Approach: Speech Enhancement without Clean Speech
——Yifan Zhang ; Wenbin Jiang ; Kai Yu
向上滑动阅读摘要
Traditional Deep Neural Network based speech enhancement usually requires a source of clean speech as the target of training. However, limited access to ideal clean speech is the obstacle in its way to practical use. Meanwhile existing self-supervised or unsupervised methods are faced with both unsatisfactory performance and impractical source demand (e.g. various kinds of noises added to the same clean speech). Hence there's a significant need to either release the restriction of training data or improve the performance. In this paper, we propose a training strategy only requires noisy speech and noise waveform. It primarily consists of two phase: 1) As add noise to noisy speech it self could construct a pair of input and target for the training of DNN, the first round of training uses noisier speech (noise added to noisy speech) and noisy speech 2) For the following training, using model trained last time to refine the noisy speech, construct new noisier-noisy pairs for next turn of training. Moreover, to accelerate the process, we apply the iteration into epochs.To evaluation the efficiency, we utilized a dataset including 10 types of real-world noises and made comparison with two classic supervised and unsupervised methods.
9.Learning from Parent Classes: a Multimodal Weighted Fusion Transformer with Parent Class Prediction
——Yuxuan Wang; Mengyue Wu; Kai Yu
向上滑动阅读摘要
Human beings utilize an adaptable multidimensional perceptual system to different object classes and events in the world. The transformation from single-modality learning to multimodality learning is key to a more comprehensive machine perception system. Most previous work adopts a universal fusion strategy for different classes and mostly concentrate on early or late fusion, ignoring the information carried in the hierarchical structure.
In contrast, we propose a parent-wise multimodal fusion method targeted at automatically learning the adaptable fusion strategy based on different parent classes. Specifically, we first propose a weighting method based on different parent class prediction (MUP-weighting) which uses the relationship between modality importance and class to find the multimodal fusion weight. Further, a multimodal parent-classwise weighted transformer (MAST) is proposed to implement parent information into hierarchical processing and obtain the final class-wise fusion weight. It is also worth noticing that MAST uses audio, video, and optical flow as input modalities, which is rare in other multimodal settings.
Experiments with different parent class settings indicate that the fusion strategy of different modalities is related to its parent class. Through ablation analysis, we show that by incorporating a class-wise weighting mechanism, our purposed method improves not only the class-wise but also the overall multimodal fusion performance.
10.Automatic Parkinson’s Speech Severity Prediction via Read Speech
——Pingyue Zhang ; Mengyue Wu ; Kai Yu
向上滑动阅读摘要
Automatic Parkinson's Disease (PD) detection has attracted much attention over the last few years. Speech is a major biomarker in PD diagnosis and a convenient, non-intrusive behavioral signal hence of great significance in automatic PD detection.
Different from other speech-based disease diagnoses, PD largely influences articulatory features, which are closely related to both audio and language.
This paper proposed the first large-scale Mandarin PD speech dataset collected in clinical settings, with audio recordings from 562 PD patients, where each is given a severity score ranging from 0 to 4 by professional clinicians.
Inspired by the categorizing strategy in MDS-UPDRS, we design a series of language-dependent PD-specific features, including matching-accuracy-related features, pause-related features, average speed, and numbers of repetitions. These features are highly related to PD speech disorder and exhibit the increased performance of automatic severity score prediction.
Results indicate that the proposed PD-feature-set largely outperforms previous audio and hand-crafted features by an absolute improvement of 30%.
Such a feature set is a combination of audio and language features with high interpretability, which can be transferred to any other spoken language with automatic speech recognition results.
The proposed PD speech dataset and detection method can be of practical use under less-restricted settings with any extent of PD patients.
11.CAM-GUI: A Conversational Assistant on Mobile GUI
——Zichen Zhu; Liangtai Sun; Jingkai Yang; Yifan Peng; Weilin Zou ; Ziyuan Li ; Wutao Li ; Lu Chen ; yingzi ma ; Danyang Zhang; Shuai Fan; Kai Yu
向上滑动阅读摘要
Smartphone assistants are becoming more and more popular in our daily lives. These assistants mostly rely on the API-based Task-Oreiented Dialogue (TOD) systems, which limits the generality of these assistants, and the development of APIs would cost much labor and time. In this paper, we develop a Conversational Assistant on Mobile GUI (CAM-GUI), which can directly perform GUI operations on real devices, without the need of TOD-related backend APIs. To evaluate the performance of our assistant, we collect a dataset containing dialogues and GUI operation traces. From the experiment demonstrations and user studies, we show that the CAM-GUI reaches promising results.
未来,思必驰DFM-2大模型在各个领域的应用也将陆续迎来落地。思必驰将基于DFM-2大模型的能力,持续赋能智能家居、智能汽车、消费电子以及金融、轨交、政务等数字政企行业场景客户,助力实现AI产品的智慧升级。
—如有合作意向,请发邮件—
marketing@aispeech.com