其他
音频慎入!枕边女友每天读论文哄我睡觉
点击左上方蓝字关注我们
项目简介
如何让飞桨可以自己“读”论文,也就是实现文字转语音的任务?简单分解一下,通过实现下面三个场景的文字转语音(TTS,Text-to-Speech)任务就可以做到:
HTML页面论文介绍 PDF论文摘要 图片英文语句OCR
最终TTS效果
HTML文章段落朗读效果:
项目过程详尽回放
以下操作过程已经在AI Stuidio上开放,可以在线体验,当然读者也可以尝试在自己电脑上参考运行:
第一步:下载并安装工具库
安装Parakeet模型库
!cd Parakeet
!pip install -e .
!cd ..
import nltk
nltk.download("punkt")
nltk.download("cmudict")
准备Parakeet预训练模型
WaveFlow模型128比特率的预训练模型 FastSpeech文字转语音预训练模型
!unzip waveflow_res128_ljspeech_ckpt_1.0.zip -d Parakeet/examples/fastspeech/
!wget https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech_ljspeech_ckpt_1.0.zip
!unzip fastspeech_ljspeech_ckpt_1.0.zip -d Parakeet/examples/fastspeech/fastspeech_ljspeech_ckpt_1.0/
安装PaddleOCR
!cd PaddleOCR/
!pip install -r requirments.txt
准备支持空格的识别预训练模型
!cd inference
!wget https://paddleocr.bj.bcebos.com/ch_models/ch_rec_r34_vd_crnn_enhance_infer.tar && tar xf ch_rec_r34_vd_crnn_enhance_infer.tar
!wget https://paddleocr.bj.bcebos.com/ch_models/ch_det_r50_vd_db_infer.tar && tar xf ch_det_r50_vd_db_infer.tar
%cd ../..
安装Beautiful Soup等工具库
!pip install xlwt
!pip install xlrd
!pip install lxml
!pip install w3lib
!pip install pdfminer3k
第二步:解析文章内容
解析HTML文章:
Beautiful Soup 4.4.0 文档 Python beautiful soup解析html获得数据 BeautifulSoup中find和find_all的使用 利用BeautifulSoup去除HTML指定标签和去除注释 AI Studio项目:《青春有你2》选手信息爬取
import re
import requests
import datetime
from bs4 import BeautifulSoup
import os
def print_crawl_data(url, save_path):
"""
爬取指定url的Html页面内容并打印
"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
url = url
try:
response = requests.get(url,headers=headers)
# print(response.status_code)
#将一段文档传入BeautifulSoup的构造方法,就能得到一个文档的对象, 可以传入一段字符串
soup = BeautifulSoup(response.text)
# [s.extract() for s in soup('a')]
# 按css搜索
# #返回的是class为'style':'color: rgb(0, 0, 0); font-family: Arial, sans-serif;'的<span>所有标签
texts = soup.find_all('span',{'style':'color: rgb(0, 0, 0); font-family: Arial, sans-serif;'})
for text in texts:
#对当前节点前面的标签和字符串进行查找,并指定查找内容为文本
# print(text.text)
with open('%s' % (save_path), 'a') as f:
result = text.text
print (result)
f.write(result + "\n")
except Exception as e:
print(e)
print_crawl_data('http://research.baidu.com/Blog/index-view?id=139','article.txt')
Audio synthesis has a variety of applications, including text-to-speech (TTS), music generation, virtual assistant, and digital content creation. In recent years, deep neural network has obtained noticeable successes for synthesizing raw audio in high-fidelity speech and music generation. One of the most successful examples are autoregressive models (e.g., WaveNet). However, they sequentially generate high temporal resolution of raw waveform (e.g., 24 kHz) at synthesis, which are prohibitively slow for real-time applications.
Many researchers from various organizations have spent considerable effort to develop parallel generative models for raw audio. Parallel WaveNet and ClariNet could generate high-fidelity audio in parallel, but they require distillation from a pretrained autoregressive model and a set of auxiliary losses for training, which complicates the training pipeline and increases the cost of development. GAN-based model can be trained from scratch, but it provides inferior audio fidelity than WaveNet. WaveGlow can be trained directly with maximum likelihood, but the model has huge number of parameters (e.g., 88M parameters) to reach the comparable fidelity of audio as WaveNet.
Today, we’re excited to announce WaveFlow (paper, audio samples), the latest milestone of audio synthesis research at Baidu. It features: 1) high-fidelity & ultra-fast audio synthesis, 2) simple likelihood-based training, and 3) small memory footprint, which could not be achieved simultaneously in previous work. Our small-footprint model (5.91M parameters) can synthesize high-fidelity speech (MOS: 4.32) more than 40x faster than real-time on a Nvidia V100 GPU. WaveFlow also provides a unified view of likelihood-models for raw audio, which includes both WaveNet and WaveGlow as special cases and allow us to explicitly trade inference parallelism for model capacity.
Our paper will be presented at ICML 2020.
For more details of WaveFlow, please check out our paper: https://arxiv.org/abs/1912.01219
Audio samples are in: https://waveflow-demo.github.io/
The implementation can be accessed in Parakeet, which is a text-to-speech toolkit building on PaddlePaddle: https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow
with open('article.txt','r',encoding = 'utf-8') as fr,open('article2.txt','w',encoding = 'utf-8') as fd:
for text in fr.readlines():
if text.split():
fd.write(text)
print('完成去空行处理...')
完成去空行处理...
with open('article2.txt','r',encoding = 'utf-8') as fr,open('article3.txt','w',encoding = 'utf-8') as fd:
for text in fr.readlines():
text = text.replace('.','.\n')
fd.write(text)
print('完成去换行处理...')
解析PDF文章
Python使用pdfminer解析PDF Python去除文本文件中的空行
import importlib,sys
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfdevice import PDFDevice
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal, LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
def parse(DataIO, save_path):
#用文件对象创建一个PDF文档分析器
parser = PDFParser(DataIO)
#创建一个PDF文档
doc = PDFDocument()
#分析器和文档相互连接
parser.set_document(doc)
doc.set_parser(parser)
#提供初始化密码,没有默认为空
doc.initialize()
#检查文档是否可以转成TXT,如果不可以就忽略
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
else:
#创建PDF资源管理器,来管理共享资源
rsrcmagr = PDFResourceManager()
#创建一个PDF设备对象
laparams = LAParams()
#将资源管理器和设备对象聚合
device = PDFPageAggregator(rsrcmagr, laparams=laparams)
#创建一个PDF解释器对象
interpreter = PDFPageInterpreter(rsrcmagr, device)
#循环遍历列表,每次处理一个page内容
#doc.get_pages()获取page列表
for page in doc.get_pages():
interpreter.process_page(page)
#接收该页面的LTPage对象
layout = device.get_result()
#这里的layout是一个LTPage对象里面存放着page解析出来的各种对象
#一般包括LTTextBox,LTFigure,LTImage,LTTextBoxHorizontal等等一些对像
#想要获取文本就得获取对象的text属性
for x in layout:
try:
if(isinstance(x, LTTextBoxHorizontal)):
with open('%s' % (save_path), 'a') as f:
result = x.get_text()
print (result)
f.write(result + "\n")
except:
print("Failed")
#解析本地PDF文本,保存到本地TXT
with open('waveflow.pdf','rb') as pdf_html:
parse(pdf_html, 'pdf2text_output.txt')
with open('pdf2text_output.txt','r',encoding = 'utf-8') as fr,open('abstract.txt','w',encoding = 'utf-8') as fd:
for text in fr.readlines()[60:86:]:
if text.split():
fd.write(text)
print(text)
print('摘要打印完成')
Abstract
In this work, we propose WaveFlow, a small-
footprint generative flow for raw audio, which
is directly trained with maximum likelihood. It
handles the long-range structure of 1-D wave-
form with a dilated 2-D convolutional architec-
ture, while modeling the local variations using
expressive autoregressive functions. WaveFlow
provides a unified view of likelihood-based mod-
els for 1-D data, including WaveNet and Wave-
Glow as special cases. It generates high-fidelity
speech as WaveNet, while synthesizing several
orders of magnitude faster as it only requires a
few sequential steps to generate very long wave-
forms with hundreds of thousands of time-steps.
Furthermore, it can significantly reduce the likeli-
hood gap that has existed between autoregressive
models and flow-based models for efficient syn-
thesis. Finally, our small-footprint WaveFlow has
only 5.91M parameters, which is 15× smaller
than WaveGlow. It can generate 22.05 kHz high-
fidelity audio 42.6× faster than real-time (at a rate
of 939.3 kHz) on a V100 GPU without engineered
inference kernels.
摘要打印完成
OCR识别图片中英文语句
对PaddleOCR/tools/infer/predict_system.py中的main()函数下面这一部分稍作修改,只识别文字,比较直观:
dt_num = len(dt_boxes)
for dno in range(dt_num):
text, score = rec_res[dno]
if score >= drop_score:
# 只打印文本,并存储为txt文件
# text_str = "%s, %.3f" % (text, score)
with open('../ocr_text.txt', 'a') as f:
text_str = "%s" % (text)
f.write(text_str + "\n")
print(text_str)
!cd /home/aistudio/PaddleOCR
/home/aistudio/PaddleOCR
# 找一些英文名言的图片
!wget https://quotefancy.com/media/wallpaper/3840x2160/50594-Francis-Bacon-Quote-Knowledge-is-power.jpg --no-check-certificate
!wget https://www.quotemaster.org/images/24/2423b4151b7283c4570e2967fbf022cf.jpg
!wget https://www.promptaconsultinggroup.com/wp-content/uploads/2018/10/Focus-on-Results.jpg
!wget https://quotefancy.com/media/wallpaper/1600x900/50583-Francis-Bacon-Quote-Knowledge-is-power.jpg --no-check-certificate
!wget https://quotefancy.com/media/wallpaper/3840x2160/2347129-William-Shakespeare-Quote-To-be-or-not-to-be-that-is-the-question.jpg --no-check-certificate
--2020-08-02 19:40:58-- https://www.promptaconsultinggroup.com/wp-content/uploads/2018/10/Focus-on-Results.jpg
Resolving www.promptaconsultinggroup.com (www.promptaconsultinggroup.com)... 67.43.226.3
Connecting to www.promptaconsultinggroup.com (www.promptaconsultinggroup.com)|67.43.226.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 883254 (863K) [image/jpeg]
Saving to: ‘Focus-on-Results.jpg’
Focus-on-Results.jp 100%[===================>] 862.55K 11.6KB/s in 72s
2020-08-02 19:42:14 (12.0 KB/s) - ‘Focus-on-Results.jpg’ saved [883254/883254]
!python tools/infer/predict_system.py \
--image_dir="50594-Francis-Bacon-Quote-Knowledge-is-power.jpg" \
--det_model_dir="./inference/ch_det_r50_vd_db/" \
--rec_model_dir="./inference/ch_rec_r34_vd_crnn_enhance/" \
--use_space_char=True
dt_boxes num : 6, elapse : 0.02082991600036621
rec_res num : 6, elapse : 0.019023895263671875
Predict time of 50594-Francis-Bacon-Quote-Knowledge-is-power.jpg: 0.097s
Knowledge
is
power
Francis
Bacon
quotefancy
The visualized image saved in ./inference_results/50594-Francis-Bacon-Quote-Knowledge-is-power.jpg
第三步:文字转语音
local_rank = dg.parallel.Env().local_rank
place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace())
fluid.enable_dygraph(place)
with open(args.config) as f:
cfg = yaml.load(f, Loader=yaml.Loader)
if not os.path.exists(args.output):
os.mkdir(args.output)
writer = SummaryWriter(os.path.join(args.output, 'log'))
model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
# Load parameters.
global_step = io.load_parameters(
model=model, checkpoint_path=args.checkpoint)
model.eval()
# 按行读取txt文本并生成语音
for i,line in enumerate(open(args.text_input)):
text_input = line
text = np.asarray(text_to_sequence(text_input))
text = np.expand_dims(text, axis=0)
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = np.expand_dims(pos_text, axis=0)
text = dg.to_variable(text).astype(np.int64)
pos_text = dg.to_variable(pos_text).astype(np.int64)
_, mel_output_postnet = model(text, pos_text, alpha=args.alpha)
if args.vocoder == 'griffin-lim':
#synthesis use griffin-lim
wav = synthesis_with_griffinlim(mel_output_postnet, cfg['audio'])
elif args.vocoder == 'waveflow':
wav = synthesis_with_waveflow(mel_output_postnet, args,
args.checkpoint_vocoder, place)
else:
print(
'vocoder error, we only support griffinlim and waveflow, but recevied %s.'
% args.vocoder)
writer.add_audio(text_input + '(' + args.vocoder + ')', wav, 0,
cfg['audio']['sr'])
if not os.path.exists(os.path.join(args.output, 'samples')):
os.mkdir(os.path.join(args.output, 'samples'))
write(
os.path.join(
os.path.join(args.output, 'samples'), args.vocoder + str(i) + '.wav'),
cfg['audio']['sr'], wav)
print("Synthesis completed !!!")
writer.close()
!export CUDA_VISIBLE_DEVICES=0
env: CUDA_VISIBLE_DEVICES=0
!cd /home/aistudio/Parakeet/examples/fastspeech
/home/aistudio/Parakeet/examples/fastspeech
使用WaveFlow作为声码器朗读HTML文章
--use_gpu=1 \
--alpha=1.0 \
--checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' \
--config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \
--output='./synthesis' \
--vocoder='waveflow' \
--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \
--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \
--text_input='/home/aistudio/article3.txt'
{'alpha': 1.0,
'checkpoint': './fastspeech_ljspeech_ckpt_1.0/step-162000',
'checkpoint_vocoder': './waveflow_res128_ljspeech_ckpt_1.0/step-2000000',
'config': './fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml',
'config_vocoder': './waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml',
'output': './synthesis',
'text_input': '/home/aistudio/article3.txt',
'use_gpu': 1,
'vocoder': 'waveflow'}
验证文字转语音效果
生成的TTS音频保存在
Parakeet/examples/fastspeech/synthesis/samples文件夹下,可以选择几段音频验证效果
IPython.display.Audio('synthesis/samples/waveflow3.wav')
使用ffmpeg合并
生成的音频文件
由于前面是通过对文本逐行扫描生成的音频文件,如果希望听到完整的文章段落,就需要将生成的音频文件按顺序拼接。
file 'path/to/file2'
file 'path/to/file3'
for i,line in enumerate(open('/home/aistudio/article3.txt')):
with open('waveflow_article3.txt', 'a') as f:
result = 'file synthesis/samples/waveflow' + str(i) +'.wav'
f.write(result + "\n")
# 音频拼接
!ffmpeg -f concat -i waveflow_article3.txt -c copy 'waveflow_article3.wav'
ffmpeg version 2.8.15-0ubuntu0.16.04.1 Copyright (c) 2000-2018 the FFmpeg developers
built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.10) 20160609
configuration: --prefix=/usr --extra-version=0ubuntu0.16.04.1 --build-suffix=-ffmpeg --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --cc=cc --cxx=g++ --enable-gpl --enable-shared --disable-stripping --disable-decoder=libopenjpeg --disable-decoder=libschroedinger --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librtmp --enable-libschroedinger --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxvid --enable-libzvbi --enable-openal --enable-opengl --enable-x11grab --enable-libdc1394 --enable-libiec61883 --enable-libzmq --enable-frei0r --enable-libx264 --enable-libopencv
libavutil 54. 31.100 / 54. 31.100
libavcodec 56. 60.100 / 56. 60.100
libavformat 56. 40.101 / 56. 40.101
libavdevice 56. 4.100 / 56. 4.100
libavfilter 5. 40.101 / 5. 40.101
libavresample 2. 1. 0 / 2. 1. 0
libswscale 3. 1.101 / 3. 1.101
libswresample 1. 2.101 / 1. 2.101
libpostproc 53. 3.100 / 53. 3.100
[0;33mGuessed Channel Layout for Input Stream #0.0 : mono
[0mInput #0, concat, from 'waveflow_article3.txt':
Duration: N/A, start: 0.000000, bitrate: 705 kb/s
Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, 1 channels, flt, 705 kb/s
Output #0, wav, to 'waveflow_article3.wav':
Metadata:
ISFT : Lavf56.40.101
Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, mono, 705 kb/s
Stream mapping:
Stream #0:0 -> #0:0 (copy)
Press [q] to stop, [?] for help
size= 16235kB time=00:03:08.49 bitrate= 705.6kbits/s
video:0kB audio:16235kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000686%
使用Griffin-Lim算法
作为声码器朗读HTML文章
--use_gpu=1 \
--alpha=1.0 \
--checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' \
--config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \
--output='./synthesis' \
--text_input='/home/aistudio/article3.txt'
{'alpha': 1.0,
'checkpoint': './fastspeech_ljspeech_ckpt_1.0/step-162000',
'checkpoint_vocoder': None,
'config': './fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml',
'config_vocoder': None,
'output': './synthesis',
'text_input': '/home/aistudio/article3.txt',
'use_gpu': 1,
'vocoder': 'griffin-lim'}
验证文字转语音效果
IPython.display.Audio('synthesis/samples/griffin-lim3.wav')
使用ffmpeg合并
生成的音频文件
for i,line in enumerate(open('/home/aistudio/article3.txt')):
with open('griffin-lim_article3.txt', 'a') as f:
result = 'file synthesis/samples/griffin-lim' + str(i) +'.wav'
f.write(result + "\n")
# 音频拼接
!ffmpeg -f concat -i griffin-lim_article3.txt -c copy 'griffin-lim_article3.wav'
论文摘要和OCR文字
转语音效果
abstract.txt和ocr_text.txt的TTS实现过程和上面的article3.txt完全一致,唯一不同在于OCR识别最终合成的音频文件比较小,可以直接在Notebook中查看效果。
--use_gpu=1 \
--alpha=1.0 \
--checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' \
--config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \
--output='./synthesis' \
--vocoder='waveflow' \
--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \
--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \
--text_input='/home/aistudio/abstract.txt'
# 生成list文件
for i,line in enumerate(open('/home/aistudio/abstract.txt')):
with open('waveflow_abstract.txt', 'a') as f:
result = 'file synthesis/samples/waveflow' + str(i) +'.wav'
f.write(result + "\n")
# 音频拼接
!ffmpeg -f concat -i waveflow_abstract.txt -c copy 'waveflow_abstract.wav'
--use_gpu=1 \
--alpha=1.0 \
--checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' \
--config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \
--output='./synthesis' \
--vocoder='waveflow' \
--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \
--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \
--text_input='/home/aistudio/ocr_text.txt'
{'alpha': 1.0,
'checkpoint': './fastspeech_ljspeech_ckpt_1.0/step-162000',
'checkpoint_vocoder': './waveflow_res128_ljspeech_ckpt_1.0/step-2000000',
'config': './fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml',
'config_vocoder': './waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml',
'output': './synthesis',
'text_input': '/home/aistudio/ocr_text.txt',
'use_gpu': 1,
'vocoder': 'waveflow'}
[checkpoint] Rank 0: loaded model from ./fastspeech_ljspeech_ckpt_1.0/step-162000.pdparams
[checkpoint] Rank 0: loaded model from ./waveflow_res128_ljspeech_ckpt_1.0/step-2000000.pdparams
Synthesis completed !!!
!mv synthesis/samples/waveflow0.wav ./ocr.wav
import IPython
IPython.display.Audio('ocr.wav')
小结:
TTS效果如何进一步提升?
更多资源
如果您想详细了解更多飞桨的相关内容,请参阅以下文档。
Gitee:
https://gitee.com/paddlepaddle/PaddleOCR
Linux 内核对 Rust 的支持有新进展,双方进行深入探讨送书|爱上读书,每天都是读书日!10本技术书(云计算、大数据等)任你选!为破除“谷歌控制说”,Istio 重组指导委员会挑战树莓派?首个运行 Linux 系统的 RISC-V 架构微型计算机 PicoRio 发布29 年超 100 万次 commit,Linux 内核何以发展至今?
觉得不错,请点个在看呀