研究动态 | 自然语言处理顶刊TACL22年第10卷精华摘编

语言服务行业

2024-09-09

Transactions of the Association for Computational Linguistics

Vol 10 (2022)

自然语言处理顶刊TACL

2022年第10卷

Word Acquisition in Neural Language Models

神经语言模型中的单词获取

Tyler A Chang, Benjamin K Bergen
AbstractWe investigate how neural language models acquire individual words during training, extracting learning curves and ages of acquisition for over 600 words on the MacArthur-Bates Communicative Development Inventory (Fenson et al., 2007). Drawing on studies of word acquisition in children, we evaluate multiple predictors for words' ages of acquisition in LSTMs, BERT, and GPT-2. We find that the effects of concreteness, word length, and lexical class are pointedly different in children and language models, reinforcing the importance of interaction and sensorimotor experience in child language acquisition. Language models rely far more on word frequency than children, but like children, they exhibit slower learning of words in longer utterances. Interestingly, models follow consistent patterns during training for both unidirectional and bidirectional models, and for both LSTM and Transformer architectures. Models predict based on unigram token frequencies early in training, before transitioning loosely to bigram probabilities, eventually converging on more nuanced predictions. These results shed light on the role of distributional learning mechanisms in children, while also providing insights for more human-like language acquisition in language models.
导图

Decomposing and Recomposing Event Structure

分解与重组事件结构

William Andrew Horsley Gantt, Lelia Glass, Aaron Steven White
AbstractWe present an event structure classification empirically derived from inferential properties annotated on sentence- and document-level Universal Decompositional Semantics (UDS) graphs. We induce this classification jointly with semantic role, entity, and event-event relation classifications using a document-level generative model structured by these graphs. To support this induction, we augment existing annotations found in the UDS1.0 dataset, which covers the entirety of the English Web Treebank, with an array of inferential properties capturing fine-grained aspects of the temporal and aspectual structure of events. The resulting dataset (available at decomp.io) is the largest annotation of event structure and (partial) event coreference to date.
导图

FeTaQA: Free-form Table Question Answering

FeTaQA: 自由格式表问答

Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Lin, Neha Verma, Rui Zhang, Wojciech Kryściński, Hailey Schoelkopf, Riley Kong, Xiangru Tang, Mutethia Mutuma, Benjamin Rosand, Isabel Trindade, Renusree Bandaru, Jacob Cunningham, Caiming Xiong, Dragomir Radev
AbstractExisting table question answering datasets contain abundant factual questions that primarily evaluate a QA system’s comprehension of query and tabular data. However, restricted by their short-form answers, these datasets fail to include question-answer interactions that represent more advanced and naturally occurring information needs: questions that ask for reasoning and integration of information pieces retrieved from a structured knowledge source. To complement the existing datasets and to reveal the challenging nature of the table-based question answering task, we introduce FeTaQA, a new dataset with 10K Wikipedia-based {table, question, free-form answer, supporting table cells} pairs. FeTaQA is collected from noteworthy descriptions of Wikipedia tables which contain information people tend to seek; generation of these descriptions requires advanced processing that humans perform on a daily basis: understand the question and table, retrieve, integrate, infer, and conduct text planning and surface realization to generate an answer. We provide two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.
导图

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

质量概览：对网络爬虫多语言数据集的审查

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Auguste Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Fred Ọ̀nọ̀mẹ̀ Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure Femi Pancrace Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi
AbstractWith the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
导图

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

CANINE: 预训练用于语言表达的高效无标记化编码器

Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting
AbstractPipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt.In this paper, we present Canine, a neural encoder that operates directly on character sequences -- without explicit tokenization or vocabulary -- and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBERT model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.
导图

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition

中断、扰动、构建：通过问题分解实现推理路径的自动扰动

Mor Geva, Tomer Wolfson, Jonathan Berant
AbstractRecent efforts to create challenge benchmarks that test the abilities of natural language understanding models have largely depended on human annotations. In this work, we introduce the "Break, Perturb, Build" (BPB) framework for automatic reasoning-oriented perturbation of question-answer pairs. BPB represents a question by decomposing it into the reasoning steps that are required to answer it, symbolically perturbs the decomposition, and then generates new question-answer pairs. We demonstrate the effectiveness of BPB by creating evaluation sets for three reading comprehension (RC) benchmarks, generating thousands of high-quality examples without human intervention. We evaluate a range of RC models on our evaluation sets, which reveals large performance gaps on generated examples compared to the original data. Moreover, symbolic perturbations enable fine-grained analysis of the strengths and limitations of models. Last, augmenting the training data with examples generated by BPB helps close the performance gaps, without any drop on the original data distribution.
导图

Out-of-Domain Discourse Dependency Parsing via Bootstrapping: An Empirical Analysis on Its Effectiveness and Limitation

通过引导进行域外话语依赖分析：有效性与局限性实证分析

Noriki Nishida, Yuji Matsumoto
AbstractDiscourse parsing has been studied for decades. However, it still remains challenging to utilize discourse parsing for real-world applications because the parsing accuracy degrades significantly on out-of-domain text. In this paper, we report and discuss the effectiveness and limitations of bootstrapping methods for adapting modern BERT-based discourse dependency parsers to out-of-domain text without relying on additional human supervision. Specifically, we investigate self-training, co-training, tri-training, and asymmetric tri-training of graph-based and transition-based discourse dependency parsing models, as well as confidence measures and sample selection criteria in two adaptation scenarios: monologue adaptation between scientific disciplines and dialogue genre adaptation. We also release COVID-19 Discourse Dependency Treebank (COVID19-DTB), a new manually annotated resource for discourse dependency parsing of biomedical paper abstracts. The experimental results show that bootstrapping is significantly and consistently effective for unsupervised domain adaptation of discourse dependency parsing, but the low coverage of accurately predicted pseudo labels is a bottleneck for further improvement. We show that active learning can mitigate this limitation.
导图
更多信息：https://transacl.org/index.php/tacl

-END-

本文转载自：Transactions of the Association for Computational Linguistics

转载编辑：Amelia

声明：本公众号转载此文章是出于传播行业资讯、洞见之目的，版权归原作者所有。如有侵犯到您的合法权益，请致信：young@lingotek.cn，我们将及时调整处理。谢谢支持！

关注我们，获取更多资讯！

往期回顾

行业动向1. 行业动态丨3个词入围2022牛津年度词汇
2. 行业动态 | 全国翻译专业学位研究生教育2022年会分论坛二：“高层次中译外人才培养与国际传播能力建设”3. 行业动态 | 全国翻译专业学位研究生教育2022年分论坛三: “翻译专硕多语对建设与区域国别研究和国际组织后备人才培养”
4. 行业动态 | 首届翻译技术与语言服务人才培养高端论坛暨SIT-RWS“双师型”教师培养基地结业盛典顺利举行
5. 行业动态 | Phrase 宣布新的机器翻译引擎：Phrase NextMT

行业洞见1. 行业观察 | 大数据时代译者如何提升数字素养—《翻译搜索指南》主编王华树博士专访 2. 行业观察 | 黄友义《从“翻译世界”到“翻译中国”》
3. 行业观察 | 黄友义：开设专博教育翻译人才培养迎来崭新时代 4. 行业观察 | 于涛：加快培育高层次翻译专业人才，推进国家翻译能力高质量发展
5. 行业观察丨陈杲：积极拥抱技术，在实践中体会快乐
行业技术1. 技术应用 | 如何用Déjà Vu 进行术语管理？
2. 技术实操 | memoQ中如何进行译后质量保证(QA)3. 技术应用 | 萌新科研干货指南——文献可视化神器CiteSpace 4. 技术科普 | 眼动仪和机器翻译能擦出什么样的火花？5. 技术应用 | 如何实现网站国际化 (i18n)？超详细攻略来了
精品课程1. 第三期“译起向未来”翻译技术云端实习冬令营报名开启~2. 翻译技术2023全年班限时预售！技术小白速速加入~3. 英专生只能做翻译？不！你还可以做管理！
资源干货1. 书籍推荐 | 人人都用的上的《翻译搜索指南》2. 书籍推荐｜戴光荣、王华树等合力编写翻译技术入门级指南 3. 资源宝库 | 徐彬教授的翻译技术和翻译批评视频汇总 4. 语言趣谈 | 韦氏词典公布2022年度词汇：Gaslighting--煤气灯效应 5. 语言趣谈 | 《人民日报》公布常见文字错误，词语、标点等避坑指南！
招聘就业1. 招聘快报 | 腾讯招游戏英语本地化经理 2. 招聘快报 | 上海外语教育出版社多岗位招聘英语、日语人才 3. 招聘快报｜米哈游招募多语种本地化翻译，远程办公 4. 招聘快报 | 一者科技招募运营经理、产品经理，翻译专业优先 5. 招聘快报 | 教育部中外语言交流合作中心2023年度公开招聘

继续滑动看下一个

语言服务行业

向上滑动看下一个

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

研究动态 | 自然语言处理顶刊TACL22年第10卷精华摘编

Vol 10 (2022)

您可能也对以下帖子感兴趣

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

生成图片，分享到微信朋友圈

研究动态 | 自然语言处理顶刊TACL22年第10卷精华摘编

Vol 10 (2022)

您可能也对以下帖子感兴趣