研究动态 | 自然语言处理顶刊TACL22年第10卷精华摘编
Transactions of the Association for Computational Linguistics
Vol 10 (2022)
自然语言处理顶刊TACL
2022年第10卷
Word Acquisition in Neural Language Models
神经语言模型中的单词获取
Tyler A Chang, Benjamin K BergenAbstractWe investigate how neural language models acquire individual words during training, extracting learning curves and ages of acquisition for over 600 words on the MacArthur-Bates Communicative Development Inventory (Fenson et al., 2007). Drawing on studies of word acquisition in children, we evaluate multiple predictors for words' ages of acquisition in LSTMs, BERT, and GPT-2. We find that the effects of concreteness, word length, and lexical class are pointedly different in children and language models, reinforcing the importance of interaction and sensorimotor experience in child language acquisition. Language models rely far more on word frequency than children, but like children, they exhibit slower learning of words in longer utterances. Interestingly, models follow consistent patterns during training for both unidirectional and bidirectional models, and for both LSTM and Transformer architectures. Models predict based on unigram token frequencies early in training, before transitioning loosely to bigram probabilities, eventually converging on more nuanced predictions. These results shed light on the role of distributional learning mechanisms in children, while also providing insights for more human-like language acquisition in language models.
导图
Decomposing and Recomposing Event Structure
分解与重组事件结构
William Andrew Horsley Gantt, Lelia Glass, Aaron Steven WhiteAbstractWe present an event structure classification empirically derived from inferential properties annotated on sentence- and document-level Universal Decompositional Semantics (UDS) graphs. We induce this classification jointly with semantic role, entity, and event-event relation classifications using a document-level generative model structured by these graphs. To support this induction, we augment existing annotations found in the UDS1.0 dataset, which covers the entirety of the English Web Treebank, with an array of inferential properties capturing fine-grained aspects of the temporal and aspectual structure of events. The resulting dataset (available at decomp.io) is the largest annotation of event structure and (partial) event coreference to date.
导图
FeTaQA: Free-form Table Question Answering
FeTaQA: 自由格式表问答
Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Lin, Neha Verma, Rui Zhang, Wojciech Kryściński, Hailey Schoelkopf, Riley Kong, Xiangru Tang, Mutethia Mutuma, Benjamin Rosand, Isabel Trindade, Renusree Bandaru, Jacob Cunningham, Caiming Xiong, Dragomir RadevAbstractExisting table question answering datasets contain abundant factual questions that primarily evaluate a QA system’s comprehension of query and tabular data. However, restricted by their short-form answers, these datasets fail to include question-answer interactions that represent more advanced and naturally occurring information needs: questions that ask for reasoning and integration of information pieces retrieved from a structured knowledge source. To complement the existing datasets and to reveal the challenging nature of the table-based question answering task, we introduce FeTaQA, a new dataset with 10K Wikipedia-based {table, question, free-form answer, supporting table cells} pairs. FeTaQA is collected from noteworthy descriptions of Wikipedia tables which contain information people tend to seek; generation of these descriptions requires advanced processing that humans perform on a daily basis: understand the question and table, retrieve, integrate, infer, and conduct text planning and surface realization to generate an answer. We provide two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.
导图
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
质量概览:对网络爬虫多语言数据集的审查
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Auguste Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Fred Ọ̀nọ̀mẹ̀ Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure Femi Pancrace Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa AdeyemiAbstractWith the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
导图
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
CANINE: 预训练用于语言表达的高效无标记化编码器
Jonathan H. Clark, Dan Garrette, Iulia Turc, John WietingAbstractPipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt.In this paper, we present Canine, a neural encoder that operates directly on character sequences -- without explicit tokenization or vocabulary -- and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBERT model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.
导图
Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition
中断、扰动、构建:通过问题分解实现推理路径的自动扰动
Mor Geva, Tomer Wolfson, Jonathan BerantAbstractRecent efforts to create challenge benchmarks that test the abilities of natural language understanding models have largely depended on human annotations. In this work, we introduce the "Break, Perturb, Build" (BPB) framework for automatic reasoning-oriented perturbation of question-answer pairs. BPB represents a question by decomposing it into the reasoning steps that are required to answer it, symbolically perturbs the decomposition, and then generates new question-answer pairs. We demonstrate the effectiveness of BPB by creating evaluation sets for three reading comprehension (RC) benchmarks, generating thousands of high-quality examples without human intervention. We evaluate a range of RC models on our evaluation sets, which reveals large performance gaps on generated examples compared to the original data. Moreover, symbolic perturbations enable fine-grained analysis of the strengths and limitations of models. Last, augmenting the training data with examples generated by BPB helps close the performance gaps, without any drop on the original data distribution.
导图
Out-of-Domain Discourse Dependency Parsing via Bootstrapping: An Empirical Analysis on Its Effectiveness and Limitation
通过引导进行域外话语依赖分析:有效性与局限性实证分析
Noriki Nishida, Yuji MatsumotoAbstractDiscourse parsing has been studied for decades. However, it still remains challenging to utilize discourse parsing for real-world applications because the parsing accuracy degrades significantly on out-of-domain text. In this paper, we report and discuss the effectiveness and limitations of bootstrapping methods for adapting modern BERT-based discourse dependency parsers to out-of-domain text without relying on additional human supervision. Specifically, we investigate self-training, co-training, tri-training, and asymmetric tri-training of graph-based and transition-based discourse dependency parsing models, as well as confidence measures and sample selection criteria in two adaptation scenarios: monologue adaptation between scientific disciplines and dialogue genre adaptation. We also release COVID-19 Discourse Dependency Treebank (COVID19-DTB), a new manually annotated resource for discourse dependency parsing of biomedical paper abstracts. The experimental results show that bootstrapping is significantly and consistently effective for unsupervised domain adaptation of discourse dependency parsing, but the low coverage of accurately predicted pseudo labels is a bottleneck for further improvement. We show that active learning can mitigate this limitation.
导图
更多信息:https://transacl.org/index.php/tacl
-END-
本文转载自:Transactions of the Association for Computational Linguistics
转载编辑:Amelia关注我们,获取更多资讯!声明:本公众号转载此文章是出于传播行业资讯、洞见之目的,版权归原作者所有。如有侵犯到您的合法权益,请致信:young@lingotek.cn,我们将及时调整处理。谢谢支持!
往期回顾
行业动向1. 行业动态丨3个词入围2022牛津年度词汇2. 行业动态 | 全国翻译专业学位研究生教育2022年会分论坛二:“高层次中译外人才培养与国际传播能力建设”3. 行业动态 | 全国翻译专业学位研究生教育2022年分论坛三: “翻译专硕多语对建设与区域国别研究和国际组织后备人才培养”
4. 行业动态 | 首届翻译技术与语言服务人才培养高端论坛暨SIT-RWS“双师型”教师培养基地结业盛典顺利举行
5. 行业动态 | Phrase 宣布新的机器翻译引擎:Phrase NextMT
行业洞见1. 行业观察 | 大数据时代译者如何提升数字素养—《翻译搜索指南》主编王华树博士专访2. 行业观察 | 黄友义《从“翻译世界”到“翻译中国”》
3. 行业观察 | 黄友义:开设专博教育 翻译人才培养迎来崭新时代4. 行业观察 | 于涛:加快培育高层次翻译专业人才,推进国家翻译能力高质量发展
5. 行业观察丨陈杲:积极拥抱技术,在实践中体会快乐
行业技术1. 技术应用 | 如何用Déjà Vu 进行术语管理?
2. 技术实操 | memoQ中如何进行译后质量保证(QA)3. 技术应用 | 萌新科研干货指南——文献可视化神器CiteSpace4. 技术科普 | 眼动仪和机器翻译能擦出什么样的火花?5. 技术应用 | 如何实现网站国际化 (i18n)? 超详细攻略来了
精品课程1. 第三期“译起向未来”翻译技术云端实习冬令营报名开启~2. 翻译技术2023全年班限时预售!技术小白速速加入~3. 英专生只能做翻译?不!你还可以做管理!
资源干货1. 书籍推荐 | 人人都用的上的《翻译搜索指南》2. 书籍推荐|戴光荣、王华树等合力编写翻译技术入门级指南3. 资源宝库 | 徐彬教授的翻译技术和翻译批评视频汇总4. 语言趣谈 | 韦氏词典公布2022年度词汇:Gaslighting--煤气灯效应5. 语言趣谈 | 《人民日报》公布常见文字错误,词语、标点等避坑指南!
招聘就业1. 招聘快报 | 腾讯招游戏英语本地化经理2. 招聘快报 | 上海外语教育出版社多岗位招聘英语、日语人才3. 招聘快报|米哈游招募多语种本地化翻译,远程办公4. 招聘快报 | 一者科技招募运营经理、产品经理,翻译专业优先5. 招聘快报 | 教育部中外语言交流合作中心2023年度公开招聘