ACL2023赶会必备,拿来即用之Experiments
© 作者|刘沛羽
机构|中国人民大学高瓴人工智能学院
研究方向|自然语言处理,模型压缩
本文从ACL2022年中的论文中整理出来了常见的数据集描述和基准方法,范围涵盖14个子领域20篇论文。文章也同步发布在AI Box知乎专栏(知乎搜索 AI Box专栏),欢迎大家在知乎专栏的文章下方评论留言,交流探讨!
本次整理主要有以下几个特点:
拿来即用,辅助写作。论文写作中一定会包含数据集介绍部分。对于通用的数据集,一般会有相对“标准”的描述方式。本文从顶会论文中收集和整理了不同数据集的描述,这有助于我们学习和积累规范的表达方式; 全面的数据集和基准方法收集和整理。本文涵盖了14个子领域20篇论文,虽然无法保证一篇不漏,但也竭尽可能覆盖了绝大多数主流数据集。本文可以让读者对NLP前沿任务有全面了解,同时对于感兴趣的领域又可快速找到基准方法、评测数据集和指标,容易快速上手复现相关论文和改进方法; 辅助规划实验。已公开的数据集非常多,但为了说明自己方法的有效性又不能全部做实验。本文整理的素材有助于读者规划自己实验任务和数据集时,有的放矢,且不偏不漏。
上次分享过拿来即用之Abstract和Related work,目的是让大家从别人论文中快速对相关领域有个准确的认识,感兴趣的可以移步:
《ACL2022赶会必备,拿来即用之Abstract和Related Work 》关于数据集整理的文章很多,但是这些文章主要还是作为数据集的“资源池”,即包含数据集的官方链接、介绍等。而本文依然保持“赶会必备系列”的初衷,目的是辅助科研和写作。希望大家ACL 2023投稿顺利~
预训练语言模型
[1]On the Sensitivity and Stability of Model Interpretations in NLP
关键词:可解释性
任务与数据集:
text classification: SST-2, Yelp, AGNews
SST-2 and Yelp are sentiment classification tasks where models predict whether a review is negative (0) or positive (1). AGNews is to discriminate between world (0) and business (1) articles.
基准方法:
VaGrad, GradInp (gradient-based)
IngGrad, DeepLIFT (reference based)
Occlusion, LIME (perturbation based)
[2]Composable Sparse Fine-Tuning for Cross-Lingual Transfer
关键词:跨语言微调
任务与数据集:
part-of-speech tagging (POS), dependency parsing (DP): Universal Dependencies 2.7
named entity recognition (NER): MasakhaNER
natural language inference (NLI): AmericasNLI
基准方法:
MAD-X(adapter-based framework)
BITFIT
[3]Compression of Generative Pre-trained Language Models via Quantization
关键词:模型压缩
任务与数据集:
Language Modeling: WikiText2, Penn Treebank (PTB), WikiText103
The task of language modeling is to predict the probability distribution over a sequence of words.
Next Utterance Prediction: Persona-Chat
The task of next utterance prediction predicts the next utterance given the dialogue context. It tests the language understanding ability of generative models.
Abstractive Summarization: XSum
Abstractive summarization aims at generating a terse summary that captures the main ideas of the source article.
基准方法:
PACT, LSQ, LAQ
[4]AdapLeR: Speeding up Inference by Adaptive Length Reduction
关键词:模型加速
任务与数据集:
sentiment: SST-2, IMDB
paraphrase: MRPC
topic classification: AG’s News
knowledge extraction: DBpedia
NLI: MNLI
question answering: QNLI
hate speech: HateXplain
基准方法:
BERT-base (the backbone)
DistillBERT (static compression method)
PoWER-BERT, TR-BERT (length reduction methods)
[5]ABC: Attention with Bounded-memory Control
关键词:模型加速
任务与数据集:
Language Modeling: WikiText-103
Machine Translation: WMT14 EN-DE (Sentence-level translation), IWSLT14 ESEN (Document-level translation)
Masked Language Model Finetuning: BookCorpus, English Wikipedia, OpenWebText, and RealNews (pretrain) GLUE (fine-tuning)
基准方法:
Linformer
[6]PERFECT: Prompt-free and Efficient Few-shot Learning with Language Models
关键词:prompt
任务与数据集:
sentiment analysis datasets: SST-2, SST-5, MR, CR
subjectivity classification: SUBJ
question classification: TREC
natural language inference: CB, RTE
question answering: QNLI
word sense disambiguation: WiC
paraphrase detection: MRPC, QQP
基准方法:
PET
The standard fine-tuning
[7]A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
关键词:prompt,多模态
任务与数据集:
visual question answering: VQAv2, OKVQA, GQA
image captioning: NoCaps, Flickr30k
categorical learning: miniImageNet
基准方法:
Frozen, PICa, SimVLM, Unified VLP (zero/few-shot vision-language learners)
Uniter_large, Oscar, SimVLM, VinVL, Unified VLP (full fine-tuned models)
VL-T5_no-vqa (pre-trained without visual question answering dataset)
Frozen and AFHN (miniImageNet)
表示学习
[8]A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space
关键词:对比学习
任务与数据集:
unsupervised demantic textual similarity: STS tasks 2012-2016, STS Benchmark, SICK-Relatedness
sentEval transfer tasks: MR, CR, SUBJ, MPQA, SST-2, TREC, MRPC
基准方法:
GloVe embeddings, Skip-thought, average BERT embeddings from the last layer, BERT-Flow, BERT-Whitening (representative methods)
ISBERT, CT-BERT, ConSERT, SimCSE (contrastive learning methods)
机器翻译
[9]Universal Conditional Masked Language Pre-training for Neural Machine Translation
关键词:预训练
任务与数据集:
autoregressive neural machine translation: En-Kk, De-En, En-Tr, En-Ro, En-Et, En-Fi, En-Lv, En-De, En-Cs, En-De, En-Fr
non-autoregressive neural machine translation: WMT14 En-De, WMT16 En-Ro and IWSLT14 En-De
基准方法:
mBART, mRASP, MASS, XLM, mBERT
信息检索
[10]Compact Token Representations with Contextual Quantization for Efficient Document Re-ranking
关键词:模型加速
任务与数据集:
passage and document ranking: MS MARCO
基准方法:
Choices of first-stage retrieval models: fast BM25 method, uniCOIL, Colbert
Re-ranking models and quantizers compared: BECR, PreTTR, BERT-base, TILDEv2
对话
[11]A Model-Agnostic Data Manipulation Method for Persona-based Dialogue Generation
任务与数据集:
PersonaChat
基准方法:
Back Translation (BT)
CVAE
Entropy Filter
推理
[12]Generated Knowledge Prompting for Commonsense Reasoning
任务与数据集:
NumerSense
NumerSense (Lin et al., 2020) consists of numerical statements about common objects and concepts where for each sentence we need to recover a masked number word.
CommonsenseQA (CSQA)
CommonsenseQA (CSQA) (Talmor et al., 2019) is a 5-way multiple-choice QA dataset about common world scenarios.
CommonsenseQA 2.0 (CSQA2)
CommonsenseQA 2.0 (CSQA2) (Talmor et al., 2021) is a binary classification dataset where we need to judge whether commonsense statements are true or false.
QASC
QASC (Khot et al., 2020) is an 8-way multiplechoice QA dataset about grade school science.
基准方法(Knowledge Generation Baselines):
No knowledge, Random sentences, Context sentences, Template-based, Retrieval-based
情感分析
[13]Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis
任务和数据集:
Amazon reviews dataset
基准方法:
BERT-DAAT
SENTIX_Fix
Standard fine-tuning
Fine-tuning + AT (Add the adversarial training operating on standard fine-tuning vanilla PLMs.)
Prompt-tuning(Hard) (Use a manually defined template “It is [MASK]” for prompt-tuning)
Prompt-tuning(Hard) + AT (Add the adversarial training operating on Prompt-tuning(Hard))
比喻解释
[14]Can Pre-trained Language Models Interpret Similes as Smart as Human?
任务和数据集:
The Simile Property Probing Task: General Corpus, Teacher-designed Quizzes
基准方法:
EMB
Meta4meaning
ConScore
MIUWE
多模态
[15]Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis
任务和数据集:
TWITTER-2015, TWITTER-2017
基准方法:
RAN, UMT, OSCGA, RpBERT (multimodal aspect term extraction (MATE))
TomBERT, CapTrBERT (multimodal aspect sentiment classification (MASC))
SPAN, D-GCN, BART (joint aspect sentiment analysis (JASA))
UMT+TomBERT, OSCGA+TomBERT, UMT-collapsed, OSCGAcollapsed, RpBERTcollapsed, JML (Joint Multimodal Aspect-Sentiment Analysis (JMASA))
文本生成
[16]A Multi-Document Coverage Reward for RELAXed Multi-Document Summarization
关键词:多文档摘要
数据集:
Multi-News
Wikipedia Current Events Portal (WCEP)
基准方法:
HiMAP, Hierarchical Transformer, GraphSum, GraphSum + RoBERTa, BART-Long (Multi-News)
TSR, BERTReg, Submodular+ABS, BART-WCEP-DynE-5 (WCEP)
阅读理解
[17]AdaLoGN: Adaptive Logic Graph Network for Reasoning-Based Machine Reading Comprehension
数据集:
ReClor
LogiQA
基准方法:
BERT, RoBERTa, XLNet (pre-trained language model based methods)
DAGN, Focal Reasoner, LReasoner
代码理解
[18]A Neural Network Architecture for Program Understanding Inspired by Human Behaviors
任务和数据集:
code summarization: TL-CodeSum, Java subset of CodeSearchNet
code clone detection: BigCloneBench 2014 (BCB), BCB-F (new dataset)
基准方法:
CodeNN, NCS, Rencos, CodeBERT, PLBART (code summarization)
CodeBERT, PLBART, ASTNN, FA-AST (code clone detection)
信息抽取
[19]FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction
任务和数据集
CORD
We evaluate on CORD (Park et al., 2019), which stands for the Consolidated Receipt Dataset for post-OCR parsing. The annotations are provided in 30 fine-grained semantic entities such as store name, menu price, table number, discount, etc.
FUNSD
FUNSD (Jaume et al., 2019) is a public dataset for form understanding in noisy scanned documents. It is a subset of the Truth Tobacco Industry Document (TTID)9. The dataset consists of 199 annotated forms with 9,707 entities and 31,485 word-level annotations for 4 entity types: header, question, answer, and other.
Payment
We use the large-scale payment data (Majumder et al., 2020) that consists of around 10K documents and 7 semantic entity labels from human annotators. The corpus comes from different vendors with different layout templates.
基准方法:
SPADE
UniLMv2
LayoutLMv1
DocFormer
LayoutLMv2
TILT
DocFormer
表格处理
[20]FORTAP: Using Formulas for Numerical-Reasoning-Aware Table Pretraining
任务和数据集:
Formula Prediction: Enron
Table Question Answering: HiTab
Cell Type Classification: DeEx
基准方法:
SpreadsheetCoder, TaPEx, TUTA (Formula Prediction)
TaPas, BERT, TaPEx, TUTA (Table Question Answering)
CNN^BERT, Bi-LSTM, TaBERT, TaPas, TUTA (Cell Type Classification)、
1. https://aclanthology.org/2022.acl-long.188
2. https://aclanthology.org/2022.acl-long.125
3. https://aclanthology.org/2022.acl-long.331
4. https://aclanthology.org/2022.acl-long.1
5. https://aclanthology.org/2022.acl-long.515
6. https://aclanthology.org/2022.acl-long.254
7. https://aclanthology.org/2022.acl-long.197
8. https://aclanthology.org/2022.acl-long.336
9. https://aclanthology.org/2022.acl-long.442
10. https://aclanthology.org/2022.acl-long.51
11. https://aclanthology.org/2022.acl-long.550
12. https://aclanthology.org/2022.acl-long.225
13. https://aclanthology.org/2022.acl-long.174
14. https://aclanthology.org/2022.acl-long.543
15. https://aclanthology.org/2022.acl-long.152
16. https://aclanthology.org/2022.acl-long.351
17. https://aclanthology.org/2022.acl-long.494
18. https://aclanthology.org/2022.acl-long.353
19. https://aclanthology.org/2022.acl-long.260
20. https://aclanthology.org/2022.acl-long.82
更多推荐
预训练模型哪家强?提示迁移学习为文本生成提供新思路——NAACL 2022论文解读
扩散模型与其在文本生成图像领域的应用
一文梳理图上的点击率预测模型