其他
ChatGPT数据集之谜
社区的愿景是促进国内外自然语言处理,机器学习学术界、产业界和广大爱好者之间的交流和进步,特别是初学者同学们的进步。
概述
常用数据集
GPT-1数据集
GPT-2数据集
GPT-3数据集
The Pile v1(GPT-J和GPT-NeoX-20B)数据集
Megatron-11B和RoBERTa数据集
MT-NLG数据集
MT-NLG 数据集Gopher数据集
结论
扩展阅读及脚注 考虑到简洁和可读性,本文使用了脚注而非文本/括弧式引文。主要参考文献如下,或者参见http://lifearchitect.ai/papers/,获取大语言模型领域的主要基础论文。以下论文按本文顺序显示。Datasheets for Datasets Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for Datasets. https://arxiv.org/abs/1803.09010 GPT-1 paper Radford, A., & Narasimhan, K. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understan ding_paper.pdf GPT-2 paper Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI. https://cdn.openai.com/better-language-models/language_models_are_unsupervised _multitask_learners.pdf GPT-3 paper Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., & Dhariwal, P. et al. (2020). OpenAI. Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165 The Pile v1 paper Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., & Foster, C. et al. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. EleutherAI. https://arxiv.org/abs/2101.00027 GPT-J announcement Komatsuzak, A., Wang, B. (2021). GPT-J-6B: 6B JAX-Based Transformer. https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/ GPT-NeoX-20B paper Black, S., Biderman, S., Hallahan, E. et al. (2022). EleutherAI. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf RoBERTa paper Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., & Chen, D. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. Meta AI. https://arxiv.org/abs/1907.11692 MT-NLG paper Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., & Casper, J. et al. (2021). Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. Microsoft/NVIDIA. https://arxiv.org/abs/2201.11990 Gopher paper Rae, J., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., & Song, F. et al. (2021). Scaling Language Models: Methods, Analysis & Insights from Training Gopher. DeepMind. https://arxiv.org/abs/2112.11446 Appendix A: Top 50 Resources: Wikipedia + CC + WebText (i.e. GPT-3)
附录 A:前50个资源:Wikipedia + CC + WebText(即 GPT-3)基于本文内容,尤其是每个数据集中每个资源的token数量,我们可以对将Wikipedia + Common Crawl + WebText数据集的组合,作为其整体训练数据集的一部分模型进行资源或域的排序。为清楚起见,这包括以下模型:OpenAI GPT-3、EleutherAI GPT-J、EleutherAI GPT-NeoX-20B、Meta AI Megatron-11B和RoBERTA,以及 Microsoft/NVIDIA MT-NLG等。请注意,展示的排名基于数据集中可用的未加权总token,每个数据集的主观权重由研究人员在模型预训练之前计算得出。其中有一些重复(例如,《纽约时报》既出现在有1.11亿token的WebText中,也出现在过滤后有1亿token的Common Crawl中)。 脚注1. GPT-NeoX-20B paper: pp11, section 6 http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf2. Datasheet for Datasets paper: https://arxiv.org/abs/1803.090103. OpenAI blog: https://openai.com/blog/gpt-3-apps/4. On the Opportunities and Risks of Foundation Models: https://arxiv.org/abs/2108.072585. Size of Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia6. C4 dataset: https://www.tensorflow.org/datasets/catalog/c47. Common Crawl website: https://commoncrawl.org/8. C4 paper: https://arxiv.org/abs/2104.08758 pp2, Figure 1 right9. Wikipedia categories: https://en.wikipedia.org/wiki/User:Smallbones/1000_random_results: “维基百科涵盖哪些主题?覆盖范围是否随时间变化?使用2015年12月抽取的1001篇随机文章对这些问题和类似问题进行了查验...随着时间推移,这些比例相当稳定...传记(27.8%),地理(17.7%),文化和艺术(15.8%),历史(9.9%),生物学、健康和医学(7.8%),体育(6.5%),商业(4.8%),其他社会(4.4%),科学与数学(3.5%),教育(1.8%)。”10. GPT-1 paper: pp4 “We use the BooksCorpus dataset for training the language model.”11. https://huggingface.co/datasets/bookcorpus: “Size of the generated dataset: 4629.00 MB”12. BookCorpus Retrospective Datasheet paper: pp9 https://arxiv.org/abs/2105.0524113. GPT-2 paper: pp3 “我们从社交媒体平台Reddit中抓取了至少有3个karma的所有出站链接。这可以被认为是一个启发式指标,用于判断其他用户是否觉得该链接有趣、有教育意义或只是有趣……WebText包含这4500万个链接的文本子集……其中不包括2017年12月之后创建的链接。经过去重和一些基于启发式的清理后,其中包含大约超过800万个文档,总共40GB文本。我们从WebText中移除了所有维基百科文档...”14. GPT-2 model card: https://github.com/openai/gpt-2/blob/master/model_card.md: “我们已经发布了WebText中出现的前1,000个域及其频率的列表。WebText中排名前15位的域是:Google、Archive、Blogspot、GitHub、纽约时报、Wordpress、华盛顿邮报、维基亚、BBC、卫报、eBay、Pastebin、CNN、雅虎和赫芬顿邮报。”15. GPT-3 paper: “WebText2:190亿token。[Alan:WebText2是从WebText稍微扩展而来,所以我们可以减去20%,得到150亿token]”16. GPT-2 paper: pp3 “GPT-3: pp9, Table 2.2 “CC: 4100亿token. WebText2: 190亿token. Books1: 120亿token. Books2: 550亿token. Wiki: 30亿token”17. GPT-3 paper: pp818. BookCorpus repo: soskek/bookcorpus#27: “books3.tar.gz似乎类似于OpenAI在他们的论文中引用的神秘“books2”数据集。不幸的是,OpenAI不会提供细节,所以我们对其差异知之甚少。人们怀疑它是“libgen的全部”,但这纯粹是猜测。尽管如此,books3仍是“所有的bibliotik”......”19. BookCorpus paper: https://arxiv.org/abs/1506.06724: “# of words: 984,846,357 [Alan: BookCorpus有13亿token。我们想要有120-550亿token]”20. Gutenberg paper: https://arxiv.org/abs/1812.08092: “我们介绍了标准化项目古腾堡语料库(SPGC),这是一种开放的科学方法,用于处理完整PG数据的精选版本,其中包含超过50,000本书和3×109word-token[Alan:相当于大约120亿BPE token,见下文 ]”21. Gutenberg repo: https://zenodo.org/record/2422561 “未压缩大小:3GB(count)+ 18GB(token)[总计21GB]”22. The Pile v1 paper: “Books3(Bibliotik tracker):100.96GB” [Alan:乘以每字节token数0.2477 = 250亿token]23. The Pile v1 paper: pp3, Table 1 for datasets. pp28, Table 7 for Tokens per byte.24. RoBERTa paper: https://arxiv.org/abs/1907.11692 “BOOKCORPUS加上英文WIKIPEDIA。这是用来训练 BERT的原始数据。(16GB)。”25. BERT paper: https://arxiv.org/abs/1810.04805 “BERT在BooksCorpus(8亿字)和维基百科(25亿字)上进行训练。”26. Stories paper: https://arxiv.org/abs/1806.02847 pp5-627. RealNews paper: https://arxiv.org/abs/1905.12616v3 “去重后,RealNews在没有压缩的情况下为120GB。”28. Gopher paper: https://arxiv.org/abs/2112.11446 pp 7: list of sizes and tokens.29. Gopher paper: https://arxiv.org/abs/2112.11446 pp 44, Figure A3b.30. Gopher paper: pp41n14 “请注意,我们将文档去重应用于除Wikipedia和GitHub之外的所有MassiveText子集“31. GPT-2 paper, pp3.
Datasheets for Datasets Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for Datasets. https://arxiv.org/abs/1803.09010 GPT-1 paper Radford, A., & Narasimhan, K. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understan ding_paper.pdf GPT-2 paper Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI. https://cdn.openai.com/better-language-models/language_models_are_unsupervised _multitask_learners.pdf GPT-3 paper Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., & Dhariwal, P. et al. (2020). OpenAI. Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165 The Pile v1 paper Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., & Foster, C. et al. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. EleutherAI. https://arxiv.org/abs/2101.00027 GPT-J announcement Komatsuzak, A., Wang, B. (2021). GPT-J-6B: 6B JAX-Based Transformer. https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/ GPT-NeoX-20B paper Black, S., Biderman, S., Hallahan, E. et al. (2022). EleutherAI. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf RoBERTa paper Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., & Chen, D. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. Meta AI. https://arxiv.org/abs/1907.11692 MT-NLG paper Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., & Casper, J. et al. (2021). Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. Microsoft/NVIDIA. https://arxiv.org/abs/2201.11990 Gopher paper Rae, J., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., & Song, F. et al. (2021). Scaling Language Models: Methods, Analysis & Insights from Training Gopher. DeepMind. https://arxiv.org/abs/2112.11446 Appendix A: Top 50 Resources: Wikipedia + CC + WebText (i.e. GPT-3)
关于作者
扫描二维码添加小助手微信