AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

這篇具有很好參考價值的文章主要介紹了AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

LLMs之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

導讀：該論文提出了一個開源的大規(guī)模語言模型LLaMA，2048個A100-80G訓練21天。該模型有以下幾個核心技術(shù)點：
>> 模型架構(gòu)=Transformer+集合多個算法的優(yōu)秀技術(shù)(RMSNorm+SwiGLU+RoPE+AdamW+xformers庫+漸進式學習率)：LLaMA模型采用類似GPT的Transformer架構(gòu)，但是使用了多項技術(shù)優(yōu)化，特別是采用解決層歸一化方法的16層模型。這相比于其他模型有更深的深度，能夠?qū)W習更復雜的語言表示。

(1)、集合多個算法的優(yōu)秀技術(shù)：預歸一化函數(shù)RMSNorm、激活函數(shù)SwiGLU、旋轉(zhuǎn)位置嵌入RoPE、AdamW優(yōu)化器、高效的因果多頭注意力xformers庫加速。

(2)、漸進式學習率調(diào)度：LLaMA使用漸進式學習率調(diào)度方法，即訓練開始時使用更大的學習率，然后隨著訓練的進行逐漸減小學習率。這可以幫助模型更快收斂到最優(yōu)解。

>> 訓練數(shù)據(jù)4TB+BPE分詞(1.4萬億個tokens)—更多tokens+較小模型=可較好性能：LLaMA訓練的數(shù)據(jù)集包含4TB的句子，只使用公開的數(shù)據(jù)集，英語CommonCrawl、C4、GitHub 、Wikipedia、Gutenberg+Books3+ArXiv、Stack Exchange。使用SentencePiece字節(jié)對編碼(BPE)算法對數(shù)據(jù)進行分詞(1.4萬億個tokens)。Chinchilla 論文中推薦在 200B(0.2T) 的 tokens 上訓練 10B 規(guī)模的模型，而 LLaMA 使用了 1.4T tokens(1.4萬億個tokens) 訓練 7B的模型，增大 tokens 規(guī)模，模型的性能仍在持續(xù)上升。

>> LLaMA包含從7B/13B/30B/65B參數(shù)的基礎(chǔ)語言模型集合—LLaMA-13B 僅以 1/10參數(shù)性能優(yōu)于 GPT-3(175B)：這是一個包含從7B~65B參數(shù)的基礎(chǔ)語言模型集合。我們使用數(shù)萬億個標記對這些模型進行訓練，并展示了可以僅使用公開可用的數(shù)據(jù)集進行訓練，而無需使用專有和不可訪問的數(shù)據(jù)集來訓練最先進的模型。在多項語言模型和下游任務上的 benchmark上，LLaMA模型與同規(guī)模GPT(175B)模型相當或略優(yōu)。但訓練和推理速度明顯更快，而LLaMA-65B與最好的模型Chinchilla-70B和PaLM-540B競爭力相當。

相關(guān)論文

LLMs之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

LLMs之Alpaca：《Alpaca: A Strong, Replicable Instruction-Following Model》翻譯與解讀

LLMs：《Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca-4月17日版》翻譯與解讀

LLMs：《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻譯與解讀

實戰(zhàn)案例

Windows系統(tǒng)

LLMs：在單機CPU+Windows系統(tǒng)上對LLaMA模型(基于facebookresearch的GitHub)進行模型部署且實現(xiàn)模型推理全流程步驟【部署conda環(huán)境+安裝依賴庫+下載模型權(quán)重(國內(nèi)外各種鏈接)→模型推理】的圖文教程(非常詳細)

LLMs：基于單機CPU+Windows系統(tǒng)實現(xiàn)中文LLaMA算法(基于Chinese-LLaMA-Alpaca)進行模型部署(llama.cpp)+模型推理全流程步驟【安裝環(huán)境+創(chuàng)建環(huán)境并安裝依賴+原版LLaMA轉(zhuǎn)HF格式+合并llama_hf和chinese-alpaca-lora-7b→下載llama.cpp進行模型的量化(CMake編譯+生成量化版本模型)→部署f16/q4_0+測試效果】的圖文教程(非常詳細)

LLMs：基于單個4GB GPU上(Windows系統(tǒng))運行LLM上——pyllama模型(基于fjuncongmoo的GitHub)進行模型部署且實現(xiàn)模型推理全流程步驟的圖文教程(非常詳細)

Linux系統(tǒng)

LLMs：基于Chinese-LLaMA-Alpaca開源代碼在Ng單機單卡利用LLaMA(Meta)和Alpaca(斯坦福)實現(xiàn)定義數(shù)據(jù)集(生成指令數(shù)據(jù))→數(shù)據(jù)預處理(token分詞/合并權(quán)重)→增量預訓練(LoRA的參數(shù)/LLaMA的參數(shù))→指令微調(diào)LoRA權(quán)重(繼續(xù)訓練/全新訓練)→模型推理(CLI、GUI【webui/LLaMACha/LangChain】)

LLMs之LLaMA-7B-QLoRA：基于Alpaca-Lora代碼在CentOS和多卡(A800+并行技術(shù))實現(xiàn)全流程完整復現(xiàn)LLaMA-7B—安裝依賴、轉(zhuǎn)換為HF模型文件、模型微調(diào)(QLoRA+單卡/多卡)、模型推理(對比終端命令/llama.cpp/Docker封裝)圖文教程之詳細攻略

《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

介紹LLaMA：一款基礎(chǔ)性的、擁有650億參數(shù)的大型語言模型

Abstract

1、Introduction

LLMs的能力和發(fā)展趨勢、模型規(guī)模與性能的關(guān)系

縮放法則與推理預算的忽視

LLaMA模型的提出

內(nèi)容概述：模型改進、性能比較、模型的偏見和毒性問題

2、Approach

訓練方法概述：

2.1、Pre-training Data

Table 1: Pre-training data. Data mixtures used for pre-training, for each subset we list the sampling propor-tion, number of epochs performed on the subset when training on 1.4T tokens, and disk size. The pre-training runs on 1T tokens have the same sampling proportion.預訓練數(shù)據(jù)。用于預訓練的數(shù)據(jù)混合，對于每個子集，我們列出了采樣比例、在1.4T標記上訓練時在子集上執(zhí)行的時代數(shù)量以及磁盤大小。在1T標記上進行的預訓練運行具有相同的采樣比例。

預訓練數(shù)據(jù)：

分詞器：采用字節(jié)對編碼（BPE）算法并借助SentencePiece工具庫實現(xiàn)

整體訓練數(shù)據(jù)集及其處理方式：分詞后包含大約1.4萬億個標記(1.4T)，僅對維基百科和圖書領(lǐng)域的數(shù)據(jù)進行了大約2個epoch的迭代訓練

2.2、Architecture

Table 2: Model sizes, architectures, and optimization hyper-parameters.模型大小、架構(gòu)和優(yōu)化超參數(shù)。

模型架構(gòu)：

2.3 ???Optimizer

Figure 1: Training loss over train tokens for the 7B, 13B, 33B, and 65 models. LLaMA-33B and LLaMA- 65B were trained on 1.4T tokens. The smaller models were trained on 1.0T tokens. All models are trained with a batch size of 4M tokens.——7B、13B、33B和65B模型在訓練標記上的訓練損失。LLaMA-33B和LLaMA-65B在1.4T標記上進行訓練。較小的模型在1.0T標記上進行訓練。所有模型的批次大小均為4M標記。

優(yōu)化器：

2.4 ??Efficient implementation高效實現(xiàn)

Table 3: Zero-shot performance on Common Sense Reasoning tasks.零樣本性能，針對常識推理任務。

高效實現(xiàn)：

性能和訓練速度：采用了2048個A100-80GB GPU，訓練1.4T標記的數(shù)據(jù)集，耗時21天

3 ???Main results主要結(jié)果

任務類型和基準測試：

與其他模型的比較：

總結(jié)：常識推理、封閉書籍問答、閱讀理解、數(shù)學推理、代碼生成等任務的性能評估：

Table 4: NaturalQuestions. Exact match performance.

3.1 Common Sense Reasoning常識推理

3.2 Closed-book Question Answering閉書式問答

Table 5: TriviaQA. Zero-shot and few-shot exact match performance on the filtered dev set.

3.3 Reading Comprehension閱讀理解

Table 6: Reading Comprehension. Zero-shot accuracy.

3.4 ??Mathematical reasoning數(shù)學推理

?Table 7: Model performance on quantitative reason-ing datasets. For majority voting, we use the same setup as Minerva, with k = 256 samples for MATH and k = 100 for GSM8k (Minerva 540B uses k = 64 for MATH and and k = 40 for GSM8k). LLaMA-65B outperforms Minerva 62B on GSM8k, although it has not been fine-tuned on mathematical data.

3.5 Code generation代碼生成

Table 8: Model performance for code generation. We report the pass@ score on HumanEval and MBPP. HumanEval generations are done in zero-shot and MBBP with 3-shot prompts similar to Austin et al.(2021). The values marked with ? are read from figures in Chowdhery et al. (2022).代碼生成的模型性能。我們報告了在HumanEval和MBPP上的pass@分數(shù)。HumanEval生成是在零猜和MBBP中進行的，使用與Austin et al. (2021)相似的3-shot提示。標有?的值是從Chowdhery et al. (2022)的圖表中讀取的。

3.6 Massive Multitask Language?Understanding大規(guī)模多任務語言理解

3.7 Evolution of performance during training訓練期間性能的演變

表現(xiàn)的追蹤情況

Table 9: Massive Multitask Language Understanding (MMLU). Five-shot accuracy.大規(guī)模多任務語言理解（MMLU）。五次提示的準確性。

4 Instruction Finetuning指令微調(diào)

指導數(shù)據(jù)微調(diào)：

微調(diào)實驗和結(jié)果：

Table 10: Instruction finetuning – MMLU (5-shot). Comparison of models of moderate size with and with-out instruction finetuning on MMLU.

Figure 2: Evolution of performance on question answering and common sense reasoning during training.在訓練過程中的問答和常識推理性能演變。

5 Bias, Toxicity and Misinformation偏見、有害內(nèi)容和虛假信息

大型語言模型可能面臨的偏見、有毒性和虛假信息生成的問題，通過多個基準測試展示了LLaMA-65B在這些方面的表現(xiàn)。

5.1 Real?Toxicity?Prompts

Table 11: RealToxicityPrompts. We run a greedy de-coder on the 100k prompts from this benchmark. The “respectful” versions are prompts starting with “Com-plete the following sentence in a polite, respectful, and unbiased manner:”, and “Basic” is without it. Scores were obtained using the PerplexityAPI, with higher score indicating more toxic generations.RealToxicityPrompts。我們在這個基準測試的100,000個提示上運行貪婪解碼器。"尊重"版本是以"以禮貌、尊重和公正的方式完成以下句子："開頭的提示，而"基本"則沒有。得分是使用PerplexityAPI獲得的，得分越高表示生成的內(nèi)容越有毒。

5.2 CrowS-Pairs

Table 12: CrowS-Pairs. We compare the level of bi-ases contained in LLaMA-65B with OPT-175B and GPT3-175B. Higher score indicates higher bias.

5.3 WinoGender

5.4 TruthfulQA

6 Carbon footprint碳足跡

模型訓練對環(huán)境的能源和碳足跡的影響

7 Related work相關(guān)工作

Table 15: Carbon footprint of training different models in the same data center. We follow Wu et al. (2022) to compute carbon emission of training OPT, BLOOM and our models in the same data center. For the power consumption of a A100-80GB, we take the thermal design power for NVLink systems, that is 400W. We take a PUE of 1.1 and a carbon intensity factor set at the national US average of 0.385 kg CO2e per KWh.在同一數(shù)據(jù)中心訓練不同模型的碳足跡。我們遵循Wu等人（2022）的方法，在同一數(shù)據(jù)中心計算OPT、BLOOM和我們模型的碳排放量。對于A100-80GB的功耗，我們采用NVLink系統(tǒng)的熱設(shè)計功耗，即400W。我們采用PUE值為1.1，碳強度因子設(shè)定為美國國家平均值，即每千瓦時0.385千克CO2e。

語言模型定義、語言模型歷史、規(guī)模擴展、規(guī)模對性能的影響

8 Conclusion結(jié)論

概括了論文的主要貢獻和觀察結(jié)果

Acknowledgements致謝

實戰(zhàn)案例

Windows系統(tǒng)

LLMs：在單機CPU+Windows系統(tǒng)上對LLaMA模型(基于facebookresearch的GitHub)進行模型部署且實現(xiàn)模型推理全流程步驟【部署conda環(huán)境+安裝依賴庫+下載模型權(quán)重(國內(nèi)外各種鏈接)→模型推理】的圖文教程(非常詳細)

https://yunyaniu.blog.csdn.net/article/details/130979622

LLMs：基于單機CPU+Windows系統(tǒng)實現(xiàn)中文LLaMA算法(基于Chinese-LLaMA-Alpaca)進行模型部署(llama.cpp)+模型推理全流程步驟【安裝環(huán)境+創(chuàng)建環(huán)境并安裝依賴+原版LLaMA轉(zhuǎn)HF格式+合并llama_hf和chinese-alpaca-lora-7b→下載llama.cpp進行模型的量化(CMake編譯+生成量化版本模型)→部署f16/q4_0+測試效果】的圖文教程(非常詳細)

https://yunyaniu.blog.csdn.net/article/details/131016046

LLMs：基于單個4GB GPU上(Windows系統(tǒng))運行LLM上——pyllama模型(基于fjuncongmoo的GitHub)進行模型部署且實現(xiàn)模型推理全流程步驟的圖文教程(非常詳細)

https://yunyaniu.blog.csdn.net/article/details/131016598

Linux系統(tǒng)

LLMs：基于Chinese-LLaMA-Alpaca開源代碼在Ng單機單卡利用LLaMA(Meta)和Alpaca(斯坦福)實現(xiàn)定義數(shù)據(jù)集(生成指令數(shù)據(jù))→數(shù)據(jù)預處理(token分詞/合并權(quán)重)→增量預訓練(LoRA的參數(shù)/LLaMA的參數(shù))→指令微調(diào)LoRA權(quán)重(繼續(xù)訓練/全新訓練)→模型推理(CLI、GUI【webui/LLaMACha/LangChain】)

https://yunyaniu.blog.csdn.net/article/details/131319010

LLMs之LLaMA-7B-QLoRA：基于Alpaca-Lora代碼在CentOS和多卡(A800+并行技術(shù))實現(xiàn)全流程完整復現(xiàn)LLaMA-7B—安裝依賴、轉(zhuǎn)換為HF模型文件、模型微調(diào)(QLoRA+單卡/多卡)、模型推理(對比終端命令/llama.cpp/Docker封裝)圖文教程之詳細攻略

https://yunyaniu.blog.csdn.net/article/details/131526139

《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

地址

論文：https://arxiv.org/abs/2302.13971

GitHub(加載LLaMA模型并進行推理)：GitHub - facebookresearch/llama at llama_v1

GitHub(基于Python部署)：https://github.com/facebookresearch/llama

GitHub(基于Python和C/C++部署)：

參考文章
https://baijiahao.baidu.com/s?id=1760235370943525251&wfr=spider&for=pc

部署文章
GitHub - ggerganov/llama.cpp: Port of Facebook's LLaMA model in C/C++

作者

Hugo Touvron?, Thibaut Lavril?, Gautier Izacard?, Xavier Martinet Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal

Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard Grave?, Guillaume Lample?

Meta AI

時間

2023年2月25日

介紹LLaMA：一款基礎(chǔ)性的、擁有650億參數(shù)的大型語言模型

時間：2023年2月24日
原文地址：https://ai.meta.com/blog/large-language-model-llama-meta-ai/

作為Meta對開放科學的承諾的一部分，今天我們公開發(fā)布LLaMA（Large Language Model Meta AI），這是一款先進的基礎(chǔ)性大型語言模型，旨在幫助研究人員推動人工智能的這一子領(lǐng)域的工作。像LLaMA這樣的更小、更高性能的模型使研究社區(qū)中那些無法訪問大量基礎(chǔ)設(shè)施的人能夠研究這些模型，進一步使這一重要而快速變化的領(lǐng)域的訪問更加民主化。

在大型語言模型領(lǐng)域，訓練像LLaMA這樣的較小基礎(chǔ)模型是可取的，因為它需要遠少于計算能力和資源來測試新方法、驗證他人的工作以及探索新的用例?；A(chǔ)模型在大量未標記數(shù)據(jù)上進行訓練，這使它們成為微調(diào)各種任務的理想選擇。我們提供了LLaMA的多個規(guī)模（7B、13B、33B和65B參數(shù)），還分享了一個LLaMA模型卡片，詳細說明了我們構(gòu)建模型的方法，符合我們對負責任人工智能實踐的理念。

在過去的一年中，擁有數(shù)十億參數(shù)的大型語言模型，即自然語言處理（NLP）系統(tǒng)，展示了生成創(chuàng)造性文本、解決數(shù)學定理、預測蛋白質(zhì)結(jié)構(gòu)、回答閱讀理解問題等方面的新能力。它們是AI能夠在全球數(shù)十億人中規(guī)模提供實質(zhì)性潛在利益的最清晰案例之一。

盡管大型語言模型近期取得了許多進展，但由于訓練和運行這樣大型模型所需的資源，對它們的全面研究訪問仍然有限。受限的訪問限制了研究人員理解這些大型語言模型是如何工作的，阻礙了改善其穩(wěn)健性并緩解已知問題（如偏見、毒性和生成誤導信息的潛在風險）的努力。

在更多的標記上訓練的較小模型更容易進行重新訓練和微調(diào)，以適應特定的潛在產(chǎn)品用例。我們在1.4萬億標記上訓練了LLaMA 65B和LLaMA 33B。我們最小的模型，LLaMA 7B，訓練在1萬億標記上。

像其他大型語言模型一樣，LLaMA通過將一系列單詞作為輸入并遞歸生成文本來工作。為了訓練我們的模型，我們選擇了使用最多人口的20種語言的文本，重點關(guān)注那些使用拉丁字母和西里爾字母的語言。

在解決大型語言模型中的偏見、有毒評論和幻覺風險方面，仍需要進行更多的研究。像其他模型一樣，LLaMA也面臨這些挑戰(zhàn)。作為基礎(chǔ)模型，LLaMA設(shè)計為靈活多用途，可以應用于許多不同的用例，而不是為特定任務設(shè)計的微調(diào)模型。通過分享LLaMA的代碼，其他研究人員可以更容易地測試在大型語言模型中限制或消除這些問題的新方法。我們還在論文中提供了一系列評估，評估模型的偏見和毒性，以展示模型的局限性，并支持在這一關(guān)鍵領(lǐng)域進行進一步研究。

為了保持完整性并防止濫用，我們將以專注于研究用途的非商業(yè)許可發(fā)布我們的模型。對該模型的訪問將根據(jù)個案授予學術(shù)研究人員、與政府、公民社會和學術(shù)界組織有關(guān)的人員以及全球工業(yè)研究實驗室。有興趣申請訪問的人可以在我們的研究論文中找到申請鏈接。

我們認為整個人工智能社區(qū) — 學術(shù)研究人員、公民社會、決策者和行業(yè) — 必須共同努力制定關(guān)于負責任AI的明確指南，特別是關(guān)于負責任的大型語言模型。我們期待看到社區(qū)能夠通過使用LLaMA學到什么 — 最終建立什么。

Abstract

We introduce LLaMA, a collection of founda- tion language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly avail- able datasets ?exclusively, ?without ?resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA- 65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community1.

我們介紹了LLaMA，這是一組參數(shù)范圍從7B到65B的基礎(chǔ)語言模型。我們使用數(shù)萬億個標記來訓練我們的模型，并展示了可以僅使用公開可用的數(shù)據(jù)集進行訓練，而不需要專有和不可訪問的數(shù)據(jù)集來訓練最先進的模型。特別是，LLaMA-13B在大多數(shù)基準測試中表現(xiàn)優(yōu)于GPT-3（175B），LLaMA-65B與最好的模型Chinchilla-70B和PaLM-540B具有競爭力。我們將所有模型發(fā)布給研究社區(qū)。

1、Introduction

LLMs的能力和發(fā)展趨勢、模型規(guī)模與性能的關(guān)系

LLMs，如GPT-3，通過在龐大的文本語料庫上進行訓練，展示了在從文本指令或Few-shot示例中執(zhí)行新任務的能力。
Few-shot特性是通過將模型擴大到足夠大的規(guī)模后首次出現(xiàn)的，這導致了進一步擴大模型規(guī)模的研究方向。
之前的研究假設(shè)更多的參數(shù)將導致更好的性能，因此致力于進一步擴大模型規(guī)模。
最佳性能并非由最大的模型實現(xiàn)，而是由在更多數(shù)據(jù)上訓練的較小模型實現(xiàn)。

Large Languages Models (LLMs) trained on mas- sive corpora of texts have shown their ability to per- form new tasks from textual instructions or from a few examples (Brown et al., 2020). These few-shot properties first appeared when scaling models to a sufficient size (Kaplan et al., 2020), resulting in a line of work that focuses on further scaling these models (Chowdhery et al., 2022; Rae et al., 2021). These efforts are based on the assumption that more parameters will lead to better performance. However, recent work from Hoffmann et al. (2022) shows that, for a given compute budget, the best performances are not achieved by the largest mod- els, but by smaller models trained on more data.

大型語言模型（LLMs）在大規(guī)模文本語料庫上訓練后展現(xiàn)了它們根據(jù)文本指令或少量示例執(zhí)行新任務的能力（Brown等，2020年）。這種少樣本特性首次出現(xiàn)在將模型擴展到足夠大的規(guī)模時（Kaplan等，2020年），隨后有了一系列進一步擴展這些模型的工作（Chowdhery等，2022年；Rae等，2021年）。這些努力是基于一個假設(shè)，即更多的參數(shù)將導致更好的性能。然而，Hoffmann等人（2022年）的最新研究表明，在給定的計算預算下，最佳性能不是由最大的模型實現(xiàn)的，而是由更小的模型在更多數(shù)據(jù)上進行訓練的模型實現(xiàn)的。

縮放法則與推理預算的忽視

Hoffmann等人的縮放法則旨在確定如何在特定訓練計算預算下最佳縮放數(shù)據(jù)集和模型大小。
但這一目標忽略了推理預算，而在大規(guī)模提供語言模型時，推理預算變得至關(guān)重要。

The objective of the scaling laws from Hoff-?mann et al. (2022) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of?performance, a ?smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Hoffmann等人（2022年）的擴展定律的目標是確定如何在特定的訓練計算預算下最佳地擴展數(shù)據(jù)集和模型大小。然而，這個目標忽視了推理預算，在大規(guī)模使用語言模型時變得至關(guān)重要。在這種情況下，給定目標性能水平，首選的模型不是訓練最快的模型，而是推理最快的模型，盡管訓練一個大型模型以達到一定的性能水平可能更便宜，但訓練時間更長的較小模型在推理階段最終更經(jīng)濟。例如，盡管Hoffmann等人（2022年）建議在200B個標記上訓練一個10B模型，但我們發(fā)現(xiàn)7B模型的性能在訓練1T個標記后仍在改善。

LLaMA模型的提出

為了在各種推理預算下實現(xiàn)最佳性能，作者提出了一系列名為LLaMA的語言模型，其參數(shù)范圍從7B到65B。
LLaMA模型在性能上與最佳的LLMs相媲美，例如LLaMA-13B在大多數(shù)基準測試上優(yōu)于GPT-3，盡管規(guī)模小了10倍。
數(shù)據(jù)來源公開性：與Chinchilla、PaLM或GPT-3不同，LLaMA僅使用公開可用的數(shù)據(jù)，使其與開源兼容。
模型開源性：大多數(shù)現(xiàn)有模型依賴于不公開或未記錄的數(shù)據(jù)，而LLaMA的公開數(shù)據(jù)使用更具開源性。

The focus of this work is to train a series of language models that achieve the best possible per- formance at various inference budgets, by training on more tokens than what is typically used. The resulting models, called LLaMA, ranges from 7B to 65B parameters with competitive performance compared to the best existing LLMs. For instance, LLaMA-13B outperforms GPT-3 on most bench- marks, despite being 10× smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU. At the higher-end of the scale, our 65B-parameter model is also competitive with the best large lan- guage models such as Chinchilla or PaLM-540B.

Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work com- patible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “Books – 2TB” or “Social media conversations”). There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao?et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.

本文的重點是訓練一系列在各種推理預算下實現(xiàn)最佳性能的語言模型，通過使用比通常使用的更多標記進行訓練。結(jié)果得到的模型稱為LLaMA，參數(shù)范圍從7B到65B，與現(xiàn)有最好的LLM相比具有競爭力的性能。例如，LLaMA-13B在大多數(shù)基準測試中優(yōu)于GPT-3，盡管體積只有其1/10。我們相信，這個模型將有助于使LLMs的訪問和研究民主化，因為它可以在單個GPU上運行。在規(guī)模較大的端，我們的65B參數(shù)模型與最好的大型語言模型（如Chinchilla或PaLM-540B）也具有競爭力。

與Chinchilla、PaLM或GPT-3不同，我們只使用公開可用的數(shù)據(jù)，使我們的工作與開源兼容，而大多數(shù)現(xiàn)有模型依賴于不公開可用或未經(jīng)記錄的數(shù)據(jù)（例如，“Books – 2TB”或“Social media conversations”）。也存在一些例外，例如OPT（Zhang等，2022年），GPT-NeoX（Black等，2022年），BLOOM（Scao等，2022年）和GLM（Zeng等，2022年），但沒有一個能與PaLM-62B或Chinchilla相競爭。

內(nèi)容概述：模型改進、性能比較、模型的偏見和毒性問題

文章介紹了對Transformer架構(gòu)（Vaswani等人，2017）的修改以及訓練方法。

In the rest of this paper, we present an overview of the modifications we made to the transformer architecture (Vaswani et al., 2017), as well as our training method. We then report the performance of our models and compare with others LLMs on a set of standard benchmarks. Finally, we expose some of the biases and toxicity encoded in our models, using some of the most recent benchmarks from the responsible AI community.

在本文的其余部分，我們將概述我們對Transformer架構(gòu)（Vaswani等，2017年）所做的修改以及我們的訓練方法。然后，我們將報告我們模型的性能，并與其他LLM在一系列標準基準測試中進行比較。最后，我們使用最近的一些負責任的AI社區(qū)的基準測試揭示了我們的模型中編碼的一些偏見和有害信息。

2、Approach

訓練方法概述：

作者的訓練方法與之前的工作相似，受到Chinchilla縮放法則的啟發(fā)。
使用大型transformers在大量文本數(shù)據(jù)上進行訓練，采用標準優(yōu)化器。

Our training approach is similar to the methods described in previous work (Brown et al., 2020; Chowdhery et al., 2022), and is inspired by the Chinchilla scaling laws (Hoffmann et al., 2022). We train large transformers on a large quantity of textual data using a standard optimizer.

我們的訓練方法類似于先前的工作（Brown等，2020年；Chowdhery等，2022年），并受到了Chinchilla擴展定律的啟發(fā)（Hoffmann等，2022年）。我們使用標準優(yōu)化器在大量文本數(shù)據(jù)上訓練大型Transformer模型。

2.1、Pre-training Data

Table 1: Pre-training data. Data mixtures used for pre-training, for each subset we list the sampling propor-tion, number of epochs performed on the subset when training on 1.4T tokens, and disk size. The pre-training runs on 1T tokens have the same sampling proportion.預訓練數(shù)據(jù)。用于預訓練的數(shù)據(jù)混合，對于每個子集，我們列出了采樣比例、在1.4T標記上訓練時在子集上執(zhí)行的時代數(shù)量以及磁盤大小。在1T標記上進行的預訓練運行具有相同的采樣比例。

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

預訓練數(shù)據(jù)：

數(shù)據(jù)集由多個來源組成，包括CommonCrawl、C4、GitHub、Wikipedia、Gutenberg、Books3、ArXiv、Stack Exchange等，涵蓋多個領(lǐng)域。
數(shù)據(jù)經(jīng)過預處理，包括去重、語言識別、質(zhì)量過濾等步驟，確保只使用公開可用、與開源兼容的數(shù)據(jù)。
整個訓練數(shù)據(jù)集包含大約1.4T個標記。

Our training dataset is a mixture of several sources, reported in Table 1, that cover a diverse set of do- mains. For the most part, we reuse data sources that have been leveraged to train other LLMs, with the restriction of only using data that is publicly available, and compatible with open sourcing. This leads to the following mixture of data and the per- centage they represent in the training set:

我們的訓練數(shù)據(jù)集是多個來源的混合物，詳見表格1，涵蓋了各種領(lǐng)域。在很大程度上，我們重新使用了用于訓練其他LLM的數(shù)據(jù)源，但限制是只使用公開可用的數(shù)據(jù)，并且與開源兼容。這導致了以下混合數(shù)據(jù)及其在訓練集中所代表的百分比:

English CommonCrawl [67%]. We preprocess five CommonCrawl ?dumps, ?ranging ?from ?2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n- gram language model. In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references.

C4 [15%]. During exploratory experiments, we observed that using diverse pre-processed Com- monCrawl datasets improves performance. We thus included the publicly available C4 dataset (Raffel?et al., 2020) in our data. The preprocessing of C4 also contains deduplication and language identifi- cation steps: the main difference with CCNet is the quality filtering, which mostly relies on heuris- tics such as presence of punctuation marks or the number of words and sentences in a webpage.

英語CommonCrawl [67%]。我們對五個CommonCrawl數(shù)據(jù)轉(zhuǎn)儲進行預處理，時間跨度從2017年到2020年，使用CCNet流程（Wenzek等，2020年）。該過程在行級別進行數(shù)據(jù)去重，使用fastText線性分類器進行語言識別以去除非英語頁面，并使用n-gram語言模型過濾低質(zhì)量內(nèi)容。此外，我們訓練了一個線性模型，用于對維基百科中用作參考的頁面與隨機抽樣頁面進行分類，并丟棄未被分類為參考文獻的頁面。

C4 [15%]。在探索性實驗中，我們觀察到使用多樣的預處理CommonCrawl數(shù)據(jù)集可以提高性能。因此，我們在我們的數(shù)據(jù)中包括了公開可用的C4數(shù)據(jù)集（Raffel等，2020年）。C4的預處理也包括去重和語言識別步驟：與CCNet的主要區(qū)別在于質(zhì)量過濾，主要依靠標點符號的存在或網(wǎng)頁中的單詞和句子數(shù)量。

Github [4.5%]. We use the public GitHub dataset available on Google BigQuery. We only kept projects that are distributed under the Apache, BSD and MIT licenses. Additionally, we filtered low quality files with heuristics based on the line length or proportion of alphanumeric characters, and removed boilerplate, such as headers, with reg- ular expressions. Finally, we deduplicate the result- ing dataset at the file level, with exact matches.

Wikipedia [4.5%]. We add Wikipedia ?dumps from the June-August 2022 period, covering 20?languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. We process the data to remove hyperlinks, comments and other formatting boilerplate.

Gutenberg and Books3 ?[4.5%]. ?We ?include two book corpora in our training dataset: the Guten- berg Project, which contains books that are in the public domain, and the Books3 section of TheP- ile (Gao et al., 2020), a publicly available dataset for training large language models. We perform deduplication at the book level, removing books with more than 90% content overlap.

GitHub [4.5%]。我們使用Google BigQuery上公開可用的GitHub數(shù)據(jù)集。我們只保留按Apache、BSD和MIT許可證分發(fā)的項目。此外，我們使用基于行長度或包含字母數(shù)字字符比例的啟發(fā)式方法過濾低質(zhì)量文件，并使用正則表達式刪除諸如標題之類的樣板文件。最后，我們使用完全匹配在文件級別進行數(shù)據(jù)去重。

Wikipedia [4.5%]。我們添加了2022年6月至8月期間的維基百科轉(zhuǎn)儲，涵蓋20種使用拉丁字母或西里爾字母的語言：bg、ca、cs、da、de、en、es、fr、hr、hu、it、nl、pl、pt、ro、ru、sl、sr、sv、uk。我們對數(shù)據(jù)進行處理，刪除超鏈接、注釋和其他格式樣板。

Gutenberg和Books3 [4.5%]。我們的訓練數(shù)據(jù)集中包括兩個圖書語料庫：Guten- berg計劃中包含的公共領(lǐng)域圖書，以及ThePile（Gao等，2020年）的Books3部分，這是一個用于訓練大型語言模型的公開可用數(shù)據(jù)集。我們對書籍進行了去重處理，刪除了內(nèi)容重疊超過90%的書籍。

ArXiv ?[2.5%]. ??We ?process ?arXiv ?Latex ?files to add scientific data to our dataset. Following Lewkowycz et al. (2022), we removed everything before the first section, as well as the bibliography. We also removed the comments from the .tex files, and inline-expanded definitions and macros written by users to increase consistency across papers.

Stack Exchange [2%]. We include a dump of Stack Exchange, a website of high quality ques- tions and answers that covers a diverse set of do- mains, ranging from computer science to chemistry. We kept the data from the 28 largest websites, re- moved the HTML tags from text and sorted the answers by score (from highest to lowest).

ArXiv [2.5%]。我們處理arXiv的LaTeX文件，以向我們的數(shù)據(jù)集添加科學數(shù)據(jù)。根據(jù)Lewkowycz等（2022年）的方法，我們刪除了第一節(jié)之前的所有內(nèi)容以及參考文獻部分。我們還從.tex文件中刪除了注釋，并對用戶編寫的內(nèi)聯(lián)擴展定義和宏進行了展開，以增加論文之間的一致性。

Stack Exchange [2%]。我們包括Stack Exchange的轉(zhuǎn)儲，這是一個高質(zhì)量問題和回答的網(wǎng)站，涵蓋了從計算機科學到化學的各種領(lǐng)域。我們保留了最大的28個網(wǎng)站的數(shù)據(jù)，從文本中刪除了HTML標記，并按得分（從高到低）對答案進行排序。

分詞器：采用字節(jié)對編碼（BPE）算法并借助SentencePiece工具庫實現(xiàn)

分詞算法： 使用了字節(jié)對編碼（BPE）算法，該算法由Sennrich等人于2015年提出，并借助了SentencePiece工具庫的實現(xiàn)（Kudo和Richardson，2018年）。
數(shù)字處理： 在分詞過程中，將所有數(shù)字拆分為獨立的數(shù)字，并對未知的UTF-8字符進行字節(jié)級的分解。
技術(shù)細節(jié)： 采用BPE算法有助于處理多樣性的語言數(shù)據(jù)，而將數(shù)字拆分和字節(jié)級分解則有助于更好地捕捉語言中的細微差異和結(jié)構(gòu)。

Tokenizer. We tokenize the data with the byte- pair encoding (BPE) algorithm (Sennrich et al., 2015), using the implementation from Sentence- Piece (Kudo and Richardson, 2018). Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.

分詞器。我們使用字節(jié)對編碼（BPE）算法（Sennrich等，2015年）對數(shù)據(jù)進行分詞，使用SentencePiece（Kudo和Richardson，2018年）的實現(xiàn)。值得注意的是，我們將所有數(shù)字拆分為單獨的數(shù)字，并使用字節(jié)對將未知的UTF-8字符進行分解。

整體訓練數(shù)據(jù)集及其處理方式：分詞后包含大約1.4萬億個標記(1.4T)，僅對維基百科和圖書領(lǐng)域的數(shù)據(jù)進行了大約2個epoch的迭代訓練

訓練數(shù)據(jù)規(guī)模： 整個訓練數(shù)據(jù)集在分詞后包含大約1.4萬億個標記（tokens）。
標記使用頻率： 在大多數(shù)訓練數(shù)據(jù)中，每個標記在訓練期間僅使用一次。這意味著每個文本標記都被模型考慮了一次。
特殊情況處理： 例外情況是對維基百科和圖書領(lǐng)域的數(shù)據(jù)，針對這兩個領(lǐng)域，進行了大約2個epoch的迭代訓練。這意味著模型對于這兩個領(lǐng)域的數(shù)據(jù)進行了更深入的學習，以更好地理解其中的語義和結(jié)構(gòu)。

Overall, our entire training dataset contains roughly 1.4T tokens after tokenization. For most of our training data, each token is used only once dur- ing training, with the exception of the Wikipedia and Books domains, over which we perform ap- proximately two epochs.

總體而言，我們整個訓練數(shù)據(jù)集在分詞后包含大約1.4萬億個標記。對于我們的大多數(shù)訓練數(shù)據(jù)，每個標記在訓練過程中僅使用一次，除了維基百科和圖書領(lǐng)域，我們對其進行了大約兩個時期的訓練。

2.2、Architecture

Table 2: Model sizes, architectures, and optimization hyper-parameters.模型大小、架構(gòu)和優(yōu)化超參數(shù)。

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

模型架構(gòu)：

基于transformer架構(gòu)，但引入了一些改進：
- 采用預正則化（Pre-normalization）來提高訓練穩(wěn)定性。
- 使用SwiGLU激活函數(shù)替代ReLU，以提高性能。
- 移除絕對位置嵌入，使用Rotary Positional Embeddings。

Following recent work on large language models, our network is based on the transformer architec- ture (Vaswani et al., 2017). We leverage various improvements that were subsequently proposed,and used in different models such as PaLM. Here are the main difference with the original architec- ture, and where we were found the inspiration for this change (in bracket):

在最近關(guān)于大型語言模型的研究中，我們的網(wǎng)絡(luò)基于Transformer架構(gòu)（Vaswani等，2017年）。我們利用了后來提出的各種改進方法，這些方法在不同模型中被使用，如PaLM。以下是與原始架構(gòu)的主要不同之處以及我們對此變化的啟示（方括號內(nèi)）：

Pre-normalization [GPT3]. To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output. We use the RMSNorm normalizing func- tion, introduced by Zhang and Sennrich (2019).

SwiGLU activation function [PaLM]. We re- place the ReLU non-linearity by the SwiGLU ac- tivation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of?4d instead of 4d as in PaLM.

Rotary Embeddings [GPTNeo]. We remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE), introduced by Su et al. (2021), at each layer of the network.

[GPT3]預歸一化的RMSNorm歸一化函數(shù)。為了改善訓練穩(wěn)定性，我們對每個Transformer子層的輸入進行歸一化，而不是對輸出進行歸一化。我們使用RMSNorm歸一化函數(shù)，由Zhang和Sennrich（2019年）引入。

[PaLM]激活函數(shù) SwiGLU。我們將ReLU非線性激活函數(shù)替換為SwiGLU激活函數(shù)，該函數(shù)由Shazeer（2020年）引入以提高性能。與PaLM不同的是，我們使用了4d的維度而不是PaLM中的4d。

[GPTNeo]旋轉(zhuǎn)位置嵌入 RoPE。我們移除了絕對位置嵌入，并在網(wǎng)絡(luò)的每個層添加了旋轉(zhuǎn)位置嵌入（RoPE），這是由Su等（2021年）引入的。

The details of the hyper-parameters for our dif-ferent models are given in Table 2.

有關(guān)我們不同模型的超參數(shù)詳細信息，請參見表2。

2.3 ???Optimizer

Figure 1: Training loss over train tokens for the 7B, 13B, 33B, and 65 models. LLaMA-33B and LLaMA- 65B were trained on 1.4T tokens. The smaller models were trained on 1.0T tokens. All models are trained with a batch size of 4M tokens.——7B、13B、33B和65B模型在訓練標記上的訓練損失。LLaMA-33B和LLaMA-65B在1.4T標記上進行訓練。較小的模型在1.0T標記上進行訓練。所有模型的批次大小均為4M標記。

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

優(yōu)化器：

使用AdamW優(yōu)化器，設(shè)定一系列超參數(shù)，包括學習率、權(quán)重衰減、梯度剪裁等。
采用余弦學習率調(diào)度，具有漸變的學習率和批次大小隨模型大小變化而變化的特性。

Our models are trained using the AdamW opti- mizer (Loshchilov and Hutter, 2017), with the fol- lowing hyper-parameters: β1 = 0.9, β2 ?= ?0.95. We use a cosine learning rate schedule, such that the final learning rate is equal to 10% of the maximal learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0. We use 2, 000 warmup?steps, and vary the learning rate and batch size with the size of the model (see Table 2 for details).

我們使用AdamW優(yōu)化器（Loshchilov和Hutter，2017年）進行模型訓練，使用以下超參數(shù)：β1 = 0.9，β2 = 0.95。我們使用余弦學習率調(diào)度，使最終學習率等于最大學習率的10%。我們使用權(quán)重衰減0.1和梯度裁剪1.0。我們使用2,000個預熱步驟，并根據(jù)模型的大小調(diào)整學習率和批次大?。ㄔ斠姳?）。

2.4 ??Efficient implementation高效實現(xiàn)

Table 3: Zero-shot performance on Common Sense Reasoning tasks.零樣本性能，針對常識推理任務。

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

高效實現(xiàn)：

通過實現(xiàn)高效的因果多頭注意力機制來降低內(nèi)存使用和運行時。
使用checkpoint檢查點技術(shù)減少在反向傳播期間需要重新計算的激活數(shù)量，以提高訓練效率。
實現(xiàn)模型和序列的并行計算，通過減少模型內(nèi)存使用，利用多GPU并行計算，最大程度上減少GPU間的通信。

性能和訓練速度：采用了2048個A100-80GB GPU，訓練1.4T標記的數(shù)據(jù)集，耗時21天

在訓練65B參數(shù)模型時，每秒每GPU處理約380個標記，使用2048個A100 GPU，每個GPU有80GB RAM。
完成包含1.4T標記的數(shù)據(jù)集的訓練大約需要21天。

We make several optimizations to improve the train-ing speed of our models. First, we use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. This imple-mentation, available in the xformers library,2 is inspired by Rabe and Staats (2021) and uses the backward from Dao et al. (2022). This is achieved by not storing the attention weights and not com-puting the key/query scores that are masked due to the causal nature of the language modeling task.

我們進行了幾項優(yōu)化來提高模型的訓練速度。首先，我們使用了一種高效的因果多頭注意力實現(xiàn)，以減少內(nèi)存使用和運行時間。這個實現(xiàn)在xformers庫中可用，受到了Rabe和Staats（2021年）的啟發(fā)，并使用了Dao等人（2022年）的反向傳播。通過不存儲注意力權(quán)重和不計算由于語言建模任務的因果性質(zhì)而被屏蔽的鍵/查詢得分，實現(xiàn)了這一點。

To further improve training efficiency, we re-duced the amount of activations that are recom-puted during the backward pass with checkpoint-ing. More precisely, we save the activations that are expensive to compute, such as the outputs of linear layers. This is achieved by manually imple-menting the backward function for the transformer layers, instead of relying on the PyTorch autograd. To fully benefit from this optimization, we need to?reduce the memory usage of the model by using model and sequence parallelism, as described by Korthikanti et al. (2022). Moreover, we also over-lap the computation of activations and the commu-nication between GPUs over the network (due to all_reduce operations) as much as possible.

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

為了進一步提高訓練效率，我們通過檢查點技術(shù)減少了在反向傳播過程中重新計算的激活數(shù)量。具體而言，我們保存了昂貴計算的激活，如線性層的輸出。這是通過手動實現(xiàn)Transformer層的反向傳播函數(shù)來實現(xiàn)的，而不是依賴于PyTorch的autograd。為了充分利用這種優(yōu)化，我們需要使用模型和序列并行來減少模型的內(nèi)存使用，正如Korthikanti等人（2022年）所描述的。此外，我們還盡可能地重疊計算激活和在網(wǎng)絡(luò)上進行的GPU之間的通信（由于all_reduce操作）。

當訓練一個擁有650億參數(shù)的模型時，我們的代碼在擁有80GB內(nèi)存的2048個A100 GPU上每秒處理約380個標記。這意味著在包含1.4萬億標記的數(shù)據(jù)集上進行訓練大約需要21天。

3 ???Main results主要結(jié)果

任務類型和基準測試：

作者考慮了零樣本和少樣本任務，并在總共20個基準測試上報告了結(jié)果。
零樣本任務中，模型通過提供開放性生成的答案或?qū)μ嶙h答案進行排名來回答任務。
少樣本任務中，模型通過提供任務的少量示例（1到64個）和一個測試示例來生成答案或?qū)Σ煌x項進行排名。

Following previous work (Brown et al., 2020), we consider zero-shot and few-shot tasks, and report results on a total of 20 benchmarks:

Zero-shot. We provide a textual description of the task and a test example. The model either provides an answer using open-ended generation, or ranks the proposed answers.
Few-shot. We provide a few examples of the task (between 1 and 64) and a test example. The model takes this text as input and gener-ates the answer or ranks different options.

我們遵循以前的工作（Brown等，2020年），考慮了零樣本和少樣本任務，并在總共20個基準測試中報告結(jié)果：

零樣本。我們提供任務的文本描述和一個測試示例。模型通過開放式生成回答或?qū)μ嶙h的答案進行排序來回答。
少樣本。我們提供任務的幾個示例（1到64個）和一個測試示例。模型以這些文本作為輸入，生成答案或?qū)Σ煌x項進行排序。

與其他模型的比較：

與其他基準模型進行比較，包括GPT-3、Gopher、Chinchilla、PaLM、OPT、GPT-J、GPT-Neo等。
在不同任務和基準測試中對LLaMA進行了性能評估。

We compare LLaMA with other foundation mod-els, namely the non-publicly available language models GPT-3 (Brown et al., 2020), Gopher (Rae et al., 2021), Chinchilla (Hoffmann et al., 2022) and PaLM (Chowdhery et al., 2022), as well as the open-sourced OPT models (Zhang et al., 2022), GPT-J (Wang and Komatsuzaki, 2021), and GPT-Neo (Black et al., 2022). In Section 4, we also briefly compare LLaMA with instruction-tuned models such as OPT-IML (Iyer et al., 2022) and Flan-PaLM (Chung et al., 2022).

We evaluate LLaMA on free-form generation tasks and multiple choice tasks. In the multiple choice tasks, the objective is to select the most appropriate completion among a set of given op-tions, based on a provided context. We select the completion with the highest likelihood given the provided context. We follow Gao et al. (2021) and use the likelihood normalized by the number of characters in the completion, except for certain datasets (OpenBookQA, BoolQ), for which we fol-low Brown et al. (2020), and select a completion based on the likelihood normalized by the likeli-hood of the completion given “Answer:” as context: P (completion|context)/P (completion|“Answer:”).

我們將LLaMA與其他基礎(chǔ)模型進行比較，包括非公開可用的語言模型GPT-3、Gopher、Chinchilla和PaLM，以及開源的OPT模型、GPT-J和GPT-Neo。在第4節(jié)中，我們還簡要比較了LLaMA與OPT-IML和Flan-PaLM等針對指令進行調(diào)整的模型。

我們在自由形式生成任務和多項選擇任務上評估LLaMA。在多項選擇任務中，目標是根據(jù)提供的上下文從給定選項中選擇最合適的完成。我們選擇在給定上下文中具有最高可能性的完成。我們遵循Gao等人（2021年）的方法，使用由完成的字符數(shù)進行歸一化的可能性作為評估指標，但對于某些數(shù)據(jù)集（OpenBookQA、BoolQ），我們按照Brown等人（2020年）的方法，根據(jù)在給定上下文"Answer:"的情況下完成的可能性與完成的可能性的比值進行選擇：P(完成|上下文)/P(完成|"Answer:")。

總結(jié)：常識推理、封閉書籍問答、閱讀理解、數(shù)學推理、代碼生成等任務的性能評估：

在常識推理任務中，LLaMA-65B在多個基準測試上超越Chinchilla-70B和PaLM-540B，LLaMA-13B也在大多數(shù)基準測試中勝過GPT-3。
在封閉書籍問答任務中，LLaMA在Natural Questions和TriviaQA基準測試中在零樣本和少樣本設(shè)置下均取得了領(lǐng)先的性能。
在閱讀理解任務中，LLaMA-65B與PaLM-540B競爭激烈，LLaMA-13B在某些基準測試中超過GPT-3。
在數(shù)學推理任務中，LLaMA在MATH和GSM8k基準測試中與PaLM和Minerva相比表現(xiàn)出色。
在代碼生成任務中，LLaMA在HumanEval和MBPP基準測試中的性能超過了LaMDA和PaLM。
在大規(guī)模多任務語言理解基準測試中，LLaMA-65B在一些領(lǐng)域上略遜于Chinchilla-70B和PaLM-540B。

Table 4: NaturalQuestions. Exact match performance.

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

3.1 Common Sense Reasoning常識推理

We consider eight standard common sense rea-soning benchmarks: BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019),HellaSwag (Zellers et al., 2019), WinoGrande (Sak-aguchi et al., 2021), ARC easy and challenge (Clark et al., 2018) and OpenBookQA (Mihaylov et al., 2018). These datasets include Cloze and Winograd style tasks, as well as multiple choice question an-swering. We evaluate in the zero-shot setting as done in the language modeling community.

In Table 3, we compare with existing models of various sizes and report numbers from the cor-responding papers. First, LLaMA-65B outper-forms Chinchilla-70B on all reported benchmarks but BoolQ. Similarly, this model surpasses PaLM- 540B everywhere but on BoolQ and WinoGrande. LLaMA-13B model also outperforms GPT-3 on most benchmarks despite being 10× smaller.

我們考慮了八個常識推理基準測試：BoolQ、PIQA、SIQA、HellaSwag、WinoGrande、ARC easy和challenge以及OpenBookQA。這些數(shù)據(jù)集包括Cloze和Winograd風格的任務，以及多項選擇題。我們按照語言建模社區(qū)的做法，在零樣本設(shè)置下進行評估。

在表3中，我們與各種規(guī)模的現(xiàn)有模型進行比較，并報告了相應論文中的數(shù)據(jù)。首先，LLaMA-65B在除了BoolQ以外的所有基準測試中都優(yōu)于Chinchilla-70B。同樣，除了BoolQ和WinoGrande以外，該模型也超過了PaLM-540B。LLaMA-13B模型盡管大小只有GPT-3的10倍小，但在大多數(shù)基準測試中表現(xiàn)優(yōu)于GPT-3。

3.2 Closed-book Question Answering閉書式問答

We compare LLaMA to existing large language models on two closed-book question answering benchmarks: Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017). For both benchmarks, we report exact match perfor-mance in a closed book setting, i.e., where the mod-els do not have access to documents that contain evidence to answer the question. In Table 4, we report performance on NaturalQuestions, and in Ta-ble 5, we report on TriviaQA. On both benchmarks, LLaMA-65B achieve state-of-the-arts performance in the zero-shot and few-shot settings. More im-portantly, the LLaMA-13B is also competitive on these benchmarks with GPT-3 and Chinchilla, de-spite being 5-10× smaller. This model runs on a single V100 GPU during inference.

我們將LLaMA與現(xiàn)有的大型語言模型在兩個閉書式問答基準測試上進行比較：自然問題和TriviaQA。對于這兩個基準測試，我們報告了在閉書設(shè)置下的準確匹配性能，即模型無法訪問包含回答問題所需證據(jù)的文檔。在表4中，我們報告了在自然問題上的性能，而在表5中，我們報告了在TriviaQA上的性能。在這兩個基準測試中，LLaMA-65B在零樣本和少樣本設(shè)置下達到了最先進的性能。更重要的是，LLaMA-13B在這些基準測試上與GPT-3和Chinchilla相比也具有競爭力，盡管模型只在單個V100 GPU上進行推斷。

Table 5: TriviaQA. Zero-shot and few-shot exact match performance on the filtered dev set.

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

3.3 Reading Comprehension閱讀理解

We evaluate our models on the RACE reading com-prehension benchmark (Lai et al., 2017). This dataset was collected from English reading com-prehension exams designed for middle and high?school Chinese students. We follow the evaluation setup from Brown et al. (2020) and report results in Table 6. On these benchmarks, LLaMA-65B is competitive with PaLM-540B, and, LLaMA-13B outperforms GPT-3 by a few percents.

?我們在RACE閱讀理解基準測試上評估了我們的模型。這個數(shù)據(jù)集是從為中國中學和高中學生設(shè)計的英語閱讀理解考試中收集的。我們按照Brown等人（2020年）的評估設(shè)置進行評估，并在表6中報告結(jié)果。在這些基準測試中，LLaMA-65B與PaLM-540B具有競爭力，而LLaMA-13B在幾個百分點上優(yōu)于GPT-3。

Table 6: Reading Comprehension. Zero-shot accuracy.

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

3.4 ??Mathematical reasoning數(shù)學推理

We evaluate our models on two mathematical rea-soning benchmarks: MATH (Hendrycks et al., 2021) and GSM8k (Cobbe et al., 2021). MATH is a dataset of 12K middle school and high school mathematics problems written in LaTeX. GSM8k is a set of middle school mathematical problems. In Table 7, we compare with PaLM and Min-erva (Lewkowycz et al., 2022). Minerva is a series of PaLM models finetuned on 38.5B tokens ex-tracted from ArXiv and Math Web Pages, while neither PaLM or LLaMA are finetuned on mathe-matical data. The numbers for PaLM and Minerva are taken from Lewkowycz et al. (2022), and we compare with and without maj1@k. maj1@k de-notes evaluations where we generate k samples for each problem and perform a majority voting (Wang et al., 2022). On GSM8k, we observe that LLaMA- 65B outperforms Minerva-62B, although it has not been fine-tuned on mathematical data.

我們在兩個數(shù)學推理基準測試上評估了我們的模型：MATH（Hendrycks等人，2021年）和GSM8k（Cobbe等人，2021年）。MATH是一個包含12,000個中學和高中數(shù)學問題的數(shù)據(jù)集，使用LaTeX編寫。GSM8k是一組中學數(shù)學問題。在表7中，我們與PaLM和Minerva（Lewkowycz等人，2022年）進行了比較。Minerva是一系列在ArXiv和數(shù)學網(wǎng)頁中提取的3850億標記上微調(diào)的PaLM模型，而PaLM和LLaMA都沒有在數(shù)學數(shù)據(jù)上進行微調(diào)。PaLM和Minerva的數(shù)據(jù)來自Lewkowycz等人（2022年），我們比較了有無maj1@k的結(jié)果。maj1@k表示我們?yōu)槊總€問題生成k個樣本，并進行多數(shù)投票（Wang等人，2022年）的評估。在GSM8k上，我們觀察到LLaMA-65B優(yōu)于Minerva-62B，盡管它沒有在數(shù)學數(shù)據(jù)上進行微調(diào)。

?Table 7: Model performance on quantitative reason-ing datasets. For majority voting, we use the same setup as Minerva, with k = 256 samples for MATH and k = 100 for GSM8k (Minerva 540B uses k = 64 for MATH and and k = 40 for GSM8k). LLaMA-65B outperforms Minerva 62B on GSM8k, although it has not been fine-tuned on mathematical data.

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

3.5 Code generation代碼生成

We evaluate the ability of our models to write code from a natural language description on two benchmarks: HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021). For both tasks, the model receives a description of the program in a few sentences, as well as a few input-output ex-amples. In HumanEval, it also receives a function signature, and the prompt is formatted as natural code with the textual description and tests in a?docstring. The model needs to generate a Python program that fits the description and satisfies the test cases. In Table 8, we compare the pass@1 scores of our models with existing language mod-els that have not been finetuned on code, namely PaLM and LaMDA (Thoppilan et al., 2022). PaLM and LLaMA were trained on datasets that contain a similar number of code tokens.	我們在兩個代碼生成基準測試上評估我們的模型對于從自然語言描述中生成代碼的能力：HumanEval（Chen等人，2021年）和MBPP（Austin等人，2021年）。對于這兩個任務，模型接收到一段程序的描述，包括幾個輸入-輸出示例。在HumanEval中，它還會接收到一個函數(shù)簽名，而提示文本的格式是自然代碼，其中包含了文本描述和測試用例。模型需要生成一個符合描述并滿足測試用例的Python程序。在表8中，我們將我們的模型的pass@1得分與未在代碼上進行微調(diào)的現(xiàn)有語言模型進行了比較，包括PaLM和LaMDA（Thoppilan等人，2022年）。PaLM和LLaMA都是在包含相似數(shù)量的代碼標記的數(shù)據(jù)集上進行訓練的。
As show in Table 8, for a similar number of parameters, LLaMA outperforms other gen-eral models such as LaMDA and PaLM, which are not trained or finetuned specifically for code. LLaMA with 13B parameters and more outper-forms LaMDA 137B on both HumanEval and MBPP. LLaMA 65B also outperforms PaLM 62B, even when it is trained longer. The pass@1 results reported in this table were obtained by sampling with temperature 0.1. The pass@100 and pass@80 metrics were obtained with temperature 0.8. We use the same method as Chen et al. (2021) to obtain unbiased estimates of the pass@k.	如表8所示，對于相似數(shù)量的參數(shù)，LLaMA優(yōu)于其他通用模型，如LaMDA和PaLM，這些模型沒有專門針對代碼進行訓練或微調(diào)。LLaMA擁有13B參數(shù)及以上，在HumanEval和MBPP上的表現(xiàn)優(yōu)于LaMDA 137B。即使在訓練時間更長的情況下，LLaMA 65B也優(yōu)于PaLM 62B。表中報告的pass@1結(jié)果是在溫度為0.1的情況下采樣得到的。pass@100和pass@80指標是在溫度為0.8的情況下獲得的。我們使用與Chen等人（2021年）相同的方法來獲得pass@k的無偏估計。
It is possible to improve the performance on code by finetuning on code-specific tokens. For instance, PaLM-Coder (Chowdhery et al., 2022) increases the pass@1 score of PaLM on HumanEval from 26.2% for PaLM to 36%. Other models trained specifically for code also perform better than gen-eral models on these tasks (Chen et al., 2021; Ni-jkamp et al., 2022; Fried et al., 2022). Finetuning on code tokens is beyond the scope of this paper.	通過在代碼上進行微調(diào)可以提高代碼的性能。例如，PaLM-Coder（Chowdhery等人，2022年）將PaLM在HumanEval上的pass@1得分從26.2%提高到36%。專門針對代碼進行訓練的其他模型在這些任務上也表現(xiàn)優(yōu)于通用模型（Chen等人，2021年；Nijkamp等人，2022年；Fried等人，2022年）。在本文范圍之外，對代碼標記進行微調(diào)。

Table 8: Model performance for code generation. We report the pass@ score on HumanEval and MBPP. HumanEval generations are done in zero-shot and MBBP with 3-shot prompts similar to Austin et al.(2021). The values marked with ? are read from figures in Chowdhery et al. (2022).代碼生成的模型性能。我們報告了在HumanEval和MBPP上的pass@分數(shù)。HumanEval生成是在零猜和MBBP中進行的，使用與Austin et al. (2021)相似的3-shot提示。標有?的值是從Chowdhery et al. (2022)的圖表中讀取的。

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

3.6 Massive Multitask Language?Understanding大規(guī)模多任務語言理解

The massive multitask language understanding benchmark, or MMLU, introduced by Hendrycks et al. (2020) consists of multiple choice questions covering various domains of knowledge, includ-ing humanities, STEM and social sciences. We evaluate our models in the 5-shot setting, using the examples provided by the benchmark, and report results in Table 9. On this benchmark, we observe that the LLaMA-65B is behind both Chinchilla- 70B and PaLM-540B by a few percent in average, and across most domains. A potential explanation is that we have used a limited amount of books and academic papers in our pre-training data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models were trained on up to 2TB of books. This large quantity of books used by Gopher, Chinchilla and PaLM may also explain why Gopher outperforms GPT-3 on this benchmark, while it is comparable on other benchmarks.

大規(guī)模多任務語言理解基準測試（MMLU），由Hendrycks等人（2020年）引入，包含涵蓋人文、STEM和社會科學等各個領(lǐng)域的多項選擇題。我們在5-shot設(shè)置下使用該基準測試提供的示例評估我們的模型，并在表9中報告結(jié)果。在這個基準測試中，LLaMA-65B在平均值上略遜于Chinchilla-70B和PaLM-540B，并且在大多數(shù)領(lǐng)域中也是如此。一個可能的解釋是，我們在預訓練數(shù)據(jù)中使用了有限數(shù)量的圖書和學術(shù)論文，即ArXiv、Gutenberg和Books3，總共只有177GB，而這些模型是在高達2TB的圖書上進行訓練的。這些由Gopher、Chinchilla和PaLM使用的大量圖書可能也解釋了為什么Gopher在這個基準測試中優(yōu)于GPT-3，而在其他基準測試中相當。

3.7 Evolution of performance during training訓練期間性能的演變

表現(xiàn)的追蹤情況

性能追蹤： 在訓練期間，對模型在一些問題回答和常識推理基準測試中的表現(xiàn)進行了追蹤。
圖表展示： 結(jié)果以圖表形式呈現(xiàn)在圖2中，展示了模型在不同基準測試中的性能。
性能趨勢： 大多數(shù)基準測試中，性能呈現(xiàn)穩(wěn)步提升，并與模型的訓練困惑度（見圖1）呈正相關(guān)。這表明隨著訓練的進行，模型對任務的理解和表現(xiàn)逐漸改善。
特殊情況： 有兩個基準測試（SIQA和WinoGrande）出現(xiàn)了特殊情況。在SIQA中，性能出現(xiàn)較大的變化，可能表明該基準測試不太可靠。而在WinoGrande中，性能與訓練困惑度的相關(guān)性不如其他測試明顯，LLaMA-33B和LLaMA-65B在訓練過程中表現(xiàn)相似。

During training, we tracked the performance of our models on a few question answering and common sense benchmarks, and report them in Figure 2. On most benchmarks, the performance improves steadily, and correlates with the training perplexity of the model (see Figure 1). The exceptions are SIQA and WinoGrande. Most notably, on SIQA, we observe a lot of variance in performance,that may indicate that this benchmark is not reliable. On WinoGrande, the performance does not correlate as well with training perplexity: the LLaMA-33B and LLaMA-65B have similar performance during the training.

在訓練過程中，我們跟蹤了我們的模型在一些問答和常識基準測試上的性能，并在圖2中進行了報告。在大多數(shù)基準測試中，性能穩(wěn)步提升，并與模型的訓練困惑度相關(guān)（參見圖1）。SIQA和WinoGrande是例外。特別是在SIQA上，我們觀察到性能有很大的變化，這可能表明該基準測試不太可靠。在WinoGrande上，性能與訓練困惑度的相關(guān)性不太明顯：LLaMA-33B和LLaMA-65B在訓練期間的性能相似。

Table 9: Massive Multitask Language Understanding (MMLU). Five-shot accuracy.大規(guī)模多任務語言理解（MMLU）。五次提示的準確性。

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

4 Instruction Finetuning指令微調(diào)

指導數(shù)據(jù)微調(diào)：

通過在指導數(shù)據(jù)上進行微調(diào)，LLaMA-65B在MMLU任務上迅速取得了性能提升。
即使LLaMA-65B的非微調(diào)版本已經(jīng)能夠遵循基本指導，微調(diào)仍然能夠顯著提高MMLU的性能，并增強模型遵循指導的能力。

微調(diào)實驗和結(jié)果：

在本文中，作者進行了一項微調(diào)實驗，命名為LLaMA-I，遵循了與Chung等人（2022年）相同的協(xié)議。
在表格10中，報告了LLaMA-I在MMLU上的結(jié)果，并與已有的中等規(guī)模的指導微調(diào)模型（OPT-IML和Flan-PaLM系列）進行了比較。
盡管這里使用的指導微調(diào)方法相對簡單，但在MMLU上達到了68.9%的性能。LLaMA-I（65B）在MMLU上的性能超過了已有的中等規(guī)模的指導微調(diào)模型，但仍遠未達到最先進水平。GPT code-davinci-002在MMLU上的最先進水平為77.4%（數(shù)據(jù)來自Iyer等人（2022年））。

性能詳細信息：

作者在附錄的表格16中提供了LLaMA-I在57個任務上的詳細性能信息。

In this section, we show that briefly finetuning on instructions data rapidly leads to improvements on MMLU. Although the non-finetuned version of LLaMA-65B is already able to follow basic in-structions, we observe that a very small amount of finetuning improves the performance on MMLU, and further improves the ability of the model to follow instructions. Since this is not the focus of this paper, we only conducted a single experiment following the same protocol as Chung et al. (2022) to train an instruct model, LLaMA-I.

在本節(jié)中，我們展示了在指令數(shù)據(jù)上進行簡短微調(diào)會迅速改善在MMLU上的性能。盡管LLaMA-65B的非微調(diào)版本已經(jīng)能夠遵循基本的指令，但我們觀察到微調(diào)很少的量可以提高在MMLU上的性能，并進一步提高模型遵循指令的能力。由于這不是本文的重點，我們只進行了一個實驗，按照Chung等人（2022年）的協(xié)議訓練了一個指令模型LLaMA-I。

In Table 10, we report the results of our instruct model LLaMA-I on MMLU and compare with ex-isting instruction finetuned models of moderate sizes, namely, OPT-IML (Iyer et al., 2022) and the Flan-PaLM series (Chung et al., 2022). All the re-ported numbers are from the corresponding papers. Despite the simplicity of the instruction finetuning approach used here, we reach 68.9% on MMLU. LLaMA-I (65B) outperforms on MMLU existing instruction finetuned models of moderate sizes, but are still far from the state-of-the-art, that is 77.4 for GPT code-davinci-002 on MMLU (numbers taken from Iyer et al. (2022)). The details of the performance on MMLU on the 57 tasks can be found in Table 16 of the appendix.

在表10中，我們報告了我們的指令模型LLaMA-I在MMLU上的結(jié)果，并與具有中等規(guī)模的現(xiàn)有指令微調(diào)模型OPT-IML（Iyer等人，2022年）和Flan-PaLM系列（Chung等人，2022年）進行了比較。所有報告的數(shù)據(jù)都來自相應的論文。盡管這里使用的指令微調(diào)方法相對簡單，我們在MMLU上達到了68.9%的準確率。LLaMA-I（65B）在MMLU上的表現(xiàn)優(yōu)于具有中等規(guī)模的現(xiàn)有指令微調(diào)模型，但仍遠遠落后于當前的最先進水平，即GPT code-davinci-002在MMLU上的準確率為77.4%（數(shù)據(jù)取自Iyer等人（2022年））。有關(guān)在57個任務上的MMLU性能細節(jié)，請參見附錄的表16。

Table 10: Instruction finetuning – MMLU (5-shot). Comparison of models of moderate size with and with-out instruction finetuning on MMLU.

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

Figure 2: Evolution of performance on question answering and common sense reasoning during training.在訓練過程中的問答和常識推理性能演變。

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

5 Bias, Toxicity and Misinformation偏見、有害內(nèi)容和虛假信息

大型語言模型可能面臨的偏見、有毒性和虛假信息生成的問題，通過多個基準測試展示了LLaMA-65B在這些方面的表現(xiàn)。

偏見、有毒性和虛假信息： 大型語言模型存在重現(xiàn)和放大訓練數(shù)據(jù)中的偏見，生成有毒或冒犯性內(nèi)容的問題。
評估模型有毒性： 通過在不同基準測試上評估LLaMA-65B模型的毒性內(nèi)容生成和刻板印象檢測能力。使用RealToxicityPrompts基準測試，評估了模型在基于真實有毒提示的情況下的表現(xiàn)，發(fā)現(xiàn)毒性隨模型規(guī)模增加而增加。
CrowS-Pairs評估模型偏見： 使用CrowS-Pairs基準測試，測量了模型在9個類別中的偏見，包括性別、宗教、種族/膚色、性取向、年齡、國籍、殘疾、外貌和社會經(jīng)濟地位。LLaMA相對于GPT-3和OPT-175B在平均偏見上稍微有優(yōu)勢。
WinoGender基準測試： 通過WinoGender基準測試進一步研究模型在性別方面的偏見。發(fā)現(xiàn)模型在“their/them/someone”代詞上的共識分辨性能明顯優(yōu)于“her/her/she”和“his/him/he”代詞，這可能表明性別偏見存在。
TruthfulQA測量真實性： 使用TruthfulQA基準測試，評估模型辨別聲明真實性的能力。結(jié)果顯示LLaMA-65B在真實和信息量充足兩個類別上得分較高，但正確回答率仍然較低，表明該模型可能產(chǎn)生不準確的答案。

Large language models have been showed to re-produce and amplify biases that are existing in the training data (Sheng et al., 2019; Kurita et al., 2019), and to generate toxic or offensive con-tent (Gehman et al., 2020). As our training dataset contains a large proportion of data from the Web, we believe that it is crucial to determine the po-tential for our models to generate such content. To understand the potential harm of LLaMA-65B, we evaluate on different benchmarks that measure toxic content production and stereotypes detection. While we have selected some of the standard bench-marks that are used by the language model com-munity to indicate some of the issues with these models, these evaluations are not sufficient to fully understand the risks associated with these models.

已經(jīng)有研究表明，大型語言模型能夠復制和放大訓練數(shù)據(jù)中存在的偏見（Sheng等人，2019年；Kurita等人，2019年），并生成有害或冒犯性的內(nèi)容（Gehman等人，2020年）。由于我們的訓練數(shù)據(jù)集包含大量來自互聯(lián)網(wǎng)的數(shù)據(jù)，我們認為確定我們的模型生成此類內(nèi)容的潛力是至關(guān)重要的。為了了解LLaMA-65B的潛在危害，我們在衡量有害內(nèi)容生成和刻板印象檢測的不同基準測試上進行評估。雖然我們選擇了一些標準的基準測試，這些測試被語言模型社區(qū)用來指示這些模型存在的一些問題，但這些評估并不足以完全了解與這些模型相關(guān)的風險。

5.1 Real?Toxicity?Prompts

Language models can generate toxic language, e.g., insults, hate speech or threats. There is a very large range of toxic content that a model can generate, making a thorough evaluation challenging. Several recent work (Zhang et al., 2022; Hoffmann et al., 2022) have considered the RealToxicityPrompts benchmark (Gehman et al., 2020) as an indicator of how toxic is their model. RealToxicityPrompts consists of about 100k prompts that the model must complete; then a toxicity score is automatically evaluated by making a request to PerspectiveAPI 3. We do not have control over the pipeline used by the third-party PerspectiveAPI, making comparison with previous models difficult.

語言模型可以生成有害語言，例如侮辱、仇恨言論或威脅。模型可以生成的有害內(nèi)容范圍非常廣泛，這使得全面評估變得具有挑戰(zhàn)性。最近的一些研究（Zhang等人，2022年；Hoffmann等人，2022年）已將RealToxicityPrompts基準測試（Gehman等人，2020年）視為評估其模型有害性的指標。RealToxicityPrompts包含約10萬個模型必須完成的提示，然后通過向PerspectiveAPI發(fā)出請求自動評估其有害性分數(shù)。我們無法控制第三方PerspectiveAPI使用的流程，這使得與先前模型的比較變得困難。

For each of the 100k prompts, we greedily gen-erate with our models, and measure their toxic-ity score. The score per prompt ranges from 0 (non-toxic) to 1 (toxic). In Table 11, we report our averaged score on basic and respectful prompt cat-egories of RealToxicityPrompts. These scores are “comparable” with what we observe in the litera-ture (e.g., 0.087 for Chinchilla) but the method-ologies differ between these work and ours (in terms of sampling strategy, number of prompts and time of API). We observe that toxicity increases?with the size of the model, especially for Respect-ful prompts. This was also observed in previous work (Zhang et al., 2022), with the notable excep-tion of Hoffmann et al. (2022) where they do not see a difference between Chinchilla and Gopher, despite different sizes. This could be explained by the fact that the larger model, Gopher, has worse performance than Chinchilla, suggesting that the relation between toxicity and model size may only apply within a model family.

對于這10萬個提示中的每個提示，我們使用我們的模型進行貪婪生成，并測量其有害性分數(shù)。每個提示的分數(shù)范圍從0（非有害）到1（有害）。在表11中，我們報告了我們在RealToxicityPrompts的基本和尊重提示類別上的平均分數(shù)。這些分數(shù)與我們在文獻中觀察到的結(jié)果“可比”（例如，Chinchilla的分數(shù)為0.087），但這些工作與我們的工作方法不同（在采樣策略、提示數(shù)量和API時間方面）。我們觀察到有害性隨著模型的大小增加而增加，特別是對于尊重提示。這也是以前的研究觀察到的現(xiàn)象（Zhang等人，2022年），但Hoffmann等人（2022年）的研究是一個值得注意的例外，他們沒有觀察到Chinchilla和Gopher之間的差異，盡管它們的大小不同。這可能可以解釋為較大的模型Gopher的性能比Chinchilla差，表明有害性和模型大小之間的關(guān)系可能僅適用于模型系列內(nèi)部。

Table 11: RealToxicityPrompts. We run a greedy de-coder on the 100k prompts from this benchmark. The “respectful” versions are prompts starting with “Com-plete the following sentence in a polite, respectful, and unbiased manner:”, and “Basic” is without it. Scores were obtained using the PerplexityAPI, with higher score indicating more toxic generations.RealToxicityPrompts。我們在這個基準測試的100,000個提示上運行貪婪解碼器。"尊重"版本是以"以禮貌、尊重和公正的方式完成以下句子："開頭的提示，而"基本"則沒有。得分是使用PerplexityAPI獲得的，得分越高表示生成的內(nèi)容越有毒。

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

5.2 CrowS-Pairs

We evaluate the biases in our model on the CrowS-Pairs (Nangia et al., 2020). This dataset allows to measure biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, dis-ability, physical appearance and socioeconomic sta-tus. Each example is composed of a stereotype and an anti-stereotype, we measure the model prefer-ence for the stereotypical sentence using the per-plexity of both sentences in a zero-shot setting. Higher scores thus indicate higher bias. We com-pare with GPT-3 and OPT-175B in Table 12.

LLaMA compares slightly favorably to both models on average. Our model is particularly bi-ased in the religion category (+10% compared to OPT-175B), followed by age and gender. We ex-pect these biases to come from CommonCrawl de-spite multiple filtering steps.

我們在CrowS-Pairs（Nangia等人，2020年）上評估了我們模型的偏見。該數(shù)據(jù)集可用于衡量9個類別的偏見：性別、宗教、種族/膚色、性取向、年齡、國籍、殘疾、外貌和社會經(jīng)濟地位。每個示例由一個刻板印象和一個反刻板印象組成，我們通過在零樣本設(shè)置中比較兩個句子的困惑度來衡量模型對刻板印象句子的偏好。較高的分數(shù)表示較高的偏見。我們在表12中與GPT-3和OPT-175B進行了比較。

總體而言，LLaMA在平均水平上略微優(yōu)于這兩個模型。我們的模型在宗教類別上表現(xiàn)出較大的偏見（相比于OPT-175B，增加了10%），其次是年齡和性別。我們認為這些偏見可能來自于CommonCrawl，盡管經(jīng)過了多次過濾步驟。

Table 12: CrowS-Pairs. We compare the level of bi-ases contained in LLaMA-65B with OPT-175B and GPT3-175B. Higher score indicates higher bias.

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

5.3 WinoGender

To further investigate the biases of our model on the gender category, we look at the WinoGender benchmark (Rudinger et al., 2018), a co-reference resolution dataset. WinoGender is made of Wino-grad schema, and biases are evaluated by determin-ing if a model co-reference resolution performance is impacted by the gender of the pronoun.

More precisely, each sentence has three men-tions: an “occupation”, a “participant”, and a “pronoun” where the pronoun is co-referencing either the occupation or participant. We prompt the model to determine the co-reference relation and measure if it does so correctly according to?the context of the sentence. The goal is to reveal if societal biases associated with occupations have been captured by the model. For example, a sentence in the WinoGender dataset is “The nurse notified the patient that his shift would be ending in an hour.”, which is followed by ‘His’ refers to. We then compare the perplexity of the continuations the nurse and the patient to per-form co-reference resolution with the model. We evaluate the performance when using 3 pronouns: “her/her/she”, “his/him/he” and “their/them/some-one” (the different choices corresponding to the grammatical function of the pronoun.

為了進一步研究我們的模型在性別類別上的偏見，我們查看了WinoGender基準測試（Rudinger等人，2018年），這是一個共參考消解數(shù)據(jù)集。WinoGender由Wino-grad schema構(gòu)成，通過確定模型對代詞的性別是否會影響共參考消解的性能來評估偏見。

更具體地說，每個句子有三個提及：一個“職業(yè)”、一個“參與者”和一個“代詞”，其中代詞與職業(yè)或參與者共指。我們提示模型確定共指關(guān)系，并根據(jù)句子的上下文正確地執(zhí)行共參考消解。目標是揭示模型是否捕捉到了與職業(yè)相關(guān)的社會偏見。例如，WinoGender數(shù)據(jù)集中的一個句子是“The nurse notified the patient that his shift would be ending in an hour.”，接著是‘His’ refers to。然后，我們比較繼續(xù)表示護士和病人的困惑度，以執(zhí)行與模型的共參考消解。我們評估使用三個代詞時的性能：“her/her/she”、“his/him/he”和“their/them/some-one”（不同選擇對應于代詞的語法功能）。

In Table 13, we report the co-reference scores for the three different pronouns contained in the dataset. We observe that our model is significantly better at performing co-reference resolution for the “their/them/someone” pronouns than for the “her/her/she” and “his/him/he” pronouns. A simi-lar observation was made in previous work (Rae et al., 2021; Hoffmann et al., 2022), and is likely indicative of gender bias. Indeed, in the case of the “her/her/she” and “his/him/he” pronouns, the model is probably using the majority gender of the occu-pation to perform co-reference resolution, instead of using the evidence of the sentence.

To further investigate this hypothesis, we look at the set of “gotcha” cases for the “her/her/she” and “his/him/he” pronouns in the WinoGender dataset. Theses cases correspond to sentences in which the pronoun does not match the majority gender of the occupation, and the occupation is the correct answer. In Table 13, we observe that our model, LLaMA-65B, makes more errors on the gotcha examples, clearly showing that it capture societal biases related to gender and occupation. The drop of performance exists for “her/her/she” and “his/him/he” pronouns, which is indicative of biases regardless of gender.

在表13中，我們報告了數(shù)據(jù)集中三個不同代詞的共參考分數(shù)。我們觀察到，對于“their/them/someone”代詞，我們的模型在執(zhí)行共參考消解時明顯更好。以前的研究（Rae等人，2021年；Hoffmann等人，2022年）也得出了類似的觀察結(jié)果，這很可能表明存在性別偏見。實際上，在“her/her/she”和“his/him/he”代詞的情況下，模型可能使用職業(yè)的多數(shù)性別來執(zhí)行共參考消解，而不是使用句子的證據(jù)。

為了進一步研究這個假設(shè)，我們查看了WinoGender數(shù)據(jù)集中“her/her/she”和“his/him/he”代詞的“gotcha”案例。這些案例對應于代詞與職業(yè)的多數(shù)性別不匹配，而職業(yè)是正確答案的句子。在表13中，我們觀察到我們的模型LLaMA-65B在“gotcha”案例上產(chǎn)生了更多錯誤，清楚地顯示了它捕捉到了與性別和職業(yè)相關(guān)的社會偏見。無論性別如何，對于“her/her/she”和“his/him/he”代詞，性能下降都存在偏見的跡象。

5.4 TruthfulQA

TruthfulQA (Lin et al., 2021) aims to measure the truthfulness of a model, i.e., its ability to identify when a claim is true. Lin et al. (2021) consider the definition of “true” in the sense of “l(fā)iteral truth about the real world”, and not claims that are only true in the context of a belief system or tradition. This benchmark can evaluate the risks of a model to generate misinformation or false claims. The questions are written in diverse style, cover 38 cat-egories and are designed to be adversarial.

In Table 14, we report the performance of our models on both questions to measure truthful mod-els and the intersection of truthful and informative. Compared to GPT-3, our model scores higher in both categories, but the rate of correct answers is still low, showing that our model is likely to hallu-cinate incorrect answers.

TruthfulQA（Lin等人，2021年）旨在衡量模型的真實性，即其識別陳述是否真實的能力。Lin等人（2021年）將“真實”定義為“關(guān)于現(xiàn)實世界的字面真實”，而不僅僅是在信仰體系或傳統(tǒng)背景下成立的陳述。該基準測試可以評估模型生成錯誤信息或虛假陳述的風險。問題以多樣的風格編寫，涵蓋了38個類別，并被設(shè)計為對抗性的。

在表14中，我們報告了我們的模型在衡量真實模型和真實且有信息的問題時的性能。與GPT-3相比，我們的模型在這兩個類別中得分較高，但正確答案的比率仍然很低，顯示出我們的模型很可能會產(chǎn)生不正確的答案。

6 Carbon footprint碳足跡

模型訓練對環(huán)境的能源和碳足跡的影響

能源消耗和碳足跡： 作者強調(diào)模型訓練耗費了大量能源，導致二氧化碳排放。在表格15中詳細列出了總能源消耗和碳足跡。作者使用Wu等人（2022）的公式估算訓練一個模型所需的瓦時（Wh）以及產(chǎn)生的二氧化碳排放量（tCO2eq）。
能源消耗公式： 作者使用公式Wh = GPU-h×(GPU功耗)×PUE來估算所需的瓦時，其中PUE（Power Usage Effectiveness）被設(shè)置為1.1。這個公式考慮了GPU的數(shù)量、功耗以及PUE。
碳排放計算： 二氧化碳排放量取決于訓練網(wǎng)絡(luò)的數(shù)據(jù)中心的位置。作者以美國全國平均碳強度因子0.385 kg CO2eq/KWh為基準，用于估算碳排放。不考慮數(shù)據(jù)中心的實際位置，以確保在相同數(shù)據(jù)中心條件下的模型訓練成本比較。
比較不同數(shù)據(jù)中心的模型訓練成本： 通過將OPT和BLOOM在相同的數(shù)據(jù)中心條件下進行公平比較，作者估算出開發(fā)這些模型的成本約為2,638 MWh，總排放量為1,015 tCO2eq。作者希望通過釋放這些模型，能夠幫助減少未來的碳排放，因為訓練已經(jīng)完成，而且其中一些模型相對較小，可以在單個GPU上運行。

The training of our models have consumed a mas-sive quantity of energy, responsible for the emis-sion of carbon dioxide. We follow the recent liter-ature on the subject and breakdown both the total energy consumption and the resulting carbon foot-print in Table 15. We follow a formula for Wu et al.(2022) to estimate the Watt-hour, Wh, needed to train a model, as well as the tons of carbon emis-sions, tCO2eq. For the Wh, we use the formula:

Wh = GPU-h×(GPU power consumption)×PUE,where we set the Power Usage Effectiveness (PUE) at 1.1. The resulting carbon emission depends on the location of the data center used to train the net-work. For instance, BLOOM uses a grid that emits 0.057 kg CO2eq/KWh leading to 27 tCO2eq and OPT a grid that emits 0.231 kg CO2eq/KWh, lead-ing to 82 tCO2eq. In this study, we are interested in comparing the cost in carbon emission of training of these models if they were trained in the same data center. Hence, we do not take the location of data center in consideration, and use, instead, the US national average carbon intensity factor of 0.385 kg CO2eq/KWh. This leads to the following formula for the tons of carbon emissions:

tCO2eq = MWh × 0.385.

我們模型的訓練消耗了大量能源，導致二氧化碳的排放。我們參考最近的文獻，將總能耗和相應的碳足跡分解如表15所示。我們遵循Wu等人（2022年）的公式來估計訓練模型所需的瓦時（Wh）和碳排放量（tCO2eq）。對于瓦時，我們使用以下公式：

Wh = GPU-h×（GPU功耗）×PUE，其中我們將功耗使用效率（PUE）設(shè)置為1.1。產(chǎn)生的碳排放量取決于用于訓練網(wǎng)絡(luò)的數(shù)據(jù)中心的位置。例如，BLOOM使用的電網(wǎng)排放0.057千克CO2eq/KWh，導致27 tCO2eq；OPT使用的電網(wǎng)排放0.231千克CO2eq/KWh，導致82 tCO2eq。在本研究中，我們感興趣的是比較在相同的數(shù)據(jù)中心中訓練這些模型的碳排放成本。因此，我們不考慮數(shù)據(jù)中心的位置，而是使用美國國家平均碳強度因子為0.385千克CO2eq/KWh。這導致以下碳排放量的公式：

tCO2eq = MWh × 0.385.

We apply the same formula to OPT and BLOOM for fair comparison. For OPT, we assume training required 34 days on 992 A100-80B (see their logs4). Finally, we estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models. This means that developing these mod-els would have cost around 2,638 MWh under our assumptions, and a total emission of 1,015 tCO2eq. We hope that releasing these models will help to reduce future carbon emission since the training is already done, and some of the models are relatively small and can be run on a single GPU.

我們對OPT和BLOOM應用相同的公式進行公平比較。對于OPT，我們假設(shè)訓練需要在992個A100-80B上進行34天（參見其日志4）。最后，根據(jù)我們的假設(shè)，我們估計使用了2048個A100-80GB進行了約5個月的模型開發(fā)。這意味著在我們的假設(shè)下，開發(fā)這些模型的成本約為2638 MWh，并且總排放量為1015 tCO2eq。我們希望發(fā)布這些模型能夠幫助減少未來的碳排放，因為訓練已經(jīng)完成，其中一些模型相對較小，可以在單個GPU上運行。

7 Related work相關(guān)工作

Table 15: Carbon footprint of training different models in the same data center. We follow Wu et al. (2022) to compute carbon emission of training OPT, BLOOM and our models in the same data center. For the power consumption of a A100-80GB, we take the thermal design power for NVLink systems, that is 400W. We take a PUE of 1.1 and a carbon intensity factor set at the national US average of 0.385 kg CO2e per KWh.在同一數(shù)據(jù)中心訓練不同模型的碳足跡。我們遵循Wu等人（2022）的方法，在同一數(shù)據(jù)中心計算OPT、BLOOM和我們模型的碳排放量。對于A100-80GB的功耗，我們采用NVLink系統(tǒng)的熱設(shè)計功耗，即400W。我們采用PUE值為1.1，碳強度因子設(shè)定為美國國家平均值，即每千瓦時0.385千克CO2e。

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

語言模型定義、語言模型歷史、規(guī)模擴展、規(guī)模對性能的影響

語言模型定義： 語言模型被定義為對單詞、標記或字符序列的概率分布。這個任務通常被描述為下一個標記的預測，長期以來一直被認為是自然語言處理中的核心問題。
語言模型歷史： 傳統(tǒng)上，語言模型基于n-gram計數(shù)統(tǒng)計，使用各種平滑技術(shù)改進對稀有事件的估計。在過去的20年中，神經(jīng)網(wǎng)絡(luò)成功應用于語言建模任務，包括前饋模型、循環(huán)神經(jīng)網(wǎng)絡(luò)（RNNs）和長短時記憶網(wǎng)絡(luò)（LSTMs）。近年來，基于自注意力機制的Transformer網(wǎng)絡(luò)取得了重要的進展，特別是對于捕捉長距離依賴性。
規(guī)模擴展： 在語言模型的發(fā)展歷史中，對模型和數(shù)據(jù)集規(guī)模進行擴展有著悠久的歷史。研究表明，使用包括BERT、GPT-2、Megatron-LM、T5等在內(nèi)的大型語言模型取得了重要的成果。GPT-3更是達到了1750億參數(shù)的規(guī)模，帶來了一系列的大型語言模型，如Jurassic-1、Megatron-Turing NLG、Gopher、Chinchilla、PaLM、OPT和GLM等。
規(guī)模對性能的影響： 其他研究關(guān)注了規(guī)模對深度學習模型性能的影響，發(fā)現(xiàn)模型和數(shù)據(jù)集規(guī)模與系統(tǒng)性能之間存在冪律關(guān)系。研究還涉及到適應學習速率計劃以適應數(shù)據(jù)集規(guī)模擴展的細化方法，以及研究了大型語言模型的能力受規(guī)模擴展的影響。

Language models are probability distributions over sequences of words, tokens or charac-ters (Shannon, 1948, 1951). This task, often framed as next token prediction, has long been considered a core problem in natural language processing (Bahl et al., 1983; Brown et al., 1990). Because Turing (1950) proposed to measure machine intelligence by using language through the “imitation game”, language modeling has been proposed as a bench-mark to measure progress toward artificial intelli-gence (Mahoney, 1999).

Architecture. Traditionally, language models were based on n-gram count statistics (Bahl et al., 1983), and various smoothing techniques were proposed to improve the estimation of rare events (Katz, 1987; Kneser and Ney, 1995). In the past two decades, neural networks have been suc-cessfully applied to the language modelling task,starting from feed forward models (Bengio et al., 2000), recurrent neural networks (Elman, 1990; Mikolov et al., 2010) and LSTMs (Hochreiter and Schmidhuber, 1997; Graves, 2013). More recently, transformer networks, based on self-attention, have led to important improvements, especially for cap-turing long range dependencies (Vaswani et al., 2017; Radford et al., 2018; Dai et al., 2019).

語言模型是對單詞、標記或字符序列的概率分布（Shannon，1948年，1951年）。這個任務通常被定義為下一個標記預測，并且長期以來一直被視為自然語言處理中的核心問題（Bahl等人，1983年；Brown等人，1990年）。自從Turing（1950年）提出通過使用語言來衡量機器智能以來，語言建模已被提出作為衡量人工智能進展的基準（Mahoney，1999年）。

架構(gòu)。傳統(tǒng)上，語言模型基于n-gram計數(shù)統(tǒng)計（Bahl等人，1983年），并提出了各種平滑技術(shù)來改善對罕見事件的估計（Katz，1987年；Kneser和Ney，1995年）。在過去的二十年中，神經(jīng)網(wǎng)絡(luò)已成功應用于語言建模任務，從前饋模型（Bengio等人，2000年），循環(huán)神經(jīng)網(wǎng)絡(luò)（Elman，1990年；Mikolov等人，2010年）和LSTM（Hochreiter和Schmidhuber，1997年；Graves，2013年）開始。最近，基于自注意力的Transformer網(wǎng)絡(luò)在捕捉長距離依賴性方面取得了重要進展（Vaswani等人，2017年；Radford等人，2018年；Dai等人，2019年）。

Scaling. There is a long history of scaling for language models, for both the model and dataset sizes. Brants et al. (2007) showed the benefits of using language models trained on 2 trillion tokens, resulting in 300 billion n-grams, on the quality of machine translation. While this work relied on a simple smoothing technique, called Stupid Backoff, Heafield et al. (2013) later showed how to scale Kneser-Ney smoothing to Web-scale data. This allowed to train a 5-gram model on 975 billions to-kens from CommonCrawl, resulting in a model with 500 billions n-grams (Buck et al., 2014). Chelba et al. (2013) introduced the One Billion Word benchmark, a large scale training dataset to measure the progress of language models.

規(guī)?；?。語言模型的規(guī)模化有著悠久的歷史，包括模型和數(shù)據(jù)集的規(guī)模。Brants等人（2007年）展示了使用訓練在2萬億標記上的語言模型的好處，從而產(chǎn)生3000億個n-gram，提高了機器翻譯的質(zhì)量。雖然這項工作依賴于稱為Stupid Backoff的簡單平滑技術(shù)，但Heafield等人（2013年）隨后展示了如何將Kneser-Ney平滑技術(shù)擴展到Web規(guī)模的數(shù)據(jù)上。這使得可以在CommonCrawl的9750億個標記上訓練一個5-gram模型，得到了5000億個n-gram的模型（Buck等人，2014年）。Chelba等人（2013年）引入了十億詞基準測試數(shù)據(jù)集，用于衡量語言模型的進展。

In the context of neural language models, Joze-fowicz et al. (2016) obtained state-of-the-art re-sults on the Billion Word benchmark by scaling LSTMs to 1 billion parameters. Later, scaling transformers lead to improvement on many NLP tasks. Notable models include BERT (Devlin et al., 2018), GPT-2 (Radford et al., 2019), Megatron-LM (Shoeybi et al., 2019), and T5 (Raffel et al., 2020). A significant breakthrough was obtained with GPT-3 (Brown et al., 2020), a model with 175 billion parameters. This lead to a series of Large Language Models, such as Jurassic-1 (Lieber et al., 2021), Megatron-Turing NLG (Smith et al.,2022), Gopher (Rae et al., 2021), Chinchilla (Hoff-mann et al., 2022), PaLM (Chowdhery et al., 2022), OPT (Zhang et al., 2022), and GLM (Zeng et al., 2022). Hestness et al. (2017) and Rosenfeld et al.(2019) studied the impact of scaling on the perfor-mance of deep learning models, showing the exis-tence of power laws between the model and dataset sizes and the performance of the system. Kaplan et al. (2020) derived power laws specifically for transformer based language models, which were later refined by Hoffmann et al. (2022), by adapting the learning rate schedule when scaling datasets. Finally, Wei et al. (2022) studied the effect of scal-ing on the abilities of large language models.

在神經(jīng)語言模型的背景下，Joze-fowicz等人（2016年）通過將LSTM擴展到10億個參數(shù)，在十億詞基準測試上取得了最先進的結(jié)果。后來，通過擴展Transformer模型，在許多自然語言處理任務上取得了改進。值得注意的模型包括BERT（Devlin等人，2018年），GPT-2（Radford等人，2019年），Megatron-LM（Shoeybi等人，2019年）和T5（Raffel等人，2020年）。GPT-3（Brown等人，2020年）是一個具有1750億個參數(shù)的模型，取得了重大突破。這導致了一系列的大語言模型，如Jurassic-1（Lieber等人，2021年），Megatron-Turing NLG（Smith等人，2022年），Gopher（Rae等人，2021年），Chinchilla（Hoff-mann等人，2022年），PaLM（Chowdhery等人，2022年），OPT（Zhang等人，2022年）和GLM（Zeng等人，2022年）。Hestness等人（2017年）和Rosenfeld等人（2019年）研究了規(guī)?；瘜ι疃葘W習模型性能的影響，展示了模型和數(shù)據(jù)集規(guī)模與系統(tǒng)性能之間存在的冪律關(guān)系。Kaplan等人（2020年）專門針對基于Transformer的語言模型推導出了冪律關(guān)系，后來由Hoffmann等人（2022年）通過在擴展數(shù)據(jù)集時調(diào)整學習率計劃進行了改進。最后，Wei等人（2022年）研究了規(guī)?；瘜Υ笮驼Z言模型能力的影響。

8 Conclusion結(jié)論

概括了論文的主要貢獻和觀察結(jié)果

發(fā)布的語言模型： 論文介紹了一系列開源發(fā)布的語言模型，這些模型在性能上與最先進的基礎(chǔ)模型相競爭。尤其值得注意的是，LLaMA-13B在比GPT-3小10倍以上的情況下表現(xiàn)更好，而LLaMA-65B與Chinchilla-70B和PaLM-540B相媲美。
基于公共數(shù)據(jù)集的訓練： 與以往研究不同的是，論文展示了在僅使用公開可用的數(shù)據(jù)集的情況下，就能夠達到最先進的性能水平，而無需使用專有數(shù)據(jù)集。
對領(lǐng)域問題的關(guān)注： 作者希望通過向研究社區(qū)發(fā)布這些模型，加速大型語言模型的發(fā)展，并幫助改善它們的穩(wěn)健性以及緩解已知問題，如毒性和偏見。
關(guān)于微調(diào)的觀察： 與Chung等人（2022）一樣，作者觀察到對這些模型進行指導性微調(diào)會產(chǎn)生有希望的結(jié)果，并計劃在未來的工作中進一步研究這一點。
未來計劃： 最后，作者計劃在未來發(fā)布在更大的預訓練語料庫上訓練的更大型模型，因為他們在不斷擴展規(guī)模時看到性能的持續(xù)改善。

In this paper, we presented a series of language models that are released openly, and competitive with state-of-the-art foundation models. Most notably, LLaMA-13B outperforms GPT-3 while being more than 10× smaller, and LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B. Unlike previous studies, we show that it is possible to achieve state-of-the-art performance by training exclusively on publicly available data, without resorting to proprietary datasets. We hope that releasing these models to the research community will accelerate the development of large language models, and help efforts to improve their robust-ness and mitigate known issues such as toxicity and bias. Additionally, we observed like Chung et al.(2022) that finetuning these models on instructions lead to promising results, and we plan to further investigate this in future work. Finally, we plan to release larger models trained on larger pretraining corpora in the future, since we have seen a constant improvement in performance as we were scaling.

?在本文中，我們介紹了一系列與最先進的基礎(chǔ)模型競爭，并且以開放方式發(fā)布的語言模型。特別值得注意的是，LLaMA-13B在比GPT-3小10倍以上的情況下表現(xiàn)更好，而LLaMA-65B與Chinchilla-70B和PaLM-540B具有競爭力。與先前的研究不同，我們展示了通過僅在公開可用的數(shù)據(jù)上進行訓練，而無需使用專有數(shù)據(jù)集就可以實現(xiàn)最先進性能的可能性。我們希望將這些模型發(fā)布給研究社區(qū)將加速大型語言模型的發(fā)展，并幫助改善它們的穩(wěn)健性以及減少已知問題，如有害和偏見。此外，我們觀察到，像Chung等人（2022年）一樣，在指令上微調(diào)這些模型會帶來有希望的結(jié)果，并計劃在未來的工作中進一步研究這一點。最后，我們計劃在未來發(fā)布在更大的預訓練語料庫上訓練的更大模型，因為我們發(fā)現(xiàn)隨著規(guī)模的擴大，性能不斷提高。

Acknowledgements致謝

We thank Daniel Haziza, Francisco Massa, Jeremy Reizenstein, Artem Korenev, and Patrick Labatut from the xformers team. We thank Susan Zhang and Stephen Roller for their support on data deduplication. We thank Luca Wehrstedt, Vegard Mella, and Pierre-Emmanuel Mazaré for their support on training stability. We thank Shubho Sengupta, Kalyan Saladi, and all the AI infra team for their support. We thank Jane Yu for her input on evaluation. We thank Yongyi Hu for his help on data collection.

我們感謝xformers團隊的Daniel Haziza、Francisco Massa、Jeremy Reizenstein、Artem Korenev和Patrick Labatut。感謝Susan Zhang和Stephen Roller在數(shù)據(jù)去重方面的支持。感謝Luca Wehrstedt、Vegard Mella和Pierre-Emmanuel Mazaré在訓練穩(wěn)定性方面的支持。感謝Shubho Sengupta、Kalyan Saladi和所有AI基礎(chǔ)設(shè)施團隊的支持。感謝Jane Yu在評估方面的貢獻。感謝Yongyi Hu在數(shù)據(jù)收集方面的幫助。文章來源地址http://www.zghlxwxcb.cn/news/detail-489036.html

到了這里，關(guān)于AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

AIGC之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

相關(guān)論文

LLMs之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

LLMs之Alpaca：《Alpaca: A Strong, Replicable Instruction-Following Model》翻譯與解讀

LLMs：《Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca-4月17日版》翻譯與解讀

LLMs：《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻譯與解讀

實戰(zhàn)案例

Windows系統(tǒng)

LLMs：在單機CPU+Windows系統(tǒng)上對LLaMA模型(基于facebookresearch的GitHub)進行模型部署且實現(xiàn)模型推理全流程步驟【部署conda環(huán)境+安裝依賴庫+下載模型權(quán)重(國內(nèi)外各種鏈接)→模型推理】的圖文教程(非常詳細)

LLMs：基于單個4GB GPU上(Windows系統(tǒng))運行LLM上——pyllama模型(基于fjuncongmoo的GitHub)進行模型部署且實現(xiàn)模型推理全流程步驟的圖文教程(非常詳細)

Linux系統(tǒng)

《LLaMA: Open and Efficient Foundation Language Models》翻譯與解讀

介紹LLaMA：一款基礎(chǔ)性的、擁有650億參數(shù)的大型語言模型

Abstract

1、Introduction

LLMs的能力和發(fā)展趨勢、模型規(guī)模與性能的關(guān)系

縮放法則與推理預算的忽視

LLaMA模型的提出

內(nèi)容概述：模型改進、性能比較、模型的偏見和毒性問題

2、Approach

訓練方法概述：

2.1、Pre-training Data

預訓練數(shù)據(jù)：

分詞器：采用字節(jié)對編碼（BPE）算法并借助SentencePiece工具庫實現(xiàn)

整體訓練數(shù)據(jù)集及其處理方式：分詞后包含大約1.4萬億個標記(1.4T)，僅對維基百科和圖書領(lǐng)域的數(shù)據(jù)進行了大約2個epoch的迭代訓練

2.2、Architecture

Table 2: Model sizes, architectures, and optimization hyper-parameters.模型大小、架構(gòu)和優(yōu)化超參數(shù)。

模型架構(gòu)：

2.3 ???Optimizer

優(yōu)化器：

2.4 ??Efficient implementation高效實現(xiàn)

Table 3: Zero-shot performance on Common Sense Reasoning tasks.零樣本性能，針對常識推理任務。

高效實現(xiàn)：

性能和訓練速度：采用了2048個A100-80GB GPU，訓練1.4T標記的數(shù)據(jù)集，耗時21天

3 ???Main results主要結(jié)果

任務類型和基準測試：

與其他模型的比較：

總結(jié)：常識推理、封閉書籍問答、閱讀理解、數(shù)學推理、代碼生成等任務的性能評估：

Table 4: NaturalQuestions. Exact match performance.

3.1 Common Sense Reasoning常識推理

3.2 Closed-book Question Answering閉書式問答

Table 5: TriviaQA. Zero-shot and few-shot exact match performance on the filtered dev set.

3.3 Reading Comprehension閱讀理解

Table 6: Reading Comprehension. Zero-shot accuracy.

3.4 ??Mathematical reasoning數(shù)學推理

3.5 Code generation代碼生成

3.6 Massive Multitask Language?Understanding大規(guī)模多任務語言理解

3.7 Evolution of performance during training訓練期間性能的演變

表現(xiàn)的追蹤情況

Table 9: Massive Multitask Language Understanding (MMLU). Five-shot accuracy.大規(guī)模多任務語言理解（MMLU）。五次提示的準確性。

4 Instruction Finetuning指令微調(diào)

指導數(shù)據(jù)微調(diào)：

微調(diào)實驗和結(jié)果：

Table 10: Instruction finetuning – MMLU (5-shot). Comparison of models of moderate size with and with-out instruction finetuning on MMLU.

Figure 2: Evolution of performance on question answering and common sense reasoning during training.在訓練過程中的問答和常識推理性能演變。

5 Bias, Toxicity and Misinformation偏見、有害內(nèi)容和虛假信息

大型語言模型可能面臨的偏見、有毒性和虛假信息生成的問題，通過多個基準測試展示了LLaMA-65B在這些方面的表現(xiàn)。

5.1 Real?Toxicity?Prompts

5.2 CrowS-Pairs

Table 12: CrowS-Pairs. We compare the level of bi-ases contained in LLaMA-65B with OPT-175B and GPT3-175B. Higher score indicates higher bias.

5.3 WinoGender

5.4 TruthfulQA

6 Carbon footprint碳足跡

模型訓練對環(huán)境的能源和碳足跡的影響

7 Related work相關(guān)工作

語言模型定義、語言模型歷史、規(guī)模擴展、規(guī)模對性能的影響

8 Conclusion結(jié)論

概括了論文的主要貢獻和觀察結(jié)果

Acknowledgements致謝

相關(guān)文章

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

微信掃一掃打賞

支付寶掃一掃領(lǐng)取紅包，優(yōu)惠每天領(lǐng)

二維碼1

二維碼2

1、Introduction

LLMs的能力和發(fā)展趨勢、模型規(guī)模與性能的關(guān)系

內(nèi)容概述：模型改進、性能比較、模型的偏見和毒性問題

2、Approach

2.1、Pre-training Data

2.2、Architecture

Table 2: Model sizes, architectures, and optimization hyper-parameters.模型大小、架構(gòu)和優(yōu)化超參數(shù)。

Table 3: Zero-shot performance on Common Sense Reasoning tasks.零樣本性能，針對常識推理任務。

性能和訓練速度：采用了2048個A100-80GB GPU，訓練1.4T標記的數(shù)據(jù)集，耗時21天

總結(jié)：常識推理、封閉書籍問答、閱讀理解、數(shù)學推理、代碼生成等任務的性能評估：

Table 9: Massive Multitask Language Understanding (MMLU). Five-shot accuracy.大規(guī)模多任務語言理解（MMLU）。五次提示的準確性。

Figure 2: Evolution of performance on question answering and common sense reasoning during training.在訓練過程中的問答和常識推理性能演變。

大型語言模型可能面臨的偏見、有毒性和虛假信息生成的問題，通過多個基準測試展示了LLaMA-65B在這些方面的表現(xiàn)。

語言模型定義、語言模型歷史、規(guī)模擴展、規(guī)模對性能的影響