国产 无码 综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama

這篇具有很好參考價(jià)值的文章主要介紹了七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方,請(qǐng)大家不吝賜教,您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

前言

如此前這篇文章《學(xué)術(shù)論文GPT的源碼解讀與微調(diào):從ChatPaper到七月論文審稿GPT第1版》中的第三部分所述,對(duì)于論文的摘要/總結(jié)、對(duì)話、翻譯、語(yǔ)法檢查而言,市面上的學(xué)術(shù)論文GPT的效果雖暫未有多好,可至少還過(guò)得去,而如果涉及到論文的修訂/審稿,則市面上已有的學(xué)術(shù)論文GPT的效果則大打折扣

原因在哪呢?本質(zhì)原因在于無(wú)論什么功能,它們基本都是基于API實(shí)現(xiàn)的,而關(guān)鍵是API畢竟不是萬(wàn)能的,API做翻譯/總結(jié)/對(duì)話還行,但如果要對(duì)論文提出審稿意見(jiàn),則API就捉襟見(jiàn)肘了,故為實(shí)現(xiàn)更好的review效果,需要使用特定的對(duì)齊數(shù)據(jù)集進(jìn)行微調(diào)來(lái)獲得具備優(yōu)秀review能力的模型

繼而,我們?cè)诘谝话嬷?,做了以下三件?/p>

  1. 爬取了3萬(wàn)多篇paper、十幾萬(wàn)的review數(shù)據(jù),并對(duì)3萬(wàn)多篇PDF形式的paper做解析(review數(shù)據(jù)爬下來(lái)之后就是文本數(shù)據(jù),不用做解析)
    當(dāng)然,paper中有被接收的、也有被拒絕的
  2. 為提高數(shù)據(jù)質(zhì)量,針對(duì)paper和review做了一系列數(shù)據(jù)處理
    當(dāng)然,主要是針對(duì)review數(shù)據(jù)做處理
  3. 基于RWKV進(jìn)行微調(diào),然因其遺忘機(jī)制比較嚴(yán)重,故最終效果不達(dá)預(yù)期

所以,進(jìn)入Q4后,我司論文審稿GPT的項(xiàng)目團(tuán)隊(duì)開(kāi)始做第二版(我司自從23年Q3在教育團(tuán)隊(duì)之外,我再帶隊(duì)成立LLM項(xiàng)目團(tuán)隊(duì)之后,一直在不斷迭代三大LLM項(xiàng)目,后來(lái)每個(gè)項(xiàng)目各自一個(gè)項(xiàng)目組,除了阿荀帶頭的論文審稿GPT之外,還有:霍哥帶頭的AIGC模特生成系統(tǒng)、朝陽(yáng)帶頭的企業(yè)知識(shí)庫(kù)問(wèn)答),并著重做以下三大方面的優(yōu)化

  • 數(shù)據(jù)的解析與處理的優(yōu)化,meta的一個(gè)ocr即「nougat」能提取出LaTeX,當(dāng)然,我們也在同步對(duì)比另一個(gè)解析器sciencebeam的效果
  • 借鑒GPT4做審稿人那篇論文,讓ChatGPT API幫爬到的review語(yǔ)料,梳理出來(lái) 以下4個(gè)方面的內(nèi)容
    1 重要性和新穎性
    2 論文被接受的原因
    3 論文被拒絕的原因
    4 改進(jìn)建議
  • 模型本身的優(yōu)化,比如Mistral或llama longlora

第一部分 第二版對(duì)論文PDF數(shù)據(jù)的解析

1.1 兩大PDF解析器:nougat VS?ScienceBeam

1.1.1?Meta?nougat

nougat是Meta針對(duì)于學(xué)術(shù)PDF文檔的開(kāi)源解析工具(其主頁(yè)、其代碼倉(cāng)庫(kù)),以O(shè)CR方法為主線,較之過(guò)往解析方案最突出的特點(diǎn)是可準(zhǔn)確識(shí)別出公式、表格并將其轉(zhuǎn)換為可適應(yīng)Markdown格式的文本。缺陷就是轉(zhuǎn)換速讀較慢、且解析內(nèi)容可能存在一定的亂序

和另一個(gè)解析器sciencebeam做下對(duì)比,可知

  • nougat比較好的地方在于可以把圖片公式拆解成LaTeX源碼,另外就是識(shí)別出來(lái)的內(nèi)容可以通過(guò)“#”符號(hào)來(lái)拆解文本段
    缺陷就是效率很低、非常慢,拿共約80頁(yè)的3篇pdf來(lái)解析的話,大概需要2分鐘,且占用20G顯存,到時(shí)候如果要應(yīng)用化,要讓用戶傳pdf解析的話,部署可能也會(huì)有點(diǎn)難度
  • sciencebeam的話就是快不少,同樣量級(jí)的3篇大約1分鐘內(nèi)都可以完成,和第1版用的SciPDF差不多,只需要CPU就可以驅(qū)動(dòng)起來(lái)了

當(dāng)然,還要考慮的是解析器格式化的粒度,比如正文拆成了什么樣子的部分,后續(xù)我們需不需要對(duì)正文的特定部分專門取出來(lái)做處理,如果格式化粒度不好的話,可能會(huì)比較難取出來(lái)

  1. 環(huán)境配置
    # 新建虛擬環(huán)境
    conda create -n nougat-ocr python=3.10
    # 激活虛擬環(huán)境
    conda activate nougat-ocr
    # 使用pip安裝必要庫(kù)(鏡像源安裝可能會(huì)出現(xiàn)版本沖突問(wèn)題,建議開(kāi)啟代理使用python官方源進(jìn)行安裝)
    pip install nougat-ocr -i https://pypi.org/simple
  2. 使用方法
    # 初次使用時(shí)會(huì)自動(dòng)獲取最新的權(quán)重文件
    # 針對(duì)單個(gè)pdf文件
    nougat {pdf文件路徑} -o {解析輸出目錄}
    # 針對(duì)多個(gè)pdf所在文件夾
    nougat {pdf目錄路徑} -o {解析輸出目錄}
  3. 測(cè)試示例
    標(biāo)題及開(kāi)頭
    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora公式識(shí)別與轉(zhuǎn)換七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora
    腳注識(shí)別
    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

1.1.2 ScienceBeam

ScienceBeam是經(jīng)典PDF文檔解析器GROBID的變體項(xiàng)目,是論文《Can large language models provide useful feedback on research papers? A large-scale empirical analysis》所采用的文本提取方法,同其他較早期的解析方法一樣,對(duì)公式無(wú)法做出LateX層面的解析,且該解析器僅支持在X86架構(gòu)的Linux系統(tǒng)中使用

// 待更

1.2?對(duì)2.6萬(wàn)篇paper的解析

最終,針對(duì)有review的2.6萬(wàn)篇paper (第一版 全部paper3萬(wàn)篇,其中帶review的2.5萬(wàn)篇;第二版 全部paper3.2萬(wàn)篇,其中帶review的2.6萬(wàn)篇?)

1.2.1 nougat的解析過(guò)程

  1. 我司審稿項(xiàng)目組的其中一位“雪狼”用的1張顯存為24G的P40解析完其中一半,另外一半由另一位“不染”用的1張顯存為48G的A40解析完
  2. 因nougat解析起來(lái)太耗資源,加之當(dāng)時(shí)我們的卡有限,所以這個(gè)PDF的解析,我們便用了一兩周..

1.2.2 ScienceBeam的解析結(jié)果

ScienceBeam解析的結(jié)果為字典,其中涉及的鍵有

  1. title: Paper的標(biāo)題,有部分會(huì)因?yàn)榻馕霾怀龆艨眨梢允褂孟鄳?yīng)的OpenReview數(shù)據(jù)的標(biāo)題來(lái)代替
  2. abstract: Paper的摘要,可能有部分會(huì)因?yàn)榻馕霾怀龆艨眨梢允褂孟鄳?yīng)的OpenReview數(shù)據(jù)的摘要來(lái)代替
  3. introduction: Paper的介紹,通常會(huì)包含在main_content中
  4. figure_and_table_captions: Paper中圖表下方的文字描述
  5. section_titles: Paper各個(gè)小節(jié)的標(biāo)題
  6. main_content: Paper的正文(含introduction)

實(shí)際取用的部分是其中的title、abstract、figure_and_table_captions以及main_content

且會(huì)加入[TITLE]、[ABSTRACT]、[CAPTIONS]、[CONTENT]特殊符號(hào)加以區(qū)分Paper的各個(gè)部分,考慮到[CONTENT]可能會(huì)提及[CAPTIONS]中的內(nèi)容,因此將[CAPTIONS]置于[CONTENT]之前

[TITLE]
標(biāo)題

[ABSTRACT]
摘要

[CAPTIONS]
各圖表描述

[CONTENT]
其余正文

// 待更

第二部分 第二版對(duì)paper和review數(shù)據(jù)的處理

2.1 第一版對(duì)review數(shù)據(jù)的處理

在第一版中,我們對(duì)review數(shù)據(jù)做了如下處理

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

總之

  • 第一版中,面向paper? 更多是做的PDF解析(解析器解析出來(lái)的正文直接就沒(méi)包含reference)
    第二版中,對(duì)于paper的數(shù)據(jù)處理沿用第一版的處理方法:解析完了之后 不再做什么處理
  • 第一版中,面向review 則做的如上圖所示的數(shù)據(jù)處理(注意,review無(wú)解析一說(shuō),畢竟如前言中所說(shuō),review數(shù)據(jù)爬下來(lái)之后就是文本數(shù)據(jù),不用做解析)
    那第二版 針對(duì)review數(shù)據(jù)的處理呢?詳見(jiàn)下文

2.2 第二版對(duì)review數(shù)據(jù)的處理

以“b_forum”字段為與Paper數(shù)據(jù)所關(guān)聯(lián)的外鍵,“b_forum”為對(duì)應(yīng)Paper的唯一標(biāo)識(shí)符(id)

  • 某篇paper所對(duì)應(yīng)的Review數(shù)據(jù)如果只是單行即為單個(gè)Review
  • 但很多時(shí)候,單篇Paper可能對(duì)應(yīng)有多個(gè)Review,故存在多行數(shù)據(jù)下b_forum相同的情況

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

針對(duì)原始數(shù)據(jù),我們做以下4點(diǎn)處理

  1. 過(guò)濾需求外的Review
    主要是去掉作者自己的回復(fù),以及對(duì)paper的評(píng)論七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora
  2. 將Review字符串化
  3. 過(guò)濾內(nèi)容過(guò)少的Review
  4. 將Review的內(nèi)容規(guī)范出4個(gè)要點(diǎn)且進(jìn)行“多聚一”,下文詳述

本部分?jǐn)?shù)據(jù)處理的代碼,暫在七月在線的「大模型項(xiàng)目開(kāi)發(fā)線上營(yíng)」中見(jiàn)

// 待更

第三部分 對(duì)review數(shù)據(jù)的進(jìn)一步處理:規(guī)范Review的格式且多聚一

3.1 斯坦福:讓GPT4首次當(dāng)論文的審稿人

近日,來(lái)自斯坦福大學(xué)等機(jī)構(gòu)的研究者把數(shù)千篇來(lái)自Nature、ICLR等的頂會(huì)文章丟給了GPT-4,讓它生成評(píng)審意見(jiàn)、修改建議,然后和人類審稿人給出的意見(jiàn)相比較

  • 在GPT4給出的意見(jiàn)中,超50%和至少一名人類審稿人一致,并且超過(guò)82.4%的作者表示,GPT-4給出的意見(jiàn)相當(dāng)有幫助
  • 這個(gè)工作總結(jié)在這篇論文中《Can large language models provide useful feedback on research papers? A large-scale empirical analysis》,這是其對(duì)應(yīng)的代碼倉(cāng)庫(kù)

所以,怎樣讓LLM給你審稿呢?具體來(lái)說(shuō),如下圖所示

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  1. 爬取PDF語(yǔ)料
  2. 接著,解析PDF論文的標(biāo)題、摘要、圖形、表格標(biāo)題、主要文本
  3. 然后告訴GPT-4,你需要遵循業(yè)內(nèi)頂尖的期刊會(huì)議的審稿反饋形式,包括四個(gè)部分
    成果是否重要、是否新穎(signifcance andnovelty)
    論文被接受的理由(potential reasons for acceptance)
    論文被拒的理由(potential reasons for rejection)
    改進(jìn)建議(suggestions for improvement)
    Your task now is to draft a high-quality review outline for a top-tierMachine Learning (ML) conference fora submission titled “{PaperTitle}”:
    
    ```
    {PaperContent}
    ```
    
    ======
    Your task:
    Compose a high-quality peer review of a paper submitted to a Nature family journal.
    
    Start by "Review outline:".
    And then:
    "1. Significance and novelty"
    "2. Potential reasons for acceptance"
    "3. Potential reasons for rejection", List multiple key reasons. For each key reason, use **>=2 sub bullet points** to further clarify and support your arguments in painstaking details. Be as specific and detailed as possible.
    "4. Suggestions for improvement", List multiple key suggestions. Be as specific and detailed as possible.
    
    Be thoughtful and constructive. Write Outlines only.
  4. 最終,GPT-4針對(duì)上圖中的這篇論文一針見(jiàn)血地指出:雖然論文提及了模態(tài)差距現(xiàn)象,但并沒(méi)有提出縮小差距的方法,也沒(méi)有證明這樣做的好處

3.2 為了讓模型對(duì)review的學(xué)習(xí)更有跡可循:歸納出來(lái)4個(gè)要點(diǎn)且多聚一

3.2.1 設(shè)計(jì)更好的提示模板以讓大模型幫梳理出來(lái)review語(yǔ)料的4個(gè)內(nèi)容點(diǎn)

上一節(jié)介紹的斯坦福這個(gè)讓GPT4當(dāng)審稿人的工作,對(duì)我司做論文審稿GPT還挺有啟發(fā)的

  1. 正向看,說(shuō)明我司這個(gè)方向是對(duì)的,至少GPT4的有效意見(jiàn)超過(guò)50%
  2. 反向看,說(shuō)明即便強(qiáng)如GPT4,其API的效果還是有限:近一半意見(jiàn)沒(méi)被采納,證明我司做審稿微調(diào)的必要性、價(jià)值性所在
  3. 審稿語(yǔ)料的組織 也還挺關(guān)鍵的,好讓模型學(xué)習(xí)起來(lái)有條條框框 有條理?分個(gè) 1 2 3 4 不混亂,瞬間get到review描述背后的邏輯、含義
    比如要是我們爬取到的審稿語(yǔ)料 也能組織成如下這4塊,我覺(jué)得 就很強(qiáng)了,模型學(xué)習(xí)起來(lái) 會(huì)很快
    1) 成果是否重要、是否新穎
    2) 論文被接受的理由
    3) 論文被拒的理由
    4) 改進(jìn)建議

對(duì)于上面的“第三大點(diǎn) 審稿語(yǔ)料的組織”,我們(特別是阿荀,其次我)創(chuàng)造性的想出來(lái)一個(gè)思路,即通過(guò)提示模板讓大模型來(lái)幫忙梳理咱們爬的審稿語(yǔ)料,好把審稿語(yǔ)料 梳理歸納出來(lái)上面所說(shuō)的4個(gè)方面的常見(jiàn)review意見(jiàn)

那怎么設(shè)計(jì)這個(gè)提示模板呢?借鑒上節(jié)中斯坦福的工作,提示模板可以在斯坦福那個(gè)模板基礎(chǔ)上,進(jìn)一步優(yōu)化如下

// 暫在「大模型項(xiàng)目開(kāi)發(fā)線上營(yíng)」中見(jiàn),至于在本文中的更新,待更

3.2.2 如何讓歸納出來(lái)的review結(jié)果更全面:多聚一

我們知道一篇paper存在多個(gè)review,而對(duì)review數(shù)據(jù)的學(xué)習(xí)有三種模式

  1. 一種是多選一
    但多選一有個(gè)問(wèn)題,即是:如果那幾個(gè)review都不是很全面呢,然后多選一的話會(huì)不會(huì)對(duì)review信息的豐富程度有損
  2. 一種是多聚一
    對(duì)多個(gè)review做一下總結(jié)歸納(阿荀、我先后想到),相當(dāng)于綜合一下,此時(shí)還是可以用GPT 3.5 16K或開(kāi)源模型幫做下review數(shù)據(jù)的多聚一

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  3. 一種是多輪交互
    這種工作量比較大,非首選

如此,最終清洗之后的24000篇paper的review,用多聚一的思路搞的話,便可以直接一次調(diào)用支持16K的GPT 3.5(畢竟16K的長(zhǎng)度足夠,可以把所有的review數(shù)據(jù)一次性給到GPT3.5 16K),或開(kāi)源模型讓它直接從所有review數(shù)據(jù)里提煉出4個(gè)要點(diǎn),大概是24000多次

3.2.3 通過(guò)最終的prompt來(lái)處理review數(shù)據(jù):ChatGPT VS 開(kāi)源模型

綜上,即是考慮多聚一策略來(lái)處理Review數(shù)據(jù),主要是對(duì)Prompting提出了更高的要求:

  1. 要求大模型聚合所有Review的觀點(diǎn)來(lái)進(jìn)行摘要
  2. 為保證規(guī)整Review的統(tǒng)一性,需提供具體的類別(如新穎性、接受原因、拒絕原因、改進(jìn)建議等)對(duì)觀點(diǎn)進(jìn)行明確“分類”
  3. 強(qiáng)調(diào)誠(chéng)實(shí)性來(lái)緩解幻覺(jué),在prompt中提供“示弱”選項(xiàng)(如回復(fù)“不知道”或允許結(jié)果為空等)
  4. 為使得后續(xù)工作更容易從大模型的輸出中獲取到所關(guān)注的信息,需對(duì)其輸出格式進(jìn)行要求
    上一節(jié)斯坦福研究者對(duì)模型review效果評(píng)估的工作看似很完美,不過(guò)其中有個(gè)小問(wèn)題,即盡管LLM可以根據(jù)指令遵循來(lái)基于Prompt的要求返回JSON格式的內(nèi)容,但并非每次都能生成得到利于解析的JSON格式內(nèi)容

相當(dāng)于咱們得基于上述要求來(lái)設(shè)計(jì)Prompt (最終設(shè)計(jì)好的prompt暫在七月在線的「大模型項(xiàng)目開(kāi)發(fā)線上營(yíng)」里講,至于本文本部分內(nèi)容的更新則明年Q1更新)

當(dāng)我們最終的prompt設(shè)計(jì)好了之后,接下來(lái),便可以讓大模型通過(guò)該prompt處理review數(shù)據(jù)了,那我們選用哪種大模型呢,是ChatGPT還是開(kāi)源模型,為此,我們對(duì)比了以下三種大模型

  1. zephyr-7b-alpha
  2. Mistral-7B-Instruct-v0.1
  3. OpenAI剛對(duì)外開(kāi)放的gpt-3.5-turbo-1106,即上一節(jié)圖中的GPT3.5 Turbo 16K

經(jīng)過(guò)對(duì)比發(fā)現(xiàn)

  • 用OpenAI的gpt-3.5-turbo-1106效果相對(duì)更好些,能力更強(qiáng) 效果更好,加之經(jīng)實(shí)際研判,費(fèi)用也還好 不算高

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  • 更讓OpenAI脫穎而出的是,gpt-4-1106-preview和gpt-3.5-turbo-1106版本中提供了JSON mode,在接口中傳入response_format={"type": "json_object"}啟用該模式、并在prompt中下達(dá)“以JSON格式返回”的指示后,將會(huì)返回完全符合JSON格式的內(nèi)容

// 待更,具體怎么個(gè)對(duì)比法,以及怎么個(gè)效果更好,暫在線上營(yíng)里見(jiàn),至于本文后續(xù)更新

不過(guò)我們?cè)趯?shí)際使用的過(guò)程中,發(fā)現(xiàn)OpenAI對(duì)API的訪問(wèn)有各種限制且限制的比較嚴(yán)格(即對(duì)用戶有多層限制:https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-one,比如分鐘級(jí)請(qǐng)求限制、每日請(qǐng)求限制、分鐘級(jí)token限制、每日token限制?),訪問(wèn)經(jīng)常會(huì)假死不給返回、也沒(méi)報(bào)錯(cuò),所以很多時(shí)間耗費(fèi)在被提示“訪問(wèn)超限”,然后等待又重復(fù)訪問(wèn)、再被提示超限這樣的過(guò)程,使得我們一開(kāi)始使用OpenAI的官方接口23年11.24到11.30大概7天才出了2600多條,并且后續(xù)限制訪問(wèn)的出現(xiàn)頻率愈加高,頭疼..

  1. 后面實(shí)在沒(méi)辦法,我們找了一個(gè)國(guó)內(nèi)的二手商,最終調(diào)二手商的接口,而二手商調(diào)OpenAI的接口,于此,用戶訪問(wèn)頻率限制、代理等問(wèn)題就讓二手商那邊解決了(我們也琢磨了下為何二手商可以解決這類訪問(wèn)限制的問(wèn)題,根據(jù)以往的經(jīng)驗(yàn),我們判斷,應(yīng)該是二手商那邊的OpenAI賬戶很多、代理路線很多,做了統(tǒng)一調(diào)度管理,然后在用戶調(diào)用的時(shí)候選取當(dāng)前低頻的官方賬戶來(lái)訪問(wèn)官方接口,時(shí)不時(shí)還自動(dòng)切換下代理,要知道一個(gè)代理被用來(lái)高頻訪問(wèn)OpenAI的時(shí)候,其實(shí)有可能是會(huì)被放進(jìn)黑名單的,所以持續(xù)維護(hù)一個(gè)代理池來(lái)做自動(dòng)切換也很重要?)
    當(dāng)然,二手商的接口晚上(或者別的高峰期)有時(shí)候還是會(huì)返回訪問(wèn)受限的提示,那時(shí)候應(yīng)該用的人比較多,導(dǎo)致即使“最低頻訪問(wèn)”的官方接口,訪問(wèn)頻率也不算低了,所以也會(huì)被訪問(wèn)受限
  2. 最終,使用二手商的中轉(zhuǎn)接口,12.04到12.08大概5天出了9000多條

3.2.4 對(duì)review數(shù)據(jù)的最后梳理:得到JSON文本的變體版且剔除長(zhǎng)尾數(shù)據(jù)

原本的經(jīng)過(guò)“多聚一”review側(cè)的數(shù)據(jù)由JSON mode返回所得,均為JSON格式(字典),大體形式如下

{
  "Significance and novelty": {
    大體描述: 具體描述,
    大體描述: 具體描述,
    ...
  },
  "Potential reasons for acceptance": {
    大體描述: 具體描述,
    大體描述: 具體描述,
    ...
  },
  "Potential reasons for rejection": {
    大體描述: 具體描述,
    大體描述: 具體描述,
    ...
  },
  "Suggestions for improvement": {
    大體描述: 具體描述,
    大體描述: 具體描述,
    ...
  }
}

但考慮到后續(xù)要微調(diào)的開(kāi)源模型對(duì)JSON格式的關(guān)注程度可能不足,學(xué)習(xí)JSON文本可能存在一定的困難,故最終將上述JSON格式的內(nèi)容轉(zhuǎn)為如下的格式(可以理解為JSON文本的變體版)

[Significance and novelty]
<大體描述> 具體描述
<大體描述> 具體描述
...

[Potential reasons for acceptance]
<大體描述> 具體描述
<大體描述> 具體描述
...

[Potential reasons for rejection]
<大體描述> 具體描述
<大體描述> 具體描述
...

[Suggestions for improvement]
<大體描述> 具體描述
<大體描述> 具體描述
...

即如下圖所示

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

且依據(jù)內(nèi)部文件《[正式方案]長(zhǎng)尾數(shù)據(jù)清洗及后續(xù)安排》,文本長(zhǎng)度過(guò)少的Review可能僅包含有一些無(wú)關(guān)緊要的信息,因此還可以考慮將長(zhǎng)度過(guò)少的Review進(jìn)行剔除(當(dāng)然,paper側(cè)也得剔除相關(guān)的長(zhǎng)尾數(shù)據(jù))

經(jīng)過(guò)一系列操作之后,數(shù)據(jù)量從22319對(duì)paper-review降到了15566

接著通過(guò)設(shè)計(jì)相關(guān)指令,且結(jié)合處理后的Paper及Review(一篇paper對(duì)應(yīng)一篇review),最終得到一份類Alpaca格式的數(shù)據(jù)集(instruction-input-output三元組數(shù)據(jù)),如下所示

[
    {
        "instruction": "You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: ** importance and novelty **, ** potential reasons for acceptance **, ** potential reasons for rejection **, and ** suggestions for improvement **. \nThe given paper is as follows: \n\n\n",
        "input": "[TITLE]\nImage Quality Assessment Techniques Improve Training and Evaluation of Energy-Based Generative Adversarial Networks\n\n[ABSTRACT]\nWe propose a new, multi-component energy function for energy-based Generative Adversarial Networks (GANs) based on methods from the image quality assessment literature. Our approach expands on the Boundary Equilibrium Generative Adversarial Network (BEGAN) by outlining some of the short-comings of the original energy and loss functions. We address these short-comings by incorporating an l1 score, the Gradient Magnitude Similarity score, and a chrominance score into the new energy function. We then provide a set of systematic experiments that explore its hyper-parameters. We show that each of the energy function's components is able to represent a slightly different set of features, which require their own evaluation criteria to assess whether they have been adequately learned. We show that models using the new energy function are able to produce better image representations than the BEGAN model in predicted ways.\n\n[CAPTIONS]\nFigure 1: From left to right, the images are the original image, a contrast stretched image, an image with impulsive noise contamination, and a Gaussian smoothed image. Although these images differ greatly in quality, they all have the same MSE from the original image (about 400), suggesting that MSE is a limited technique for measuring image quality.\nFigure 2: Comparison of the gradient (edges in the image) for models 11 (BEGAN) and 12 (scaled BEGAN+GMSM), where O is the original image, A is the autoencoded image, OG is the gradient of the original image, AG is the gradient of the autoencoded image, and S is the gradient magnitude similarity score for the discriminator (D) and generator (G). White equals greater similarity (better performance) and black equals lower similarity for the final column.\nFigure 3: Comparison of the chrominance for models 9 (BEGAN+GMSM+Chrom), 11 (BEGAN) and 12 (scaled BEGAN+GMSM), where O is the original image, OC is the original image in the corresponding color space, A is the autoencoded image in the color space, and S is the chrominance similarity score. I and Q indicate the (blue-red) and (green-purple) color dimensions, respectively. All images were normalized relative to their maximum value to increase luminance. Note that pink and purple approximate a similarity of 1, and green and blue approximate a similarity of 0 for I and Q dimensions, respectively. The increased gradient 'speckling' of model 12Q suggests an inverse relationship between the GMSM and chrominance distance functions.\nTable 1: Models and their corresponding model distance function parameters. The l 1 , GMSM, and Chrom parameters are their respective β d values from Equation 8.\nTable 2: Lists the models, their discriminator mean error scores, and their standard deviations for the l 1 , GMSM, and chrominance distance functions over all training epochs. Bold values show the best scores for similar models. Double lines separate sets of similar models. Values that are both bold and italic indicate the best scores overall, excluding models that suffered from modal collapse. These results suggest that model training should be customized to emphasize the relevant components.\n\n[CONTENT]\nSection Title: INTRODUCTION\n  INTRODUCTION\n\nSection Title: IMPROVING LEARNED REPRESENTATIONS FOR GENERATIVE MODELING\n  IMPROVING LEARNED REPRESENTATIONS FOR GENERATIVE MODELING   Radford et al. (2015)  demonstrated that Generative Adversarial Networks (GANs) are a good unsu- pervised technique for learning representations of images for the generative modeling of 2D images. Since then, a number of improvements have been made. First,  Zhao et al. (2016)  modified the error signal of the deep neural network from the original, single parameter criterion to a multi-parameter criterion using auto-encoder reconstruction loss.  Berthelot et al. (2017)  then further modified the loss function from a hinge loss to the Wasserstein distance between loss distributions. For each modification, the proposed changes improved the resulting output to 

# 中間一大段略去..

The first row is model 1, which corresponds with the original BEGAN model. Rows 2-12 represent our experiments. Each cell represents the output of a random sample.  \n",
        "output": "[Significance and novelty]\n<Incremental nature of the paper> Reviewers highlighted that the paper's proposed approach was somewhat incremental, and its significance and novelty were questioned. The modifications and additions to the existing BEGAN framework were not perceived as significantly contributing to the field.\n<Limited novelty of energy function components> The paper's novelty was deemed to be restricted, particularly in the choice and justification of the energy function components. Reviewers pointed out that the selected components were borrowed from existing image quality assessment techniques, and their application was not seen as a significant contribution to GAN theories or algorithm designs.\n\n[Potential reasons for acceptance]\n<No related terms> -\n\n[Potential reasons for rejection]\n<Weak experiments and results analysis> Reviewers noted that the experimental section was weak, with reliance on visual evaluations and a limited range of datasets (primarily CelebA). This led to concerns about the thoroughness and robustness of the paper's findings.\n<Unclear significance and contribution> The paper was criticized for its lack of clarity in motivating its contributions and presenting results. Reviewers found it difficult to estimate the significance of the proposed model and understand how its results differed from baseline models.\n\n[Suggestions for improvement]\n<Broaden experimental testing> To strengthen the paper, reviewers suggested broadening the experimental testing to include different datasets involving natural images, beyond the single CelebA dataset. This would provide a more comprehensive evaluation of the proposed techniques.\n<Clarify and justify design choices> Improvements in the paper's clarity and justification were recommended, specifically in clarifying the design choices made for the energy function components. Providing clear justifications for the modifications and additions to the BEGAN framework would enhance the paper's credibility and significance.\n\n"
    },
    {
        "instruction": "You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: ** importance and novelty **, ** potential reasons for acceptance **, ** potential reasons for rejection **, and ** suggestions for improvement **. \nThe given paper is as follows: \n\n\n",
        "input": "[TITLE]\nLearning Independent Features with Adversarial Nets for Non-linear ICA\n\n[ABSTRACT]\nReliable measures of statistical dependence could potentially be useful tools for learning independent features and performing tasks like source separation using Independent Component Analysis (ICA).  Unfortunately, many of such measures, like the mutual information, are hard to estimate and optimize directly.  We propose to learn independent features with adversarial objectives (Goodfellow et al. 2014, Arjovsky et al. 2017) which optimize such measures implicitly.  These objectives compare samples from the joint distribution and the product of the marginals without the need to compute any probability densities. We also propose two methods for obtaining samples from the product of the marginals using either a simple resampling trick or a separate parametric distribution.  Our experiments show that this strategy can easily be applied to different types of model architectures and solve both linear and non-linear ICA problems.\n\n\n[CAPTIONS]\nFigure 1: Schematic representation of the entire system for learning non-linear ICA. Specific func- tional shapes can be enforced by choosing a suitable decoder architecture.\nFigure 2: The decoder architecture used for the PNL experiments. It can only learn transformations in which a linear transformation is followed by the application of non-linear scalar functions to each of the dimensions.\nFigure 3: Source signals used in the experiments.\nFigure 4: Convergence plots for the linear synthetic source task.\nTable 1: Maximum correlation results on all tasks for the synthetic data. A 'g' in the suffix of the model name indicates that a separate generator network was used instead of the resampling method. Two scores separated by a '/' indicate that the first score was obtained using the model selection described in Section 6.1 while the second score is simply the best score a posteriori. Parentheses refer contain the standard deviation of the scores multiplied with 10 ?4 .\nTable 2: Maximum correlation results on all tasks for the audio data. A 'g' in the suffix of the model name indicates that a separate generator network was used instead of the resampling method. Two scores separated by a '/' indicate that the first score was obtained using the model selection described in Section 6.1 while the second score is simply the best score a posteriori. Parentheses refer contain the standard deviation of the scores multiplied with 10 ?4 .\n\n[CONTENT]\nSection Title: INTRODUCTION\n  INTRODUCTION When unsupervised models are developed to learn interesting features of data, this often implies that each feature represents some interesting unique property of the data that cannot easily be derived from the other features. A sensible way to learn such features is by ensuring that they are statistically independent. 

# 中間一大段略去..

Future work should be devoted to a more thorough theoretical analysis of of the proposed methods for minimizing and measuring dependence and how to evaluate them.\n  See the appendix for more details about the synthetic signals.\n",
        "output": "[Significance and novelty]\n<Adversarial objectives for computing independent features> The use of adversarial objectives for computing independent features sheds new light on the problem of independent component analysis. It presents an interesting and potentially innovative approach to addressing the issue.\n<Comparison with existing methods for independent component analysis (ICA)> The paper's comparison with existing methods for ICA, such as linear and post-nonlinear ICA, highlights the attempt to address a wide range of ICA problems using the proposed GAN variant. This comprehensive approach could contribute to the literature on ICA methods.\n\n[Potential reasons for acceptance]\n<Conceptually thought-provoking> The paper presents a conceptually thought-provoking approach to independent component analysis using adversarial training, which could contribute to the advancement of ICA methods.\n<Coverage of linear and non-linear ICA problems> The coverage of both linear and non-linear ICA problems demonstrates the broad applicability of the proposed GAN-based approach, potentially adding value to the field of independent component analysis.\n\n[Potential reasons for rejection]\n<Lack of clarity and focus in presentation> Reviewers have expressed concerns about the lack of clarity, focus, and thorough analysis in the presentation of the proposed GAN variant for ICA, leading to a marginal rating below the acceptance threshold.\n<Inadequate comparison with prior work> Reviewers have noted that the comparison with existing methods, such as linear and post-nonlinear ICA, is inadequate, and the paper lacks comprehensive analysis and evaluation, resulting in a rating marginally below the acceptance threshold.\n\n[Suggestions for improvement]\n<Streamlining focus and discussion> The authors should focus their discussion on addressing specific ICA problems, streamlining the presentation, and providing a more focused and in-depth analysis of the proposed GAN variant for ICA. Emphasizing the novelty and significance of the approach could strengthen the paper.\n<Comprehensive comparative analysis> Enhancing the comparison with prior work, especially in the context of linear and non-linear ICA, and providing a more thorough evaluation of the proposed method would address concerns raised by the reviewers and potentially improve the paper's acceptance prospects.\n\n"
    },
    # 總計(jì)15566條..

至于完整數(shù)據(jù)集,我司的大模型項(xiàng)目開(kāi)發(fā)線上營(yíng)里見(jiàn)

再考慮到單條數(shù)據(jù)算作“instruction+input+output”的拼接,使用Mistral的tokenizer對(duì)各條數(shù)據(jù)進(jìn)行分詞,并統(tǒng)計(jì)數(shù)據(jù)的token數(shù)

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

由上圖大致可了解到單條token數(shù)大致在6000至12000的區(qū)間,較多數(shù)據(jù)的長(zhǎng)度分布在8500左右,因此后續(xù)在為訓(xùn)練模型設(shè)定序列裁切(cut off)長(zhǎng)度時(shí)選擇11264或12288比較合適

3.3 (選讀)相關(guān)工作之AcademicGPT:增量訓(xùn)練LLaMA2-70B,包含論文審稿功能

3.3.1?AcademicGPT: Empowering Academic Research

11月下旬,我司第二項(xiàng)目組的阿荀發(fā)現(xiàn)

  1. 有一個(gè)團(tuán)隊(duì)于23年11.21日,在arXiv上提交了一篇論文《AcademicGPT: Empowering Academic Research》,論文中提出了AcademicGPT,其通過(guò)學(xué)術(shù)數(shù)據(jù)在LLaMA2-70B的基礎(chǔ)上經(jīng)過(guò)繼續(xù)訓(xùn)練得到的
  2. 然后該團(tuán)隊(duì)AcademicGPT的基礎(chǔ)之上延伸開(kāi)發(fā)了4個(gè)方面的應(yīng)用:學(xué)術(shù)問(wèn)答、論文輔助閱讀、論文評(píng)審、標(biāo)題和摘要的輔助生成等功能,由于其中的論文問(wèn)答、論文摘要等功能已經(jīng)很常見(jiàn)了(比如此文提到的chatpaper、中科院一團(tuán)隊(duì)的gpt_academic都通過(guò)GPT3.5的API做了還可以的實(shí)現(xiàn)),但論文評(píng)審此前一些開(kāi)源工具通過(guò)GPT3.5做的效果并不好,所以既然AcademicGPT做了論文審稿這一功能,而且還用了70B的模型,那必須得關(guān)注一波,于是便仔細(xì)研究了下他們的論文
    (當(dāng)然,我相信,他們很快也會(huì)關(guān)注到我司論文審稿GPT這個(gè)工作,然后改進(jìn)他們的訓(xùn)練策略,畢竟同行之間互相借鑒,并不為怪?)

他們與我們有兩點(diǎn)顯著不同的是,一者,他們對(duì)LLaMA做了增量預(yù)訓(xùn)練(AcademicGPT is a continual pretraining on LLaMA2),二者,我司目前的論文審稿GPT暫只針對(duì)英文論文的評(píng)審(畢竟七月的客戶要發(fā)論文的話,以英文EI ei期刊 SCI論文為主,其次才中文期刊),而他們還考慮到了中文,故他們考慮到LLaMA2-70B有限的中文能力與學(xué)術(shù)領(lǐng)域知識(shí),所以他們收集中文數(shù)據(jù)和學(xué)術(shù)英文數(shù)據(jù)來(lái)對(duì)相關(guān)方面進(jìn)行提高

  • 中文數(shù)據(jù):取自CommonCrawl、Baike、Books等(此外還從互聯(lián)網(wǎng)爬取了200K學(xué)術(shù)文本)
    由于CC這類數(shù)據(jù)通常會(huì)包含很多廣告、色情等有害信息,所以需要對(duì)其進(jìn)行數(shù)據(jù)清洗,最終他們借助LLM且使用下圖所示的Prompt來(lái)對(duì)取自互聯(lián)網(wǎng)的數(shù)據(jù)進(jìn)行清洗,比如對(duì)文檔進(jìn)行各種標(biāo)注
    (根據(jù)論文原文,我們判斷,他們應(yīng)該是先讓模型基于人類給的prompt 對(duì)一些文本做標(biāo)注,之后ChatGPT對(duì)同樣那些文本做標(biāo)注,最后對(duì)比這兩者之間的差異,建損失函數(shù) 然后微調(diào)模型本身,差不多后,模型對(duì)剩下的文本做標(biāo)注)

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  • 英文學(xué)術(shù)數(shù)據(jù):爬取來(lái)自200所頂尖大學(xué)的100多萬(wàn)篇Paper、Arxiv的226萬(wàn)篇Paper(截止到23年5月),其中較長(zhǎng)的Paper使用Nougat進(jìn)行解析(和我司七月一樣?)、較短的Paper使用研究團(tuán)隊(duì)自己的解析器,此外還有4800多萬(wàn)篇來(lái)自u(píng)npaywall2的免費(fèi)學(xué)術(shù)文章,以及Falcon開(kāi)源的數(shù)據(jù)集中的學(xué)術(shù)相關(guān)數(shù)據(jù)

基于上述所得的120B的數(shù)據(jù),他們使用192個(gè)40G顯存的A100 GPU進(jìn)行繼續(xù)二次預(yù)訓(xùn)練(他們的所有工作我沒(méi)有任何羨慕,但唯獨(dú)他們有192塊A100,讓我個(gè)人著實(shí)羨慕了一把,好期待有哪個(gè)大豪可以解決下我司七月的GPU緊缺問(wèn)題,^_^),最終通過(guò)37天的訓(xùn)練,使得LLaMA2-70B進(jìn)一步獲得理解中文與學(xué)術(shù)內(nèi)容的能力,以下是關(guān)于訓(xùn)練的更多細(xì)節(jié)

  1. 且為了加快訓(xùn)練過(guò)程,用了FlashAt-tention2 (Dao, 2023),它不僅加快了注意力模塊的速度,而且節(jié)省了大量?jī)?nèi)存,且通過(guò)Apex RMSNorm實(shí)現(xiàn)了融合cuda內(nèi)核(Apex RMSNorm that implements a fused cuda kernel)
  2. 由于AcademicGPT是LLaMA2- 70b的二次訓(xùn)練模型,因此它使用了一些與LLaMA2相同的技術(shù),包括
    RMSNorm (Zhang and Sennrich, 2019)而不是LayerNorm,
    SwiGLU (Shazeer, 2020)而不是GeLU
    對(duì)于位置嵌入,它使用RoPE (Su et al., 2021)而不是Alibi(Press et al., 2021)
    對(duì)于tokenizer,它使用BPE (Sennrich等人,2015)
    且使用DeepSpeed (Rasley et al., 2020)和Zero (Rajbhandari et al., 2020),且他們的訓(xùn)練基于gpt-neox (Black et al., 2022)框架,其中我們集成了許多新引入的技能。使用具有40GB內(nèi)存的192個(gè)A100 gpu完成120B數(shù)據(jù)的訓(xùn)練需要大約37天

3.3.2 論文評(píng)審:借鑒ReviewAdvisor抽取出review的7個(gè)要點(diǎn)(類似我司借鑒斯坦福工作把review歸納出4個(gè)要點(diǎn))

他們和我司一樣,都是從同一帶有論文review的網(wǎng)站上收集了29119篇Paper和約79000條Review,然后經(jīng)過(guò)下述處理

  • Paper側(cè)處理:剔除了7115篇無(wú)內(nèi)容或無(wú)Review的Paper、剔除了解析失敗的Paper
  • Review側(cè)處理:
    • 剔除了具有過(guò)多換行符的Review
    • 剔除了過(guò)短(少于100 tokens),或過(guò)長(zhǎng)(多于2000 tokens)的Review
    • 剔除了與Decision Review決策不一致、且confidence低的Review
    • 抽取review要點(diǎn)
      和我司「3.2.1 設(shè)計(jì)更好的提示模板以讓大模型幫梳理出來(lái)review語(yǔ)料的4個(gè)內(nèi)容點(diǎn)」類似,他們則參考的是《Can We Automate Scientific Reviewing?》中的7個(gè)方面要點(diǎn),去掉Summary所以是7個(gè):
      ? 1 動(dòng)機(jī)/影響Motivation/Impact
      ? 2 原創(chuàng)性O(shè)riginality
      ? 3 合理性/正確性Soundness/Correctness
      ? 4 實(shí)質(zhì)性Substance
      ? 5 可復(fù)現(xiàn)性Replicability
      ? 6 有意義的對(duì)比Meaningful Comparison
      ? 7?清晰程度Clarity)
      且使用該論文的源碼對(duì)Review進(jìn)行進(jìn)一步標(biāo)注,然后抽取出相應(yīng)的要點(diǎn)
      ?具體而言,他們?cè)跉w納單篇review的7或8個(gè)要點(diǎn)時(shí),給定一篇review,讓BERT逐個(gè)逐個(gè)的標(biāo)注出每個(gè)token/詞它可能所屬的要點(diǎn)類別,BERT對(duì)整篇review標(biāo)注完以后,把有要點(diǎn)標(biāo)注結(jié)果的內(nèi)容給抽出來(lái),比如下圖,模型逐個(gè)對(duì)每個(gè)token標(biāo)注出了summary(紫色)、clarity(黃色)、substance(橙色)等等,然后把帶顏色的要點(diǎn)部分抽出來(lái)作為該篇review的歸納,其中+表示積極的情緒,-表示負(fù)面情緒

      七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

最終,經(jīng)過(guò)上面一系列梳理之后得到的paper數(shù)據(jù) + 歸納好的review數(shù)據(jù)去微調(diào)70B模型

為方便大家理解,我補(bǔ)充一下關(guān)于這篇《Can We Automate Scientific Reviewing?》的解釋說(shuō)明


事實(shí)上,該篇論文的視角在于將“Review”視作對(duì)Paper的摘要與對(duì)應(yīng)內(nèi)容的評(píng)估,以此保證事實(shí)正確性。因此該篇論文考慮將Paper Review問(wèn)題建模為摘要生成任務(wù),采用當(dāng)時(shí)(2021)較為先進(jìn)的BART模型進(jìn)行訓(xùn)練,得到ReviewAdvisor模型

通過(guò)設(shè)計(jì)好的評(píng)估系統(tǒng),得出如下觀察:

  • 模型容易生成非事實(shí)性陳述
  • 模型尚未學(xué)習(xí)到高級(jí)理解,如沒(méi)法實(shí)質(zhì)地分辨Paper的高質(zhì)量與低質(zhì)量
  • 模型傾向于模仿訓(xùn)練數(shù)據(jù)的語(yǔ)言風(fēng)格(傾向低級(jí)模式),如容易生成訓(xùn)練樣本中的高頻句子
  • 可以較好地概括論文核心思想

最終結(jié)論是:“模型評(píng)審還尚未能替代人工評(píng)審,但可以輔助人工進(jìn)行評(píng)審”


這項(xiàng)工作有兩個(gè)值得關(guān)注的地方:

  • 增強(qiáng)Review數(shù)據(jù)(通過(guò)BERT對(duì)review數(shù)據(jù)抽取式歸納出8個(gè)要點(diǎn)、然后人工做校正)
    對(duì)于相對(duì)雜亂的Review內(nèi)容來(lái)說(shuō),研究團(tuán)隊(duì)只想保留有用的“結(jié)構(gòu)化”內(nèi)容,因此他們將從定義“結(jié)構(gòu)化方面”開(kāi)始,從Review中取出相應(yīng)的結(jié)構(gòu)化內(nèi)容,由此實(shí)現(xiàn)Review側(cè)的數(shù)據(jù)增強(qiáng)

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

    1 定義結(jié)構(gòu)化方面
    研究團(tuán)隊(duì)討論出了他們所認(rèn)為的一篇“好的Review”所應(yīng)該具備的各個(gè)方面,包括如下8個(gè)要點(diǎn):
    Summary(SUM):總結(jié)摘要
    Motivation/Impact(MOT):動(dòng)機(jī)/影響
    Originality(ORI):原創(chuàng)性
    Soundness/Correctness(SOU):合理性/正確性
    Substance(SUB):實(shí)質(zhì)性
    Replicability(REP):可復(fù)現(xiàn)性
    Meaningful Comparison(CMP):有意義的對(duì)比
    Clarity(CLA):清晰程度

    2 人工標(biāo)注
    研究團(tuán)隊(duì)邀請(qǐng)6名具有機(jī)器學(xué)習(xí)背景的學(xué)生對(duì)原本的Review進(jìn)行注釋,注釋手法傾向于“抽取式摘要”,即標(biāo)注原文本中哪些片段屬于何種類別「which are Summary (SUM), Moti-vation/Impact (MOT) , Originality (ORI), Sound-ness/Correctness (SOU), Substance (SUB), Repli-cability (REP), Meaningful Comparison (CMP)and Clarity (CLA)
    類似于“... ... The results are new[Positive Originality] and important to this field[Positive Motivation] ... ...”

    3 訓(xùn)練標(biāo)注器
    考慮到人工標(biāo)注全部數(shù)據(jù)并不現(xiàn)實(shí),使用第2步標(biāo)注過(guò)的Review數(shù)據(jù)訓(xùn)練一個(gè)BERT抽取模型作為標(biāo)注器,用于自動(dòng)標(biāo)注原Review中的方面項(xiàng)。即輸入Review文本,BERT對(duì)文本進(jìn)行逐token分類預(yù)測(cè),預(yù)測(cè)出Review哪些部分屬于哪些方面

    4 后處理
    使用標(biāo)注器BERT對(duì)余下數(shù)據(jù)進(jìn)行標(biāo)注后,其結(jié)果并不完全可信(畢竟BERT的能力沒(méi)有像GPT3.5那么強(qiáng),即結(jié)果沒(méi)那么可信),需要制定規(guī)則或使用人工對(duì)標(biāo)注器的預(yù)測(cè)結(jié)果進(jìn)行校正

    5 人工檢查
    邀請(qǐng)具有機(jī)器學(xué)習(xí)背景的人員檢查標(biāo)注結(jié)果
  • 生成Review(通過(guò)paper和BERT抽取且人工校正過(guò)的review語(yǔ)料,微調(diào)BART)
    根據(jù)給定Paper生成Review,模型選型為彼時(shí)最大長(zhǎng)度為1024的BART模型,考慮到Paper的長(zhǎng)度較長(zhǎng),因此整個(gè)生成Review的方案被設(shè)計(jì)成了兩階段的形式,即首先從Paper中擇取出突出片段(輸入上下文長(zhǎng)度壓縮),然后基于這些突出片段來(lái)生成review摘要

    選取突出片段
    使用諸如“demonstrate”“state-of-the-art”等關(guān)鍵詞及對(duì)句子的諸多規(guī)則判斷來(lái)確定突出片段

    訓(xùn)練方面感知摘要(Aspect-aware Summarizaiton)模型
    基于基礎(chǔ)Seq2Seq模型實(shí)現(xiàn)的是由輸入序列(Paper)預(yù)測(cè)輸出序列(Review)的過(guò)程,研究團(tuán)隊(duì)在這個(gè)基礎(chǔ)上引入了“方面感知”來(lái)輔助模型進(jìn)行預(yù)測(cè),強(qiáng)調(diào)模型對(duì)“方面要點(diǎn)”的輸出,即引入兩個(gè)的多層感知機(jī)來(lái)分別進(jìn)行生成任務(wù):模型不僅要逐token生成Review內(nèi)容,還要逐token預(yù)測(cè)其對(duì)應(yīng)的“方面要點(diǎn)”

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

    因此模型需要同時(shí)學(xué)習(xí)兩個(gè)損失函數(shù)
    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora
    這也意味著模型在一次推理中將輸出2條序列,其一為預(yù)測(cè)的Review內(nèi)容(其損失函數(shù)為),其二為預(yù)測(cè)的方面要點(diǎn)(其損失函數(shù)為)

3.3.3 70B的AcademicGPT在論文審稿上效果不佳的原因

根據(jù)原論文中展示的對(duì)一些論文做審稿的案例來(lái)看,其效果并不佳

下圖是論文中的兩個(gè)審稿案例

  1. 下圖是論文中的審稿案例1,可以看出來(lái),它指出對(duì)應(yīng)論文的缺點(diǎn):“寫作需要打磨。存在太多的拼寫和語(yǔ)法錯(cuò)誤。實(shí)驗(yàn)設(shè)置不夠令人信服。首先,沒(méi)有提供基線。其次,作者僅在單一數(shù)據(jù)集上進(jìn)行了實(shí)驗(yàn)。第三,作者沒(méi)有報(bào)告結(jié)果的方差?!?br> 這種審稿意見(jiàn)對(duì)于論文作者本身而言,參考價(jià)值可能不大,畢竟當(dāng)你指出有太多的拼寫和語(yǔ)法錯(cuò)誤,最好是具體指出來(lái)所謂的拼寫和語(yǔ)法錯(cuò)誤是在論文中哪一段

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  2. 下圖是論文中的審稿案例2

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

    但第5個(gè)Weaknesses的點(diǎn)「5. The writing of the paper could be improved. For example, the authors should explain what xt,i means in
    Eq. (1) 是說(shuō)論文應(yīng)該解釋下公式(1)中 的含義,但原論文的公式(1)不涉及

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

而效果不佳的原因有多個(gè)方面,下面更多對(duì)比與我司的不一致

  1. Focus程度不一樣
    與我司現(xiàn)在整個(gè)項(xiàng)目組全力以赴迭代論文審稿GPT不同,對(duì)于AcademicGPT而言,論文審稿只是他們4大應(yīng)用中的其中一塊,當(dāng)然,他們選用的基座模型的參數(shù)規(guī)模更大、卡也比我們多
  2. 做摘要抽取的模型不一樣
    他們通過(guò)BERT對(duì)review數(shù)據(jù)抽取出7個(gè)要點(diǎn),而早期模型BERT抽取的結(jié)果不一定準(zhǔn)確(即便加了一定的人工校正),畢竟和我司所用的GPT3.5還是沒(méi)法比的
  3. 做摘要抽取時(shí)的策略不一樣
    即他們通過(guò)BERT對(duì)review做抽取式摘要時(shí),直接抽取review原話(通過(guò)“抽取式摘要”拿出來(lái),相當(dāng)于抽取、挪動(dòng)、組合原有的review詞?),可review是由各式各樣的人寫的,原話風(fēng)格高度不統(tǒng)一,模型可能會(huì)收斂困難

    總之,他們與我司論文審稿GPT的差異就在于
    他們是抽取式提取要點(diǎn)、我司是生成式歸納要點(diǎn)
    ??抽取式抽出來(lái)的是原話,但是原話措辭風(fēng)格迥異,而且抽取式模型能力有限需要做很多人工核驗(yàn)等后處理
    ? 但對(duì)于生成式而言,尤其是LLM的生成式可以根據(jù)要求生成相對(duì)統(tǒng)一的措辭風(fēng)格
  4. review本身信息的全面程度不一樣
    他們把各個(gè)review抽取出7個(gè)要點(diǎn)后,沒(méi)繼續(xù)做多聚一的操作
    我司把各個(gè)review歸納出4個(gè)要點(diǎn)后,為讓單篇paper所對(duì)應(yīng)的review信息更加全面,做了多聚一的操作

    所以雖然AcademicGPT最終基于LLaMA2-70B去微調(diào),模型參數(shù)規(guī)模比我司選用的大
    但因?yàn)閞eview數(shù)據(jù)的質(zhì)量有限,最終效果自然不會(huì)太好

當(dāng)然,在沒(méi)有實(shí)際開(kāi)源出來(lái)讓用戶使用之前,也不好下太多論斷,具體等他們先對(duì)外開(kāi)放吧(且他們看到本文后,我相信很快也會(huì)改進(jìn))


第四部分 模型的選型:從Mistral、Mistral-YaRN到LongLora LLaMA

23年12月中旬,本項(xiàng)目總算要走到模型選型階段了,在此前的工作:數(shù)據(jù)的處理和數(shù)據(jù)的質(zhì)量提高上,下足了功夫,用了各種策略 也用了最新的GPT3.5 16K幫歸納review信息,整個(gè)全程是典型的大模型項(xiàng)目開(kāi)發(fā)流程

而論文審稿GPT第二版在做模型選型的時(shí)候,我司一開(kāi)始考慮了三個(gè)候選模型:

  1. Mistral
  2. Yarn-Mistral-7b-64k
  3. LLaMA-LongLora

以下逐一介紹這三個(gè)模型,以及對(duì)應(yīng)的訓(xùn)練細(xì)節(jié)、最終效果

4.1?Mistral 7B:通過(guò)分組查詢注意力 + 滑動(dòng)窗口注意力超越13B模型

今年5月,DeepMind和Meta的三位前員工在巴黎共同創(chuàng)立了Mistral AI(其CEO Arthur Mensch此前在DeepMind巴黎工作,CTO Timothée Lacroix和首席科學(xué)家Guillaume Lample則在Meta共同參與過(guò)LLaMA一代的研發(fā),很像當(dāng)年OpenAI的部分員工出走成立Anthropic啊),今年10月,他們發(fā)布了第一個(gè)基座大模型,即Mistral 7B

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

據(jù)其對(duì)應(yīng)的論文《Mistral 7B》稱(?另,這是其GitHub地址),以下是「模型參數(shù)圖」

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  1. Mistral 7B在所有評(píng)估基準(zhǔn)中均勝過(guò)了目前最好的13B參數(shù)模型(Llama 2),并在推理、數(shù)學(xué)和代碼生成方面超越了Llama 34B
    Mistral 7B outperforms the previous best 13B model (Llama 2, [Llama 2: Open foundation and fine-tuned chat models]) across all testedbenchmarks, and surpasses the best 34B model (LLaMa 34B, [Llama: Open and efficient foundation language models]) in mathematics and codegeneration.
  2. 該模型采用了分組查詢注意力(GQA),GQA顯著加快了推理速度,還減少了解碼期間的內(nèi)存需求,允許更高的批處理大小,從而提高吞吐量
    GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput
    ?

    所以你看上面的「模型參數(shù)圖」,維度(dim):4096,總計(jì)32個(gè)頭(n_heads),每個(gè)頭的維度(head_dim):128,這一眼可以看出來(lái),而n_kv_heads是啥呢?
    咋一看好像不太好理解 是不?其實(shí),正是因?yàn)镸istral用了GQA,n_heads指的是Q的頭數(shù),n_kv_heads指的是K、V的頭數(shù)

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

    不過(guò)要注意的是,與上圖中間所示部分不太一樣的地方在于:
    ? 上圖中間所示部分中,Q的頭數(shù)是K V頭數(shù)的2倍

    ? 但在Mistral的GQA中,Q的頭數(shù)是K V頭數(shù)的4倍

    關(guān)于GQA的更多介紹,請(qǐng)參見(jiàn)《一文通透各種注意力:從多頭注意力MHA到分組查詢注意力GQA、多查詢注意力MQA》
  3. 同時(shí)結(jié)合滑動(dòng)窗口注意力(sliding window attention,簡(jiǎn)稱SWA)以有效處理任意長(zhǎng)度的序列,
    SWA is designed to handle longer sequences more effectively at a reduced computational cost

    包括你再看上上張圖所示的「模型參數(shù)圖」,可知context_len 8192是說(shuō)它訓(xùn)練的時(shí)候,傳進(jìn)來(lái)的數(shù)據(jù)最大只能到8192個(gè)tokens,也就是訓(xùn)練時(shí)的上下文長(zhǎng)度上限,
    windows_size 4096是sliding windows attention的滑窗大小,1次attention計(jì)算的上下文范圍只4096個(gè)tokens

    言外之意是,每個(gè)token只最多計(jì)算4096的范圍
    第5000個(gè)token只計(jì)算[905: 5000]這個(gè)范圍的attention
    第5001個(gè)token只計(jì)算[906: 5001]這個(gè)范圍的attention
    以此類推..

此外,作者提供了一個(gè)針對(duì)遵循指令進(jìn)行了微調(diào)的模型,名為Mistral 7B-Instruct,它在人工和自動(dòng)化基準(zhǔn)測(cè)試中均超過(guò)了LLaMA?2 13B-chat模型

4.1.1 滑動(dòng)窗口注意力:擴(kuò)展上下文長(zhǎng)度

vanilla attention的操作次數(shù)在序列長(zhǎng)度上是二次型的,記憶量隨著token數(shù)量線性增加。在推理時(shí),由于緩存可用性的降低,這導(dǎo)致了更高的延遲和更小的吞吐量(The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higherlatency and smaller throughput due to reduced cache availability)

為了緩解這個(gè)問(wèn)題,Mistral 7B使用滑動(dòng)窗口注意力(sliding window attention)

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  1. 每個(gè)token最多可以關(guān)注來(lái)自上一層的W個(gè)token(上圖中,W = 3)。請(qǐng)注意,滑動(dòng)窗口之外的token仍然影響下一個(gè)單詞預(yù)測(cè)
    each token can attend to at most W tokens from the previous layer (here, W = 3). Note that tokensoutside the sliding window still influence next word prediction.

    舉個(gè)例子,在面對(duì)這個(gè)序列時(shí):The cat sat on the
    如果是標(biāo)準(zhǔn)注意力,在計(jì)算最后一個(gè)token “the”時(shí),得計(jì)算the本身所對(duì)應(yīng)的query與整個(gè)上文每個(gè)token對(duì)應(yīng)的key的內(nèi)積,當(dāng)序列長(zhǎng)度一長(zhǎng)時(shí),該計(jì)算量還是比較大的
    但如果是滑動(dòng)窗口注意力,則在計(jì)算最后一個(gè)token “the”時(shí),只需計(jì)算the本身所對(duì)應(yīng)的query與上文中3個(gè)token對(duì)應(yīng)的key的內(nèi)積(這里說(shuō)的上文中的3個(gè)token 包括the自己在內(nèi))
  2. 在每個(gè)注意力層,信息可以向前移動(dòng)W個(gè)token。因此,在k層注意力之后,信息最多可以向前移動(dòng)k個(gè)×W個(gè)token
    At each attention layer, information can moveforward by W tokens. Hence, after k attention layers, information can move forward by up to k ×W tokens.

4.1.2?滾動(dòng)緩沖區(qū)緩存(Rolling Buffer Cache)

固定的注意力長(zhǎng)度意味著可以使用滾動(dòng)緩存來(lái)限制的緩存大小(A fixed attention span means that we can limit our cache size using a rollingbuffer cache)

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  1. 緩存的大小是固定的W,時(shí)間步長(zhǎng)i的鍵和值存儲(chǔ)在緩存的位置i?mod?W中。因此,當(dāng)位置i大于W時(shí),緩存中過(guò)去的值就會(huì)被覆蓋,緩存的大小就會(huì)停止增加
    The cache has a fixed size of W, and the keys and values for the timestep i are storedin position i mod W of the cache. As a result, when the position i is larger than W, past valuesin the cache are overwritten, and the size of the cache stops increasing

    以“The cat sat on the mat”為例..
    當(dāng) i = 0 時(shí),指The,0 mod??3=0
    當(dāng) i = 1 時(shí),指cat,1 mod??3=1
    當(dāng) i = 2 時(shí),指sat,2 mod??3=2

    當(dāng) i = 3 時(shí),指on,3 mod??3=0
    當(dāng) i = 4 時(shí),指the,4 mod??3=1
    當(dāng) i = 5 時(shí),指mat,5?mod 3 = 2
  2. 在32k token的序列長(zhǎng)度上,這減少了8倍的緩存內(nèi)存使用,而不影響模型質(zhì)量
    On a sequence length of 32k tokens, this reduces the cache memory usageby 8x, without impacting the model quality.

如果把緩沖區(qū)比作一座倉(cāng)庫(kù),每存進(jìn)一個(gè)新東西,都會(huì)占據(jù)相應(yīng)的位置,而倉(cāng)庫(kù)的總?cè)萘渴枪潭ǖ模?dāng)倉(cāng)庫(kù)被裝滿時(shí),就會(huì)把最早放入的東西移除,讓新的物品繼續(xù)進(jìn)倉(cāng),相當(dāng)于入倉(cāng)時(shí)間更接近當(dāng)前時(shí)間的物品則會(huì)留在倉(cāng)庫(kù)中,如此,即能在節(jié)約資源的同時(shí)保留一定長(zhǎng)度的序列

4.1.3 預(yù)填充與分塊:減少重復(fù)運(yùn)算

在生成序列時(shí),需要一個(gè)一個(gè)地預(yù)測(cè)token,因?yàn)槊總€(gè)token都以前面的token為條件。然而,prompt是提前知道的,可以用prompt預(yù)填充(k, v)緩存,即

  1. 如果prompt非常大,可以把它分成更小的塊,用每個(gè)塊預(yù)填充緩存。為此,可以選擇窗口大小作為分塊大小。因此,對(duì)于每個(gè)塊,需要計(jì)算緩存和塊上的注意力
  2. 下圖展示了注意力掩碼在緩存和分塊上的工作原理

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

    在預(yù)填充緩存時(shí),長(zhǎng)序列被分塊,以限制內(nèi)存使用
    我們把一個(gè)序列分成三個(gè)塊來(lái)處理,“The cat sat on”,“the mat and saw”,“the dog go to”。上圖中顯示了第三塊(“the dog go to”)發(fā)生的情況:它使用因果掩碼(最右塊)來(lái)關(guān)注自己,使用滑動(dòng)窗口(中心塊)來(lái)關(guān)注緩存,并且不關(guān)注過(guò)去的token,因?yàn)樗鼈冊(cè)诨瑒?dòng)窗口之外(左塊)

4.1.4?Mistral 7B – Instruct

與Mistral 7B同期發(fā)布的Mistral 7B – Instruct(We also provide a model fine-tuned to follow instructions,Mistral 7B –Instruct),在MT-Bench的表現(xiàn)可以略微超過(guò)13B –Chat模型

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

// 待更

4.2 Mistral 7B結(jié)合YaRN

4.2.1 什么是YaRN

因項(xiàng)目中要用到Y(jié)aRN,所以我又專門寫了一篇文章介紹什么是YaRN,詳見(jiàn)《大模型上下文擴(kuò)展之YaRN解析:從直接外推ALiBi、位置插值、NTK-aware插值、YaRN》

比如該文中有講到:“3.1 YaRN怎么來(lái)的:基于“NTK-by-parts”插值修改注意力”

除了前述的插值技術(shù),他們還觀察到,在對(duì)logits進(jìn)行softmax操作之前引入溫度t可以統(tǒng)一地影響困惑度,無(wú)論數(shù)據(jù)樣本和擴(kuò)展上下文窗口上的token位置如何,更準(zhǔn)確地說(shuō),將注意力權(quán)重的計(jì)算修改為

通過(guò)將RoPE重新參數(shù)化為一組2D矩陣對(duì),給實(shí)現(xiàn)注意力縮放帶來(lái)了明顯的好處(The reparametrization of RoPE as a set of 2D matrices has a clear benefit on the implementation of this attention scaling)

  1. 可以利用“長(zhǎng)度縮放”技巧,簡(jiǎn)單地將復(fù)雜的RoPE嵌入按相同比例進(jìn)行縮放,使得qm和kn都以常數(shù)因子進(jìn)行縮放
    這樣一來(lái),在不修改代碼的情況下,YaRN能夠有效地改變注意力機(jī)制
    we can instead use a "length scaling" trick which scales both qm and kn by a constant factor p 1/t by simply scaling the complex RoPE embeddings by the same amount.
    With this, YaRN can effectively alter the attention mechanism without modifying its code.
  2. 此外,在推理和訓(xùn)練期間,它沒(méi)有額外開(kāi)銷,因?yàn)镽oPE嵌入是提前生成并在所有向前傳遞中被重復(fù)使用的。結(jié)合“NTK-by-parts”插值方法,就得到了YaRN方法
    Furthermore, it has zero overhead during both inference and training, as RoPE embeddings are generated in advance and are reused for all forward passes. Combining it with the "NTK-by-parts" interpolation, we have the YaRN method

對(duì)于LLaMA和LLaMA 2模型,他們推薦以下值:

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

上式是在未進(jìn)行微調(diào)的LLaMA 7b、13b、33b和65b模型上,使用“NTK-by-parts”方法對(duì)各種因素的尺度擴(kuò)展進(jìn)行最小困惑度擬合得到的(The equation above is found by fitting p 1/t at the lowest perplexity against the scale extension by various factors s using the "NTK-by-parts" method)

Yarn-Mistral-7b-64k相當(dāng)于自己實(shí)現(xiàn)了modeling,即把mistral的sliding windows attention改了,相當(dāng)于把sliding windows的范圍從滑窗大小直接調(diào)到了65536即64K(即直接滑65536那么個(gè)范圍的滑窗,其實(shí)就是全局)

4.3 LongLoRA LLaMA與LongQLoRA?LLaMA

通過(guò)此文《通透理解FlashAttention與FlashAttention2:讓大模型上下文長(zhǎng)度突破32K的技術(shù)之一》的開(kāi)頭可知,LLaMA2的上下文長(zhǎng)度只有4K,但通過(guò)longlora技術(shù)的加持,可以讓其上下文長(zhǎng)度擴(kuò)展到32K(LLaMA2 7B可以擴(kuò)展到100K、LLaMA2 70B可以擴(kuò)展到32K)

模型 對(duì)應(yīng)的上下文長(zhǎng)度
LLaMA 2048
LLaMA2 4096
LLaMA2-long(其23年9.27發(fā)的論文) 32K
基于LongLoRA技術(shù)的LongAlpaca-7B/13B/70B 32K以上

而LongQLoRA則相當(dāng)于LongLoRA + QLoRA

至于什么是LongLoRA、LongQLoRA,請(qǐng)參見(jiàn)此文:《大模型上下文長(zhǎng)度的超強(qiáng)擴(kuò)展:從LongLoRA到LongQLoRA》

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

4.4 模型怎么選,此三PK:Yarn-Mistral-7b-64k、Mistral-instruct、LLaMA-LongLoRA/LLaMA-LongQLoRA

接上文《3.2.4 對(duì)review數(shù)據(jù)的最后梳理:得到JSON文本的變體版且剔除長(zhǎng)尾數(shù)據(jù)》,我們終于要開(kāi)始選擇合適的模型來(lái)微調(diào)了,然后在具體微調(diào)的時(shí)候,又注意到了微調(diào)庫(kù)llama factory

所以我們有以下三種微調(diào)模式

  1. Yarn-Mistral-7b-64k(qlora+s2 + llama factory),準(zhǔn)備的1張顯存為48G的A40
    (后面直接通過(guò)llama factory微調(diào)Mistral-instruct時(shí),則一開(kāi)始準(zhǔn)備的4-8張顯存為24G的P40,后來(lái) 還是換成了A卡)
  2. LLaMA-LongLoRA (直接改的longlora源碼,比如把Embedding和layernorm的lora權(quán)重給去掉了),準(zhǔn)備2張顯存為48G的A40
  3. LLaMA-LongQLoRA (直接改的longqlora的源碼,longqlora的源碼剖析見(jiàn)此文),準(zhǔn)備1張顯存為48G的A40

接下來(lái)便一一通過(guò)上面這些模型來(lái)進(jìn)行具體的微調(diào)

4.4.1?Yarn-Mistral-7b-64k

一開(kāi)始阿荀通過(guò)「yarn-mistral + qlora + s2attn + llama factory」跑起來(lái)了幾百條數(shù)據(jù)后發(fā)現(xiàn),初始loss達(dá)到了6(當(dāng)然,雖然loss初期很高,但有在下降,說(shuō)明模型還是有學(xué)到,只不過(guò)初期loss很高說(shuō)明我們給的數(shù)據(jù)和模型所學(xué)過(guò)的數(shù)據(jù)差異比較大)

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

這里面有三件比較有意思的事

  1. 細(xì)究才發(fā)現(xiàn)Yarn-Mistral-7b-64k不是個(gè)chat模型(yarn-mistral使用的基座是非sft模型),說(shuō)白了Yarn-Mistral-7b系列均是基于非chat模型訓(xùn)練所得

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

    故一方面我們?cè)谖⒄{(diào)Yarn-Mistral-7b-64k的同時(shí),也開(kāi)始關(guān)注跟隨Mistral 7B一同發(fā)布的Mistral 7B – Instruct
  2. 但后來(lái)我們還是不信邪,還是把Yarn-Mistral-7b-64k硬跑了下來(lái),在1.10日下午loss已經(jīng)降到1.8拉,可喜可賀,^_^

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  3. 然雖yarn-mistral這邊練了2個(gè)epoch了后,loss雖然相對(duì)穩(wěn)定下降,但仍舊表現(xiàn)出復(fù)讀嚴(yán)重的問(wèn)題,調(diào)整過(guò)解碼策略參數(shù)也仍是一樣
    原因猜測(cè)是上下文長(zhǎng)度太長(zhǎng),注意力分配出去太勻了,以至于內(nèi)容再加多點(diǎn)、注意力也是大差不差的感覺(jué)

實(shí)話講,上面第三個(gè)問(wèn)題 還挺麻煩的,因?yàn)槟P偷妮敵鰶](méi)有實(shí)質(zhì)性內(nèi)容,就是在復(fù)讀用戶的一部分輸入

加之在此之前,沒(méi)有人公開(kāi)用yarn后的模型做過(guò)sft,沒(méi)有實(shí)證可以參考

  • 不清楚其是否不適合做微調(diào)(是否破壞了yarn的外推性)、又為何不適合,這點(diǎn)沒(méi)有大量、多角度的實(shí)驗(yàn)的話是無(wú)從確定的
  • 又或者是需要更多數(shù)據(jù)、更大規(guī)模模型、更多訓(xùn)練量等等

那最后怎么辦呢,具體大模型項(xiàng)目開(kāi)發(fā)線上營(yíng)中見(jiàn)

4.4.2 直接通過(guò)llama factory微調(diào)Mistral-instruct

如果我們要微調(diào)Mistral 7B – Instruct的話,我們當(dāng)時(shí)的第一反應(yīng)是怎么擴(kuò)展Mistral 7B – Instruct的長(zhǎng)度呢(Mistral 7B – Instruct的上下文長(zhǎng)度只有8K)

  1. 既然可以給Mistral-7b加YaRN,那類似的,給Mistral 7B – Instruct加YaRN行不行?
    然問(wèn)題是不好實(shí)現(xiàn):YaRN-Mistral 7B – Instruct,因?yàn)閅arn是全量訓(xùn)的方案,而大滑窗范圍+全量很吃資源
  2. 受LongLora LLaMA的啟發(fā),既然沒(méi)法給Mistral 7B – Instruct加YaRN,那可以給其加longlora么?
    然問(wèn)題是mistral又沒(méi)法享有l(wèi)onglora,因?yàn)閙istral的sliding windows attention和longlora的shift short attention無(wú)法同時(shí)兼容,但要對(duì)原chat模型的上下文長(zhǎng)度進(jìn)行有效擴(kuò)展又會(huì)需要shift short attention
  3. 至此,是不只能意味著,chat版本的mistral-instruct不加其他技巧盲練長(zhǎng)文本(用雪狼那邊的4張24G的p40),還是有別的辦法?這個(gè),我司的「大模型項(xiàng)目開(kāi)發(fā)線上營(yíng)」里見(jiàn)

4.4.3?LLaMA 2 7B chat-LongLoRA:成功

項(xiàng)目組一同事不染在最開(kāi)始用微調(diào)庫(kù)llama factory實(shí)際微調(diào)LLaMA-LongLoRA時(shí)

  1. 發(fā)現(xiàn)即便用上了longlora占用還很大,后來(lái)阿荀發(fā)現(xiàn)原來(lái)是需要用llama-factory的stable版本
  2. 但最終還是沒(méi)用llama factory了,直接改的LongLoRA源碼(比如embedding和Norm層不添加LoRA權(quán)重,原因在于一開(kāi)始添加了但訓(xùn)練效果不穩(wěn)定,猜測(cè)是如果embedding放開(kāi)得要更大的數(shù)據(jù)才訓(xùn)得夠)

最終于24年1.31日,不染通過(guò)longlora跑了4輪之后,也終于跑出來(lái)了,得到的review意見(jiàn)如下

<Potential reasons for acceptance>
<Technical solidity> The paper presents a technically solid approach to neural architecture search, offering a novel perspective on the problem and providing a theoretically sound optimization method.
<Empirical evidence> The experimental results demonstrate competitive or better performance compared to existing methods, along with improved efficiency and scalability.
<Clear presentation> The paper is well written and easy to understand, making it accessible to a wide audience.

[Potential reasons for rejection]
<No related terms> The paper lacks related terms, which may impact its relevance and positioning in the field.
<Insufficient comparison with prior art> The paper does not sufficiently compare its approach with prior art, particularly in relation to existing methods for neural architecture search.
<Unclear motivation> The motivation behind the proposed method is not clearly explained, leading to uncertainty about its significance and novelty.

[Suggestions for improvement]
<Include related terms> The paper should include relevant related terms to enhance its positioning and relevance in the field.
<Comprehensive comparison with prior art> A thorough comparison with existing methods for neural architecture search, particularly those addressing similar issues, would strengthen the paper's contribution.
<Clarify motivation> Providing a clearer explanation of the motivation behind the proposed method would help establish its significance and novelty.

</s>

至于具體我們?nèi)绾斡?xùn)練的,以及最終該模型的效果如何,請(qǐng)見(jiàn)下文的后續(xù)更新

4.4.4 基于LongQLoRA + 一萬(wàn)多條paper-review數(shù)據(jù)集微調(diào)LLaMA 2 7B chat:成功

LLaMA 2 7B chat本身的上下文長(zhǎng)度只有4096,好在我們給它加上LongQLoRA之后,其上下文長(zhǎng)度確實(shí)實(shí)現(xiàn)了從4096到12288(至于為何是12288,原因見(jiàn)上文的3.2.4節(jié)最后)

24年1.17日(是在我創(chuàng)業(yè)即將9周年的前兩天),在歷經(jīng)80h的模型訓(xùn)練之后,我們終于通過(guò)15565條paper-review數(shù)據(jù)集把LLaMA 2 7B chat LongQLoRA微調(diào)好了(相比3.2.4節(jié)最后說(shuō)的15566去掉了一條異常數(shù)據(jù),至于怎么個(gè)異常法,線上營(yíng)中說(shuō)?),是我司第二項(xiàng)目組「包括我、阿荀(主力)、朝陽(yáng)、雪狼、不染」花費(fèi)整整半年、且歷經(jīng)論文審稿第一版、第二版的里程碑式工作(后續(xù)再迭代優(yōu)化下之后,今年會(huì)把這個(gè)工作發(fā)表成SCI論文)

具體而言

  • 用下面這段名為“llama2_instruct+input?”的prompt (至于更全面的prompt線上營(yíng)中見(jiàn))
    (簡(jiǎn)單解釋一下格式的問(wèn)題
    由于如上文3.2.4節(jié)最后說(shuō)的,我們微調(diào)的數(shù)據(jù)格式均為JSON格式的變體,即
    [Significance and novelty]
    <大體描述> 具體描述
    <大體描述> 具體描述
    ...

    [Potential reasons for acceptance]
    <大體描述> 具體描述
    <大體描述> 具體描述
    ...


    相當(dāng)于LLaMA2已經(jīng)對(duì)上述這種JSON變體格式輕車熟路,所以在設(shè)計(jì)prompt時(shí),不用再特地強(qiáng)調(diào)格式的輸出)
    You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: ** importance and novelty **, ** potential reasons for acceptance **, ** potential reasons for rejection **, and ** suggestions for improvement **.
    The given paper is as follows.:
    
    [TITLE]
    YaRN: Efficient Context Window Extension of Large Language Models
    
    [ABSTRACT]
    Rotary Position Embeddings (RoPE) have been shown to effectively encode posi- tional information in transformer-based language models. However..
    
    # 還有一大段CONTENT,略..
    提示微調(diào)后得到的模型LLaMA 2 7B chat-LongQLoRA針對(duì)上文提過(guò)的YARN這篇論文(其不在我們的paper-review訓(xùn)練集之內(nèi))輸出審稿意見(jiàn):實(shí)驗(yàn)設(shè)計(jì)比較少和對(duì)方法的講述不夠容易讓人理解的問(wèn)題,如下所示
    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora
  • 然后對(duì)比了GPT3.5和GPT4針對(duì)YARN這篇論文的審稿意見(jiàn)
    首先用的prompt如下(純正的JSON格式)
    ?
    You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: ** importance and novelty **, ** potential reasons for acceptance **, ** potential reasons for rejection **, and ** suggestions for improvement **.
    You just need to use the following JSON format for output, but don't output opinions that don't exist in the original reviews. if you're not sure, return an empty dict:
    {
    'Significance and novelty': List multiple items by using Dict, The key is a brief description of the item, and the value is a detailed description of the item.
    'Potential reasons for acceptance': List multiple items by using Dict, The key is a brief description of the item, and the value is a detailed description of the item.
    "Potential reasons for rejection": List multiple items by using Dict, The key is a brief description of the item, and the value is a detailed description of the item.
    'Suggestions for improvement': List multiple items by using Dict, The key is a brief description of the item, and the value is a detailed description of the item.
    }
    
    The given paper is as follows.:
    
    [TITLE]
    YaRN: Efficient Context Window Extension of Large Language Models
    
    [ABSTRACT]
    Rotary Position Embeddings (RoPE) have been shown to effectively encode posi- tional information in transformer-based language models. However, ...
    
    # 還有一大段CONTENT,略..
    其次,輸出結(jié)果分別如下圖左部(提到實(shí)驗(yàn)不夠廣泛的問(wèn)題,然后指出YaRN的實(shí)現(xiàn)示例不夠清晰)、下圖右部所示(指出了實(shí)驗(yàn)不夠廣泛、范圍比較有限)
    ??七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

接下來(lái)是三個(gè)重點(diǎn)工作

  1. 繼續(xù)微調(diào)另外兩個(gè)開(kāi)源模型
  2. 微調(diào)gpt 3.5 16k
  3. 實(shí)現(xiàn)評(píng)估pipeline,全面對(duì)比我們微調(diào)的各個(gè)開(kāi)源模型(包括微調(diào)前后)、GPT3.5/4(包括微調(diào)前后)的效果,爭(zhēng)取早日趕超GPT4

第五部分 模型的訓(xùn)練與微調(diào):如何微調(diào)LLaMA 2、Yarn-Mistral

5.1 如何基于一萬(wàn)多paper-review數(shù)據(jù)集微調(diào)LLaMA 2

5.1.1 微調(diào)時(shí)對(duì)LongQLoRA代碼的修改

通過(guò)上文這節(jié)《4.4.4 基于LongQLoRA + 一萬(wàn)多條paper-review數(shù)據(jù)集微調(diào)LLaMA 2 7B chat:成功》的內(nèi)容,我們已經(jīng)知道終于微調(diào)成功了,但到底如何基于一萬(wàn)多paper-review數(shù)據(jù)集微調(diào)LLaMA 2呢?

首先,如之前所說(shuō),我們的微調(diào)代碼是改自LongQLoRA的源碼(沒(méi)有用llama factory),具體而言

  1. LongQLoRA源碼沒(méi)有實(shí)現(xiàn)LLaMA2 sft訓(xùn)練的數(shù)據(jù)讀取類(包括數(shù)據(jù)讀取、數(shù)據(jù)組織、tokenizing),要自己實(shí)現(xiàn)
  2. 原本的LongQLoRA源碼訓(xùn)練參數(shù)指定使用fp16數(shù)據(jù)類型,但是訓(xùn)練可能會(huì)很不穩(wěn)定,loss震蕩嚴(yán)重(可能高達(dá)上百)甚至不收斂,要去設(shè)置使用bf16數(shù)據(jù)類型進(jìn)行訓(xùn)練,loss就能穩(wěn)定從4.多開(kāi)始收斂
  3. 要使用支持bf16數(shù)據(jù)類型(即不建議使用V開(kāi)頭的卡)的卡來(lái)訓(xùn)練,使用A40最合適,支持bf16數(shù)據(jù)類型、顯存48G剛好

當(dāng)然,更多細(xì)節(jié)在線上營(yíng)中透露,如下圖所示

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

但上圖好像沒(méi)有看到文件1、文件3、文件5呢?原因在于

  • component文件夾下
    dataset.py需要修改,參照VicunaSFTDataset去實(shí)現(xiàn)Llama2SFTDataset,算文件1,所以標(biāo)號(hào)1空出來(lái)了

    不過(guò),仿寫的時(shí)候需要注意,如阿荀所說(shuō)“我最開(kāi)始照著vicunasft實(shí)現(xiàn)的,發(fā)現(xiàn)超出長(zhǎng)度的時(shí)候截?cái)喾绞绞侵苯訌奈膊拷財(cái)?,極端情況下可能會(huì)完全截去output的內(nèi)容
    所以改成讓截?cái)鄰膇nput、output各截一部分,確保無(wú)論如何一條數(shù)據(jù)也要有output部分”
    相當(dāng)于input截一部分,output截一部分


    可能有同學(xué)會(huì)疑問(wèn),既然longqlora把llama2 拉長(zhǎng)到12K以上了,為何還要做截?cái)嗄?br> 結(jié)果是不做安全截?cái)噙€是會(huì)出現(xiàn)每次運(yùn)行都不一樣的報(bào)錯(cuò),至少longqlora的代碼用來(lái)跑我們爬的數(shù)據(jù)會(huì)是這樣(可能是longqlora實(shí)現(xiàn)的問(wèn)題,或者transformers版本的問(wèn)題,因?yàn)橛械目蚣懿蛔霭踩財(cái)嘁矝](méi)事)
  • train_args文件夾下
    需要自定義:llama2-7b-chat-sft-bf16.yaml,算文件3,相關(guān)說(shuō)明見(jiàn)下文5.1.4節(jié)
  • train.py,算文件5,加載模型時(shí)需要修改成bf16
        # 加載模型
        logger.info(f'Loading model from: {args.model_name_or_path}')
        model = AutoModelForCausalLM.from_pretrained(
            args.model_name_or_path,
            config=config,
            device_map=device_map,
            load_in_4bit=True,
            # torch_dtype=torch.float16,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            quantization_config=BitsAndBytesConfig(
                load_in_4bit=True,
                # bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_compute_dtype=torch.bfloat16,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4",
                llm_int8_threshold=6.0,
                llm_int8_has_fp16_weight=False,
            ),
        )

5.1.2 資源依賴與環(huán)境配置

以下是所需的資源需求

  1. Linux系統(tǒng)
  2. 支持cuda11.7
  3. 單張A40(即顯存48G+的Ampere架構(gòu)顯卡)
  4. 可訪問(wèn)HuggingFace/Python官方源(操作前確認(rèn)已開(kāi)啟)
  5. 至少80GB的空余硬盤空間

接下來(lái),如下配置環(huán)境

cd /path/to/LongQLoRA

# 創(chuàng)建虛擬環(huán)境
conda create -n longqlora python=3.9 pip

# 配置虛擬環(huán)境
## 單獨(dú)安裝pytorch
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117 -i https://pypi.org/simple
## 單獨(dú)安裝flash attention
pip install flash_attn -i https://pypi.org/simple
## 安裝requirements
pip install -r requirements.txt -i https://pypi.org/simple

5.1.3 前期準(zhǔn)備:數(shù)據(jù)集與模型文件下載

  1. 創(chuàng)建輸出目錄
  2. 放置數(shù)據(jù)集
  3. 下載模型文件
    安裝git-lfs
    # 安裝git-lfs
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
    sudo apt-get install git-lfs
    
    # 激活git-lfs
    git lfs install
    獲取Llama-2-7b-chat-hf模型文件
    # 進(jìn)入用于存儲(chǔ)模型文件的目錄
    cd /path/to/models_dir
    
    # 獲取Llama-2-7b-chat-hf
    git lfs clone https://huggingface.co/NousResearch/Llama-2-7b-chat-hf

5.1.4 定義傳參

  • 修改yaml文件
    路徑位于“/path/to/LongQLoRA/train_args/llama2-7b-chat-sft-bf16.yaml”

    相關(guān)主要參數(shù)說(shuō)明

    參數(shù)

    釋義

    output_dir

    訓(xùn)練輸出(日志、權(quán)重文件等)目錄,即創(chuàng)建的輸出目錄外加自定義的文件名

    model_name_or_path

    用于訓(xùn)練的模型文件目錄,即獲取的模型文件路徑

    train_file

    訓(xùn)練所用數(shù)據(jù)路徑,即放置數(shù)據(jù)集的路徑。

    deepspeed

    deepspeed參數(shù)路徑,即LongQLoRA目錄下的“train_args/deepspeed/deepspeed_config_s2_bf16.json”

    sft

    是否是SFT訓(xùn)練模式

    use_flash_attn

    是否使用flash attention、attention

    num_train_epochs

    訓(xùn)練輪次

    per_device_train_batch_size

    每個(gè)設(shè)備的batch_size

    gradient_accumulation_steps

    梯度累計(jì)數(shù)

    max_seq_length

    數(shù)據(jù)截?cái)嚅L(zhǎng)度

    model_max_length

    模型所支持的最大長(zhǎng)度,即本次訓(xùn)練所要擴(kuò)展的目標(biāo)長(zhǎng)度

    learning_rate

    學(xué)習(xí)率

    logging_steps

    打印頻率,每logging_steps步打印1次

    save_steps

    權(quán)重存儲(chǔ)頻率,每save_steps步保存1次

    save_total_limit

    權(quán)重存儲(chǔ)數(shù)量上限,超出該上限時(shí)自動(dòng)刪除早期存儲(chǔ)的權(quán)重

    lr_scheduler_type

    學(xué)習(xí)率調(diào)度策略

    warmup_steps

    warmup步數(shù)

    lora_rank

    lora秩的大小

    lora_alpha

    lora的縮放尺度

    lora_dropout

    lora的dropout概率

    gradient_checkpointing

    是否開(kāi)啟gradient_checkpointing

    optim

    所選用的優(yōu)化器

    bf16

    是否開(kāi)啟bf16訓(xùn)練

    report_to

    輸出的日志形式

    dataloader_num_workers

    讀取數(shù)據(jù)所用線程數(shù),0為不開(kāi)啟多線程

    save_strategy

    保存策略,steps為按步數(shù)進(jìn)行保存、epochs為按輪次進(jìn)行保存

    weight_decay

    權(quán)重衰減值

    max_grad_norm

    梯度裁剪閾值

    remove_unused_columns

    是否刪除數(shù)據(jù)集中的無(wú)關(guān)列

  • 修改bash文件
    路徑位于“/path/to/LongQLoRA/run_train_sft_bf16.sh”,該文件如下所示
    export CUDA_LAUNCH_BLOCKING=1
    deepspeed train.py --train_args_file /path/to/LongQLoRA/train_args/llama2-7b-chat-sft-bf16.yaml
    其中--train_args_file,即指訓(xùn)練所用yaml文件的路徑

5.1.5 運(yùn)行訓(xùn)練

# 進(jìn)入LongQLoRA源碼目錄
cd /path/to/LongQLoRA

# 啟動(dòng)bash文件進(jìn)行訓(xùn)練
bash run_train_sft_bf16.sh

// 更多線上營(yíng)中見(jiàn),本文待更


第六部分 模型的評(píng)估:如何評(píng)估審稿GPT的效果

6.1 斯坦福研究者如何評(píng)估GPT4審稿意見(jiàn)的效果

6.1.1 重疊度上命中率的定義

在斯坦福那篇讓GPT4當(dāng)審稿人的論文中(具體論文詳見(jiàn)上文3.1節(jié)),他們?cè)u(píng)估了GPT-4 vs. Human和Human vs. Human在命中率方面的兩兩重疊,命中率定義為集合A中comments與集合B中comments匹配的比例,計(jì)算方法如下

6.1.2 基于「重疊度上命中率指標(biāo)」衡量LLM評(píng)估效果的流程

如下圖所示

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  • 對(duì)比
    該兩階段管道將LLM生成的Review與人類審稿員的Review進(jìn)行對(duì)比
  • 提取
    利用GPT4的信息提取功能,從LLM生成的Review和人類審稿員的Review中提取關(guān)鍵要點(diǎn)
  • 匹配
    使用GPT4進(jìn)行語(yǔ)義相似性分析,將來(lái)自LLM和人類反饋的Review進(jìn)行匹配。針對(duì)每個(gè)匹配到的Review,都會(huì)給出一個(gè)相似性評(píng)級(jí)和理由。通過(guò)設(shè)定≥7 的相似性閾值來(lái)過(guò)濾掉較弱匹配的評(píng)論(這一閾值是基于匹配階段經(jīng)過(guò)人工驗(yàn)證選擇得出)

總之

  1. 針對(duì)LLM提出的Review與人類的Review,均分別使用一定的prompt (具體prompt見(jiàn)線上營(yíng))交由GPT-4進(jìn)行摘要處理。對(duì)LLM下達(dá)任務(wù),要求其關(guān)注Review中潛在的拒絕原因,并以特定的JSON格式來(lái)提供Review所指出的關(guān)鍵問(wèn)題所在,研究團(tuán)隊(duì)解釋側(cè)重關(guān)鍵問(wèn)題的目的在于“Review中的批評(píng)直接有助于指導(dǎo)作者改進(jìn)論文”
  2. 將需要評(píng)估的LLM Review與人類Review由上一步得到的內(nèi)容共同輸入至GPT-4中,利用特定的prompt (具體prompt見(jiàn)線上營(yíng))來(lái)指示GPT-4輸出新的JSON內(nèi)容,讓GPT-4指出兩個(gè)傳入的內(nèi)容中的匹配項(xiàng),并且對(duì)匹配程度進(jìn)行評(píng)估(5-10分)
    作者研究發(fā)現(xiàn)5分、6分的相似項(xiàng)置信程度不佳,因此設(shè)定7分以上視為“匹配”,再基于計(jì)算重疊程度,其中為L(zhǎng)LM提出的批評(píng)項(xiàng)數(shù),為L(zhǎng)LM與人類提出的匹配批評(píng)項(xiàng)數(shù)

6.2 對(duì)LLaMA 2 7B chat-LongQLoRA效果的評(píng)估:強(qiáng)過(guò)GPT3.5和GPT4

注,本次對(duì)比測(cè)試中

  • gpt4是gpt-4-turbo-preview
  • gpt3.5是gpt-3.5-turbo-1106

都是支持json format且長(zhǎng)輸入的版本

在驗(yàn)證集的數(shù)據(jù)準(zhǔn)備上,我司使用57篇訓(xùn)練集外的Paper,各自都對(duì)應(yīng)有“多聚一”后的人工Review。且考慮到使用LLM可能存在輸出不穩(wěn)定的情況,因此將57條數(shù)據(jù)均分別復(fù)制5份,共得到285條測(cè)試數(shù)據(jù),因此后續(xù)LLM有機(jī)會(huì)對(duì)每個(gè)輸入進(jìn)行5次生成

下圖為測(cè)試集的Paper數(shù)據(jù)

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

下圖為測(cè)試集的Review數(shù)據(jù)(人工/Golden)

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

6.2.1 讓我司審稿模型、GPT3.5分別對(duì)測(cè)試集的paper輸出review

使用LLaMA2-paperreview(為免與實(shí)際概念上的Paper數(shù)據(jù)和Review數(shù)據(jù)混淆,以下簡(jiǎn)稱llama2)對(duì)測(cè)試集進(jìn)行輸出,得到類似如下結(jié)果

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

使用gpt-3.5-turbo對(duì)測(cè)試集進(jìn)行輸出,得到類似如下結(jié)果(至于使用的prompt模板如何寫,見(jiàn)線上營(yíng))

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

6.2.2 對(duì)review的處理:格式轉(zhuǎn)換、為review項(xiàng)標(biāo)注觀點(diǎn)序號(hào)

首先,將原本為JSON格式的gpt-3.5-turbo輸出內(nèi)容轉(zhuǎn)換為如下圖所示的格式(與人工Review、llama2 Review相同)

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

接下來(lái),為Review項(xiàng)標(biāo)注觀點(diǎn)序號(hào)

  1. 對(duì)人工Review中的“<xxx> yyy”項(xiàng)進(jìn)行序號(hào)標(biāo)注,并統(tǒng)計(jì)該條Review的項(xiàng)數(shù),序號(hào)標(biāo)注的結(jié)果如下圖所示,該例中項(xiàng)數(shù)為4

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  2. 對(duì)llama2 Review中的“<xxx> yyy”項(xiàng)進(jìn)行序號(hào)標(biāo)注,并統(tǒng)計(jì)該條Review的項(xiàng)數(shù),序號(hào)標(biāo)注的結(jié)果如下圖所示,該例中項(xiàng)數(shù)為8

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  3. 對(duì)gpt-3.5-turbo Review中的“<xxx> yyy”項(xiàng)進(jìn)行序號(hào)標(biāo)注,并統(tǒng)計(jì)該條Review的項(xiàng)數(shù),序號(hào)標(biāo)注的結(jié)果如下圖所示,該例中項(xiàng)數(shù)為4

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

6.2.3 對(duì)人工、llama2、GPT3.5輸出的觀點(diǎn)項(xiàng)進(jìn)行匹配

基于上文提到過(guò)的這篇論文《Can large language models provide useful feedback on research papers? A large-scale empirical analysis》所提出的“節(jié)點(diǎn)匹配prompt”進(jìn)行小部分修改,得到特定的prompt來(lái)指示gpt4進(jìn)行Review項(xiàng)匹配(至于完整的prompt見(jiàn)線上營(yíng))

prompt模板主要涉及4個(gè)要點(diǎn):
- 指示gpt4分析并匹配給定的Review A 和 Review B兩篇Review中相匹配的觀點(diǎn)。
- 給出輸出示例,理當(dāng)是一個(gè)類似
“{
{匹配的項(xiàng)1: 
{rationale: 闡明匹配原因, similiraty: 匹配分?jǐn)?shù)}, 
{匹配的項(xiàng)2: 
rationale: 闡明匹配原因, similiraty: 匹配分?jǐn)?shù)},
...
}”
的多層JSON
- 聲明匹配分?jǐn)?shù)準(zhǔn)則,匹配分?jǐn)?shù)由5至10,數(shù)字越大匹配成程度越高。
- 指出如果不存在匹配項(xiàng)則返回空J(rèn)SON

首先,使用gpt4基于特定prompt模板對(duì)llama2 Review(Review A)和人工Review(Review B)進(jìn)行觀點(diǎn)匹配,輸出結(jié)果類似下圖

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

然后,使用gpt4基于上述prompt模板對(duì)gpt-3.5-turbo Review(Review A)和人工Review(Review B)進(jìn)行觀點(diǎn)匹配,輸出結(jié)果類似下圖

七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

6.2.4 我司審稿模型與GPT大PK:計(jì)算命中率與命中數(shù),一決勝率

  • 計(jì)算命中率:對(duì)GPT3.5勝率近6成、對(duì)GPT4勝率更高
    定義similarity為7及以上的匹配項(xiàng)為“強(qiáng)匹配項(xiàng)”,即算作“命中”,根據(jù)以下公式計(jì)算命中率hit rate


    其中A為Review A,通常為L(zhǎng)LM輸出的Review,B為Review B,通常為人工Review,命中率即為“LLM輸出的Review與人工Review的強(qiáng)匹配項(xiàng)”與“LLM輸出的Review項(xiàng)數(shù)”的比值

    對(duì)于具體的某篇Paper p,其存在N個(gè)LLM輸出的Review,對(duì)N個(gè)Review的命中率計(jì)算出平均命中率mean hit rate。在本評(píng)估中N為5,即對(duì)于同篇Paper均輸出5個(gè)Review


    即如果按照斯坦福一團(tuán)隊(duì)把GPT4當(dāng)審稿人論文中的評(píng)估模式,是個(gè)比值,得到的結(jié)果如下,其顯示七月審稿GPT超越了GPT3.5(兩兩比對(duì)llama2與gpt-3.5-turbo對(duì)于同篇Paper的平均命中率,平均命中率高者即為“獲勝”,對(duì)于給定的57篇Paper,其中33篇為llama2具有更高的平均命中率,余下24篇?jiǎng)t為gpt-3.5-turbo具有更高的平均命中率?)

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

    至于GPT4呢,實(shí)話講,我們超過(guò)GPT4反而更多,可能不太符合直覺(jué),畢竟一般人的認(rèn)知里,GPT4按道理應(yīng)該比gpt-3.5-turbo更強(qiáng),但事實(shí)上在該場(chǎng)景下、該評(píng)估指標(biāo)下,其表現(xiàn)確實(shí)反而不如gpt-3.5-turbo
    對(duì)于此點(diǎn),微博上一朋友也同樣表達(dá)了類似的觀點(diǎn):

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

  • 計(jì)算命中數(shù):對(duì)GPT3.5勝率超8成、對(duì)GPT4勝率超6成
    如果只看絕對(duì)值,即只算平均命中數(shù)的話,我們對(duì)gpt3.5的勝率是84.21%,對(duì)gpt4的勝率是63.16%

    考慮到LLM輸出的長(zhǎng)度存在上限,也不至于生成過(guò)多觀點(diǎn),并且基于具體應(yīng)用場(chǎng)景考慮,“過(guò)多生成觀點(diǎn)”也并非壞事,甚至可能還會(huì)給用戶帶來(lái)更多啟發(fā)。故考慮直接使用“命中數(shù)”作為相關(guān)指標(biāo),即直接統(tǒng)計(jì)LLM與人工的“強(qiáng)匹配項(xiàng)”,而不計(jì)其生成的觀點(diǎn)數(shù)

    其中A為Review A,通常為L(zhǎng)LM輸出的Review;B為Review B,通常為人工Review。命中數(shù)即為“LLM輸出的Review與人工Review的強(qiáng)匹配項(xiàng)數(shù)”

    對(duì)于具體的某篇Paper p,其存在N個(gè)LLM輸出的Review,對(duì)N個(gè)Review的命中數(shù)計(jì)算出平均命中數(shù)mean hits。在本評(píng)估中N為5,即對(duì)于同篇Paper均輸出5個(gè)Review


    ? 首先,兩兩比對(duì)llama2與gpt-3.5-turbo對(duì)于同篇Paper的平均命中數(shù),平均命中數(shù)高者即為“獲勝”。對(duì)于給定的57篇Paper,其中48篇為llama2具有更高的平均命中數(shù),余下9篇?jiǎng)t為gpt-3.5-turbo具有更高的平均命中數(shù)

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

    ? 其次,那如果是PK GPT4呢?通過(guò)兩兩比對(duì)llama2與gpt-4對(duì)于同篇Paper的平均命中數(shù),平均命中數(shù)高者即為“獲勝”。對(duì)于給定的57篇Paper,其中36篇為llama2具有更高的平均命中數(shù),余下27篇?jiǎng)t為gpt-4具有更高的平均命中數(shù)

    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama,大模型/AIGC/ChatGPT系列:原理、論文、代碼、實(shí)戰(zhàn),Meta Nougat,GPT4審稿,mistral,論文審稿GPT,LongLora

至此,本文中已透露了很多我司論文審稿GPT項(xiàng)目的各種工程細(xì)節(jié),這些細(xì)節(jié)網(wǎng)上很少有,畢竟商用項(xiàng)目,當(dāng)然 更多在「大模型項(xiàng)目開(kāi)發(fā)線上營(yíng)」見(jiàn)文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-753597.html


參考文獻(xiàn)與推薦閱讀

  1. GPT4當(dāng)審稿人那篇論文的全文翻譯:【斯坦福大學(xué)最新研究】使用大語(yǔ)言模型生成審稿意見(jiàn)
  2. GPT-4竟成Nature審稿人?斯坦福清華校友近5000篇論文實(shí)測(cè),超50%結(jié)果和人類評(píng)審一致
  3. 幾篇mistral-7B的中文解讀
    從開(kāi)源LLM中學(xué)模型架構(gòu)優(yōu)化-Mistral 7B
    開(kāi)源社區(qū)新寵Mistral,最好的7B模型
  4. Mistral 7B-來(lái)自號(hào)稱“歐洲OpenAI”Mistral AI團(tuán)隊(duì)發(fā)布的最強(qiáng)7B模型
  5. [論文嘗鮮]LongLoRA - 高效微調(diào)長(zhǎng)上下文的LLMs,如果你發(fā)現(xiàn)該文與本博客中的有些表述不一致,請(qǐng)以本博客的為準(zhǔn)
  6. 2行代碼,「三體」一次讀完!港中文賈佳亞團(tuán)隊(duì)聯(lián)手MIT發(fā)布超長(zhǎng)文本擴(kuò)展技術(shù),打破LLM遺忘魔咒

創(chuàng)作、修改、完善記錄

  1. 第一大階段 review數(shù)據(jù)處理
    11.2日,開(kāi)寫本文
  2. 11.3日,側(cè)重寫第二部分、GPT4審稿的思路
  3. 11.4日,側(cè)重寫第三部分中的Mistral 7B
  4. 11.5日,繼續(xù)完善Mistral 7B的部分
  5. 11.11日,更新此節(jié):“2.2.2 如何讓梳理出來(lái)的review結(jié)果更全面:多聚一”
    完善1.1.1節(jié)Meta nougat
    順帶感慨下,為項(xiàng)目落地而進(jìn)行的技術(shù)研究,這種感覺(jué)特別爽,^_^
  6. 11.15,增加2.2節(jié):對(duì)review數(shù)據(jù)的二次處理
  7. 11.18,優(yōu)化2.2節(jié)中的部分描述
  8. 11.22,補(bǔ)充了第二部分關(guān)于論文審稿GPT第一版中數(shù)據(jù)處理部分的細(xì)節(jié),比如對(duì)paper數(shù)據(jù)處理只是做了去除reference
    補(bǔ)充“3.2.3節(jié) 通過(guò)最終的prompt來(lái)處理review數(shù)據(jù):ChatGPT VS 開(kāi)源模型”的相關(guān)內(nèi)容
  9. 11.23,新增此節(jié):1.2?對(duì)2.6萬(wàn)篇paper的解析
  10. 11.25,考慮到數(shù)據(jù)解析、數(shù)據(jù)處理、模型訓(xùn)練之后,還得做模型的評(píng)估
    故新增一部分的內(nèi)容,即第五部分 模型的評(píng)估:如何評(píng)估審稿GPT的效果
  11. 12.8,因?yàn)橐谖錆h給一公司做內(nèi)訓(xùn),且也將在「大模型項(xiàng)目開(kāi)發(fā)線上營(yíng)」里講論文審稿GPT,所以隨著該項(xiàng)目的不斷推進(jìn),故
    補(bǔ)充在通過(guò)OpenAI的API對(duì)review數(shù)據(jù)做摘要處理時(shí),如何繞開(kāi)API做的各種訪問(wèn)限制
    新增一節(jié):“3.3 相關(guān)工作之AcademicGPT:增量訓(xùn)練LLaMA2-70B,包含論文審稿功能”
  12. 12.9,重點(diǎn)優(yōu)化此節(jié)的內(nèi)容:“3.3.2 論文評(píng)審:借鑒ReviewAdvisor抽取出review的7個(gè)要點(diǎn)(類似我司借鑒斯坦福工作把review歸納出4個(gè)要點(diǎn))”
  13. 12.17,重點(diǎn)優(yōu)化關(guān)于「相關(guān)工作AcademicGPT」的描述,特別是其review抽取式歸納的策略
  14. 12.18,補(bǔ)充了Mistral 7B的模型參數(shù)圖,并補(bǔ)充了和GQA、window_size等參數(shù)相關(guān)的解釋說(shuō)明
  15. 12.20,新增:“1.2.2 ScienceBeam的解析結(jié)果”,以及review數(shù)據(jù)的組織格式
  16. 12.21,完善關(guān)于第二版中對(duì)paper和review數(shù)據(jù)處理相關(guān)的內(nèi)容描述,即
    第一版,paper有解析無(wú)數(shù)據(jù)處理,review無(wú)解析有處理
    第二版,paper有解析無(wú)數(shù)據(jù)處理,review也無(wú)解析但做了更多處理
  17. 第二大階段 模型的選型、訓(xùn)練、調(diào)優(yōu)
    12.24,增加對(duì)YaRN的補(bǔ)充介紹,即
    3.1 YaRN怎么來(lái)的:基于“NTK-by-parts”插值修改注意力
  18. 12.26,開(kāi)始更新“4.3 LongLora LLaMA”一節(jié)
  19. 12.28,新增一節(jié):4.4 模型怎么選,此三PK:Mistral-instruct、LLaMA-LongLoRA、LLaMA-LongQLoRA
  20. 24年1.4,為把LongLora和LongQLora更好的寫清楚,故把本文中關(guān)于LongLora的部分抽取出來(lái)獨(dú)立成一篇新的文章
    且新增一節(jié):3.2.4 對(duì)review數(shù)據(jù)的最后梳理:得到JSON文本的變體版
  21. 1.6,4.4.1節(jié)中,確切地說(shuō)應(yīng)該是yarn-mistral + qlora + s2attn,且沒(méi)直接用longqlora的實(shí)現(xiàn),團(tuán)隊(duì)直接改的代碼,但運(yùn)行還是基于llama factory的框架
    故最終定為:yarn-mistral + qlora + s2attn + llama factory
  22. 1.7,修正關(guān)于Mistral中GQA的一個(gè)表述:在Mistral的GQA中,Q的頭數(shù)是K V頭數(shù)的4倍,而非2倍
  23. 1.9,修訂“4.4 模型怎么選,此三PK:Yarn-Mistral-7b-64k、Mistral-instruct、LLaMA-LongLoRA/LLaMA-LongQLoRA”中相關(guān)的內(nèi)容
  24. 第三階段 模型訓(xùn)練終于取得突破
    1.17,更新此節(jié)
    4.4.4 基于LongQLoRA + 一萬(wàn)多條paper-review數(shù)據(jù)集微調(diào)LLaMA 2 7B chat:成功

    且標(biāo)題由之前的
    七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到微調(diào)Mistral、LongLora Llama
    改成
    七月論文審稿GPT第2版:如何用一萬(wàn)多條paper-review數(shù)據(jù)集微調(diào)LLaMA2以趕超GPT4
  25. 1.18,補(bǔ)充讓“微調(diào)好的模型對(duì)某篇新論文輸出審稿意見(jiàn)”的prompt,且重點(diǎn)說(shuō)明了其格式
    補(bǔ)充第二版中15566條論文審稿的指令數(shù)據(jù)的表示(instruction-input-output三元組數(shù)據(jù))
  26. 1.20,新增最新的評(píng)估結(jié)果,即新增此節(jié)
    6.2.1?對(duì)LLaMA 2 7B chat-LongQLoRA效果的評(píng)估:強(qiáng)過(guò)GPT3.5
  27. 1.21,新增以下新的一部分
    第五部分 模型的訓(xùn)練與微調(diào):如何微調(diào)LLaMA 2、Yarn-Mistral
  28. 1.22,更新對(duì)GPT4的勝率
    6.2.1 對(duì)LLaMA 2 7B chat-LongQLoRA效果的評(píng)估:強(qiáng)過(guò)GPT3.5和GPT4

    且再次更新本文的標(biāo)題為
    七月論文審稿GPT第2版:用一萬(wàn)多條paper-review數(shù)據(jù)集微調(diào)LLaMA2最終反超GPT4
  29. 1.24,考慮到學(xué)術(shù)研究及指標(biāo)評(píng)估是一件極其嚴(yán)肅的事情,故新增一節(jié)內(nèi)容,完全公開(kāi)我們的評(píng)估方法(借鑒的斯坦福一團(tuán)隊(duì)把GPT4當(dāng)審稿人論文中的評(píng)估模式)
    6.2 對(duì)LLaMA 2 7B chat-LongQLoRA效果的評(píng)估:強(qiáng)過(guò)GPT3.5和GPT4
  30. 1.25,在5.1節(jié)中補(bǔ)充我們微調(diào)時(shí),對(duì)LongQLoRA源碼的具體修改
  31. 1.31,第二項(xiàng)目組的不染把「LLaMA 2 7B chat-LongLoRA」的結(jié)果也終于跑通了
    故更新此節(jié)的內(nèi)容:4.4.3?LLaMA 2 7B chat-LongLoRA:成功

到了這里,關(guān)于七月論文審稿GPT第2版:從Meta Nougat、GPT4審稿到Mistral、LongLora Llama的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!

本文來(lái)自互聯(lián)網(wǎng)用戶投稿,該文觀點(diǎn)僅代表作者本人,不代表本站立場(chǎng)。本站僅提供信息存儲(chǔ)空間服務(wù),不擁有所有權(quán),不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載,請(qǐng)注明出處: 如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符,請(qǐng)點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋,一經(jīng)查實(shí),立即刪除!

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

相關(guān)文章

  • 最新ChatGPT/GPT4科研應(yīng)用與AI繪圖及論文高效寫作

    最新ChatGPT/GPT4科研應(yīng)用與AI繪圖及論文高效寫作

    詳情點(diǎn)擊鏈接:最新ChatGPT/GPT4科研應(yīng)用與AI繪圖及論文高效寫作 一OpenAI 1.最新大模型GPT-4 Turbo 2.最新發(fā)布的高級(jí)數(shù)據(jù)分析,AI畫圖,圖像識(shí)別,文檔API 3.GPT Store 4.從0到1創(chuàng)建自己的GPT應(yīng)用 5. 模型Gemini以及大模型Claude2 二定制自己的GPTs 1.自定義GPTs使用 2.聊天交流的方式制作自己

    2024年01月19日
    瀏覽(15)
  • ChatGPT/GPT4科研實(shí)踐應(yīng)用與AI繪圖技術(shù)及論文高效寫作

    ChatGPT/GPT4科研實(shí)踐應(yīng)用與AI繪圖技術(shù)及論文高效寫作

    2023年隨著OpenAI開(kāi)發(fā)者大會(huì)的召開(kāi),最重磅更新當(dāng)屬GPTs,多模態(tài)API,未來(lái)自定義專屬的GPT。微軟創(chuàng)始人比爾蓋茨稱ChatGPT的出現(xiàn)有著重大歷史意義,不亞于互聯(lián)網(wǎng)和個(gè)人電腦的問(wèn)世。360創(chuàng)始人周鴻祎認(rèn)為未來(lái)各行各業(yè)如果不能搭上這班車,就有可能被淘汰在這個(gè)數(shù)字化時(shí)代,如

    2024年01月25日
    瀏覽(27)
  • 搞科研、寫論文,如何正確使用GPT?AIGC技術(shù)解析、提示詞工程高級(jí)技巧、AI繪圖、ChatGPT/GPT4應(yīng)用

    搞科研、寫論文,如何正確使用GPT?AIGC技術(shù)解析、提示詞工程高級(jí)技巧、AI繪圖、ChatGPT/GPT4應(yīng)用

    目錄 專題一 OpenAI開(kāi)發(fā)者大會(huì)最新技術(shù)發(fā)展及最新功能應(yīng)用 專題二 AIGC技術(shù)解析 專題三 提示詞工程高級(jí)技巧 專題四 ChatGPT/GPT4的實(shí)用案例 專題五 讓ChatGPT/GPT4成為你的論文助手 專題六 讓ChatGPT/GPT4成為你的編程助手 專題七 讓ChatGPT/GPT4進(jìn)行數(shù)據(jù)處理 專題八 ChatGPT/GPT4在地球科學(xué)

    2024年02月04日
    瀏覽(27)
  • GPT4助力Python數(shù)據(jù)分析與可視化、人工智能建模及論文高效撰寫

    詳情點(diǎn)擊鏈接:GPT4助力Python數(shù)據(jù)分析與可視化、人工智能建模及論文高效撰寫 第一: GPT 4 基礎(chǔ)入門 1、ChatGPT概述(GPT-1、GPT-2、GPT-3、GPT-3.5、GPT-4模型的演變) 2、ChatGPT對(duì)話初體驗(yàn)(注冊(cè)與充值、購(gòu)買方法) 3、GPT-4與GPT-3.5的區(qū)別,以及與國(guó)內(nèi)大語(yǔ)言模型(文心一言、星火等

    2024年01月18日
    瀏覽(23)
  • 【論文閱讀】GPT4Graph: Can Large Language Models Understand Graph Structured Data?

    【論文閱讀】GPT4Graph: Can Large Language Models Understand Graph Structured Data?

    作者:Jiayan Guo, Lun Du, Hengyu Liu 文章鏈接:GPT4Graph: Can Large Language Models Understand Graph Structured Data? An Empirical Evaluation and Benchmarking 代碼鏈接:GPT4Graph: Can Large Language Models Understand Graph Structured Data? An Empirical Evaluation and Benchmarking 通過(guò)使用自然語(yǔ)言描述圖并向LLM提供文本描述,直接

    2024年01月20日
    瀏覽(25)
  • 分享AIGC前沿論文系列二 面向區(qū)域級(jí)圖像理解的端到端多模態(tài)大模型GPT4RoI

    分享AIGC前沿論文系列二 面向區(qū)域級(jí)圖像理解的端到端多模態(tài)大模型GPT4RoI

    面向區(qū)域級(jí)圖像理解的端到端多模態(tài)大模型 帶來(lái)了超越圖像級(jí)理解的全新對(duì)話和交互體驗(yàn) 進(jìn)行絲滑的人機(jī)互動(dòng),不僅僅是文字級(jí)別的人機(jī)互動(dòng) 本文提出對(duì)感興趣區(qū)域進(jìn)行Instruction Tuning,并提出GPT4RoI: 一種區(qū)域級(jí)視覺(jué)-語(yǔ)言模型,帶來(lái)了超越圖像級(jí)理解的全新對(duì)話和交互體驗(yàn),

    2024年02月15日
    瀏覽(27)
  • 論文閱讀《Nougat:Neural Optical Understanding for Academic Documents》

    論文閱讀《Nougat:Neural Optical Understanding for Academic Documents》

    科學(xué)知識(shí)主要存儲(chǔ)在書籍和科學(xué)期刊中,通常以PDF的形式。然而PDF格式會(huì)導(dǎo)致語(yǔ)義信息的損失,特別是對(duì)于數(shù)學(xué)表達(dá)式。我們提出了Nougat,這是一種視覺(jué)transformer模型,它執(zhí)行OCR任務(wù),用于將科學(xué)文檔處理成標(biāo)記語(yǔ)言,并證明了我們的模型在新的科學(xué)文檔數(shù)據(jù)集上的有效性。

    2024年02月09日
    瀏覽(17)
  • Nougat:結(jié)合光學(xué)神經(jīng)網(wǎng)絡(luò),引領(lǐng)學(xué)術(shù)PDF文檔的智能解析、挖掘?qū)W術(shù)論文PDF的價(jià)值

    這是Nougat的官方存儲(chǔ)庫(kù),Nougat是一種學(xué)術(shù)文檔PDF解析器,可以理解LaTeX數(shù)學(xué)和表格。 Project page: https://facebookresearch.github.io/nougat/ From pip: From repository: Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions here 如果您想從API調(diào)用模型或生成

    2024年02月04日
    瀏覽(19)
  • 【GPT4】微軟 GPT-4 測(cè)試報(bào)告(3)GPT4 的編程能力

    【GPT4】微軟 GPT-4 測(cè)試報(bào)告(3)GPT4 的編程能力

    歡迎關(guān)注【youcans的GPT學(xué)習(xí)筆記】原創(chuàng)作品,火熱更新中 微軟 GPT-4 測(cè)試報(bào)告(1)總體介紹 微軟 GPT-4 測(cè)試報(bào)告(2)多模態(tài)與跨學(xué)科能力 微軟 GPT-4 測(cè)試報(bào)告(3)GPT4 的編程能力 微軟研究院最新發(fā)布的論文 「 人工智能的火花:GPT-4 的早期實(shí)驗(yàn) 」 ,公布了對(duì) GPT-4 進(jìn)行的全面測(cè)

    2023年04月10日
    瀏覽(23)
  • 從 GPT1 - GPT4 拆解

    從 GPT1 - GPT4 拆解

    ? 大語(yǔ)言模型構(gòu)建流程: 預(yù)訓(xùn)練階段 :使用大量數(shù)據(jù)構(gòu)建基礎(chǔ)模型,賦予模型生成文本和理解世界知識(shí)的能力 有監(jiān)督微調(diào)階段 :使用高質(zhì)量數(shù)據(jù)集對(duì)模型進(jìn)行細(xì)化,增強(qiáng)其理解和執(zhí)行指令的能力 獎(jiǎng)勵(lì)建模階段 :構(gòu)建模型以評(píng)估文本質(zhì)量,為強(qiáng)化學(xué)習(xí)提供基礎(chǔ) 強(qiáng)化學(xué)習(xí)階段

    2024年01月19日
    瀏覽(19)

覺(jué)得文章有用就打賞一下文章作者

支付寶掃一掃打賞

博客贊助

微信掃一掃打賞

請(qǐng)作者喝杯咖啡吧~博客贊助

支付寶掃一掃領(lǐng)取紅包,優(yōu)惠每天領(lǐng)

二維碼1

領(lǐng)取紅包

二維碼2

領(lǐng)紅包