国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

<del id="x7zte"></del>

[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models

2年前作者：ZedKingCarry分類：Toy博客閱讀(31)違法舉報

這篇具有很好參考價值的文章主要介紹了[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

一、論文信息

1 論文標題

TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models

2 發(fā)表刊物

arXiv2023

3 作者團隊

復旦大學

4 關鍵詞

Benchmark、Continual Learing、LLMs

二、文章結構

三、引言

1 研究動機

已經(jīng)對齊過的大模型 (Aligned LLMs )能力很強，但持續(xù)學習能力缺乏關注；
目前CL的benchmark對于頂尖的LLMs來說過于簡單，并且在指令微調(diào)存在model的potential exposure。(這里的exposure是指什么，在擔心安全嗎？)

2 任務背景

Intro-P1：

LLMs (通用能力)+fine-tuning (特長能力)+alignment（安全）已經(jīng)統(tǒng)治了NLP。但是對模型的需求能力仍然在增長，尤其是在domain-specific knowledge, multilingual proficiency, complex task-solving, tool usage等方面。
但重頭訓練LLMs代價太大不現(xiàn)實，因此通過持續(xù)學習方式incrementally訓練已有的模型顯得非常重要。這就引出一個重要問題：To what degree do Aligned LLMs exhibit catastrophic forgetting when subjected to incremental training?

Intro-P2：

目前的CL benchmark不適合用于評估SOTA LLMs，原因如下：

很多常見且簡單的NLU數(shù)據(jù)集。對于LLMs來說太簡單，而且很多已經(jīng)作為訓練數(shù)據(jù)喂給LLMs，再用來evaluate不合適。
現(xiàn)存的benchmark只關注模型在序列任務的表現(xiàn)，缺乏對新任務泛化性、人類指令遵循性和安全保護性等方面的評估。

Intro-P3：
提出了適用于aligned LLMs的CL benchmark： TRACE

8 distinct datasets spanning challenging tasks
- domain-specific tasks
- multilingual capabilities
- code generation
- mathematical reasoning
equal distribution
3 metrics
- general ability delta
- instruction following delta
- safety delta

Intro-P4:
在TRACE上評估了5個LLMs：

幾乎所有LLMs在通用能力上都會明顯下降；
LLMs的多語言能力會提高；
全量微調(diào)相比LoRA更容易合適目標任務，但在通用能力上下降明顯；
LLMs的指令遵循能力也會下降；

Intro-P5:

使用一些推理方法會有效保存模型的能力；
提出了 Reasoning-augmented Continual Learning (RCL)
not only boosts performance on target tasks but also significantly upholds the inherent strengths of LLMs；

3 相關工作

3.1 CL

經(jīng)典3分類，可以參考之前的文章。

3.2 CL Benchmark in NLP

Standard CL Benchmark；
15個分類；

3.3 COT

COT；
Zero shot COT；
fine-tune COT；

四、創(chuàng)新方法

1 模型結構

[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,閱讀筆記,持續(xù)學習,大語言模型,論文閱讀,筆記,語言模型,自然語言處理,人工智能

TRACE consists of two main components:

A selection of eight datasets constituting a tailored set of tasks for continual learning, covering challenges in domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning.
A post-training evaluation of LLM capabilities. In addition to traditional continual learning metrics, we introduce General Ability Delta, Instruction Following Delta, and Safety Delta to evaluate shifts in LLM’s inherent abilities.

2 數(shù)據(jù)構建

[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,閱讀筆記,持續(xù)學習,大語言模型,論文閱讀,筆記,語言模型,自然語言處理,人工智能

3 評測指標

General Ability Delta: $\Delta R_t^G=\frac1M\sum_{i=1}^M(R_{t,i}^G-R_{0,i}^G)$ , 其中 $t, i$ 表示已經(jīng)訓練到第t個任務時的模型在第i個任務上的表現(xiàn)。 $0, i$ 表示模型直接在i上的表現(xiàn)。
Instruction Following Delta: $\Delta R_t^I=\frac1N\sum_{i=1}^N(R_{t,i}^I-R_{0,i}^I)$
Safety Delta: $\Delta R_t^S=\frac1L\sum_{i=1}^L(R_{t,i}^S-R_{0,i}^S)$

上述指標計算方式一致，區(qū)別在于用于評測的數(shù)據(jù)集不同。

4 實驗設置

4.1 baselines

Sequential Full-Parameter Fine-Tuning (SeqFT): This method involves training all model
parameters in sequence.
LoRA-based Sequential Fine-Tuning (LoraSeqFT): Only the low-rank LoRA matrices are fine-tuned, leaving the LLM backbone fixed. This method is chosen based on prior findings of reduced forgetting with ”Efficient Tuning” .
Replay-based Sequential Fine-Tuning (Replay): Replay, a common continual learning strategy, is employed for its simplicity and effectiveness. We incorporate alignment data from LIMA into the replay memory, replaying 10% of historical data.
In-Context Learning (ICL): Task demonstrations are supplied as part of the language prompt, acting as a form of prompt engineering. A 6-shot setting is used for our experiments.

To evaluate the resilience of safety alignment models from diverse training backgrounds and strategies, we select five aligned models from three organizations:

Meta:
- LLaMa-2-7B-Chat,
- LLaMa-2-13B-Chat
BaiChuan:
- Baichuan 2-7B-Chat
Large Model Systems Organization
- Vicuna-13B-V1.5
- Vicuna-7B-V1.5

實驗結果

主實驗結果表格
[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,閱讀筆記,持續(xù)學習,大語言模型,論文閱讀,筆記,語言模型,自然語言處理,人工智能

序列任務上的表現(xiàn)

In-Context Learning (ICL) Performance: ICL methods generally perform lower than SeqFT and Replay methods. This suggests that the TRACE benchmark is indeed challenging, and LLMs can’t readily identify solutions just through simple demonstrations.
Replay Performance: Among all the baselines, Replay achieved the highest OP score. With its
BWT score being positive, it indicates that Replay effectively retains its performance on sequential tasks without significant forgetting. This makes Replay a straightforward and efficient strategy in a continual learning context.
Full Parameter Training vs. LoRA: Full parameter training demonstrates better task-specific
adaptability compared to LoRA, with a smaller BWT score. For instance, LLaMA-2-7B-Chat’s
SeqFT OP(BWT) is 48.7 (8.3%), while LoRASeqFT stands at 12.7 (45.7%). This suggests that
when the focus is primarily on sequential tasks, full parameter fine-tuning should be prioritized over parameter-efficient methods like LoRA.

通用能力的適應

From the Model Perspective:

Nearly all models display a negative General Ability Delta, indicating a general decline in overall capabilities after continual learning.
Larger models, in comparison to their smaller counterparts, show a more pronounced (明顯的) forgetting in factual knowledge and reasoning tasks.

From the Task Perspective:

Despite the presence of CoT prompts, there is a noticeable decline in math and reasoning abilities across all models, suggesting that these abilities are highly sensitive to new task learning.
Excluding the llama2-7b model, most models exhibit a significant drop in performance on MMLU, suggesting a gradual loss of factual knowledge through continual learning.
TydiQA task sees a general boost post-training, possibly due to the inclusion of Chinese and German datasets in our sequential tasks. Even more intriguing is the observed enhancement (and some declines) in other languages on TydiQA, suggesting potential cross-linguistic transfer characteristics.
Performance shifts on PIQA for most models are subtle(不明顯的), indicating the relative robustness of commonsense knowledge during continual learning.

From the Methodological Perspective:

The Replay method proves beneficial in preserving reasoning and factuality skills. Especially for larger models, the mitigation of forgetting through Replay is more pronounced. For instance, for LLaMA-2-7B-Chat, Replay offers a 6.5 EM score boost compared to methods without Replay, while for LLaMA-2-13B-Chat, the increase is 17.1 EM score.

實驗圖1
[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,閱讀筆記,持續(xù)學習,大語言模型,論文閱讀,筆記,語言模型,自然語言處理,人工智能

Figure 2 (a) illustrates the win rate % for instruction following sequentially trained LLMs and their original versions. Here, the win rate can be approximated as an indicator for the Instruction-following delta. It’s evident that all three training methods exhibit a marked decline in instruction-following capabilities compared to their initial versions, with the decline being most pronounced in the LoRA method. Therefore, be cautious when exploring approaches like LoRA for continual learning in LLMs. 概括：說明LoRA微調(diào)完很可能不遵循指令。
Figure 2(b) shows the win rate % for instruction following between the new LLMs and their starting versions. Here, the win rate can be used as a measure for the Safety Delta. Compared to the original models, most answers were rated as ’Tie’. This suggests that the safety of the model’s answers is largely unaffected by continual learning on general tasks. 概括：說明大部分情況下安全性不太受持續(xù)學習訓練的影響。

LLMs遺忘的影響因子

數(shù)據(jù)質(zhì)量和訓練步數(shù)

[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,閱讀筆記,持續(xù)學習,大語言模型,論文閱讀,筆記,語言模型,自然語言處理,人工智能

Performance improves as data volume grows, indicating at least 5000 samples from the TRACE-selected datasets are needed for full fitting.
Performance improves with up to 5 training epochs, confirming our baseline epoch setting balances target task optimization and retaining existing capabilities.

[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,閱讀筆記,持續(xù)學習,大語言模型,論文閱讀,筆記,語言模型,自然語言處理,人工智能
How exactly does the reasoning capability of LLMs transform during the continual learning process?

a surge in the model’s reasoning prowess post-training on the ScienceQA task, while it declined for other tasks.
even though the two tasks from NumGLUE are mathematically inclined, their answers don’t provide a clear reasoning path. ScienceQA does offer such a pathway in its answers. This observation suggests the potential advantage of incorporating reasoning paths during training to preserve and perhaps even enhance the model’s reasoning capability.

Reasoning-Augmented Continual Learning

motivation:

Instead of treating LLMs as traditional models and inundating them with large volumes of data to fit a task’s distribution, might we leverage their inherent abilities for rapid task transfer?

method

[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,閱讀筆記,持續(xù)學習,大語言模型,論文閱讀,筆記,語言模型,自然語言處理,人工智能

results

[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,閱讀筆記,持續(xù)學習,大語言模型,論文閱讀,筆記,語言模型,自然語言處理,人工智能

討論

Can traditional continual learning methods be effectively applied to LLMs?

[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,閱讀筆記,持續(xù)學習,大語言模型,論文閱讀,筆記,語言模型,自然語言處理,人工智能文章來源地址http://www.zghlxwxcb.cn/news/detail-801421.html

High Training Cost: LLMs require significant data for both pre-training and alignment, leading to a high training cost. Using simple replay to maintain past capabilities can be very expensive. Therefore, selecting key data from past training to keep LLMs’ diverse predictive abilities is essential.
Large Number of Parameters: The huge parameter size of LLMs demands advanced hardware for training. Many regularization techniques need to store gradients from past tasks, which is a big challenge for both CPU and GPU memory.
One-for-All Deployment of LLMs: LLMs are designed for a wide range of tasks, meaning tailoring parameters for specific tasks might limit their ability to generalize to new tasks. Additionally, methods that adjust the network dynamically can complicate deployment, as it becomes tricky to handle multiple task queries at once.

How should LLMs approach continual learning?

Direct end-to-end training of Language Model (LLMs) might cause them to excessively focus on specific patterns of the target task, potentially hindering their performance in more general scenarios.
LLMs are already trained on diverse datasets and possess the ability to handle multiple tasks, even with limited examples. Building upon the Superficial Alignment Hypothesis proposed by LIMA, the focus should be on aligning LLMs’ existing capabilities with new tasks rather than starting from scratch.
Therefore, strategies like the RCL approach, which leverage LLMs’ inherent abilities for quick transfer to novel tasks, can be effective in mitigating catastrophic forgetting.

到了這里，關于[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。如若轉載，請注明出處：如若內(nèi)容造成侵權/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領支付寶紅包贊助服務器費用

分層強化學習綜述論文閱讀 Hierarchical Reinforcement Learning: A Comprehensive Survey
分層強化學習可以通過將困難的長期決策任務分解為更簡單的子任務，提升強化學習算法的性能。分層強化學習方法主要涉及：使用HRL學習分層策略、子任務發(fā)現(xiàn)、遷移學習和多智能體學習四個主要挑戰(zhàn)。強化學習算法的一個痛點：如果任務的長度很長，狀態(tài)空間和動作空
2024年02月04日
瀏覽(30)
【論文精讀】GAIA: A Benchmark for General AI Assistants
一篇來自Meta、HuggingFace、AutoGPT聯(lián)合投稿的Agent Benchmark的工作，為當前百花齊放的Agent領域帶來了評測的標準。這篇工作詳細介紹了GAIA的設計理念，展望了GAIA的未來，討論了當前GAIA的不足，細讀下來可以看到這些大佬們對于這個當前火熱領域的熱切期待。 Paper https://arxiv.org
2024年02月04日
瀏覽(22)
Learning Sample Relationship for Exposure Correction 論文閱讀筆記
這是中科大發(fā)表在CVPR2023的一篇論文，提出了一個module和一個損失項，能夠提高現(xiàn)有exposure correction網(wǎng)絡的性能。這已經(jīng)是最近第三次看到這種論文了，前兩篇分別是CVPR2022的ENC（和這篇文章是同一個一作作者）和CVPR2023的SKF，都是類似即插即用地提出一些模塊來提高現(xiàn)有方法的
2024年02月07日
瀏覽(20)
Deep Frequency Filtering for Domain Generalization論文閱讀筆記
這是CVPR2023的一篇論文，講的是在頻域做domain generalization，找到頻域中generalizable的分量enhance它，suppress那些影響generalization的分量 DG是一個研究模型泛化性的領域，嘗試通過各自方法使得模型在未見過的測試集上有良好的泛化性。 intro部分指出，低頻分量更好泛化，而高頻分
2024年02月07日
瀏覽(26)
【論文筆記】基于預訓練模型的持續(xù)學習（Continual Learning）（增量學習，Incremental Learning）
論文鏈接： Continual Learning with Pre-Trained Models: A Survey 代碼鏈接： Github: LAMDA-PILOT 持續(xù)學習（Continual Learning, CL）旨在使模型在學習新知識的同時能夠保留原來的知識信息了，然而現(xiàn)實任務中，模型并不能很好地保留原始信息，這也就是常說的災害性遺忘（Catastrophic forgetting）
2024年04月26日
瀏覽(23)
RIS 系列 Mask Grounding for Referring Image Segmentation 論文閱讀筆記
寫在前面 ??一篇 Arxiv 上面的新文章，看看清華大佬們的研究。論文地址：Mask Grounding for Referring Image Segmentation 代碼地址：原論文說將會開源，靜待佳音~ 預計提交于：CVPR 2024 Ps：2023 年每周一篇博文閱讀筆記，主頁更多干貨，歡迎關注呀，期待 6 千粉絲有你的參與呦~ ??
2024年02月03日
瀏覽(23)
Lightening Network for Low-Light Image Enhancement 論文閱讀筆記
這是2022年TIP期刊的一篇有監(jiān)督暗圖增強的文章網(wǎng)絡結構如圖所示： LBP的網(wǎng)絡結構如下：有點繞，其基于的理論如下。就是說，普通的暗圖增強就只是走下圖的L1紅箭頭，從暗圖估計一個亮圖。但是其實這個亮圖和真實的亮圖還是有一些差距，怎么彌補呢，可以再進一步學習
2024年02月16日
瀏覽(31)
【論文閱讀筆記】Traj-MAE: Masked Autoencoders for Trajectory Prediction
通過預測可能的危險，軌跡預測一直是構建可靠的自動駕駛系統(tǒng)的關鍵任務。一個關鍵問題是在不發(fā)生碰撞的情況下生成一致的軌跡預測。為了克服這一挑戰(zhàn)，我們提出了一種有效的用于軌跡預測的掩蔽自編碼器(Traj-MAE)，它能更好地代表駕駛環(huán)境中智能體的復雜行為。具體
2024年02月06日
瀏覽(32)
【論文閱讀筆記】PraNet: Parallel Reverse Attention Network for Polyp Segmentation
PraNet: Parallel Reverse Attention Network for Polyp Segmentation PraNet：用于息肉分割的并行反向注意力網(wǎng)絡 2020年發(fā)表在MICCAI Paper Code 結腸鏡檢查是檢測結直腸息肉的有效技術，結直腸息肉與結直腸癌高度相關。在臨床實踐中，從結腸鏡圖像中分割息肉是非常重要的，因為它為診斷和手術
2024年01月20日
瀏覽(29)
[論文筆記] 大模型主流Benchmark測試集介紹
?????????自然語言處理（NLP）的進步往往通過在各種benchmark測試集上的表現(xiàn)來衡量。隨著多語言和跨語言NLP研究的興起，越來越多的多語言測試集被提出以評估模型在不同語言和文化背景下的泛化能力。在這篇文章中，我們將介紹幾個主流的多語言NLP benchmark測試集，包括
2024年02月22日
瀏覽(25)