一、論文信息
1 論文標題
TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models
2 發(fā)表刊物
arXiv2023
3 作者團隊
復旦大學
4 關鍵詞
Benchmark、Continual Learing、LLMs
二、文章結構
三、引言
1 研究動機
- 已經(jīng)對齊過的大模型 (Aligned LLMs )能力很強,但持續(xù)學習能力缺乏關注;
- 目前CL的benchmark對于頂尖的LLMs來說過于簡單,并且在指令微調(diào)存在model的potential exposure。(這里的exposure是指什么,在擔心安全嗎?)
2 任務背景
Intro-P1:
- LLMs (通用能力)+fine-tuning (特長能力)+alignment(安全) 已經(jīng)統(tǒng)治了NLP。但是對模型的需求能力仍然在增長,尤其是在domain-specific knowledge, multilingual proficiency, complex task-solving, tool usage等方面。
- 但重頭訓練LLMs代價太大不現(xiàn)實,因此通過持續(xù)學習方式incrementally訓練已有的模型顯得非常重要。這就引出一個重要問題:To what degree do Aligned LLMs exhibit catastrophic forgetting when subjected to incremental training?
Intro-P2:
目前的CL benchmark不適合用于評估SOTA LLMs,原因如下:
- 很多常見且簡單的NLU數(shù)據(jù)集。對于LLMs來說太簡單,而且很多已經(jīng)作為訓練數(shù)據(jù)喂給LLMs,再用來evaluate不合適。
- 現(xiàn)存的benchmark只關注模型在序列任務的表現(xiàn),缺乏對新任務泛化性、人類指令遵循性和安全保護性等方面的評估。
Intro-P3:
提出了適用于aligned LLMs的CL benchmark: TRACE
- 8 distinct datasets spanning challenging tasks
- domain-specific tasks
- multilingual capabilities
- code generation
- mathematical reasoning
- equal distribution
- 3 metrics
- general ability delta
- instruction following delta
- safety delta
Intro-P4:
在TRACE上評估了5個LLMs:
- 幾乎所有LLMs在通用能力上都會明顯下降;
- LLMs的多語言能力會提高;
- 全量微調(diào)相比LoRA更容易合適目標任務,但在通用能力上下降明顯;
- LLMs的指令遵循能力也會下降;
Intro-P5:
- 使用一些推理方法會有效保存模型的能力;
- 提出了 Reasoning-augmented Continual Learning (RCL)
- not only boosts performance on target tasks but also significantly upholds the inherent strengths of LLMs;
3 相關工作
3.1 CL
經(jīng)典3分類,可以參考之前的文章。
3.2 CL Benchmark in NLP
Standard CL Benchmark;
15個分類;
3.3 COT
COT;
Zero shot COT;
fine-tune COT;
四、創(chuàng)新方法
1 模型結構
TRACE consists of two main components:
- A selection of eight datasets constituting a tailored set of tasks for continual learning, covering challenges in domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning.
- A post-training evaluation of LLM capabilities. In addition to traditional continual learning metrics, we introduce General Ability Delta, Instruction Following Delta, and Safety Delta to evaluate shifts in LLM’s inherent abilities.
2 數(shù)據(jù)構建
3 評測指標
- General Ability Delta: Δ R t G = 1 M ∑ i = 1 M ( R t , i G ? R 0 , i G ) \Delta R_t^G=\frac1M\sum_{i=1}^M(R_{t,i}^G-R_{0,i}^G) ΔRtG?=M1?∑i=1M?(Rt,iG??R0,iG?), 其中 t , i t,i t,i表示已經(jīng)訓練到第t個任務時的模型在第i個任務上的表現(xiàn)。 0 , i 0,i 0,i表示模型直接在i上的表現(xiàn)。
- Instruction Following Delta: Δ R t I = 1 N ∑ i = 1 N ( R t , i I ? R 0 , i I ) \Delta R_t^I=\frac1N\sum_{i=1}^N(R_{t,i}^I-R_{0,i}^I) ΔRtI?=N1?∑i=1N?(Rt,iI??R0,iI?)
- Safety Delta: Δ R t S = 1 L ∑ i = 1 L ( R t , i S ? R 0 , i S ) \Delta R_t^S=\frac1L\sum_{i=1}^L(R_{t,i}^S-R_{0,i}^S) ΔRtS?=L1?∑i=1L?(Rt,iS??R0,iS?)
上述指標計算方式一致,區(qū)別在于用于評測的數(shù)據(jù)集不同。
4 實驗設置
4.1 baselines
-
Sequential Full-Parameter Fine-Tuning (SeqFT): This method involves training all model
parameters in sequence. - LoRA-based Sequential Fine-Tuning (LoraSeqFT): Only the low-rank LoRA matrices are fine-tuned, leaving the LLM backbone fixed. This method is chosen based on prior findings of reduced forgetting with ”Efficient Tuning” .
- Replay-based Sequential Fine-Tuning (Replay): Replay, a common continual learning strategy, is employed for its simplicity and effectiveness. We incorporate alignment data from LIMA into the replay memory, replaying 10% of historical data.
- In-Context Learning (ICL): Task demonstrations are supplied as part of the language prompt, acting as a form of prompt engineering. A 6-shot setting is used for our experiments.
To evaluate the resilience of safety alignment models from diverse training backgrounds and strategies, we select five aligned models from three organizations:
- Meta:
- LLaMa-2-7B-Chat,
- LLaMa-2-13B-Chat
- BaiChuan:
- Baichuan 2-7B-Chat
- Large Model Systems Organization
- Vicuna-13B-V1.5
- Vicuna-7B-V1.5
實驗結果
主實驗結果表格
序列任務上的表現(xiàn)
- In-Context Learning (ICL) Performance: ICL methods generally perform lower than SeqFT and Replay methods. This suggests that the TRACE benchmark is indeed challenging, and LLMs can’t readily identify solutions just through simple demonstrations.
-
Replay Performance: Among all the baselines, Replay achieved the highest OP score. With its
BWT score being positive, it indicates that Replay effectively retains its performance on sequential tasks without significant forgetting. This makes Replay a straightforward and efficient strategy in a continual learning context. -
Full Parameter Training vs. LoRA: Full parameter training demonstrates better task-specific
adaptability compared to LoRA, with a smaller BWT score. For instance, LLaMA-2-7B-Chat’s
SeqFT OP(BWT) is 48.7 (8.3%), while LoRASeqFT stands at 12.7 (45.7%). This suggests that
when the focus is primarily on sequential tasks, full parameter fine-tuning should be prioritized over parameter-efficient methods like LoRA.
通用能力的適應
From the Model Perspective:
- Nearly all models display a negative General Ability Delta, indicating a general decline in overall capabilities after continual learning.
- Larger models, in comparison to their smaller counterparts, show a more pronounced (明顯的) forgetting in factual knowledge and reasoning tasks.
From the Task Perspective:
- Despite the presence of CoT prompts, there is a noticeable decline in math and reasoning abilities across all models, suggesting that these abilities are highly sensitive to new task learning.
- Excluding the llama2-7b model, most models exhibit a significant drop in performance on MMLU, suggesting a gradual loss of factual knowledge through continual learning.
- TydiQA task sees a general boost post-training, possibly due to the inclusion of Chinese and German datasets in our sequential tasks. Even more intriguing is the observed enhancement (and some declines) in other languages on TydiQA, suggesting potential cross-linguistic transfer characteristics.
- Performance shifts on PIQA for most models are subtle(不明顯的), indicating the relative robustness of commonsense knowledge during continual learning.
From the Methodological Perspective:
- The Replay method proves beneficial in preserving reasoning and factuality skills. Especially for larger models, the mitigation of forgetting through Replay is more pronounced. For instance, for LLaMA-2-7B-Chat, Replay offers a 6.5 EM score boost compared to methods without Replay, while for LLaMA-2-13B-Chat, the increase is 17.1 EM score.
實驗圖1
- Figure 2 (a) illustrates the win rate % for instruction following sequentially trained LLMs and their original versions. Here, the win rate can be approximated as an indicator for the Instruction-following delta. It’s evident that all three training methods exhibit a marked decline in instruction-following capabilities compared to their initial versions, with the decline being most pronounced in the LoRA method. Therefore, be cautious when exploring approaches like LoRA for continual learning in LLMs. 概括:說明LoRA微調(diào)完很可能不遵循指令。
- Figure 2(b) shows the win rate % for instruction following between the new LLMs and their starting versions. Here, the win rate can be used as a measure for the Safety Delta. Compared to the original models, most answers were rated as ’Tie’. This suggests that the safety of the model’s answers is largely unaffected by continual learning on general tasks. 概括:說明大部分情況下安全性不太受持續(xù)學習訓練的影響。
LLMs遺忘的影響因子
數(shù)據(jù)質(zhì)量和訓練步數(shù)
- Performance improves as data volume grows, indicating at least 5000 samples from the TRACE-selected datasets are needed for full fitting.
- Performance improves with up to 5 training epochs, confirming our baseline epoch setting balances target task optimization and retaining existing capabilities.
How exactly does the reasoning capability of LLMs transform during the continual learning process?
- a surge in the model’s reasoning prowess post-training on the ScienceQA task, while it declined for other tasks.
- even though the two tasks from NumGLUE are mathematically inclined, their answers don’t provide a clear reasoning path. ScienceQA does offer such a pathway in its answers. This observation suggests the potential advantage of incorporating reasoning paths during training to preserve and perhaps even enhance the model’s reasoning capability.
Reasoning-Augmented Continual Learning
motivation:
Instead of treating LLMs as traditional models and inundating them with large volumes of data to fit a task’s distribution, might we leverage their inherent abilities for rapid task transfer?
method
results
文章來源:http://www.zghlxwxcb.cn/news/detail-801421.html
討論
Can traditional continual learning methods be effectively applied to LLMs?
文章來源地址http://www.zghlxwxcb.cn/news/detail-801421.html
- High Training Cost: LLMs require significant data for both pre-training and alignment, leading to a high training cost. Using simple replay to maintain past capabilities can be very expensive. Therefore, selecting key data from past training to keep LLMs’ diverse predictive abilities is essential.
- Large Number of Parameters: The huge parameter size of LLMs demands advanced hardware for training. Many regularization techniques need to store gradients from past tasks, which is a big challenge for both CPU and GPU memory.
- One-for-All Deployment of LLMs: LLMs are designed for a wide range of tasks, meaning tailoring parameters for specific tasks might limit their ability to generalize to new tasks. Additionally, methods that adjust the network dynamically can complicate deployment, as it becomes tricky to handle multiple task queries at once.
How should LLMs approach continual learning?
- Direct end-to-end training of Language Model (LLMs) might cause them to excessively focus on specific patterns of the target task, potentially hindering their performance in more general scenarios.
- LLMs are already trained on diverse datasets and possess the ability to handle multiple tasks, even with limited examples. Building upon the Superficial Alignment Hypothesis proposed by LIMA, the focus should be on aligning LLMs’ existing capabilities with new tasks rather than starting from scratch.
- Therefore, strategies like the RCL approach, which leverage LLMs’ inherent abilities for quick transfer to novel tasks, can be effective in mitigating catastrophic forgetting.
到了這里,關于[論文閱讀筆記] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章,希望大家以后多多支持TOY模板網(wǎng)!