国产 无码 综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

[AI]如何讓語(yǔ)言模型LLMs流式輸出:HuggingFace Transformers實(shí)現(xiàn)

這篇具有很好參考價(jià)值的文章主要介紹了[AI]如何讓語(yǔ)言模型LLMs流式輸出:HuggingFace Transformers實(shí)現(xiàn)。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方,請(qǐng)大家不吝賜教,您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

HugginFace Transforms是一個(gè)非常方便的庫(kù),集成了非常多SOTA的模型,包含:LLAMA, GPT, ChatGLM Moss,等。目前基本上主流的方案都是基于HugginFace Transforms這個(gè)框架實(shí)現(xiàn)的。以前如果要流式輸出需要自己去改模型底層的推理邏輯。

如ChatGLM,自己實(shí)現(xiàn)的流式輸出如下:

#chatglm-6bmodel/modeling_chatglm.py
@torch.no_grad()
    def stream_chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, max_length: int = 2048,
                    do_sample=True, top_p=0.7, temperature=0.95, logits_processor=None, **kwargs):
        if history is None:
            history = []
        if logits_processor is None:
            logits_processor = LogitsProcessorList()
        logits_processor.append(InvalidScoreLogitsProcessor())
        gen_kwargs = {"max_length": max_length, "do_sample": do_sample, "top_p": top_p,
                      "temperature": temperature, "logits_processor": logits_processor, **kwargs}
        if not history:
            prompt = query
        else:
            prompt = ""
            for i, (old_query, response) in enumerate(history):
                prompt += "[Round {}]\n問(wèn):{}\n答:{}\n".format(i, old_query, response)
            prompt += "[Round {}]\n問(wèn):{}\n答:".format(len(history), query)
        inputs = tokenizer([prompt], return_tensors="pt")
        inputs = inputs.to(self.device)
        for outputs in self.stream_generate(**inputs, **gen_kwargs):
            outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):]
            response = tokenizer.decode(outputs)
            response = self.process_response(response)
            new_history = history + [(query, response)]
            yield response, new_history

    @torch.no_grad()
    def stream_generate(
            self,
            input_ids,
            generation_config: Optional[GenerationConfig] = None,
            logits_processor: Optional[LogitsProcessorList] = None,
            stopping_criteria: Optional[StoppingCriteriaList] = None,
            prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
            **kwargs,
    ):
        batch_size, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]

        if generation_config is None:
            generation_config = self.generation_config
        generation_config = copy.deepcopy(generation_config)
        model_kwargs = generation_config.update(**kwargs)
        bos_token_id, eos_token_id = generation_config.bos_token_id, generation_config.eos_token_id

        if isinstance(eos_token_id, int):
            eos_token_id = [eos_token_id]

        has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not None
        if has_default_max_length and generation_config.max_new_tokens is None:
            warnings.warn(
                f"Using `max_length`'s default ({generation_config.max_length}) to control the generation length. "
                "This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we"
                " recommend using `max_new_tokens` to control the maximum length of the generation.",
                UserWarning,
            )
        elif generation_config.max_new_tokens is not None:
            generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length
            if not has_default_max_length:
                logger.warn(
                    f"Both `max_new_tokens` (={generation_config.max_new_tokens}) and `max_length`(="
                    f"{generation_config.max_length}) seem to have been set. `max_new_tokens` will take precedence. "
                    "Please refer to the documentation for more information. "
                    "(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)",
                    UserWarning,
                )

        if input_ids_seq_length >= generation_config.max_length:
            input_ids_string = "decoder_input_ids" if self.config.is_encoder_decoder else "input_ids"
            logger.warning(
                f"Input length of {input_ids_string} is {input_ids_seq_length}, but `max_length` is set to"
                f" {generation_config.max_length}. This can lead to unexpected behavior. You should consider"
                " increasing `max_new_tokens`."
            )

        # 2. Set generation parameters if not already defined
        logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()

        logits_processor = self._get_logits_processor(
            generation_config=generation_config,
            input_ids_seq_length=input_ids_seq_length,
            encoder_input_ids=input_ids,
            prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
            logits_processor=logits_processor,
        )

        stopping_criteria = self._get_stopping_criteria(
            generation_config=generation_config, stopping_criteria=stopping_criteria
        )
        logits_warper = self._get_logits_warper(generation_config)

        unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
        scores = None
        while True:
            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
            # forward pass to get next token
            outputs = self(
                **model_inputs,
                return_dict=True,
                output_attentions=False,
                output_hidden_states=False,
            )

            next_token_logits = outputs.logits[:, -1, :]

            # pre-process distribution
            next_token_scores = logits_processor(input_ids, next_token_logits)
            next_token_scores = logits_warper(input_ids, next_token_scores)

            # sample
            probs = nn.functional.softmax(next_token_scores, dim=-1)
            if generation_config.do_sample:
                next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
            else:
                next_tokens = torch.argmax(probs, dim=-1)

            # update generated ids, model inputs, and length for next step
            input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
            model_kwargs = self._update_model_kwargs_for_generation(
                outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
            )
            unfinished_sequences = unfinished_sequences.mul((sum(next_tokens != i for i in eos_token_id)).long())

            # stop when each sentence is finished, or if we exceed the maximum length
            if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
                break
            yield input_ids

HuggingFace Transformers實(shí)現(xiàn)

hugging face也注意到這個(gè)需求,在v4.30.1加入了兩個(gè)流式輸出的接口:

  • TextStreamer: 能夠在stdout中流式輸出結(jié)果
  • TextIteratorStreamer:能夠在自定義loop中進(jìn)行操作

詳細(xì)介紹如下

TextStreamer

Text generation strategiesWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/docs/transformers/main/generation_strategies

The?generate()?supports streaming, through its?streamer?input. The?streamer?input is compatible any instance from a class that has the following methods:?put()?and?end(). Internally,?put()?is used to push new tokens and?end()?is used to flag the end of text generation.

The API for the streamer classes is still under development and may change in the future.

In practice, you can craft your own streaming class for all sorts of purposes! We also have basic streaming classes ready for you to use. For example, you can use the?TextStreamer?class to stream the output of?generate()?into your screen, one word at a time:

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
streamer = TextStreamer(tok)

# Despite returning the usual output, the streamer will also print the generated text to stdout.
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)

?TextIteratorStreamer

Utilities for GenerationWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/docs/transformers/main/en/internal/generation_utils#transformers.TextStreamer

Streamer that stores print-ready text in a queue, to be used by a downstream application as an iterator. This is useful for applications that benefit from acessing the generated text in a non-blocking way (e.g. in an interactive Gradio demo).

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
streamer = TextIteratorStreamer(tok)

# Run the generation in a separate thread, so that we can fetch the generated text in a non-blocking way.
generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=20)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
generated_text = ""
for new_text in streamer:
    generated_text += new_text
generated_text

ChatGLM流式回復(fù)Demo?

以下是使用chatGLM6B加上TextIteratorStreamerTextStreamer的一個(gè)簡(jiǎn)單的cli demo文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-595869.html

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, AutoModel
from transformers import TextIteratorStreamer
from threading import Thread

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
model = model.eval()

# 建構(gòu)顯示對(duì)話
def build_prompt(history):
    prompt = "歡迎使用 ChatGLM-6B 模型,輸入內(nèi)容即可進(jìn)行對(duì)話,clear 清空對(duì)話歷史,stop 終止程序"
    for query, response in history:
        prompt += f"\n\n用戶:{query}"
        prompt += f"\n\nChatGLM-6B:{response}"
    return prompt

# 維護(hù)多輪歷史
def build_history(history, query, response, index):
    history[index] = [query, response]
    return history

if __name__ == "__main__":
     # TextIteratorStreamer實(shí)現(xiàn)
    streamer = TextIteratorStreamer(tokenizer)
    history = []
    turn_count = 0
    while True:
        query = input("\n用戶:")
        if query.strip() == "stop":
            break
        if query.strip() == "clear":
            history = []
            turn_count = 0
            os.system(clear_command)
            print("歡迎使用 ChatGLM-6B 模型,輸入內(nèi)容即可進(jìn)行對(duì)話,clear 清空對(duì)話歷史,stop 終止程序")
            continue
        
        history.append([query, ""])
        
        inputs = tokenizer([query], return_tensors="pt").to('cuda')
        generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=512)
        thread = Thread(target=model.generate, kwargs=generation_kwargs)
        thread.start()
        generated_text = ""
        count = 0
        # 流式輸出
        for new_text in streamer:
            generated_text += new_text
            history = build_history(history, query, generated_text, turn_count)
            count += 1
            if count % 8 == 0:
                os.system("clear")
                print(build_prompt(history), flush=True)
        os.system("clear")
        print(build_prompt(history), flush=True)
        turn_count += 1
    
    # TextStreamer實(shí)現(xiàn)
    # streamer = TextStreamer(tokenizer)
    # _ = model.generate(**inputs, streamer=streamer, max_new_tokens=512)

到了這里,關(guān)于[AI]如何讓語(yǔ)言模型LLMs流式輸出:HuggingFace Transformers實(shí)現(xiàn)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!

本文來(lái)自互聯(lián)網(wǎng)用戶投稿,該文觀點(diǎn)僅代表作者本人,不代表本站立場(chǎng)。本站僅提供信息存儲(chǔ)空間服務(wù),不擁有所有權(quán),不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載,請(qǐng)注明出處: 如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符,請(qǐng)點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋,一經(jīng)查實(shí),立即刪除!

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

相關(guān)文章

  • 【AI大模型應(yīng)用開發(fā)】【LangChain系列】9. 實(shí)用技巧:大模型的流式輸出在 OpenAI 和 LangChain 中的使用

    【AI大模型應(yīng)用開發(fā)】【LangChain系列】9. 實(shí)用技巧:大模型的流式輸出在 OpenAI 和 LangChain 中的使用

    大家好,我是同學(xué)小張,日常分享AI知識(shí)和實(shí)戰(zhàn)案例 歡迎 點(diǎn)贊 + 關(guān)注 ??, 持續(xù)學(xué)習(xí) , 持續(xù)干貨輸出 。 +v: jasper_8017 一起交流??,一起進(jìn)步??。 微信公眾號(hào)也可搜【同學(xué)小張】 ?? 本站文章一覽: 當(dāng)大模型的返回文字非常多時(shí),返回完整的結(jié)果會(huì)耗費(fèi)比較長(zhǎng)的時(shí)間。如果

    2024年04月09日
    瀏覽(36)
  • 16K個(gè)大語(yǔ)言模型的進(jìn)化樹;81個(gè)在線可玩的AI游戲;AI提示工程的終極指南;音頻Transformers課程 | ShowMeAI日?qǐng)?bào)

    16K個(gè)大語(yǔ)言模型的進(jìn)化樹;81個(gè)在線可玩的AI游戲;AI提示工程的終極指南;音頻Transformers課程 | ShowMeAI日?qǐng)?bào)

    ?? 日?qǐng)?bào)周刊合集 | ?? 生產(chǎn)力工具與行業(yè)應(yīng)用大全 | ?? 點(diǎn)贊關(guān)注評(píng)論拜托啦! 這張進(jìn)化圖來(lái)自于論文 「 On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models 」,構(gòu)建了一個(gè)包含15821個(gè)大型語(yǔ)言模型的進(jìn)化樹和關(guān)系圖,以便探索不同的大模型之間的關(guān)系 ? 論文

    2024年02月16日
    瀏覽(24)
  • huggingface transformers庫(kù)中LlamaForCausalLM

    新手入門筆記。 LlamaForCausalLM 的使用示例,這應(yīng)該是一段推理代碼。 參考: Llama2 https://huggingface.co/docs/transformers/v4.32.1/en/model_doc/llama2#transformers.LlamaForCausalLM

    2024年02月09日
    瀏覽(21)
  • huggingface transformers loadset 導(dǎo)入本地文件

    點(diǎn)擊查看 Huggingface詳細(xì)入門介紹之dataset庫(kù) json : 表示導(dǎo)入的本地文件是 json文件

    2024年02月11日
    瀏覽(19)
  • 使用SSE技術(shù)調(diào)用OPENAI接口并實(shí)現(xiàn)流式輸出,用PHP語(yǔ)言實(shí)現(xiàn)

    作為AI語(yǔ)言模型服務(wù)提供商,OpenAI 提供了一系列的 API 接口,其中大部分需要通過(guò) HTTP 請(qǐng)求訪問(wèn)。對(duì)于大量數(shù)據(jù)的請(qǐng)求,傳統(tǒng)的同步請(qǐng)求會(huì)導(dǎo)致網(wǎng)絡(luò)響應(yīng)變慢,無(wú)法滿足實(shí)時(shí)數(shù)據(jù)處理和分析的需求。因此,為了優(yōu)化這些接口的調(diào)用效率,我們可以利用 SSE(Server Sent Events) 技術(shù)來(lái)

    2024年02月11日
    瀏覽(22)
  • 【AI之路】使用huggingface_hub優(yōu)雅解決huggingface大模型下載問(wèn)題

    【AI之路】使用huggingface_hub優(yōu)雅解決huggingface大模型下載問(wèn)題

    Hugging face 資源很不錯(cuò),可是國(guó)內(nèi)下載速度很慢,動(dòng)則GB的大模型,下載很容易超時(shí),經(jīng)常下載不成功。很是影響玩AI的信心。(有人說(shuō)用迅雷啊,試試就知道有無(wú)奈。) 經(jīng)過(guò)多次測(cè)試,終于搞定了下載,即使超時(shí)也可以繼續(xù)下載。真正實(shí)現(xiàn)下載無(wú)憂!究竟如何實(shí)現(xiàn)?且看本文

    2024年02月09日
    瀏覽(30)
  • LLMs:《A Survey on Evaluation of Large Language Models大型語(yǔ)言模型評(píng)估綜述》理解智能本質(zhì)(具備推理能力)、AI評(píng)估的重要性(識(shí)別當(dāng)前算法的局限性+設(shè)

    LLMs:《A Survey on Evaluation of Large Language Models大型語(yǔ)言模型評(píng)估綜述》理解智能本質(zhì)(具備推理能力)、AI評(píng)估的重要性(識(shí)別當(dāng)前算法的局限性+設(shè)

    LLMs:《A Survey on Evaluation of Large Language Models大型語(yǔ)言模型評(píng)估綜述》翻譯與解讀 導(dǎo)讀 :該文章首先介紹了人工智能(AI)對(duì)機(jī)器智能的專注,并探討了評(píng)估AI模型的方法。隨后,重點(diǎn)介紹了大語(yǔ)言模型(LLMs)的背景和特點(diǎn),以及它們?cè)谧匀徽Z(yǔ)言處理、推理、生成等各類任務(wù)中

    2024年02月03日
    瀏覽(23)
  • 【修改huggingface transformers默認(rèn)緩存文件夾】

    【修改huggingface transformers默認(rèn)緩存文件夾】

    最近在學(xué)習(xí)用TensorFlow框架做NLP任務(wù),注意到huggingface中的transforms庫(kù)非常強(qiáng)大,于是開始學(xué)習(xí)用它來(lái)做相應(yīng)的任務(wù)。剛開始用這個(gè)庫(kù)沒(méi)多久,感覺(jué)確實(shí)操作起來(lái)既簡(jiǎn)單又強(qiáng)大,于是打算深入學(xué)習(xí)。 學(xué)習(xí)過(guò)程中發(fā)現(xiàn),運(yùn)行程序過(guò)程中,下載的模型和數(shù)據(jù)集是默認(rèn)放在C盤的用戶目

    2024年01月23日
    瀏覽(54)
  • Huggingface鏡像網(wǎng)站下載語(yǔ)言模型方法

    通常通過(guò)鏡像網(wǎng)站下載https://hf-mirror.com/。 在鏈接頁(yè)面有介紹方法,對(duì)于不大的模型可以直接下載。這里介紹比較常用且方便的下載方法。 安裝(huggingface_hub、hf_transfer安裝可以使用-i命令從鏡像網(wǎng)站下載) 基本命令(每次打開遠(yuǎn)程鏈接都要輸入) 下載模型(下載NousResearch

    2024年02月21日
    瀏覽(43)
  • 語(yǔ)言模型:GPT與HuggingFace的應(yīng)用

    本文分享自華為云社區(qū)《大語(yǔ)言模型底層原理你都知道嗎?大語(yǔ)言模型底層架構(gòu)之二GPT實(shí)現(xiàn)》,作者:碼上開花_Lancer 。 受到計(jì)算機(jī)視覺(jué)領(lǐng)域采用ImageNet對(duì)模型進(jìn)行一次預(yù)訓(xùn)練,使得模型可以通過(guò)海量圖像充分學(xué)習(xí)如何提取特征,然后再根據(jù)任務(wù)目標(biāo)進(jìn)行模型微調(diào)的范式影響

    2024年02月05日
    瀏覽(17)

覺(jué)得文章有用就打賞一下文章作者

支付寶掃一掃打賞

博客贊助

微信掃一掃打賞

請(qǐng)作者喝杯咖啡吧~博客贊助

支付寶掃一掃領(lǐng)取紅包,優(yōu)惠每天領(lǐng)

二維碼1

領(lǐng)取紅包

二維碼2

領(lǐng)紅包