国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫

1年前作者：BiuBiu__A分類：Toy博客閱讀(27)違法舉報(bào)

這篇具有很好參考價(jià)值的文章主要介紹了Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問。

概述：

人工智能大語言模型是近年來人工智能領(lǐng)域的一項(xiàng)重要技術(shù)，它的出現(xiàn)標(biāo)志著自然語言處理領(lǐng)域的重大突破。這些模型利用深度學(xué)習(xí)和大規(guī)模數(shù)據(jù)訓(xùn)練，能夠理解和生成人類語言，為各種應(yīng)用場景提供了強(qiáng)大的文本處理能力。AI大語言模型的技術(shù)原理主要基于深度學(xué)習(xí)和自然語言處理技術(shù)，通過自監(jiān)督學(xué)習(xí)和大規(guī)模文本數(shù)據(jù)的預(yù)訓(xùn)練來學(xué)習(xí)語言的表示。訓(xùn)練完成后，可以通過微調(diào)等方法，將模型應(yīng)用于特定的任務(wù)和應(yīng)用場景。
未來，AI大語言模型有望在更多領(lǐng)域發(fā)揮作用，包括自然語言理解、文本生成、對(duì)話系統(tǒng)、語言翻譯等。它們可以用于自動(dòng)摘要、文檔生成、智能客服、智能問答等多種應(yīng)用場景，為用戶提供了更加智能和個(gè)性化的服務(wù)。
本文為學(xué)習(xí)大語言模型及FastGPT部署的學(xué)習(xí)筆記。通過直接部署**ChatGML3大語言模型**或**OLLAMA模型管理工具**配合FastGPT私有化搭建知識(shí)庫。其中**one-api**、**fastgpt**是兩種方法都需要部署的，其他的更建議使用ollama直接進(jìn)行部署，切換模型方便快捷，易于管理。

硬件要求

以下配置僅作參考
**chatglm3-6b+m3e：**3060 12 ↑
**qwen:4b+m3e：**3060 12 ↑
**qwen:2b+m3e：**1660 6g↑
總結(jié)：模型量級(jí)越大所需顯卡性能越高，過小的量級(jí)的大模型在低端cpu亦可運(yùn)行，只是推理的精準(zhǔn)度很差，更不能配合m3e向量模型進(jìn)行推理，速度會(huì)非常慢。

本文檔涉及到的資源

conda

https://www.anaconda.com/
Conda 是一個(gè)運(yùn)行在 Windows、MacOS 和 Linux 上的開源包管理系統(tǒng)和環(huán)境管理系統(tǒng)。Conda 可以：

快速安裝、運(yùn)行和更新包及其依賴項(xiàng)

輕松地在本地計(jì)算機(jī)上創(chuàng)建、保存、加載和切換環(huán)境

它是為 Python 程序創(chuàng)建的，但它可以為任何語言打包和分發(fā)軟件。

簡單總結(jié)一下就是 Conda 很好、很強(qiáng)大，使用 Conda 會(huì)讓你很省心。（人生苦短，我選 “Conda”!）

one-api

https://github.com/songquanpeng/one-api
All in one 的 OpenAI 接口整合各種 API 訪問方式一鍵部署，開箱即用

chatglm3

https://github.com/THUDM/ChatGLM3
ChatGLM3 是智譜 AI 和清華大學(xué) KEG 實(shí)驗(yàn)室聯(lián)合發(fā)布的對(duì)話預(yù)訓(xùn)練模型。

m3e

https://modelscope.cn/models/Jerry0/m3e-base/summary
M3E 是 Moka Massive Mixed Embedding 的縮寫

Moka，此模型由 MokaAI 訓(xùn)練，開源和評(píng)測，訓(xùn)練腳本使用 uniem ，評(píng)測 BenchMark 使用 MTEB-zh

Massive，此模型通過千萬級(jí) (2200w+) 的中文句對(duì)數(shù)據(jù)集進(jìn)行訓(xùn)練

Mixed，此模型支持中英雙語的同質(zhì)文本相似度計(jì)算，異質(zhì)文本檢索等功能，未來還會(huì)支持代碼檢索

Embedding，此模型是文本嵌入模型，可以將自然語言轉(zhuǎn)換成稠密的向量

ollama

https://ollama.com/
大語言模型管理工具

fastgpt

https://github.com/labring/FastGPT
FastGPT 是一個(gè)基于 LLM 大語言模型的知識(shí)庫問答系統(tǒng)，提供開箱即用的數(shù)據(jù)處理、模型調(diào)用等能力。同時(shí)可以通過 Flow 可視化進(jìn)行工作流編排，從而實(shí)現(xiàn)復(fù)雜的問答場景！

魔搭社區(qū)

https://modelscope.cn/home
ModelScope旨在打造下一代開源的模型即服務(wù)共享平臺(tái)，為泛AI開發(fā)者提供靈活、易用、低成本的一站式模型服務(wù)產(chǎn)品，讓模型應(yīng)用更簡單！

CONDA的使用

官網(wǎng)：https://www.anaconda.com/

安裝好后更新工具

conda update -n base -c defaults conda

更新各種庫

conda update --all

創(chuàng)建虛擬環(huán)境

conda create --name windows_chatglm3-6b python=3.11 -y

激活并進(jìn)入虛擬環(huán)境

conda activate windows_chatglm3-6b

安裝pytorch

cmd:nvidia-smi查看最高支持的 CUDA Version，我的是12.2

安裝pytorch

PyTorch是一個(gè)開源的Python機(jī)器學(xué)習(xí)庫，基于Torch，用于自然語言處理等應(yīng)用程序。PyTorch既可以看作加入了GPU支持的numpy，同時(shí)也可以看成一個(gè)擁有自動(dòng)求導(dǎo)功能的強(qiáng)大的深度神經(jīng)網(wǎng)絡(luò) 。

在 https://pytorch.org/get-started/locally/ 查詢自己電腦需要執(zhí)行的命令
在conda虛擬環(huán)境內(nèi)執(zhí)行以下命令安裝pytorch

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

chatglm3環(huán)境搭建(非ollama模式)

模型及DEMO下載

一共需要下載兩個(gè)模型chatglm3 及m3e

chatglm3下載

模型地址：https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary
下載方法：git
下載時(shí)間較久，耐心等待

git lfs install
git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git

m3e-base下載

git clone https://www.modelscope.cn/Jerry0/m3e-base.git

官方ChatGLM3 DEMO下載

地址：https://github.com/THUDM/ChatGLM3
下載方法：git

git clone https://github.com/THUDM/ChatGLM3

配置及運(yùn)行

進(jìn)入剛剛clone的 ChatGLM3/openai_api_demo文件夾
打開api_server.py的python文件
代碼拉倒最下方

覆蓋if name == “main”:方法內(nèi)的代碼如下：

其中一些地方需要修改，`tokenizer`及`model`的地址對(duì)應(yīng)的是[chatglm3](#eB5m3)的下載地址，`embedding_model`的地址對(duì)應(yīng)的是[m3e](#FHYRP)的下載地址，`port`可根據(jù)個(gè)人需要自行配置

tokenizer = AutoTokenizer.from_pretrained("E:\Work\HaoQue\FastGPT\models\chatglm3-6B-32k-int4", trust_remote_code=True)

model = AutoModel.from_pretrained("E:\Work\HaoQue\FastGPT\models\chatglm3-6B-32k-int4", trust_remote_code=True, device_map="auto").eval()
# load Embedding

embedding_model = SentenceTransformer("E:\Work\HaoQue\FastGPT\models\m3e-base", trust_remote_code=True, device="cuda")

uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

回到ChatGLM3根目錄進(jìn)入剛剛創(chuàng)建的windows_chatglm3-6bconda 虛擬環(huán)境
cmd運(yùn)行pip install -r requirements.txt安裝依賴
耐心等待安裝完成
完成后運(yùn)行通過python運(yùn)行 python openai_api_demo/api_server.py
查看運(yùn)行結(jié)果，以下為運(yùn)行成功。

Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫,后端,AI,chatgpt,chatgpt,fastgpt,ai

ollama環(huán)境搭建

ollama程序下載及模型安裝安裝

下載：https://ollama.com/download
安裝直接下一步
安裝完成后進(jìn)入cmd 輸入 ollama -v驗(yàn)證是否成功

Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫,后端,AI,chatgpt,chatgpt,fastgpt,ai
通過ollama進(jìn)行模型下載

模型列表：https://ollama.com/library
這里以qwen:1.8b為例
cmd運(yùn)行 ollama run qwen:1.8b
耐心等待下載即可
下載完成模型會(huì)自動(dòng)啟動(dòng)，無需其他操作

m3e環(huán)境搭建（ollama模式）

使用docker進(jìn)行部署，docker安裝在此不做介紹。

docker run -d --name m3e -p 6008:6008 --gpus all -e sk-key=123321  registry.cn-hangzhou.aliyuncs.com/fastgpt_docker/m3e-large-api

one-api環(huán)境部署及配置

使用docker進(jìn)行部署，docker安裝在此不做介紹。

docker run --name one-api -d --restart always -p 3000:3000 -e TZ=Asia/Shanghai -v /home/ubuntu/data/one-api:/data justsong/one-api

然后訪問[http://localhost:3000/](http://localhost:3001/)端口為docker run 時(shí)候-p的端口，

Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫,后端,AI,chatgpt,chatgpt,fastgpt,ai

登陸初始賬號(hào)用戶名為 root，密碼為 123456。
登陸后來到渠道頁面添加渠道，此步驟添加的是大語言模型
如果你是通過ollama運(yùn)行的大模型，則需要再次添加新渠道，本次添加m3e渠道
新建令牌
新建令牌后，復(fù)制令牌sdk備用

fastgpt環(huán)境搭建

參考文檔：https://doc.fastai.site/docs/development/docker/

非 Linux 環(huán)境或無法訪問外網(wǎng)環(huán)境，可手動(dòng)創(chuàng)建一個(gè)目錄，并下載下面2個(gè)鏈接的文件: docker-compose.yml,config.json

注意: docker-compose.yml 配置文件中 Mongo 為 5.x，部分服務(wù)器不支持，需手動(dòng)更改其鏡像版本為 4.4.24（需要自己在docker hub下載，阿里云鏡像沒做備份）

config.json配置文件修改
1. 打開下載的config.json
2. 復(fù)制并替換**llmModels**數(shù)組中的第一組數(shù)據(jù)，修改model和name屬性為你部署的模型屬性，其他可以不做修改

    {
      "model": "gemma:2b",
      "name": "gemma:2b",
      "maxContext": 16000,
      "avatar": "/imgs/model/openai.svg",
      "maxResponse": 4000,
      "quoteMaxToken": 13000,
      "maxTemperature": 1.2,
      "charsPointsPrice": 0,
      "censor": false,
      "vision": false,
      "datasetProcess": true,
      "usedInClassify": true,
      "usedInExtractFields": true,
      "usedInToolCall": true,
      "usedInQueryExtension": true,
      "toolChoice": true,
      "functionCall": true,
      "customCQPrompt": "",
      "customExtractPrompt": "",
      "defaultSystemChatPrompt": "",
      "defaultConfig": {}
    },

如果你是ollama部署的大模型
1. 打開下載的config.json
2. 在**vectorModels**數(shù)組中添加以下數(shù)據(jù)

    {
      "model": "m3e",
      "name": "M3E",
      "inputPrice": 0,
      "outputPrice": 0,
      "defaultToken": 700,
      "maxToken": 1800,
      "weight": 100
    }

打開docker-compose.yml
注釋掉mysql 及 oneapi相關(guān)配置
**啟動(dòng)容器 **

在 docker-compose.yml 同級(jí)目錄下執(zhí)行。請(qǐng)確保docker-compose版本最好在2.17以上，否則可能無法執(zhí)行自動(dòng)化命令。

# 啟動(dòng)容器
docker-compose up -d
# 等待10s，OneAPI第一次總是要重啟幾次才能連上Mysql
sleep 10
# 重啟一次oneapi(由于OneAPI的默認(rèn)Key有點(diǎn)問題，不重啟的話會(huì)提示找不到渠道，臨時(shí)手動(dòng)重啟一次解決，等待作者修復(fù))
docker restart oneapi

訪問 FastGPT

目前可以通過 ip:3000 直接訪問(注意防火墻)。登錄用戶名為 root，密碼為docker-compose.yml環(huán)境變量里設(shè)置的 DEFAULT_ROOT_PSW。
如果需要域名訪問，請(qǐng)自行安裝并配置 Nginx。
首次運(yùn)行，會(huì)自動(dòng)初始化 root 用戶，密碼為 1234

新建知識(shí)庫

Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫,后端,AI,chatgpt,chatgpt,fastgpt,ai

上傳知識(shí)庫文件

Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫,后端,AI,chatgpt,chatgpt,fastgpt,ai

新建AI應(yīng)用

Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫,后端,AI,chatgpt,chatgpt,fastgpt,ai

開始使用

Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫,后端,AI,chatgpt,chatgpt,fastgpt,ai

運(yùn)行報(bào)錯(cuò)

報(bào)huggingface-hub的錯(cuò)

pip install huggingface-hub==0.20.3

顯存不足嘗試設(shè)置環(huán)境變量

set PYTORCH_CUDA_ALLOC_COFF=expandable_segments:True
再次運(yùn)行python api_server.py（經(jīng)測試無用）
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 0 has a total capacity of 6.00 GiB of which 0 bytes is free. Of the allocated memory 12.31 GiB is allocated by PyTorch, and 1.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))文章來源地址http://www.zghlxwxcb.cn/news/detail-853473.html

api_server.py整體

"""
This script implements an API for the ChatGLM3-6B model,
formatted similarly to OpenAI's API (https://platform.openai.com/docs/api-reference/chat).
It's designed to be run as a web server using FastAPI and uvicorn,
making the ChatGLM3-6B model accessible through OpenAI Client.

Key Components and Features:
- Model and Tokenizer Setup: Configures the model and tokenizer paths and loads them.
- FastAPI Configuration: Sets up a FastAPI application with CORS middleware for handling cross-origin requests.
- API Endpoints:
  - "/v1/models": Lists the available models, specifically ChatGLM3-6B.
  - "/v1/chat/completions": Processes chat completion requests with options for streaming and regular responses.
  - "/v1/embeddings": Processes Embedding request of a list of text inputs.
- Token Limit Caution: In the OpenAI API, 'max_tokens' is equivalent to HuggingFace's 'max_new_tokens', not 'max_length'.
For instance, setting 'max_tokens' to 8192 for a 6b model would result in an error due to the model's inability to output
that many tokens after accounting for the history and prompt tokens.
- Stream Handling and Custom Functions: Manages streaming responses and custom function calls within chat responses.
- Pydantic Models: Defines structured models for requests and responses, enhancing API documentation and type safety.
- Main Execution: Initializes the model and tokenizer, and starts the FastAPI app on the designated host and port.

Note:
    This script doesn't include the setup for special tokens or multi-GPU support by default.
    Users need to configure their special tokens and can enable multi-GPU support as per the provided instructions.
    Embedding Models only support in One GPU.

    Running this script requires 14-15GB of GPU memory. 2 GB for the embedding model and 12-13 GB for the FP16 ChatGLM3 LLM.


"""

import os
import time
import tiktoken
import torch
import uvicorn

from fastapi import FastAPI, HTTPException, Response
from fastapi.middleware.cors import CORSMiddleware

from contextlib import asynccontextmanager
from typing import List, Literal, Optional, Union
from loguru import logger
from pydantic import BaseModel, Field
from transformers import AutoTokenizer, AutoModel
from utils import process_response, generate_chatglm3, generate_stream_chatglm3
from sentence_transformers import SentenceTransformer

from sse_starlette.sse import EventSourceResponse

# Set up limit request time
EventSourceResponse.DEFAULT_PING_INTERVAL = 1000000

# set LLM path
MODEL_PATH = os.environ.get('MODEL_PATH', 'D:\WangMing\FastGPT\models\chatglm3-6b-copy')
TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", 'D:\WangMing\FastGPT\models\chatglm3-6b-copy')

# set Embedding Model path
EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', 'D:\WangMing\FastGPT\models\m3e-base')


@asynccontextmanager
async def lifespan(app: FastAPI):
    yield
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()


app = FastAPI(lifespan=lifespan)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)


class ModelCard(BaseModel):
    id: str
    object: str = "model"
    created: int = Field(default_factory=lambda: int(time.time()))
    owned_by: str = "owner"
    root: Optional[str] = None
    parent: Optional[str] = None
    permission: Optional[list] = None


class ModelList(BaseModel):
    object: str = "list"
    data: List[ModelCard] = []


class FunctionCallResponse(BaseModel):
    name: Optional[str] = None
    arguments: Optional[str] = None


class ChatMessage(BaseModel):
    role: Literal["user", "assistant", "system", "function"]
    content: str = None
    name: Optional[str] = None
    function_call: Optional[FunctionCallResponse] = None


class DeltaMessage(BaseModel):
    role: Optional[Literal["user", "assistant", "system"]] = None
    content: Optional[str] = None
    function_call: Optional[FunctionCallResponse] = None


## for Embedding
class EmbeddingRequest(BaseModel):
    input: List[str]
    model: str


class CompletionUsage(BaseModel):
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int


class EmbeddingResponse(BaseModel):
    data: list
    model: str
    object: str
    usage: CompletionUsage


# for ChatCompletionRequest

class UsageInfo(BaseModel):
    prompt_tokens: int = 0
    total_tokens: int = 0
    completion_tokens: Optional[int] = 0


class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[ChatMessage]
    temperature: Optional[float] = 0.8
    top_p: Optional[float] = 0.8
    max_tokens: Optional[int] = None
    stream: Optional[bool] = False
    tools: Optional[Union[dict, List[dict]]] = None
    repetition_penalty: Optional[float] = 1.1


class ChatCompletionResponseChoice(BaseModel):
    index: int
    message: ChatMessage
    finish_reason: Literal["stop", "length", "function_call"]


class ChatCompletionResponseStreamChoice(BaseModel):
    delta: DeltaMessage
    finish_reason: Optional[Literal["stop", "length", "function_call"]]
    index: int


class ChatCompletionResponse(BaseModel):
    model: str
    id: str
    object: Literal["chat.completion", "chat.completion.chunk"]
    choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
    created: Optional[int] = Field(default_factory=lambda: int(time.time()))
    usage: Optional[UsageInfo] = None


@app.get("/health")
async def health() -> Response:
    """Health check."""
    return Response(status_code=200)


@app.post("/v1/embeddings", response_model=EmbeddingResponse)
async def get_embeddings(request: EmbeddingRequest):
    embeddings = [embedding_model.encode(text) for text in request.input]
    embeddings = [embedding.tolist() for embedding in embeddings]

    def num_tokens_from_string(string: str) -> int:
        """
        Returns the number of tokens in a text string.
        use cl100k_base tokenizer
        """
        encoding = tiktoken.get_encoding('cl100k_base')
        num_tokens = len(encoding.encode(string))
        return num_tokens

    response = {
        "data": [
            {
                "object": "embedding",
                "embedding": embedding,
                "index": index
            }
            for index, embedding in enumerate(embeddings)
        ],
        "model": request.model,
        "object": "list",
        "usage": CompletionUsage(
            prompt_tokens=sum(len(text.split()) for text in request.input),
            completion_tokens=0,
            total_tokens=sum(num_tokens_from_string(text) for text in request.input),
        )
    }
    return response


@app.get("/v1/models", response_model=ModelList)
async def list_models():
    model_card = ModelCard(
        id="chatglm3-6b"
    )
    return ModelList(
        data=[model_card]
    )


@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):
    global model, tokenizer

    if len(request.messages) < 1 or request.messages[-1].role == "assistant":
        raise HTTPException(status_code=400, detail="Invalid request")

    gen_params = dict(
        messages=request.messages,
        temperature=request.temperature,
        top_p=request.top_p,
        max_tokens=request.max_tokens or 1024,
        echo=False,
        stream=request.stream,
        repetition_penalty=request.repetition_penalty,
        tools=request.tools,
    )
    logger.debug(f"==== request ====\n{gen_params}")

    if request.stream:

        # Use the stream mode to read the first few characters, if it is not a function call, direct stram output
        predict_stream_generator = predict_stream(request.model, gen_params)
        output = next(predict_stream_generator)
        if not contains_custom_function(output):
            return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")

        # Obtain the result directly at one time and determine whether tools needs to be called.
        logger.debug(f"First result output：\n{output}")

        function_call = None
        if output and request.tools:
            try:
                function_call = process_response(output, use_tool=True)
            except:
                logger.warning("Failed to parse tool call")

        # CallFunction
        if isinstance(function_call, dict):
            function_call = FunctionCallResponse(**function_call)

            """
            In this demo, we did not register any tools.
            You can use the tools that have been implemented in our `tools_using_demo` and implement your own streaming tool implementation here.
            Similar to the following method:
                function_args = json.loads(function_call.arguments)
                tool_response = dispatch_tool(tool_name: str, tool_params: dict)
            """
            tool_response = ""

            if not gen_params.get("messages"):
                gen_params["messages"] = []

            gen_params["messages"].append(ChatMessage(
                role="assistant",
                content=output,
            ))
            gen_params["messages"].append(ChatMessage(
                role="function",
                name=function_call.name,
                content=tool_response,
            ))

            # Streaming output of results after function calls
            generate = predict(request.model, gen_params)
            return EventSourceResponse(generate, media_type="text/event-stream")

        else:
            # Handled to avoid exceptions in the above parsing function process.
            generate = parse_output_text(request.model, output)
            return EventSourceResponse(generate, media_type="text/event-stream")

    # Here is the handling of stream = False
    response = generate_chatglm3(model, tokenizer, gen_params)

    # Remove the first newline character
    if response["text"].startswith("\n"):
        response["text"] = response["text"][1:]
    response["text"] = response["text"].strip()

    usage = UsageInfo()
    function_call, finish_reason = None, "stop"
    if request.tools:
        try:
            function_call = process_response(response["text"], use_tool=True)
        except:
            logger.warning("Failed to parse tool call, maybe the response is not a tool call or have been answered.")

    if isinstance(function_call, dict):
        finish_reason = "function_call"
        function_call = FunctionCallResponse(**function_call)

    message = ChatMessage(
        role="assistant",
        content=response["text"],
        function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,
    )

    logger.debug(f"==== message ====\n{message}")

    choice_data = ChatCompletionResponseChoice(
        index=0,
        message=message,
        finish_reason=finish_reason,
    )
    task_usage = UsageInfo.model_validate(response["usage"])
    for usage_key, usage_value in task_usage.model_dump().items():
        setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)

    return ChatCompletionResponse(
        model=request.model,
        id="",  # for open_source model, id is empty
        choices=[choice_data],
        object="chat.completion",
        usage=usage
    )


async def predict(model_id: str, params: dict):
    global model, tokenizer

    choice_data = ChatCompletionResponseStreamChoice(
        index=0,
        delta=DeltaMessage(role="assistant"),
        finish_reason=None
    )
    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
    yield "{}".format(chunk.model_dump_json(exclude_unset=True))

    previous_text = ""
    for new_response in generate_stream_chatglm3(model, tokenizer, params):
        decoded_unicode = new_response["text"]
        delta_text = decoded_unicode[len(previous_text):]
        previous_text = decoded_unicode

        finish_reason = new_response["finish_reason"]
        if len(delta_text) == 0 and finish_reason != "function_call":
            continue

        function_call = None
        if finish_reason == "function_call":
            try:
                function_call = process_response(decoded_unicode, use_tool=True)
            except:
                logger.warning(
                    "Failed to parse tool call, maybe the response is not a tool call or have been answered.")

        if isinstance(function_call, dict):
            function_call = FunctionCallResponse(**function_call)

        delta = DeltaMessage(
            content=delta_text,
            role="assistant",
            function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,
        )

        choice_data = ChatCompletionResponseStreamChoice(
            index=0,
            delta=delta,
            finish_reason=finish_reason
        )
        chunk = ChatCompletionResponse(
            model=model_id,
            id="",
            choices=[choice_data],
            object="chat.completion.chunk"
        )
        yield "{}".format(chunk.model_dump_json(exclude_unset=True))

    choice_data = ChatCompletionResponseStreamChoice(
        index=0,
        delta=DeltaMessage(),
        finish_reason="stop"
    )
    chunk = ChatCompletionResponse(
        model=model_id,
        id="",
        choices=[choice_data],
        object="chat.completion.chunk"
    )
    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
    yield '[DONE]'


def predict_stream(model_id, gen_params):
    """
    The function call is compatible with stream mode output.

    The first seven characters are determined.
    If not a function call, the stream output is directly generated.
    Otherwise, the complete character content of the function call is returned.

    :param model_id:
    :param gen_params:
    :return:
    """
    output = ""
    is_function_call = False
    has_send_first_chunk = False
    for new_response in generate_stream_chatglm3(model, tokenizer, gen_params):
        decoded_unicode = new_response["text"]
        delta_text = decoded_unicode[len(output):]
        output = decoded_unicode

        # When it is not a function call and the character length is> 7,
        # try to judge whether it is a function call according to the special function prefix
        if not is_function_call and len(output) > 7:

            # Determine whether a function is called
            is_function_call = contains_custom_function(output)
            if is_function_call:
                continue

            # Non-function call, direct stream output
            finish_reason = new_response["finish_reason"]

            # Send an empty string first to avoid truncation by subsequent next() operations.
            if not has_send_first_chunk:
                message = DeltaMessage(
                    content="",
                    role="assistant",
                    function_call=None,
                )
                choice_data = ChatCompletionResponseStreamChoice(
                    index=0,
                    delta=message,
                    finish_reason=finish_reason
                )
                chunk = ChatCompletionResponse(
                    model=model_id,
                    id="",
                    choices=[choice_data],
                    created=int(time.time()),
                    object="chat.completion.chunk"
                )
                yield "{}".format(chunk.model_dump_json(exclude_unset=True))

            send_msg = delta_text if has_send_first_chunk else output
            has_send_first_chunk = True
            message = DeltaMessage(
                content=send_msg,
                role="assistant",
                function_call=None,
            )
            choice_data = ChatCompletionResponseStreamChoice(
                index=0,
                delta=message,
                finish_reason=finish_reason
            )
            chunk = ChatCompletionResponse(
                model=model_id,
                id="",
                choices=[choice_data],
                created=int(time.time()),
                object="chat.completion.chunk"
            )
            yield "{}".format(chunk.model_dump_json(exclude_unset=True))

    if is_function_call:
        yield output
    else:
        yield '[DONE]'


async def parse_output_text(model_id: str, value: str):
    """
    Directly output the text content of value

    :param model_id:
    :param value:
    :return:
    """
    choice_data = ChatCompletionResponseStreamChoice(
        index=0,
        delta=DeltaMessage(role="assistant", content=value),
        finish_reason=None
    )
    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
    yield "{}".format(chunk.model_dump_json(exclude_unset=True))

    choice_data = ChatCompletionResponseStreamChoice(
        index=0,
        delta=DeltaMessage(),
        finish_reason="stop"
    )
    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
    yield '[DONE]'


def contains_custom_function(value: str) -> bool:
    """
    Determine whether 'function_call' according to a special function prefix.

    For example, the functions defined in "tools_using_demo/tool_register.py" are all "get_xxx" and start with "get_"

    [Note] This is not a rigorous judgment method, only for reference.

    :param value:
    :return:
    """
    return value and 'get_' in value


if __name__ == "__main__":
    # Load LLM
    tokenizer = AutoTokenizer.from_pretrained("D:\WangMing\FastGPT\models\chatglm3-6b-copy", trust_remote_code=True)
    model = AutoModel.from_pretrained("D:\WangMing\FastGPT\models\chatglm3-6b-copy", trust_remote_code=True, device_map="auto").quantize(4).eval()

    # load Embedding
    embedding_model = SentenceTransformer("D:\WangMing\FastGPT\models\chatglm3-6b-copy", trust_remote_code=True, device="cuda")
    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

到了這里，關(guān)于Fastgpt配合chatglm+m3e或ollama+m3e搭建個(gè)人知識(shí)庫的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點(diǎn)僅代表作者本人，不代表本站立場。本站僅提供信息存儲(chǔ)空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請(qǐng)注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符，請(qǐng)點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋，一經(jīng)查實(shí)，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

2分鐘搭建FastGPT訓(xùn)練企業(yè)知識(shí)庫AI助理（Docker部署）
我們使用寶塔面板來進(jìn)行搭建，更方便快捷靈活，爭取操作時(shí)間只需兩分鐘在【軟件商店中】安裝【docker管理器】【docker模塊】即可通過【Docker】【添加容器】【容器編排】創(chuàng)建里新增docker-compose.yaml 以下是模板內(nèi)容僅需把? CHAT_API_KEY ?修改成 openai key 即可。如果需要使用
2024年02月10日
瀏覽(26)
AI知識(shí)庫進(jìn)階！三種數(shù)據(jù)處理方法！提高正確率！本地大模型+fastgpt知識(shí)庫手把手搭建！22/45
hi~ 在上一篇，我們成功搭建了本地知識(shí)庫+大模型的完全體！在知識(shí)星球收到很多朋友的打卡，有各種報(bào)錯(cuò)差點(diǎn)崩潰的，也有看到部署成功，開心得跳起來的！除了自用，還有星球朋友學(xué)會(huì)搭建，成功接到商單（聽說單子還不?。?！不管怎樣，酸甜苦辣，總算把它部署了下
2024年03月11日
瀏覽(20)
全民AI時(shí)代：手把手教你用Ollama & AnythingLLM搭建AI知識(shí)庫，無需編程，跟著做就行！
在本地電腦上跑大語言模型（LLM），已經(jīng)不是什么高科技操作了。隨著技術(shù)的迭代，現(xiàn)在利用Ollam和AnythingLLM就可以輕松構(gòu)建自己的本地知識(shí)庫，人人皆可上手，有手就行。過往要達(dá)成這一目標(biāo)，可是需要有編程經(jīng)驗(yàn)的。首先得了解一下背后的原理。大概就是三步走：一是
2024年04月24日
瀏覽(137)
使用chatglm搭建本地知識(shí)庫AI_聞達(dá)
最近大火的chatgpt，老板說讓我看看能不能用自己的數(shù)據(jù)，回答專業(yè)一些，所以做了一些調(diào)研，最近用這個(gè)倒是成功推理了自己的數(shù)據(jù)，模型也開源了，之后有機(jī)會(huì)也訓(xùn)練一下自己的數(shù)據(jù)。 1.1雙擊打開anconda prompt創(chuàng)建虛擬環(huán)境 1.2下載pytorch（這里要根據(jù)自己的電腦版本下載）都
2024年02月10日
瀏覽(23)
【搭建個(gè)人知識(shí)庫-3】
基于InternLM和LangChain搭建專屬個(gè)人的大模型知識(shí)庫；大模型開發(fā)范式 LangChain簡介構(gòu)建大模型具有簡單的廣度回答，但是在垂直領(lǐng)域的知識(shí)受限；如何讓LLM及時(shí)獲得最新的知識(shí) 如何打造垂直領(lǐng)域大模型如何打造個(gè)人專屬的LLM應(yīng)用兩種常用開發(fā)范式：RAG VS Finetune 即：檢索增
2024年02月01日
瀏覽(65)
開源大模型ChatGLM2-6B 2. 跟著LangChain參考文檔搭建LLM+知識(shí)庫問答系統(tǒng)
租用了1臺(tái)GPU服務(wù)器，系統(tǒng) ubuntu20，Tesla V100-16GB （GPU服務(wù)器已經(jīng)關(guān)機(jī)結(jié)束租賃了） SSH地址：* 端口：17520 SSH賬戶：root 密碼：Jaere7pa 內(nèi)網(wǎng)： 3389 ，外網(wǎng)：17518 VNC地址：* 端口：17519 VNC用戶名：root 密碼：Jaere7pa 硬件需求，ChatGLM-6B和ChatGLM2-6B相當(dāng)。量化等級(jí)?? ?最低 GPU 顯存 F
2024年02月03日
瀏覽(32)
使用 FastGPT 構(gòu)建高質(zhì)量 AI 知識(shí)庫
作者：余金隆。FastGPT 項(xiàng)目作者，Sealos 項(xiàng)目前端負(fù)責(zé)人，前 Shopee 前端開發(fā)工程師 FastGPT 項(xiàng)目地址： https://github.com/labring/FastGPT/ 自從去年 12 月 ChatGPT 發(fā)布以來，帶動(dòng)了一輪新的交互應(yīng)用革命。尤其在 GPT-3.5 接口全面開放后，大量的 LLM 應(yīng)用如雨后春筍般涌現(xiàn)。然而，由于 GP
2024年02月14日
瀏覽(24)
Linux服務(wù)器快速安裝FastGPT知識(shí)庫問答系統(tǒng)
最近開始體驗(yàn)FastGPT知識(shí)庫問答系統(tǒng)，參考官方文檔，在自己的阿里云服務(wù)器使用Docker Compose快速完成了部署。環(huán)境說明：阿里云ECS，2核8G，X86架構(gòu)，CentOS 7.9操作系統(tǒng)。 1.登錄服務(wù)器，執(zhí)行相關(guān)命令完成安裝。 1.登錄服務(wù)器，在/mnt目錄(可以自己選擇)下創(chuàng)建fastgpt目錄，并下載
2024年02月04日
瀏覽(21)
使用OpenAI Assistants三分鐘搭建個(gè)人知識(shí)庫AI助手網(wǎng)站
隨著OpenAI將Assistants助手API對(duì)外發(fā)布，我們搭建個(gè)人知識(shí)庫變的如此簡單。開發(fā)者將自己的應(yīng)用通過Assistants API與OpenAI對(duì)接，就可以讓每一位客戶擁有不一般體驗(yàn)的個(gè)人知識(shí)庫。由于Assistants相關(guān)API有30+，本文只列舉完成一個(gè)最小功能閉環(huán)涉及的接口。關(guān)于Assistants的介紹，這里
2024年02月05日
瀏覽(25)
docsify快速部署搭建個(gè)人知識(shí)庫（支持本地、服務(wù)器、虛擬機(jī)運(yùn)行）
?? 服務(wù)器與網(wǎng)站部署知識(shí)體系目錄我們先在本地運(yùn)行體會(huì)與獲取 docsify 結(jié)構(gòu)，后面再部署到服務(wù)器上運(yùn)行。部署一個(gè)個(gè)人知識(shí)庫只需要按照本文的指令直接 cv 即可。但請(qǐng)注意打開服務(wù)器防火墻的 80 端口。 Docsify即時(shí)生成您的文檔網(wǎng)站。與 GitBook 不同，它不會(huì)生成靜態(tài) htm
2024年02月04日
瀏覽(31)