jieba 加whooh 構(gòu)建自己本地數(shù)據(jù)庫的搜索引擎

2年前作者：東方佑分類：Toy博客閱讀(27)違法舉報

這篇具有很好參考價值的文章主要介紹了jieba 加whooh 構(gòu)建自己本地數(shù)據(jù)庫的搜索引擎。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

例子

from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID
from jieba.analyse import ChineseAnalyzer
from whoosh.qparser import QueryParser

import os



analyzer = ChineseAnalyzer()
schema = Schema(title=TEXT(stored=True, analyzer=analyzer), content=TEXT(stored=True, analyzer=analyzer), id=ID(stored=True))
if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)


documents = [
	{
		"title": "下文",
		"content": "首先安裝jieba和whoosh庫，",
		"id": "1"
	},
	{
		"title": "中文自然語言處理",
		"content": "中文自然語言處理涉及分詞、詞性標注、命名實體識別等...",
		"id": "2"
	}
]

writer = ix.writer() 
for doc in documents:
    writer.add_document(title=doc["title"], content=doc["content"], id=doc["id"])
writer.commit()

searcher = ix.searcher()
query_parser = QueryParser("content", schema=ix.schema)
search_input = "jieba和whoosh"
query = query_parser.parse(search_input)
results = searcher.search(query, limit=None)

print(f"找到 {len(results)} 篇相關(guān)文檔：")
for result in results:
    print(f"{result['id']} - {result['title']}")

實戰(zhàn)文章來源地址http://www.zghlxwxcb.cn/news/detail-685144.html

from whoosh.index import create_in,open_dir
from whoosh.fields import Schema, TEXT, ID
from jieba.analyse import ChineseAnalyzer
from whoosh.qparser import QueryParser
from whoosh.index import open_dir
import os

import jieba
import pandas as pd

from glob import glob
from multiprocessing import Process, freeze_support

from tqdm import tqdm


class GenVocTensorForDataSet:
    def __init__(self):
        pass

    @staticmethod
    def gen_data_tensor(data_v, out_dir, process_count):
        """

        :param data_v:
        :param out_dir:
        :param process_count:
        :return:
        """
        total_l = []
        one_p_count = 0
        for one_v in tqdm(data_v):
            one_p_count += 1

            with open(one_v, "r", encoding="utf-8") as f:
                total_str = f.read()
                total_str = "".join(total_str.split())
            one_data = list(jieba.cut(total_str))
            documents = []
            text = ""
            for one in one_data:
                text += one
                if text not in total_str[len("".join(documents)) + len(text):]:
                    documents.append(text)
                    text = ""
            total_l.append(documents)
        pd.to_pickle({"voc": total_l},
                     out_dir + "/{}{}.pandas_pickle_data_set".format(process_count, one_p_count))

    def gen_voc_data_to_tensor_set(self, paths_list_dir, out_dir, works_num=8):
        """
        唯一長度拆分
        :param paths_list_dir: 多個txt 的文件夾
        :param works_num:
        :return:
        """
        paths_list_pr = glob(pathname=paths_list_dir + "*")

        p_list = []
        # 發(fā)任務(wù)到異步進程
        for i in range(0, len(paths_list_pr), len(paths_list_pr) // works_num):
            j = len(paths_list_pr) // works_num + i

            p = Process(target=self.gen_data_tensor, args=(
                paths_list_pr[i:j], out_dir, i))
            p.start()
            p_list.append(p)

        for p in p_list:
            p.join()

    @staticmethod
    def init_data_set(paths_list_dir):
        paths_list_pr = glob(pathname=paths_list_dir + "*")
        analyzer = ChineseAnalyzer()
        schema = Schema(title=TEXT(stored=True, analyzer=analyzer), content=TEXT(stored=True, analyzer=analyzer),
                        id=ID(stored=True))
        if not os.path.exists("index"):
            os.mkdir("index")
        with create_in("index", schema, indexname='article_index') as ix:


            # documents = [
            #     {
            #         "title": "下文",
            #         "content": "首先安裝jieba和whoosh庫，",
            #         "id": "1"
            #     },
            #     {
            #         "title": "中文自然語言處理",
            #         "content": "中文自然語言處理涉及分詞、詞性標注、命名實體識別等...",
            #         "id": "2"
            #     }
            # ]

            writer = ix.writer()
            total_count_id = 0
            for one_p in paths_list_pr:
                documents = pd.read_pickle(one_p)
                for doc in tqdm(documents["voc"]):
                    for doc_i, doc_j in zip(doc[1:], doc[:-1]):
                        writer.add_document(title=doc_i, content=doc_j, id=str(total_count_id))
                        total_count_id += 1
            writer.commit()

    @staticmethod
    def add_data_set(paths_list_dir):
        paths_list_pr = glob(pathname=paths_list_dir + "*")
        with open_dir("indexdir", indexname='article_index') as ix:
            writer = ix.writer()
            total_count_id = 0
            for one_p in paths_list_pr:
                documents = pd.read_pickle(one_p)
                for doc in tqdm(documents["voc"]):
                    for doc_i, doc_j in zip(doc[1:], doc[:-1]):
                        writer.add_document(title=doc_i, content=doc_j, id=str(total_count_id))
                        total_count_id += 1
            writer.commit()


    @staticmethod
    def search_by_jieba_world(search_text):
        ix = open_dir("index", indexname='article_index')
        with ix.searcher() as searcher:
            query_parser = QueryParser("content", schema=ix.schema)
            search_input = search_text
            query = query_parser.parse(search_input)
            results = searcher.search(query, limit=None)

            print(f"找到 {len(results)} 篇相關(guān)文檔：")
            for result in results:
                print(f"{result['id']} - {result['title']}")
        return results


if __name__ == '__main__':
    freeze_support()
    txt_p = "E:/just_and_sum/data_sets/"
    gvt_fds = GenVocTensorForDataSet()
    # 生成分詞庫
    # gvt_fds.gen_voc_data_to_tensor_set(txt_p, "E:/just_and_sum/data_set_d",works_num=8)
    # 初始化數(shù)據(jù)庫
    # data_base = gvt_fds.init_data_set("E:/just_and_sum/data_set_d/")
    # 搜索
    search_res = gvt_fds.search_by_jieba_world("頭孢克洛頭孢泊肟酯是同")
    print(search_res)

到了這里，關(guān)于jieba 加whooh 構(gòu)建自己本地數(shù)據(jù)庫的搜索引擎的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務(wù)，不擁有所有權(quán)，不承擔相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費用

【搜索引擎數(shù)據(jù)庫】
一、搜索引擎數(shù)據(jù)庫簡介 1.1、? 搜索引擎數(shù)據(jù)庫簡介 ? ? ? 通常意義上的數(shù)據(jù)庫即指數(shù)據(jù)庫系統(tǒng)（Database System，簡稱 DBS），由數(shù)據(jù)庫、數(shù)據(jù)庫管理系統(tǒng)、應(yīng)用程序、管理員四部分組成。DBMS 是數(shù)據(jù)庫系統(tǒng)的基礎(chǔ)和核心，作為能夠使用戶定義、創(chuàng)建、維護和控制訪問數(shù)據(jù)庫的
2023年04月17日
瀏覽(87)
數(shù)據(jù)庫搜索引擎介紹
索引的定義：索引是對數(shù)據(jù)庫表的一列或者多列的值進行排序一種結(jié)構(gòu)，使用索引可以快速訪問數(shù)據(jù)表中的特定信息。通俗來講，索引就是數(shù)據(jù)庫表的一個目錄，通過索引，我們可以迅速的找到數(shù)據(jù)庫中的數(shù)據(jù)，并進行相應(yīng)的增刪改查等操作。索引的使用大大加快數(shù)據(jù)檢索
2024年02月03日
瀏覽(93)
6月《中國數(shù)據(jù)庫行業(yè)分析報告》已發(fā)布，首發(fā)空間、搜索引擎數(shù)據(jù)庫【全球產(chǎn)業(yè)圖譜】
為了幫助大家及時了解中國數(shù)據(jù)庫行業(yè)發(fā)展現(xiàn)狀、梳理當前數(shù)據(jù)庫市場環(huán)境和產(chǎn)品生態(tài)等情況，從2022年4月起，墨天輪社區(qū)行業(yè)分析研究團隊出品將持續(xù)每月為大家推出最新《中國數(shù)據(jù)庫行業(yè)分析報告》，持續(xù)傳播數(shù)據(jù)技術(shù)知識、努力促進技術(shù)創(chuàng)新與行業(yè)生態(tài)發(fā)展，目前已更
2024年02月13日
瀏覽(19)
使用矢量數(shù)據(jù)庫打造全新的搜索引擎
在技術(shù)層面上，矢量數(shù)據(jù)庫采用了一種名為“矢量索引”的技術(shù)，這是一種組織和搜索矢量數(shù)據(jù)的方法，可以快速找到相似矢量。其中關(guān)鍵的一環(huán)是“距離函數(shù)”的概念，它可以衡量兩個矢量的相似程度。矢量數(shù)據(jù)庫是專門設(shè)計用來高效處理矢量數(shù)據(jù)的數(shù)據(jù)庫。什么是矢量數(shù)
2024年02月14日
瀏覽(24)
7個精選的矢量數(shù)據(jù)庫和搜索引擎項目
向量數(shù)據(jù)庫是一種用于存儲、檢索和分析向量的數(shù)據(jù)庫。在圖片搜索、語音搜索等應(yīng)用中，不是直接存儲和對比原始數(shù)據(jù)，而是使用向量表示，通常為256/512個浮點數(shù)數(shù)組。它提供標準的SQL訪問接口，同時支持高效的數(shù)據(jù)組織、檢索和分析能力，包括傳統(tǒng)數(shù)據(jù)庫管理結(jié)構(gòu)化數(shù)據(jù)
2024年02月03日
瀏覽(21)
【Golang星辰圖】數(shù)據(jù)管理利器：Go編程語言中的數(shù)據(jù)庫和搜索引擎綜合指南
Go編程語言是一種強大、類型安全且高效的編程語言，它在處理數(shù)據(jù)庫和搜索引擎方面有著廣泛的應(yīng)用。本篇文章將詳細介紹幾個Go編程語言中常用的數(shù)據(jù)庫和全文搜索引擎，包括Go-bleve、Go-pgx、Go-leveldb/leveldb、Go-xorm、Go-mysql-driver和Go-bbolt/bbolt。對于每個工具，我們將介紹其功
2024年03月26日
瀏覽(109)
《Spring Boot 實戰(zhàn)派》--13.集成NoSQL數(shù)據(jù)庫，實現(xiàn)Elasticsearch和Solr搜索引擎
?????????關(guān)于搜索引擎我們很難實現(xiàn) Elasticseach 和 Solr兩大搜索框架的效果；所以本章針對兩大搜索框架，非常詳細地講解它們的原理和具體使用方法，首先介紹什么是搜索引擎、如何用 MySQL實現(xiàn)簡單的搜索引擎，以及Elasticseach 的概念和接口類；然后介紹Elasticseach
2023年04月09日
瀏覽(24)
Java SpringBoot API 實現(xiàn)ES(Elasticsearch)搜索引擎的一系列操作(超詳細)(模擬數(shù)據(jù)庫操作)
小編使用的是elasticsearch-7.3.2 基礎(chǔ)說明：啟動：進入elasticsearch-7.3.2/bin目錄，雙擊elasticsearch.bat進行啟動，當出現(xiàn)一下界面說明，啟動成功。也可以訪問http://localhost:9200/ 啟動ES管理：進入elasticsearch-head-master文件夾，然后進入cmd命令界面，輸入npm?run?start?即可啟動。訪問http
2024年02月04日
瀏覽(33)
基于Python+OpenCV的圖像搜索引擎（CBIR+深度學(xué)習(xí)+機器視覺）含全部工程源碼及圖片數(shù)據(jù)庫下載資源
本項目旨在開發(fā)一套完整高效的圖像搜索引擎，為用戶提供更加便捷的圖片搜索體驗。為了實現(xiàn)這一目標，我們采用了 CBIR（Content-based image retrieval）技術(shù)，這是目前主流的圖像搜索方法之一。CBIR 技術(shù)基于圖像內(nèi)容的相似性來檢索相似的圖像，相比于傳統(tǒng)的圖像搜索方法，
2024年02月08日
瀏覽(26)
火山引擎云搜索服務(wù)升級云原生新架構(gòu)；提供數(shù)十億級分布式向量數(shù)據(jù)庫能力
從互聯(lián)網(wǎng)發(fā)展伊始，搜索技術(shù)就綻放出了驚人的社會和經(jīng)濟價值。隨著信息社會快速發(fā)展，數(shù)據(jù)呈爆炸式增長，搜索技術(shù)通過數(shù)據(jù)收集與處理，滿足信息共享與快速檢索的需求。云搜索服務(wù) ESCloud 是火山引擎提供的完全托管在線分布式搜索服務(wù) ，兼容 Elasticsearch、Kibana 等軟
2024年02月16日
瀏覽(30)