[C++項(xiàng)目] Boost文檔站內(nèi)搜索引擎(4): 搜索的相關(guān)接口的實(shí)現(xiàn)、線程安全的單例index接口、cppjieba分詞庫(kù)的使用、綜合調(diào)試...

這篇具有很好參考價(jià)值的文章主要介紹了[C++項(xiàng)目] Boost文檔站內(nèi)搜索引擎(4): 搜索的相關(guān)接口的實(shí)現(xiàn)、線程安全的單例index接口、cppjieba分詞庫(kù)的使用、綜合調(diào)試...。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問。

[C++項(xiàng)目] Boost文檔站內(nèi)搜索引擎(4): 搜索的相關(guān)接口的實(shí)現(xiàn)、線程安全的單例index接口、cppjieba分詞庫(kù)的使用、綜合調(diào)試...,C++項(xiàng)目實(shí)戰(zhàn)-Boost庫(kù)文檔站內(nèi)搜索,c++,搜索引擎,Boost,項(xiàng)目,linux

有關(guān)Boost文檔搜索引擎的項(xiàng)目的前三篇文章, 已經(jīng)分別介紹分析了:

項(xiàng)目背景: ??[C++項(xiàng)目] Boost文檔站內(nèi)搜索引擎(1): 項(xiàng)目背景介紹、相關(guān)技術(shù)棧、相關(guān)概念介紹…
文檔解析、處理模塊parser的實(shí)現(xiàn): ??[C++項(xiàng)目] Boost文檔站內(nèi)搜索引擎(2): 文檔文本解析模塊parser的實(shí)現(xiàn)、如何對(duì)文檔文件去標(biāo)簽、如何獲取文檔標(biāo)題…
文檔 正排索引與倒排索引 建立的接口的實(shí)現(xiàn): ??[C++項(xiàng)目] Boost文檔站內(nèi)搜索引擎(3): 建立文檔及其關(guān)鍵字的正排倒排索引、jieba庫(kù)的安裝與使用…
建議先閱讀上面三篇文章

已經(jīng)實(shí)現(xiàn)了對(duì)文檔建立索引的相關(guān)接口. 有了接口, 就可以調(diào)用并建立文檔索引.

建立了索引, 其實(shí)就可以根據(jù)索引查找文檔了. 所以, 本篇文章的內(nèi)容即為:

查找、搜索相關(guān)接口的實(shí)現(xiàn)
建立索引接口的相關(guān)優(yōu)化
本地搜索測(cè)試

做完上面的內(nèi)容, 就后面就是加入網(wǎng)絡(luò)和頁(yè)面的制作了~

搜索

搜索是通過(guò)輸入的內(nèi)容進(jìn)行搜索的. 并且一定是 先在倒排索引中找到文檔id, 再根據(jù)文檔id去正排索引中找到文檔 的內(nèi)容.

而倒排索引中存儲(chǔ)的內(nèi)容是對(duì)文檔內(nèi)容進(jìn)行分詞, 然后根據(jù)分詞建立的.

那么要實(shí)現(xiàn)搜索, 也需要 對(duì)搜索的內(nèi)容進(jìn)行分詞, 然后再根據(jù)搜索內(nèi)容的分詞在倒排索引中查找關(guān)鍵詞對(duì)應(yīng)的倒排拉鏈

搜索接口的基本結(jié)構(gòu)

了解了搜索的流程, 那么搜索的相關(guān)接口的基本結(jié)構(gòu)實(shí)際也就顯現(xiàn)出來(lái)了:

namespace ns_searcher {
	class searcher {
	private:
		ns_index::index* _index; // 建立索引的類

	public:
        // 初始化接口
        // 在搜索之前需要先建立索引. 這個(gè)接口就是建立索引用的
		void initSearcher(const std::string& input) {}

		// 搜索接口
		// 搜索需要實(shí)現(xiàn)什么功能?
        // 搜索需要接收字符串, 然后針對(duì)字符串進(jìn)行分詞 再根據(jù)分詞在索引中進(jìn)行查找
		// 首先參數(shù)部分需要怎么實(shí)現(xiàn)?
		// 參數(shù)部分, 需要接收需要搜索的句子或關(guān)鍵字, 還需要一個(gè)輸出型參數(shù) 用于輸出查找結(jié)果
		//  查找結(jié)果我們使用jsoncpp進(jìn)行序列化和反序列化
		void search(const std::string& query, std::string* jsonString) {}

基本的結(jié)構(gòu)就這么簡(jiǎn)單. 只需要對(duì)外提供兩個(gè)接口:

initSearcher() 初始化接口
search() 搜索接口

initSearcher()接口實(shí)現(xiàn)

initSearcher() 是用來(lái)做搜索前的工作的, 實(shí)際就是建立索引的接口

但是, 在建立索引之前我們清楚所有的搜索都是在唯一一個(gè)倒排索引和唯一一個(gè)正排索引中進(jìn)行的. 也就是說(shuō) 最終一個(gè)程序中只需要建立一次索引. 所以我們可以將索引的相關(guān)函數(shù)實(shí)現(xiàn)為單例.

`index`接口類單例實(shí)現(xiàn)

index類的單例實(shí)現(xiàn)非常的簡(jiǎn)單:

namespace ns_index {

	// 用于正排索引中 存儲(chǔ)文檔內(nèi)容
	typedef struct docInfo {
		std::string _title;	  // 文檔標(biāo)題
		std::string _content; // 文檔去標(biāo)簽之后的內(nèi)容
		std::string _url;	  // 文檔對(duì)應(yīng)官網(wǎng)url
		std::size_t _docId;	  // 文檔id
	} docInfo_t;

	// 用于倒排索引中 記錄關(guān)鍵字對(duì)應(yīng)的文檔id和權(quán)重
	typedef struct invertedElem {
		std::size_t _docId;	   // 文檔id
		std::string _keyword;  // 關(guān)鍵字
		std::uint64_t _weight; // 搜索此關(guān)鍵字, 此文檔id 所占權(quán)重

		invertedElem() // 權(quán)重初始化為0
			: _weight(0) {}
	} invertedElem_t;

	// 關(guān)鍵字的詞頻
	typedef struct keywordCnt {
		std::size_t _titleCnt;	 // 關(guān)鍵字在標(biāo)題中出現(xiàn)的次數(shù)
		std::size_t _contentCnt; // 關(guān)鍵字在內(nèi)容中出現(xiàn)的次數(shù)

		keywordCnt()
			: _titleCnt(0)
			, _contentCnt(0) {}
	} keywordCnt_t;

	// 倒排拉鏈
	typedef std::vector<invertedElem_t> invertedList_t;

	class index {
	private:
		// 正排索引使用vector, 下標(biāo)天然是 文檔id
		std::vector<docInfo_t> forwardIndex;
		// 倒排索引 使用 哈希表, 因?yàn)榈古潘饕?一定是 一個(gè)keyword 對(duì)應(yīng)一組 invertedElem拉鏈
		std::unordered_map<std::string, invertedList_t> invertedIndex;

		// 單例模式設(shè)計(jì)
		index() {}

		index(const index&) = delete;
		index& operator=(const index&) = delete;

		static index* _instance; // 單例
		static std::mutex _mtx;

	public:
		// 獲取單例
		static index* getInstance() {
			if (nullptr == _instance) {
				_mtx.lock();
				if (nullptr == _instance) {
					_instance = new index;
				}
				_mtx.unlock();
			}

			return _instance;
		}
		
        // 通過(guò)關(guān)鍵字 檢索倒排索引, 獲取對(duì)應(yīng)的 倒排拉鏈
		invertedList_t* getInvertedList(const std::string& keyword) {}

		// 通過(guò)倒排拉鏈中 每個(gè)倒排元素中存儲(chǔ)的 文檔id, 檢索正排索引, 獲取對(duì)應(yīng)文檔內(nèi)容
		docInfo_t* getForwardIndex(std::size_t docId) {}

		// 根據(jù)parser模塊處理過(guò)的 所有文檔的信息
		// 提取文檔信息, 建立 正排索引和倒排索引
		// input 為 ./data/output/raw
		bool buildIndex(const std::string& input) {}

	private:
		// 對(duì)一個(gè)文檔建立正排索引
		docInfo_t* buildForwardIndex(const std::string& file) {}
        // 對(duì)一個(gè)文檔建立倒排索引
		bool buildInvertedIndex(const docInfo_t& doc) {}
	};
	// 單例相關(guān)
	index* index::_instance = nullptr;
	std::mutex index::_mtx;
}

需要做的工作也就只有:

添加兩個(gè)成員變量, 并在類外定義:

static index* _instance;

static std::mutex _mtx;
構(gòu)造函數(shù)設(shè)置私有, 拷貝構(gòu)造函數(shù)和賦值重載函數(shù)刪除:

index() {}

index(const index&) = delete;

index& operator=(const index&) = delete;

添加線程安全的獲取單例的公開接口:

static index* getInstance() {
    if (nullptr == _instance) {
        _mtx.lock();
        if (nullptr == _instance) {
            _instance = new index;
        }
        _mtx.unlock();
    }

    return _instance;
}

這樣就將index類設(shè)計(jì)為了單例模式

接口實(shí)現(xiàn)

initSearcher()接口的實(shí)現(xiàn)也是非常的簡(jiǎn)單, 只需要建立索引就可以了:

void initSearcher(const std::string& input) {
    // 搜索前的初始化操作
    // search類成員 ns_index::index* _index 獲取單例
    _index = ns_index::index::getInstance();
    std::cout << "獲取單例成功 ..." << std::endl;
    
    // 建立索引
    _index->buildIndex(input);
    std::cout << "構(gòu)建正排索引、倒排索引成功 ..." << std::endl;
}

search()接口實(shí)現(xiàn) **

searcher類中, 初始化接口initSearcher()實(shí)現(xiàn)的簡(jiǎn)單.

但是search()就沒有那么簡(jiǎn)單了, 需要注意非常多的細(xì)節(jié)

搜索接口需要實(shí)現(xiàn)的功能是:

接收字符串, 然后針對(duì)字符串進(jìn)行分詞
再根據(jù)分詞在倒排索引中查找對(duì)應(yīng)的倒排拉鏈
通過(guò)倒排拉鏈獲取相關(guān)文檔的id
再根據(jù)文檔id, 查找正排索引查找對(duì)應(yīng)的文檔內(nèi)容信息
最終查找到的文檔內(nèi)容信息是需要輸出的, 所以我們接口使用了輸出型參數(shù)

但這只是功能實(shí)現(xiàn)的整體邏輯. 還有許多的細(xì)節(jié)需要考慮:

倒排索引中的關(guān)鍵詞都是小寫的, 而搜索輸入的內(nèi)容很可能存在大小寫, 如何實(shí)現(xiàn)忽略大小寫的搜索呢?
查找到倒排拉鏈之后, 是可以通過(guò)遍歷拉鏈獲取到文檔id等相關(guān)信息的

不過(guò), 頁(yè)面的顯示是需要按照相關(guān)度排序的, 我們也在倒排索引中使用詞頻簡(jiǎn)單地體現(xiàn)出了關(guān)鍵字與對(duì)應(yīng)文檔的相關(guān)性

那么如何對(duì)獲取到的文檔進(jìn)行排序呢?
在查找的時(shí)候, 一定會(huì)有不同的詞查找到同一個(gè)文檔的問題. 那么如果不做處理, 就會(huì)出現(xiàn)同一個(gè)文檔在頁(yè)面中不同的位置被顯示出來(lái)的問題, 該怎么解決呢?
獲取到文檔內(nèi)容信息之后, 是需要將設(shè)置文檔需要展示的相關(guān)信息的: title description url

如果文檔內(nèi)容過(guò)長(zhǎng), 一定不能將文檔全部?jī)?nèi)容展示在搜索頁(yè)面中, 那么如何獲取文章相關(guān)的摘要呢?
還有一些其他細(xì)節(jié), 結(jié)合代碼具體分析…

那么, 根據(jù)需求 search()接口的實(shí)現(xiàn)代碼就是這樣的:

typedef struct invertedElemOut {
    std::size_t _docId;
    std::uint64_t _weight;
    std::vector<std::string> _keywords;
} invertedElemOut_t;

// 搜索接口
// 首先參數(shù)部分需要怎么實(shí)現(xiàn)?
// 參數(shù)部分, 需要接收需要搜索的句子或關(guān)鍵字, 還需要一個(gè)輸出型參數(shù) 用于輸出查找結(jié)果
//  查找結(jié)果我們使用jsoncpp進(jìn)行序列化和反序列化
// search() 具體需要實(shí)現(xiàn)的功能:
//  1. 對(duì)接收的句子或關(guān)鍵詞進(jìn)行分詞
//  2. 根據(jù)分詞, 在倒排索引中查找到所有分詞的倒排拉鏈 并匯總其中的 invertedElem, 然后根據(jù)相關(guān)性進(jìn)行排序
//  4. 然后再遍歷所有的 invertedElem, 根據(jù) invertedElem中存儲(chǔ)的 文檔id, 在正排索引中獲取到文檔內(nèi)容
//  5. 然后將獲取到的文檔內(nèi)容使用jsoncpp 進(jìn)行序列化, 存儲(chǔ)到輸出型參數(shù)中
// 直到遍歷完invertedElem
void search(const std::string& query, std::string* jsonString) {
    // 1. 對(duì)需要搜索的句子或關(guān)鍵詞進(jìn)行分詞
    std::vector<std::string> keywords;
    ns_util::jiebaUtil::cutString(query, &keywords);

    // 統(tǒng)計(jì)文檔用, 因?yàn)榭赡艽嬖诓煌姆衷~ 在倒排索引中指向同一個(gè)文檔的情況
    // 如果不去重, 會(huì)重復(fù)展示
    std::unordered_map<std::size_t, invertedElemOut_t> invertedElemOutMap;
    // 2. 根據(jù)分詞獲取倒排索引中的倒排拉鏈, 并匯總?cè)ブ?invertedElem
    for (std::string word : keywords) {
        boost::to_lower(word);

        ns_index::invertedList_t* tmpInvertedList = _index->getInvertedList(word);
        if (nullptr == tmpInvertedList) {
            // 沒有這個(gè)關(guān)鍵詞
            continue;
        }

        for (auto& elem : *tmpInvertedList) {
            // 遍歷倒排拉鏈, 根據(jù)文檔id 對(duì)invertedElem 去重
            auto& item = invertedElemOutMap[elem._docId]; // 在map中獲取 或 創(chuàng)建對(duì)應(yīng)文檔id的 invertedElem
            item._docId = elem._docId;
            item._weight += elem._weight;
            // 權(quán)重需要+= 是因?yàn)槎鄠€(gè)關(guān)鍵詞指向了同一個(gè)文檔 那么就說(shuō)明此文檔的與搜索內(nèi)容的相關(guān)性更高
      		// 就可以將多個(gè)關(guān)鍵字關(guān)于此文檔的權(quán)重相加, 表示搜索相關(guān)性高
            // 最好也將 此文檔相關(guān)的關(guān)鍵詞 也存儲(chǔ)起來(lái), 因?yàn)樵诳蛻舳怂阉鹘Y(jié)果中, 可能需要對(duì)網(wǎng)頁(yè)中有的關(guān)鍵字進(jìn)行高亮
            // 但是 invertedElem 的第三個(gè)成員是 單獨(dú)的一個(gè)string對(duì)象, 不太合適
            // 所以, 可以定義一個(gè)與invertedElem 相似的, 但是第三個(gè)成員是一個(gè) vector 的類, 比如 invertedElemOut
            item._keywords.push_back(elem._keyword);
            // 此時(shí)就將當(dāng)前invertedElem 去重到了 invertedElemMap 中
        }
    }
    
    // vector 存儲(chǔ) 文檔id相關(guān)信息, 方便排序
    std::vector<invertedElemOut_t> allInvertedElemOut;
    // 出循環(huán)之后, 就將搜索到的 文檔的 id、權(quán)重和相關(guān)關(guān)鍵詞 存儲(chǔ)到了 invertedElemMap
    // 然后將文檔的相關(guān)信息 invertedElemOut 都存儲(chǔ)到 vector 中
    for (const auto& elemOut : invertedElemOutMap) {
        // map中的second: elemOut, 在執(zhí)行此操作之后, 就沒用了
        // 所以使用移動(dòng)語(yǔ)義, 防止發(fā)生拷貝
        allInvertedElemOut.push_back(std::move(elemOut.second));
    }

    // 執(zhí)行到這里, 可以搜索到的文檔id 權(quán)重 和 相關(guān)關(guān)鍵詞的信息, 已經(jīng)都在allInvertedElemOut 中了.
    // 但是, 還不能直接 根據(jù)文檔id 在正排索引中檢索
    // 因?yàn)? 此時(shí)如果直接進(jìn)行文檔內(nèi)容的索引, 在找到文檔內(nèi)容之后, 就要直接進(jìn)行序列化并輸出了. 而客戶端顯示的時(shí)候, 反序列化出來(lái)的文檔順序, 就是顯示的文檔順序
    // 但是現(xiàn)在找到的文檔還是亂序的. 還需要將allInvertedElemOut中的相關(guān)文檔, 通過(guò)_weight 進(jìn)行倒序排列
    // 這樣, 序列化就是按照倒序排列的, 反序列化也會(huì)如此, 顯示同樣如此
    std::sort(allInvertedElemOut.begin(), allInvertedElemOut.end(),
              [](const invertedElemOut_t& elem1, const invertedElemOut_t& elem2) {
                  return elem1._weight > elem2._weight;
              });

    // 排序之后, allInvertedElemOut中 文檔的排序就是降序了
    // 然后 通過(guò)遍歷此數(shù)組, 獲取文檔id, 根據(jù)id獲取文檔在正排索引中的內(nèi)容
    // 然后再將 所有內(nèi)容序列化
    Json::Value root;
    for (auto& elemOut : allInvertedElemOut) {
        // 通過(guò)Json::Value 對(duì)象, 存儲(chǔ)文檔內(nèi)容
        Json::Value elem;
        // 通過(guò)elemOut._docId 獲取正排索引中 文檔的內(nèi)容信息
        ns_index::docInfo_t* doc = _index->getForwardIndex(elemOut._docId);
        // elem賦值
        elem["url"] = doc->_url;
        elem["title"] = doc->_title;
        // 關(guān)于文檔的內(nèi)容, 搜索結(jié)果中是不展示文檔的全部?jī)?nèi)容的, 應(yīng)該只顯示包含關(guān)鍵詞的摘要, 點(diǎn)進(jìn)文檔才顯示相關(guān)內(nèi)容
        // 而docInfo中存儲(chǔ)的是文檔去除標(biāo)簽之后的所有內(nèi)容, 所以不能直接將 doc._content 存儲(chǔ)到elem對(duì)應(yīng)key:value中
        elem["desc"] = getDesc(doc->_content, elemOut._keywords[0]); // 只根據(jù)第一個(gè)關(guān)鍵詞來(lái)獲取摘要
        // for Debug
        // 這里有一個(gè)bug, jsoncpp 0.10.5.2 是不支持long或long long 相關(guān)類型的, 所以需要轉(zhuǎn)換成 double
        // 這里轉(zhuǎn)換成 double不會(huì)有什么影響, 因?yàn)檫@兩個(gè)參數(shù)只是本地調(diào)試顯示用的.
        elem["docId"] = (double)doc->_docId;
        elem["weight"] = (double)elemOut._weight;

        root.append(elem);
    }

    // 序列化完成之后將相關(guān)內(nèi)容寫入字符串
    // for Debug 用 styledWriter
    Json::StyledWriter writer;
    *jsonString = writer.write(root);
}

執(zhí)行搜索, 首先要做的就是 對(duì)傳入的字符串進(jìn)行分詞

然后根據(jù)每個(gè)分詞, 在倒排索引中查找對(duì)應(yīng)的倒排拉鏈, 再通過(guò)遍歷倒排拉鏈就可以獲取到當(dāng)前關(guān)鍵字對(duì)應(yīng)出現(xiàn)的文檔相關(guān)信息.

不過(guò), 分詞之后-遍歷時(shí)-正式查找之前要做的首要任務(wù)就是, 將分詞轉(zhuǎn)換為小寫. 因?yàn)? 倒排索引中的所有關(guān)鍵詞都是小寫的狀態(tài)

并且, 查找到倒排拉鏈在獲取并統(tǒng)計(jì)文檔信息時(shí), 還會(huì)出現(xiàn)不同關(guān)鍵字指向同一文檔的情況, 這種情況是需要處理的 不能多次記錄同一個(gè)文檔.

還有就是, 如果一次搜索中 多個(gè)關(guān)鍵詞指向了同一個(gè)文檔那么就說(shuō)明此文檔的與搜索內(nèi)容的相關(guān)性更高, 此時(shí)是需要將文檔的顯示權(quán)重增加的.

根據(jù)這些需求, 實(shí)現(xiàn)了第一部分的代碼:

第一部分的代碼實(shí)現(xiàn)了:

對(duì)搜索內(nèi)容分詞
遍歷分詞查找倒排拉鏈
根據(jù)倒排拉鏈去重獲取文檔信息

這部分代碼, 有三個(gè)要點(diǎn):

需要定義一個(gè)unordered_map來(lái)實(shí)現(xiàn)對(duì)搜索到的文檔記錄并去重
如果單純地對(duì)多個(gè)關(guān)鍵詞搜到的同一個(gè)文檔去重, 而不記錄相關(guān)的關(guān)鍵字, 那么就無(wú)法得知此文檔是根據(jù)那些關(guān)鍵字搜索到的. 那么再去重的同時(shí), 還需要記錄對(duì)應(yīng)的關(guān)鍵詞

也就是說(shuō), unordered_map 存儲(chǔ)的元素類型不能是簡(jiǎn)單的ns_index::invertedElem, 因?yàn)?code>invertedElem沒有辦法很好的記錄多個(gè)關(guān)鍵詞

所以, 定義了一個(gè)結(jié)構(gòu)體:
```
typedef struct invertedElemOut {
    std::size_t _docId;
    std::uint64_t _weight;
    std::vector<std::string> _keywords;
} invertedElemOut_t;
```
成員依舊包括文檔id和權(quán)重, 但是第三個(gè)成員變量與invertedElem不同, invertedElemOut的第三個(gè)成員變量是vector<string>, 適合存儲(chǔ)多個(gè)關(guān)鍵字.
第三個(gè)要點(diǎn)就是: unordered_map中存儲(chǔ)的對(duì)應(yīng)此關(guān)鍵字的元素的權(quán)重, 需要+=當(dāng)前關(guān)鍵字的權(quán)重.

因?yàn)?多個(gè)關(guān)鍵詞指向了同一個(gè)文檔那么就說(shuō)明此文檔的與搜索內(nèi)容的相關(guān)性更高, 所以就可以將多個(gè)關(guān)鍵字關(guān)于此文檔的權(quán)重相加, 表示搜索相關(guān)性高

第一部分執(zhí)行完之后, 根據(jù)搜索內(nèi)容查找到的所有的文檔的相關(guān)信息, 都存儲(chǔ)在了invertedElemOutMap中.

接下來(lái)要做的, 并不是遍歷unordered_map獲取文檔id, 去正排索引中查找文檔的內(nèi)容. 而是需要先根據(jù)文檔的顯示權(quán)重進(jìn)行排序. 排完序之后, 再進(jìn)行文檔內(nèi)容的獲取.

因?yàn)? 獲取每到一個(gè)文檔內(nèi)容就需要將文檔內(nèi)容輸出了, 輸出之后就要做處理響應(yīng)回客戶端進(jìn)行顯示了. 這也意味著在正排索引中的查找順序實(shí)際就是搜索結(jié)果的顯示順序, 所以在查找之前, 需要先排序:

這里的實(shí)現(xiàn), 先使用vector存儲(chǔ)invertedElemOut元素. 為了方便排序

然后通過(guò)std::sort()+lambda進(jìn)行降序排序

這里需要注意一個(gè)細(xì)節(jié):

在向vector插入元素時(shí), 對(duì)invertedElemOutMap中存儲(chǔ)的元素執(zhí)行std::move()

也就是使用移動(dòng)語(yǔ)義, 防止發(fā)生拷貝構(gòu)造.

可以使用移動(dòng)語(yǔ)義的原因就是, 構(gòu)建完vector之后, invertedElemOutMap就沒用了, 不需要存儲(chǔ)元素.

執(zhí)行完這一部分代碼. 此次搜索到的所有的文檔id相關(guān)信息就按照顯示權(quán)重的降序被存儲(chǔ)到了 std::vector<invertedElemOut_t> allInvertedElemOut中.

接下來(lái), 就是根據(jù)文檔id相關(guān)信息在正排索引中查找文檔內(nèi)容信息了

這部分代碼, 實(shí)際就是搜索的最后一部分代碼了.

最后一部分的代碼其實(shí)相對(duì)簡(jiǎn)單, 只需要在正派索引中找到文檔的內(nèi)容信息, 然后序列化并存儲(chǔ)起來(lái)就可以了.

等獲取到全部的文檔內(nèi)容信息, 再將結(jié)果通過(guò)輸出型參數(shù)傳遞出去就可以了

對(duì)內(nèi)容做序列化處理, 需要用到jsoncpp.

在CentOS平臺(tái)下, 直接執(zhí)行sudo yum install -y jsoncpp-devel就可以安裝了

關(guān)于jsoncpp最基本的使用的相關(guān)介紹, 可以看一下這篇文章:

[Linux] 初識(shí)應(yīng)用層協(xié)議: 序列化與反序列化、編碼與解碼、jsoncpp簡(jiǎn)單食用…

這段代碼中, 唯一要注意的就是:

使用Json::Value root存儲(chǔ)Json::Value elem的方式, 在root存儲(chǔ)不同文檔的序列化內(nèi)容.

在之前的使用中, 只需要通過(guò)Json::Value變量序列化一個(gè)結(jié)構(gòu)體之后, 就可以將Json::Value的結(jié)果寫入string了.

而, 這里為什么要套兩層Json::Value呢?

因?yàn)? 這里傳輸?shù)牟恢皇且粋€(gè)結(jié)構(gòu)體變量的內(nèi)容, 而是 有很多個(gè)結(jié)構(gòu)體. 很多個(gè)同類型結(jié)構(gòu)體的內(nèi)容都需要序列化并存儲(chǔ)起來(lái), 很自然而然就可以想到要使用兩層結(jié)構(gòu). 并且還需要保證序列化, 所以就是用Json::Value嵌套的方式對(duì)不同的文檔內(nèi)容序列化并存儲(chǔ).

而 Json::Value也很好的支持了存儲(chǔ)Json::Value的接口, 就是Json::Value::append().

源碼中關(guān)于append()的聲明, 參數(shù)就是Json::Value&:

Value& Value::append(const Value& value) { return append(Value(value)); }

Value& Value::append(Value&& value) {
JSON_ASSERT_MESSAGE(type() == nullValue || type() == arrayValue,
                   "in Json::Value::append: requires arrayValue");
if (type() == nullValue) {
 *this = Value(arrayValue);
}
return this->value_.map_->emplace(size(), std::move(value)).first->second;
}

還有就是, elem中并不序列化存儲(chǔ)文檔的完整內(nèi)容, 而是存儲(chǔ)文檔的部分內(nèi)容.

所以就需要實(shí)現(xiàn)一個(gè)getDesc()接口

`getDesc()`摘要獲取接口實(shí)現(xiàn)

我們摘要獲取的思路非常簡(jiǎn)單, 就是在正文內(nèi)容中找到第一個(gè)關(guān)鍵詞的所在位置. 然后截取此位置的前50字節(jié)到此位置的后100字節(jié) 的內(nèi)容.

std::string getDesc(const std::string& content, const std::string& keyword) {
    // 如何獲取摘要呢?
    // 我們嘗試獲取正文中 第一個(gè)keyword 的前50個(gè)字節(jié)和后100個(gè)字節(jié)的內(nèi)容 作為摘要
    const std::size_t prevStep = 50;
    const std::size_t nextStep = 100;
    
   	// 獲取正文中 第一個(gè) keyword 的位置
    std::size_t pos = content.find(keyword);
    if (pos == std::string::npos)
        return "keyword does not exist!";

    std::size_t begin = 0;
    std::size_t end = content.size() - 1;

    // 獲取前50字節(jié) 和 后100字節(jié)的迭代器位置
    if (pos > begin + prevStep)
        begin += (pos - prevStep);
    if (pos + nextStep < end)
        end = pos + nextStep;

    if (begin >= end)
        return "nothing!";

    // 獲取摘要
    std::string desc;
	if (content.begin() + begin > content.begin())
		desc = "...";
	desc += content.substr(begin, end - begin);
	if (content.begin() + end < content.end())
		desc += "...";

    return desc;
}

演示及調(diào)試

上面已經(jīng)將所有搜索的相關(guān)接口都實(shí)現(xiàn)了.

下面我們通過(guò)一個(gè)簡(jiǎn)單的代碼調(diào)試一下:

#include <iostream>
#include "util.hpp"
#include "index.hpp"
#include "searcher.hpp"

const std::string& rawPath = "./data/output/raw";

int main() {
	ns_searcher::searcher searcher;
	searcher.initSearcher(rawPath);

	std::string query;
	std::string json_string;

	char buffer[1024];
	while (true) {
		std::cout << "Please Enter You Search Query# ";
		fgets(buffer, sizeof(buffer) - 1, stdin);
		buffer[strlen(buffer) - 1] = 0;
		query = buffer;
		searcher.search(query, &json_string);
		std::cout << json_string << std::endl;
	}

	return 0;
}

這段代碼可以把搜索到的內(nèi)容直接打印出來(lái).

我們演示一下:

首先是建立索引的過(guò)程:

然后就是搜索

從大體的結(jié)果上來(lái)看, 是沒什么問題的. 不僅可以搜索到, 而且是按照weight排序的

但是, 為什么desc會(huì)是keyword does not exist!?

搜到了文檔, 應(yīng)該就表示文檔中有這個(gè)關(guān)鍵詞. 但為什么會(huì)出現(xiàn)keyword does not exist!?

其實(shí)原因很簡(jiǎn)單: 我們通過(guò)關(guān)鍵詞在倒排索引中搜索, 都是通過(guò)全小寫來(lái)搜索的. 所以可以搜到文檔. 但是getDesc()獲取摘要的接口, 可并沒有實(shí)現(xiàn)通過(guò)小寫來(lái)查詢關(guān)鍵字. 這時(shí)候, 就有可能找不到全小寫的關(guān)鍵字, 也就無(wú)法獲取摘要.

所以, getDesc()接口在正文內(nèi)容中查找關(guān)鍵字的行為, 不能簡(jiǎn)單的使用string::find().

`getDesc()`接口優(yōu)化

不能使用string::find(), 并且 string也并沒有提供忽略大小寫搜索的接口

而且, 關(guān)鍵詞可以改為小寫, 但是也不能將正文內(nèi)容全部轉(zhuǎn)換成小寫呀.

那么, 在正文中如何忽略大小寫的查找關(guān)鍵詞呢?

std::search()接口. 可以通過(guò)仿函數(shù)來(lái)設(shè)置字符之間的查找方式:

std::string getDesc(const std::string& content, const std::string& keyword) {
    // 如何獲取摘要呢?
    // 我們嘗試獲取正文中 第一個(gè)keyword 的前50個(gè)字節(jié)和后100個(gè)字節(jié)的內(nèi)容 作為摘要
    const std::size_t prevStep = 50;
    const std::size_t nextStep = 100;
    // 獲取正文中 第一個(gè) keyword 的位置

    // std::size_t pos = content.find(keyword);
    // if (pos == std::string::npos)
	//     return "keyword does not exist!";
    // 直接這樣處理, 會(huì)出現(xiàn)一個(gè)問題:
    // keyword是有大小寫的. 倒排索引中查找 我們實(shí)現(xiàn)的是忽略大小寫, 所以可以找到文檔
    // 而 string::find() 是區(qū)分大小寫的查找, 可能無(wú)法在內(nèi)容中找到對(duì)應(yīng)的關(guān)鍵詞
    // string容器也沒有提供不區(qū)分大小寫的查找方法
    // 此時(shí), 可以用std::search()
    // std::search(it1, it2, it3, it4, pred);
    // 可以在[it1, it2)中 查找第一個(gè)[it3, it4)(詞語(yǔ))的出現(xiàn)位置.
    // 并且, 如果使用第5個(gè)參數(shù), 就可以傳入 帶有兩個(gè)參數(shù)的仿函數(shù), 這兩個(gè)參數(shù)就是需要比較的字符
    // 可以在仿函數(shù)內(nèi)設(shè)置這兩個(gè)字符的比較方式
    // 最終會(huì)返回找到的找到的單次第一個(gè)字符位置的迭代器, 否則返回it2
    auto iter = std::search(content.begin(), content.end(), keyword.begin(), keyword.end(),
                            [](int x, int y) {
                                return std::tolower(x) == std::tolower(y);
                            });
    if (iter == content.end())
        return "keyword does not exist!";
    std::size_t pos = std::distance(content.begin(), iter);

    std::size_t begin = 0;
    std::size_t end = content.size() - 1;

    // 獲取前50字節(jié) 和 后100字節(jié)的迭代器位置
    if (pos > begin + prevStep)
        begin += (pos - prevStep);
    if (pos + nextStep < end)
        end = pos + nextStep;

    if (begin >= end)
        return "nothing!";

    // 獲取摘要
    std::string desc;
    if (pos <= begin + prevStep)
        desc = "...";
    desc += content.substr(begin, end - begin);
    if (pos + nextStep < end)
        desc += "...";

    return desc;
}

使用std::search(it1, it2, it3, it4, pred);

可以在[it1, it2)中查找第一個(gè)[it3, it4)(詞語(yǔ))的出現(xiàn)位置.

并且, 如果使用第5個(gè)參數(shù), 就可以傳入 帶有兩個(gè)參數(shù)的仿函數(shù), 這兩個(gè)參數(shù)就是需要比較的字符 可以在仿函數(shù)內(nèi)設(shè)置這 兩個(gè)字符的比較方式

最終會(huì)返回找到的找到的單次第一個(gè)字符位置的迭代器, 否則返回it2

在仿函數(shù)內(nèi), 將參數(shù)字符都以小寫的形式比較, 就可以實(shí)現(xiàn)忽略大小寫比較:

這次, 就可以在文檔中找到關(guān)鍵詞了.

代碼實(shí)現(xiàn)到這里, 本地搜索的接口其實(shí)已經(jīng)相對(duì)完善了.

但是還并沒有結(jié)束

停用詞的處理 *

在項(xiàng)目中, 我們使用jieba庫(kù)針對(duì)搜索內(nèi)容和文檔內(nèi)容來(lái)分詞, 分別用來(lái)搜索和建立索引.

但是, 分詞時(shí)很可能會(huì)分出一些非常常見的詞, 比如中文的: 了 在 的 它 他 她 你… 還有英文的: a an the you it that this … 還有一些標(biāo)點(diǎn)符號(hào). 這部分詞被稱為 停用詞 或 停止詞 或 暫停詞

這些詞, 實(shí)際對(duì) 這種文檔的搜索是沒有什么用的. 而我們?cè)诜衷~的時(shí)候并沒有去除這些字, 這會(huì)導(dǎo)致什么結(jié)果呢?

搜索the a an都能搜出文檔, 但是我們輸入的并不是具有目的的有效內(nèi)容. 空格都能搜出文檔.

而, 我們的目的是防止用戶通過(guò)停用詞查找到了一些無(wú)關(guān)的文檔.

所以, 我們可以將這些 停用詞 在分詞之后, 去除掉.

怎么去除呢? jieba分詞庫(kù), 已經(jīng)提供了統(tǒng)計(jì)了常見的停用詞的文件:

內(nèi)容是這樣一行一行的:

我們只需要將文件的內(nèi)容按行以string的類型讀取到內(nèi)存中, 然后在分詞之后遍歷分詞進(jìn)行查找去除, 就可以實(shí)現(xiàn)去除分詞中的停用詞.

jieba提供的停用詞有些不適合被過(guò)濾掉, 有興趣可以自己整理一下

比如 about, 畢竟Boost庫(kù)文檔中的第一個(gè)文檔名就是about. 如果被當(dāng)作停用詞去掉了, 是不是有點(diǎn)不合適?

博主把 about any move 刪除掉. 因?yàn)?code>data/input目錄下存在以這三個(gè)單詞為名的文檔:

然后可以在util.hpp中的jiebaUtil類中添加一個(gè)去除停用詞的版本.

由于需要將停用詞從文件加載到內(nèi)存中, 而且只需要加載一次, 所以可以考慮將jiebaUtil設(shè)置為單例:

const char* const DICT_PATH = "./cppjiebaDict/jieba.dict.utf8";
const char* const HMM_PATH = "./cppjiebaDict/hmm_model.utf8";
const char* const USER_DICT_PATH = "./cppjiebaDict/user.dict.utf8";
const char* const IDF_PATH = "./cppjiebaDict/idf.utf8";
const char* const STOP_WORD_PATH = "./cppjiebaDict/stop_words.utf8";

class jiebaUtil {
    private:
    cppjieba::Jieba _jieba;
    std::unordered_map<std::string, bool> _stopKeywordMap;

    jiebaUtil()
        : _jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH) {}

    jiebaUtil(const jiebaUtil&) = delete;
    jiebaUtil& operator=(const jiebaUtil&) = delete;

    static jiebaUtil* _instance;

    private:
    // 主要是為了支持 消除停止詞的分詞
    // 也就是需要將停止詞, 寫入到 map中
    bool initJiebaUtil() {
        // 首先按行讀取文件 const char* const STOP_WORD_PATH = "./cppjiebaDict/stop_words.utf8"
        std::ifstream stopFile(STOP_WORD_PATH, std::ios::in);
        if (!stopFile.is_open()) {
            return false;
        }

        std::string line;
        while (std::getline(stopFile, line)) {
            _stopKeywordMap.insert({line, true});
        }

        stopFile.close();

        return true;
    }
    
    void noStopHelper(const std::string& src, std::vector<std::string>* out) {
        _jieba.CutForSearch(src, *out);
        // 遍歷out 查詢是否為停止詞 是則刪除
        // 需要注意迭代器失效的問題
        for (auto iter = out->begin(); iter != out->end();) {
            std::string word = *iter;
            boost::to_lower(word);
            // 這里要注意, 函數(shù)的第一個(gè)參數(shù) src 傳入的一般是文檔原文 或 搜索內(nèi)容的原文
            // 原文內(nèi)容都是區(qū)分大小寫的, 也就是說(shuō)這里的iter指向的分詞都是有大小寫之分的
            // 而jieba庫(kù)提供的停用詞都是小寫的, 也就是說(shuō)_stopKeywordMap內(nèi)存儲(chǔ)的內(nèi)容都是小寫的
            // 如果拿著有大小寫之分的分詞, 在停用詞表中查找, 是查找不到的. 
            // 所以在查找之前, 要先將iter指向的分詞 小寫化, 然后再在停用詞表中找
            auto stopIt = _stopKeywordMap.find(word);
            if (stopIt != _stopKeywordMap.end())
                // 注意接收erase的返回值 防止出現(xiàn)迭代器失效問題
                iter = out->erase(iter);
            else
                iter++;
        }
    }

    public:
    static jiebaUtil* getInstance() {
        static std::mutex mtx;
        if (nullptr == _instance) {
            mtx.lock();
            if (nullptr == _instance) {
                _instance = new jiebaUtil;
                _instance->initJiebaUtil(); // 初始化單例
            }
            mtx.unlock();
        }

        return _instance;
    }

    // 分詞: 不消除停止詞的版本
    void cutString(const std::string& src, std::vector<std::string>* out) {
        _jieba.CutForSearch(src, *out);
    }
    // 分詞: 消除停止詞的版本
    void cutStringNoStop(const std::string& src, std::vector<std::string>* out) {
        noStopHelper(src, out);
    }
};
jiebaUtil* jiebaUtil::_instance;

具體的實(shí)現(xiàn)思路是:

添加一個(gè)unordered_map<string, bool>成員對(duì)象, 用來(lái)記錄停用詞
定義一個(gè)initJiebaUtil()接口, 用于初始化jiebaUtil類. 實(shí)際做的是將停用詞加載到unordered_map中的工作
然后定義一個(gè)私有的noStopHelper()接口, 用于以消除暫停詞的方式分詞
然后提供一個(gè)公有的cutStringNoStop()接口, 封裝noStopHelper().
然后再實(shí)現(xiàn)線程安全的單例模式就好了

特別需要注意的一點(diǎn)是: 實(shí)現(xiàn)對(duì)分詞進(jìn)行去除停用詞的操作時(shí), 在對(duì)src分詞之后需要遍歷分詞并在停用詞表中查找是否為停用詞. 查找此分詞在停用詞表中查找是否存在時(shí), 必須要先將分詞小寫化. 因?yàn)橥Ｓ迷~表中的詞都是小寫的, 如果拿著有大小寫之分的詞, 去查全小寫的表, 會(huì)出現(xiàn)應(yīng)該找到但是卻沒有找到的情況.

并且, 將jiebaUtil設(shè)置為單例模式. 也就意味著之前調(diào)用分詞的接口需要修改一下. 不過(guò)先不急.

先來(lái)分析幾個(gè)問題:

分詞操作要在哪里做?

答: 搜索的時(shí)候, 對(duì)輸入的內(nèi)容分詞以及建立倒排索引的時(shí)候, 對(duì)文檔的內(nèi)容分詞
去除停用詞的分詞操作, 是否會(huì)消耗更長(zhǎng)的時(shí)間、更多的資源?

答: 肯定會(huì)的. 因?yàn)槿コＳ迷~的步驟, 說(shuō)到底就是遍歷分出來(lái)的詞并在停用詞的unordered_map中查找是否有當(dāng)前詞. 至少是一個(gè)O(N)的過(guò)程
搜索時(shí) 和建立索引時(shí), 是否都需要用到去除停用詞的分詞操作?

答案是, 不需要 都使用去除停用詞的分詞操作

這兩方, 只要有一方去除了停用詞. 那么在搜索時(shí), 就不會(huì)根據(jù)停用詞去搜索文檔. 那么也就分了兩種情況:
1. 搜索時(shí) 去除了停用詞, 建立索引時(shí) 沒有去除停用詞
  
  那么, 就只會(huì)使用有效詞搜索, 索引中是否存在停用詞的相關(guān)索引也就沒有關(guān)系
2. 搜索時(shí) 沒有去除停用詞, 建立索引時(shí) 去除了停用詞
  
  那么, 索引中就不會(huì)存在停用詞的相關(guān)索引, 就算使用停用詞去搜索, 也不會(huì)根據(jù)停用詞搜索到文檔.
這兩種情況, 有很大的區(qū)別. 我們知道, 去除停用詞是需要消耗資源的. 分詞越多, 用的時(shí)間就越久, 那么對(duì)于建立索引時(shí)的去除停用詞操作來(lái)說(shuō), 那將會(huì)是一個(gè)非常耗時(shí)的工程.

每一篇文檔內(nèi)容都可能分出上千甚至上萬(wàn)的詞. 如果對(duì)每篇文檔的分詞在進(jìn)行去除停用詞的操作. 那將會(huì)非常的耗時(shí).

那么:
1. 對(duì)于第一種情況. 搜索時(shí) 輸入的內(nèi)容絕大情況下是比文檔內(nèi)容少的. 雖然也會(huì)有一定的消耗, 但是沒有建立索引時(shí)消耗的大
  
  如果只在搜索時(shí), 對(duì)搜索分詞進(jìn)行去除停用詞. 而建立索引時(shí)不去除停用詞
  
  那么, 如果從全局的角度來(lái)看, 服務(wù)器就沒有非常巨大的消耗
2. 而對(duì)于第二種情況.
  
  如果在建立索引時(shí), 對(duì)每篇文章的內(nèi)容分詞去除停用詞. 就是一個(gè)非常耗時(shí)的工程.
  
  從全局的角度來(lái)看, 服務(wù)器會(huì)存在一段非常巨大的消耗
所以, 我們應(yīng)該選第2種情況嗎?

并不是的.

從用戶的效率來(lái)講, 最好選用第一種情況, 為什么?

因?yàn)槲覀兊乃阉饕?是給用戶提供服務(wù)的, 搜索的速度用戶可以感知到. 如果在搜索時(shí) 進(jìn)行去除停用詞的操作. 某些情況下, 可能會(huì)在一定程度上影響搜索的效率

而索引的建立, 是實(shí)現(xiàn)在服務(wù)器正式啟動(dòng)之前的. 這一部分的開銷再大, 用戶也是感知不到的.

所以, 我們這里選擇第1種實(shí)現(xiàn).

當(dāng)然, 情況的選擇不絕對(duì). 因?yàn)榫W(wǎng)絡(luò)上數(shù)據(jù)的傳輸情況非常的復(fù)雜. 可能傳輸?shù)臄?shù)據(jù)量也會(huì)很大程度上影響效率

就像一般的搜索引擎都會(huì)限制輸入長(zhǎng)度.

所以, ns_index::index 和 ns_searcher::searcher兩個(gè)類中, 關(guān)于分詞的實(shí)現(xiàn) 就需要變化一下:

`ns_index::index::buildInvertedIndex()`

// 關(guān)于分詞 使用 cppjieba 中文分詞庫(kù)
bool buildInvertedIndex(const docInfo_t& doc) {
    // 用來(lái)映射關(guān)鍵字 和 關(guān)鍵字的詞頻
    std::unordered_map<std::string, keywordCnt_t> keywordsMap;
    ns_util::jiebaUtil* jiebaIns = ns_util::jiebaUtil::getInstance();
    
    // 標(biāo)題分詞
    std::vector<std::string> titleKeywords;
    jiebaIns->cutStringNoStop(doc._title, &titleKeywords); // 去除停用詞分詞
    // ns_util::jiebaUtil::cutString(doc._title, &titleKeywords);
    // 標(biāo)題詞頻統(tǒng)計(jì) 與 轉(zhuǎn)換 記錄
    for (auto keyword : titleKeywords) {
        boost::to_lower(keyword);		  // 關(guān)鍵字轉(zhuǎn)小寫
        keywordsMap[keyword]._titleCnt++; // 記錄關(guān)鍵字 并統(tǒng)計(jì)標(biāo)題中詞頻
        // unordered_map 的 [], 是用來(lái)通過(guò)keyword值 訪問value的. 如果keyword值已經(jīng)存在, 則返回對(duì)應(yīng)的value, 如果keyword值不存在, 則會(huì)插入keyword并創(chuàng)建對(duì)應(yīng)的value
    }

    // 內(nèi)容分詞
    std::vector<std::string> contentKeywords;
    jiebaIns->cutStringNoStop(doc._content, &contentKeywords); // 去除停用詞分詞
    // ns_util::jiebaUtil::cutString(doc._content, &contentKeywords);
    // 內(nèi)容詞頻統(tǒng)計(jì) 與 轉(zhuǎn)換 記錄
    for (auto keyword : contentKeywords) {
        boost::to_lower(keyword);			// 關(guān)鍵字轉(zhuǎn)小寫
        keywordsMap[keyword]._contentCnt++; // 記錄關(guān)鍵字 并統(tǒng)計(jì)內(nèi)容中詞頻
    }

    // 這兩個(gè)const 變量是用來(lái)計(jì)算 關(guān)鍵字在文檔中的權(quán)重的.
    // 并且, 關(guān)鍵字出現(xiàn)在標(biāo)題中  文檔與關(guān)鍵字的相關(guān)性大概率是要高的, 所以 可以把titleWeight 設(shè)置的大一些
    const int titleWeight = 20;
    const int contentWeight = 1;
    // 分詞并統(tǒng)計(jì)詞頻之后, keywordsMap 中已經(jīng)存儲(chǔ)的當(dāng)前文檔的所有關(guān)鍵字, 以及對(duì)應(yīng)的在標(biāo)題 和 內(nèi)容中 出現(xiàn)的頻率
    // 就可以遍歷 keywordsMap 獲取關(guān)鍵字信息, 構(gòu)建 invertedElem 并添加到 invertedIndex中 關(guān)鍵詞的倒排拉鏈 invertedList中了
    for (auto& keywordInfo : keywordsMap) {
        invertedElem_t item;
        item._docId = doc._docId;		   // 本文檔id
        item._keyword = keywordInfo.first; // 關(guān)鍵字
        item._weight = keywordInfo.second._titleCnt * titleWeight + keywordInfo.second._contentCnt * contentWeight;

        // 上面構(gòu)建好了 invertedElem, 下面就要將 invertedElem 添加到對(duì)應(yīng)關(guān)鍵字的 倒排拉鏈中, 構(gòu)建倒排索引
        invertedList_t& list = invertedIndex[keywordInfo.first]; // 獲取關(guān)鍵字對(duì)應(yīng)的倒排拉鏈
        list.push_back(std::move(item));
    }

    return true;
}

`ns_searcher::searcher::search()`

void search(const std::string& query, std::string* jsonString) {
    // 1. 對(duì)需要搜索的句子或關(guān)鍵詞進(jìn)行分詞
    std::vector<std::string> keywords;
    ns_util::jiebaUtil* jiebaIns = ns_util::jiebaUtil::getInstance();

    jiebaIns->cutString(query, &keywords); // 不去除停用詞分詞
    // ns_util::jiebaUtil::cutString(query, &keywords);
    
    // 統(tǒng)計(jì)文檔用, 因?yàn)榭赡艽嬖诓煌姆衷~ 在倒排索引中指向同一個(gè)文檔的情況
    // 如果不去重, 會(huì)重復(fù)展示
    std::unordered_map<std::size_t, invertedElemOut_t> invertedElemOutMap;
    // 2. 根據(jù)分詞獲取倒排索引中的倒排拉鏈, 并匯總?cè)ブ?invertedElem
    for (std::string word : keywords) {
        boost::to_lower(word);

        ns_index::invertedList_t* tmpInvertedList = _index->getInvertedList(word);
        if (nullptr == tmpInvertedList) {
            // 沒有這個(gè)關(guān)鍵詞
            continue;
        }

        for (auto& elem : *tmpInvertedList) {
            // 遍歷倒排拉鏈, 根據(jù)文檔id 對(duì)invertedElem 去重
            auto& item = invertedElemOutMap[elem._docId]; // 在map中獲取 或 創(chuàng)建對(duì)應(yīng)文檔id的 invertedElem
            item._docId = elem._docId;
            item._weight += elem._weight;
            item._keywords.push_back(elem._keyword);
            // 此時(shí)就將當(dāng)前invertedElem 去重到了 invertedElemMap 中
        }
    }

    // vector 存儲(chǔ) 文檔相關(guān)信息, 方便排序
    std::vector<invertedElemOut_t> allInvertedElemOut;
    // 出循環(huán)之后, 就將搜索到的 文檔的 id、權(quán)重和相關(guān)關(guān)鍵詞 存儲(chǔ)到了 invertedElemMap
    // 然后將文檔的相關(guān)信息 invertedElemOut 都存儲(chǔ)到 vector 中
    for (const auto& elemOut : invertedElemOutMap) {
        // map中的second: elemOut, 在執(zhí)行此操作之后, 就沒用了
        // 所以使用移動(dòng)語(yǔ)義, 防止發(fā)生拷貝
        allInvertedElemOut.push_back(std::move(elemOut.second));
    }

    std::sort(allInvertedElemOut.begin(), allInvertedElemOut.end(),
              [](const invertedElemOut_t& elem1, const invertedElemOut_t& elem2) {
                  return elem1._weight > elem2._weight;
              });

    // 然后 通過(guò)遍歷此數(shù)組, 獲取文檔id, 根據(jù)id獲取文檔在正排索引中的內(nèi)容
    // 然后再將 所有內(nèi)容序列化
    Json::Value root;
    for (auto& elemOut : allInvertedElemOut) {
        // 通過(guò)Json::Value 對(duì)象, 存儲(chǔ)文檔內(nèi)容
        Json::Value elem;
        // 通過(guò)elemOut._docId 獲取正排索引中 文檔的內(nèi)容信息
        ns_index::docInfo_t* doc = _index->getForwardIndex(elemOut._docId);
        // elem賦值
        elem["url"] = doc->_url;
        elem["title"] = doc->_title;
        // 關(guān)于文檔的內(nèi)容, 搜索結(jié)果中是不展示文檔的全部?jī)?nèi)容的, 應(yīng)該只顯示包含關(guān)鍵詞的摘要, 點(diǎn)進(jìn)文檔才顯示相關(guān)內(nèi)容
        // 而docInfo中存儲(chǔ)的是文檔去除標(biāo)簽之后的所有內(nèi)容, 所以不能直接將 doc._content 存儲(chǔ)到elem對(duì)應(yīng)key:value中
        elem["desc"] = getDesc(doc->_content, elemOut._keywords[0]); // 只根據(jù)第一個(gè)關(guān)鍵詞來(lái)獲取摘要
        // for Debug
        // 這里有一個(gè)bug, jsoncpp 0.10.5.2 是不支持long或long long 相關(guān)類型的, 所以需要轉(zhuǎn)換成 double
        // 這里轉(zhuǎn)換成 double不會(huì)有什么影響, 因?yàn)檫@兩個(gè)參數(shù)只是本地調(diào)試顯示用的.
        elem["docId"] = (double)doc->_docId;
        elem["weight"] = (double)elemOut._weight;

        root.append(elem);
    }

    // 序列化完成之后將相關(guān)內(nèi)容寫入字符串
    // for Debug 用 styledWriter
    Json::StyledWriter writer;
    *jsonString = writer.write(root);
}

結(jié)果演示

我們選擇的這種方式, 會(huì)將建立索引的時(shí)長(zhǎng)拉的很長(zhǎng), 最起碼比之前要長(zhǎng)的多:

然后就可以進(jìn)行搜索了:

項(xiàng)目當(dāng)前目錄結(jié)構(gòu)

Boost文檔搜索引擎庫(kù)這個(gè)項(xiàng)目, 當(dāng)前已經(jīng)實(shí)現(xiàn)了:

parser文檔內(nèi)容處理模塊
index索引建立相關(guān)接口
searcher搜索實(shí)現(xiàn)相關(guān)接口

當(dāng)前項(xiàng)目目錄結(jié)構(gòu)為:

? pwd
/home/July/gitCode/gitHub/Boost-Doc-Searcher
? tree -L 3
.
├── cppjieba
│   ├── DictTrie.hpp
│   ├── ...(jieba庫(kù)相關(guān)頭文件)
│   └── Unicode.hpp
├── cppjiebaDict
│   ├── hmm_model.utf8
│   ├── ...(jieba庫(kù)提供的分詞庫(kù))
│   └── user.dict.utf8
├── data
│   ├── input
│   │   ├── about.html
│   │   ├── ...(Boost庫(kù)文檔文件)
│   │   └── yap.html
│   └── output
│       └── raw
├── index.hpp
├── LICENSE
├── makefile
├── parser.cc
├── README.md
├── searcher.hpp
├── serverDebug.cc
└── util.hpp

63 directories, 279 files

索引接口以及搜索接口相關(guān)代碼整合

當(dāng)前, util.hpp index.hpp 和 searcher.hpp 的代碼:

util.hpp:

// util.hpp 一般定義一些通用的宏定義、工具函數(shù)等

#pragma once

#include <boost/algorithm/string/case_conv.hpp>
#include <iostream>
#include <unordered_map>
#include <vector>
#include <string>
#include <fstream>
#include <mutex>
#include <boost/algorithm/string.hpp>
#include "cppjieba/Jieba.hpp"

namespace ns_util {
    class fileUtil {
    public:
        // readFile 用于讀取指定文本文件的內(nèi)容, 到string輸出型參數(shù)中
        static bool readFile(const std::string& filePath, std::string* out) {
            // 要讀取文件內(nèi)容, 就要先打開文件
            // 1. 以讀取模式打開文件
            std::ifstream in(filePath, std::ios::in);
            if (!in.is_open()) {
                // 打卡文件失敗
                std::cerr << "Failed to open " << filePath << "!" << std::endl;
                return false;
            }

            // 走到這里打開文件成功
            // 2. 讀取文件內(nèi), 并存儲(chǔ)到out中
            std::string line;
            while (std::getline(in, line)) {
                *out += line;
            }

            in.close();

            return true;
        }
    };

    class stringUtil {
    public:
        static bool split(const std::string& file, std::vector<std::string>* fileResult, const std::string& sep) {
            // 使用 boost庫(kù)中的split接口, 可以將 string 以指定的分割符分割, 并存儲(chǔ)到vector<string>輸出型參數(shù)中
            boost::split(*fileResult, file, boost::is_any_of(sep), boost::algorithm::token_compress_on);
            // boost::algorithm::token_compress_on 表示壓縮連續(xù)的分割符

            if (fileResult->empty()) {
                return false;
            }

            return true;
        }
    };

    const char* const DICT_PATH = "./cppjiebaDict/jieba.dict.utf8";
    const char* const HMM_PATH = "./cppjiebaDict/hmm_model.utf8";
    const char* const USER_DICT_PATH = "./cppjiebaDict/user.dict.utf8";
    const char* const IDF_PATH = "./cppjiebaDict/idf.utf8";
    const char* const STOP_WORD_PATH = "./cppjiebaDict/stop_words.utf8";

    class jiebaUtil {
    private:
        cppjieba::Jieba _jieba;
        std::unordered_map<std::string, bool> _stopKeywordMap;

        jiebaUtil()
            : _jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH) {}

        jiebaUtil(const jiebaUtil&) = delete;
        jiebaUtil& operator=(const jiebaUtil&) = delete;

        static jiebaUtil* _instance;

    private:
        void noStopHelper(const std::string& src, std::vector<std::string>* out) {
            _jieba.CutForSearch(src, *out);
            // 遍歷out 查詢是否為停止詞 是則刪除
            // 需要注意迭代器失效的問題
            for (auto iter = out->begin(); iter != out->end();) {
                std::string word = *iter;
                boost::to_lower(word);
                auto stopIt = _stopKeywordMap.find(word);
                // auto stopIt = _stopKeywordMap.find(*iter);
                if (stopIt != _stopKeywordMap.end()) {
                    // 注意接收erase的返回值 防止出現(xiàn)迭代器失效問題
                    iter = out->erase(iter);
                }
                else {
                    iter++;
                }
            }
        }

        // 主要是為了支持 消除停止詞的分詞
        // 也就是需要將停止詞, 寫入到 map中
        bool initJiebaUtil() {
            // 首先按行讀取文件 const char* const STOP_WORD_PATH = "./cppjiebaDict/stop_words.utf8"
            std::ifstream stopFile(STOP_WORD_PATH, std::ios::in);
            if (!stopFile.is_open()) {
                return false;
            }

            std::string line;
            while (std::getline(stopFile, line)) {
                _stopKeywordMap.insert({line, true});
            }

            stopFile.close();

            return true;
        }

    public:
        static jiebaUtil* getInstance() {
            static std::mutex mtx;
            if (nullptr == _instance) {
                mtx.lock();
                if (nullptr == _instance) {
                    _instance = new jiebaUtil;
                    _instance->initJiebaUtil();
                }
                mtx.unlock();
            }

            return _instance;
        }

        // 分詞: 不消除停止詞的版本
        void cutString(const std::string& src, std::vector<std::string>* out) {
            _jieba.CutForSearch(src, *out);
        }
        // 分詞: 消除停止詞的版本
        void cutStringNoStop(const std::string& src, std::vector<std::string>* out) {
            noStopHelper(src, out);
        }
    };
    jiebaUtil* jiebaUtil::_instance;
    // cppjieba::Jieba jiebaUtil::jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH);
}

index.hpp:

// 本代碼是 建立索引相關(guān)的接口
// 索引 是用來(lái)快速搜索的
// parser模塊, 已經(jīng)將所有文檔內(nèi)容處理好, 并存儲(chǔ)到了 data/output/raw 中
// 索引的建立, 就是通過(guò)獲取 已經(jīng)處理好的文檔內(nèi)容 來(lái)建立的
// 項(xiàng)目中, 需要分別建立正排索引和倒排索引
// 正排索引, 是從文檔id 找到文件內(nèi)容的索引
// 倒排索引, 是從關(guān)鍵詞 找到關(guān)鍵詞所在文檔id 的索引

// 首先第一個(gè)問題:
// 正排索引中 文件內(nèi)容該如何表示?
// 其實(shí)在parser模塊中, 已經(jīng)有過(guò)相關(guān)的處理了, 即用結(jié)構(gòu)體(docInfo) 成員為: title、content、url
// 不過(guò), 在建立索引時(shí), 文檔在索引中 應(yīng)該存在一個(gè)文檔id.

// 正排索引結(jié)構(gòu)
// 正排索引 可以通過(guò)文檔id找到文件內(nèi)容. 那么 正排索引可以用 vector 建立, vector 存儲(chǔ)docInfo結(jié)構(gòu)體 那么數(shù)組下標(biāo)就天然是 文檔id

// 倒排索引結(jié)構(gòu)
// 倒排索引 需要通過(guò)關(guān)鍵字 找到包含關(guān)鍵字的文檔id, 文檔id 對(duì)應(yīng)正排索引中的下標(biāo), 所以需要先建立正排索引, 再建立倒排索引
// 由于可能多個(gè)文檔包含相同的關(guān)鍵字, 倒排索引更適合 keyword:value 結(jié)構(gòu)存儲(chǔ). 所以 可以使用 unordered_map
// 并且, 同樣因?yàn)殛P(guān)鍵字可能找到多個(gè)文檔, value的類型就 可以是存儲(chǔ)著文檔id的vector, 稱為倒排拉鏈

// 倒排索引中, 通過(guò)關(guān)鍵字找到的 倒排拉鏈中 不應(yīng)該僅僅是文檔id的數(shù)據(jù).
// 因?yàn)榈古潘饕牟檎医Y(jié)果是關(guān)乎到查找結(jié)果的顯示順序的. 所以 還需要知道對(duì)應(yīng)文檔id 在本次搜索的權(quán)重.
// 所以, 最好將文檔id和權(quán)重結(jié)合起來(lái), 構(gòu)成一個(gè)結(jié)構(gòu)體(invertedElem)存儲(chǔ).
// 不過(guò), 不需要 先將所有文檔的正排索引建立完成之后 再建立倒排索引. 可以先給 某文檔建立正排索引之后, 直接對(duì)此文檔建立倒排索引

#pragma once

#include <iostream>
#include <fstream>
#include <utility>
#include <vector>
#include <string>
#include <unordered_map>
#include <mutex>
#include "util.hpp"

namespace ns_index {

    // 用于正排索引中 存儲(chǔ)文檔內(nèi)容
    typedef struct docInfo {
        std::string _title;   // 文檔標(biāo)題
        std::string _content; // 文檔去標(biāo)簽之后的內(nèi)容
        std::string _url;     // 文檔對(duì)應(yīng)官網(wǎng)url
        std::size_t _docId;   // 文檔id
    } docInfo_t;

    // 用于倒排索引中 記錄關(guān)鍵字對(duì)應(yīng)的文檔id和權(quán)重
    typedef struct invertedElem {
        std::size_t _docId;    // 文檔id
        std::string _keyword;  // 關(guān)鍵字
        std::uint64_t _weight; // 搜索此關(guān)鍵字, 此文檔id 所占權(quán)重

        invertedElem() // 權(quán)重初始化為0
            : _weight(0) {}
    } invertedElem_t;

    // 關(guān)鍵字的詞頻
    typedef struct keywordCnt {
        std::size_t _titleCnt;   // 關(guān)鍵字在標(biāo)題中出現(xiàn)的次數(shù)
        std::size_t _contentCnt; // 關(guān)鍵字在內(nèi)容中出現(xiàn)的次數(shù)

        keywordCnt()
            : _titleCnt(0)
            , _contentCnt(0) {}
    } keywordCnt_t;

    // 倒排拉鏈
    typedef std::vector<invertedElem_t> invertedList_t;

    class index {
    private:
        // 正排索引使用vector, 下標(biāo)天然是 文檔id
        std::vector<docInfo_t> forwardIndex;
        // 倒排索引 使用 哈希表, 因?yàn)榈古潘饕?一定是 一個(gè)keyword 對(duì)應(yīng)一組 invertedElem拉鏈
        std::unordered_map<std::string, invertedList_t> invertedIndex;

        // 單例模式設(shè)計(jì)
        index() {}

        index(const index&) = delete;
        index& operator=(const index&) = delete;

        static index* _instance; // 單例
        static std::mutex _mtx;

    public:
        // 獲取單例
        static index* getInstance() {
            if (nullptr == _instance) {
                _mtx.lock();
                if (nullptr == _instance) {
                    _instance = new index;
                }
                _mtx.unlock();
            }

            return _instance;
        }

        // 通過(guò)關(guān)鍵字 檢索倒排索引, 獲取對(duì)應(yīng)的 倒排拉鏈
        invertedList_t* getInvertedList(const std::string& keyword) {
            // 先找 關(guān)鍵字 所在迭代器
            auto iter = invertedIndex.find(keyword);
            if (iter == invertedIndex.end()) {
                std::cerr << keyword << " have no invertedList!" << std::endl;
                return nullptr;
            }

            // 找到之后
            return &(iter->second);
        }

        // 通過(guò)倒排拉鏈中 每個(gè)倒排元素中存儲(chǔ)的 文檔id, 檢索正排索引, 獲取對(duì)應(yīng)文檔內(nèi)容
        docInfo_t* getForwardIndex(std::size_t docId) {
            if (docId >= forwardIndex.size()) {
                std::cerr << "docId out range, error!" << std::endl;
                return nullptr;
            }

            return &forwardIndex[docId];
        }

        // 根據(jù)parser模塊處理過(guò)的 所有文檔的信息
        // 提取文檔信息, 建立 正排索引和倒排索引
        // input 為 ./data/output/raw
        bool buildIndex(const std::string& input) {
            // 先以讀取方式打開文件
            std::ifstream in(input, std::ios::in);
            if (!in.is_open()) {
                std::cerr << "Failed to open " << input << std::endl;
                return false;
            }

            std::size_t count = 0;

            std::string line;
            while (std::getline(in, line)) {
                // 按照parser模塊的處理, getline 一次讀取到的數(shù)據(jù), 就是一個(gè)文檔的: title\3content\3url\n
                docInfo_t* doc = buildForwardIndex(line); // 將一個(gè)文檔的數(shù)據(jù) 建立到索引中
                if (nullptr == doc) {
                    std::cerr << "Failed to buildForwardIndex for " << line << std::endl;
                    continue;
                }

                // 文檔建立正排索引成功, 接著就通過(guò) doc 建立倒排索引
                if (!buildInvertedIndex(*doc)) {
                    std::cerr << "Failed to buildInvertedIndex for " << line << std::endl;
                    continue;
                }

                count++;
                if (count % 50 == 0)
                    std::cout << "當(dāng)前已經(jīng)建立的索引文檔: " << count << std::endl;
            }

            return true;
        }

    private:
        // 對(duì)一個(gè)文檔建立正排索引
        docInfo_t* buildForwardIndex(const std::string& file) {
            // 一個(gè)文檔的 正排索引的建立, 是將 title\3content\3url (file) 中title content url 提取出來(lái)
            // 構(gòu)成一個(gè) docInfo_t doc
            // 然后將 doc 存儲(chǔ)到正排索引vector中
            std::vector<std::string> fileResult;
            const std::string sep("\3");
            // stringUtil::split() 字符串通用工具接口, 分割字符串
            ns_util::stringUtil::split(file, &fileResult, sep);

            docInfo_t doc;
            doc._title = fileResult[0];
            doc._content = fileResult[1];
            doc._url = fileResult[2];

            // 因?yàn)閐oc是需要存儲(chǔ)到 forwardIndex中的, 存儲(chǔ)之前 forwardIndex的size 就是存儲(chǔ)之后 doc所在的位置
            doc._docId = forwardIndex.size();

            forwardIndex.push_back(std::move(doc));

            return &forwardIndex.back();
        }

        // 對(duì)一個(gè)文檔建立倒排索引
        // 倒排索引是用來(lái)通過(guò)關(guān)鍵詞定位文檔的.
        // 倒排索引的結(jié)構(gòu)是 std::unordered_map<std::string, invertedList_t> invertedIndex;
        // keyword值就是關(guān)鍵字, value值則是關(guān)鍵字所映射到的文檔的倒排拉鏈
        // 對(duì)一個(gè)文檔建立倒排索引的原理是:
        //  1. 首先對(duì)文檔的標(biāo)題 和 內(nèi)容進(jìn)行分詞, 并記錄分詞
        //  2. 分別統(tǒng)計(jì)整理標(biāo)題分析的詞頻 和 內(nèi)容分詞的詞頻
        //     統(tǒng)計(jì)詞頻是為了可以大概表示關(guān)鍵字在文檔中的 相關(guān)性.
        //     在本項(xiàng)目中, 可以簡(jiǎn)單的認(rèn)為關(guān)鍵詞在文檔中出現(xiàn)的頻率, 代表了此文檔內(nèi)容與關(guān)鍵詞的相關(guān)性. 當(dāng)然這是非常膚淺的聯(lián)系, 一般來(lái)說(shuō)相關(guān)性的判斷都是非常復(fù)雜的. 因?yàn)樯婕暗皆~義 語(yǔ)義等相關(guān)分析.
        //     每個(gè)關(guān)鍵字 在標(biāo)題中出現(xiàn)的頻率 和 在內(nèi)容中出現(xiàn)的頻率, 可以記錄在一個(gè)結(jié)構(gòu)體中. 此結(jié)構(gòu)體就表示關(guān)鍵字的詞頻
        //  3. 使用 unordered_map<std::string, wordCnt_t> 記錄關(guān)鍵字與其詞頻
        //  4. 通過(guò)遍歷記錄關(guān)鍵字與詞頻的 unordered_map, 構(gòu)建 invertedElem: _docId, _keyword, _weight
        //  5. 構(gòu)建了關(guān)鍵字的invertedElem 之后, 再將關(guān)鍵詞的invertedElem 添加到在 invertedIndex中 關(guān)鍵詞的倒排拉鏈 invertedList中
        // 注意, 搜索引擎一般不區(qū)分大小寫, 所以可以將分詞出來(lái)的所有的關(guān)鍵字, 在倒排索引中均以小寫的形式映射. 在搜索時(shí) 同樣將搜索請(qǐng)求分詞出的關(guān)鍵字小 寫化, 在進(jìn)行檢索. 就可以實(shí)現(xiàn)搜索不區(qū)分大小寫.

        // 關(guān)于分詞 使用 cppjieba 中文分詞庫(kù)
        bool buildInvertedIndex(const docInfo_t& doc) {
            // 用來(lái)映射關(guān)鍵字 和 關(guān)鍵字的詞頻
            std::unordered_map<std::string, keywordCnt_t> keywordsMap;
            ns_util::jiebaUtil* jiebaIns = ns_util::jiebaUtil::getInstance();

            // 標(biāo)題分詞
            std::vector<std::string> titleKeywords;
            jiebaIns->cutStringNoStop(doc._title, &titleKeywords);
            // jiebaIns->cutString(doc._title, &titleKeywords);
            // 標(biāo)題詞頻統(tǒng)計(jì) 與 轉(zhuǎn)換 記錄
            for (auto keyword : titleKeywords) {
                boost::to_lower(keyword);         // 關(guān)鍵字轉(zhuǎn)小寫
                keywordsMap[keyword]._titleCnt++; // 記錄關(guān)鍵字 并統(tǒng)計(jì)標(biāo)題中詞頻
                                                  // unordered_map 的 [], 是用來(lái)通過(guò)keyword值 訪問value的. 如果keyword值已經(jīng)存在, 則返回對(duì)應(yīng)的value, 如果keyword值不存在, 則會(huì)插入keyword并創(chuàng)建對(duì)應(yīng)的value
            }

            // 內(nèi)容分詞
            std::vector<std::string> contentKeywords;
            jiebaIns->cutStringNoStop(doc._content, &contentKeywords);
            // jiebaIns->cutString(doc._content, &contentKeywords);
            // 內(nèi)容詞頻統(tǒng)計(jì) 與 轉(zhuǎn)換 記錄
            for (auto keyword : contentKeywords) {
                boost::to_lower(keyword);           // 關(guān)鍵字轉(zhuǎn)小寫
                keywordsMap[keyword]._contentCnt++; // 記錄關(guān)鍵字 并統(tǒng)計(jì)內(nèi)容中詞頻
            }

            // 這兩個(gè)const 變量是用來(lái)計(jì)算 關(guān)鍵字在文檔中的權(quán)重的.
            // 并且, 關(guān)鍵字出現(xiàn)在標(biāo)題中  文檔與關(guān)鍵字的相關(guān)性大概率是要高的, 所以 可以把titleWeight 設(shè)置的大一些
            const int titleWeight = 20;
            const int contentWeight = 1;
            // 分詞并統(tǒng)計(jì)詞頻之后, keywordsMap 中已經(jīng)存儲(chǔ)的當(dāng)前文檔的所有關(guān)鍵字, 以及對(duì)應(yīng)的在標(biāo)題 和 內(nèi)容中 出現(xiàn)的頻率
            // 就可以遍歷 keywordsMap 獲取關(guān)鍵字信息, 構(gòu)建 invertedElem 并添加到 invertedIndex中 關(guān)鍵詞的倒排拉鏈 invertedList中了
            for (auto& keywordInfo : keywordsMap) {
                invertedElem_t item;
                item._docId = doc._docId;          // 本文檔id
                item._keyword = keywordInfo.first; // 關(guān)鍵字
                item._weight = keywordInfo.second._titleCnt * titleWeight + keywordInfo.second._contentCnt * contentWeight;

                // 上面構(gòu)建好了 invertedElem, 下面就要將 invertedElem 添加到對(duì)應(yīng)關(guān)鍵字的 倒排拉鏈中, 構(gòu)建倒排索引
                invertedList_t& list = invertedIndex[keywordInfo.first]; // 獲取關(guān)鍵字對(duì)應(yīng)的倒排拉鏈
                list.push_back(std::move(item));
            }

            return true;
        }
    };
    // 單例相關(guān)
    index* index::_instance = nullptr;
    std::mutex index::_mtx;
}

searcher.hpp:

// 本文件實(shí)現(xiàn) 搜索相關(guān)接口

// 本項(xiàng)目中的搜索, 是根據(jù)輸入的關(guān)鍵詞:
//  1. 先對(duì)關(guān)鍵詞進(jìn)行分詞
//  2. 然后通過(guò)分詞, 在倒排索引中進(jìn)行檢索, 檢索到相關(guān)的倒排拉鏈
//  3. 然后再通過(guò)倒排拉鏈中 倒排元素的對(duì)應(yīng)文檔id, 在正排索引中獲取文件內(nèi)容

// 不過(guò)在正式開始搜索之前, 要先構(gòu)建索引
// 而索引的構(gòu)建, 在整個(gè)程序中只需要構(gòu)建一次, 所以可以將索引設(shè)計(jì)為單例模式
#pragma once

#include <algorithm>
#include <cctype>
#include <iostream>
#include <iterator>
#include <string>
#include <vector>
#include <unordered_map>
#include <algorithm>
#include <boost/algorithm/string.hpp>
#include <jsoncpp/json/json.h>
#include "util.hpp"
#include "index.hpp"

namespace ns_searcher {
    typedef struct invertedElemOut {
        std::size_t _docId;
        std::uint64_t _weight;
        std::vector<std::string> _keywords;
    } invertedElemOut_t;

    class searcher {
    private:
        ns_index::index* _index; // 建立索引的類

        ns_util::jiebaUtil* _jiebaIns;

    public:
        void initSearcher(const std::string& input) {
            // 搜索前的初始化操作
            // 獲取單例
            _index = ns_index::index::getInstance();
            _jiebaIns = ns_util::jiebaUtil::getInstance();

            std::cout << "獲取單例成功 ..." << std::endl;
            // 建立索引
            _index->buildIndex(input);
            std::cout << "構(gòu)建正排索引、倒排索引成功 ..." << std::endl;
        }

        // 搜索接口
        // 搜索需要實(shí)現(xiàn)什么功能?
        // 首先參數(shù)部分需要怎么實(shí)現(xiàn)?
        // 參數(shù)部分, 需要接收需要搜索的句子或關(guān)鍵字, 還需要一個(gè)輸出型參數(shù) 用于輸出查找結(jié)果
        //  查找結(jié)果我們使用jsoncpp進(jìn)行序列化和反序列化
        // search() 具體需要實(shí)現(xiàn)的功能:
        //  1. 對(duì)接收的句子或關(guān)鍵詞進(jìn)行分詞
        //  2. 根據(jù)分詞, 在倒排索引中查找到所有分詞的倒排拉鏈 匯總 的 invertedElem, 并根據(jù)相關(guān)性進(jìn)行排序
        //  4. 然后再遍歷所有的 invertedElem, 根據(jù) invertedElem中存儲(chǔ)的 文檔id, 在正排索引中獲取到文檔內(nèi)容
        //  5. 然后將獲取到的文檔內(nèi)容使用jsoncpp 進(jìn)行序列化, 存儲(chǔ)到輸出型參數(shù)中
        // 直到遍歷完invertedElem
        void search(const std::string& query, std::string* jsonString) {
            // 1. 對(duì)需要搜索的句子或關(guān)鍵詞進(jìn)行分詞
            std::vector<std::string> keywords;

            _jiebaIns->cutString(query, &keywords);
            // _jiebaIns->cutStringNoStop(query, &keywords);
            // ns_util::jiebaUtil::cutString(query, &keywords);

            // std::vector<invertedElemOut_t> allInvertedElemOut;
            // std::vector<ns_index::invertedElem_t> allInvertedElem;

            // 統(tǒng)計(jì)文檔用, 因?yàn)榭赡艽嬖诓煌姆衷~ 在倒排索引中指向同一個(gè)文檔的情況
            // 如果不去重, 會(huì)重復(fù)展示
            // std::unordered_map<std::size_t, ns_index::invertedElem_t> invertedElemMap;
            std::unordered_map<std::size_t, invertedElemOut_t> invertedElemOutMap;
            // 2. 根據(jù)分詞獲取倒排索引中的倒排拉鏈, 并匯總?cè)ブ?invertedElem
            for (std::string word : keywords) {
                boost::to_lower(word);

                ns_index::invertedList_t* tmpInvertedList = _index->getInvertedList(word);
                if (nullptr == tmpInvertedList) {
                    // 沒有這個(gè)關(guān)鍵詞
                    continue;
                }

                for (auto& elem : *tmpInvertedList) {
                    // 遍歷倒排拉鏈, 根據(jù)文檔id 對(duì)invertedElem 去重
                    auto& item = invertedElemOutMap[elem._docId]; // 在map中獲取 或 創(chuàng)建對(duì)應(yīng)文檔id的 invertedElem
                    item._docId = elem._docId;
                    item._weight += elem._weight;
                    // 權(quán)重需要+= 是因?yàn)槎鄠€(gè)關(guān)鍵詞指向了同一個(gè)文檔 那么就說(shuō)明此文檔的與搜索內(nèi)容的相關(guān)性更高
                    // 所以, 就可以將多個(gè)關(guān)鍵字關(guān)于此文檔的權(quán)重相加, 表示搜索相關(guān)性高
                    // 最好還將 此文檔相關(guān)的關(guān)鍵詞 也存儲(chǔ)起來(lái), 因?yàn)樵诳蛻舳怂阉鹘Y(jié)果中, 需要對(duì)網(wǎng)頁(yè)中有的關(guān)鍵字進(jìn)行高亮
                    // 但是 invertedElem 的第三個(gè)成員是 單獨(dú)的一個(gè)string對(duì)象, 不太合適
                    // 所以, 可以定義一個(gè)與invertedElem 相似的, 但是第三個(gè)成員是一個(gè) vector 的類, 比如 invertedElemOut
                    item._keywords.push_back(elem._keyword);
                    // 此時(shí)就將當(dāng)前invertedElem 去重到了 invertedElemMap 中
                }
            }

            // vector 存儲(chǔ) 文檔相關(guān)信息, 方便排序
            std::vector<invertedElemOut_t> allInvertedElemOut;
            // 出循環(huán)之后, 就將搜索到的 文檔的 id、權(quán)重和相關(guān)關(guān)鍵詞 存儲(chǔ)到了 invertedElemMap
            // 然后將文檔的相關(guān)信息 invertedElemOut 都存儲(chǔ)到 vector 中
            for (const auto& elemOut : invertedElemOutMap) {
                // map中的second: elemOut, 在執(zhí)行此操作之后, 就沒用了
                // 所以使用移動(dòng)語(yǔ)義, 防止發(fā)生拷貝
                allInvertedElemOut.push_back(std::move(elemOut.second));
            }

            // 執(zhí)行到這里, 可以搜索到的文檔id 權(quán)重 和 相關(guān)關(guān)鍵詞的信息, 已經(jīng)都在allInvertedElemOut 中了.
            // 但是, 還不能直接 根據(jù)文檔id 在正排索引中檢索
            // 因?yàn)? 此時(shí)如果直接進(jìn)行文檔內(nèi)容的索引, 在找到文檔內(nèi)容之后, 就要直接進(jìn)行序列化并輸出了. 而客戶端顯示的時(shí)候, 反序列化出來(lái)的文檔順序, 就是顯示的文檔順序
            // 但是現(xiàn)在找到的文檔還是亂序的. 還需要將allInvertedElemOut中的相關(guān)文檔, 通過(guò)_weight 進(jìn)行倒序排列
            // 這樣, 序列化就是按照倒序排列的, 反序列化也會(huì)如此, 顯示同樣如此
            std::sort(allInvertedElemOut.begin(), allInvertedElemOut.end(),
                      [](const invertedElemOut_t& elem1, const invertedElemOut_t& elem2) {
                          return elem1._weight > elem2._weight;
                      });

            // 排序之后, allInvertedElemOut 中文檔的排序就是倒序了
            // 然后 通過(guò)遍歷此數(shù)組, 獲取文檔id, 根據(jù)id獲取文檔在正排索引中的內(nèi)容
            // 然后再將 所有內(nèi)容序列化
            Json::Value root;
            for (auto& elemOut : allInvertedElemOut) {
                // 通過(guò)Json::Value 對(duì)象, 存儲(chǔ)文檔內(nèi)容
                Json::Value elem;
                // 通過(guò)elemOut._docId 獲取正排索引中 文檔的內(nèi)容信息
                ns_index::docInfo_t* doc = _index->getForwardIndex(elemOut._docId);
                // elem賦值
                elem["url"] = doc->_url;
                elem["title"] = doc->_title;
                // 關(guān)于文檔的內(nèi)容, 搜索結(jié)果中是不展示文檔的全部?jī)?nèi)容的, 應(yīng)該只顯示包含關(guān)鍵詞的摘要, 點(diǎn)進(jìn)文檔才顯示相關(guān)內(nèi)容
                // 而docInfo中存儲(chǔ)的是文檔去除標(biāo)簽之后的所有內(nèi)容, 所以不能直接將 doc._content 存儲(chǔ)到elem對(duì)應(yīng)key:value中
                elem["desc"] = getDesc(doc->_content, elemOut._keywords[0]); // 只根據(jù)第一個(gè)關(guān)鍵詞來(lái)獲取摘要
                // for Debug
                // 這里有一個(gè)bug, jsoncpp 0.10.5.2 是不支持long或long long 相關(guān)類型的, 所以需要轉(zhuǎn)換成 double
                // 這里轉(zhuǎn)換成 double不會(huì)有什么影響, 因?yàn)檫@兩個(gè)參數(shù)只是本地調(diào)試顯示用的.
                elem["docId"] = (double)doc->_docId;
                elem["weight"] = (double)elemOut._weight;

                root.append(elem);
            }

            // 序列化完成之后將相關(guān)內(nèi)容寫入字符串
            // for Debug 用 styledWriter
            Json::StyledWriter writer;
            *jsonString = writer.write(root);
        }

        std::string getDesc(const std::string& content, const std::string& keyword) {
            // 如何獲取摘要呢?
            // 我們嘗試獲取正文中 第一個(gè)keyword 的前50個(gè)字節(jié)和后100個(gè)字節(jié)的內(nèi)容 作為摘要
            const std::size_t prevStep = 50;
            const std::size_t nextStep = 100;
            // 獲取正文中 第一個(gè) keyword 的位置

            // std::size_t pos = content.find(keyword);
            // if (pos == std::string::npos)
            //  return "keyword does not exist!";
            // 直接這樣處理, 會(huì)出現(xiàn)一個(gè)問題:
            // keyword是有大小寫的. 倒排索引中查找 我們實(shí)現(xiàn)的是忽略大小寫, 所以可以找到文檔
            // 而 string::find() 是區(qū)分大小寫的查找, 可能無(wú)法在內(nèi)容中找到對(duì)應(yīng)的關(guān)鍵詞
            // string容器也沒有提供不區(qū)分大小寫的查找方法
            // 此時(shí), 可以用std::search()
            // std::search(it1, it2, it3, it4, pred);
            // 可以在[it1, it2)中 查找第一個(gè)[it3, it4)(詞語(yǔ))的出現(xiàn)位置.
            // 并且, 如果使用第5個(gè)參數(shù), 就可以傳入 帶有兩個(gè)參數(shù)的仿函數(shù), 這兩個(gè)參數(shù)就是需要比較的字符
            // 可以在仿函數(shù)內(nèi)設(shè)置這兩個(gè)字符的比較方式
            // 最終會(huì)返回找到的找到的單次第一個(gè)字符位置的迭代器, 否則返回it2

            auto iter = std::search(content.begin(), content.end(), keyword.begin(), keyword.end(),
                                    [](int x, int y) {
                                        return std::tolower(x) == std::tolower(y);
                                    });
            if (iter == content.end())
                return "keyword does not exist!";
            std::size_t pos = std::distance(content.begin(), iter);

            std::size_t begin = 0;
            std::size_t end = content.size() - 1;

            // 獲取前50字節(jié) 和 后100字節(jié)的迭代器位置
            if (pos > begin + prevStep)
                begin += (pos - prevStep);
            if (pos + nextStep < end)
                end = pos + nextStep;

            if (begin >= end)
                return "nothing!";

            // 獲取摘要
            std::string desc;
            if (pos <= begin + prevStep)
                desc = "...";
            desc += content.substr(begin, end - begin);
            if (pos + nextStep < end)
                desc += "...";

            return desc;
        }
    };
}

本篇文章到此結(jié)束

感謝閱讀~文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-633164.html

到了這里，關(guān)于[C++項(xiàng)目] Boost文檔站內(nèi)搜索引擎(4): 搜索的相關(guān)接口的實(shí)現(xiàn)、線程安全的單例index接口、cppjieba分詞庫(kù)的使用、綜合調(diào)試...的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！