国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

<cite id="6qsw4"><bdo id="6qsw4"></bdo></cite>

<button id="6qsw4"></button>

<dfn id="6qsw4"></dfn>

<li id="6qsw4"><tfoot id="6qsw4"></tfoot></li>

boost庫搜索引擎

1年前作者：加法器+分類：Toy博客閱讀(29)違法舉報

這篇具有很好參考價值的文章主要介紹了boost庫搜索引擎。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

Gitee倉庫：boost庫搜索引擎

0. 前言

市面上有很多搜索引擎例如Google、百度、360等，這些都是特別大的項目。

對于個人學習我們可以寫一個站內搜索，這個搜索的內容更加垂直，數據量更小，例如C++的文檔The C++ Resources Network

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

Google搜索顯示內容：
boost庫搜索引擎,原創(chuàng),項目,搜索引擎

1. 搜索引擎原理

客戶端使用瀏覽器搜索向服務器發(fā)起請求（GET方法上傳）。

服務器上面有一個搜索的服務軟件，在搜索之前，它要在全網抓取網頁（爬蟲）。

將網頁抓下來之后：

去標簽和數據清理
建立索引

之后這個搜索的軟件檢索索引得到相關的html，拼接多個網頁的title、description、url，構建出一個新的網頁返回給客戶端

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

2. 技術棧和項目環(huán)境

技術棧：

后端：

C/C++、C++11、STL、boost準標準庫、Jsoncpp、cppjieba、cpp-http
前端(選學)：

html5、css、JavaScript、jQuery、Ajax

項目環(huán)境：

Centos 7云服務器、VS Code、 gcc(g++)、Makefile

3. 正排索引和倒排索引

比如說我們有2個文檔：

文檔1：我好喜歡睡覺啊，因為睡覺很舒服呀
文檔2：你喜歡睡覺嗎？喜歡和誰一起睡覺啊

3.1 正排索引

正排索引中每個文檔都有唯一標識符，然后就是從文檔ID找到文檔內容

文檔ID	文檔內容
1	我好喜歡睡覺啊，因為睡覺很舒服噢
2	你喜歡睡覺嗎？喜歡和誰一起睡覺啊

3.2 倒排索引

倒排索引則是文檔分詞，通過整理不重復的各個關鍵字來找到文檔ID

有些停止詞在分詞的時候一般不考慮，搜索引擎stopwords大全：https://github.com/goto456/stopwords

關鍵字	文檔ID，weight(權重)
睡覺	文檔1、文檔2
舒服	文檔1
喜歡	文檔1、文檔2
一起	文檔2

總的來說，正排索引適用于需要快速訪問整個文檔內容的場景，而倒排索引則適用于需要快速檢索包含特定詞項的文檔的場景。兩者常常結合使用，在信息檢索系統(tǒng)中發(fā)揮各自的優(yōu)勢。

3.3 模擬查找

查找：喜歡 -> 倒排索引查找 -> 提取文檔1、文檔2 -> 正排索引 -> 找到文檔內容 -> 文檔內容摘要(title、desc、url) -> 構建響應結果

4. 獲取數據源

進入boost官網：Boost C++ Libraries，然后下載文檔：

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

下載完畢之后傳入服務器

rz -E

? 傳入之后解壓：

tar xzf boost_1_84_0.tar.gz

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

這些html文件就是我們的數據源了，我們只需要這里面的內容，所以將這里面的內容拷貝到要存放數據源的文件夾當中。

5. 數據清洗

html中用的<>就是標簽，一般都是成對出現，所以我們要將這些內容給去掉，留下里面正文的內容

ls -Rl | grep -E '*.html' | wc -l

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

這里查看到差不多有8500個文檔，我們的目標就是要將這八千多個文檔的標簽全部去掉，然后保存到一個文件中。

5.1 保存路徑

bool EnumFiles(const std::string &src_path, std::vector<std::string> *files_list)
{
    namespace fs = boost::filesystem;
    fs::path root_path(src_path);

    //判斷路徑是否存在
    if(!fs::exists(root_path))
    {
        std::cerr << src_path << " not exists..." << std::endl;
        return false;
    }
    //空迭代器,判斷遞歸是否結束
    fs::recursive_directory_iterator end;
    for(fs::recursive_directory_iterator iter(root_path); iter!=end; iter++)
    {
        //判斷是否為普通文件，html都是普通文件
        if(!fs::is_regular_file(*iter))
        {
            continue;
        }
        //判斷后綴
        if(iter->path().extension() != ".html")
        {
            
            continue;
        }

        //保存獲取到的文件路徑
        //std::cout << iter->path().string() << std::endl;
        //保存路徑
        files_list->push_back(iter->path().string());
    }
    return true;
}

我們這里先將路徑保存下來，方便后續(xù)的讀取

這里需要用到boost庫的文件操作，需要提前安裝boost庫

boost庫安裝：
sudo yum install -y boost-devel

5.2 解析文件

首先將文件的內容讀取上來，然后根據讀取的內容對標題、內容進行提取，最后構造出對應的url

bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results)
{
    for(const std::string &file : files_list)
    {
        //讀取文件內容
        std::string result;
        if(!ns_util::FileUtil::ReadFile(file, &result))
        {
            continue;
        }

        DocInfo_t doc;
        //解析文件,提取標題
        if(!ParseTitle(result, &doc.title))
        {
            continue;
        }
        //解析文件,提取內容
        if(!ParseContent(result, &doc.content))
        {
            continue;
        }
        //解析路徑,構建url
        if(!ParseUrl(file, &doc.url))
        {
            continue;
        }

        //解析完畢
        //results->push_back(doc);    //直接拷貝, 效率較低
        results->push_back(std::move(doc));

        //Debug
        //ShowDoc(doc);
    }
    return true;
}

提取標題

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

標題的標簽是<title>內容</title>，需要拿到中間的內容，string中的find方法，找到的是起始位置，也就是<，所以截取的起始位置，還需要加上"<title>"的長度

static bool ParseTitle(const std::string &file, std::string *title)
{
    size_t begin = file.find("<title>");
    if(begin == std::string::npos)
        return false;
    size_t end = file.find("</title>");
    if(end == std::string::npos)
        return false;
    
    begin += std::string("<title>").size();
    if(begin > end)
        return false;

    *title = file.substr(begin, end - begin);
    return true;
}

提取內容

提取內容并不是拿取html網頁的內容，而是要去掉標簽，也就是去除<>里面的內容去掉

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

這里采用簡易的狀態(tài)機來進行判斷：

static bool ParseContent(const std::string &file, std::string *content)
{
    //簡易狀態(tài)機
    enum status
    {
        LABEL,
        CONTENT
    };

    status s = LABEL;   //起始肯定是遇到標簽
    for(char c : file)
    {
        switch (s)
        {
        case LABEL:
            if(c == '>')    s = CONTENT;
            break;
        case CONTENT:
            if(c == '<')    s = LABEL;
            else
            {
                if(c == '\n')   c = ' ';
                content->push_back(c);
            }
            break;
        default:
            break;
        }
    }
    return true;
}

起始狀態(tài)肯定是遇到標簽<，所以將狀態(tài)機的其狀態(tài)設為標簽狀態(tài)

當遇到>就表明狀態(tài)結束，此時我們將狀態(tài)設置為CONTENT，但是也可能下一個又是標簽，所以還行進行判斷，當遇到<，就表明新的狀態(tài)又開始了。

這里沒有保留原文件中的\n，因為之后\n用于文本的分隔符

構造url

boost庫的官網文檔，和我們獲取的數據源是有對應的路徑關系的

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

而我們將html文件放到data/input當中，所以相當官網的路徑+我們的文檔路徑訪問到指定的文檔

url_head：https://www.boost.org/doc/libs/1_84_0/doc/html（切記，帶上前綴https://，踩bug之談）

我們的路徑為data/input/array.html，但是我們并不需要~~/data/input~~

url_tail：array.html

url = url_head + url + tail

這樣拼接起來就是指定文檔的網址了

5.4 保存內容

文檔之間如何區(qū)分：

文檔和文檔之間可以用\3區(qū)分，例如：

AAAAAAA\3BBBBBB\3CCCCCC

為什么是\3？

文檔當中能夠顯示的字符，一般都是屬于打印字符；而像\3這樣的叫做控制字符，是不顯示的，就不會污染我們清洗之后的文檔

但是getline函數，是以行讀取的內容，所以我們可以采用\3來作為title、content、url的分隔符，而\n作為文檔的分隔符，即：title\3content\3url\3\n\...，這樣會方便之后的文件讀取

bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output)
{
    std::ofstream out(output, std::ios::out | std::ios::binary);
    if(!out.is_open())
    {
        std::cerr << "open " << output << " error..." << std::endl;
        return false;
    }

    //寫入
    for(auto &item : results)
    {
        std::string out_str;
        out_str += item.title;
        out_str += SEP;
        out_str += item.content;
        out_str += SEP;
        out_str += item.url;
        out_str += '\n';

        out.write(out_str.c_str(), out_str.size());
    }

    out.close();
    return true;
}

運行之后得到的內容和從官網獲取的文檔內容數量一樣：

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

6. 建立索引

建立正排索引是要將文檔的內容讀取之后放到數組當中，而建立倒排是需要在正排之后，根據正排之后的文檔再建立。

struct DocInfo
{
    std::string title;
    std::string content;
    std::string url;
    uint64_t doc_id;    //文檔id
};
struct InvertedElem
{
    uint64_t doc_id;
    std::string word;
    int weight;
};
typedef std::vector<InvertedElem> InvertedList;   //倒排拉鏈
class Index
{

public:
    Index()
    {}
    ~Index()
    {}
public:
    //根據文檔id找文檔內容
    DocInfo* GetForwardIndex(const uint64_t &id)
    {
        //...
        return nullptr;
    }

    //根據關鍵字找倒排拉鏈
    InvertedList* GetInvertedList(const std::string &word)
    {
        //...
        return nullptr;
    }

    //根據parse之后的數據,建立正排倒排索引
    bool BuildIndex(const std::string &input)
    {
        //...
        return true;
    }

private:
    std::vector<DocInfo> forward_index;  //數組的下標就是天然的文檔id
    std::unordered_map<std::string, InvertedList> inverted_index;    //一個關鍵字對應一組(個)倒排元素
};

6.1 建立正排索引

DocInfo *BuildForwardIndex(const std::string &line)
{
    //1. 切割字符串
    std::vector<std::string> results;
    const std::string sep = "\3";   //內容分隔符
    ns_util::StringUtil::SplitString(line, &results, sep);
    if(results.size() != 3)
        return nullptr;
    //2.填充內容
    DocInfo doc;
    doc.title = results[0];
    doc.content = results[1];
    doc.url = results[2];
    doc.doc_id = forward_index.size();  //當時的大小就是 文檔id
    //3. 插入正排
    forward_index.push_back(std::move(doc));

    return &forward_index.back();
}

切分字符串采用boost庫的split函數：
    class StringUtil
    {
    public:
        static void SplitString(const std::string &target, std::vector<std::string> *ret, const std::string &sep)
        {
            boost::split(*ret, target, boost::is_any_of(sep), boost::token_compress_on);
            
        }
    };
*ret：目標容器的指針，這個容器用于存儲分割后的子串。

target：需要分割的源字符串。

boost::is_any_of(sep)：指定的分隔符，可以是單個字符、字符串或者字符范圍。

boost::token_compress_on：表示開啟壓縮模式。開啟后，連續(xù)的分隔符將被視為一個分隔符，并且不會產生空字符串。

6.2 建立倒排索引

原理：

一個文檔里面的title和content是包含很多詞的，需要將這些詞進行分詞，根據這些內容形成倒排拉鏈

這里就需要將詞和文檔標題/內容中出現次數進行關聯，采用unordered_map

jieba庫：

獲取cppjieba：GitHub - yanyiwu/cppjieba: "結巴"中文分詞的C++版本，然后克隆到服務器

git clone https://github.com/yanyiwu/cppjieba.git

demo代碼：

#include "cppjieba/Jieba.hpp"
const char* const DICT_PATH = "../dict/jieba.dict.utf8";
const char* const HMM_PATH = "../dict/hmm_model.utf8";
const char* const USER_DICT_PATH = "../dict/user.dict.utf8";
const char* const IDF_PATH = "../dict/idf.utf8";
const char* const STOP_WORD_PATH = "../dict/stop_words.utf8";

這些就是詞庫，就是根據這些詞庫進行分詞

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

我們稍后要用，肯定不是在這個庫文件里面創(chuàng)建文件使用，所以我們需要建立軟連接讓程序能找到詞庫，然后它包含的頭文件還要包含Jieba.hpp，也是需要建立軟鏈接的

不了解的查看此篇文章：Linux軟硬鏈接

這里還有一個小坑：

我們編譯的會報錯，說缺少limonp/Logging.hpp，我們需要手動拷貝limonp這個目錄到cppjieba/include當中

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

        bool BuildInvertedIndex(const DocInfo &doc)
        {
            struct word_cnt
            {
                int title_cnt;
                int content_cnt;

                word_cnt()
                :title_cnt(0),content_cnt(0)
                {}
            };
            std::unordered_map<std::string, word_cnt> word_map; //暫存詞頻映射表
            std::vector<std::string> title_words;
            std::vector<std::string> content_words;
            ns_util::JiebaUtil::CutString(doc.title, &title_words, STOP_WORD_FLAG); //標題分詞
            ns_util::JiebaUtil::CutString(doc.content, &content_words, STOP_WORD_FLAG); //內容分詞
            //統(tǒng)計標題詞頻
            for(std::string s : title_words)
            {
                word_map[s].title_cnt++;
            }
            //統(tǒng)計內容詞頻
            for(std::string s : content_words)
            {
                word_map[s].content_cnt++;
            }
#define X 10
#define Y 1
            //構建倒排索引
            for(auto &word_pair : word_map)
            {
                InvertedElem item;
                item.doc_id = doc.doc_id;
                item.word = word_pair.first;
                item.weight = word_pair.second.title_cnt * X + word_pair.second.content_cnt * Y;    //標題出現權重更高
                

                //inverted_index[word_pair.first].push_back(std::move(item));
                InvertedList &invertedlist = inverted_index[word_pair.first];
                invertedlist.push_back(std::move(item));
            }
            return true;
        }

6.3 構建索引

bool BuildIndex(const std::string &input)
{
    std::ifstream in(input, std::ios::in | std::ios::binary);
    if(!in.is_open())
    {
        ns_log::log(Fatal, "%s open error", input.c_str());
        //std::cerr << input << "open error..." << std::endl;
        return false;
    }

    std::string line;
    int cnt = 0;
    int len = lable.size();

    while(std::getline(in, line))
    {
        //建立正排
        DocInfo *doc = BuildForwardIndex(line);
        if(doc == nullptr)
        {
            ns_log::log(Warning, "build %s error", line.c_str());
            //std::cerr << "build " << line << " error" << std::endl;
            continue;
        }
        //建立倒排
        BuildInvertedIndex(*doc);
        cnt++;
        ns_log::log(Info, "build document %d", cnt);
        // std::cout << "build doc: " << cnt  << "..." << std::endl;
    }
    ns_log::log(Info, "total document %d ...", cnt);

    return true;
}

建立索引的本質是將去標簽化的數據加載到內存當中，這個體積是很大的，而且索引只有一份，所以構建成單例模式

class Index
{
   //...
   private:
       Index()
       {}
       Index(const Index&) = delete;//拷貝構造去掉
       Index& operator=(const Index&) = delete;    //賦值語句去掉
       static Index *instance; //單例
       static std::mutex mtx;


   public:    
       ~Index()
       {}
   public:
       static Index* GetInstance()
       {
           if (nullptr == instance)
           {
               mtx.lock();
               if (nullptr == instance)
               {
                   instance = new Index();
               }
               mtx.unlock();
           }

           return instance;
       }
   //...
};
   Index* Index::instance = nullptr;
   std::mutex Index::mtx;  //構造函數

7. 搜索

搜索服務思維導圖：

boost庫搜索引擎,原創(chuàng),項目,搜索引擎

#pragma once
#include<algorithm>
#include<unordered_map>
#include<jsoncpp/json/json.h>
#include"index.hpp"
#include"util.hpp"
#include"Log.hpp"

namespace ns_searcher
{
    struct InvertedElemPrint{
        uint64_t doc_id;
        int weight;
        std::vector<std::string> words;

        InvertedElemPrint()
        :doc_id(0),weight(0)
        {}
    };
    class Searcher
    {
    public:
        Searcher()
        {}
        ~Searcher()
        {}
    private:
        std::string GetAbstract(const std::string &content, const std::string &word)
        {
            // 往前找50字節(jié)，往后找100字節(jié)   直接硬編碼
            size_t prev_pos = 50;
            size_t next_pos = 100;

            //size_t pos = content.find(word);
            // if (pos == std::string::npos)
            //     return "Not Found";

            auto it = std::search(content.begin(), content.end(), word.begin(), word.end(), [](int x, int y){
                return std::tolower(x) == std::tolower(y);
            });

            if(it == content.end()) return "Not Found1";

            size_t pos = std::distance(content.begin(), it);

            size_t start = 0;                // 默認起始位置為begin
            size_t end = content.size() - 1; // 結束位置為end

            if (pos > start + prev_pos)
                start = pos - prev_pos; // 用加法,防止無符號溢出
            if (pos + next_pos < end)
                end = pos + next_pos;

            if (start >= end)
                return "Not Found2";
            return content.substr(start, end - start)+ "...";
        }

    public:
        void InitSearcher(const std::string &input)
        {
            //獲取index對象
            index = ns_index::Index::GetInstance();
            //std::cout << "get instance" << std::endl;
            ns_log::log(Info, "Get instance success");
            //建立索引
            index->BuildIndex(input);
            //std::cout << "build index success" << std::endl;
            ns_log::log(Info, "build index success");
        }


        //keyword:搜索關鍵字    json_string:返回給用戶瀏覽器的搜索結果(序列化)
        void Search(const std::string &keyword, std::string *json_string)
        {
            //對關鍵字進行分詞
            std::vector<std::string> words;
            ns_util::JiebaUtil::CutString(keyword, &words, 1);
            //觸發(fā),根據關鍵字分詞進行查找
            //ns_index::InvertedList inverted_list_all;
            std::vector<InvertedElemPrint> inverted_list_all;
            std::unordered_map<uint16_t, InvertedElemPrint> tokens_map;
            for(std::string s : words)
            {
                boost::to_lower(s); //忽略大小寫(轉小寫)
                //先倒排,再正排
                
                //倒排
                ns_index::InvertedList *inverted_list = index->GetInvertedList(s);


                if(inverted_list == nullptr)
                {
                    continue;
                }
                //inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());
                for(const auto &elem : *inverted_list)
                {
                    auto &item = tokens_map[elem.doc_id];    //去重
                    item.doc_id = elem.doc_id;  //去重之后賦值
                    item.weight += elem.weight;  //累加權重值
                    item.words.push_back(elem.word);
                }
                for(const auto &e : tokens_map)
                {
                    inverted_list_all.push_back(std::move(e.second));
                }

                //排序
                // std::sort(inverted_list_all.begin(), inverted_list_all.end(), 
                // [](const ns_index::InvertedElem &e1, const ns_index::InvertedElem &e2){
                //     return e1.weight > e2.weight; 
                // });

                std::sort(inverted_list_all.begin(),inverted_list_all.end(), 
                [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){
                    return e1.weight > e2.weight;
                });

                //正排
                Json::Value root;
                for(auto &e : inverted_list_all)
                {
                    ns_index::DocInfo *doc = index->GetForwardIndex(e.doc_id);
                    if(doc == nullptr)
                    {
                        continue;
                    }
                    Json::Value item;
                    item["title"] = doc->title;
                    //item["abstract"] = doc->content;
                    item["content"] = GetAbstract(doc->content, e.words[0]);
                    item["url"] = doc->url;

                    //debug
                    //item["id"] = (int)e.doc_id;   //json自動轉換int -> string
                    //item["weight"] = e.weight;
                    root.append(item);
                    //
                }
                //序列化
                Json::StyledWriter writer;
                *json_string = writer.write(root);
            }
        }
    private:
        ns_index::Index *index;
    };
}

8. 服務端

使用第三方網絡庫cpp-httplib，這個庫需要使用較新的編譯器：

安裝scl源：sudo yum install centos-release-scl scl-utils-build
更新gcc/g++：sudo yum install -y devtoolset-7gcc devtoolset-7-gcc-c++

安裝好之后的路徑：

這里用的時候要啟動，使用指令：scl enable devtoolset-7 bash

但是這個只是每次會話有效，如果想每次啟動都有效，可以將上面的指令加入.bash_profile配置文件

#include"cpp-httplib/httplib.h"
#include"search.hpp"
#include"Log.hpp"

const std::string root_path = "./wwwroot";
const std::string input = "data/raw_html/raw.txt";

int main()
{
    ns_searcher::Searcher search;
    search.InitSearcher(input);
    httplib::Server svr;
    svr.set_base_dir(root_path.c_str());
    svr.Get("/s", [&search](const httplib::Request &req, httplib::Response &rsp)
            {
        //rsp.set_content("hello httplib", "text/plain; charset=utf-8");
       if(!req.has_param("word"))
       {
        rsp.set_content("please enter keyword!", "text/plain; charset=utf-8");
        return;
       }
       std::string word = req.get_param_value("word");
       //std::cout << "user is searching: " << word << std::endl;
       ns_log::log(Info, "User search content is \"%s\"", word.c_str());
       std::string json_str;
       search.Search(word, &json_str);
       rsp.set_content(json_str, "application/json");

       });
    ns_log::log(Info, "server start success");
    svr.listen("0.0.0.0", 8080);
    return 0;
}

9. 服務部署

nohup ./http_server > log/log.txt 2>&1 &

將服務后臺運行部署，如果要要查看，可以直接看log/log.txt，也可以tail -f log/log.txt

默認設置了去掉暫停詞，第一次啟動可能會較慢，如果不想去掉暫停詞，可以將STOP_WORD_FLAG設置為0文章來源地址http://www.zghlxwxcb.cn/news/detail-845531.html

到了這里，關于boost庫搜索引擎的文章就介紹完了。如果您還想了解更多內容，請在右上角搜索TOY模板網以前的文章或繼續(xù)瀏覽下面的相關文章，希望大家以后多多支持TOY模板網！

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。如若轉載，請注明出處：如若內容造成侵權/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經查實，立即刪除！

分享到：

領支付寶紅包贊助服務器費用

基于boost準標準庫的搜索引擎項目
這是一個基于Web的搜索服務架構客戶端-服務器模型：采用了經典的客戶端-服務器模型，用戶通過客戶端與服務器交互，有助于集中管理和分散計算。簡單的用戶界面：客戶端似乎很簡潔，用戶通過簡單的HTTP請求與服務端交互，易于用戶操作。搜索引擎功能：服務器端的
2024年04月27日
瀏覽(15)
【Boost搜索引擎項目】Day1 項目介紹+去標簽和數據清洗框架搭建
??歡迎來到C++項目專欄 ?????♀?作者介紹：前PLA隊員目前是一名普通本科大三的軟件工程專業(yè)學生 ??IP坐標：湖北武漢 ?? 目前技術棧：C/C++、Linux系統(tǒng)編程、計算機網絡、數據結構、Mysql、Python ?? 博客介紹：通過分享學習過程，加深知識點的掌握，也希望通過平臺能
2024年03月23日
瀏覽(26)
[C++項目] Boost文檔站內搜索引擎(1): 項目背景介紹、相關技術棧、相關概念介紹...
Boost 庫是 C++ 中一個非常重要的開源庫. 它實現了許多 C++ 標準庫中沒有涉及的特性和功能, 一度成為了 C++ 標準庫的拓展庫. C++ 新標準的內容, 很大一部分脫胎于 Boost 庫中. Boost 庫的高質量代碼以及提供了更多實用方便的 C++ 組件, 使得 Boost 庫在 C++ 開發(fā)中會被高頻使用為方便
2024年02月14日
瀏覽(20)
[C++項目] Boost文檔站內搜索引擎(5): cpphttplib實現網絡服務、html頁面實現、服務器部署...
在前四篇文章中, 我們實現了從文檔文件的清理到搜索的所有內容: 項目背景: ??[C++項目] Boost文檔站內搜索引擎(1): 項目背景介紹、相關技術棧、相關概念介紹… 文檔解析、處理模塊 parser 的實現: ??[C++項目] Boost文檔站內搜索引擎(2): 文檔文本解析模塊parser的實現、如何對
2024年02月13日
瀏覽(21)
[C++項目] Boost文檔站內搜索引擎(4): 搜索的相關接口的實現、線程安全的單例index接口、cppjieba分詞庫的使用、綜合調試...
有關 Boost 文檔搜索引擎的項目的前三篇文章, 已經分別介紹分析了: 項目背景: ??[C++項目] Boost文檔站內搜索引擎(1): 項目背景介紹、相關技術棧、相關概念介紹… 文檔解析、處理模塊 parser 的實現: ??[C++項目] Boost文檔站內搜索引擎(2): 文檔文本解析模塊parser的實現、如何對
2024年02月14日
瀏覽(28)
[C++項目] Boost文檔站內搜索引擎(2): 文檔文本解析模塊parser的實現、如何對文檔文件去標簽、如何獲取文檔標題...
在上一篇文章中, 已經從 Boost 官網獲取了 Boost 庫的源碼. 相關文章: ??[C++項目] Boost文檔站內搜索引擎(1): 項目背景介紹、相關技術棧、相關概念介紹… 接下來就要編寫代碼了. 不過還需要做一些準備工作. 創(chuàng)建項目目錄所有的項目文件肯定要在一個目錄下, 找一個位置執(zhí)行下
2024年02月14日
瀏覽(25)
[C++項目] Boost文檔站內搜索引擎(3): 建立文檔及其關鍵字的正排倒排索引、jieba庫的安裝與使用...
之前的兩篇文章: 第一篇文章介紹了本項目的背景, 獲取了 Boost 庫文檔 ??[C++項目] Boost文檔站內搜索引擎(1): 項目背景介紹、相關技術棧、相關概念介紹… 第二篇文章分析實現了 parser 模塊. 此模塊的作用是對所有文檔 html 文件, 進行清理并匯總 ??[C++項目] Boost文檔站內搜
2024年02月07日
瀏覽(91)
boost庫搜索引擎
Gitee倉庫：boost庫搜索引擎市面上有很多搜索引擎例如Google、百度、360等，這些都是特別大的項目。對于個人學習我們可以寫一個站內搜索，這個搜索的內容更加垂直，數據量更小，例如C++的文檔The C++ Resources Network Google搜索顯示內容：客戶端使用瀏覽器搜索向服務器發(fā)起
2024年04月09日
瀏覽(29)
boost 搜索引擎
done 公司：百度、搜狗、360搜索、頭條新聞客戶端 - 我們自己實現是不可能的！站內搜索：搜索的數據更垂直，數據量其實更小 boost的官網是沒有站內搜索的，需要我們自己做一個首先在用戶進行搜索之前，在公司的服務器server上，內存上有一個searcher服務，而我們想進行搜
2024年02月11日
瀏覽(18)
Boost搜索引擎
先說一下什么是搜索引擎,很簡單,就是我們平常使用的百度,我們把自己想要所有的內容輸入進去,百度給我們返回相關的內容.百度一般給我們返回哪些內容呢?這里很簡單,我們先來看一下. 這里我們簡單的說一下我們的搜索引擎的基本原理. 我們給服務器發(fā)起請求,例如搜索關鍵
2024年01月19日
瀏覽(24)

<table id="cgewy"></table>

<menu id="cgewy"><rt id="cgewy"></rt></menu>