国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

<track id="4emde"></track>

【自然語言處理】利用 TextRank 算法提取關(guān)鍵詞

2年前作者：G皮T分類：Toy博客閱讀(30)違法舉報

這篇具有很好參考價值的文章主要介紹了【自然語言處理】利用 TextRank 算法提取關(guān)鍵詞。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點(diǎn)擊"舉報違法"按鈕提交疑問。

TextRank 是一種基于 PageRank 的算法，常用于關(guān)鍵詞提取和文本摘要。在本文中，我將通過一個關(guān)鍵字提取示例幫助您了解 TextRank 如何工作，并展示 Python 的實(shí)現(xiàn)。
textrank,# 自然語言處理,自然語言處理,人工智能,TextRank,PageRank,關(guān)鍵詞提取

使用 TextRank、NER 等進(jìn)行關(guān)鍵詞提取

1.PageRank 簡介

關(guān)于 PageRank 的文章有很多，我只簡單介紹一下 PageRank。這將有助于我們稍后理解 TextRank，因?yàn)樗腔?PageRank 的。

PageRank (PR) 是一種用于計(jì)算網(wǎng)頁權(quán)重的算法。我們可以把所有的網(wǎng)頁看成一個大的有向圖。在此圖中，節(jié)點(diǎn)是網(wǎng)頁。如果網(wǎng)頁 A 有指向網(wǎng)頁 B 的鏈接，則它可以表示為從 A 到 B 的有向邊。

構(gòu)建完整個圖后，我們可以通過以下公式為網(wǎng)頁分配權(quán)重。
textrank,# 自然語言處理,自然語言處理,人工智能,TextRank,PageRank,關(guān)鍵詞提取

這是一個示例，可以更好地理解上面的符號。我們有一個圖表來表示網(wǎng)頁如何相互鏈接。每個節(jié)點(diǎn)代表一個網(wǎng)頁，箭頭代表邊。我們想得到網(wǎng)頁 e 的權(quán)重。

我們可以將上述函數(shù)中的求和部分重寫為更簡單的版本。
textrank,# 自然語言處理,自然語言處理,人工智能,TextRank,PageRank,關(guān)鍵詞提取
我們可以通過下面的函數(shù)得到網(wǎng)頁 e 的權(quán)重。

我們可以看到網(wǎng)頁 e 的權(quán)重取決于其入站頁面的權(quán)重。我們需要多次運(yùn)行此迭代才能獲得最終權(quán)重。初始化時，每個網(wǎng)頁的重要性為 1。

2.PageRank 實(shí)現(xiàn)

textrank,# 自然語言處理,自然語言處理,人工智能,TextRank,PageRank,關(guān)鍵詞提取
我們可以用一個矩陣來表示圖中 a、b、e、f 之間的入站和出站鏈接。

一行中的每個節(jié)點(diǎn)表示來自其他節(jié)點(diǎn)的入站鏈接。例如，對于 e 行，節(jié)點(diǎn) a 和 b 具有指向節(jié)點(diǎn) e 的出站鏈接。本演示文稿將簡化更新權(quán)重的計(jì)算。

根據(jù) $\frac{1}{|Out(Vi)|}$ ，從函數(shù)中，我們應(yīng)該規(guī)范化每一列。
textrank,# 自然語言處理,自然語言處理,人工智能,TextRank,PageRank,關(guān)鍵詞提取
我們使用這個矩陣乘以所有節(jié)點(diǎn)的權(quán)重。

這只是一次沒有阻尼系數(shù) d 的迭代。

我們可以使用 Python 進(jìn)行多次迭代。

import numpy as np
g = [[0, 0, 0, 0],
     [0, 0, 0, 0],
     [1, 0.5, 0, 0],
     [0, 0.5, 0, 0]]
     
g = np.array(g)
pr = np.array([1, 1, 1, 1]) # initialization for a, b, e, f is 1
d = 0.85

for iter in range(10):
    pr = 0.15 + 0.85 * np.dot(g, pr)
    print(iter)
    print(pr)

0
[0.15 0.15 1.425 0.575]
1
[0.15 0.15 0.34125 0.21375]
2
[0.15 0.15 0.34125 0.21375]
3
[0.15 0.15 0.34125 0.21375]
4
[0.15 0.15 0.34125 0.21375]
5
[0.15 0.15 0.34125 0.21375]
6
[0.15 0.15 0.34125 0.21375]
7
[0.15 0.15 0.34125 0.21375]
8
[0.15 0.15 0.34125 0.21375]
9
[0.15 0.15 0.34125 0.21375]
10
[0.15 0.15 0.34125 0.21375]

所以 e 的權(quán)重（PageRank值）為 0.34125。

如果我們把有向邊變成無向邊，我們就可以相應(yīng)地改變矩陣。
textrank,# 自然語言處理,自然語言處理,人工智能,TextRank,PageRank,關(guān)鍵詞提取
規(guī)范化。

textrank,# 自然語言處理,自然語言處理,人工智能,TextRank,PageRank,關(guān)鍵詞提取
我們應(yīng)該相應(yīng)地更改代碼。

import numpy as np
g = [[0, 0, 0.5, 0],
     [0, 0, 0.5, 1],
     [1, 0.5, 0, 0],
     [0, 0.5, 0, 0]]
     
g = np.array(g)
pr = np.array([1, 1, 1, 1]) # initialization for a, b, e, f is 1
d = 0.85

for iter in range(10):
    pr = 0.15 + 0.85 * np.dot(g, pr)
    print(iter)
    print(pr)

0
[0.575 1.425 1.425 0.575]
1
[0.755625 1.244375 1.244375 0.755625]
2
[0.67885937 1.32114062 1.32114062 0.67885937]
3
[0.71148477 1.28851523 1.28851523 0.71148477]
4
[0.69761897 1.30238103 1.30238103 0.69761897]
5
[0.70351194 1.29648806 1.29648806 0.70351194]
6
[0.70100743 1.29899257 1.29899257 0.70100743]
7
[0.70207184 1.29792816 1.29792816 0.70207184]
8
[0.70161947 1.29838053 1.29838053 0.70161947]
9
[0.70181173 1.29818827 1.29818827 0.70181173]

所以 e 的權(quán)重（PageRank值）為 1.29818827。

3.TextRank 原理

TextRank 和 PageTank 有什么區(qū)別呢？

簡而言之 PageRank 用于網(wǎng)頁排名，TextRank 用于文本排名。 PageRank 中的網(wǎng)頁就是 TextRank 中的文本，所以基本思路是一樣的。

我們將一個文檔分成幾個句子，我們只存儲那些帶有特定 POS 標(biāo)簽的詞。我們使用 spaCy 進(jìn)行詞性標(biāo)注。

import spacy
nlp = spacy.load('en_core_web_sm')

content = '''
The Wandering Earth, described as China’s first big-budget science fiction thriller, quietly made it onto screens at AMC theaters in North America this weekend, and it shows a new side of Chinese filmmaking — one focused toward futuristic spectacles rather than China’s traditionally grand, massive historical epics. At the same time, The Wandering Earth feels like a throwback to a few familiar eras of American filmmaking. While the film’s cast, setting, and tone are all Chinese, longtime science fiction fans are going to see a lot on the screen that reminds them of other movies, for better or worse.
'''

doc = nlp(content)
for sents in doc.sents:
    print(sents.text)

我們將段落分成三個句子。

The Wandering Earth, described as China’s first big-budget science fiction thriller, quietly made it onto screens at AMC theaters in North America this weekend, and it shows a new side of Chinese filmmaking — one focused toward futuristic spectacles rather than China’s traditionally grand, massive historical epics.

At the same time, The Wandering Earth feels like a throwback to a few familiar eras of American filmmaking.

While the film’s cast, setting, and tone are all Chinese, longtime science fiction fans are going to see a lot on the screen that reminds them of other movies, for better or worse.

因?yàn)榫渥又械拇蟛糠衷~對確定重要性沒有用，我們只考慮帶有 NOUN、PROPN、VERB POS 標(biāo)簽的詞。這是可選的，你也可以使用所有的單詞。

candidate_pos = ['NOUN', 'PROPN', 'VERB']
sentences = []
?
for sent in doc.sents:
    selected_words = []
    for token in sent:
        if token.pos_ in candidate_pos and token.is_stop is False:
            selected_words.append(token)
    sentences.append(selected_words)
?
print(sentences)

[[Wandering, Earth, described, China, budget, science, fiction, thriller, screens, AMC, theaters, North, America, weekend, shows, filmmaking, focused, spectacles, China, epics], 
[time, Wandering, Earth, feels, throwback, eras, filmmaking], 
[film, cast, setting, tone, science, fiction, fans, going, lot, screen, reminds, movies]]

每個詞都是 PageRank 中的一個節(jié)點(diǎn)。我們將窗口大小設(shè)置為 k。
textrank,# 自然語言處理,自然語言處理,人工智能,TextRank,PageRank,關(guān)鍵詞提取
$w1, w2, …, w_k], [w2, w3, …, w_{k+1}], [w3, w4, …, w_{k+2}]$ 是窗口。窗口中的任何兩個詞對都被認(rèn)為具有無向邊。

我們以 [time, wandering, earth, feels, throwback, era, filmmaking] 為例，設(shè)置窗口大小 $k = 4$ ，所以得到 4 個窗口，[time, Wandering, Earth, feels]，[Wandering, Earth, feels, throwback]，[Earth, feels, throwback, eras]，[feels, throwback, eras, filmmaking]。

對于窗口 [time, Wandering, Earth, feels]，任何兩個詞對都有一條無向邊。所以我們得到 (time, Wandering)，(time, Earth)，(time, feels)，(Wandering, Earth)，(Wandering, feels)，(Earth, feels)。

基于此圖，我們可以計(jì)算每個節(jié)點(diǎn)（單詞）的權(quán)重。最重要的詞可以用作關(guān)鍵字。

4.TextRank 提取關(guān)鍵詞

這里我用 Python 實(shí)現(xiàn)了一個完整的例子，我們使用 spaCy 來獲取詞的詞性標(biāo)簽。

from collections import OrderedDict
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load('en_core_web_sm')

class TextRank4Keyword():
    """Extract keywords from text"""
    
    def __init__(self):
        self.d = 0.85 # damping coefficient, usually is .85
        self.min_diff = 1e-5 # convergence threshold
        self.steps = 10 # iteration steps
        self.node_weight = None # save keywords and its weight

    
    def set_stopwords(self, stopwords):  
        """Set stop words"""
        for word in STOP_WORDS.union(set(stopwords)):
            lexeme = nlp.vocab[word]
            lexeme.is_stop = True
    
    def sentence_segment(self, doc, candidate_pos, lower):
        """Store those words only in cadidate_pos"""
        sentences = []
        for sent in doc.sents:
            selected_words = []
            for token in sent:
                # Store words only with cadidate POS tag
                if token.pos_ in candidate_pos and token.is_stop is False:
                    if lower is True:
                        selected_words.append(token.text.lower())
                    else:
                        selected_words.append(token.text)
            sentences.append(selected_words)
        return sentences
        
    def get_vocab(self, sentences):
        """Get all tokens"""
        vocab = OrderedDict()
        i = 0
        for sentence in sentences:
            for word in sentence:
                if word not in vocab:
                    vocab[word] = i
                    i += 1
        return vocab
    
    def get_token_pairs(self, window_size, sentences):
        """Build token_pairs from windows in sentences"""
        token_pairs = list()
        for sentence in sentences:
            for i, word in enumerate(sentence):
                for j in range(i+1, i+window_size):
                    if j >= len(sentence):
                        break
                    pair = (word, sentence[j])
                    if pair not in token_pairs:
                        token_pairs.append(pair)
        return token_pairs
        
    def symmetrize(self, a):
        return a + a.T - np.diag(a.diagonal())
    
    def get_matrix(self, vocab, token_pairs):
        """Get normalized matrix"""
        # Build matrix
        vocab_size = len(vocab)
        g = np.zeros((vocab_size, vocab_size), dtype='float')
        for word1, word2 in token_pairs:
            i, j = vocab[word1], vocab[word2]
            g[i][j] = 1
            
        # Get Symmeric matrix
        g = self.symmetrize(g)
        
        # Normalize matrix by column
        norm = np.sum(g, axis=0)
        g_norm = np.divide(g, norm, where=norm!=0) # this is ignore the 0 element in norm
        
        return g_norm

    
    def get_keywords(self, number=10):
        """Print top number keywords"""
        node_weight = OrderedDict(sorted(self.node_weight.items(), key=lambda t: t[1], reverse=True))
        for i, (key, value) in enumerate(node_weight.items()):
            print(key + ' - ' + str(value))
            if i > number:
                break
        
        
    def analyze(self, text, 
                candidate_pos=['NOUN', 'PROPN'], 
                window_size=4, lower=False, stopwords=list()):
        """Main function to analyze text"""
        
        # Set stop words
        self.set_stopwords(stopwords)
        
        # Pare text by spaCy
        doc = nlp(text)
        
        # Filter sentences
        sentences = self.sentence_segment(doc, candidate_pos, lower) # list of list of words
        
        # Build vocabulary
        vocab = self.get_vocab(sentences)
        
        # Get token_pairs from windows
        token_pairs = self.get_token_pairs(window_size, sentences)
        
        # Get normalized matrix
        g = self.get_matrix(vocab, token_pairs)
        
        # Initionlization for weight(pagerank value)
        pr = np.array([1] * len(vocab))
        
        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr = (1-self.d) + self.d * np.dot(g, pr)
            if abs(previous_pr - sum(pr))  < self.min_diff:
                break
            else:
                previous_pr = sum(pr)

        # Get weight for each node
        node_weight = dict()
        for word, index in vocab.items():
            node_weight[word] = pr[index]
        
        self.node_weight = node_weight

這個 TextRank4Keyword 實(shí)現(xiàn)了前文描述的相關(guān)功能。我們可以看到一段的輸出。文章來源地址http://www.zghlxwxcb.cn/news/detail-752906.html

text = '''
The Wandering Earth, described as China’s first big-budget science fiction thriller, quietly made it onto screens at AMC theaters in North America this weekend, and it shows a new side of Chinese filmmaking — one focused toward futuristic spectacles rather than China’s traditionally grand, massive historical epics. At the same time, The Wandering Earth feels like a throwback to a few familiar eras of American filmmaking. While the film’s cast, setting, and tone are all Chinese, longtime science fiction fans are going to see a lot on the screen that reminds them of other movies, for better or worse.
'''
?
tr4w = TextRank4Keyword()
tr4w.analyze(text, candidate_pos = ['NOUN', 'PROPN'], window_size=4, lower=False)
tr4w.get_keywords(10)

science - 1.717603106506989
fiction - 1.6952610926181002
filmmaking - 1.4388798751402918
China - 1.4259793786986021
Earth - 1.3088154732297723
tone - 1.1145002295684114
Chinese - 1.0996896235078055
Wandering - 1.0071059904601571
weekend - 1.002449354657688
America - 0.9976329264870932
budget - 0.9857269586649321
North - 0.9711240881032547

到了這里，關(guān)于【自然語言處理】利用 TextRank 算法提取關(guān)鍵詞的文章就介紹完了。如果您還想了解更多內(nèi)容，請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點(diǎn)僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符，請點(diǎn)擊違法舉報進(jìn)行投訴反饋，一經(jīng)查實(shí)，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

自然語言處理：提取長文本進(jìn)行文本主要內(nèi)容（文本意思）概括（兩種方法，但效果都一般）
本文主要針對長文本進(jìn)行文本提取和中心思想概括，原文檔放在了附件里面：科大訊飛公告 -----------------------------------方法一：jieba分詞提取文本（句子賦分法）------------------------- 1、首先導(dǎo)入相關(guān)庫并讀取文檔內(nèi)容：這里僅作演示所以只讀取了文檔的第一條數(shù)據(jù)文本，如果
2024年02月10日
瀏覽(23)
自然語言處理--雙向匹配算法
雙向匹配算法是一種用于自然語言處理的算法，用于確定兩個文本之間的相似度或匹配程度。該算法通常使用在文本對齊、翻譯、語義匹配等任務(wù)中。在雙向匹配算法中，首先將兩個文本分別進(jìn)行處理，然后分別從兩個文本的角度進(jìn)行匹配。這種雙向匹配可以更全面地考慮兩
2024年01月23日
瀏覽(22)
自然語言處理學(xué)習(xí)筆記（五）————切分算法
目錄 1.切分算法 2.完全切分 3.正向最長匹配 4.逆向最長匹配 5.雙向最長匹配 6.速度評測 1.切分算法 ? ? ? ? 詞典確定后，句子可能含有很多詞典中的詞語，他們有可能互相重疊，如何切分需要一些規(guī)則。常用規(guī)則為：正向匹配算法、逆向匹配算法以及雙向匹配算法。但他們
2024年02月14日
瀏覽(21)
用AI提升客戶滿意度：如何利用自然語言處理和人工智能技術(shù)改善客戶服務(wù)流程
作者：禪與計(jì)算機(jī)程序設(shè)計(jì)藝術(shù) 引言 1.1. 背景介紹隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展，客戶服務(wù)行業(yè)也在不斷地變革和升級。為了提高客戶滿意度，降低客戶流失率，很多企業(yè)開始關(guān)注客戶服務(wù)流程的優(yōu)化。 1.2. 文章目的本文旨在介紹如何利用自然語言處理和人工智能技術(shù)來改善
2024年02月07日
瀏覽(97)
MATLAB算法實(shí)戰(zhàn)應(yīng)用案例精講-【深度學(xué)習(xí)】自然語言處理模型SimCSE
目錄前言 1.介紹 2.對比學(xué)習(xí)背景 2.1定義 2.2構(gòu)造正樣本 2.3對齊性和均勻性
2024年02月11日
瀏覽(30)
MATLAB算法實(shí)戰(zhàn)應(yīng)用案例精講-【自然語言處理】語義分割模型-DeepLabV3
目錄自然語言處理庫 1.Hugging Face Datasets 2.TextHero 3.spaCy 4.Hugging Face Transformers 5.Scattertext 1、DeepLab系列簡介
2024年02月11日
瀏覽(38)
自然語言處理（NLP）一直是人工智能領(lǐng)域的一項(xiàng)重要任務(wù)，其涉及到從文本中提取特征、組織數(shù)據(jù)、訓(xùn)練模型等諸多復(fù)雜任務(wù)。如何有效地進(jìn)行文本理解和分析?
作者：禪與計(jì)算機(jī)程序設(shè)計(jì)藝術(shù) 自然語言處理（NLP）一直是人工智能領(lǐng)域的一項(xiàng)重要任務(wù)，其涉及到從文本中提取特征、組織數(shù)據(jù)、訓(xùn)練模型等諸多復(fù)雜任務(wù)。如何有效地進(jìn)行文本理解和分析，成為一個重要研究課題。近年來，隨著計(jì)算能力的提升和硬件性能的增強(qiáng)，大規(guī)模
2024年02月09日
瀏覽(20)
【Chatgpt4 教學(xué)】 NLP（自然語言處理）第九課樸素貝葉斯分類器的工作原理機(jī)器學(xué)習(xí)算法
我在起，點(diǎn)更新NLP自然語言處理==》《王老師帶我成為救世主》為啥為它單獨(dú)開章，因?yàn)樗档?，它成功的讓我斷了一更，讓我?shí)踐了自上而下找能夠理解的知識點(diǎn)，然后自下而上的學(xué)習(xí)給自己的知識升級，將自己提升到能夠解決當(dāng)前遇到的問題的水平。（1）--------------
2023年04月15日
瀏覽(26)
【自然語言處理】自然語言處理 --- NLP入門指南
NLP的全稱是 Natuarl Language Processing ，中文意思是自然語言處理，是人工智能領(lǐng)域的一個重要方向自然語言處理（NLP）的一個最偉大的方面是跨越多個領(lǐng)域的計(jì)算研究，從人工智能到計(jì)算語言學(xué)的多個計(jì)算研究領(lǐng)域都在研究計(jì)算機(jī)與人類語言之間的相互作用。它主要關(guān)注計(jì)算機(jī)
2024年02月03日
瀏覽(40)
[自然語言處理] 自然語言處理庫spaCy使用指北
spaCy是一個基于Python編寫的開源自然語言處理庫?；谧匀惶幚眍I(lǐng)域的最新研究，spaCy提供了一系列高效且易用的工具，用于文本預(yù)處理、文本解析、命名實(shí)體識別、詞性標(biāo)注、句法分析和文本分類等任務(wù)。 spaCy的官方倉庫地址為：spaCy-github。本文主要參考其官方網(wǎng)站的文檔，
2024年02月15日
瀏覽(32)

<dl id="pmdur"><tr id="pmdur"><progress id="pmdur"></progress></tr></dl><track id="pmdur"><video id="pmdur"></video></track>

<address id="pmdur"><tr id="pmdur"><progress id="pmdur"></progress></tr></address>