電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

這篇具有很好參考價值的文章主要介紹了電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點(diǎn)擊"舉報違法"按鈕提交疑問。

一：項(xiàng)目概述

二：模塊實(shí)現(xiàn)

2.1?Python爬蟲的技術(shù)實(shí)現(xiàn)

2.1.1 爬取網(wǎng)頁，獲取數(shù)據(jù)

2.1.2 解析內(nèi)容

2.1.3 保存數(shù)據(jù)

2.2?數(shù)據(jù)可視化

2.2.1?Flask框架

2.2.2 首頁和電影頁（表格）

2.2.3?使用Echarts呈現(xiàn)電影評分分布圖

2.2.4?jieba分詞，WordCloud生成“詞云”

一：項(xiàng)目概述

本項(xiàng)目運(yùn)用 Python爬取電影Top250網(wǎng)頁數(shù)據(jù)，使用BeautifulSoup和正則表達(dá)式進(jìn)行解析，存于excel和sqlite數(shù)據(jù)庫中。數(shù)據(jù)可視化應(yīng)用Flask 框架，使用Echarts呈現(xiàn)電影評分分布圖，使用jieba進(jìn)行文本分析，WordCloud生成電影“詞云”。

二：模塊實(shí)現(xiàn)

2.1?Python爬蟲的技術(shù)實(shí)現(xiàn)

技術(shù)概覽：

1.爬取網(wǎng)頁，獲取數(shù)據(jù)：使用urllib2庫獲取指定url的數(shù)據(jù)。

2.解析內(nèi)容：使用BeautifulSoup定位特定的標(biāo)簽位置；使用正則表達(dá)式找到具體的內(nèi)容。

3.保存數(shù)據(jù)：使用xlwt將抽取的數(shù)據(jù)寫入Excel表格中；使用sqlite3將數(shù)據(jù)寫入數(shù)據(jù)庫。

2.1.1 爬取網(wǎng)頁，獲取數(shù)據(jù)

使用urllib2庫獲取指定url的數(shù)據(jù)。

import urllib.request

#得到指定一個URL的網(wǎng)頁內(nèi)容
def askURL(url)
    head = {    #模擬瀏覽器頭部信息，向豆瓣服務(wù)器發(fā)送消息
        "User-Agent": "xxxx"
    }   #用戶代理：表示告訴電影網(wǎng)站服務(wù)器，我們是什么類型的機(jī)器、瀏覽器（本質(zhì)上是告訴服務(wù)器，我們可以接收什么水平的文件內(nèi)容）

    req = urllib.request.Request(url=url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except urllib.error.URlError as e:
        if hasattr(e,"code"):       #hasattr（e,"code“): 判斷e這個對象里面是否包含code這個屬性
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)

    return html

#爬取網(wǎng)頁
def getData(baseurl):
    datalist = []
    for i in range(0,10):       #調(diào)用獲取頁面信息的函數(shù)，10次
        url = baseurl + str(i*25)
        html = askURL(url)      #保存獲取到的網(wǎng)頁源碼

        #2.逐一解析數(shù)據(jù)

    return datalist

2.1.2 解析內(nèi)容

使用BeautifulSoup定位特定的標(biāo)簽位置；使用正則表達(dá)式找到具體的內(nèi)容。

#創(chuàng)建正則表達(dá)式對象，表示規(guī)則（字符串的模式）
findLink = re.compile(r'<a href="(.*?)">')      #只拿括號里的內(nèi)容；括號里的？表示非貪婪模式，找到第一個>就停下
findImgSrc = re.compile(r'<img.*src="(.*?)"',re.S)   #re.S表示使.匹配包括換行在內(nèi)的所有字符
findTitle = re.compile(r'<span class="title">(.*)</span>')
findRating = re.compile(r'<span class="rating_num".*>(.*)</span>')
findJudgeNum = re.compile(r'<span>(\d*)人評價</span>')
findInq = re.compile(r'<span class="inq">(.*)</span>')
findBd = re.compile(r'<p class="">(.*?)</p>',re.S)

#2.逐一解析數(shù)據(jù)
soup = BeautifulSoup(html,"html.parser")
for item in soup.find_all('div',class_ ="item"):    #查找符合要求的字符串，形成列表;class_加下劃線表示屬性
    #print(item)        #測試：查看電影item全部信息
    data = []       #保存一部電影的所有信息
    item = str(item)      #轉(zhuǎn)變類型為字符串，未后面的正則匹配做準(zhǔn)備

    #影片詳情的鏈接
    link = re.findall(findLink,item)[0]     #re庫用來通過正則表達(dá)式查找指定的字符串
    data.append(link)

    #影片圖片
    imgSrc = re.findall(findImgSrc,item)[0]
    data.append(imgSrc)

    #影片片名
    titles = re.findall(findTitle,item)     #片名可能只有一個中文名，也可能還有外文名，甚至多個外文名
    if len(titles)>=2:     #若有多個外文名也只取一個
        ctitle = titles[0]      #添加中文名
        data.append(ctitle)
        otitle = titles[1].replace("/","").strip()     #添加英文名，并去掉/和前后空格
        data.append(otitle)
    else:
        data.append(titles[0])
        data.append("")         #外國名字要留空，否則數(shù)據(jù)會錯位

    #影片評分
    rating = re.findall(findRating,item)[0]
    data.append(rating)

    #評分人數(shù)
    judgeNum = re.findall(findJudgeNum,item)[0]
    data.append(judgeNum)

    # 影片概述
    inq = re.findall(findInq, item)     #有的影片沒有概述，因此這里用了[0]會報錯
    if len(inq) != 0:
        inq = inq[0].replace("。", "")
        data.append(inq)
    else:
        data.append("")

    #影片的相關(guān)內(nèi)容
    bd = re.findall(findBd, item)[0]
    bd = re.sub('<br(\s+)?/>(\s+)'," ",bd)        #去掉(\s+)，\s匹配空白和tab鍵
    bd = re.sub('/'," ",bd)             #替換/
    bd = bd.strip()                 #去掉前后的空格
    data.append(bd)

    datalist.append(data)           #把處理好的一部電影信息放入datalist

2.1.3 保存數(shù)據(jù)

1.Excel表儲存

利用python庫xlwt將抽取的數(shù)據(jù)datalist寫入表格中

import xlwt

# 保存數(shù)據(jù)(excel存儲)
def saveData(datalist,savepath):
    book = xlwt.Workbook(encoding="etf-8",style_compression=0)  # encoding:設(shè)置編碼，可寫中文；style_compression:是否壓縮，不常用
    sheet = book.add_sheet('電影Top250',cell_overwrite_ok=True)    # cell_overwrite_ok:是否可以覆蓋單元格，默認(rèn)為False
    col = ("影片詳情鏈接","影片圖片","影片中文名","影片外文名","影片評分","評分人數(shù)","影片概述","影片相關(guān)內(nèi)容") #設(shè)置表頭
    for i in range(0,len(col)):
        sheet.write(0,i,col[i])     #存入列名
    for i in range(0,250):
        data = datalist[i]          #拿出每一條電影的信息
        for j in range(0,len(col)):
            sheet.write(i+1,j,data[j])     #第0行是表頭，故須i+1
    book.save(savepath)             #保存數(shù)據(jù)表

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

2.SQLite儲存

使用sqlite3。步驟包括：連接數(shù)據(jù)庫，創(chuàng)建數(shù)據(jù)表，插入數(shù)據(jù)。

import sqlite3                          #進(jìn)行SQLite數(shù)據(jù)庫操作

#保存數(shù)據(jù)(db存儲)
def saveData2(datalist,savedb):
    conn = sqlite3.connect(savedb)
    cur = conn.cursor()
    #建表
    sql1 = '''
            create table movie250
                (id integer PRIMARY KEY autoincrement,
                link text,
                imgSrc text,
                ctitle text,
                otitle text,
                rating real,
                judgeNum int,
                inq text,
                bd text);       
        '''
    cur.execute(sql1)

    #插入數(shù)據(jù)
    for i,data in enumerate(datalist):
        sql2 = '''
                    insert into movie250(id,link,imgSrc,ctitle,otitle,rating,judgeNum,inq,bd)
            '''
        value_str = 'values(' + str(i+1) + ','
        for j in range(0,len(data)):
            if j == 4 or j == 5 :
                value_str = value_str + str(data[j]) + ','
            elif j != len(data) - 1:
                value_str = value_str + '"' + data[j] + '",'
            else:
                value_str = value_str + '"' + data[j] + '"'
        sql2 += value_str + ');'
        cur.execute(sql2)
        conn.commit()

    conn.close()

2.2?數(shù)據(jù)可視化

2.2.1?Flask框架

本項(xiàng)目使用Flask作為Web框架。Flask框架的核心是Werkzeug和Jinja2。Werkzeug進(jìn)行請求的路由轉(zhuǎn)發(fā)；Jinja2進(jìn)行界面的渲染。

新建基于Flask框架的工程文件：

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

自動生成兩個文件夾：

1.static放一些css、js文件，網(wǎng)頁相關(guān)素材的提供

2.templates模板：放一些html網(wǎng)頁文件，反饋給用戶想要訪問的內(nèi)容

運(yùn)行一下得到一個網(wǎng)頁

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

?run()監(jiān)聽用戶訪問這個網(wǎng)頁

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

這兩部分就是我們可以自定義的內(nèi)容了。Werkzeug負(fù)責(zé)判斷特定路徑執(zhí)行哪一個函數(shù)（紅框部分）；Jinja2負(fù)責(zé)返回的內(nèi)容（黃框部分）

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

2.2.2 首頁和電影頁（表格）

?首頁和電影列表頁代碼：

@app.route('/index')    # 首頁
def home():
    return index()

@app.route('/movie')    # 列表頁
def movie():
    datalist = []
    conn = sqlite3.connect("movie250.db")
    cur = conn.cursor()
    sql = "select * from movie250"
    data = cur.execute(sql)
    for item in data:
        datalist.append(item)
    cur.close()
    conn.close()
    return render_template("movie.html",movies = datalist)

電影頁html表格部分代碼：

<table class="table table-striped">
    <tr>
        <td>排名</td>
        <td>中文名稱</td>
        <td>外文名稱</td>
        <td>評分</td>
        <td>評分人數(shù)</td>
        <td>一句話概述</td>
        <td>其他信息</td>
    </tr>

    {% for movie in movies %}
    <tr>
        <td>{{ movie[0] }}</td>
        <td>
            <a href="{{ movie[1] }}" target="_blank">
            {{ movie[3] }}
            </a>
        </td>
        <td>{{ movie[4] }}</td>
        <td>{{ movie[5] }}</td>
        <td>{{ movie[6] }}</td>
        <td>{{ movie[7] }}</td>
        <td>{{ movie[8] }}</td>
    </tr>
    {% endfor %}

</table>

效果圖：

首頁

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

電影頁

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

2.2.3?使用Echarts呈現(xiàn)電影評分分布圖

ECharts是一款基于JavaScript的數(shù)據(jù)可視化圖表庫，提供直觀，生動，可交互，可個性化定制的數(shù)據(jù)可視化圖表。本項(xiàng)目應(yīng)用使用Echarts呈現(xiàn)電影Top250的評分分布圖。

score列表頁代碼

@app.route('/score')
def score():
    score = []      #評分
    num = []        #每個評分所統(tǒng)計出的電影數(shù)量
    conn = sqlite3.connect("movie250.db")
    cur = conn.cursor()
    sql = "select rating,count(rating) from movie250 group by rating"
    data = cur.execute(sql)
    for item in data:
        score.append(item[0])
        num.append(item[1])
    cur.close()
    conn.close()
    return render_template("score.html", score=score,num=num)

?html文件的Echarts部分

<!-- 為 ECharts 準(zhǔn)備一個定義了寬高的 DOM -->
<div id="main" style="width: 100;height:350px;"></div>

<script type="text/javascript">
  // 基于準(zhǔn)備好的dom，初始化echarts實(shí)例
  var myChart = echarts.init(document.getElementById('main'));

  // 指定圖表的配置項(xiàng)和數(shù)據(jù)
  option = {
    tooltip: {
        trigger: 'axis',
        axisPointer: {
          type: 'shadow'
        }
      },
    color:['#3398DB'],
    grid: {
        left: 100,
        right: 50,
        top: 10
      },
    xAxis: {
        type: 'category',
        data: {{ score }}
        <!--['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']-->
    },
    yAxis: {
        type: 'value'
    },
    series: [{
        data: {{ num }},
        <!--[120, 200, 150, 80, 70, 110, 130],-->
        type: 'bar',
        barWidth:'50'
    }]
  };

  // 使用剛指定的配置項(xiàng)和數(shù)據(jù)顯示圖表。
  myChart.setOption(option);
</script>

效果圖

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

2.2.4?jieba分詞，WordCloud生成“詞云”

需要安裝jieba分詞包（把一個句子分成很多個詞），以及繪圖工具matplotlib包，還有Wordcloud下載。

import sqlite3              #數(shù)據(jù)庫
import jieba                #分詞
from matplotlib import pyplot as plt       #繪圖，數(shù)據(jù)可視化
from wordcloud import WordCloud            #詞云
from PIL import Image       #圖片處理
import numpy as np          #矩陣運(yùn)算

#準(zhǔn)備詞云所需的文字（詞）
conn = sqlite3.connect('movie250.db')
cur = conn.cursor()
sql = 'select inq from movie250'
data = cur.execute(sql)
text = ""
for item in data:
    text = text + item[0]
#print(text)
cur.close()
conn.close()

#分詞
cut = jieba.cut(text)
str = " ".join(cut)
print(len(str))

#生成遮罩圖片
img = Image.open(r'.\static\assets\img\tree.jpg')   #打開遮罩圖片
img_array = np.array(img)       #將圖片轉(zhuǎn)換為數(shù)組
wc = WordCloud(                 #封裝WordCloud對象
    background_color='white',
    mask=img_array,
    font_path="SourceHanSansCN-Bold.otf",    #字體所在位置：C:\Windows\Fonts
    min_word_length=2 ,          #一個單詞必須包含的最小字符數(shù)
    stopwords=["就是","一個","不是","這樣","一部","我們","沒有","電影","不會","不能","每個"]      #屏蔽詞
)
wc.generate_from_text(str)      #根據(jù)str文本生成wc詞云

#繪制圖片
fig = plt.figure(1)     #繪制圖片
plt.imshow(wc)          #按照詞云wc的規(guī)則顯示圖片
plt.axis('off')         #是否顯示坐標(biāo)軸

#plt.show()              #顯示生成的詞云圖片

#輸出詞云圖片到文件
plt.savefig(r'.\static\assets\img\word.jpg',dpi=1000)

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

裁剪優(yōu)化：

裁剪前

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud

裁剪后，視覺效果更好

#裁剪圖片
base_img = Image.open(r'.\static\assets\img\word.jpg')
w,h = base_img.size                 #獲取圖片尺寸的寬和高
box = (0.1*w,0.1*h,0.9*w,0.9*h)     #四個參數(shù)值分別是x，y，w，h； x，y是圖像左上點(diǎn)的坐標(biāo)，w，h是圖像的寬和高
base_img.crop(box).save(r'.\static\assets\img\word2.jpg')
#base_img.crop(box).show()

電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud 文章來源地址http://www.zghlxwxcb.cn/news/detail-440211.html

到了這里，關(guān)于電影Top250數(shù)據(jù)分析可視化，應(yīng)用Python爬蟲，F(xiàn)lask框架，Echarts，WordCloud的文章就介紹完了。如果您還想了解更多內(nèi)容，請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！