0 簡(jiǎn)介
今天學(xué)長(zhǎng)向大家介紹一個(gè)機(jī)器視覺(jué)的畢設(shè)項(xiàng)目
??基于大數(shù)據(jù)的電影數(shù)據(jù)分析與可視化系統(tǒng)
項(xiàng)目運(yùn)行效果(視頻):
畢業(yè)設(shè)計(jì) 大數(shù)據(jù)電影評(píng)論情感分析
項(xiàng)目獲取:
https://gitee.com/assistant-a/project-sharing文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-841240.html
1 課題背景
研究中國(guó)用戶電影數(shù)據(jù),有助于窺探中國(guó)電影市場(chǎng)發(fā)展背后的規(guī)律,理解其來(lái)龍去脈,獲知未來(lái)走向。如今互聯(lián)網(wǎng)上中國(guó)用戶的電影數(shù)據(jù)集缺失,缺少如MovieLens、Kaggle等獨(dú)立機(jī)構(gòu)完成長(zhǎng)期收集電影數(shù)據(jù)工作,研究人員只能自行收集或下載來(lái)自國(guó)外的公共電影數(shù)據(jù)集,不具有本地屬性。
本項(xiàng)目爬取豆瓣網(wǎng)相關(guān)電影信息,建立數(shù)據(jù)庫(kù)。并根據(jù)此數(shù)據(jù)庫(kù)進(jìn)行了可視化分析,從中提取出大量數(shù)據(jù)背后信息,多維度分析了電影在公映時(shí)間、觀眾分布、類別占比、各國(guó)市場(chǎng)情況的關(guān)系,從評(píng)論詞云、文本情感角度挖掘單部電影呈現(xiàn)的規(guī)律。
2 效果實(shí)現(xiàn)
評(píng)論情感得分隨時(shí)間變化情況如下
熱門(mén)評(píng)論列表情況如下
3 爬蟲(chóng)及實(shí)現(xiàn)
簡(jiǎn)介
網(wǎng)絡(luò)爬蟲(chóng)是一種按照一定的規(guī)則,自動(dòng)地抓取萬(wàn)維網(wǎng)信息的程序或者腳本。爬蟲(chóng)對(duì)某一站點(diǎn)訪問(wèn),如果可以訪問(wèn)就下載其中的網(wǎng)頁(yè)內(nèi)容,并且通過(guò)爬蟲(chóng)解析模塊解析得到的網(wǎng)頁(yè)鏈接,把這些鏈接作為之后的抓取目標(biāo),并且在整個(gè)過(guò)程中完全不依賴用戶,自動(dòng)運(yùn)行。若不能訪問(wèn)則根據(jù)爬蟲(chóng)預(yù)先設(shè)定的策略進(jìn)行下一個(gè) URL的訪問(wèn)。在整個(gè)過(guò)程中爬蟲(chóng)會(huì)自動(dòng)進(jìn)行異步處理數(shù)據(jù)請(qǐng)求,返回網(wǎng)頁(yè)的抓取數(shù)據(jù)。在整個(gè)的爬蟲(chóng)運(yùn)行之前,用戶都可以自定義的添加代理,偽 裝 請(qǐng)求頭以便更好地獲取網(wǎng)頁(yè)數(shù)據(jù)。
爬蟲(chóng)流程圖如下:
部分代碼實(shí)現(xiàn)
import re
import requests
import json
import time
from openpyxl import load_workbook, Workbook
from requests import RequestException
def get_detail_page(html):
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
}
cookies = {}
response = requests.get(url=html, headers=headers, cookies=cookies)
response.encoding = 'utf-8'
if response.status_code == 200:
return response.text
return None
except RequestException:
print('獲取詳情頁(yè)錯(cuò)誤')
time.sleep(3)
return get_detail_page(html)
def parse_index_page(html):
html = get_detail_page(html)
html = html[12:-1]
data = json.loads(html)
id_list = []
if data:
for item in data:
id_list.append(item['url'])
return id_list
def parse_detail_page(data):
html = get_detail_page(data)
info = []
# 獲取電影名稱
name_pattern = re.compile('<span property="v:itemreviewed">(.*?)</span>')
name = re.findall(name_pattern, html)
info.append(name[0])
# 獲取評(píng)分
score_pattern = re.compile('rating_num" property="v:average">(.*?)</strong>')
score = re.findall(score_pattern, html)
info.append(score[0])
# 獲取導(dǎo)演
director_pattern = re.compile('rel="v:directedBy">(.*?)</a>')
director = re.findall(director_pattern, html)
print(director)
info.append(str(director[0]))
# 獲取演員
actor_pattern = re.compile('rel="v:starring">(.*?)</a>')
actor = re.findall(actor_pattern, html)
info.append(str(actor[0]))
# 獲取年份
year_pattern = re.compile('<span class="year">\((.*?)\)</span>')
year = re.findall(year_pattern, html)
info.append(year[0])
# 獲取類型
type_pattern = re.compile('property="v:genre">(.*?)</span>')
type = re.findall(type_pattern, html)
info.append(type[0].split(' /')[0])
# 獲取時(shí)長(zhǎng)
try:
time_pattern = re.compile('property="v:runtime" content="(.*?)"')
time = re.findall(time_pattern, html)
info.append(time[0])
except:
info.append('1')
# 獲取語(yǔ)言
language_pattern = re.compile('pl">語(yǔ)言:</span>(.*?)<br/>')
language = re.findall(language_pattern, html)
info.append(language[0].split(' /')[0])
# 獲取評(píng)價(jià)人數(shù)
comment_pattern = re.compile('property="v:votes">(.*?)</span>')
comment = re.findall(comment_pattern, html)
info.append(comment[0])
# 獲取地區(qū)
area_pattern = re.compile(' class="pl">制片國(guó)家/地區(qū):</span>(.*?)<br/>')
area = re.findall(area_pattern, html)
info.append(area[0].split(' /')[0])
return info
html = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E5%86%B7%E9%97%A8%E4%BD%B3%E7%89%87&sort=rank&page_limit=20&page_start='
wc = Workbook()
sheet = wc.active
sheet.title = "New"
ws = wc['New']
sheet['A1'] = 'name'
sheet['B1'] = 'score'
sheet['C1'] = 'director'
sheet['D1'] = 'actor'
sheet['E1'] = 'year'
sheet['F1'] = 'type'
sheet['G1'] = 'time'
sheet['H1'] = 'language'
sheet['I1'] = 'comment'
sheet['J1'] = 'area'
ws = wc[wc.sheetnames[0]]
wc.save('豆瓣電影.xlsx')
ti = 1
for i in range(20, 50):
print(i)
html1 = html+str(i*20)
u = parse_index_page(html1)
print(u)
for t in u:
time.sleep(0.5)
b = parse_detail_page(t)
print(b)
ws.append(b)
wc.save('豆瓣電影.xlsx')
ti += 1
4 Flask框架
簡(jiǎn)介
Flask是一個(gè)基于Werkzeug和Jinja2的輕量級(jí)Web應(yīng)用程序框架。與其他同類型框架相比,F(xiàn)lask的靈活性、輕便性和安全性更高,而且容易上手,它可以與MVC模式很好地結(jié)合進(jìn)行開(kāi)發(fā)。Flask也有強(qiáng)大的定制性,開(kāi)發(fā)者可以依據(jù)實(shí)際需要增加相應(yīng)的功能,在實(shí)現(xiàn)豐富的功能和擴(kuò)展的同時(shí)能夠保證核心功能的簡(jiǎn)單。Flask豐富的插件庫(kù)能夠讓用戶實(shí)現(xiàn)網(wǎng)站定制的個(gè)性化,從而開(kāi)發(fā)出功能強(qiáng)大的網(wǎng)站。
Flask項(xiàng)目結(jié)構(gòu)圖
部分相關(guān)代碼
from flask import Flask, render_template, jsonify
import requests
from bs4 import BeautifulSoup
from snownlp import SnowNLP
import jieba
import numpy as np
app = Flask(__name__)
app.config.from_object('config')
# 中文停用詞
STOPWORDS = set(map(lambda x: x.strip(), open(r'./stopwords.txt', encoding='utf8').readlines()))
headers = {
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
'accept-language': "en-US,en;q=0.9,zh-CN;q=0.8,zh-TW;q=0.7,zh;q=0.6",
'cookie': 'll="108296"; bid=ieDyF9S_Pvo; __utma=30149280.1219785301.1576592769.1576592769.1576592769.1; __utmc=30149280; __utmz=30149280.1576592769.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _vwo_uuid_v2=DF618B52A6E9245858190AA370A98D7E4|0b4d39fcf413bf2c3e364ddad81e6a76; ct=y; dbcl2="40219042:K/CjqllYI3Y"; ck=FsDX; push_noty_num=0; push_doumail_num=0; douban-fav-remind=1; ap_v=0,6.0',
'host': "search.douban.com",
'referer': "https://movie.douban.com/",
'sec-fetch-mode': "navigate",
'sec-fetch-site': "same-site",
'sec-fetch-user': "?1",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36 Edg/79.0.309.56"
}
login_name = None
# --------------------- html render ---------------------
@app.route('/')
def index():
return render_template('index.html')
@app.route('/search')
def search():
return render_template('search.html')
@app.route('/search/<movie_name>')
def search2(movie_name):
return render_template('search.html')
@app.route('/hot_movie')
def hot_movie():
return render_template('hot_movie.html')
@app.route('/movie_category')
def movie_category():
return render_template('movie_category.html')
# ------------------ ajax restful api -------------------
@app.route('/check_login')
def check_login():
"""判斷用戶是否登錄"""
return jsonify({'username': login_name, 'login': login_name is not None})
@app.route('/register/<name>/<pasw>')
def register(name, pasw):
conn = sqlite3.connect('user_info.db')
cursor = conn.cursor()
check_sql = "SELECT * FROM sqlite_master where type='table' and name='user'"
cursor.execute(check_sql)
results = cursor.fetchall()
# 數(shù)據(jù)庫(kù)表不存在
if len(results) == 0:
# 創(chuàng)建數(shù)據(jù)庫(kù)表
sql = """
CREATE TABLE user(
name CHAR(256),
pasw CHAR(256)
);
"""
cursor.execute(sql)
conn.commit()
print('創(chuàng)建數(shù)據(jù)庫(kù)表成功!')
sql = "INSERT INTO user (name, pasw) VALUES (?,?);"
cursor.executemany(sql, [(name, pasw)])
conn.commit()
return jsonify({'info': '用戶注冊(cè)成功!', 'status': 'ok'})
@app.route('/login/<name>/<pasw>')
def login(name, pasw):
global login_name
conn = sqlite3.connect('user_info.db')
cursor = conn.cursor()
check_sql = "SELECT * FROM sqlite_master where type='table' and name='user'"
cursor.execute(check_sql)
results = cursor.fetchall()
# 數(shù)據(jù)庫(kù)表不存在
if len(results) == 0:
# 創(chuàng)建數(shù)據(jù)庫(kù)表
sql = """
CREATE TABLE user(
name CHAR(256),
pasw CHAR(256)
);
"""
cursor.execute(sql)
conn.commit()
print('創(chuàng)建數(shù)據(jù)庫(kù)表成功!')
sql = "select * from user where name='{}' and pasw='{}'".format(name, pasw)
cursor.execute(sql)
results = cursor.fetchall()
login_name = name
if len(results) > 0:
return jsonify({'info': name + '用戶登錄成功!', 'status': 'ok'})
else:
return jsonify({'info': '當(dāng)前用戶不存在!', 'status': 'error'})
5 Ajax技術(shù)
Ajax 是一種獨(dú)立于 Web 服務(wù)器軟件的瀏覽器技術(shù)。
Ajax使用 JavaScript 向服務(wù)器提出請(qǐng)求并處理響應(yīng)而不阻塞的用戶核心對(duì)象XMLHttpRequest。通過(guò)這個(gè)對(duì)象,您的 JavaScript 可在不重載頁(yè)面的情況與 Web 服務(wù)器交換數(shù)據(jù),即在不需要刷新頁(yè)面的情況下,就可以產(chǎn)生局部刷新的效果。
前端將需要的參數(shù)轉(zhuǎn)化為JSON字符串,再通過(guò)get/post方式向服務(wù)器發(fā)送一個(gè)請(qǐng)并將參數(shù)直接傳遞給后臺(tái),后臺(tái)對(duì)前端請(qǐng)求做出反應(yīng),接收數(shù)據(jù),將數(shù)據(jù)作為條件查詢,但會(huì)j’son字符串格式的查詢結(jié)果集給前端,前端接收到后臺(tái)返回的數(shù)據(jù)進(jìn)行條件判斷并作出相應(yīng)的頁(yè)面展示。
$.ajax({
url: 'http://127.0.0.1:5000/updatePass',
type: "POST",
data:JSON.stringify(data.field),
contentType: "application/json; charset=utf-8",
dataType: "json",
success: function(res) {
if (res.code == 200) {
layer.msg(res.msg, {icon: 1});
} else {
layer.msg(res.msg, {icon: 2});
}
}
})
6 Echarts
ECharts(Enterprise Charts)是百度開(kāi)源的數(shù)據(jù)可視化工具,底層依賴輕量級(jí)Canvas庫(kù)ZRender。兼容了幾乎全部常用瀏覽器的特點(diǎn),使它可廣泛用于PC客戶端和手機(jī)客戶端。ECharts能輔助開(kāi)發(fā)者整合用戶數(shù)據(jù),創(chuàng)新性的完成個(gè)性化設(shè)置可視化圖表。支持折線圖(區(qū)域圖)、柱狀圖(條狀圖)、散點(diǎn)圖(氣泡圖)、K線圖、餅圖(環(huán)形圖)等,通過(guò)導(dǎo)入 js 庫(kù)在 Java Web 項(xiàng)目上運(yùn)行。
7 最后
項(xiàng)目分享:文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-841240.html
https://gitee.com/assistant-a/project-sharing
到了這里,關(guān)于畢設(shè)開(kāi)源 大數(shù)據(jù)電影數(shù)據(jù)分析與可視化系統(tǒng)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!