爬取豆瓣Top250圖書(shū)數(shù)據(jù)
項(xiàng)目的實(shí)現(xiàn)步驟
1.項(xiàng)目結(jié)構(gòu)
2.獲取網(wǎng)頁(yè)數(shù)據(jù)
3.提取網(wǎng)頁(yè)中的關(guān)鍵信息
4.保存數(shù)據(jù)
1.項(xiàng)目結(jié)構(gòu)
2.獲取網(wǎng)頁(yè)數(shù)據(jù)
對(duì)應(yīng)的網(wǎng)址為https://book.douban.com/top250
import requests
from bs4 import BeautifulSoup
"""
獲取網(wǎng)頁(yè)數(shù)據(jù),解析數(shù)據(jù),將相應(yīng)的數(shù)據(jù)傳出
"""
def get_page(url):
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 '
'Mobile Safari/537.36 Edg/114.0.1823.43'
}
resp=requests.get(url,headers=headers)
soup=BeautifulSoup(resp.text,'html.parser')
return soup
3.提取網(wǎng)頁(yè)中的關(guān)鍵信息
獲取傳出的解析后的數(shù)據(jù),獲取對(duì)應(yīng)的圖片,書(shū)名,作者,價(jià)格,評(píng)價(jià),簡(jiǎn)介
from geturlcocument.get_document import get_page
import re
# 初始數(shù)據(jù)
pictures=[]
names=[]
authors=[]
prices=[]
scores=[]
sums=[]
def get_single():
# 網(wǎng)址地址
urls = [f"https://book.douban.com/top250?start={num}" for num in range(0,250,25)]
for url in urls:
# 獲取對(duì)應(yīng)的網(wǎng)頁(yè)文本
text = get_page.get_page(url)
# 所有數(shù)據(jù)的集合
all_tr = text.find_all(name="tr", attrs={"class": "item"})
# 查找每個(gè)單項(xiàng)
for tr in all_tr:
# 數(shù)據(jù)類(lèi)型:圖片,書(shū)名,作者,價(jià)格,評(píng)分,簡(jiǎn)介
# 圖片
picture = tr.find(name="img")
picture = picture.get('src')
# print(picture)
# 書(shū)名
div = tr.find(name='div', attrs={'class': 'pl2'})
name = div.find('a').text
name = re.sub(r'\s+', '', name)
# 作者
author = tr.find(name='p', attrs={'class': 'pl'}).text
author = author.split('/')[0]
# 價(jià)格
price = author.split('/')[-1]
price = re.sub(r'元', '', price)
# 評(píng)分
score = tr.find(name='span', attrs={'class': 'rating_nums'}).text
try:
sum = tr.find(name='span', attrs={'class': 'inq'}).text
except AttributeError:
sum = ''
pictures.append(picture)
names.append(name)
authors.append(author)
prices.append(price)
scores.append(score)
sums.append(sum)
data = {
"picture": pictures,
"name": names,
"author": authors,
"price": prices,
"score": scores,
"sum": sums
}
return data
將獲取的數(shù)據(jù)存入到字典中,將數(shù)據(jù)傳出,使用re庫(kù)對(duì)相應(yīng)的數(shù)據(jù)進(jìn)行處理,運(yùn)用異常檢錯(cuò)
4.保存數(shù)據(jù)
獲取傳出的字典類(lèi)型的數(shù)據(jù),將數(shù)據(jù)存入到pandas的DataFrame類(lèi)型中文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-481176.html
from geturlcocument.get_single_docuemnt import get_single
import pandas as pd
# 獲取字典類(lèi)型的數(shù)據(jù)
data=get_single.get_single()
# 用pandas的DataFrame類(lèi)型存儲(chǔ)數(shù)據(jù)
df=pd.DataFrame(data)
df.to_csv('./books.csv',encoding='utf-8')
print('ending of data')
該項(xiàng)目完成?。?!文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-481176.html
到了這里,關(guān)于爬取豆瓣Top250圖書(shū)數(shù)據(jù)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!