今天我們將繼續(xù)進(jìn)行爬蟲(chóng)實(shí)戰(zhàn),除了常規(guī)的網(wǎng)頁(yè)數(shù)據(jù)抓取外,我們還將引入一個(gè)全新的下載功能。具體而言,我們的主要任務(wù)是爬取小說(shuō)內(nèi)容,并實(shí)現(xiàn)將其下載到本地的操作,以便后續(xù)能夠進(jìn)行離線閱讀。
為了確保即使在功能逐漸增多的情況下也不至于使初學(xué)者感到困惑,我特意為你繪制了一張功能架構(gòu)圖,具體如下所示:
讓我們開(kāi)始深入解析今天的主角:小說(shuō)網(wǎng)
小說(shuō)解析
書單獲取
在小說(shuō)網(wǎng)的推薦列表中,我們可以選擇解析其中的某一個(gè)推薦內(nèi)容,而無(wú)需完全還原整個(gè)網(wǎng)站頁(yè)面的顯示效果,從而更加高效地獲取我們需要的信息。
以下是一個(gè)示例代碼,幫助你更好地理解:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request("https://www.readnovel.com/",headers=headers)
# 發(fā)出請(qǐng)求,獲取html
# 獲取的html內(nèi)容是字節(jié),將其轉(zhuǎn)化為字符串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')
for li in soup.select('#new-book-list li'):
a_tag = li.select_one('a[data-eid="qd_F24"]')
p_tag = li.select_one('p')
book = {
'href': a_tag['href'],
'title': a_tag.get('title'),
'content': p_tag.get_text()
}
print(book)
書籍簡(jiǎn)介
在通常情況下,我們會(huì)先查看書單,然后對(duì)書籍的大致內(nèi)容進(jìn)行了解,因此直接解析相關(guān)內(nèi)容即可。以下是一個(gè)示例代碼:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request(f"https://www.readnovel.com{link}#Catalog",headers=headers)
# 發(fā)出請(qǐng)求,獲取html
# 獲取的html內(nèi)容是字節(jié),將其轉(zhuǎn)化為字符串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')
og_title = soup.find('meta', property='og:title')['content']
og_description = soup.find('meta', property='og:description')['content']
og_novel_author = soup.find('meta', property='og:novel:author')['content']
og_novel_update_time = soup.find('meta', property='og:novel:update_time')['content']
og_novel_status = soup.find('meta', property='og:novel:status')['content']
og_novel_latest_chapter_name = soup.find('meta', property='og:novel:latest_chapter_name')['content']
# 查找內(nèi)容為"免費(fèi)試讀"的a標(biāo)簽
div_tag = soup.find('div', id='j-catalogWrap')
list_items = div_tag.find_all('li', attrs={'data-rid': True})
for li in list_items:
link_text = li.find('a').text
if '第' in link_text:
link_url = li.find('a')['href']
link_obj = {'link_text':link_text,
'link_url':link_url}
free_trial_link.append(link_obj)
print(f"書名:{og_title}")
print(f"簡(jiǎn)介:{og_description}")
print(f"作者:{og_novel_author}")
print(f"最近更新:{og_novel_update_time}")
print(f"當(dāng)前狀態(tài):{og_novel_status}")
print(f"最近章節(jié):{og_novel_latest_chapter_name}")
在解析過(guò)程中,我們發(fā)現(xiàn)除了獲取書籍的大致內(nèi)容外,還順便解析了相關(guān)的書籍目錄。將這些目錄保存下來(lái)會(huì)方便我們以后進(jìn)行試讀操作,因?yàn)橐坏?duì)某本書感興趣,我們接下來(lái)很可能會(huì)閱讀一下。如果確實(shí)對(duì)書籍感興趣,可能還會(huì)將其加入書單。為了避免在閱讀時(shí)再次解析,我們?cè)谶@里直接保存了這些目錄信息。
免費(fèi)試讀
在這一步,我們的主要任務(wù)是解析章節(jié)的名稱以及章節(jié)內(nèi)容,并將它們打印出來(lái),為后續(xù)封裝成方法以進(jìn)行下載或閱讀做準(zhǔn)備。這樣做可以更好地組織和管理數(shù)據(jù),提高代碼的復(fù)用性和可維護(hù)性。下面是一個(gè)示例代碼,展示了如何實(shí)現(xiàn)這一功能:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request(f"https://www.readnovel.com{link}",headers=headers)
# 發(fā)出請(qǐng)求,獲取html
# 獲取的html內(nèi)容是字節(jié),將其轉(zhuǎn)化為字符串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text, 'html.parser')
name = soup.find('h1',class_='j_chapterName')
chapter = {
'name':name.get_text()
}
print(name.get_text())
ywskythunderfont = soup.find('div', class_='ywskythunderfont')
if ywskythunderfont:
p_tags = ywskythunderfont.find_all('p')
chapter['text'] = p_tags[0].get_text()
print(chapter)
小說(shuō)下載
當(dāng)我們完成內(nèi)容解析后,已經(jīng)成功獲取了小說(shuō)的章節(jié)內(nèi)容,接下來(lái)只需執(zhí)行下載操作即可。對(duì)于下載操作的具體步驟,如果有遺忘的情況,我來(lái)幫忙大家進(jìn)行回顧一下。
file_name = 'a.txt'
with open(file_name, 'w', encoding='utf-8') as file:
file.write('嘗試下載')
print(f'文件 {file_name} 下載完成!')
包裝一下
按照老規(guī)矩,以下是源代碼示例。即使你懶得編寫代碼,也可以直接復(fù)制粘貼運(yùn)行一下,然后自行琢磨其中的細(xì)節(jié)。這樣能夠更好地理解代碼的運(yùn)行邏輯和實(shí)現(xiàn)方式。文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-841368.html
# 導(dǎo)入urllib庫(kù)的urlopen函數(shù)
from urllib.request import urlopen,Request
# 導(dǎo)入BeautifulSoup
from bs4 import BeautifulSoup as bf
from random import choice,sample
from colorama import init
from termcolor import colored
from readchar import readkey
FGS = ['green', 'yellow', 'blue', 'cyan', 'magenta', 'red']
book_list = []
free_trial_link = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
def get_hot_book():
print(colored('開(kāi)始搜索書單!',choice(FGS)))
book_list.clear()
req = Request("https://www.readnovel.com/",headers=headers)
# 發(fā)出請(qǐng)求,獲取html
# 獲取的html內(nèi)容是字節(jié),將其轉(zhuǎn)化為字符串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')
for li in soup.select('#new-book-list li'):
a_tag = li.select_one('a[data-eid="qd_F24"]')
p_tag = li.select_one('p')
book = {
'href': a_tag['href'],
'title': a_tag.get('title'),
'content': p_tag.get_text()
}
book_list.append(book)
def get_book_detail(link):
global free_trial_link
free_trial_link.clear()
req = Request(f"https://www.readnovel.com{link}#Catalog",headers=headers)
# 發(fā)出請(qǐng)求,獲取html
# 獲取的html內(nèi)容是字節(jié),將其轉(zhuǎn)化為字符串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')
og_title = soup.find('meta', property='og:title')['content']
og_description = soup.find('meta', property='og:description')['content']
og_novel_author = soup.find('meta', property='og:novel:author')['content']
og_novel_update_time = soup.find('meta', property='og:novel:update_time')['content']
og_novel_status = soup.find('meta', property='og:novel:status')['content']
og_novel_latest_chapter_name = soup.find('meta', property='og:novel:latest_chapter_name')['content']
# 查找內(nèi)容為"免費(fèi)試讀"的a標(biāo)簽
div_tag = soup.find('div', id='j-catalogWrap')
list_items = div_tag.find_all('li', attrs={'data-rid': True})
for li in list_items:
link_text = li.find('a').text
if '第' in link_text:
link_url = li.find('a')['href']
link_obj = {'link_text':link_text,
'link_url':link_url}
free_trial_link.append(link_obj)
print(colored(f"書名:{og_title}",choice(FGS)))
print(colored(f"簡(jiǎn)介:{og_description}",choice(FGS)))
print(colored(f"作者:{og_novel_author}",choice(FGS)))
print(colored(f"最近更新:{og_novel_update_time}",choice(FGS)))
print(colored(f"當(dāng)前狀態(tài):{og_novel_status}",choice(FGS)))
print(colored(f"最近章節(jié):{og_novel_latest_chapter_name}",choice(FGS)))
def free_trial(link):
req = Request(f"https://www.readnovel.com{link}",headers=headers)
# 發(fā)出請(qǐng)求,獲取html
# 獲取的html內(nèi)容是字節(jié),將其轉(zhuǎn)化為字符串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text, 'html.parser')
name = soup.find('h1',class_='j_chapterName')
chapter = {
'name':name.get_text()
}
print(colored(name.get_text(),choice(FGS)))
ywskythunderfont = soup.find('div', class_='ywskythunderfont')
if ywskythunderfont:
p_tags = ywskythunderfont.find_all('p')
chapter['text'] = p_tags[0].get_text()
return chapter
def download_chapter(chapter):
file_name = chapter['name'] + '.txt'
with open(file_name, 'w', encoding='utf-8') as file:
file.write(chapter['text'].replace('\u3000\u3000', '\n'))
print(colored(f'文件 {file_name} 下載完成!',choice(FGS)))
def print_book():
for i in range(0, len(book_list), 3):
names = [f'{i + j}:{book_list[i + j]["title"]}' for j in range(3) if i + j < len(book_list)]
print(colored('\t\t'.join(names),choice(FGS)))
def read_book(page):
if not free_trial_link:
print(colored('未選擇書單,無(wú)法閱讀!',choice(FGS)))
print(colored(free_trial(free_trial_link[page]['link_url'])['text'],choice(FGS)))
get_hot_book()
init() ## 命令行輸出彩色文字
print(colored('已搜索完畢!',choice(FGS)))
print(colored('m:返回首頁(yè)',choice(FGS)))
print(colored('d:免費(fèi)試讀',choice(FGS)))
print(colored('x:全部下載',choice(FGS)))
print(colored('n:下一章節(jié)',choice(FGS)))
print(colored('b:上一章節(jié)',choice(FGS)))
print(colored('q:退出閱讀',choice(FGS)))
my_key = ['q','m','d','x','n','b']
current = 0
while True:
while True:
move = readkey()
if move in my_key:
break
if move == 'q': ## 鍵盤‘Q’是退出
break
if move == 'd':
read_book(current)
if move == 'x': ## 這里只是演示為主,不循環(huán)下載所有數(shù)據(jù)了
download_chapter(free_trial(free_trial_link[0]['link_url']))
if move == 'b':
current = current - 1
if current < 0 :
current = 0
read_book(current)
if move == 'n':
current = current + 1
if current > len(free_trial_link) :
current = len(free_trial_link) - 1
read_book(current)
if move == 'm':
print_book()
current = 0
num = int(input('請(qǐng)輸入書單編號(hào):=====>'))
if num <= len(book_list):
get_book_detail(book_list[num]['href'])
總結(jié)
今天在爬蟲(chóng)實(shí)戰(zhàn)中,除了正常爬取網(wǎng)頁(yè)數(shù)據(jù)外,我們還添加了一個(gè)下載功能,主要任務(wù)是爬取小說(shuō)并將其下載到本地,以便離線閱讀。為了避免迷糊,我為大家繪制了功能架構(gòu)圖。我們首先解析了小說(shuō)網(wǎng),包括獲取書單、書籍簡(jiǎn)介和免費(fèi)試讀章節(jié)。然后針對(duì)每個(gè)功能編寫了相應(yīng)的代碼,如根據(jù)書單獲取書籍信息、獲取書籍詳細(xì)信息、免費(fèi)試讀章節(jié)解析和小說(shuō)下載。最后,將這些功能封裝成方法,方便調(diào)用和操作。通過(guò)這次實(shí)戰(zhàn),我們深入了解了爬蟲(chóng)的應(yīng)用,為后續(xù)的項(xiàng)目提供了基礎(chǔ)支持。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-841368.html
到了這里,關(guān)于爬蟲(chóng)實(shí)戰(zhàn):從網(wǎng)頁(yè)到本地,如何輕松實(shí)現(xiàn)小說(shuō)離線閱讀的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!