一、安裝python
官網(wǎng)
下載python3.9及以上版本
二、安裝playwright
playwright是微軟公司2020年初發(fā)布的新一代自動化測試工具,相較于目前最常用的Selenium,它僅用一個API即可自動執(zhí)行Chromium、Firefox、WebKit等主流瀏覽器自動化操作。
(1)安裝Playwright依賴庫
1 pip install playwright
?文章來源地址http://www.zghlxwxcb.cn/news/detail-711675.html
(2)安裝Chromium、Firefox、WebKit等瀏覽器的驅(qū)動文件(內(nèi)置瀏覽器)
1 python -m playwright install
三、分析網(wǎng)站的HTML結(jié)構(gòu)
魔筆小說網(wǎng)是一個輕小說下載網(wǎng)站,提供了mobi、epub等格式小說資源,美中不足的是,需要跳轉(zhuǎn)城通網(wǎng)盤下載,無會員情況下被限速且同一時(shí)間只允許一個下載任務(wù)。
當(dāng)使用chrome瀏覽器時(shí)點(diǎn)擊鍵盤的F12進(jìn)入開發(fā)者模式。
(一)小說目錄
HTML內(nèi)容
通過href標(biāo)簽可以獲得每本小說的詳細(xì)地址,隨后打開該地址獲取章節(jié)下載地址。
(二)章節(jié)下載目錄
HTML內(nèi)容
遍歷每本小說的地址并保存到單獨(dú)的txt文件中供后續(xù)下載。
(三)代碼
1 import time,re 2 3 from playwright.sync_api import Playwright, sync_playwright, expect 4 5 def cancel_request(route,request): 6 route.abort() 7 def run(playwright: Playwright) -> None: 8 browser = playwright.chromium.launch(headless=False) 9 context = browser.new_context() 10 page = context.new_page() 11 # 不加載圖片 12 # page.route(re.compile(r"(\.png)|(\.jpg)"), cancel_request) 13 page.goto("https://mobinovels.com/") 14 # 由于魔筆小說首頁是動態(tài)加載列表,因此在此處加30s延遲,需手動滑動頁面至底部直至加載完全部內(nèi)容 15 for i in range(30): 16 time.sleep(1) 17 print(i) 18 # 定位至列表元素 19 novel_list = page.locator('[class="post-title entry-title"]') 20 # 統(tǒng)計(jì)小說數(shù)量 21 total = novel_list.count() 22 # 遍歷獲取小說詳情地址 23 for i in range(total): 24 novel = novel_list.nth(i).locator("a") 25 title = novel.inner_text() 26 title_url = novel.get_attribute("href") 27 page1 = context.new_page() 28 page1.goto(title_url,wait_until='domcontentloaded') 29 print(i+1,total,title) 30 try: 31 content_list = page1.locator("table>tbody>tr") 32 # 保存至單獨(dú)txt文件中供后續(xù)下載 33 with open('./novelurl/'+title+'.txt', 'a') as f: 34 for j in range(content_list.count()): 35 if content_list.nth(j).locator("td").count() > 2: 36 content_href = content_list.nth(j).locator("td").nth(3).locator("a").get_attribute("href") 37 f.write(title+str(j+1)+'分割'+content_href + '\n') 38 except: 39 pass 40 page1.close() 41 # 程序結(jié)束后手動關(guān)閉程序 42 time.sleep(50000) 43 page.close() 44 45 # --------------------- 46 context.close() 47 browser.close() 48 49 50 with sync_playwright() as playwright: 51 run(playwright)
(四)運(yùn)行結(jié)果
四、開始下載
之所以先將下載地址保存到txt再下載而不是立即下載,是防止程序因網(wǎng)絡(luò)等原因異常崩潰后記錄進(jìn)度,下次啟動避免重復(fù)下載。
(一)獲取cookies
城通網(wǎng)盤下載較大資源時(shí)需要登陸,有的輕小說文件較大時(shí),頁面會跳轉(zhuǎn)到登陸頁面導(dǎo)致程序卡住,因此需利用cookies保存登陸狀態(tài),或增加延遲手動在頁面登陸。
chrome瀏覽器可以通過cookies editor插件獲取cookies,導(dǎo)出后即可使用。
(二)分析下載地址
下載地址有三種類型,根據(jù)判斷條件分別處理:
(1)文件的訪問密碼統(tǒng)一為6195,當(dāng)域名為?https://url74.ctfile.com/?地址后綴帶有??p=6195?時(shí),頁面自動填入訪問密碼,我們需要在腳本中判斷后綴是否為??p=6195?,如不是則拼接字符串后訪問;
(2)有后綴時(shí)無需處理;
(3)當(dāng)域名為?https://t00y.com/?時(shí)無需密碼;
1 if "t00y.com" in new_url: 2 page.goto(new_url) 3 elif "?p=6195" not in new_url: 4 page.goto(new_url+"?p=6195") 5 page.get_by_placeholder("文件訪問密碼").click() 6 page.get_by_role("button", name="解密文件").click() 7 else: 8 page.goto(new_url) 9 page.get_by_placeholder("文件訪問密碼").click() 10 page.get_by_role("button", name="解密文件").click()
(三)開始下載
playWright下載資源需利用?page.expect_download?函數(shù)。
下載完整代碼如下:
1 import time,os 2 3 from playwright.sync_api import Playwright, sync_playwright, expect 4 5 6 def run(playwright: Playwright) -> None: 7 browser = playwright.chromium.launch(channel="chrome", headless=False) # 此處使用的是本地chrome瀏覽器 8 context = browser.new_context() 9 path = r'D:\PycharmProjects\wxauto\novelurl' 10 dir_list = os.listdir(path) 11 # 使用cookies 12 # cookies = [] 13 # context.add_cookies(cookies) 14 page = context.new_page() 15 for i in range(len(dir_list)): 16 try: 17 novel_url = os.path.join(path, dir_list[i]) 18 print(novel_url) 19 with open(novel_url) as f: 20 for j in f.readlines(): 21 new_name,new_url = j.strip().split("分割") 22 if "t00y.com" in new_url: 23 page.goto(new_url) 24 elif "?p=6195" not in new_url: 25 page.goto(new_url+"?p=6195") 26 page.get_by_placeholder("文件訪問密碼").click() 27 page.get_by_role("button", name="解密文件").click() 28 else: 29 page.goto(new_url) 30 page.get_by_placeholder("文件訪問密碼").click() 31 page.get_by_role("button", name="解密文件").click() 32 33 with page.expect_download(timeout=100000) as download_info: 34 page.get_by_role("button", name="立即下載").first.click() 35 print(new_name,"開始下載") 36 download_file = download_info.value 37 download_file.save_as("./novel/"+dir_list[i][:-4]+"/"+download_file.suggested_filename) 38 time.sleep(3) 39 os.remove(novel_url) 40 print(i+1,dir_list[i],"下載結(jié)束") 41 except: 42 print(novel_url,"出錯") 43 time.sleep(60) 44 page.close() 45 46 # --------------------- 47 context.close() 48 browser.close() 49 50 51 with sync_playwright() as playwright: 52 run(playwright)
(四)運(yùn)行結(jié)果
文章來源:http://www.zghlxwxcb.cn/news/detail-711675.html
?
Arabic | Hebrew | Polish |
Bulgarian | Hindi | Portuguese |
Catalan | Hmong Daw | Romanian |
Chinese Simplified | Hungarian | Russian |
Chinese Traditional | Indonesian | Slovak |
Czech | Italian | Slovenian |
Danish | Japanese | Spanish |
Dutch | Klingon | Swedish |
English | Korean | Thai |
Estonian | Latvian | Turkish |
Finnish | Lithuanian | Ukrainian |
French | Malay | Urdu |
German | Maltese | Vietnamese |
Greek | Norwegian | Welsh |
Haitian Creole | Persian | ? |
到了這里,關(guān)于使用playwright爬取魔筆小說網(wǎng)站并下載輕小說資源的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!