国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

<label id="dnrmg"></label><rt id="dnrmg"></rt>

Scrapy爬蟲框架集成Selenium來解析動態(tài)網(wǎng)頁

2年前作者：__彎弓__分類：Toy博客閱讀(25)違法舉報

這篇具有很好參考價值的文章主要介紹了Scrapy爬蟲框架集成Selenium來解析動態(tài)網(wǎng)頁。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

1、爬蟲項目單獨使用scrpay框架的不足

當前網(wǎng)站普遍采用了javascript 動態(tài)頁面，特別是vue與react的普及，使用scrapy框架定位動態(tài)網(wǎng)頁元素十分困難，而selenium是最流行的瀏覽器自動化工具，可以模擬瀏覽器來操作網(wǎng)頁，解析元素，執(zhí)行動作，可以處理動態(tài)網(wǎng)頁，使用selenium處理1個大型網(wǎng)站，速度很慢，而且非常耗資源，是否可以將selenium集成到scrapy框架中，發(fā)揮二者的優(yōu)點呢？

Scrapy集成selenium的關鍵是，將其放入DownloaderMiddleware. 如下面的scrapy原理圖，可以在Downloader的中間件方法中，修改request與response對象，再返回給scrapy
Scrapy爬蟲框架集成Selenium來解析動態(tài)網(wǎng)頁,爬蟲,selenium,測試工具

可以自定義downloader middleware 中間件類來集成selenium，當然實現(xiàn)selenium的所有特性，工作量比較大。因此，我們推薦使用scrapy-selenium第3方為來集成。

2. 搭建 scrapy-selenium 開發(fā)環(huán)境

2.1 安裝scrapy-selenium庫

pip install scrapy-selenium
python 版本應大于3.6,

2.2 安裝瀏覽器驅(qū)動

本機上應該安裝有1個selenium支持的瀏覽器，如chrom, firefox, edge等
再安裝對應瀏覽器、版本的webdrive
下載 downloaded chromedriver.exe 之后，放在項目根目錄下，或者加入系統(tǒng)環(huán)境變量。

2.3 集成selenium到scrapy 項目

項目結(jié)構(gòu)如下


├── scrapy.cfg
├── chromedriver.exe ## <-- Here
└── myproject
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

進入項目文件夾，更新settings.py

## settings.py

# for Chrome driver 
from shutil import which
  
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless']  
  
DOWNLOADER_MIDDLEWARES = {
     'scrapy_selenium.SeleniumMiddleware': 800
     }

3. 在spider中使用selenium來解析網(wǎng)頁

在spider中，用SeleniumRequest 類來代替selenium內(nèi)置的Request類。

## spider.py
import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_selenium import SeleniumRequest

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response):
        quote_item = QuoteItem()
        for quote in response.css('div.quote'):
            quote_item['text'] = quote.css('span.text::text').get()
            quote_item['author'] = quote.css('small.author::text').get()
            quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield quote_item

scrapy 會自動調(diào)用selenium來解析response回傳的頁面元素，這里selenium 使用的是headless chrom瀏覽器。

4. 使用selenium 的特性來爬取數(shù)據(jù)

可以使用selenium的特性，如
? 網(wǎng)頁元素等待
? 模擬點擊等操作
? 屏幕截圖
等。

(1）Waits 功能

動態(tài)網(wǎng)頁定位不到元素，通常是由于組件加載順序，ajax 異步請求更新等造成的，而selenium提供了 wait_until的功能來處理實現(xiàn)對動態(tài)網(wǎng)頁元素的定位。
所有request 等待10秒

def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(url=url, callback=self.parse, wait_time=10)

使用selenium wait_until條件等待功能

## spider.py
import scrapy
from quotes_js_scraper.items import QuoteItem
 
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(
                    url=url, 
                    callback=self.parse, 
                    wait_time=10,
                    wait_until=EC.element_to_be_clickable((By.CLASS_NAME, 'quote'))
                    )
    def parse(self, response):
        quote_item = QuoteItem()
        for quote in response.selector.css('div.quote'):
            quote_item['text'] = quote.css('span.text::text').get()
            quote_item['author'] = quote.css('small.author::text').get()
            quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield quote_item

(2) 點擊按鈕

比如，可以配置selenium執(zhí)行 a 標簽的點擊事件文章來源地址http://www.zghlxwxcb.cn/news/detail-550660.html

lass QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(
            url=url,
            callback=self.parse,
            script="document.querySelector('.pager .next>a').click()",
        )

(3）頁面截圖

## spider.py
import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_selenium import SeleniumRequest

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(
                    url=url, 
                    callback=self.parse, 
                    screenshot=True
                    )

    def parse(self, response):
        with open('image.png', 'wb') as image_file:
            image_file.write(response.meta['screenshot'])

到了這里，關于Scrapy爬蟲框架集成Selenium來解析動態(tài)網(wǎng)頁的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領支付寶紅包贊助服務器費用

python爬蟲進階篇：Scrapy中使用Selenium模擬Firefox火狐瀏覽器爬取網(wǎng)頁信息
接著上一篇的筆記，Scrapy爬取普通無反爬、靜態(tài)頁面的網(wǎng)頁時可以順利爬取我們要的信息。但是大部分情況下我們要的數(shù)據(jù)所在的網(wǎng)頁它是動態(tài)加載出來的（ajax請求后傳回前端頁面渲染、js調(diào)用function等）。這種情況下需要使用selenium進行模擬人工操作瀏覽器行為，實現(xiàn)自動化
2024年02月04日
瀏覽(101)
Python爬蟲基礎（三）：使用Selenium動態(tài)加載網(wǎng)頁
Python爬蟲基礎（一）：urllib庫的使用詳解 Python爬蟲基礎（二）：使用xpath與jsonpath解析爬取的數(shù)據(jù) Python爬蟲基礎（三）：使用Selenium動態(tài)加載網(wǎng)頁 Python爬蟲基礎（四）：使用更方便的requests庫 Python爬蟲基礎（五）：使用scrapy框架（1）Selenium是一個用于Web應用程序測試的工具。
2024年02月06日
瀏覽(30)
Java學習筆記：爬蟲-操作動態(tài)網(wǎng)頁的Selenium
Why Selenium? 有些網(wǎng)頁內(nèi)容是在瀏覽器端動態(tài)生成的，直接Http獲取網(wǎng)頁源碼是得不到那些元素的。 Selenium可以自動啟動一個瀏覽器、打開網(wǎng)頁，可以用程序操作頁面元素，也可以獲得瀏覽器當前頁面動態(tài)加載的頁面元素。比如：百度圖片的圖片是動態(tài)加載的。用法： 1、下載安
2024年02月13日
瀏覽(26)
Python-爬蟲、自動化（selenium，動態(tài)網(wǎng)頁翻頁，模擬搜索，下拉列表選擇、selenium行為鏈）
selenium是一個Web自動化測試工具，可以直接運行在瀏覽器上·支持所有主流的瀏覽器.可以根據(jù)我們的指令，讓瀏覽器自動加載頁面，獲取需要的數(shù)據(jù)，基礎頁面截圖等。使用pip install selenium命令下載selenium模塊。運行下列代碼：說明沒有下載對應瀏覽器的驅(qū)動，這里使用谷歌
2024年02月01日
瀏覽(28)
爬蟲入門指南(4): 使用Selenium和API爬取動態(tài)網(wǎng)頁的最佳方法
隨著互聯(lián)網(wǎng)的發(fā)展，許多網(wǎng)站開始采用動態(tài)網(wǎng)頁來呈現(xiàn)內(nèi)容。與傳統(tǒng)的靜態(tài)網(wǎng)頁不同，動態(tài)網(wǎng)頁使用JavaScript等腳本技術來實現(xiàn)內(nèi)容的動態(tài)加載和更新。這給網(wǎng)頁爬取帶來了一定的挑戰(zhàn)，因為傳統(tǒng)的爬蟲工具往往只能獲取靜態(tài)網(wǎng)頁的內(nèi)容。本文將介紹如何使用Selenium和API來實現(xiàn)
2024年02月11日
瀏覽(34)
Python爬蟲框架之Selenium庫入門：用Python實現(xiàn)網(wǎng)頁自動化測試詳解
是否還在為網(wǎng)頁測試而煩惱？是否還在為重復的點擊、等待而勞累？試試強大的 Selenium ！讓你的網(wǎng)頁自動化測試變得輕松有趣！ Selenium 是一個強大的自動化測試工具，它可以讓你直接操控瀏覽器，完成各種與網(wǎng)頁交互的任務。通過使用 Python 的 Selenium 庫，你可以高效地實現(xiàn)
2024年02月10日
瀏覽(23)
scrapy集成selenium
???????? ? ? ?使用scrapy默認下載器---》類似于requests模塊發(fā)送請求，不能執(zhí)行js，有的頁面拿回來數(shù)據(jù)不完整 ? ? ?想在scrapy中集成selenium，獲取數(shù)據(jù)更完整，獲取完后，自己組裝成 Response對象，就會進爬蟲解析，現(xiàn)在解析的是使用selenium拿回來的頁面，數(shù)據(jù)更完整 ? 集成
2024年02月17日
瀏覽(13)
Python網(wǎng)絡爬蟲逆向分析爬取動態(tài)網(wǎng)頁、使用Selenium庫爬取動態(tài)網(wǎng)頁、?編輯將數(shù)據(jù)存儲入MongoDB數(shù)據(jù)庫
目錄逆向分析爬取動態(tài)網(wǎng)頁了解靜態(tài)網(wǎng)頁和動態(tài)網(wǎng)頁區(qū)別 1.判斷靜態(tài)網(wǎng)頁 ?2.判斷動態(tài)網(wǎng)頁 ?逆向分析爬取動態(tài)網(wǎng)頁使用Selenium庫爬取動態(tài)網(wǎng)頁安裝Selenium庫以及下載瀏覽器補丁頁面等待 ?頁面操作 1.填充表單 2.執(zhí)行JavaScript 元素選取 Selenium庫的find_element的語法使用格式如下
2024年02月15日
瀏覽(65)
python爬蟲selenium+scrapy常用功能筆記
訪問網(wǎng)址可以看到直觀結(jié)果 https://bot.sannysoft.com/ 獲取頁面dom 頁面元素獲取元素點擊 frame跳轉(zhuǎn) 獲取cookie 給請求添加cookie 點擊上傳文件退出頁面多摘自之前文檔 https://blog.csdn.net/weixin_43521165/article/details/111905800 創(chuàng)建項目 scrapy startproject 爬蟲項目名字 # 例如 scrapy startproject f
2023年04月20日
瀏覽(24)
java爬蟲遇到網(wǎng)頁驗證碼怎么辦？（使用selenium模擬瀏覽器并用python腳本解析驗證碼圖片）
????????筆者這幾天在爬取數(shù)據(jù)的時候遇到了一個很鬧心的問題，就是在我爬取數(shù)據(jù)的時候遇到了驗證碼，而這個驗證碼又是動態(tài)生成的，嘗試了很多方法都沒能繞開這個驗證碼問題。 ? ? ? ? 我的解決方案是：使用selenium模擬瀏覽器行為，獲取到動態(tài)生成的驗證碼后用
2024年02月09日
瀏覽(175)

<form id="pgz3n"></form>

<label id="pgz3n"></label>