簡介
Selenium是一個用于自動化瀏覽器操作的工具,通常用于Web應(yīng)用測試。然而,它也可以用作爬蟲,通過模擬用戶在瀏覽器中的操作來提取網(wǎng)頁數(shù)據(jù)。以下是有關(guān)Selenium爬蟲的一些基本介紹:
-
瀏覽器自動化: Selenium允許你通過編程方式控制瀏覽器的行為,包括打開網(wǎng)頁、點(diǎn)擊按鈕、填寫表單等。這樣你可以模擬用戶在瀏覽器中的操作。
-
支持多種瀏覽器: Selenium支持多種主流瀏覽器,包括Chrome、Firefox、Edge等。你可以選擇適合你需求的瀏覽器來進(jìn)行自動化操作。
-
網(wǎng)頁數(shù)據(jù)提?。?/strong> 利用Selenium,你可以加載網(wǎng)頁并提取頁面上的數(shù)據(jù)。這對于一些動態(tài)加載內(nèi)容或需要用戶交互的網(wǎng)頁來說特別有用。
-
等待元素加載: 由于網(wǎng)頁可能會異步加載,Selenium提供了等待機(jī)制,確保在繼續(xù)執(zhí)行之前等待特定的元素加載完成。
-
選擇器: Selenium支持各種選擇器,類似于使用CSS選擇器或XPath來定位網(wǎng)頁上的元素。
-
動態(tài)網(wǎng)頁爬取: 對于使用JavaScript動態(tài)生成內(nèi)容的網(wǎng)頁,Selenium是一個有力的工具,因?yàn)樗梢詧?zhí)行JavaScript代碼并獲取渲染后的結(jié)果。
盡管Selenium在爬蟲中可以提供很多便利,但也需要注意一些方面。首先,使用Selenium進(jìn)行爬取速度較慢,因?yàn)樗M了真實(shí)用戶的操作。其次,網(wǎng)站可能會檢測到自動化瀏覽器,并采取措施來防止爬蟲,因此使用Selenium時需要小心謹(jǐn)慎,遵守網(wǎng)站的使用規(guī)定和政策。
在使用selenium前需要有scrapy爬蟲框架的相關(guān)知識,selenium需要結(jié)合scrapy的中間件才能發(fā)揮爬蟲的作用,詳細(xì)請看→前提知識:https://blog.csdn.net/shizuguilai/article/details/135554205
環(huán)境配置
1、建議先安裝conda
參考連接:https://blog.csdn.net/Q_fairy/article/details/129158178
2、創(chuàng)建虛擬環(huán)境并安裝對應(yīng)的包
# 創(chuàng)建名字為scrapy的包
conda create -n scrapy
# 進(jìn)入虛擬環(huán)境
conda activate scrapy
# 下載對應(yīng)的包
pip install scrapy
pip install selenium
3、下載對應(yīng)的谷歌驅(qū)動以及與驅(qū)動對應(yīng)的瀏覽器
參考連接:https://zhuanlan.zhihu.com/p/665018772
記得配置好環(huán)境變量
代碼
目錄結(jié)構(gòu):spiders下面就是我放scrapy腳本的位置。文章來源:http://www.zghlxwxcb.cn/news/detail-807847.html
setting.py配置
# Scrapy settings for sw project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = "sw"
SPIDER_MODULES = ["sw.spiders"]
NEWSPIDER_MODULE = "sw.spiders"
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
COOKIES_ENABLED = True
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "sw (+http://www.yourdomain.com)"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 文件settings.py中
# ----------- selenium參數(shù)配置 -------------
SELENIUM_TIMEOUT = 25 # selenium瀏覽器的超時時間,單位秒
LOAD_IMAGE = True # 是否下載圖片
WINDOW_HEIGHT = 900 # 瀏覽器窗口大小
WINDOW_WIDTH = 900
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "sw.middlewares.SwSpiderMiddleware": 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "sw.middlewares.SwDownloaderMiddleware": 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "sw.pipelines.SwPipeline": 300,
#}
# ITEM_PIPELINES = {
# "sw.pipelines.SwPipeline": 300,
# }
# DB_SETTINGS = {
# 'host': '127.0.0.1',
# 'port': 3306,
# 'user': 'root',
# 'password': '123456',
# 'db': 'scrapy_news_2024_01_08',
# 'charset': 'utf8mb4',
# }
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
# REDIRECT_ENABLED = False
scrapy腳本參考
"""
Created on 2024/01/06 14:00 by Fxy
"""
import scrapy
from sw.items import SwItem
import time
from datetime import datetime
import locale
from scrapy_splash import SplashRequest
# scrapy 信號相關(guān)庫
from scrapy.utils.project import get_project_settings
# 下面這種方式,即將廢棄,所以不用
# from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
# scrapy最新采用的方案
from pydispatch import dispatcher
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
class NhcSpider(scrapy.Spider):
'''
scrapy變量
'''
# 爬蟲名稱
name = "1000_nhc"
# 允許爬取的域名
allowed_domains = ["xxxx.cn"]
# 爬蟲的起始鏈接
start_urls = ["xxxx.shtml"]
# 創(chuàng)建一個VidoItem實(shí)例
item = SwItem()
custom_settings = {
'LOG_LEVEL':'INFO',
'DOWNLOAD_DELAY': 0,
'COOKIES_ENABLED': False, # enabled by default
'DOWNLOADER_MIDDLEWARES': {
# SeleniumMiddleware 中間件
'sw.middlewares.SeleniumMiddleware': 543, # 這個數(shù)字是啟用的優(yōu)先級
# 將scrapy默認(rèn)的user-agent中間件關(guān)閉
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
}
'''
自定義變量
'''
# 機(jī)構(gòu)名稱
org = "xxxx數(shù)據(jù)"
# 機(jī)構(gòu)英文名稱
org_e = "None"
# 日期格式
site_date_format = '發(fā)布時間:\n \t%Y-%m-%d\n ' # 網(wǎng)頁的日期格式
date_format = '%d.%m.%Y %H:%M:%S' # 目標(biāo)日期格式
# 網(wǎng)站語言格式
language_type = "zh2zh" # 中文到中文的語言代碼, 調(diào)用翻譯接口時,使用
# 模擬瀏覽器格式
meta = {'usedSelenium': name, 'dont_redirect': True}
# 將chrome初始化放到spider中,成為spider中的元素
def __init__(self, timeout=40, isLoadImage=True, windowHeight=None, windowWidth=None):
# 從settings.py中獲取設(shè)置參數(shù)
self.mySetting = get_project_settings()
self.timeout = self.mySetting['SELENIUM_TIMEOUT']
self.isLoadImage = self.mySetting['LOAD_IMAGE']
self.windowHeight = self.mySetting['WINDOW_HEIGHT']
self.windowWidth = self.mySetting['windowWidth']
# 初始化chrome對象
options = webdriver.ChromeOptions()
options.add_experimental_option('useAutomationExtension', False) # 隱藏selenium特性
options.add_experimental_option('excludeSwitches', ['enable-automation']) # 隱藏selenium特性
options.add_argument('--ignore-certificate-errors') # 忽略證書錯誤
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('--ignore-ssl-errors') # 忽略ssl錯誤
# chrome_options = webdriver.ChromeOptions()
# chrome_options.binary_location = "E:\\學(xué)校的一些資料\\文檔\研二上\\chrome-win64\\chrome.exe" # 替換為您的特定版本的Chrome瀏覽器路徑
#1.創(chuàng)建Chrome或Firefox瀏覽器對象,這會在電腦上在打開一個瀏覽器窗口
# browser = webdriver.Chrome(executable_path ="E:\\chromedriver\\chromedriver", chrome_options=chrome_options) #第一個參數(shù)為驅(qū)動的路徑,第二個參數(shù)為對應(yīng)的應(yīng)用程序地址
self.browser = webdriver.Chrome(chrome_options=options)
self.browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { # 隱藏selenium特性
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
if self.windowHeight and self.windowWidth:
self.browser.set_window_size(900, 900)
self.browser.set_page_load_timeout(self.timeout) # 頁面加載超時時間
self.wait = WebDriverWait(self.browser, 30) # 指定元素加載超時時間
super(NhcSpider, self).__init__()
# 設(shè)置信號量,當(dāng)收到spider_closed信號時,調(diào)用mySpiderCloseHandle方法,關(guān)閉chrome
dispatcher.connect(receiver = self.mySpiderCloseHandle,
signal = signals.spider_closed
)
# 信號量處理函數(shù):關(guān)閉chrome瀏覽器
def mySpiderCloseHandle(self, spider):
print(f"mySpiderCloseHandle: enter ")
self.browser.quit()
def start_requests(self):
yield scrapy.Request(url = self.start_urls[0],
meta = self.meta,
callback = self.parse,
# errback = self.error
)
#爬蟲的主入口,這里是獲取所有的歸檔文章鏈接, 從返回的respose
def parse(self,response):
# locale.setlocale(locale.LC_TIME, 'en_US') #本地語言為英語 //*[@id="538034"]/div
achieve_links = response.xpath('//ul[@class="zxxx_list"]/li/a/@href').extract()
print("achieve_links",achieve_links)
for achieve_link in achieve_links:
full_achieve_link = "http:/xxxx.cn" + achieve_link
print("full_achieve_link", full_achieve_link)
# 進(jìn)入每個歸檔鏈接
yield scrapy.Request(full_achieve_link, callback=self.parse_item,dont_filter=True, meta=self.meta)
#翻頁邏輯
xpath_expression = f'//*[@id="page_div"]/div[@class="pagination_index"]/span/a[text()="下一頁"]/@href'
next_page = response.xpath(xpath_expression).extract_first()
print("next_page = ", next_page)
# 翻頁操作
if next_page != None:
# print(next_page)
# print('next page')
full_next_page = "http://xxxx/" + next_page
print("full_next_page",full_next_page)
meta_page = {'usedSelenium': self.name, "whether_wait_id" : True} # 翻頁的meta和請求的meta要不一樣
yield scrapy.Request(full_next_page, callback=self.parse, dont_filter=True, meta=meta_page)
#獲取每個文章的內(nèi)容,并存入item
def parse_item(self,response):
source_url = response.url
title_o = response.xpath('//div[@class="tit"]/text()').extract_first().strip()
# title_t = my_tools.get_trans(title_o, "de2zh")
publish_time = response.xpath('//div[@class="source"]/span[1]/text()').extract_first()
date_object = datetime.strptime(publish_time, self.site_date_format) # 先讀取成網(wǎng)頁的日期格式
date_object = date_object.strftime(self.date_format) # 轉(zhuǎn)換成目標(biāo)的日期字符串
publish_time = datetime.strptime(date_object, self.date_format) # 從符合格式的字符串,轉(zhuǎn)換成日期
content_o = [content.strip() for content in response.xpath('//div[@id="xw_box"]//text()').extract()]
# content_o = ' '.join(content_o) # 這個content_o提取出來是一個字符串?dāng)?shù)組,所以要拼接成字符串
# content_t = my_tools.get_trans(content_o, "de2zh")
print("source_url:", source_url)
print("title_o:", title_o)
# print("title_t:", title_t)
print("publish_time:", publish_time) #15.01.2008
print("content_o:", content_o)
# print("content_t:", content_t)
print("-" * 50)
page_data = {
'source_url': source_url,
'title_o': title_o,
# 'title_t' : title_t,
'publish_time': publish_time,
'content_o': content_o,
# 'content_t': content_t,
'org' : self.org,
'org_e' : self.org_e,
}
self.item['url'] = page_data['source_url']
self.item['title'] = page_data['title_o']
# self.item['title_t'] = page_data['title_t']
self.item['time'] = page_data['publish_time']
self.item['content'] = page_data['content_o']
# self.item['content_t'] = page_data['content_t']
# 獲取當(dāng)前時間
current_time = datetime.now()
# 格式化成字符串
formatted_time = current_time.strftime(self.date_format)
# 將字符串轉(zhuǎn)換為 datetime 對象
datetime_object = datetime.strptime(formatted_time, self.date_format)
self.item['scrapy_time'] = datetime_object
self.item['org'] = page_data['org']
self.item['trans_org'] = page_data['org_e']
yield self.item
中間件middlewares.py
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class SwSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
class SwDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
# -*- coding: utf-8 -*- 使用selenium
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from scrapy.http import HtmlResponse
from logging import getLogger
import time
class SeleniumMiddleware():
# Middleware中會傳遞進(jìn)來一個spider,這就是我們的spider對象,從中可以獲取__init__時的chrome相關(guān)元素
def process_request(self, request, spider):
'''
用chrome抓取頁面
:param request: Request請求對象
:param spider: Spider對象
:return: HtmlResponse響應(yīng)
'''
print(f"chrome is getting page = {request.url}")
# 依靠meta中的標(biāo)記,來決定是否需要使用selenium來爬取
usedSelenium = request.meta.get('usedSelenium', None) # 從request中的meta字段中獲取usedSelenium值,不過不存在,返回默認(rèn)的None
# print("來到中間了?")
if usedSelenium == "1000_nhc":
try:
spider.browser.get(request.url)
time.sleep(4)
if(request.meta.get('whether_wait_id', False)): # 從request中的meta字段中獲取whether_wait_id值,不過不存在,返回默認(rèn)的False
print("準(zhǔn)備等待翻頁的元素出現(xiàn)。。。")
# 使用WebDriverWait等待頁面加載完成
wait = WebDriverWait(spider.browser, 20) # 設(shè)置最大等待時間為60秒
# 示例:等待頁面中的某個元素加載完成,可根據(jù)實(shí)際情況調(diào)整
wait.until(EC.presence_of_element_located((By.ID, "page_div"))) # 等待翻頁結(jié)束,才進(jìn)行下一步
except TimeoutException: # 沒有等到元素,繼續(xù)重新進(jìn)行請求
print("Timeout waiting for element. Retrying the request.")
self.retry_request(request, spider)
except Exception as e:
print(f"chrome getting page error, Exception = {e}")
return HtmlResponse(url=request.url, status=500, request=request)
else:
time.sleep(4)
# 頁面爬取成功,構(gòu)造一個成功的Response對象(HtmlResponse是它的子類)
return HtmlResponse(url=request.url,
body=spider.browser.page_source,
request=request,
# 最好根據(jù)網(wǎng)頁的具體編碼而定
encoding='utf-8',
status=200)
# try:
# spider.browser.get(request.url)
# # 搜索框是否出現(xiàn)
# input = spider.wait.until(
# EC.presence_of_element_located((By.XPATH, "http://div[@class='nav-search-field ']/input"))
# )
# time.sleep(2)
# input.clear()
# input.send_keys("iphone 7s")
# # 敲enter鍵, 進(jìn)行搜索
# input.send_keys(Keys.RETURN)
# # 查看搜索結(jié)果是否出現(xiàn)
# searchRes = spider.wait.until(
# EC.presence_of_element_located((By.XPATH, "http://div[@id='resultsCol']"))
# )
# except Exception as e:
# print(f"chrome getting page error, Exception = {e}")
# return HtmlResponse(url=request.url, status=500, request=request)
# else:
# time.sleep(3)
# # 頁面爬取成功,構(gòu)造一個成功的Response對象(HtmlResponse是它的子類)
# return HtmlResponse(url=request.url,
# body=spider.browser.page_source,
# request=request,
# # 最好根據(jù)網(wǎng)頁的具體編碼而定
# encoding='utf-8',
# status=200)
附錄:selenium教程
參考鏈接1 selenium如何等待具體元素的出現(xiàn):https://selenium-python-zh.readthedocs.io/en/latest/waits.html
參考鏈接2 selenium具體用法:https://pythondjango.cn/python/tools/7-python_selenium/#%E5%85%83%E7%B4%A0%E5%AE%9A%E4%BD%8D%E6%96%B9%E6%B3%95
參考鏈接3 別人的的實(shí)戰(zhàn):https://blog.csdn.net/zwq912318834/article/details/79773870文章來源地址http://www.zghlxwxcb.cn/news/detail-807847.html
到了這里,關(guān)于爬蟲進(jìn)階之selenium模擬瀏覽器的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!