国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

爬蟲進(jìn)階之selenium模擬瀏覽器

2年前作者：氏族歸來分類：Toy博客閱讀(104)違法舉報

這篇具有很好參考價值的文章主要介紹了爬蟲進(jìn)階之selenium模擬瀏覽器。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點(diǎn)擊"舉報違法"按鈕提交疑問。

簡介

Selenium是一個用于自動化瀏覽器操作的工具，通常用于Web應(yīng)用測試。然而，它也可以用作爬蟲，通過模擬用戶在瀏覽器中的操作來提取網(wǎng)頁數(shù)據(jù)。以下是有關(guān)Selenium爬蟲的一些基本介紹：

瀏覽器自動化： Selenium允許你通過編程方式控制瀏覽器的行為，包括打開網(wǎng)頁、點(diǎn)擊按鈕、填寫表單等。這樣你可以模擬用戶在瀏覽器中的操作。
支持多種瀏覽器： Selenium支持多種主流瀏覽器，包括Chrome、Firefox、Edge等。你可以選擇適合你需求的瀏覽器來進(jìn)行自動化操作。
網(wǎng)頁數(shù)據(jù)提?。?/strong> 利用Selenium，你可以加載網(wǎng)頁并提取頁面上的數(shù)據(jù)。這對于一些動態(tài)加載內(nèi)容或需要用戶交互的網(wǎng)頁來說特別有用。

等待元素加載： 由于網(wǎng)頁可能會異步加載，Selenium提供了等待機(jī)制，確保在繼續(xù)執(zhí)行之前等待特定的元素加載完成。

選擇器： Selenium支持各種選擇器，類似于使用CSS選擇器或XPath來定位網(wǎng)頁上的元素。

動態(tài)網(wǎng)頁爬取： 對于使用JavaScript動態(tài)生成內(nèi)容的網(wǎng)頁，Selenium是一個有力的工具，因?yàn)樗梢詧?zhí)行JavaScript代碼并獲取渲染后的結(jié)果。

盡管Selenium在爬蟲中可以提供很多便利，但也需要注意一些方面。首先，使用Selenium進(jìn)行爬取速度較慢，因?yàn)樗M了真實(shí)用戶的操作。其次，網(wǎng)站可能會檢測到自動化瀏覽器，并采取措施來防止爬蟲，因此使用Selenium時需要小心謹(jǐn)慎，遵守網(wǎng)站的使用規(guī)定和政策。

在使用selenium前需要有scrapy爬蟲框架的相關(guān)知識，selenium需要結(jié)合scrapy的中間件才能發(fā)揮爬蟲的作用，詳細(xì)請看→前提知識：https://blog.csdn.net/shizuguilai/article/details/135554205

環(huán)境配置

1、建議先安裝conda

參考連接：https://blog.csdn.net/Q_fairy/article/details/129158178

2、創(chuàng)建虛擬環(huán)境并安裝對應(yīng)的包

# 創(chuàng)建名字為scrapy的包 conda create -n scrapy # 進(jìn)入虛擬環(huán)境 conda activate scrapy # 下載對應(yīng)的包 pip install scrapy pip install selenium

3、下載對應(yīng)的谷歌驅(qū)動以及與驅(qū)動對應(yīng)的瀏覽器

參考連接：https://zhuanlan.zhihu.com/p/665018772
記得配置好環(huán)境變量

代碼

目錄結(jié)構(gòu)：spiders下面就是我放scrapy腳本的位置。

setting.py配置

# Scrapy settings for sw project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = "sw" SPIDER_MODULES = ["sw.spiders"] NEWSPIDER_MODULE = "sw.spiders" DOWNLOAD_DELAY = 3 RANDOMIZE_DOWNLOAD_DELAY = True USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' COOKIES_ENABLED = True # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = "sw (+http://www.yourdomain.com)" # Obey robots.txt rules ROBOTSTXT_OBEY = False # 文件settings.py中 # ----------- selenium參數(shù)配置 ------------- SELENIUM_TIMEOUT = 25 # selenium瀏覽器的超時時間，單位秒 LOAD_IMAGE = True # 是否下載圖片 WINDOW_HEIGHT = 900 # 瀏覽器窗口大小 WINDOW_WIDTH = 900 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", # "Accept-Language": "en", #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # "sw.middlewares.SwSpiderMiddleware": 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # "sw.middlewares.SwDownloaderMiddleware": 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # "scrapy.extensions.telnet.TelnetConsole": None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # "sw.pipelines.SwPipeline": 300, #} # ITEM_PIPELINES = { # "sw.pipelines.SwPipeline": 300, # } # DB_SETTINGS = { # 'host': '127.0.0.1', # 'port': 3306, # 'user': 'root', # 'password': '123456', # 'db': 'scrapy_news_2024_01_08', # 'charset': 'utf8mb4', # } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = "httpcache" #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage" # Set settings whose default value is deprecated to a future-proof value REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8" # REDIRECT_ENABLED = False

scrapy腳本參考

""" Created on 2024/01/06 14:00 by Fxy """ import scrapy from sw.items import SwItem import time from datetime import datetime import locale from scrapy_splash import SplashRequest # scrapy 信號相關(guān)庫 from scrapy.utils.project import get_project_settings # 下面這種方式，即將廢棄，所以不用 # from scrapy.xlib.pydispatch import dispatcher from scrapy import signals # scrapy最新采用的方案 from pydispatch import dispatcher from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait class NhcSpider(scrapy.Spider): ''' scrapy變量 ''' # 爬蟲名稱 name = "1000_nhc" # 允許爬取的域名 allowed_domains = ["xxxx.cn"] # 爬蟲的起始鏈接 start_urls = ["xxxx.shtml"] # 創(chuàng)建一個VidoItem實(shí)例 item = SwItem() custom_settings = { 'LOG_LEVEL':'INFO', 'DOWNLOAD_DELAY': 0, 'COOKIES_ENABLED': False, # enabled by default 'DOWNLOADER_MIDDLEWARES': { # SeleniumMiddleware 中間件 'sw.middlewares.SeleniumMiddleware': 543, # 這個數(shù)字是啟用的優(yōu)先級 # 將scrapy默認(rèn)的user-agent中間件關(guān)閉 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, } } ''' 自定義變量 ''' # 機(jī)構(gòu)名稱 org = "xxxx數(shù)據(jù)" # 機(jī)構(gòu)英文名稱 org_e = "None" # 日期格式 site_date_format = '發(fā)布時間：\n \t%Y-%m-%d\n ' # 網(wǎng)頁的日期格式 date_format = '%d.%m.%Y %H:%M:%S' # 目標(biāo)日期格式 # 網(wǎng)站語言格式 language_type = "zh2zh" # 中文到中文的語言代碼, 調(diào)用翻譯接口時，使用 # 模擬瀏覽器格式 meta = {'usedSelenium': name, 'dont_redirect': True} # 將chrome初始化放到spider中，成為spider中的元素 def __init__(self, timeout=40, isLoadImage=True, windowHeight=None, windowWidth=None): # 從settings.py中獲取設(shè)置參數(shù) self.mySetting = get_project_settings() self.timeout = self.mySetting['SELENIUM_TIMEOUT'] self.isLoadImage = self.mySetting['LOAD_IMAGE'] self.windowHeight = self.mySetting['WINDOW_HEIGHT'] self.windowWidth = self.mySetting['windowWidth'] # 初始化chrome對象 options = webdriver.ChromeOptions() options.add_experimental_option('useAutomationExtension', False) # 隱藏selenium特性 options.add_experimental_option('excludeSwitches', ['enable-automation']) # 隱藏selenium特性 options.add_argument('--ignore-certificate-errors') # 忽略證書錯誤 options.add_argument('--ignore-certificate-errors-spki-list') options.add_argument('--ignore-ssl-errors') # 忽略ssl錯誤 # chrome_options = webdriver.ChromeOptions() # chrome_options.binary_location = "E:\\學(xué)校的一些資料\\文檔\研二上\\chrome-win64\\chrome.exe" # 替換為您的特定版本的Chrome瀏覽器路徑 #1.創(chuàng)建Chrome或Firefox瀏覽器對象，這會在電腦上在打開一個瀏覽器窗口 # browser = webdriver.Chrome(executable_path ="E:\\chromedriver\\chromedriver", chrome_options=chrome_options) #第一個參數(shù)為驅(qū)動的路徑，第二個參數(shù)為對應(yīng)的應(yīng)用程序地址 self.browser = webdriver.Chrome(chrome_options=options) self.browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { # 隱藏selenium特性 "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ }) if self.windowHeight and self.windowWidth: self.browser.set_window_size(900, 900) self.browser.set_page_load_timeout(self.timeout) # 頁面加載超時時間 self.wait = WebDriverWait(self.browser, 30) # 指定元素加載超時時間 super(NhcSpider, self).__init__() # 設(shè)置信號量，當(dāng)收到spider_closed信號時，調(diào)用mySpiderCloseHandle方法，關(guān)閉chrome dispatcher.connect(receiver = self.mySpiderCloseHandle, signal = signals.spider_closed ) # 信號量處理函數(shù)：關(guān)閉chrome瀏覽器 def mySpiderCloseHandle(self, spider): print(f"mySpiderCloseHandle: enter ") self.browser.quit() def start_requests(self): yield scrapy.Request(url = self.start_urls[0], meta = self.meta, callback = self.parse, # errback = self.error ) #爬蟲的主入口，這里是獲取所有的歸檔文章鏈接, 從返回的respose def parse(self,response): # locale.setlocale(locale.LC_TIME, 'en_US') #本地語言為英語 //*[@id="538034"]/div achieve_links = response.xpath('//ul[@class="zxxx_list"]/li/a/@href').extract() print("achieve_links",achieve_links) for achieve_link in achieve_links: full_achieve_link = "http:/xxxx.cn" + achieve_link print("full_achieve_link", full_achieve_link) # 進(jìn)入每個歸檔鏈接 yield scrapy.Request(full_achieve_link, callback=self.parse_item,dont_filter=True, meta=self.meta) #翻頁邏輯 xpath_expression = f'//*[@id="page_div"]/div[@class="pagination_index"]/span/a[text()="下一頁"]/@href' next_page = response.xpath(xpath_expression).extract_first() print("next_page = ", next_page) # 翻頁操作 if next_page != None: # print(next_page) # print('next page') full_next_page = "http://xxxx/" + next_page print("full_next_page",full_next_page) meta_page = {'usedSelenium': self.name, "whether_wait_id" : True} # 翻頁的meta和請求的meta要不一樣 yield scrapy.Request(full_next_page, callback=self.parse, dont_filter=True, meta=meta_page) #獲取每個文章的內(nèi)容,并存入item def parse_item(self,response): source_url = response.url title_o = response.xpath('//div[@class="tit"]/text()').extract_first().strip() # title_t = my_tools.get_trans(title_o, "de2zh") publish_time = response.xpath('//div[@class="source"]/span[1]/text()').extract_first() date_object = datetime.strptime(publish_time, self.site_date_format) # 先讀取成網(wǎng)頁的日期格式 date_object = date_object.strftime(self.date_format) # 轉(zhuǎn)換成目標(biāo)的日期字符串 publish_time = datetime.strptime(date_object, self.date_format) # 從符合格式的字符串，轉(zhuǎn)換成日期 content_o = [content.strip() for content in response.xpath('//div[@id="xw_box"]//text()').extract()] # content_o = ' '.join(content_o) # 這個content_o提取出來是一個字符串?dāng)?shù)組，所以要拼接成字符串 # content_t = my_tools.get_trans(content_o, "de2zh") print("source_url:", source_url) print("title_o:", title_o) # print("title_t:", title_t) print("publish_time:", publish_time) #15.01.2008 print("content_o:", content_o) # print("content_t:", content_t) print("-" * 50) page_data = { 'source_url': source_url, 'title_o': title_o, # 'title_t' : title_t, 'publish_time': publish_time, 'content_o': content_o, # 'content_t': content_t, 'org' : self.org, 'org_e' : self.org_e, } self.item['url'] = page_data['source_url'] self.item['title'] = page_data['title_o'] # self.item['title_t'] = page_data['title_t'] self.item['time'] = page_data['publish_time'] self.item['content'] = page_data['content_o'] # self.item['content_t'] = page_data['content_t'] # 獲取當(dāng)前時間 current_time = datetime.now() # 格式化成字符串 formatted_time = current_time.strftime(self.date_format) # 將字符串轉(zhuǎn)換為 datetime 對象 datetime_object = datetime.strptime(formatted_time, self.date_format) self.item['scrapy_time'] = datetime_object self.item['org'] = page_data['org'] self.item['trans_org'] = page_data['org_e'] yield self.item

中間件middlewares.py

# Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # useful for handling different item types with a single interface from itemadapter import is_item, ItemAdapter class SwSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info("Spider opened: %s" % spider.name) class SwDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info("Spider opened: %s" % spider.name) # -*- coding: utf-8 -*- 使用selenium from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.keys import Keys from scrapy.http import HtmlResponse from logging import getLogger import time class SeleniumMiddleware(): # Middleware中會傳遞進(jìn)來一個spider，這就是我們的spider對象，從中可以獲取__init__時的chrome相關(guān)元素 def process_request(self, request, spider): ''' 用chrome抓取頁面 :param request: Request請求對象 :param spider: Spider對象 :return: HtmlResponse響應(yīng) ''' print(f"chrome is getting page = {request.url}") # 依靠meta中的標(biāo)記，來決定是否需要使用selenium來爬取 usedSelenium = request.meta.get('usedSelenium', None) # 從request中的meta字段中獲取usedSelenium值，不過不存在，返回默認(rèn)的None # print("來到中間了？") if usedSelenium == "1000_nhc": try: spider.browser.get(request.url) time.sleep(4) if(request.meta.get('whether_wait_id', False)): # 從request中的meta字段中獲取whether_wait_id值，不過不存在，返回默認(rèn)的False print("準(zhǔn)備等待翻頁的元素出現(xiàn)。。。") # 使用WebDriverWait等待頁面加載完成 wait = WebDriverWait(spider.browser, 20) # 設(shè)置最大等待時間為60秒 # 示例：等待頁面中的某個元素加載完成，可根據(jù)實(shí)際情況調(diào)整 wait.until(EC.presence_of_element_located((By.ID, "page_div"))) # 等待翻頁結(jié)束，才進(jìn)行下一步 except TimeoutException: # 沒有等到元素，繼續(xù)重新進(jìn)行請求 print("Timeout waiting for element. Retrying the request.") self.retry_request(request, spider) except Exception as e: print(f"chrome getting page error, Exception = {e}") return HtmlResponse(url=request.url, status=500, request=request) else: time.sleep(4) # 頁面爬取成功，構(gòu)造一個成功的Response對象(HtmlResponse是它的子類) return HtmlResponse(url=request.url, body=spider.browser.page_source, request=request, # 最好根據(jù)網(wǎng)頁的具體編碼而定 encoding='utf-8', status=200) # try: # spider.browser.get(request.url) # # 搜索框是否出現(xiàn) # input = spider.wait.until( # EC.presence_of_element_located((By.XPATH, "http://div[@class='nav-search-field ']/input")) # ) # time.sleep(2) # input.clear() # input.send_keys("iphone 7s") # # 敲enter鍵, 進(jìn)行搜索 # input.send_keys(Keys.RETURN) # # 查看搜索結(jié)果是否出現(xiàn) # searchRes = spider.wait.until( # EC.presence_of_element_located((By.XPATH, "http://div[@id='resultsCol']")) # ) # except Exception as e: # print(f"chrome getting page error, Exception = {e}") # return HtmlResponse(url=request.url, status=500, request=request) # else: # time.sleep(3) # # 頁面爬取成功，構(gòu)造一個成功的Response對象(HtmlResponse是它的子類) # return HtmlResponse(url=request.url, # body=spider.browser.page_source, # request=request, # # 最好根據(jù)網(wǎng)頁的具體編碼而定 # encoding='utf-8', # status=200)

附錄：selenium教程

參考鏈接1 selenium如何等待具體元素的出現(xiàn)：https://selenium-python-zh.readthedocs.io/en/latest/waits.html
參考鏈接2 selenium具體用法：https://pythondjango.cn/python/tools/7-python_selenium/#%E5%85%83%E7%B4%A0%E5%AE%9A%E4%BD%8D%E6%96%B9%E6%B3%95
參考鏈接3 別人的的實(shí)戰(zhàn)：https://blog.csdn.net/zwq912318834/article/details/79773870
文章來源地址http://www.zghlxwxcb.cn/news/detail-807847.html
到了這里，關(guān)于爬蟲進(jìn)階之selenium模擬瀏覽器的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點(diǎn)僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符，請點(diǎn)擊違法舉報進(jìn)行投訴反饋，一經(jīng)查實(shí)，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

[爬蟲]2.2.1 使用Selenium庫模擬瀏覽器操作
Selenium是一個非常強(qiáng)大的工具，用于自動化Web瀏覽器的操作。它可以模擬真實(shí)用戶的行為，如點(diǎn)擊按鈕，填寫表單，滾動頁面等。由于Selenium可以直接與瀏覽器交互，所以它可以處理那些需要JavaScript運(yùn)行的動態(tài)網(wǎng)頁。首先，我們需要安裝Selenium庫。你可以使用pip命令來安裝：
2024年02月16日
瀏覽(97)
項(xiàng)目總面試技巧，利用Selenium模擬瀏覽器進(jìn)行爬蟲，解析底層原理
count = 0 def getCommentList(): global count try: commentList = driver.find_elements_by_css_selector(‘div.reply-content’) for comment in commentList: content = comment.find_element_by_tag_name(‘p’) print(content.text) count = count + 1 return commentList except: return None def clickMoreButton() - bool: try: moreButton = driver.find_element_by_css_se
2024年04月17日
瀏覽(60)
阿里巴巴面試算法題利用Selenium模擬瀏覽器進(jìn)行爬蟲，【工作經(jīng)驗(yàn)分享
def clickMoreButton() - bool: try: moreButton = driver.find_element_by_css_selector(‘button.page-last-btn’) moreButton.click() return True except: return False def main(): while True: try: driver.switch_to.frame(driver.find_element_by_css_selector(“iframe[title=‘livere-comment’]”)) except: pass commentList = getCommentList() waitTime = 0 while co
2024年04月16日
瀏覽(56)
Python小姿勢 - # Python網(wǎng)絡(luò)爬蟲之如何通過selenium模擬瀏覽器登錄微博
Python網(wǎng)絡(luò)爬蟲之如何通過selenium模擬瀏覽器登錄微博微博登錄接口很混亂，需要我們通過selenium來模擬瀏覽器登錄。首先我們需要安裝selenium，通過pip安裝： ``` pip install selenium ``` 然后我們需要下載一個瀏覽器驅(qū)動，推薦使用Chrome，下載地址：http://chromedriver.storage.googleapis.c
2024年02月03日
瀏覽(94)
Python爬蟲入門：使用selenium庫，webdriver庫模擬瀏覽器爬蟲，模擬用戶爬蟲，爬取網(wǎng)站內(nèi)文章數(shù)據(jù)，循環(huán)爬取網(wǎng)站全部數(shù)據(jù)。
*嚴(yán)正聲明：本文僅限于技術(shù)討論與分享，嚴(yán)禁用于非法途徑。目錄準(zhǔn)備工具：思路：具體操作：調(diào)用需要的庫：啟動瀏覽器驅(qū)動：代碼主體： ?完整代碼（解析注釋）： Python環(huán)境；安裝selenium庫； Python編輯器；待爬取的網(wǎng)站；安裝好的瀏覽器；與瀏覽器版本相對應(yīng)的
2023年04月24日
瀏覽(103)
java爬蟲遇到網(wǎng)頁驗(yàn)證碼怎么辦？（使用selenium模擬瀏覽器并用python腳本解析驗(yàn)證碼圖片）
????????筆者這幾天在爬取數(shù)據(jù)的時候遇到了一個很鬧心的問題，就是在我爬取數(shù)據(jù)的時候遇到了驗(yàn)證碼，而這個驗(yàn)證碼又是動態(tài)生成的，嘗試了很多方法都沒能繞開這個驗(yàn)證碼問題。 ? ? ? ? 我的解決方案是：使用selenium模擬瀏覽器行為，獲取到動態(tài)生成的驗(yàn)證碼后用
2024年02月09日
瀏覽(175)
python爬蟲進(jìn)階篇：Scrapy中使用Selenium+Firefox瀏覽器爬取滬深A(yù)股股票行情
上篇記錄了Scrapy搭配selenium的使用方法，有了基本的了解后我們可以將這項(xiàng)技術(shù)落實(shí)到實(shí)際需求中。目前很多股票網(wǎng)站的行情信息都是動態(tài)數(shù)據(jù)，我們可以用Scrapy+selenium對股票進(jìn)行實(shí)時采集并持久化，再進(jìn)行數(shù)據(jù)分析、郵件通知等操作。詳情請看上篇筆記 items middlewares setti
2024年02月04日
瀏覽(29)
Selenium教程：自動化瀏覽器測試工具
Selenium是一款用于自動化瀏覽器測試的工具，它提供了一系列的API和功能，使得開發(fā)人員可以編寫腳本來模擬用戶在瀏覽器中的行為。無論是在Web應(yīng)用程序的功能測試、性能測試還是數(shù)據(jù)抓取方面，Selenium都是一個強(qiáng)大且廣泛使用的工具。在開始使用Selenium之前，您需要進(jìn)行安
2024年02月07日
瀏覽(201)
python 爬蟲熱身篇使用 requests 庫通過 HTTP 讀取網(wǎng)絡(luò)數(shù)據(jù)，使用 pandas 讀取網(wǎng)頁上的表格，使用 Selenium 模擬瀏覽器操作
在過去，收集數(shù)據(jù)是一項(xiàng)繁瑣的工作，有時非常昂貴。機(jī)器學(xué)習(xí)項(xiàng)目不能沒有數(shù)據(jù)。幸運(yùn)的是，我們現(xiàn)在在網(wǎng)絡(luò)上有很多數(shù)據(jù)可供我們使用。我們可以從 Web 復(fù)制數(shù)據(jù)來創(chuàng)建數(shù)據(jù)集。我們可以手動下載文件并將其保存到磁盤。但是，我們可以通過自動化數(shù)據(jù)收集來更有效地做
2023年04月08日
瀏覽(98)
UI自動化測試之selenium工具（瀏覽器窗口的切換）
1、在瀏覽網(wǎng)頁的時候，有時點(diǎn)擊一個鏈接或者按鈕，會彈出一個新的窗口。這類窗口也被稱之為句柄（一個瀏覽器窗口的唯一標(biāo)識符，通過句柄實(shí)現(xiàn)不同瀏覽器窗口之間的切換），在我們手動控制瀏覽器的時候，產(chǎn)生新的句柄時瀏覽器會自動的幫我們跳轉(zhuǎn)到最新的句柄處（鼠
2024年02月02日
瀏覽(20)

感谢您访问我们的网站，您可能还对以下资源感兴趣：
国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区