国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

<ruby id="kbmc5"></ruby><ruby id="kbmc5"></ruby>

【python爬蟲】設(shè)計自己的爬蟲 4. 封裝模擬瀏覽器 Selenium

2年前作者：loyd3分類：Toy博客閱讀(90)違法舉報

這篇具有很好參考價值的文章主要介紹了【python爬蟲】設(shè)計自己的爬蟲 4. 封裝模擬瀏覽器 Selenium。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

有些自動化工具可以獲取瀏覽器當(dāng)前呈現(xiàn)的頁面的源代碼，可以通過這種方式來進(jìn)行爬取
一般常用的的有Selenium， playwright, pyppeteer，考慮到他們的使用有許多相同之處，因此考慮把他們封裝到一套api中

先看基類

class BrowserSimulateBase:
    def __init__(self):
        pass

    def start_browser(self, is_headless=False, is_cdp=False, is_dev=False, proxy=None, is_socks5=False, *args, **kwargs):
        """
        啟動瀏覽器。

        Args:
            is_headless (bool, optional): 是否開啟無頭模式。默認(rèn)為 False。
            is_cdp (bool, optional): 是否使用 Chrome Devtools Protocol。默認(rèn)為 False。
            is_dev (bool, optional): 是否啟用調(diào)試模式。默認(rèn)為 False。
            proxy (str, optional): 代理設(shè)置。默認(rèn)為 None。
            is_socks5 (bool, optional): 是否使用 SOCKS5 代理。默認(rèn)為 False。
            *args, **kwargs: 其他參數(shù)。

        Raises:
            NotImplementedError: 派生類需要實現(xiàn)該方法。
        """
        raise NotImplementedError

    # 啟動頁面
    def start_page(self, url):
        raise NotImplementedError

    # 顯式等待
    def wait_until_element(self, selector_location, timeout=None, selector_type=None):
        raise NotImplementedError

    # 等待時間
    def wait_time(self, timeout):
        raise NotImplementedError

    # 等待時間
    def wait_for_time(self, timeout):
        raise NotImplementedError

    # 查找多個元素
    def find_elements(self, selector_location, selector_type=None):
        raise NotImplementedError

    # 查找元素
    def find_element(self, selector_location, selector_type=None):
        raise NotImplementedError

    # 輸入框 輸入內(nèi)容并提交
    def send_keys(self, selector_location, input_content, selector_type=None):
        raise NotImplementedError

    # 執(zhí)行js命令
    def execute_script(self, script_command):
        raise NotImplementedError

    # 瀏覽器回退
    def go_back(self):
        raise NotImplementedError

    # 瀏覽器前進(jìn)
    def go_forward(self):
        raise NotImplementedError

    # 獲取cookies
    def get_cookies(self):
        raise NotImplementedError

    # 添加cookies
    def add_cookie(self, cookie):
        raise NotImplementedError

    # 刪除cookies
    def del_cookies(self):
        raise NotImplementedError

    # 切換選項卡
    def switch_tab(self, tab_index):
        raise NotImplementedError

    # 刷新頁面
    def reload_page(self):
        raise NotImplementedError

    # 截圖
    def screen_page(self, file_name=None):
        raise NotImplementedError

    # 關(guān)閉瀏覽器
    def close_browser(self):
        raise NotImplementedError

    # 獲取頁面內(nèi)容
    def get_content(self):
        raise NotImplementedError

    # 點擊
    def click(self, selector_location, selector_type=None):
        raise NotImplementedError

    # 拉拽動作
    def drag_and_drop(self, source_element, target_element):
        raise NotImplementedError

    # 拉拽動作
    def to_iframe(self, frame):
        raise NotImplementedError

Selenium是一個自動化測試工具，利用它可以驅(qū)動瀏覽器完成特定操作，還可以獲取瀏覽器當(dāng)前呈現(xiàn)的頁面的源代碼，做到所見即所爬對一些JavaScript動態(tài)渲染的頁面來說，這種爬取方式非常有效使用Selenium驅(qū)動瀏覽器加載網(wǎng)頁，可以直接拿到JavaScript渲染的結(jié)果

下面是封裝的類

class SeleniumSimulate(BrowserSimulateBase):
    def __init__(self):
        self.browser = None

    # 啟動瀏覽器
    # is_headless 是否開啟無頭模式
    # is_cdp 是否使用cdp (Chrome Devtools Protocol)
    def start_browser(self, is_headless=False, is_cdp=False, is_dev=False, proxy=None, is_socks5=False, *args,
                      **kwargs) -> webdriver.Chrome:
        """
        啟動 Chrome 瀏覽器。

        Args:
            is_headless (bool, optional): 是否開啟無頭模式。默認(rèn)為 False。
            is_cdp (bool, optional): 是否使用 Chrome Devtools Protocol。默認(rèn)為 False。
            is_dev (bool, optional): 是否啟用調(diào)試模式。默認(rèn)為 False。
            proxy (str, optional): 代理設(shè)置。默認(rèn)為 None。
            is_socks5 (bool, optional): 是否使用 SOCKS5 代理。默認(rèn)為 False。
            *args, **kwargs: 其他參數(shù)。

        Returns:
            webdriver.Chrome: 已啟動的 Chrome 瀏覽器對象。
        """
        option = ChromeOptions()
        if is_headless:
            option.add_argument('--headless')
        elif is_cdp:
            option.add_experimental_option('excludeSwitches', ['enable-automation'])
            option.add_experimental_option('useAutomationExtension', False)
        elif proxy:
            if is_socks5:
                option.add_argument('--proxy-server=socks5://' + proxy)
            else:
                option.add_argument('--proxy-server=http://' + proxy)

        self.browser = webdriver.Chrome(ChromeDriverManager().install(), options=option)
        if is_cdp:
            self.browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
                'source': 'Object.defineProperty(navigator, "webdriver", {get:()=>undefined})'
            })

        self.browser.set_window_size(WINDOW_WIDTH, WINDOW_HEIGHT)
        return self.browser

    # 啟動頁面
    def start_page(self, url):
        """
       在瀏覽器中打開指定的 URL。

       參數(shù):
       url (str): 要打開的網(wǎng)址。

       無返回值。
       """
        self.browser.get(url)

    # 顯式等待
    # timeout等待的最長時間
    def wait_until_element(self, selector_location, timeout=None, selector_type=None):
        """
        等待指定的元素出現(xiàn)在頁面中。

        參數(shù):
        selector_location (str): 要等待的元素選擇器。
        timeout (int, optional): 等待的最大時間（秒）。如果未提供，將使用默認(rèn)超時時間。
        selector_type (str, optional): 選擇器類型（例如 'css', 'xpath' 等）。

        無返回值。
        """
        wait = WebDriverWait(self.browser, timeout)
        if selector_type:
            selector_type = self.get_selector_type(selector_type)
        else:
            selector_type = self.get_selector_type(identify_selector_type(selector_location))
            selector_location = extract_value_from_selector(selector_location)
        wait.until(EC.presence_of_element_located((selector_type, selector_location)))

    # 獲取定位類型
    def get_selector_type(self, selector_type):
        """
        將自定義的選擇器類型映射為Selenium的選擇器類型。

        參數(shù):
        selector_type (str): 自定義的選擇器類型（例如 'css', 'xpath' 等）。

        返回:
        by_type (selenium.webdriver.common.by.By): Selenium的選擇器類型。
        """
        selector_type = selector_type.lower()
        if selector_type == ID:
            by_type = By.ID
        elif selector_type == XPATH:
            by_type = By.XPATH
        elif selector_type == LINK_TEXT:
            by_type = By.LINK_TEXT
        elif selector_type == PARTIAL_LINK_TEXT:
            by_type = By.PARTIAL_LINK_TEXT
        elif selector_type == NAME:
            by_type = By.NAME
        elif selector_type == TAG_NAME:
            by_type = By.TAG_NAME
        elif selector_type == CLASS_NAME:
            by_type = By.CLASS_NAME
        elif selector_type == CSS_SELECTOR:
            by_type = By.CSS_SELECTOR
        return by_type

    # 等待時間
    def wait_for_time(self, timeout):
        """
        異步等待指定的時間（秒）。

        參數(shù):
        timeout (int): 等待的時間（秒）。

        無返回值。
        """
        time.sleep(timeout)

    # 查找多個元素
    def find_elements(self, selector_location, selector_type=None):
        # 傳了selector_type就獲取 沒傳就通過selector_location進(jìn)行解析

        """
        查找多個元素。

        參數(shù):
        selector_location (str): 要查找的元素選擇器。
        selector_type (str, optional): 選擇器類型（例如 'css', 'xpath' 等）。

        返回:
        elements (list): 包含匹配元素的列表。
        """
        if selector_type:
            selector_type = self.get_selector_type(selector_type)
        else:
            selector_type = self.get_selector_type(identify_selector_type(selector_location))
            selector_location = extract_value_from_selector(selector_location)
        return self.browser.find_elements(selector_type, selector_location)

    # 查找元素
    def find_element(self, selector_location, selector_type=None):
        """
        查找單個元素。

        參數(shù):
        selector_location (str): 要查找的元素選擇器。
        selector_type (str, optional): 選擇器類型（例如 'css', 'xpath' 等）。

        返回:
        element (WebElement): 匹配的元素。
        """
        try:
            if selector_type:
                by_type = self.get_selector_type(selector_type)
            else:
                by_type = self.get_selector_type(identify_selector_type(selector_location))
                selector_location = extract_value_from_selector(selector_location)

            element = self.browser.find_element(by_type, selector_location)
            return element
        except NoSuchElementException:
            # 處理元素未找到的情況
            print(f"未找到匹配的元素: {selector_location}")
            return None  # 或者你可以選擇拋出自定義的異常，或者返回其他默認(rèn)值

    # 輸入框 輸入內(nèi)容并提交
    def send_keys(self, selector_location, input_content, selector_type=None):
        """
        在指定的選擇器位置輸入文本內(nèi)容。

        參數(shù):
        selector_location (str): 要輸入文本的元素選擇器。
        input_content (str): 要輸入的文本內(nèi)容。
        selector_type (str, optional): 選擇器類型（例如 'css', 'xpath' 等）。

        無返回值。
        """
        input_element = self.find_element(selector_location, selector_type)  # 查找輸入框元素
        if input_element:
            input_element.send_keys(input_content)  # 輸入文本內(nèi)容
        else:
            print(f"未找到元素: {selector_location}")

    # 執(zhí)行js命令
    def execute_script(self, script_command):
        """
        在當(dāng)前頁面上執(zhí)行 JavaScript 腳本。

        參數(shù):
        script_command (str): 要執(zhí)行的 JavaScript 腳本命令。

        無返回值。
        """
        self.browser.execute_script(script_command)

    # 瀏覽器回退
    def go_back(self):
        """
        在瀏覽器中回退到上一個頁面。

        無返回值。
        """
        self.browser.back()

    # 瀏覽器前進(jìn)
    def go_forward(self):
        """
        在瀏覽器中執(zhí)行前進(jìn)操作，前往下一頁。

        無返回值。
        """
        self.browser.forward()

    # 獲取cookies
    def get_cookies(self):
        """
        獲取當(dāng)前頁面的所有 Cookies。

        返回:
        cookies (List): 包含所有 Cookies 的列表。
        """
        return self.browser.get_cookies()

    # 添加cookies
    def add_cookie(self, cookie):
        """
        向當(dāng)前頁面添加一個 Cookie。

        參數(shù):
        cookie (dict): 要添加的 Cookie 對象，應(yīng)包含 'name' 和 'value' 屬性。

        無返回值。
        """
        self.browser.add_cookie(cookie)

    # 刪除cookies
    def del_cookies(self):
        """
        刪除當(dāng)前頁面的所有 Cookies。

        無返回值。
        """
        self.browser.delete_all_cookies()

    # 切換選項卡
    def switch_tab(self, tab_index):
        """
        在瀏覽器窗口中切換到指定的標(biāo)簽頁。

        參數(shù):
        tab (int): 要切換到的標(biāo)簽頁的索引號。

        無返回值。
        """
        self.browser.switch_to.window(self.browser.window_handles[tab_index])

    # 刷新頁面
    def reload_page(self):
        """
        重新加載當(dāng)前頁面。

        無返回值。
        """
        self.browser.reload()

    # 截圖
    def screen_page(self, file_path=None):
        """
        截取當(dāng)前頁面的屏幕截圖并保存到指定路徑。

        參數(shù):
        file_path (str, optional): 保存截圖的文件路徑。如果未提供，將保存為默認(rèn)文件名（當(dāng)前目錄下的'screenshot.png'）。

        無返回值。
        """
        # 如果未提供文件路徑，默認(rèn)保存為'screenshot.png'在當(dāng)前目錄下
        if not file_path:
            file_path = 'screenshot.png'
        # 獲取文件擴(kuò)展名
        file_extension = os.path.splitext(file_path)[1][1:]
        # 如果不是png格式，轉(zhuǎn)換成png
        if file_extension != 'png':
            file_path = os.path.splitext(file_path)[0] + '.png'

        # 截取屏幕截圖并保存
        self.browser.save_screenshot(file_path)

    # 關(guān)閉瀏覽器
    def close_browser(self):
        """
        關(guān)閉瀏覽器。

        無返回值。
        """
        self.browser.close()

    def click(self, selector_location, selector_type=None):
        """
        在頁面上點擊指定的元素。

        參數(shù):
        selector_location (str): 要點擊的元素選擇器。
        selector_type (str, optional): 選擇器類型（例如 'css', 'xpath' 等）。

        無返回值。
        """
        element = self.find_element(selector_location, selector_type)  # 查找要點擊的元素
        if element:
            element.click()  # 點擊元素
        else:
            print(f"未找到元素: {selector_location}")

    # 拉拽動作
    def drag_and_drop(self, source_element, target_element):
        """
        在頁面上執(zhí)行拖拽動作。

        參數(shù):
        source_element (WebElement): 要拖拽的源元素。
        target_element (WebElement): 拖拽的目標(biāo)元素。

        無返回值。
        """
        actions = ActionChains(self.browser)  # 創(chuàng)建動作鏈對象
        actions.drag_and_drop(source_element, target_element)  # 執(zhí)行拖拽操作
        actions.perform()  # 執(zhí)行動作鏈中的所有動作
        self.browser.switch_to.alert.accept()  # 處理可能出現(xiàn)的彈窗（假設(shè)拖拽操作可能觸發(fā)了彈窗）

    # iframe
    def to_iframe(self, frame):
        """
        切換到指定的 iframe。

        參數(shù):
        frame (str or WebElement): 要切換的 iframe 元素或者 iframe 的名稱或 ID。

        無返回值。
        """
        self.browser.switch_to.frame(frame)

    # 獲取頁面內(nèi)容
    def get_content(self):
        """
        獲取當(dāng)前頁面的內(nèi)容。

        返回:
        content (str): 當(dāng)前頁面的 HTML 內(nèi)容。
        """
        return self.browser.page_source


selenium_simulate = SeleniumSimulate()

其中用到的工具類如下文章來源地址http://www.zghlxwxcb.cn/news/detail-773232.html

# 獲取選擇器屬性
def identify_selector_type(selector):
    if re.match(r'^#[\w-]+$', selector):
        return 'id'
    elif re.match(r'^[.\w-]+[\w-]*$', selector):
        return 'css'
    elif re.match(r'^(//.*|\(//.*|\*\[contains\(.*\)\]|\*\[@id=\'.*\'\])', selector):
        return 'xpath'
    elif re.match(r'^<[\w-]+>$', selector):
        return 'tag'
    elif re.match(r'^<a.*>.*</a>$', selector):
        return 'link'
    elif re.match(r'.*<a.*>.*</a>.*', selector):
        return 'partial link'
    elif re.match(r'^\[name=[\'\"].*[\'\"]\]$', selector):
        return 'name'
    elif re.match(r'^\[class=[\'\"].*[\'\"]\]$', selector):
        return 'class'
    else:
        return 'unknown'


# 獲取選擇器內(nèi)容
def extract_value_from_selector(selector):
    match = re.match(r'^#([\w-]+)$', selector)
    if match:
        return match.group(1)

    match = re.match(r'^\.([\w-]+[\w-]*)$', selector)
    if match:
        return match.group(1)

    match = re.match(r'^(//.*|\(//.*|\*\[contains\((.*)\)\]|\*\[@id=\'(.*)\'\])', selector)
    if match:
        return match.group(1)

    match = re.match(r'^<([\w-]+)>$', selector)
    if match:
        return match.group(1)

    match = re.match(r'^<a.*>(.*)</a>$', selector)
    if match:
        return match.group(1)

    match = re.match(r'.*<a.*>(.*)</a>.*', selector)
    if match:
        return match.group(1)

    match = re.match(r'^\[name=[\'\"](.*)[\'\"]\]$', selector)
    if match:
        return match.group(1)

    match = re.match(r'^\[class=[\'\"](.*)[\'\"]\]$', selector)
    if match:
        return match.group(1)

    return None

到了這里，關(guān)于【python爬蟲】設(shè)計自己的爬蟲 4. 封裝模擬瀏覽器 Selenium的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符，請點擊違法舉報進(jìn)行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費用

Python爬蟲入門：使用selenium庫，webdriver庫模擬瀏覽器爬蟲，模擬用戶爬蟲，爬取網(wǎng)站內(nèi)文章數(shù)據(jù)，循環(huán)爬取網(wǎng)站全部數(shù)據(jù)。
*嚴(yán)正聲明：本文僅限于技術(shù)討論與分享，嚴(yán)禁用于非法途徑。目錄準(zhǔn)備工具：思路：具體操作：調(diào)用需要的庫：啟動瀏覽器驅(qū)動：代碼主體： ?完整代碼（解析注釋）： Python環(huán)境；安裝selenium庫； Python編輯器；待爬取的網(wǎng)站；安裝好的瀏覽器；與瀏覽器版本相對應(yīng)的
2023年04月24日
瀏覽(103)
python爬蟲進(jìn)階篇：Scrapy中使用Selenium模擬Firefox火狐瀏覽器爬取網(wǎng)頁信息
接著上一篇的筆記，Scrapy爬取普通無反爬、靜態(tài)頁面的網(wǎng)頁時可以順利爬取我們要的信息。但是大部分情況下我們要的數(shù)據(jù)所在的網(wǎng)頁它是動態(tài)加載出來的（ajax請求后傳回前端頁面渲染、js調(diào)用function等）。這種情況下需要使用selenium進(jìn)行模擬人工操作瀏覽器行為，實現(xiàn)自動化
2024年02月04日
瀏覽(101)
java爬蟲遇到網(wǎng)頁驗證碼怎么辦？（使用selenium模擬瀏覽器并用python腳本解析驗證碼圖片）
????????筆者這幾天在爬取數(shù)據(jù)的時候遇到了一個很鬧心的問題，就是在我爬取數(shù)據(jù)的時候遇到了驗證碼，而這個驗證碼又是動態(tài)生成的，嘗試了很多方法都沒能繞開這個驗證碼問題。 ? ? ? ? 我的解決方案是：使用selenium模擬瀏覽器行為，獲取到動態(tài)生成的驗證碼后用
2024年02月09日
瀏覽(175)
爬蟲進(jìn)階之selenium模擬瀏覽器
Selenium是一個用于自動化瀏覽器操作的工具，通常用于Web應(yīng)用測試。然而，它也可以用作爬蟲，通過模擬用戶在瀏覽器中的操作來提取網(wǎng)頁數(shù)據(jù)。以下是有關(guān)Selenium爬蟲的一些基本介紹：瀏覽器自動化： Selenium允許你通過編程方式控制瀏覽器的行為，包括打開網(wǎng)頁、點擊按鈕
2024年01月20日
瀏覽(104)
[爬蟲]2.2.1 使用Selenium庫模擬瀏覽器操作
Selenium是一個非常強(qiáng)大的工具，用于自動化Web瀏覽器的操作。它可以模擬真實用戶的行為，如點擊按鈕，填寫表單，滾動頁面等。由于Selenium可以直接與瀏覽器交互，所以它可以處理那些需要JavaScript運行的動態(tài)網(wǎng)頁。首先，我們需要安裝Selenium庫。你可以使用pip命令來安裝：
2024年02月16日
瀏覽(97)
python 爬蟲熱身篇使用 requests 庫通過 HTTP 讀取網(wǎng)絡(luò)數(shù)據(jù)，使用 pandas 讀取網(wǎng)頁上的表格，使用 Selenium 模擬瀏覽器操作
在過去，收集數(shù)據(jù)是一項繁瑣的工作，有時非常昂貴。機(jī)器學(xué)習(xí)項目不能沒有數(shù)據(jù)。幸運的是，我們現(xiàn)在在網(wǎng)絡(luò)上有很多數(shù)據(jù)可供我們使用。我們可以從 Web 復(fù)制數(shù)據(jù)來創(chuàng)建數(shù)據(jù)集。我們可以手動下載文件并將其保存到磁盤。但是，我們可以通過自動化數(shù)據(jù)收集來更有效地做
2023年04月08日
瀏覽(98)
項目總面試技巧，利用Selenium模擬瀏覽器進(jìn)行爬蟲，解析底層原理
count = 0 def getCommentList(): global count try: commentList = driver.find_elements_by_css_selector(‘div.reply-content’) for comment in commentList: content = comment.find_element_by_tag_name(‘p’) print(content.text) count = count + 1 return commentList except: return None def clickMoreButton() - bool: try: moreButton = driver.find_element_by_css_se
2024年04月17日
瀏覽(60)
阿里巴巴面試算法題利用Selenium模擬瀏覽器進(jìn)行爬蟲，【工作經(jīng)驗分享
def clickMoreButton() - bool: try: moreButton = driver.find_element_by_css_selector(‘button.page-last-btn’) moreButton.click() return True except: return False def main(): while True: try: driver.switch_to.frame(driver.find_element_by_css_selector(“iframe[title=‘livere-comment’]”)) except: pass commentList = getCommentList() waitTime = 0 while co
2024年04月16日
瀏覽(56)
爬蟲之Cookie獲?。豪脼g覽器模擬一個cookie出來、面對反爬蟲、加密的cookie的應(yīng)對方法
在爬蟲或模擬請求時，特別是獲取驗證碼的時候，反爬蟲的網(wǎng)站的cookie或定期失效，復(fù)制出來使用是不行的為了應(yīng)對這種方式，我們可能就需要像瀏覽器打開網(wǎng)站一樣，取得它信任的cookie selenium就是一個很好的手段一、什么是selenium Selenium最初是一個自動化測試工具，Selen
2024年01月16日
瀏覽(50)
Python 和 Selenium 的瀏覽器爬蟲
Selenium?是一款強(qiáng)大的基于瀏覽器的開源自動化測試工具，最初由 Jason Huggins 于 2004 年在 ThoughtWorks 發(fā)起，它提供了一套簡單易用的 API，模擬瀏覽器的各種操作，方便各種 Web 應(yīng)用的自動化測試。它的取名很有意思，因為當(dāng)時最流行的一款自動化測試工具叫做 QTP，是由 Mercur
2024年02月08日
瀏覽(157)