說明:本記錄是在Windows系統(tǒng)上執(zhí)行的!
起因是:博導(dǎo)要求統(tǒng)計一下國內(nèi)某個領(lǐng)域的專家情況,統(tǒng)計主持國家自然科學(xué)基金的副教授和教授都有哪些大牛!
于是:本人去[NSFC]:https://kd.nsfc.cn/ 下載全部的歷史基金項目書。。。。工作量太大就……半自動化實現(xiàn)吧?。?!
前期準(zhǔn)備
1. python Selenium庫
2. Edge瀏覽器 或 Chrome瀏覽器
1. 瀏覽器開啟遠(yuǎn)程控制指令
- 無論是哪種瀏覽器,都需要使用終端獨立運(yùn)行瀏覽器的遠(yuǎn)程調(diào)試模式。
- 開啟方式:加入指令(–remote-debugging-port=9222 --user-data-dir=“D:\selenium\AutomationProfile”)
需要進(jìn)入目標(biāo)瀏覽器的根目錄! 不然就輸入全路徑!
(1)Edge
.\msedge.exe --remote-debugging-port=9222 --user-data-dir=“D:\selenium\AutomationProfile”
(2)Chrome
.\chrome.exe --remote-debugging-port=9222 --user-data-dir=“D:\selenium\AutomationProfile”
2. 執(zhí)行python代碼
(1)先啟動瀏覽器后執(zhí)行代碼
-
必須是先執(zhí)行上述步驟,開啟了瀏覽器的遠(yuǎn)程調(diào)試端口后,才能通過下方代碼進(jìn)行控制。
-
add_experimental_option("debuggerAddress", "127.0.0.1:9222")
這句話是關(guān)鍵!
from selenium import webdriver
from selenium.webdriver.edge.options import Options
class Test:
def edge(self):
edge_driver_path = executable_path=r'C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe'
chrome_options = Options()
# chrome_options.binary_location = edge_driver_path # 傳入驅(qū)動地址
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222") # "127.0.0.1:9222"其中,9222是瀏覽器的運(yùn)行端口
# 讓瀏覽器帶著這個配置運(yùn)行
# chrome_options.add_experimental_option('detach', True) # 通過option參數(shù),設(shè)置瀏覽器不關(guān)閉
driver = webdriver.Edge(options=chrome_options, keep_alive=True)
driver.implicitly_wait(10) # 頁面元素查找的等待時間
self.driver = driver
pass
def chrome_drive(self, drive='chrome'):
edge_driver_path = executable_path = r'D:\Program Files\Google\Chrome\Application'
if drive == 'chrome':
chrome_options = webdriver.ChromeOptions()
# chrome_options.binary_location = edge_driver_path # 傳入驅(qū)動地址
# chrome_options.add_experimental_option('detach', True) # 通過option參數(shù),設(shè)置瀏覽器不關(guān)閉
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
driver = webdriver.Chrome(options=chrome_options, keep_alive=False)
driver.implicitly_wait(10) # 頁面元素查找的等待時間
self.driver = driver
pass
(2)通過代碼啟動瀏覽器
- 這個時候被注釋掉的
.binary_location = edge_driver_path
是關(guān)鍵! - 這種情況下,需要下載對應(yīng)的驅(qū)動軟件(.exe)
- 博主在筆記本電腦上首次嘗試Selenium時就下載了驅(qū)動軟件!但后來在臺式電腦使用相同代碼時發(fā)現(xiàn),壓根不需要下載什么驅(qū)動軟件!
- 只需要使用終端提前啟動瀏覽器的調(diào)試模型即可。 (這是彎路、坑)
- 因為,如果是通過代碼啟動瀏覽器的調(diào)試模型,需要配置路徑,然后保證程序關(guān)閉后瀏覽器依舊運(yùn)行!麻煩?。?!
(3)Bug問題記錄
1)python可讀取瀏覽器所有標(biāo)簽標(biāo)題,但檢索網(wǎng)頁元素失敗
- 部分網(wǎng)頁不支持爬取!特別是當(dāng)網(wǎng)頁開啟F12的開發(fā)人選項后,會出現(xiàn)無法查找元素的問題。
- 此時,關(guān)閉 “開發(fā)人選項” 即可。
2)瀏覽器開啟程序,但python程序無法鏈接瀏覽器進(jìn)行自動控制
- 關(guān)閉原有瀏覽器,重新打開瀏覽器(需搭配命令:–remote-debugging-port=9222 --user-data-dir=“xxx folder”
3. 爬取效果
3. 完整代碼共享
以下代碼主要實現(xiàn)了:文章來源:http://www.zghlxwxcb.cn/news/detail-846453.html
- 瀏覽器標(biāo)簽頁的翻動和選擇
- 爬取 – 青塔網(wǎng)檢索”國家自然科學(xué)基金項目“的作者信息,并保存到表格。
- 爬取 – NSFC”國家自然科學(xué)基金項目“的作者信息,并保存到表格。
- 爬取 – 國際某個領(lǐng)域?qū)<业淖髡咝畔?,并保存到表格?/li>
3.1 包含Excel部分的完整代碼
包含Excel部分的完整代碼見:資源文件文章來源地址http://www.zghlxwxcb.cn/news/detail-846453.html
3.2 爬蟲部分的完整代碼
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.action_chains import ActionChains
# '.\chrome.exe --remote-debugging-port=9222 --user-data-dir=“D:\selenium\AutomationProfile” n "*" --ws --allow-insecure-unlock --nodiscover --authrpc.addr 127.0.1.2 --authrpc.port 8545'
# '.\chrome.exe --remote-debugging-port=9222 --user-data-dir=“D:\selenium\AutomationProfile”'
class Web_Browser:
def __init__(self, drive='chrome'):
self.driver = None
# self.edge()
self.chrome_drive()
def edge(self):
# edge_driver_path = executable_path=r'D:\Program Files\Google\Chrome\Application\chromedriver.exe'
edge_driver_path = executable_path=r'C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe'
chrome_options = Options()
# chrome_options.binary_location = edge_driver_path
# 配置瀏覽器
# 添加User-Agent到Chrome選項中
# chrome_options.add_argument("--user-agent=windows 10 Edge")
# "127.0.0.1:9222"其中,9222是瀏覽器的運(yùn)行端口
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
# 讓瀏覽器帶著這個配置運(yùn)行
# chrome_options.add_experimental_option('detach', True) # 通過option參數(shù),設(shè)置瀏覽器不關(guān)閉
driver = webdriver.Edge(options=chrome_options, keep_alive=True)
# driver = webdriver.Chrome( options=chrome_options)
print('===================')
# driver.get('www.baidu.com')
driver.implicitly_wait(10)
self.driver = driver
def chrome_drive(self, drive='chrome'):
edge_driver_path = executable_path = r'D:\Program Files\Google\Chrome\Application\chromedriver.exe'
if drive == 'chrome':
chrome_options = webdriver.ChromeOptions()
# chrome_options.binary_location = edge_driver_path
# chrome_options.add_experimental_option('detach', True) # 通過option參數(shù),設(shè)置瀏覽器不關(guān)閉
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
driver = webdriver.Chrome(options=chrome_options, keep_alive=False)
self.driver = driver
driver.implicitly_wait(10)
self.opened_windows_dict = None
pass
def get_all_opened_windows(self):
driver = self.driver
cw = driver.current_window_handle
res = {}
# 獲取已打開的標(biāo)簽頁的信息
tabs = driver.window_handles
for t in tabs:
driver.switch_to.window(t)
res[str(driver.title)] = str(t)
self.opened_windows_dict = res
driver.switch_to.window(cw)
print('已打開的標(biāo)簽頁的信息:',)
for k in res: print(f"\t{k}: {res[k]}")
return res
def switch_window(self, key):
driver = self.driver
cw = driver.current_window_handle
# 獲取已打開的標(biāo)簽頁的信息
tabs = driver.window_handles
for t in tabs:
driver.switch_to.window(t)
if key in str(driver.title): cw = t
break
# driver.switch_to.window(cw)
self.driver = driver
pass
def open_new_window(self, driver=None, url=None, delay_t=0.6):
'''# 打開新標(biāo)簽頁'''
driver = self.driver if not driver else driver
old_handle = driver.window_handles # 獲取已打開的標(biāo)簽頁的信息
# driver.find_element("body").send_keys(Keys.CONTROL + 't') # 沒有實體會報錯
# driver.execute_script("window.open('','_blank');") # 可能被攔截
driver.switch_to.new_window('tab')
time.sleep(delay_t)
if len(driver.window_handles) >len(old_handle): return True
driver.execute_script(f"window.open('{url if url else ''}');")
time.sleep(delay_t)
if len(driver.window_handles) >len(old_handle): return True
return False
def func1(self, xlsx):
""" 學(xué)術(shù)網(wǎng) """
for p in range(50):
# self.switch_window('故障診斷')
driver = self.driver
web = driver.find_element(by=By.XPATH, value='//*[@id="search_body"]/div[2]/div[3]/div[1]/div[2]/div[1]/div[3]/div[2]/div/div[2]/div[2]/div/div')
web1 = web.find_elements(by=By.CLASS_NAME, value='inner-content')
print('web1 len=', len(web1))
num = 0
for i, w in enumerate(web1):
try:
# '//*[@id="search_body"]/div[2]/div[3]/div[1]/div[2]/div[1]/div[3]/div[2]/div/div[2]/div[2]/div/div'
#
a = w.find_element(by=By.XPATH, value=f'//div[{1+i}]/div/div[2]/div[1]/div[1]/div/a/strong/span/span').text
try:
b = w.find_element(by=By.XPATH, value=f'//div[{1 + i}]/div/div[2]/div[3]/p[2]').text
school = str(b).split(',')
for s in school:
if 'university' in s.lower(): b = s[1:]
except: b = None
c = w.find_element(by=By.XPATH, value=f'//div[{1 + i}]/div/div[2]/div[3]/p[1]').text
d = None
e = None
f = None
try:
h_index = w.find_element(by=By.XPATH, value=f'//div[{1 + i}]/div/div[2]/div[2]/div/span[1]/span[3]').text
paper = w.find_element(by=By.XPATH, value=f'//div[{1 + i}]/div/div[2]/div[2]/div/span[2]/span[3]').text
cite = w.find_element(by=By.XPATH, value=f'//div[{1 + i}]/div/div[2]/div[2]/div/span[3]/span[3]').text
f = f"H-index: {h_index}, papers: {paper}, cites: {cite}"
except: pass
g = None
h = w.find_element(by=By.XPATH, value=f'//div[{1 + i}]/div/div[2]/div[1]/div[1]/div/a')
h = 'https://www.aminer.cn/' + h.get_attribute('href')
print(a, b ,c, g)
xlsx.input_data(a,b,c,d,e,f,g, h)
num += 1
except: pass
print('記錄:', num)
# aa = driver.find_elements(by=By.XPATH, value='//*[@id="search_body"]/div[2]/div[3]/div[1]/div[2]/div[1]/div[3]/div[2]/div/div[2]/div[3]/ul/li')
# aa = aa[-1]
aa = driver.find_element(by=By.CLASS_NAME, value='ant-pagination-next')
# v = '#search_body > div.ant-tabs.ant-tabs-top.a-aminer-core-search-index-searchPageTab.ant-tabs-line.ant-tabs-no-animation > div.ant-tabs-content.ant-tabs-content-no-animated.ant-tabs-top-content > div.ant-tabs-tabpane.ant-tabs-tabpane-active > div.a-aminer-core-search-index-componentContent > div.a-aminer-core-search-c-search-component-temp-searchComponent > div.view > div:nth-child(2) > div > div:nth-child(2) > div.paginationWrap > ul > li.ant-pagination-next'
# aa = driver.find_element(by=By.CSS_SELECTOR, value=v)
# 創(chuàng)建一個ActionChains對象,用于執(zhí)行鼠標(biāo)動作
action_chains = ActionChains(driver)
# 將鼠標(biāo)移動到鏈接元素上并點擊
action_chains.move_to_element(aa).click().perform()
print(f'第{p+1}頁 --> 第{p+2}頁')
try:
xlsx.make_frame()
xlsx.save_excel()
except: pass
time.sleep(5)
pass
def func2(self, xlsx=None):
for p in range(50):
self.switch_window('青塔')
driver = self.driver
web = driver.find_element(by=By.XPATH,
value='//*[@id="app"]/div[2]/div[1]/div/div[2]/div[2]/div/div[2]')
web1 = web.find_elements(by=By.CLASS_NAME, value='list-item')
print('web1 len=', len(web1))
num = 0
for i, w in enumerate(web1):
# try:
# //*[@id="app"]/div[2]/div[1]/div/div[2]/div[2]/div/div[2]
# '//*[@id="app"]/div[2]/div[1]/div/div[2]/div[2]/div/div[2]/div/div[2]/div[2]/div[2]/div[1]/div[2]'
# //*[@id="app"]/div[2]/div[1]/div/div[2]/div[2]/div/div[2]/div/div[1]/div[2]/div[2]/div[1]/div[1]
b = w.find_element(by=By.XPATH, value=f'//div[2]/div[1]/div[1]/div[2]')
print(b)
b = b.text
print('b=', b)
a = w.find_element(by=By.XPATH, value=f'//div[2]/div[2]/div[1]/div[2]').text
print('a=', a)
c = None
d = None
e = w.find_element(by=By.XPATH, value=f'//div[1]/div[1]').text
print('e=', e)
year = w.find_element(by=By.XPATH, value=f'//div[2]/div[2]/div[2]/div[2]').text
money = w.find_element(by=By.XPATH, value=f'//div[2]/div[1]/div[2]/div[2]').text
print('year=', year, 'money=', money)
e = f"{e}, 立項: {year}, 資助: {money}"
jijin = w.find_element(by=By.XPATH, value=f'//div[2]/div[3]/div[1]/div[2]').text
domain = w.find_element(by=By.XPATH, value=f'//div[2]/div[3]/div[2]/div[2]').text
print('jijin=',jijin, 'domain=', domain)
f = f"{jijin}, 領(lǐng)域: {domain}"
g = None
h = None
print(i, '-----------', i)
print(a, b, c, d, e, f)
xlsx.input_data(a, b, c, d, e, f, g, h)
num += 1
break
# except: pass
print('記錄:', num)
break
aa = driver.find_element(by=By.XPATH, value=f'//*[@id="app"]/div[2]/div[1]/div/div[2]/div[2]/div/div[3]/button[2]')
# 創(chuàng)建一個ActionChains對象,用于執(zhí)行鼠標(biāo)動作
action_chains = ActionChains(driver)
# 將鼠標(biāo)移動到鏈接元素上并點擊
action_chains.move_to_element(aa).click().perform()
print(f'第{p + 1}頁 --> 第{p + 2}頁')
try:
xlsx.make_frame()
xlsx.save_excel()
except:
pass
time.sleep(5)
pass
def func3(self, xlsx=None):
for p in range(50):
self.switch_window('大數(shù)據(jù)知識管理服務(wù)門戶')
driver = self.driver
d = driver.find_element(by=By.CLASS_NAME, value='container_list_right')
print('d==', d)
# web = driver.find_element(by=By.XPATH,
# value='//*[@id="app"]/div[1]/div[3]/div/div[3]/div[1]/div')
web = d.find_element(by=By.XPATH, value='//div[1]/div')
# web1 = web.find_elements(by=By.CLASS_NAME, value='list-item')
# print('web1 len=', len(web1))
num = 0
for i, w2 in enumerate(range(6)):
w = web
try:
# //*[@id="app"]/div[1]/div[3]/div/div[3]/div[1]/div
# //*[@id="app"]/div[1]/div[3]/div/div[3]
# //*[@id="app"]/div[1]/div[3]/div/div[3]/div[1]/div/div[2]/div[2]/div[1]
b = w.find_element(by=By.XPATH, value=f'//div[{i+1}]/div[3]/div[4]/a')
b = b.text
# print('b=', b)
a = w.find_element(by=By.XPATH, value=f'//div[{i+1}]/div[2]/div[4]/a').text
# print('a=', a)
c = None
d = None
e = w.find_element(by=By.XPATH, value=f'//div[{i+1}]/div[1]/div[1]/p/a').text
# print('e=', e)
year = w.find_element(by=By.XPATH, value=f'//div[{i+1}]/div[3]/div[3]').text
money = w.find_element(by=By.XPATH, value=f'//div[{i+1}]/div[3]/div[1]').text
# print('year=', year, 'money=', money)
e = f"{e}, {year}, {money}"
jijin = w.find_element(by=By.XPATH, value=f'//div[{i+1}]/div[2]/div[3]').text
domain = w.find_element(by=By.XPATH, value=f'//div[{i+1}]/div[2]/div[1]').text
# print('jijin=',jijin, domain)
f = f"{jijin}, {domain}"
g = None
h = None
print(i+1, '-----------', i+1)
print(a, b, c, d, e, f)
xlsx.input_data(a, b, c, d, e, f, g, h)
num += 1
# break
except: pass
print('記錄:', num)
# break
# aa = driver.find_element(by=By.CLASS_NAME, value=f'btn-next')
# # 創(chuàng)建一個ActionChains對象,用于執(zhí)行鼠標(biāo)動作
# action_chains = ActionChains(driver)
# # 將鼠標(biāo)移動到鏈接元素上并點擊
# action_chains.move_to_element(aa).click().perform()
print(f'第{p + 1}頁 --> 第{p + 2}頁')
try:
xlsx.make_frame()
xlsx.save_excel()
except:
pass
break
# time.sleep(5)
pass
def func4(self, xlsx=None, key='Google2'):
if key == 'Google': self.switch_window('Google')
else: self.switch_window('必應(yīng)')
driver = self.driver
data = xlsx.read_excel()
# print(data['姓名'])
for i, name in enumerate(data['姓名']):
school = data['學(xué)校'][i]
text = f'{school}{name}是不是教授'
print(f'search [{i+1}]: {name} -》 ', text)
if key == 'Google': web = driver.find_element(by=By.XPATH, value='//*[@id="APjFqb"]')
else: web = driver.find_element(by=By.XPATH, value='//*[@id="sb_form_q"]')
web.clear()
web.send_keys(text)
if key == 'Google': web = driver.find_element(by=By.XPATH, value='//*[@id="tsf"]/div[1]/div[1]/div[2]/button')
else: web = driver.find_element(by=By.XPATH, value='//*[@id="sb_form_go"]')
# try:
web.click()
# except: pass
time.sleep(5)
num = 0
if __name__ == '__main__':
from temp import Make_Excel, input_data_list, input_data
xlsx = Make_Excel()
web = Web_Browser()
web.get_all_opened_windows()
# web.switch_window('故障診斷')
''' 學(xué)術(shù)網(wǎng) '''
web.func1(xlsx) # 學(xué)術(shù)網(wǎng)
# web.func2(xlsx) # 青塔網(wǎng)
# web.func3(xlsx) # NSFC官網(wǎng)
# web.func4(xlsx, ) # goole搜索網(wǎng)
# xlsx.make_frame()
# xlsx.save_excel()
pass
到了這里,關(guān)于Selenium控制已運(yùn)行的Edge和Chrome瀏覽器——在線控制 | 人機(jī)交互(詳細(xì)啟動步驟和bug記錄)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!