jS動態(tài)生成,由于呈現(xiàn)在網(wǎng)頁上的內(nèi)容是由JS生成而來,我們能夠在瀏覽器上看得到,但是在HTML源碼中卻發(fā)現(xiàn)不了
一、注意:代碼加入了常規(guī)的防爬技術(shù)
? ? ? ?如果不加,如果網(wǎng)站有防爬技術(shù),比如頻繁訪問,后面你會發(fā)現(xiàn)什么數(shù)據(jù)都取不到
1.1?模擬請求頭:? 這里入進(jìn)入一步加強,隨機,主要是User-Agent這個參數(shù)
User-Agent獲取地方:
1.2?偽造請求cookie:當(dāng)然也這里可以做隨機的
?網(wǎng)頁獲取位置:
?1.3?使用代理IP(我這里沒有做這個,這個網(wǎng)站沒必要,也沒深入研究)
使用代理IP解決反爬。(免費代理不靠譜,最好使用付費的。有按次數(shù)收費的,有按時長收費的,根據(jù)自身情況選擇)
是什么意思呢,就是每次發(fā)送請求,讓你像從不同的地域發(fā)過來的一樣,第一次我的ip地址是河北,第二次是廣東,第三次是美國。。。像這樣:
def get_ip_pool(cnt):
"""獲取代理ip的函數(shù)"""
url_api = '獲取代理IP的API地址'
try:
r = requests.get(url_api)
res_text = r.text
res_status = r.status_code
print('獲取代理ip狀態(tài)碼:', res_status)
print('返回內(nèi)容是:', res_text)
res_json = json.loads(res_text)
ip_pool = random.choice(res_json['RESULT'])
ip = ip_pool['ip']
port = ip_pool['port']
ret = str(ip) + ':' + str(port)
print('獲取代理ip成功 -> ', ret)
return ret
except Exception as e:
print('get_ip_pool except:', str(e))
proxies = get_ip_pool() # 調(diào)用獲取代理ip的函數(shù)
requests.get(url=url, headers=headers, proxies={'HTTPS': proxies}) # 發(fā)送請求
1.4?隨機等待間隔訪問
盡量不要用sleep(1)、sleep(3)這種整數(shù)時間的等待,一看就是機器。。
還是那句話,讓爬蟲程序表現(xiàn)地更像一個人!
time.sleep(random.uniform(0.5, 3)) # 隨機等待0.5-3秒
上面4點防爬技術(shù),不一定要全部加入,只看被爬網(wǎng)站是否有防爬技術(shù),多數(shù)用到1、2點就搞定
一、例子:以抓取雙色球數(shù)據(jù)為例子
官網(wǎng):陽光開獎
?經(jīng)過排查,是通過接口獲取數(shù)據(jù)再由JS來生成這部分網(wǎng)頁元素
通過檢查元素是有數(shù)據(jù)的(JS來生成這部分網(wǎng)頁元素)
一、不過濾什么元素字段,把所有元素導(dǎo)出表格
# 抓取雙色球歷史數(shù)據(jù)
# 編碼:utf-8
import requests
import json
import random
import time
import pandas as pd
data_list = []
num_pages = 1 #抓取多少頁數(shù)據(jù)
# 創(chuàng)建一個DataFrame,用于保存到excel中
for page in range(1, num_pages+1):
url = 'http://www.cwl.gov.cn/cwl_admin/front/cwlkj/search/kjxx/findDrawNotice?name=ssq&issueCount=&issueStart=&issueEnd=&dayStart=&dayEnd=&pageSize=30&week=&systemType=PC&pageNo='
url2 = url + str(page)
# request header,其中最關(guān)鍵的一項,User-Agent,可以寫個agent_list,每次請求,隨機選擇一個agent,像這樣:
agent_list = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10"
]
headers = {
'User-Agent': random.choice(agent_list), # 在調(diào)用的時候,隨機選取一個就可以了
# 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299',
'Referer': 'http://www.cwl.gov.cn/ygkj/wqkjgg/ssq/',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Cookie': 'HMF_CI=xxxxxxxxxxxxxxxxxxxxxxxxx' # 加入自己的Cookie
}
# 使用代理IP
# proxies = {
# 'http': 'http://10.10.1.10:3128',
# 'https': 'http://10.10.1.10:1080',
# }
#response = requests.get(url2, headers=headers, proxies=proxies).text
wbdata = requests.get(url2, headers=headers).text
time.sleep(random.uniform(0.5, 3)) # 隨機等待0.5-3秒
data = json.loads(wbdata) # json.loads() 方法將 JSON 數(shù)組轉(zhuǎn)換為 Python 列表
news = data['result']
for n in news:
df = pd.DataFrame(n)
df.to_excel("雙色球歷史數(shù)據(jù).xlsx")
print('完成')
導(dǎo)出excel
二、只要需要的元素字段,這里比如只需要期號、紅球、藍(lán)球("code", "red", "blue")三個數(shù)據(jù)
# 抓取雙色球歷史數(shù)據(jù)
# 編碼:utf-8
import requests
import json
import random
import time
import pandas as pd
data_list = []
# 抓取多少頁數(shù)據(jù)
pageNo = 1
# 頁數(shù)
pageSize = 30
# 只需要期號、紅球、藍(lán)球("code", "red", "blue")三個數(shù)據(jù)
columns = ["code", "red", "blue"]
df = pd.DataFrame(columns=columns)
for page in range(1, pageNo+1):
url = 'http://www.cwl.gov.cn/cwl_admin/front/cwlkj/search/kjxx/findDrawNotice?name=ssq&issueCount=&issueStart=&issueEnd=&dayStart=&dayEnd=&pageSize=' + \
str(pageSize) + '&week=&systemType=PC&pageNo='
url2 = url + str(page)
# request header,其中最關(guān)鍵的一項,User-Agent,可以寫個agent_list,每次請求,隨機選擇一個agent,像這樣:
agent_list = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10"
]
headers = {
'User-Agent': random.choice(agent_list), # 在調(diào)用的時候,隨機選取一個就可以了
# 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299',
'Referer': 'http://www.cwl.gov.cn/ygkj/wqkjgg/ssq/',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Cookie': 'HMF_CI=xxxxxxxxxx' # 加入自己的Cookie
}
# 使用代理IP
wbdata = requests.get(url2, headers=headers).text
# 隨機等待間隔訪問 隨機等待0.5-3秒
time.sleep(random.uniform(0.5, 3))
data = json.loads(wbdata)
news = data['result']
# 過濾不要的元素數(shù)據(jù)
new_json = json.dumps(
[{key: x[key] for key in columns} for x in news]
)
# 再將JSON 數(shù)組轉(zhuǎn)換為 Python 列表list
new_response = json.loads(new_json)
for n, arr in enumerate(new_response):
index = n+(pageNo-1)*pageSize # 插入新數(shù)據(jù)時要添加索引
df.loc[index] = arr # 一次插入一行數(shù)據(jù)
df.to_excel("雙色球歷史數(shù)據(jù)2.xlsx")
print('完成')
上面都是獲取一個網(wǎng)頁的數(shù)據(jù),如果源數(shù)據(jù)網(wǎng)頁是有分頁的,那如何抓取
三、抓取多個網(wǎng)頁數(shù)據(jù)
# 抓取雙色球歷史數(shù)據(jù)
# 編碼:utf-8
import requests
import json
import random
import time
import pandas as pd
data_list = []
# 抓取多少頁數(shù)據(jù)
pageNo = 2
# 頁數(shù)
pageSize = 30
# 只需要期號、紅球、藍(lán)球("code", "red", "blue")三個數(shù)據(jù)
columns = ["code", "red", "blue"]
df = pd.DataFrame(columns=columns)
for page in range(1, pageNo+1):
url = 'http://www.cwl.gov.cn/cwl_admin/front/cwlkj/search/kjxx/findDrawNotice?name=ssq&issueCount=&issueStart=&issueEnd=&dayStart=&dayEnd=&pageSize=' + \
str(pageSize) + '&week=&systemType=PC&pageNo='
url2 = url + str(page)
# request header,其中最關(guān)鍵的一項,User-Agent,可以寫個agent_list,每次請求,隨機選擇一個agent,像這樣:
agent_list = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10"
]
headers = {
'User-Agent': random.choice(agent_list), # 在調(diào)用的時候,隨機選取一個就可以了
# 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299',
'Referer': 'http://www.cwl.gov.cn/ygkj/wqkjgg/ssq/',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Cookie': 'xxxxxxxxx' # 加入自己的Cookie
}
# 使用代理IP
wbdata = requests.get(url2, headers=headers).text
# 隨機等待間隔訪問 隨機等待0.5-3秒
time.sleep(random.uniform(0.5, 3))
data = json.loads(wbdata)
news = data['result']
# 過濾不要的元素數(shù)據(jù)
new_json = json.dumps(
[{key: x[key] for key in columns} for x in news]
)
# 再將JSON 數(shù)組轉(zhuǎn)換為 Python 列表list
new_response = json.loads(new_json)
# 把抓取每一個網(wǎng)頁的數(shù)據(jù)加入data_list數(shù)組中(python中l(wèi)ist) extend() 函數(shù)用于在列表末尾一次性追加另一個序列中的多個值(用新列表擴展原來的列表) 不能用append()
data_list.extend(new_response)
print('------------1.抓取到第' + str(page) + '頁數(shù)據(jù)---------------')
# print(data_list)
for n, arr in enumerate(data_list):
df.loc[n+1] = arr # 一次插入一行數(shù)據(jù)
df.to_excel("雙色球歷史數(shù)據(jù).xlsx")
df.head()
print('數(shù)據(jù)導(dǎo)出導(dǎo)出完成:雙色球歷史數(shù)據(jù).xlsx')
,如果這個編程語言完全不會,用chatgpt來寫代碼還是有點困難的,對于編程人員來說chatpgt就很好用
參考:
python怎樣抓取js生成的頁面_ITPUB博客
python抓取數(shù)據(jù),pandas 處理并存儲為excel_python pandas生成excel_格物致理,的博客-CSDN博客文章來源:http://www.zghlxwxcb.cn/news/detail-430222.html
【道高一尺,魔高一丈】Python爬蟲之如何應(yīng)對網(wǎng)站反爬蟲策略_python應(yīng)對反爬蟲策略_Python程序員小泉的博客-CSDN博客文章來源地址http://www.zghlxwxcb.cn/news/detail-430222.html
到了這里,關(guān)于小白用chatgpt編寫python 爬蟲程序代碼 抓取網(wǎng)頁數(shù)據(jù)(js動態(tài)生成網(wǎng)頁元素)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!