因為很多網(wǎng)站都增加了登錄驗證,所以需要添加一段利用cookies跳過登陸驗證碼的操作
import pandas as pd
import requests
from lxml import etree
# 通過Chrome瀏覽器F12來獲取cookies,agent,headers
cookies ={'ssxmod_itna2':'eqfx0DgQGQ0QG=DC8DXxxxxx',
'ssxmod_itna':'euitGKD5iIgGxxxxx'}
agent ='Mozilla/5.0 (Windows NT 10.0; Win64; x64)xxxxxxx'
headers = {
'User-Agent' : agent,
'Host':'www.xxx.com',
'Referer':'https://www.xxx.com/'
}
#建立會話
session = requests.session()
session.headers = headers
cookies獲取方式
chrmoe瀏覽器,F(xiàn)12,把name和value填入cookies
agent獲取方式
任意點擊一條網(wǎng)絡(luò)資源,右側(cè)headers往下翻到底
測試訪問是否成功
#↓此處測試訪問是否成功,成功的話返回碼200
requests.utils.add_dict_to_cookiejar(session.cookies, cookies)
url = 'https://www.xxx.com/search-prov/36/3604/p1'
response=session.get(url)
print(response)
訪問成功的話進(jìn)入下一步
一般翻頁后查看網(wǎng)址變化就能得出網(wǎng)址規(guī)則文章來源:http://www.zghlxwxcb.cn/news/detail-725024.html
#初始化df數(shù)據(jù)
df = pd.DataFrame(columns = ['企業(yè)名稱'])
#觀察翻頁后網(wǎng)址變化規(guī)律,取10頁數(shù)據(jù)
for k in range(10):
url = 'https://www.xxx.com/search-prov/36/3604/p' + str(k+1) + '/'
cookies_dict = requests.utils.add_dict_to_cookiejar(session.cookies, cookies)
page_text = requests.get(url, headers = headers, cookies = cookies_dict).text # GET
#print(page_text)
tree = etree.HTML(page_text) #數(shù)據(jù)解析
#取到企業(yè)名對應(yīng)xpath
name = [i for i in tree.xpath("http://div[@class='company-title font-18 font-f6']/a/text()")]
dic = {'企業(yè)名稱':name}
df1 = pd.DataFrame(dic)
df = pd.concat([df,df1], axis=0)
#print(df)
print('全部數(shù)據(jù)爬取成功')
print(df)
最后將結(jié)果導(dǎo)入csv文件;編碼格式utf-8-sig防止亂碼文章來源地址http://www.zghlxwxcb.cn/news/detail-725024.html
#將df數(shù)據(jù)寫入csv文件
df.to_csv('xx企業(yè)名錄.csv',index=None,encoding = 'utf-8-sig')
到了這里,關(guān)于python爬蟲練手項目之獲取某地企業(yè)名錄的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!