Python爬蟲基礎(chǔ)（一）：urllib庫(kù)的使用詳解

這篇具有很好參考價(jià)值的文章主要介紹了Python爬蟲基礎(chǔ)（一）：urllib庫(kù)的使用詳解。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

系列文章索引

Python爬蟲基礎(chǔ)（一）：urllib庫(kù)的使用詳解
Python爬蟲基礎(chǔ)（二）：使用xpath與jsonpath解析爬取的數(shù)據(jù)
Python爬蟲基礎(chǔ)（三）：使用Selenium動(dòng)態(tài)加載網(wǎng)頁(yè)
Python爬蟲基礎(chǔ)（四）：使用更方便的requests庫(kù)
Python爬蟲基礎(chǔ)（五）：使用scrapy框架

一、urllib庫(kù)的使用

1、基本介紹

urllib是一個(gè)python自帶的庫(kù)，不需要手動(dòng)安裝。

urllib庫(kù)用于操作網(wǎng)頁(yè) URL，并對(duì)網(wǎng)頁(yè)的內(nèi)容進(jìn)行抓取處理。

urllib 包包含以下幾個(gè)模塊：
urllib.request - 打開和讀取 URL。
urllib.error - 包含 urllib.request 拋出的異常。
urllib.parse - 解析 URL。
urllib.robotparser - 解析 robots.txt 文件

python爬蟲主要用到的urllib庫(kù)中的request和parse模塊

# 使用urllib來(lái)獲取百度首頁(yè)的源碼，引入urllib的request
import urllib.request

# (1)定義一個(gè)url  就是你要訪問(wèn)的地址
url = 'http://www.baidu.com'

# (2)模擬瀏覽器向服務(wù)器發(fā)送請(qǐng)求 response響應(yīng)
response = urllib.request.urlopen(url)

# （3）獲取響應(yīng)中的頁(yè)面的源碼  content 內(nèi)容的意思
# read方法  返回的是字節(jié)形式的二進(jìn)制數(shù)據(jù)
# 我們要將二進(jìn)制的數(shù)據(jù)轉(zhuǎn)換為字符串
# 二進(jìn)制 --> 字符串  解碼：  decode('編碼的格式')
content = response.read().decode('utf-8')

# （4）打印數(shù)據(jù)，返回的內(nèi)容就是源地址的html內(nèi)容
print(content)

2、response的類型和關(guān)鍵方法

一個(gè)類型： http.client.HTTPResponse
六個(gè)方法： read readline readlines getcode geturl getheaders

import urllib.request

url = 'http://www.baidu.com'

# 模擬瀏覽器向服務(wù)器發(fā)送請(qǐng)求
response = urllib.request.urlopen(url)

# 一個(gè)類型和六個(gè)方法
# response是HTTPResponse的類型：<class 'http.client.HTTPResponse'>
# print(type(response))

# 按照一個(gè)字節(jié)一個(gè)字節(jié)的去讀
# content = response.read()
# print(content)

# 返回固定數(shù)量的字節(jié)
# content = response.read(5)
# print(content)

# 讀取一行
# content = response.readline()
# print(content)

# 讀取所有的行
# content = response.readlines()
# print(content)

# 返回狀態(tài)碼  如果是200 那么就證明請(qǐng)求成功
# print(response.getcode())

# 返回的是url地址
# print(response.geturl())

# 獲取是一個(gè)狀態(tài)信息，響應(yīng)頭
# print(response.getheaders())

3、下載文件

import urllib.request

# 下載網(wǎng)頁(yè)
# url_page = 'http://www.baidu.com'

# url代表的是下載的路徑  filename文件的名字
# 在python中 可以變量的名字  也可以直接寫值
# urllib.request.urlretrieve(url_page,'baidu.html')

# 下載圖片
url_img = 'https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png'
urllib.request.urlretrieve(url= url_img,filename='baidu.jpg')

# 下載視頻
url_video = 'https://highlight-video.cdn.bcebos.com/video/6s/b1bc17fc-46dd-11ee-911e-7cd30a602444.mp4?v_from_s=bdapp-landingpage-api-nanjing'
urllib.request.urlretrieve(url_video,'video.mp4')

4、GET請(qǐng)求實(shí)例

（1）設(shè)置請(qǐng)求頭（百度）

ua反爬蟲是一種很常見的反爬手段，通過(guò)識(shí)別發(fā)送的請(qǐng)求中是否有需要的參數(shù)信息來(lái)判斷這次訪問(wèn)是否由用戶通過(guò)瀏覽器發(fā)起。

UA介紹：User Agent中文名為用戶代理，簡(jiǎn)稱 UA，它是一個(gè)特殊字符串頭，使得服務(wù)器能夠識(shí)別客戶使用的操作系統(tǒng)及版本、CPU 類型、瀏覽器及版本。瀏覽器內(nèi)核、瀏覽器渲染引擎、瀏覽器語(yǔ)言、瀏覽器插件等。

import urllib.request

url = 'https://www.baidu.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}


# 請(qǐng)求對(duì)象的定制，手動(dòng)設(shè)置請(qǐng)求頭
# 需要手動(dòng)指定url和headers，因?yàn)镽equest對(duì)象的構(gòu)造方法，參數(shù)的順序并不是這樣，需要關(guān)鍵字傳參
request = urllib.request.Request(url=url,headers=headers)

response = urllib.request.urlopen(request)

content = response.read().decode('utf8')

print(content)

（2）使用quote方法對(duì)get參數(shù)編碼（百度）

使用urllib.parse.quote，將中文編碼為unicode格式，拼接到url上使其作為get請(qǐng)求的參數(shù)。

import urllib.request
import urllib.parse


url = 'https://www.baidu.com/s?wd='

# 請(qǐng)求對(duì)象的定制為了解決反爬的第一種手段
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'Cookie' : ''
}

# 將周杰倫三個(gè)字變成unicode編碼的格式
# 我們需要依賴于urllib.parse
name = urllib.parse.quote('周杰倫')

url = url + name
print(url) # https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6

# 請(qǐng)求對(duì)象的定制
request = urllib.request.Request(url=url,headers=headers)

# 模擬瀏覽器向服務(wù)器發(fā)送請(qǐng)求
response = urllib.request.urlopen(request)

# 獲取響應(yīng)的內(nèi)容
content = response.read().decode('utf-8')

# 打印數(shù)據(jù)
print(content)

（3）使用urlencode方法對(duì)get多個(gè)參數(shù)編碼（百度）

使用urllib.parse.urlencode方法，將字典進(jìn)行自動(dòng)編碼為get請(qǐng)求參數(shù)。

import urllib.request
import urllib.parse

base_url = 'https://www.baidu.com/s?'

# 使用字典
data = {
    'wd':'周杰倫',
    'sex':'男',
    'location':'中國(guó)臺(tái)灣省'
}

# 自動(dòng)將字典，拼接為url，并且加上&，并且自動(dòng)轉(zhuǎn)碼
new_data = urllib.parse.urlencode(data)
print(new_data) # wd=%E5%91%A8%E6%9D%B0%E4%BC%A6&sex=%E7%94%B7&location=%E4%B8%AD%E5%9B%BD%E5%8F%B0%E6%B9%BE%E7%9C%81

# 請(qǐng)求資源路徑
url = base_url + new_data

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'Cookie' : ''
}

# 請(qǐng)求對(duì)象的定制
request = urllib.request.Request(url=url,headers=headers)

# 模擬瀏覽器向服務(wù)器發(fā)送請(qǐng)求
response = urllib.request.urlopen(request)

# 獲取網(wǎng)頁(yè)源碼的數(shù)據(jù)
content = response.read().decode('utf-8')

# 打印數(shù)據(jù)
#print(content)

（4）get請(qǐng)求結(jié)果保存本地（豆瓣電影）

# get請(qǐng)求
# 獲取豆瓣電影的第一頁(yè)的數(shù)據(jù) 并且保存起來(lái)

import urllib.request

url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# (1) 請(qǐng)求對(duì)象的定制
request = urllib.request.Request(url=url,headers=headers)

# （2）獲取響應(yīng)的數(shù)據(jù)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')

# (3) 數(shù)據(jù)下載到本地
# open方法默認(rèn)情況下使用的是gbk的編碼  如果我們要想保存漢字 那么需要在open方法中指定編碼格式為utf-8
# encoding = 'utf-8'
# fp = open('douban.json','w',encoding='utf-8')
# fp.write(content)

with open('douban1.json','w',encoding='utf-8') as fp:
    fp.write(content)

（5）get請(qǐng)求結(jié)果保存本地2（豆瓣電影）

# https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&
# start=0&limit=20

# https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&
# start=20&limit=20

# https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&
# start=40&limit=20

# https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&
# start=60&limit=20

# page    1  2   3   4
# start   0  20  40  60

# start （page - 1）*20


# 下載豆瓣電影前10頁(yè)的數(shù)據(jù)
# （1） 請(qǐng)求對(duì)象的定制
# （2） 獲取響應(yīng)的數(shù)據(jù)
# （3） 下載數(shù)據(jù)

import urllib.parse
import urllib.request

def create_request(page):
    base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&'

    data = {
        'start':(page - 1) * 20,
        'limit':20
    }

    data = urllib.parse.urlencode(data)

    url = base_url + data

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
    }

    request = urllib.request.Request(url=url,headers=headers)
    return request


def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content


def down_load(page,content):
    with open('douban_' + str(page) + '.json','w',encoding='utf-8')as fp:
        fp.write(content)

# 程序的入口
if __name__ == '__main__':
    start_page = int(input('請(qǐng)輸入起始的頁(yè)碼'))
    end_page = int(input('請(qǐng)輸入結(jié)束的頁(yè)面'))

    for page in range(start_page,end_page+1):
#         每一頁(yè)都有自己的請(qǐng)求對(duì)象的定制
        request = create_request(page)
#         獲取響應(yīng)的數(shù)據(jù)
        content = get_content(request)
#         下載
        down_load(page,content)

5、POST請(qǐng)求實(shí)例

（1）POST請(qǐng)求發(fā)送數(shù)據(jù)（百度翻譯）

post請(qǐng)求方式的參數(shù) 必須編碼 data = urllib.parse.urlencode(data)
編碼之后必須調(diào)用encode方法 data = urllib.parse.urlencode(data).encode(‘utf-8’)
參數(shù)是放在請(qǐng)求對(duì)象定制的方法中 request = urllib.request.Request(url=url,data=data,headers=headers)

# post請(qǐng)求，百度翻譯

import urllib.request
import urllib.parse


url = 'https://fanyi.baidu.com/sug'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# 請(qǐng)求信息，格式為FormData
data = {
    'kw':'spider'
}

# post請(qǐng)求的參數(shù) 必須要進(jìn)行編碼
data = urllib.parse.urlencode(data).encode('utf-8')

# post的請(qǐng)求的參數(shù) 是不會(huì)拼接在url的后面的  而是需要放在請(qǐng)求對(duì)象定制的參數(shù)中
# post請(qǐng)求的參數(shù) 必須要進(jìn)行編碼
request = urllib.request.Request(url=url,data=data,headers=headers)

# 模擬瀏覽器向服務(wù)器發(fā)送請(qǐng)求
response = urllib.request.urlopen(request)

# 獲取響應(yīng)的數(shù)據(jù)
content = response.read().decode('utf-8')

# 解析 字符串 --> json對(duì)象
import json

obj = json.loads(content)
print(obj)

# 百度詳細(xì)翻譯
import urllib.request
import urllib.parse

url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh'

headers = {
    # 'Accept': '*/*',
    # 'Accept-Encoding': 'gzip, deflate, br',
    # 'Accept-Language': 'zh-CN,zh;q=0.9',
    # 'Connection': 'keep-alive',
    # 'Content-Length': '135',
    # 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Cookie': 'BIDUPSID=xxxxxxxxxxxxxxxxxxxxxx; PSTM=xxxxxxxxx;xxxxxxxxxxxxxxxxxxxxxxxxxx',
    # 'Host': 'fanyi.baidu.com',
    # 'Origin': 'https://fanyi.baidu.com',
    # 'Referer': 'https://fanyi.baidu.com/?aldtype=16047',
    # 'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
    # 'sec-ch-ua-mobile': '?0',
    # 'Sec-Fetch-Dest': 'empty',
    # 'Sec-Fetch-Mode': 'cors',
    # 'Sec-Fetch-Site': 'same-origin',
    # 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    # 'X-Requested-With': 'XMLHttpRequest',
}

data = {
    'from': 'en',
    'to': 'zh',
    'query': 'love',
    'transtype': 'realtime',
    'simple_means_flag': '3',
    'sign': '198772.518981',
    'token': '5483bfa652979b41f9c90d91f3de875d',
    'domain': 'common',
}
# post請(qǐng)求的參數(shù)  必須進(jìn)行編碼 并且要調(diào)用encode方法
data = urllib.parse.urlencode(data).encode('utf-8')

# 請(qǐng)求對(duì)象的定制
request = urllib.request.Request(url = url,data = data,headers = headers)

# 模擬瀏覽器向服務(wù)器發(fā)送請(qǐng)求
response = urllib.request.urlopen(request)

# 獲取響應(yīng)的數(shù)據(jù)
content = response.read().decode('utf-8')

import json

obj = json.loads(content)
print(obj)

（2）POST請(qǐng)求結(jié)果保存本地（肯德基）

# 1頁(yè)
# http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
# post
# cname: 北京
# pid:
# pageIndex: 1
# pageSize: 10


# 2頁(yè)
# http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
# post
# cname: 北京
# pid:
# pageIndex: 2
# pageSize: 10

import urllib.request
import urllib.parse

def create_request(page):
    base_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'

    data = {
        'cname': '北京',
        'pid':'',
        'pageIndex': page,
        'pageSize': '10'
    }

    data = urllib.parse.urlencode(data).encode('utf-8')

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
    }

    request = urllib.request.Request(url=base_url,headers=headers,data=data)

    return request

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content


def down_load(page,content):
    with open('kfc_' + str(page) + '.json','w',encoding='utf-8')as fp:
        fp.write(content)



if __name__ == '__main__':
    start_page = int(input('請(qǐng)輸入起始頁(yè)碼'))
    end_page = int(input('請(qǐng)輸入結(jié)束頁(yè)碼'))

    for page in range(start_page,end_page+1):
        # 請(qǐng)求對(duì)象的定制
        request = create_request(page)
        # 獲取網(wǎng)頁(yè)源碼
        content = get_content(request)
        # 下載
        down_load(page,content)

6、異常處理

（1）URLError\HTTPError簡(jiǎn)介

1.HTTPError類是URLError類的子類
2.導(dǎo)入的包urllib.error.HTTPError urllib.error.URLError
3.http錯(cuò)誤：http錯(cuò)誤是針對(duì)瀏覽器無(wú)法連接到服務(wù)器而增加出來(lái)的錯(cuò)誤提示。引導(dǎo)并告訴瀏覽者該頁(yè)是哪里出了問(wèn)題。
4.通過(guò)urllib發(fā)送請(qǐng)求的時(shí)候，有可能會(huì)發(fā)送失敗，這個(gè)時(shí)候如果想讓你的代碼更加的健壯，可以通過(guò)try‐except進(jìn)行捕獲異常，異常有兩類，URLError\HTTPError

（2）URLError\HTTPError實(shí)例

import urllib.request
import urllib.error

url = 'https://blog.csdn.net/sulixu/article/details/1198189491'

# url = 'http://www.doudan.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

try:
    request = urllib.request.Request(url = url, headers = headers)

    response = urllib.request.urlopen(request)

    content = response.read().decode('utf-8')

    print(content)
# 404等狀態(tài)碼，會(huì)拋出HTTPError
except urllib.error.HTTPError:
    print('系統(tǒng)正在升級(jí)。。。')

# url等其他問(wèn)題，會(huì)拋出URLError
except urllib.error.URLError:
    print('我都說(shuō)了 系統(tǒng)正在升級(jí)。。。')

7、實(shí)操：爬取微博好友圈的內(nèi)容

（1）找到好友圈的get接口

通過(guò)瀏覽器的F12，我們點(diǎn)擊好友圈之后，發(fā)現(xiàn)調(diào)用了這樣一個(gè)接口，通過(guò)Preview查看返回的數(shù)據(jù)，發(fā)現(xiàn)正是好友圈的內(nèi)容。
Python爬蟲基礎(chǔ)（一）：urllib庫(kù)的使用詳解,python大家庭,python,爬蟲,開發(fā)語(yǔ)言

（2）編碼爬取數(shù)據(jù)

import urllib.request

url = 'https://weibo.com/ajax/feed/groupstimeline?list_id=100095458128744&refresh=4&fast_refresh=1&count=25'

# 模擬瀏覽器向服務(wù)器發(fā)送請(qǐng)求
response = urllib.request.urlopen(url)

# 獲取響應(yīng)的數(shù)據(jù)
content = response.read().decode('utf-8')

# 將數(shù)據(jù)保存到本地
with open('weibo.json','w',encoding='utf-8') as fp:
    fp.write(content)

運(yùn)行上述代碼，我們發(fā)現(xiàn)，控制臺(tái)出現(xiàn)了錯(cuò)誤：

content = response.read().decode(‘utf-8’)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xca in position 339: invalid continuation byte

代碼稍微修改一下，打印read讀取的結(jié)果：

# 獲取響應(yīng)的數(shù)據(jù)
content = response.read()
print(content)

發(fā)現(xiàn)返回的是一個(gè)html信息，charset編碼格式是gb2312：
Python爬蟲基礎(chǔ)（一）：urllib庫(kù)的使用詳解,python大家庭,python,爬蟲,開發(fā)語(yǔ)言
然后修改代碼，將生成的html保存下來(lái)：

# 獲取響應(yīng)的數(shù)據(jù)
content = response.read().decode('gb2312')

# 將數(shù)據(jù)保存到本地
with open('weibo.html','w',encoding='gb2312') as fp:
    fp.write(content)

我們發(fā)現(xiàn)，并不是我們想要的數(shù)據(jù)。

這是因?yàn)?，爬取微博好友圈的?nèi)容，需要登錄，我們?nèi)鄙倭岁P(guān)鍵的信息。

我們加上請(qǐng)求頭，并設(shè)置Cookie嘗試一下：
Python爬蟲基礎(chǔ)（一）：urllib庫(kù)的使用詳解,python大家庭,python,爬蟲,開發(fā)語(yǔ)言

import urllib.request

url = 'https://weibo.com/ajax/feed/groupstimeline?list_id=100095458128744&refresh=4&fast_refresh=1&count=25'

headers = {
	#     cookie中攜帶著你的登陸信息   如果有登陸之后的cookie  那么我們就可以攜帶著cookie進(jìn)入到任何頁(yè)面
	'cookie': '拷貝瀏覽器中的cookie',
}
# 請(qǐng)求對(duì)象的定制
request = urllib.request.Request(url=url,headers=headers)

# 模擬瀏覽器向服務(wù)器發(fā)送請(qǐng)求
response = urllib.request.urlopen(request)

# 獲取響應(yīng)的數(shù)據(jù)
content = response.read().decode('utf-8')

# 將數(shù)據(jù)保存到本地
with open('weibo.json','w',encoding='utf-8') as fp:
    fp.write(content)

我們發(fā)現(xiàn)，成功獲取到了數(shù)據(jù)！

8、Handler處理器

（1）基本使用

# 需求 使用handler來(lái)訪問(wèn)百度  獲取網(wǎng)頁(yè)源碼

import urllib.request

url = 'http://www.baidu.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

request = urllib.request.Request(url = url,headers = headers)

# handler   build_opener  open

# （1）獲取hanlder對(duì)象
handler = urllib.request.HTTPHandler()

# （2）獲取opener對(duì)象
opener = urllib.request.build_opener(handler)

# (3) 調(diào)用open方法
response = opener.open(request)

content = response.read().decode('utf-8')

print(content)

（2）使用代理

代理的常用功能：
1.突破自身IP訪問(wèn)限制，訪問(wèn)國(guó)外站點(diǎn)。
2.訪問(wèn)一些單位或團(tuán)體內(nèi)部資源
擴(kuò)展：某大學(xué)FTP(前提是該代理地址在該資源的允許訪問(wèn)范圍之內(nèi))，使用教育網(wǎng)內(nèi)地址段免費(fèi)代理服務(wù)器，就可以用于對(duì)教育網(wǎng)開放的各類FTP下載上傳，以及各類資料查詢共享等服務(wù)。
3.提高訪問(wèn)速度
擴(kuò)展：通常代理服務(wù)器都設(shè)置一個(gè)較大的硬盤緩沖區(qū)，當(dāng)有外界的信息通過(guò)時(shí)，同時(shí)也將其保存到緩沖區(qū)中，當(dāng)其他用戶再訪問(wèn)相同的信息時(shí)，則直接由緩沖區(qū)中取出信息，傳給用戶，以提高訪問(wèn)速度。
4.隱藏真實(shí)IP
擴(kuò)展：上網(wǎng)者也可以通過(guò)這種方法隱藏自己的IP，免受攻擊。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-712421.html

import urllib.request

url = 'http://www.baidu.com/s?wd=ip'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# 請(qǐng)求對(duì)象的定制
request = urllib.request.Request(url = url,headers= headers)

# 模擬瀏覽器訪問(wèn)服務(wù)器
# response = urllib.request.urlopen(request)

proxies = {
    'http':'192.168.56.10:5556'
}
# handler  build_opener  open
# 開啟代理
handler = urllib.request.ProxyHandler(proxies = proxies)

opener = urllib.request.build_opener(handler)

response = opener.open(request)

# 獲取響應(yīng)的信息
content = response.read().decode('utf-8')

# 保存
with open('daili.html','w',encoding='utf-8')as fp:
    fp.write(content)

（3）使用代理池

import urllib.request
# 多個(gè)代理
proxies_pool = [
    {'http':'192.168.56.10:5556'},
    {'http':'192.168.56.10:5557'},
]

import random
# 隨機(jī)選擇一個(gè)
proxies = random.choice(proxies_pool)

url = 'http://www.baidu.com/s?wd=ip'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

request = urllib.request.Request(url = url,headers=headers)

handler = urllib.request.ProxyHandler(proxies=proxies)

opener = urllib.request.build_opener(handler)

response = opener.open(request)

content = response.read().decode('utf-8')

with open('daili.html','w',encoding='utf-8')as fp:
    fp.write(content)

到了這里，關(guān)于Python爬蟲基礎(chǔ)（一）：urllib庫(kù)的使用詳解的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

Toy模板網(wǎng)

Python爬蟲基礎(chǔ)（一）：urllib庫(kù)的使用詳解

系列文章索引

一、urllib庫(kù)的使用

1、基本介紹

2、response的類型和關(guān)鍵方法

3、下載文件

4、GET請(qǐng)求實(shí)例

（1）設(shè)置請(qǐng)求頭（百度）

（2）使用quote方法對(duì)get參數(shù)編碼（百度）

（3）使用urlencode方法對(duì)get多個(gè)參數(shù)編碼（百度）

（4）get請(qǐng)求結(jié)果保存本地（豆瓣電影）

（5）get請(qǐng)求結(jié)果保存本地2（豆瓣電影）

5、POST請(qǐng)求實(shí)例

（1）POST請(qǐng)求發(fā)送數(shù)據(jù)（百度翻譯）

（2）POST請(qǐng)求結(jié)果保存本地（肯德基）

6、異常處理

（1）URLError\HTTPError簡(jiǎn)介

（2）URLError\HTTPError實(shí)例

7、實(shí)操：爬取微博好友圈的內(nèi)容

（1）找到好友圈的get接口

（2）編碼爬取數(shù)據(jù)

8、Handler處理器

（1）基本使用

（2）使用代理

（3）使用代理池

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

微信掃一掃打賞

支付寶掃一掃領(lǐng)取紅包，優(yōu)惠每天領(lǐng)

二維碼1

二維碼2

Python爬蟲基礎(chǔ)（一）：urllib庫(kù)的使用詳解

系列文章索引

一、urllib庫(kù)的使用

1、基本介紹

2、response的類型和關(guān)鍵方法

3、下載文件

4、GET請(qǐng)求實(shí)例

（1）設(shè)置請(qǐng)求頭（百度）

（2）使用quote方法對(duì)get參數(shù)編碼（百度）

（3）使用urlencode方法對(duì)get多個(gè)參數(shù)編碼（百度）

（4）get請(qǐng)求結(jié)果保存本地（豆瓣電影）

（5）get請(qǐng)求結(jié)果保存本地2（豆瓣電影）

5、POST請(qǐng)求實(shí)例

（1）POST請(qǐng)求發(fā)送數(shù)據(jù)（百度翻譯）

（2）POST請(qǐng)求結(jié)果保存本地（肯德基）

6、異常處理

（1）URLError\HTTPError簡(jiǎn)介

（2）URLError\HTTPError實(shí)例

7、實(shí)操：爬取微博好友圈的內(nèi)容

（1）找到好友圈的get接口

（2）編碼爬取數(shù)據(jù)

8、Handler處理器

（1）基本使用

（2）使用代理

（3）使用代理池

相關(guān)文章

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

微信掃一掃打賞

支付寶掃一掃領(lǐng)取紅包，優(yōu)惠每天領(lǐng)

二維碼1

二維碼2

一、urllib庫(kù)的使用

1、基本介紹

4、GET請(qǐng)求實(shí)例

5、POST請(qǐng)求實(shí)例

7、實(shí)操：爬取微博好友圈的內(nèi)容

8、Handler處理器

支付寶掃一掃領(lǐng)取紅包，優(yōu)惠每天領(lǐng)