目錄
scrapy框架
pipeline-itrm-shell
scrapy模擬登錄
scrapy下載圖片
下載中間件
scrapy框架
含義:
構(gòu)圖:
?運行流程:1.scrapy框架拿到start_urls構(gòu)造了一個request請求
2.request請求發(fā)送給scrapy引擎,中途路過爬蟲中間件,引擎再發(fā)送request給調(diào)度器(一個隊列存儲request請求)
3.調(diào)度器再把requst請求發(fā)送給引擎
4.引擎再把requst請求發(fā)送給下載器,中途經(jīng)過下載中間件
5.下載器然后訪問互聯(lián)網(wǎng)然后返回response響應(yīng)
6.下載器把得到的response發(fā)送給引擎,中途經(jīng)過下載中間件
7.引擎發(fā)送resonse給爬蟲,中途路過爬蟲中間件
8.爬蟲通過response獲取數(shù)據(jù),(可以獲取url,....)如果還想再發(fā)請求,就再構(gòu)造一個request請求進行發(fā)送給引擎并再循環(huán)一次,如果不發(fā)請求,就把數(shù)據(jù)發(fā)送給引擎,中途路過爬蟲中間件
9.引擎把數(shù)據(jù)再發(fā)送給管道
10.管道進行保存
我們先來通過cmd頁板來創(chuàng)建項目吧
c:/d:/e:? --->切換網(wǎng)盤
cd 文件名稱 ----->切換進文件
scrapy ?startproject ?項目名稱 -------->創(chuàng)建項目
scrapy genspider 爬蟲文件名稱 ? ?域名? ? ?------->創(chuàng)建爬蟲文件
?scrapy crawl 爬蟲文件名稱? ? ? ------------>運行爬蟲文件
我們還可以創(chuàng)建start.py文件運行爬蟲文件(要創(chuàng)建在項目下的第一層)
文件的創(chuàng)建位置:
?代碼運行爬蟲文件:
from scrapy import cmdline
# cmdline.execute("scrapy crawl baidu".split())
# cmdline.execute("scrapy crawl novel".split())
cmdline.execute("scrapy crawl shiping".split())
導(dǎo)入from? ?scrapy import cmdline
cmdline.execute([ 'scrapy',' crawl',' 爬蟲文件名稱' ]) :運行爬蟲文件
下面我來分析一下里面的文件
爬蟲名字.py文件
?可以看出scrapy框架給出了一些類屬性,這些類屬性的值可以更改,但是def parse()是不能隨意更改名字和傳參的
settings.py文件
?找到這個并打開,把注釋去掉,數(shù)值越小越先執(zhí)行,如果不打開就無法傳數(shù)據(jù)到pipelines.py文件里的
MyScrapyPipeline類中的process_item()中的item參數(shù)
下面我來演示,
import scrapy
class BaiduSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/review/best/']
def parse(self, response):
print(response.text)
結(jié)果:
?當我們點擊第一個網(wǎng)址是會跳轉(zhuǎn)到下面去
?是因為爬蟲文件遵守了一個規(guī)則,解決方法如下:在settings.py文件找到如下的代碼:
?把True改為False,然后運行
結(jié)果:
?可以看出減少了一個錯誤
但還是有錯誤,下面我們來解決一下:
解決403的方法有添加UA(header請求頭)
如圖找到這里:
?把My_scrapy (+http://www.yourdomain.com)這個更改為一個請求頭:
結(jié)果:
?可以正常訪問了
middlewares.py文件( 用于加請求頭)
但有些小可愛覺得這樣太麻煩了,如果是更換header請求頭很頻繁就很不好用,對于這個問題,我們可以想想,如果在發(fā)送請求的過程就加個請求頭是不是就不用這么麻煩了,那怎么加呢,
小可愛們可以想想,中間件這個是不是可以利用一下:
那我們就要找到中間件了,中間件在scrapy項目是一個middlewares.py文件
?當我們打開這個文件是會看見:
?主要是這個文件把爬蟲中間件和下載中間件都寫在middlewares.py文件
MyScrapyDownloaderMiddleware 這個是下載中間件
MyScrapySpiderMiddleware 這個是爬蟲中間件
所以下面我來講解 MyScrapyDownloaderMiddleware
?主要的還是這兩個比較常用,下面我們先來process_crawler
代碼截圖:
當我們打印的時候會發(fā)現(xiàn),怎么沒有打印,為什么會這樣??原因是我們的中間件還未打開,下面我們舉要找到settings,py文件,并將其注釋去掉
代碼截圖:
?一運行成功了:
?那我們再來試試process_response
代碼截圖:
?結(jié)果:
可以看出request 是在response前面的
可能一些小可愛又想到了一些情況,可不可以創(chuàng)建一個請求和響應(yīng)的呢
下面我們來試試
?代碼截圖:
?結(jié)果:
?細心的小可愛會發(fā)現(xiàn)和自己的預(yù)想不對,
下面我截取下載中間件來:
?這個就是問題所在
下面我來解釋一下下面的:
process_request(request, spider)
# - return None: continue processing this request
當return None時就會傳遞下去,比如duoban的process_request()?返回return None就會運行下載中間件的process_request()
# - or return a Request object
當return (一個Request對象)時不會傳遞下去,比如duoban的process_request()?返回return (一個Request對象)就不會運行下載中間件的process_request()而是返回到引擎,引擎返回給調(diào)度器(原路返回)
# - or return a Response object
當return (一個Responset對象)時不會傳遞下去,比如duoban的process_request()?返回return (一個Response對象)就不會運行下載中間件的process_request()而是返回到引擎,引擎返回給爬蟲文件(跨級)
# - or raise IgnoreRequest: process_exception() methods of
如果這個?法拋出異常,則會調(diào)?process_exception?法
# installed downloader middleware will be called
process_response(request, response, spider)
# - return a Response object
# - return a Request object
返回Request對象:停?中間器調(diào)?,將其放置到調(diào)度器待調(diào)度下載;
# - or raise IgnoreRequest
有些小可愛就會想,那我可不可以自己創(chuàng)建一個中間件用于添加請求頭:(要在middlewares.py文件)
from scrapy import signals
import random
class UsertMiddleware:
User_Agent=["Mozilla/5.0 (compatible; MSIE 9.0; AOL 9.7; AOLBuild 4343.19; Windows NT 6.1; WOW64; Trident/5.0; FunWebProducts)",
"Mozilla/4.0 (compatible; MSIE 8.0; AOL 9.7; AOLBuild 4343.27; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"]
def process_request(self, request, spider):
# 添加請求頭
print(dir(request))
request.headers["User-Agent"]=random.choice(self.User_Agent)
# 添加代理ip
# request.meta["proxies"]="代理ip"
return None
class UafgfMiddleware:
def process_response(self, request, response, spider):
# 檢測請求頭是否添加上
print(request.headers["User-Agent"])
return response
結(jié)果
?是可以運行的
pipelines.py文件
process_item(self, item, spider)
item:接收爬蟲文件返回過來的數(shù)據(jù),如字典
下面我們來爬取一下豆瓣吧
練習爬取豆瓣電影的圖片
爬蟲文件.py:
import scrapy
class BaiduSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com','doubanio.com']
start_urls = ['https://movie.douban.com/review/best/']
a=1
def parse(self, response):
divs=response.xpath('//div[@id="content"]//div[@class="review-list chart "]//div[@class="main review-item"]')
for div in divs:
# print(div.extract)
title=div.xpath('./a/img/@title')
src=div.xpath('./a/img/@src')
# print(title.extract_first())
print(src.extract_first())
yield {
"title": title.extract_first(),
"src": src.extract_first(),
"type": "csv"
}
# 再發(fā)請求下載圖片
yield scrapy.Request(
url=src.extract_first(),
callback=self.parse_url,
cb_kwargs={"imgg":title.extract_first()}
)
#第一種
# next1=response.xpath(f'//div[@class="paginator"]//a[1]/@href').extract_first()
# 第二種方法自己構(gòu)建
next1="/review/best?start={}".format(20*self.a)
self.a+=1
url11='https://movie.douban.com'+next1
yield scrapy.Request(url=url11,callback=self.parse)
print(url11)
def parse_url(self,response,imgg):
# print(response.body)
yield {
"title":imgg,
"ts":response.body,
"type":"img"
}
pipelines.py文件:
import csv
class MyScrapyPipeline:
def open_spider(self,spider): # 當爬蟲開啟時調(diào)用
header = ["title", "src"]
self.f = open("move.csv", "a", encoding="utf-8")
self.wri_t=csv.DictWriter(self.f,header)
self.wri_t.writeheader()
def process_item(self, item, spider): # 每次傳參都會調(diào)用一次
if item.get("type")=="csv":
item.pop("type")
self.wri_t.writerow(item)
if item.get("type")=="img":
item.pop("type")
with open("./圖片/{}.png".format(item.get("title")),"wb")as f:
f.write(item.get("ts"))
print("{}.png下載完畢".format(item.get("title")))
return item
def close_spider(self,spider):
self.f.close()
settings.py文件:
?
?這個可以只輸出自己想輸出的內(nèi)容
_____________________________________
?
?
?以上這些都有打開
記住如果爬蟲文件里發(fā)送請求失敗后就無法回調(diào)pipelines.py文件里的函數(shù)
暫停和恢復(fù)爬蟲的方法
有些小可愛覺得有沒有可以暫停和恢復(fù)爬蟲的方法?有的話那是啥
下面我來講講
?scrapy crawl 爬蟲文件名字? -s JOBDIR=文件路徑(隨便定義)
Ctrl+c暫停爬蟲
當小可愛想再次恢復(fù)時會發(fā)現(xiàn)不能運行下載了,
原因是啥呢,因為我們寫的方法和框架給的不一樣,
scrapy.Request如下:
?dont_filte(不過濾嗎?)r是一個過濾,為False則過濾(相同的url只訪問一次),為True則不過濾
小可愛就會覺得那為啥parse()能發(fā)送,結(jié)果如下:
?結(jié)果就很明了了,如果要想不過濾,就得更改
如果你想過濾重寫方法:
?
scrapy模擬登錄
有兩種方法:
● 1 直接攜帶cookies請求??(半自動,用selenium獲取或者自己手動獲取cookie)
https://www.1905.com/vod/list/c_178/o3u1p1.html來做個案例
第一種方法之手動登錄獲取之請求頁面
爬蟲文件代碼實例一(在爬蟲文件添加cookie);
import scrapy
class A17kSpider(scrapy.Spider):
name = '17k'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/']
# 重寫
def start_requests(self):
cook="GUID=f0f80f5e-fb00-443f-a6be-38c6ce3d4c61; __bid_n=1883d51d69d6577cf44207; BAIDU_SSP_lcr=https://www.baidu.com/link?url=v-ynoaTMtiyBil1uTWfIiCbXMGVZKqm4MOt5_xZD0q7&wd=&eqid=da8d6ae20003f26f00000006647c3209; Hm_lvt_9793f42b498361373512340937deb2a0=1684655954,1684929837,1685860878; dfxafjs=js/dfxaf3-ef0075bd.js; FPTOKEN=zLc3s/mq2pguVT/CfivS7tOMcBA63ZrOyecsnTPMLcC/fBEIx0PuIlU5HgkDa8ETJkZYoDJOSFkTHaz1w8sSFlmsRLKFG8s+GO+kqSXuTBgG98q9LQ+EJfeSHMvwMcXHd+EzQzhAxj1L9EnJuEV2pN0w7jUCYmfORSbIqRtu5kruBMV58TagSkmIywEluK5JC6FnxCXUO0ErYyN/7awzxZqyqrFaOaVWZZbYUrhCFq0N8OQ1NMPDvUNvXNDjDOLM6AU9f+eHsXFeAaE9QunHk6DLbxOb8xHIDot4Pau4MNllrBv8cHFtm2U3PHX4f6HFkEpfZXB0yVrzbX1+oGoscbt+195MLZu478g3IFYqkrB8b42ILL4iPHtj6M/MUbPcxoD25cMZiDI1R0TSYNmRIA==|U8iJ37fGc7sL3FohNPBpgau0+kHrBi2OlH2bHfhFOPQ=|10|87db5f81d4152bd8bebb5007a0f3dbc3; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F03%252F43%252F75%252F100257543.jpg-88x88%253Fv%253D1685860834000%26id%3D100257543%26nickname%3D%25E8%2580%2581%25E5%25A4%25A7%25E5%2592%258C%25E5%258F%258D%25E5%25AF%25B9%25E6%25B3%2595%25E7%259A%2584%25E5%258F%258D%26e%3D1701413546%26s%3Db67793dfa5cea859; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100257543%22%2C%22%24device_id%22%3A%221883d51d52d1790-08af8c489ac963-26031a51-1638720-1883d51d52eea0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%22f0f80f5e-fb00-443f-a6be-38c6ce3d4c61%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1685861547"
yield scrapy.Request(
url=self.start_urls[0],
callback=self.parse,
cookies={lis.split("=")[0]:lis.split("=")[1] for lis in cook.split(";")}
)
def parse(self, response):
# print(response.text)
yield scrapy.Request(url="https://user.17k.com/www/",callback=self.parse_url)
def parse_url(self,response):
print(response.text)
結(jié)果:
?爬蟲文件代碼實例二(在下載中間件文件添加cookie);
class MyaddcookieMiddleware:
def process_request(self, request, spider):
cook = "GUID=f0f80f5e-fb00-443f-a6be-38c6ce3d4c61; __bid_n=1883d51d69d6577cf44207; BAIDU_SSP_lcr=https://www.baidu.com/link?url=v-ynoaTMtiyBil1uTWfIiCbXMGVZKqm4MOt5_xZD0q7&wd=&eqid=da8d6ae20003f26f00000006647c3209; Hm_lvt_9793f42b498361373512340937deb2a0=1684655954,1684929837,1685860878; dfxafjs=js/dfxaf3-ef0075bd.js; FPTOKEN=zLc3s/mq2pguVT/CfivS7tOMcBA63ZrOyecsnTPMLcC/fBEIx0PuIlU5HgkDa8ETJkZYoDJOSFkTHaz1w8sSFlmsRLKFG8s+GO+kqSXuTBgG98q9LQ+EJfeSHMvwMcXHd+EzQzhAxj1L9EnJuEV2pN0w7jUCYmfORSbIqRtu5kruBMV58TagSkmIywEluK5JC6FnxCXUO0ErYyN/7awzxZqyqrFaOaVWZZbYUrhCFq0N8OQ1NMPDvUNvXNDjDOLM6AU9f+eHsXFeAaE9QunHk6DLbxOb8xHIDot4Pau4MNllrBv8cHFtm2U3PHX4f6HFkEpfZXB0yVrzbX1+oGoscbt+195MLZu478g3IFYqkrB8b42ILL4iPHtj6M/MUbPcxoD25cMZiDI1R0TSYNmRIA==|U8iJ37fGc7sL3FohNPBpgau0+kHrBi2OlH2bHfhFOPQ=|10|87db5f81d4152bd8bebb5007a0f3dbc3; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F03%252F43%252F75%252F100257543.jpg-88x88%253Fv%253D1685860834000%26id%3D100257543%26nickname%3D%25E8%2580%2581%25E5%25A4%25A7%25E5%2592%258C%25E5%258F%258D%25E5%25AF%25B9%25E6%25B3%2595%25E7%259A%2584%25E5%258F%258D%26e%3D1701413546%26s%3Db67793dfa5cea859; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100257543%22%2C%22%24device_id%22%3A%221883d51d52d1790-08af8c489ac963-26031a51-1638720-1883d51d52eea0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%22f0f80f5e-fb00-443f-a6be-38c6ce3d4c61%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1685861547"
cookies = {lis.split("=")[0]: lis.split("=")[1] for lis in cook.split(";")}
request.cookies=cookies
return None
?爬蟲文件代碼實例三(在下載中間件文件添加cookie);
def sele():
#創(chuàng)建一個瀏覽器
driver=webdriver.Chrome()
#打開網(wǎng)頁
driver.get("https://user.17k.com/www/bookshelf/")
print("你有15秒的時間登入")
time.sleep(15)
print(driver.get_cookies())
print({i.get("name"):i.get("value") for i in driver.get_cookies()})
class MyaddcookieMiddleware:
def process_request(self, request, spider):
sele()
return None
找接?發(fā)送post請求存儲cookie
代碼1:
import scrapy
class A17kSpider(scrapy.Spider):
name = '17k'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/']
# # 重寫
# def start_requests(self):
# cook="GUID=f0f80f5e-fb00-443f-a6be-38c6ce3d4c61; __bid_n=1883d51d69d6577cf44207; BAIDU_SSP_lcr=https://www.baidu.com/link?url=v-ynoaTMtiyBil1uTWfIiCbXMGVZKqm4MOt5_xZD0q7&wd=&eqid=da8d6ae20003f26f00000006647c3209; Hm_lvt_9793f42b498361373512340937deb2a0=1684655954,1684929837,1685860878; dfxafjs=js/dfxaf3-ef0075bd.js; FPTOKEN=zLc3s/mq2pguVT/CfivS7tOMcBA63ZrOyecsnTPMLcC/fBEIx0PuIlU5HgkDa8ETJkZYoDJOSFkTHaz1w8sSFlmsRLKFG8s+GO+kqSXuTBgG98q9LQ+EJfeSHMvwMcXHd+EzQzhAxj1L9EnJuEV2pN0w7jUCYmfORSbIqRtu5kruBMV58TagSkmIywEluK5JC6FnxCXUO0ErYyN/7awzxZqyqrFaOaVWZZbYUrhCFq0N8OQ1NMPDvUNvXNDjDOLM6AU9f+eHsXFeAaE9QunHk6DLbxOb8xHIDot4Pau4MNllrBv8cHFtm2U3PHX4f6HFkEpfZXB0yVrzbX1+oGoscbt+195MLZu478g3IFYqkrB8b42ILL4iPHtj6M/MUbPcxoD25cMZiDI1R0TSYNmRIA==|U8iJ37fGc7sL3FohNPBpgau0+kHrBi2OlH2bHfhFOPQ=|10|87db5f81d4152bd8bebb5007a0f3dbc3; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F03%252F43%252F75%252F100257543.jpg-88x88%253Fv%253D1685860834000%26id%3D100257543%26nickname%3D%25E8%2580%2581%25E5%25A4%25A7%25E5%2592%258C%25E5%258F%258D%25E5%25AF%25B9%25E6%25B3%2595%25E7%259A%2584%25E5%258F%258D%26e%3D1701413546%26s%3Db67793dfa5cea859; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100257543%22%2C%22%24device_id%22%3A%221883d51d52d1790-08af8c489ac963-26031a51-1638720-1883d51d52eea0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%22f0f80f5e-fb00-443f-a6be-38c6ce3d4c61%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1685861547"
# yield scrapy.Request(
# url=self.start_urls[0],
# callback=self.parse,
# cookies={lis.split("=")[0]:lis.split("=")[1] for lis in cook.split(";")}
# )
#
# def parse(self, response):
# # print(response.text)
# # yield scrapy.Request(url="https://user.17k.com/www/bookshelf/",callback=self.parse_url)
# pass
# def parse_url(self,response):
#
# # print(response.text)
# pass
#發(fā)送post請求
def parse(self, response):
data={
"loginName": "15278307585",
"password": "wasd1234"
}
yield scrapy.FormRequest(
url="https://passport.17k.com/ck/user/login",
callback=self.prase_url,
formdata=data
)
#適用于該頁面有form表單
# yield scrapy.FormRequest.from_response(response,formdata=data,callback=self.start_urls)
def prase_url(self,response):
print(response.text)
除了這些還可以通過下載中間件返回respose對象來
from scrapy import signals
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import time
from scrapy.http.response.html import HtmlResponse
lass MyaaacookieMiddleware:
def process_request(self, request, spider):
# 創(chuàng)建一個瀏覽器
driver=webdriver.Chrome()
# 打開瀏覽器
driver.get("https://juejin.cn/")
driver.implicitly_wait(3)
# js語句下拉
for i in range(3):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
html=driver.page_source
return HtmlResponse(url=driver.current_url,body=html,request=request,encoding="utf-8")
以上就是這些內(nèi)容了.文章來源:http://www.zghlxwxcb.cn/news/detail-473745.html
總結(jié)
scrapy框架就是為了解決我們爬取許多數(shù)據(jù)而造成大量的代碼重寫,通過少數(shù)代碼解決問題文章來源地址http://www.zghlxwxcb.cn/news/detail-473745.html
到了這里,關(guān)于python的scrapy框架----->可以使我們更加強大,為打破寫許多代碼而生的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!