【作者主頁】:吳秋霖
【作者介紹】:Python領(lǐng)域優(yōu)質(zhì)創(chuàng)作者、阿里云博客專家、華為云享專家。長期致力于Python與爬蟲領(lǐng)域研究與開發(fā)工作!
【作者推薦】:對JS逆向感興趣的朋友可以關(guān)注《爬蟲JS逆向?qū)崙?zhàn)》,對分布式爬蟲平臺感興趣的朋友可以關(guān)注《分布式爬蟲平臺搭建與開發(fā)實(shí)戰(zhàn)》
還有未來會(huì)持續(xù)更新的驗(yàn)證碼突防、APP逆向、Python領(lǐng)域等一系列文章
1. 寫在前面
??Scrapy是爬蟲非常經(jīng)典的一個(gè)框架,深受開發(fā)者喜愛!因其簡潔高效的設(shè)計(jì),被廣泛選用于構(gòu)建強(qiáng)大的爬蟲工程。很多人會(huì)選擇使用它來開發(fā)自己的爬蟲工程。今天我將用一個(gè)論壇網(wǎng)站的示例來全面講述Scrapy框架的使用
以前都是底層開始,現(xiàn)在不一樣了,一上來都是框架。導(dǎo)致很多人是知其然,但不知其所以然。而忽略了底層原理的理解
目標(biāo)網(wǎng)站(感興趣的可以練練手):
aHR0cHM6Ly9mb3J1bS5heGlzaGlzdG9yeS5jb20v
這是一個(gè)國外的BBS論壇,隨手挑的一個(gè)曾經(jīng)寫過的案例。前幾年做輿情相關(guān)的項(xiàng)目,寫的爬蟲真的是很多,境內(nèi)外社交媒體、論壇、新聞資訊
2. 抓包分析
??首先,我們打開這個(gè)網(wǎng)站,這個(gè)網(wǎng)站是要登陸的。我們先解決登陸這塊,簡單的構(gòu)造一下登陸請求抓個(gè)包分析一下:
上圖就是登陸請求提交的參數(shù),接下來我們需要在Scrapy爬蟲工程的Spider中構(gòu)造并實(shí)現(xiàn)登陸功能
3. Scrapy提交登陸請求
??參數(shù)都都是明文的比較簡單,唯一的一個(gè)sid也不是加密生成的,在HTML中就能夠拿到
很多時(shí)候一些接口某些參數(shù),你看起來是密文,但是并不一定就是加密算法生成的,很有可能在HTML或者其它接口響應(yīng)中就能獲取的到
sid獲取如下:
現(xiàn)在我們開始編寫Scrapy爬蟲中登陸的這部分代碼,實(shí)現(xiàn)代碼如下所示:
def parse(self, response):
text = response.headers['Set-Cookie']
pa = re.compile("phpbb3_lzhqa_sid=(.*?);")
sid = pa.findall(text)[0]
response.meta['sid'] = sid
login_url = 'https://forum.axishistory.com/ucp.php?mode=login'
yield Request(login_url, meta=response.meta, callback=self.parse_login)
def parse_login(self, response):
sid=response.meta['sid']
username ='用戶名'
password = '密碼'
formdata = {
"username": username,
"password": password,
"sid": sid,
"redirect": "index.php",
"login": "Login",
}
yield FormRequest.from_response(response, formid='login', formdata=formdata, callback=self.parse_after_login)
首先我們它通過parse函數(shù)從start_urls請求所響應(yīng)的response中獲取sid的值,然后繼續(xù)交給parse_login的登陸函數(shù)實(shí)現(xiàn)模擬登陸
另外說一下formid這個(gè)參數(shù),在HTML文檔中,表單通常通過標(biāo)簽定義,并且可以包含id屬性,這個(gè)id屬性就是表單的ID,如下一個(gè)HTML的示例:
<form id="login" method="post" action="/login">
<!-- 表單的其他字段 -->
<input type="text" name="username">
<input type="password" name="password">
<!-- 其他表單字段 -->
<input type="submit" value="Login">
</form>
在上面的這個(gè)例子中,標(biāo)簽有一個(gè)id屬性,其值為“l(fā)ogin”。所以,formid這個(gè)參數(shù)用于指定表單,去構(gòu)造登陸提交請求
4. 列表與詳情頁面數(shù)據(jù)解析
??登陸處理完以后,我們就可以使用Scrapy爬蟲繼續(xù)對列表跟詳情頁構(gòu)造請求并解析數(shù)據(jù),這一部分的無非就是寫XPATH規(guī)則了,基本對技術(shù)的要求并不高,如下使用XPATH測試工具編寫列表頁鏈接提取的規(guī)則:
Scrapy列表頁代碼實(shí)現(xiàn)如下:
def parse_page_list(self, response):
pagination = response.meta.get("pagination", 1)
details = response.xpath("http://div[@class='inner']/ul/li")
for detail in details:
replies = detail.xpath("dl/dd[@class='posts']/text()").extract_first()
views = detail.xpath("dl/dd[@class='views']/text()").extract_first()
meta = response.meta
meta["replies"] = replies
meta["views"] = views
detail_link = detail.xpath("dl//div[@class='list-inner']/a[@class='topictitle']/@href").extract_first()
detail_title = detail.xpath("dl//div[@class='list-inner']/a[@class='topictitle']/text()").extract_first()
meta["detail_title"] = detail_title
yield Request(response.urljoin(detail_link), callback=self.parse_detail, meta=response.meta)
next_page = response.xpath("http://div[@class='pagination']/ul/li/a[@rel='next']/@href").extract_first()
if next_page and pagination < self.pagination_num:
meta = response.meta
meta['pagination'] = pagination+1
yield Request(response.urljoin(next_page), callback=self.parse_page_list, meta=meta)
self.pagination_num是一個(gè)翻頁最大采集數(shù)的配置,這個(gè)自行設(shè)定即可
通過列表頁我們拿到了所有貼文的鏈接,我們并在代碼的最后使用了yield對列表頁發(fā)起了請求,<font 并通過color=#ff0033 size=3>callback=self.parse_detail交給解析函數(shù)去提取數(shù)據(jù)
首先我們定義在項(xiàng)目的items.py文件中定義Item數(shù)據(jù)結(jié)構(gòu),主要帖子跟評論的,如下所示:
class AccountItem(Item):
account_url = Field() # 賬號url
account_id = Field() # 賬號id
account_name = Field() # 賬號名稱
nick_name = Field() # 昵稱
website_name = Field() # 論壇名
account_type = Field() # 賬號類型,固定forum
level = Field() # 賬號等級
account_description = Field() # 賬號描述信息
account_followed_num = Field() # 賬號關(guān)注數(shù)
account_followed_list = Field() # 賬號關(guān)注id列表
account_focus_num = Field() # 賬號粉絲數(shù)
account_focus_list = Field() # 賬號粉絲id列表
regist_time = Field() # 賬號注冊時(shí)間
forum_credits = Field() # 論壇積分/經(jīng)驗(yàn)值
location = Field() # 地區(qū)
post_num = Field() # 發(fā)帖數(shù)
reply_num = Field() # 跟帖數(shù)
msg_type = Field()
area = Field()
class PostItem(Item):
type = Field() # "post"
post_id = Field() # 帖子id
title = Field() # 帖子標(biāo)題
content = Field() # 帖子內(nèi)容
website_name = Field() # 論壇名
category = Field() # 帖子所屬版塊
url = Field() # 帖子url
language = Field() # 語種, zh_cn|en|es
release_time = Field() # 發(fā)布時(shí)間
account_id = Field() # 發(fā)帖人id
account_name = Field() # 發(fā)帖人賬號名
page_view_num = Field() # 帖子瀏覽數(shù)
comment_num = Field() # 帖子回復(fù)數(shù)
like_num = Field() # 帖子點(diǎn)贊數(shù)
quote_from =Field() # 被轉(zhuǎn)載的帖子id
location_info = Field() # 發(fā)帖地理位置信息
images_url = Field() # 帖子圖片鏈接
image_file = Field() # 帖子圖片存儲(chǔ)路徑
msg_type = Field()
area = Field()
class CommentItem(Item):
type = Field() # "comment"
website_name = Field() # 論壇名
post_id = Field()
comment_id = Field()
content = Field() # 回帖內(nèi)容
release_time = Field() # 回帖時(shí)間
account_id = Field() # 帖子回復(fù)人id
account_name = Field() # 回帖人名稱
comment_level = Field() # 回帖層級
parent_id = Field() # 回復(fù)的帖子或評論id
like_num = Field() # 回帖點(diǎn)贊數(shù)
comment_floor = Field() # 回帖樓層
images_url = Field() # 評論圖片鏈接
image_file = Field() # 評論圖片存儲(chǔ)路徑
msg_type = Field()
area = Field()
接下來我們需要編寫貼文內(nèi)容的數(shù)據(jù)解析代碼,解析函數(shù)代碼實(shí)現(xiàn)如下所示:
def parse_detail(self, response):
dont_parse_post = response.meta.get("dont_parse_post")
category = " < ".join(response.xpath("http://ul[@id='nav-breadcrumbs']/li//span[@itemprop='title']/text()").extract()[1:])
if dont_parse_post is None:
msg_ele = response.xpath("http://div[@id='page-body']//div[@class='inner']")[0]
post_id = msg_ele.xpath("div//h3/a/@href").extract_first(default='').strip().replace("#p", "")
post_item = PostItem()
post_item["url"] = response.url
post_item['area'] = self.name
post_item['msg_type'] = u"貼文"
post_item['type'] = u"post"
post_item["post_id"] = post_id
post_item["language"] = 'en'
post_item["website_name"] = self.allowed_domains[0]
post_item["category"] = category
post_item["title"] = response.meta.get("detail_title")
post_item["account_name"] = msg_ele.xpath("div//strong/a[@class='username']/text()").extract_first(default='').strip()
post_item["content"] = "".join(msg_ele.xpath("div//div[@class='content']/text()").extract()).strip()
post_time = "".join(msg_ele.xpath("div//p[@class='author']/text()").extract()).strip()
post_item["release_time"] = dateparser.parse(post_time).strftime('%Y-%m-%d %H:%M:%S')
post_item["collect_time"] = dateparser.parse(str(time.time())).strftime('%Y-%m-%d %H:%M:%S')
user_link =msg_ele.xpath("div//strong/a[@class='username']/@href").extract_first(default='').strip()
account_id = "".join(re.compile("&u=(\d+)").findall(user_link))
post_item["account_id"] = account_id
post_item["comment_num"] = response.meta.get("replies")
post_item["page_view_num"] = response.meta.get("views")
images_urls = msg_ele.xpath("div//div[@class='content']//img/@src").extract() or ""
post_item["images_url"] = [response.urljoin(url) for url in images_urls]
post_item["image_file"] = self.image_path(post_item["images_url"])
post_item["language"] = 'en'
post_item["website_name"] = self.name
response.meta["post_id"] = post_id
response.meta['account_id'] = post_item["account_id"]
response.meta["account_name"] = post_item["account_name"]
full_user_link = response.urljoin(user_link)
yield Request(full_user_link, meta=response.meta, callback=self.parse_account_info)
for comment_item in self.parse_comments(response):
yield comment_item
comment_next_page = response.xpath(u"http://div[@class='pagination']/ul/li/a[@rel='next']/@href").extract_first()
if comment_next_page:
response.meta["dont_parse_post"] = 1
next_page_link = response.urljoin(comment_next_page)
yield Request(next_page_link, callback=self.parse_detail, meta=response.meta)
貼文內(nèi)容的下方就是評論信息,上面代碼中我們拿到評論的鏈接comment_next_page,直接繼續(xù)發(fā)送請求解析評論內(nèi)容:
def parse_comments(self, response):
comments = response.xpath("http://div[@id='page-body']//div[@class='inner']")
if response.meta.get("dont_parse_post") is None:
comments = comments[1:]
for comment in comments:
comment_item = CommentItem()
comment_item['type'] = "comment"
comment_item['area'] = self.name
comment_item['msg_type'] = u"評論"
comment_item['post_id'] = response.meta.get("post_id")
comment_item["parent_id"] = response.meta.get("post_id")
comment_item["website_name"] = self.allowed_domains[0]
user_link =comment.xpath("div//strong/a[@class='username']/@href").extract_first(default='').strip()
account_id = "".join(re.compile("&u=(\d+)").findall(user_link))
comment_item['comment_id'] = comment.xpath("div//h3/a/@href").extract_first(default='').strip().replace("#p","")
comment_item['account_id'] = account_id
comment_item['account_name'] = comment.xpath("div//strong/a[@class='username']/text()").extract_first(default='').strip()
comment_time = "".join(comment.xpath("div//p[@class='author']/text()").extract()).strip()
if not comment_time:
continue
comment_level_text = comment.xpath("div//div[@id='post_content%s']//a[contains(@href,'./viewtopic.php?p')]/text()" % comment_item['comment_id']).extract_first(default='')
comment_item['comment_level'] = "".join(re.compile("\d+").findall(comment_level_text))
comment_item['release_time'] = dateparser.parse(comment_time).strftime('%Y-%m-%d %H:%M:%S')
comment_content_list = "".join(comment.xpath("div//div[@class='content']/text()").extract()).strip()
comment_item['content'] = "".join(comment_content_list)
response.meta['account_id'] = comment_item["account_id"]
response.meta["account_name"] = comment_item["account_name"]
full_user_link = response.urljoin(user_link)
yield Request(full_user_link, meta=response.meta, callback=self.parse_account_info)
評論信息采集中還有一個(gè)針對評論用戶信息采集的功能,通過調(diào)用parse_account_info函數(shù)進(jìn)行采集,實(shí)現(xiàn)代碼如下所示:
def parse_account_info(self, response):
about_item = AccountItem()
about_item["account_id"] = response.meta["account_id"]
about_item["account_url"] = response.url
about_item["account_name"] = response.meta["account_name"]
about_item["nick_name"] = ""
about_item["website_name"] = self.allowed_domains[0]
about_item["account_type"] = "forum"
about_item["level"] = ""
account_description = "".join(response.xpath("http://div[@class='inner']/div[@class='postbody']//text()").extract())
about_item["account_description"] = account_description
about_item["account_followed_num"] = ""
about_item["account_followed_list"] = ""
about_item["account_focus_num"] = ""
about_item["account_focus_list"] = ""
regist_time = "".join(response.xpath("http://dl/dt[text()='Joined:']/following-sibling::dd[1]/text()").extract())
about_item["regist_time"] = dateparser.parse(regist_time).strftime('%Y-%m-%d %H:%M:%S')
about_item["forum_credits"] = ""
location = "".join(response.xpath("http://dl/dt[text()='Location:']/following-sibling::dd[1]/text()").extract())
about_item["location"] = location
post_num_text = response.xpath("http://dl/dt[text()='Total posts:']/following-sibling::dd[1]/text()[1]").extract_first(default='')
post_num = post_num_text.replace(",",'').strip("|").strip()
about_item["post_num"] = post_num
about_item["reply_num"] = ""
about_item["msg_type"] = 'account'
about_item["area"] = self.name
yield about_item
最后從帖子到評論再到賬號信息,層層采集與調(diào)用拿到完整的一個(gè)JSON結(jié)構(gòu)化數(shù)據(jù),進(jìn)行yield到數(shù)據(jù)庫
5. 中間件Middleware配置
??因?yàn)槭菄獾恼搲W(wǎng)站案例,所以這里我們需要使用我們的Middleware來解決這個(gè)問題:
class ProxiesMiddleware():
logfile = logging.getLogger(__name__)
def process_request(self, request, spider):
self.logfile.debug("entry ProxyMiddleware")
try:
# 依靠meta中的標(biāo)記,來決定是否需要使用proxy
proxy_addr = spider.proxy
if proxy_addr:
if request.url.startswith("http://"):
request.meta['proxy'] = "http://" + proxy_addr # http代理
elif request.url.startswith("https://"):
request.meta['proxy'] = "https://" + proxy_addr # https代理
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
self.logfile.warning(u"Proxies error: %s, %s, %s, %s" %
(exc_type, e, fname, exc_tb.tb_lineno))
settings文件中配置開啟Middleware:文章來源:http://www.zghlxwxcb.cn/news/detail-756706.html
DOWNLOADER_MIDDLEWARES = {
'forum.middlewares.ProxiesMiddleware': 100,
}
??好了,到這里又到了跟大家說再見的時(shí)候了。創(chuàng)作不易,幫忙點(diǎn)個(gè)贊再走吧。你的支持是我創(chuàng)作的動(dòng)力,希望能帶給大家更多優(yōu)質(zhì)的文章文章來源地址http://www.zghlxwxcb.cn/news/detail-756706.html
到了這里,關(guān)于【深入Scrapy實(shí)戰(zhàn)】從登錄到數(shù)據(jù)解析構(gòu)建完整爬蟲流程的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!