目錄
一、分析目標(biāo):
二、代碼實(shí)現(xiàn)
目標(biāo)1:對(duì)于捕獲的URL內(nèi)嵌發(fā)包
目標(biāo)2:找到電話和郵箱的位置
目標(biāo)3:提取電話和郵箱
三、完整代碼
四、網(wǎng)絡(luò)安全小圈子
(注:需要帶上登錄成功后的cookie發(fā)包)
一、分析目標(biāo):
點(diǎn)擊進(jìn)去爬取每個(gè)企業(yè)里面的電話、郵箱
(我們是來(lái)投簡(jiǎn)歷的,切勿干非法的事情)
每個(gè)單位的URL記下來(lái)
(一定是在前一個(gè)頁(yè)面能找到的,不然只能跳轉(zhuǎn)進(jìn)來(lái)是吧)
我們可以看到這個(gè)URL就是他跳轉(zhuǎn)的URL
其實(shí)我們前面已經(jīng)提前爬取了每個(gè)單位的這個(gè)URL
思路:
對(duì)我們爬取的URL發(fā)包,并對(duì)數(shù)據(jù)包進(jìn)行處理,提取我們需要的數(shù)據(jù)
二、代碼實(shí)現(xiàn)
目標(biāo)1:對(duì)于捕獲的URL內(nèi)嵌發(fā)包
for u in [link]:
html2 = get_page(u)
soup2 = BeautifulSoup(html2, 'lxml')
email_phone_div = soup2.find('div', attrs={'class': 'index_detail__JSmQM'})
目標(biāo)2:找到電話和郵箱的位置
(1)找到他的上一級(jí)(也就是都包含他們的)
for u in [link]:
html2 = get_page(u)
soup2 = BeautifulSoup(html2, 'lxml')
email_phone_div = soup2.find('div', attrs={'class': 'index_detail__JSmQM'})
目標(biāo)3:提取電話和郵箱
(1)首先加一個(gè)if判空
把phone和email的上一級(jí)進(jìn)行判空,不為空再繼續(xù)
if email_phone_div is not None:
phone_div = email_phone_div.find('div', attrs={'class': 'index_first__3b_pm'})
email_div = email_phone_div.find('div', attrs={'class': 'index_second__rX915'})
#中間為提取email和phone代碼
else:
phone = ''
email = ''
#遍歷一遍就寫(xiě)入一個(gè)數(shù)據(jù)
csv_w.writerow((title.strip(), link, type_texts, money, email, phone))
(2)對(duì)phone進(jìn)行提取
首先也是對(duì)上一級(jí)標(biāo)簽判空,不為空才繼續(xù)
if phone_div is not None:
phone_element = phone_div.find('span', attrs={'class': 'link-hover-click'})
if phone_element is not None:
phone = phone_element.find('span',attrs={'class':'index_detail-tel__fgpsE'}).text
else:
phone = ''
else:
phone = ''
(3)對(duì)email提取
和phone一樣先對(duì)上一級(jí)判空,不為空再繼續(xù)提取
if email_div is not None:
email_element = email_div.find('span', attrs={'class': 'index_detail-email__B_1Tq'})
if email_element is not None:
email = email_element.text
else:
email = ''
else:
email = ''
(3)對(duì)內(nèi)嵌請(qǐng)求處理的完整代碼
for u in [link]:
html2 = get_page(u)
soup2 = BeautifulSoup(html2, 'lxml')
email_phone_div = soup2.find('div', attrs={'class': 'index_detail__JSmQM'})
if email_phone_div is not None:
phone_div = email_phone_div.find('div', attrs={'class': 'index_first__3b_pm'})
email_div = email_phone_div.find('div', attrs={'class': 'index_second__rX915'})
if phone_div is not None:
phone_element = phone_div.find('span', attrs={'class': 'link-hover-click'})
if phone_element is not None:
phone = phone_element.find('span',attrs={'class':'index_detail-tel__fgpsE'}).text
else:
phone = ''
else:
phone = ''
if email_div is not None:
email_element = email_div.find('span', attrs={'class': 'index_detail-email__B_1Tq'})
if email_element is not None:
email = email_element.text
else:
email = ''
else:
email = ''
else:
phone = ''
email = ''
csv_w.writerow((title.strip(), link, type_texts, money, email, phone))
三、完整代碼
運(yùn)行結(jié)果
爬取寫(xiě)入的數(shù)據(jù)
(cookie需要填上自己的)
import time
import requests
import csv
from bs4 import BeautifulSoup
def get_page(url):
try:
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36',
'Cookie':'?。。。。。。。?!'
}
response = requests.get(url, headers=headers, timeout=10)
return response.text
except:
return ""
def get_TYC_info(page):
TYC_url = f"https://www.tianyancha.com/search?key=&sessionNo=1688538554.71584711&base=hub&cacheCode=00420100V2020&city=wuhan&pageNum={page}"
html = get_page(TYC_url)
soup = BeautifulSoup(html, 'lxml')
GS_list = soup.find('div', attrs={'class': 'index_list-wrap___axcs'})
GS_items = GS_list.find_all('div', attrs={'class': 'index_search-box__7YVh6'})
for item in GS_items:
title = item.find('div', attrs={'class': 'index_name__qEdWi'}).a.span.text
link = item.a['href']
company_type_div = item.find('div', attrs={'class': 'index_tag-list__wePh_'})
if company_type_div is not None:
company_type = company_type_div.find_all('div', attrs={'class': 'index_tag-common__edIee'})
type_texts = [element.text for element in company_type]
else:
type_texts = ''
money = item.find('div', attrs={'class': 'index_info-col__UVcZb index_narrow__QeZfV'}).span.text
for u in [link]:
html2 = get_page(u)
soup2 = BeautifulSoup(html2, 'lxml')
email_phone_div = soup2.find('div', attrs={'class': 'index_detail__JSmQM'})
if email_phone_div is not None:
phone_div = email_phone_div.find('div', attrs={'class': 'index_first__3b_pm'})
email_div = email_phone_div.find('div', attrs={'class': 'index_second__rX915'})
if phone_div is not None:
phone_element = phone_div.find('span', attrs={'class': 'link-hover-click'})
if phone_element is not None:
phone = phone_element.find('span',attrs={'class':'index_detail-tel__fgpsE'}).text
else:
phone = ''
else:
phone = ''
if email_div is not None:
email_element = email_div.find('span', attrs={'class': 'index_detail-email__B_1Tq'})
if email_element is not None:
email = email_element.text
else:
email = ''
else:
email = ''
else:
phone = ''
email = ''
csv_w.writerow((title.strip(), link, type_texts, money, email, phone))
if __name__ == '__main__':
with open('5.csv', 'a', encoding='utf-8', newline='') as f:
csv_w = csv.writer(f)
csv_w.writerow(('公司名', 'URL', '類型', '資金', '電子郵件', '電話號(hào)碼'))
for page in range(1, 5):
get_TYC_info(page)
print(f'第{page}頁(yè)已爬完')
time.sleep(2)
四、網(wǎng)絡(luò)安全小圈子
README.md · 書(shū)半生/網(wǎng)絡(luò)安全知識(shí)體系-實(shí)戰(zhàn)中心 - 碼云 - 開(kāi)源中國(guó) (gitee.com)https://gitee.com/shubansheng/Treasure_knowledge/blob/master/README.md文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-532974.html
GitHub - BLACKxZONE/Treasure_knowledgehttps://github.com/BLACKxZONE/Treasure_knowledge文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-532974.html
到了這里,關(guān)于【網(wǎng)絡(luò)安全帶你練爬蟲(chóng)-100練】第6練:內(nèi)嵌發(fā)包提取數(shù)據(jù)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!