僅供學(xué)習(xí)參考文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-829606.html
一、獲取html網(wǎng)址中文本和鏈接,寫入TXT文件中
import requests
from lxml import html
base_url = "https://abcdef自己的網(wǎng)址要改"
response = requests.get(base_url)
response.encoding = 'utf-8' # 指定正確的編碼方式
tree = html.fromstring(response.content, parser=html.HTMLParser(encoding='utf-8'))
# 固定部分XPath,只有最后一個(gè)div的索引會(huì)變化,自己修改,復(fù)制網(wǎng)址的xpath路徑
fixed_xpath = "/html/body/div[4]/div[2]/ul/li[{div_index}]/a"
filename = "現(xiàn)TXT文本內(nèi)容.txt"
with open(filename, "w", encoding="utf-8") as f:
for div_index in range(1, 100): # 假設(shè)有100個(gè)人
# 構(gòu)建完整的XPath
xpath = fixed_xpath.format(div_index=div_index)
# 使用XPath定位每個(gè)人員信息的元素
person_elements = tree.xpath(xpath)
for person_element in person_elements:
# 獲取網(wǎng)址路徑和姓名信息
url_path = person_element.get("href")
full_url = base_url + url_path if url_path else ""
name = person_element.xpath('string()').strip() # 提取文本內(nèi)容并去除空格
# 僅輸出網(wǎng)址中的路徑部分
url_path = full_url.replace(base_url, "")
output_str = f"網(wǎng)址路徑:{url_path}\n姓名:{name}\n\n"
print(output_str)
f.write(output_str)
print(f"輸出已保存到文件 {filename}")
結(jié)果:現(xiàn)TXT文本內(nèi)容
網(wǎng)址路徑:http://abc.html
姓名:abc
二、根據(jù)現(xiàn)有的TXT文本,打開鏈接找到需要的內(nèi)容。將內(nèi)容放入姓名之后,以新的文本輸出
import re
import requests
from lxml import html
# 讀取文件內(nèi)容
with open("現(xiàn)TXT文本內(nèi)容.txt", "rb") as file:
content = file.read().decode('utf-8', 'ignore')
lines = content.splitlines()
email_xpath = '/html/body/div[4]/div/div/div/div/div[2]/div[1]/div[2]/div[4]/div[1]/text()'
filename = "現(xiàn)TXT文本內(nèi)容郵箱.txt"
with open(filename, "w", encoding="utf-8") as f:
# 遍歷每一行內(nèi)容
for i in range(0, len(lines), 1):
url_line = lines[i] # 當(dāng)前行為URL
name_line = lines[i + 1] # 下一行為姓名
# 從URL和姓名行中提取URL和姓名信息
url_match = re.search(r"https?://[^\s]+", url_line)
name_match = re.search(r"姓名:(.+)", name_line)
# 如果URL和姓名都匹配到了
if url_match and name_match:
url = url_match.group()
name = name_match.group(1)
# 發(fā)送GET請(qǐng)求到URL獲取頁(yè)面內(nèi)容
response = requests.get(url)
# 將頁(yè)面內(nèi)容轉(zhuǎn)為XPath對(duì)象
tree = html.fromstring(response.content)
# 使用XPath表達(dá)式提取郵箱信息
email = tree.xpath(email_xpath)
email = email[0] if email else "未找到郵箱地址"
# 將姓名和郵箱信息寫入文件
output_str = f"{name}:{email}\n"
print(output_str)
f.write(output_str)
# 輸出保存結(jié)果
print(f"輸出已保存到文件 {filename}")
輸出TXT文本內(nèi)容
abc:abc@aa.com
...
文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-829606.html
到了這里,關(guān)于Python爬蟲html網(wǎng)址實(shí)戰(zhàn)筆記的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!