国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

解析庫bs4的使用

2年前作者：不再熬夜分類：Toy博客閱讀(15)違法舉報

這篇具有很好參考價值的文章主要介紹了解析庫bs4的使用。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

一、bs4的使用

安裝：pip3 install Beautifulsoup4

1.bs4遍歷文檔樹
bs4：解析xml格式的模塊，從xml中找想要的數(shù)據(jù)。
html是xml的一種，解析html，使用requests返回的數(shù)據(jù)，可能是json、html、文件，再使用bs4解析html格式。

用法：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id='id_p' xx='xx'>我是帥哥<b>The Dormouse's story <span>xxx</span></b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# soup = BeautifulSoup(html_doc, 'html.parser')

# 速度比上面快，但是需要安裝lxml模塊 pip3 install lxml
soup = BeautifulSoup(html_doc, 'lxml')
res = soup.prettify()  # 美化
print(res)

# ----------遍歷文檔樹----------
# 1、用法  通過 .
body = soup.body  # 直接通過soup對象.標(biāo)簽名，找到標(biāo)簽對象
print(type(body))
print(body.p)
# bs4.element.Tag   標(biāo)簽對象可以繼續(xù)往下 .

# 2、獲取標(biāo)簽的名稱
p = soup.p
print(p.name)

# 3、獲取標(biāo)簽的屬性
p=soup.p
print(p.attrs) # 把p標(biāo)簽所有屬性變成字典
print(p.attrs['class'])  # class 是列表形式---->因為class有多個
print(p['id'])  # 獲取屬性第二種方式

# 獲取第一個a標(biāo)簽的href屬性
a=soup.html.body.a['href']
print(a)

# 4、獲取標(biāo)簽的內(nèi)容
# text  string  strings
# 獲取第一個p標(biāo)簽的文本內(nèi)容
p = soup.p
print(p.text) # 獲取p子子孫孫所有的文本內(nèi)容，拼到一起
print(p.string) # p標(biāo)簽有且只有文本才能取出，如果有子標(biāo)簽，取出空
print(list(p.strings))  # 把子子孫孫的文本內(nèi)容放到迭代器中

# 5、嵌套選擇
p = soup.html.body.p
print(p)

# ----------只做了解----------
# 6、子節(jié)點、子孫節(jié)點
print(soup.p.contents)  # p下所有子節(jié)點(不包含孫),是列表形式
print(list(soup.p.children)) #得到一個迭代器,包含p下所有子節(jié)點

print(list(soup.p.descendants))  # 子子孫

# 7、父節(jié)點、祖先節(jié)點
print(soup.a.parent) #獲取a標(biāo)簽的父節(jié)點
print(list(soup.a.parents)) #找到a標(biāo)簽所有的祖先節(jié)點，父親的父親，父親的父親的父親...

# 8、兄弟節(jié)點
print(soup.a.next_sibling) #下一個兄弟,緊鄰的，不一定是標(biāo)簽
print(soup.a.previous_sibling) #上一個兄弟

print(list(soup.a.next_siblings)) #下面的兄弟們=>生成器對象
print(list(soup.a.previous_siblings)) #上面的兄弟們=>生成器對象

注：lxml比html.parser速度塊，但是需要安裝lxml模塊（pip3 install lxml）

2.bs4搜索文檔樹
搜索文檔樹速度是比遍歷慢一些的。

五種過濾器:
字符串、正則表達式、列表、True、方法

兩種方法：
find：找到的第一個 find_all：找到的所有

用法：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id='id_p' xx='xx'>我是帥哥<b>The Dormouse's story <span>xxx</span></b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# 字符串:指的的 屬性='字符串形式'
res=soup.find_all(name='body')
res=soup.find(name='body')
res=soup.find(class_='story')  # class 是關(guān)鍵字，需要寫成class_
res=soup.find(id='link2')
res=soup.find(href='http://example.com/lacie')
# 如果傳多個參數(shù)，表示并列 and條件‘
res=soup.find(attrs={'class':'story'})
print(res)

# 正則表達式
import re
print(soup.find_all(name=re.compile('^b'))) #找出b開頭的標(biāo)簽，結(jié)果有body和b標(biāo)簽
# 找到所有所有連接的標(biāo)簽
res=soup.find_all(href=re.compile('^http://'))
res=soup.find_all(href=re.compile('.*?lie$'))
res=soup.find_all(id=re.compile('^id'))
print(res)

# 列表
res=soup.find_all(name=['p','a'])
res=soup.find_all(class_=['title','story'])
print(len(res))

# 布爾
res=soup.find_all(name=True)
res=soup.find_all(href=True)
res=soup.find_all(src=True)  # 把當(dāng)前頁面的圖片拿出來
print(res)

# 方法（了解）
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')


print(soup.find_all(has_class_but_no_id))

案例：

import requests

res=requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=9&start=12&mrd=0.013926765110156447')
soup=BeautifulSoup(res.text,'lxml')
li_list=soup.find_all(name='li',class_='categoryem')
for li in li_list:
    res=li.div.a['href']
    print(res)

li_list=soup.find_all(href=True,name='a',class_='vervideo-lilink')
print(li_list)

3.bs4其他用法
遍歷和搜索，可以混合用
recursive :是否遞歸查找
limit：查找多少條

用法：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id='id_p' xx='xx'>我是帥哥<b>The Dormouse's story <span>xxx</span></b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# 遍歷和搜索，可以混合用
res=soup.html.body.find('p')
res = soup.find('p')
print(res)

# recursive :是否遞歸查找
res=soup.html.body.find_all(name='p',recursive=False)
print(res)

# limit 查找多少條
res=soup.find_all('p',limit=2)
print(len(res))

補充：
1 鏈?zhǔn)秸{(diào)用（跟語言沒關(guān)系）

class Person:
    def change_name(self, name):
        self.name = name
        return self

    def change_age(self, age):
        self.age = age
        return self

    def __str__(self):
        try:
            return '我的名字是：%s，我的年齡是：%s' % (self.name, self.age)
        except:
            return super().__str__()


p = Person()
p.change_name('egon').change_age(14)
print(p)

2 bs4支持修改文檔樹，對爬蟲沒用，對實際寫后臺代碼有用

3 主流軟件的配置文件方式
xxx.conf（redis，nginx）
xxx.ini（mysql）
xxx.xml（uwsgi，java的配置文件居多）
xxx.yaml

4 css選擇器
所有解析庫，通常會有自己的查找方式（bs4就是find和find_all），還會支持css和想xpath選擇。
記住一些css選擇器用法：

id#
類名.
標(biāo)簽名p
標(biāo)簽名>標(biāo)簽名緊鄰的子
標(biāo)簽名標(biāo)簽名子子孫孫

res=soup.select('#id_p')
res=soup.select('p>a')
print(res)

5 xpath：在xml中查找文檔的語言

6 css、xpath都不會寫怎么辦
終極大招：瀏覽器F12選中頁面元素，鼠標(biāo)右擊選擇xpath或css復(fù)制即可~~
示例：文章來源地址http://www.zghlxwxcb.cn/news/detail-520082.html

# css 
# maincontent > div:nth-child(3) > table > tbody > tr:nth-child(44) > td:nth-child(3)
# xpath
# //*[@id="maincontent"]/div[2]/table/tbody/tr[44]/td[3]

到了這里，關(guān)于解析庫bs4的使用的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費用

Python爬蟲之Requests庫、BS4解析庫的下載和安裝
一、Requests庫下載地址 requests · PyPI 將下載的.whl文件放在Script目錄下 ?win+r 輸入cmd 進入windows控制臺進入到Scripts目錄輸入pip3.10 install?requests-2.28.1-py3-none-any.whl（文件的名稱）出現(xiàn)Successful install即安裝成功 ?二、BS4解析庫的下載和安裝進入到scripts目錄 pip install bs4 由于 BS4
2024年02月05日
瀏覽(18)
解析庫bs4的使用
安裝： pip3 install Beautifulsoup4 1.bs4遍歷文檔樹 bs4：解析xml格式的模塊，從xml中找想要的數(shù)據(jù)。 html是xml的一種，解析html，使用requests返回的數(shù)據(jù)，可能是json、html、文件，再使用bs4解析html格式。用法：注：lxml比html.parser速度塊，但是需要安裝lxml模塊（ pip3 install lxml ） 2.bs4搜
2024年02月12日
瀏覽(14)
python-網(wǎng)絡(luò)爬蟲.BS4
BS4 Beautiful Soup 是一個可以從HTML或XML文件中提取數(shù)據(jù)的Python庫，它能夠通過你喜歡的轉(zhuǎn)換器實現(xiàn)慣用的文檔導(dǎo)航、查找、修改文檔的方式。 Beautiful Soup 4 官方文檔： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 幫助手冊： https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/ 一、安裝Beaut
2024年02月14日
瀏覽(15)
python爬蟲8：bs4庫
前言 ? python實現(xiàn)網(wǎng)絡(luò)爬蟲非常簡單，只需要掌握一定的基礎(chǔ)知識和一定的庫使用技巧即可。本系列目標(biāo)旨在梳理相關(guān)知識點，方便以后復(fù)習(xí)。申明 ? 本系列所涉及的代碼僅用于個人研究與討論，并不會對網(wǎng)站產(chǎn)生不好影響。目錄結(jié)構(gòu) 1. 概述與安裝 ? bs4是BeautifulSoup4的簡
2024年02月12日
瀏覽(18)
【Python爬蟲】Python爬蟲三大基礎(chǔ)模塊（urllib & BS4 & Selenium）
參考資料 Python爬蟲教程（從入門到精通） Python urllib | 菜鳥教程 Beautiful Soup 4 入門手冊_w3cschool Selenium入門指南 Selenium教程什么是 Scrapy|極客教程 Scrapy入門教程 1、網(wǎng)絡(luò)爬蟲是什么？我們所熟悉的一系列搜索引擎都是大型的網(wǎng)絡(luò)爬蟲，比如百度、搜狗、360瀏覽器、谷歌搜索等
2024年02月12日
瀏覽(21)
【用Vscode實現(xiàn)簡單的python爬蟲】從安裝到配置環(huán)境變量到簡單爬蟲以及python中pip和request，bs4安裝
第一步：安裝python包 ?可以默認(rèn)，也可以選擇自己想要安裝的路徑 python下載資源鏈接： Download Python | Python.org 第二步：配置python環(huán)境變量，找到我的電腦-屬性-高級-環(huán)境變量找到python,新增然后將剛剛安裝的路徑配置到path路徑下：特別注意，配置了環(huán)境變量后要進行重啟電
2024年02月15日
瀏覽(16)
使用bs4 分析html文件
首先需要 pip install beautifulsoup4 安裝然后為了方便學(xué)習(xí)此插件，隨便打開一個網(wǎng)頁，然后鼠標(biāo)右鍵，打開源網(wǎng)頁，如下圖片這樣就可以獲得一個網(wǎng)頁源碼，全選復(fù)制粘貼到本地，存儲為 .html 文件，后續(xù)的學(xué)習(xí)以此html文件為模版進行如，html文件中含結(jié)構(gòu) 我使用如下命令：例
2024年01月17日
瀏覽(14)
Python爬蟲解析工具之xpath使用詳解
爬蟲抓取到整個頁面數(shù)據(jù)之后，我們需要從中提取出有價值的數(shù)據(jù)，無用的過濾掉。這個過程稱為數(shù)據(jù)解析，也叫數(shù)據(jù)提取。數(shù)據(jù)解析的方式有多種，按照網(wǎng)站數(shù)據(jù)來源是靜態(tài)還是動態(tài)進行分類，如下：動態(tài)網(wǎng)站：字典取值。動態(tài)網(wǎng)站的數(shù)據(jù)一般都是JS發(fā)過來的，基本
2024年02月12日
瀏覽(22)
Python爬蟲——解析插件xpath的安裝及使用
目錄 1.安裝xpath 2.安裝lxml的庫 3.xpath基本語法 4.案例一：獲取百度網(wǎng)站的百度一下字樣 5.案例二：爬取站長素材網(wǎng)上的前十頁照片打開谷歌瀏覽器 --?點擊右上角小圓點 --?更多工具 --?擴展程序 ?下載xpath壓縮包，下載地址：阿里云盤分享把壓縮包解壓到指定目錄 --?選擇加
2024年02月02日
瀏覽(26)
尚硅谷爬蟲(解析_xpath的基本使用)筆記
創(chuàng)建一個簡單的HTML：創(chuàng)建一個python文件：如果解析本地文件使用etree.parse 如果解析服務(wù)器響應(yīng)文件使用etree.HTML() 運行： ?會報錯 lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: meta line 4 and head, line 6, column 8 原因是 xpath 嚴(yán)格遵守HTML規(guī)范?? 解決方法：在meta標(biāo)簽中加入 /? 再次
2023年04月21日
瀏覽(20)