国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

<optgroup id="d3ijd"></optgroup>

<th id="d3ijd"></th>

<b id="d3ijd"><abbr id="d3ijd"></abbr></b>

《零基礎入門學習Python》第060講：論一只爬蟲的自我修養(yǎng)8：正則表達式4

2年前作者：XILALIKE分類：Toy博客閱讀(25)違法舉報

這篇具有很好參考價值的文章主要介紹了《零基礎入門學習Python》第060講：論一只爬蟲的自我修養(yǎng)8：正則表達式4。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

有了前面幾節(jié)課的準備，我們這一次終于可以真刀真槍的干一場大的了，但是呢，在進行實戰(zhàn)之前，我們還要講講正則表達式的實用方法和擴展語法，然后再來實戰(zhàn)，大家多把持一會啊。

我們先來翻一下文檔：

首先，我們要舉的例子是講得最多的 search() 方法，search() 方法既有模塊級別的，就是直接調(diào)用 re.search() 來實現(xiàn)，另外，編譯后的正則表達式模式對象也同樣擁有?search() 方法，我問問大家，它們之間有區(qū)別嗎？

如果你的回答僅僅是模塊級別的search() 方法比模式級別的search() 方法要多一個正則表達式的參數(shù)，那你肯定沒有去翻文檔。

re.search(pattern,?string,?flags=0)

Scan through?string?looking for the first location where the regular expression?pattern?produces a match, and return a corresponding?match object. Return?None?if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

這是模塊級別的 search() 方法，大家注意它的參數(shù)，它有一個 flags 參數(shù)， flags 參數(shù)就我們上節(jié)課講得編譯標志位，作為一個模塊級別的，它沒辦法復印，它直接在這里使用它的標志位就可以了。

pattern 是正則表達式的模式

string 是要搜索的字符串

我們再來看一下如果是編譯后的模式對象，它的 search() 方法又有哪些參數(shù)：

regex.search(string[,?pos[,?endpos]])

Scan through?string?looking for the first location where this regular expression produces a match, and return a corresponding?match object. Return?None?if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

The optional second parameter?pos?gives an index in the string where the search is to start; it defaults to?0. This is not completely equivalent to slicing the string; the?'^'?pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

The optional parameter?endpos?limits how far the string will be searched; it will be as if the string is?endpos?characters long, so only the characters from?pos?to?endpos - 1?will be searched for a match. If?endpos?is less than?pos, no match will be found; otherwise, if?rx?is a compiled regular expression object,?rx.search(string, 0, 50)?is equivalent to?rx.search(string[:50], 0).

前面的 pattern，模式對象的參數(shù)，就不需要了。

string 第一個參數(shù)就是待搜索的字符串

后面有兩個可選參數(shù)是我們模塊級別的 search() 方法沒有的，它分別代表需要搜索的起始位置（pos）和結(jié)束位置（endpos）

你就可以像?rx.search(string, 0, 50) 或者?rx.search(string[:50], 0) 這樣子去匹配它的搜索位置了。

還有一點可能被忽略的就是，search() 方法并不會立刻返回你所需要的字符串，取而代之，它是返回一個匹配對象。我們來舉個例子：

>>> import re
>>> result = re.search(r" (\w+) (\w+)", "I love Python.com")
>>> result
<_sre.SRE_Match object; span=(1, 13), match=' love Python'>

我們看到，這個 result 是一個匹配對象（?match object.），而不是一個字符串。它這個匹配對象有一些方法，你使用這些方法才能夠獲得你所需要的匹配的字符串：

例如：group()方法：

>>> result.group()
' love Python'

我們就把匹配的內(nèi)容打印出來了。首先是一個空格，然后是 \w+ ，就是任何字符，這里就是love，然后又是一個空格，然后又是 \w+，這里就是Python。

說到這個?group()方法，值的一提的是，如果正則表達式中存在著子組，子組會將匹配的內(nèi)容進行捕獲，通過這個?group()方法中設置序號，可以提取到對應的子組（序號從1開始）捕獲的字符串。例如：

>>> result.group(1)
'love'
>>> result.group(2)
'Python'

除了?group()方法之外，它還有 start()方法? 、end()方法、 span() 方法，分別返回它匹配的開始位置、結(jié)束位置、范圍。

match.start([group])

match.end([group])

Return the indices of the start and end of the substring matched by?group;?group?defaults to zero (meaning the whole matched substring). Return?-1?if?group?exists but did not contribute to the match. For a match object?m, and a group?g?that did contribute to the match, the substring matched by group?g?(equivalent to?m.group(g)) is
m.string[m.start(g):m.end(g)]
Note that?m.start(group)?will equal?m.end(group)?if?group?matched a null string. For example, after?m = re.search('b(c?)', 'cba'),?m.start(0)?is 1,?m.end(0)?is 2,?m.start(1)?and?m.end(1)?are both 2, and?m.start(2)?raises an?IndexError?exception.

An example that will remove?remove_this?from email addresses:
>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'tony@tiger.net'
match.span([group])

For a match?m, return the 2-tuple?(m.start(group), m.end(group)). Note that if?group?did not contribute to the match, this is?(-1, -1).?group?defaults to zero, the entire match.

>>> result.start()
1
>>> result.end()
13
>>> result.span()
(1, 13)

?接下來講講 findall() 方法：

re.findall(pattern,?string,?flags=0)

Return all non-overlapping matches of?pattern?in?string, as a list of strings. The?string?is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

有人可能會覺得，findall() 方法很容易，不就是找到所有匹配的內(nèi)容，然后把它們組織成列表的形式返回嗎。

沒錯，這是在正則表達式里沒有子組的情況下所做的事，如果正則表達式里包含了子組，那么，findall() 會變得很聰明。

我們來舉個例子吧，上貼吧爬圖：

例如我們想下載這個頁面的所有圖片：貼吧404

我們先來踩點，看到圖片格式的標簽：

《零基礎入門學習Python》第060講：論一只爬蟲的自我修養(yǎng)8：正則表達式4,python零基礎,python

我們就來直接寫代碼啦：

首先，我們寫下下面的代碼，來爬取圖片地址：

import re
p = r'<img class="BDE_Image" src="[^"]+\.jpg"'
imglist = re.findall(p, html)
for each in imglist:
print(each)

打印的結(jié)果為：

============== RESTART: C:\Users\XiangyangDai\Desktop\tieba.py ==============
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=65ac7c3d9e0a304e5222a0f2e1c9a7c3/4056053b5bb5c9ea8d7d0bdadc39b6003bf3b34e.jpg"
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=d887aa03394e251fe2f7e4f09787c9c2/77f65db5c9ea15ceaf60e830bf003af33b87b24e.jpg"
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=0db90d472c1f95caa6f592bef9167fc5/2f78cfea15ce36d34f8a8b0933f33a87e850b14e.jpg"
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=abfd18169ccad1c8d0bbfc2f4f3f67c4/bd2713ce36d3d5392db307fa3387e950342ab04e.jpg"

很顯然，這不是我們需要的地址，我們需要的只是后面的部分。我們接下來要解決的問題就是如何將里面的地址提取出來，不少人聽到這里，可能就已經(jīng)開始動手了。但是，別急，我這里有更好的方法。

只需要把圖片地址用小括號括起來，即將：

?p = r'<img class="BDE_Image" src="[^"]+\.jpg"' 改為?p = r'<img class="BDE_Image" src="([^"]+\.jpg)"'，

大家再來看一下運行后的結(jié)果：

============== RESTART: C:\Users\XiangyangDai\Desktop\tieba.py ==============
https://imgsa.baidu.com/forum/w%3D580/sign=65ac7c3d9e0a304e5222a0f2e1c9a7c3/4056053b5bb5c9ea8d7d0bdadc39b6003bf3b34e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=d887aa03394e251fe2f7e4f09787c9c2/77f65db5c9ea15ceaf60e830bf003af33b87b24e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=0db90d472c1f95caa6f592bef9167fc5/2f78cfea15ce36d34f8a8b0933f33a87e850b14e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=abfd18169ccad1c8d0bbfc2f4f3f67c4/bd2713ce36d3d5392db307fa3387e950342ab04e.jpg

是不是很興奮，是不是很驚訝，先別急，我先把代碼敲完，再給大家講解。

import urllib.request
import re
def open_url(url):
req = urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36')
response = urllib.request.urlopen(url)
html = response.read()
return html
def get_img(url):
html = open_url(url).decode('utf-8')
p = r'<img class="BDE_Image" src="([^"]+\.jpg)"'
imglist = re.findall(p, html)
'''
for each in imglist:
print(each)
'''
for each in imglist:
filename = each.split('/')[-1]
urllib.request.urlretrieve(each, filename, None)
if __name__ == '__main__':
url = "https://tieba.baidu.com/p/4863860271"
get_img(url)

運行結(jié)果，就是很多美眉圖片出現(xiàn)在桌面了（前提是這個程序在桌面運行，圖片自動下載到程序所在文件夾。）

接下來就來解決大家的困惑了：為什么加個小括號會如此方便呢？

這是因為在 findall() 方法中，如果給出的正則表達式是包含著子組的話，那么就會把子組的內(nèi)容單獨給返回回來。然而，如果存在多個子組，那么它還會將匹配的內(nèi)容組合成元組的形式再返回。

我們還是舉個例子：

因為有時候 findall() 如果使用的不好，很多同學就會感覺很疑惑，很迷?！?/p>

拿前面匹配 ip 地址的正則表達式來講解，我們使用 findall() 來嘗試自動從https://www.xicidaili.com/wt/獲取 ip 地址：

初代碼如下：

import urllib.request
import re
def open_url(url):
req = urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36')
reponse = urllib.request.urlopen(req)
html = reponse.read()
return html
def get_ip(url):
html = open_url(url).decode('utf-8')
p = r'(([0,1]?\d?\d|2[0-4]\d|25[0-5])\.){3}([0,1]?\d?\d|2[0-4]\d|25[0-5])'
iplist = re.findall(p, html)
for each in iplist:
print(each)
if __name__ == "__main__":
url = "https://www.xicidaili.com/wt/"
get_ip(url)

運行結(jié)果如下：

============== RESTART: C:\Users\XiangyangDai\Desktop\getIP.py ==============
('180.', '180', '122')
('248.', '248', '79')
('129.', '129', '198')
('217.', '217', '7')
('40.', '40', '35')
('128.', '128', '21')
('118.', '118', '106')
('101.', '101', '46')
('3.', '3', '4')

得到的結(jié)果讓我們很迷茫，為什么會這樣呢？這明顯不是我們想要的結(jié)果，這是因為我們在正則表達式里面使用了 3 個子組，所以，findall() 會自作聰明的把我們的結(jié)果做了分類，然后用元組的形式返回給我們。

那有沒有解決的方法呢？

要解決這個問題，我們可以讓子組不捕獲內(nèi)容。

我們查看 ->?Python3 正則表達式特殊符號及用法（詳細列表）,尋求擴展語法。

讓子組不捕獲內(nèi)容，擴展語法就是非捕獲組：

《零基礎入門學習Python》第060講：論一只爬蟲的自我修養(yǎng)8：正則表達式4,python零基礎,python

所以我們的初代碼修改如下：

import urllib.request
import re
def open_url(url):
req = urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36')
reponse = urllib.request.urlopen(req)
html = reponse.read()
return html
def get_ip(url):
html = open_url(url).decode('utf-8')
p = r'(?:(?:[0,1]?\d?\d|2[0-4]\d|25[0-5])\.){3}(?:[0,1]?\d?\d|2[0-4]\d|25[0-5])'
iplist = re.findall(p, html)
for each in iplist:
print(each)
if __name__ == "__main__":
url = "https://www.xicidaili.com/wt/"
get_ip(url)

運行得到的結(jié)果也是我們想要的 ip 地址了，如下：

============== RESTART: C:\Users\XiangyangDai\Desktop\getIP.py ==============
183.47.40.35
61.135.217.7
221.214.180.122
101.76.248.79
182.88.129.198
175.165.128.21
42.48.118.106
60.216.101.46
219.245.3.4
117.85.221.45

接下來我們又回到文檔：

另外還有一些使用的方法，例如：

finditer() ，是將結(jié)果返回一個迭代器，方便以迭代方式獲取數(shù)據(jù)。

sub() ，是實現(xiàn)替換的操作。

在Python3 正則表達式特殊符號及用法（詳細列表）中也還有一些特殊的語法，例如：

(?=...)：前向肯定斷言。

(?！...)：前向否定斷言。

(?<=...)：后向肯定斷言。

(?<!...)：后向肯定斷言。

這些都是非常有用的，但是呢，這些內(nèi)容有點多了，如果說全部都講正則表達式的話，那我們就是喧賓奪主了，我們主要講的是網(wǎng)絡爬蟲?哦。文章來源地址http://www.zghlxwxcb.cn/news/detail-606230.html

所以，大家還是要自主學習一下，多看，多學，多操作。

到了這里，關于《零基礎入門學習Python》第060講：論一只爬蟲的自我修養(yǎng)8：正則表達式4的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權(quán)，不承擔相關法律責任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領支付寶紅包贊助服務器費用

【100天精通python】Day41：python網(wǎng)絡爬蟲開發(fā)_爬蟲基礎入門
目錄 ?專欄導讀? 1網(wǎng)絡爬蟲概述 1.1?工作原理 1.2 應用場景 1.3 爬蟲策略
2024年02月12日
瀏覽(34)
【零基礎入門Python爬蟲】第三節(jié) Python Selenium
Python Selenium 是一種自動化測試框架，可以模擬用戶在瀏覽器中的交互行為。它是一個基于瀏覽器驅(qū)動程序的工具，可用于Web應用程序測試、數(shù)據(jù)采集等方面，能夠讓開發(fā)人員通過代碼自動化地模擬用戶在瀏覽器中的操作，并獲取到所需的數(shù)據(jù)。 Selenium的主要優(yōu)勢是它可以模擬
2024年02月04日
瀏覽(19)
python爬蟲基礎入門——利用requests和BeautifulSoup
（本文是自己學習爬蟲的一點筆記和感悟）經(jīng)過python的初步學習，對字符串、列表、字典、元祖、條件語句、循環(huán)語句……等概念應該已經(jīng)有了整體印象，終于可以著手做一些小練習來鞏固知識點，寫爬蟲練習再適合不過。爬蟲的本質(zhì)就是從網(wǎng)頁中獲取所需的信息，對網(wǎng)頁
2024年02月15日
瀏覽(23)
Python基礎入門之網(wǎng)絡爬蟲利器：lxml詳解
導語：網(wǎng)絡爬蟲是數(shù)據(jù)采集和信息提取的重要工具之一。在Python中，lxml庫是一款功能強大且高效的網(wǎng)絡爬蟲工具，具有解析HTML和XML文檔、XPath定位、數(shù)據(jù)提取等功能。本文將詳細介紹lxml庫的使用方法，并提供相應的代碼示例。 lxml庫 lxml是一個HTML/XML的解析器，主要的功能是
2024年02月07日
瀏覽(23)
Python爬蟲學習筆記（一）---Python入門
pycharm的安裝可以自行去搜索教程。 pycharm的使用需要注意： 1、venv文件夾是這個項目的虛擬環(huán)境文件，應與代碼文件分開。 2、如果運行沒有，最后一行是“進程已結(jié)束，退出代碼為0”，如果最后不是0，那么，就說明運行出錯。 print括號中使用單引號或者雙引號都是可以的。
2024年01月17日
瀏覽(22)
【超簡版，代碼可用！】【0基礎Python爬蟲入門——下載歌曲/視頻】
科普： get:公開數(shù)據(jù) post:加密，個人信息科普：爬哪個網(wǎng)址？怎么找視頻/音頻網(wǎng)址？都是指URL，并非最上方的地址把URL復制即可如下操作：解釋：【看不懂沒關系！請看下面的代碼！可以直接套用】 res=requests.get(url) # 發(fā)送請求 print(res.content) # 獲取二進制數(shù)據(jù) wb 寫入
2024年01月24日
瀏覽(25)
014集：python訪問互聯(lián)網(wǎng)：網(wǎng)絡爬蟲實例—python基礎入門實例
以pycharm環(huán)境為例：首先需要安裝各種庫(urllib：requests：Openssl-python等) python爬蟲中需要用到的庫，大致可分為：1、實現(xiàn) HTTP 請求操作的請求庫；2、從網(wǎng)頁中提取信息的解析庫；3、Python與數(shù)據(jù)庫交互的存儲庫；4、爬蟲框架；5、Web框架庫。一、請求庫實現(xiàn) HTTP 請求操作 1、
2024年01月16日
瀏覽(26)
最簡單的python爬蟲案例，適合入門學習
用python從網(wǎng)頁爬取數(shù)據(jù)，網(wǎng)上相關文章很多，但能讓零基礎初學者輕松上手的卻很少?？赡苁怯械淖髡哂X得有些知識點太簡單不值得花費精力講，結(jié)果是難者不會會者不難，初學者常常因此而蒙圈。本人也是小白，剛摸索著爬了兩個簡單的網(wǎng)頁數(shù)據(jù)，經(jīng)歷了初學者易犯的各種
2024年02月08日
瀏覽(25)
Python爬蟲入門：HTTP與URL基礎解析及簡單示例實踐
在數(shù)字化時代，數(shù)據(jù)已成為一種寶貴的資源。Python作為一種強大的編程語言，在數(shù)據(jù)采集和處理方面表現(xiàn)出色。爬蟲技術(shù)，即網(wǎng)絡爬蟲，是Python中用于數(shù)據(jù)采集的重要工具。本文作為Python爬蟲基礎教程的第一篇，將深入講解URL和HTTP的基礎知識，為后續(xù)的爬蟲實踐打下堅實的基
2024年03月22日
瀏覽(19)
Python爬蟲學習筆記（一）————網(wǎng)頁基礎
目錄 1.網(wǎng)頁的組成 2.HTML （1）標簽（2）比較重要且常用的標簽： ①列表標簽 ②超鏈接標簽（a標簽） ③img標簽：用于渲染，圖片資源的標簽 ④div標簽和span標簽（3）屬性（4）常用的語義化標簽（5）元素的分類及特點 ①塊元素 ②行內(nèi)元素 ③行內(nèi)塊元素（6）文件路徑（
2024年02月15日
瀏覽(23)