01 正則表達(dá)式(Regular expressions)
《Python數(shù)據(jù)分析技術(shù)?!返?3章 01 正則表達(dá)式(Regular expressions)
A regular expression is a pattern containing both characters (like letters and digits) and metacharacters (like the * and $ symbols). Regular expressions can be used whenever we want to search, replace, or extract data with an identifiable pattern, for example, dates, postal codes, HTML tags, phone numbers, and so on. They can also be used to validate fields like passwords and email addresses, by ensuring that the input from the user is in the correct format.
正則表達(dá)式是一種包含字符(如字母和數(shù)字)和元字符(如 * 和 $ 符號)的模式。正則表達(dá)式可用于搜索、替換或提取具有可識別模式的數(shù)據(jù),例如日期、郵政編碼、HTML 標(biāo)記、電話號碼等。正則表達(dá)式還可用于驗證密碼和電子郵件地址等字段,確保用戶的輸入格式正確。
使用正則表達(dá)式解決問題的步驟(Steps for solving problems with regular expressions)
Support for regular expressions is provided by the re module in Python, which can be imported using the following statement:
Python 中的 re 模塊提供了對正則表達(dá)式的支持,可以使用以下語句導(dǎo)入該模塊:
import re
If you have not already installed the re module, go to the Anaconda Prompt and enter the following command:
如果尚未安裝 re 模塊,請轉(zhuǎn)到 Anaconda 提示符并輸入以下命令:
pip install re
Once the module is imported, you need to follow the following steps.
模塊導(dǎo)入后,您需要遵循以下步驟。
Define and compile the regular expression: After the re module is imported, we define the regular expression and compile it. The search pattern begins with the prefix “r” followed by the string (search pattern). The “r” prefix, which stands for a raw string, tells the compiler that special characters are to be treated literally and not as escape sequences. Note that this “r” prefix is optional. The compile function compiles the search pattern into a byte code as follows and the search string (and) is passed as an argument to the compile function.
定義并編譯正則表達(dá)式 導(dǎo)入 re 模塊后,我們將定義正則表達(dá)式并對其進(jìn)行編譯。搜索模式以前綴 "r "開頭,后跟字符串(搜索模式)。r "前綴代表原始字符串,它告訴編譯器特殊字符應(yīng)按字面意思處理,而不是作為轉(zhuǎn)義序列。請注意,"r "前綴是可選的。編譯函數(shù)將搜索模式編譯成如下字節(jié)碼,搜索字符串(和)作為參數(shù)傳遞給編譯函數(shù)。
search_pattern=re.compile(r'and')
Locate the search pattern (regular expression) in your string: In the second step, we try to locate this pattern in the string to be searched using the search method. This method is called on the variable (search_pattern) we defined in the previous step.
在字符串中找到搜索模式(正則表達(dá)式): 第二步,我們嘗試使用搜索方法在要搜索的字符串中找到該模式。該方法在上一步中定義的變量 (search_pattern) 上調(diào)用。
search_pattern.search('Today and tomorrow')
A match object is returned since the search pattern (“and”) is found in the string (“Today and tomorrow”).
由于在字符串(“今天和明天”)中找到了搜索模式(“和”),因此將返回一個匹配對象。
簡寫(Shortcut (combining steps 2 and 3))
The preceding two steps can be combined into a single step, as shown in the following statement:
前面兩個步驟可以合并為一個步驟,如下所示:
re.search('and','Today and tomorrow')
Using one line of code, as defined previously, we combine the three steps of defining, compiling, and locating the search pattern in one step.
使用前面定義的一行代碼,我們就可以一步完成定義、編譯和定位搜索模式的三個步驟。
正則表達(dá)式的 Python 函數(shù)(Python functions for regular expressions)
We use regular expressions for matching, splitting, and replacing text, and there is a separate function for each of these tasks. Table 3-1 provides a list of all these functions, along with examples of their usage.
我們使用正則表達(dá)式來匹配、分割和替換文本,每種任務(wù)都有一個單獨(dú)的函數(shù)。表 3-1 列出了所有這些函數(shù)及其使用示例。
re.findall( ): Searches for all possible matches of the regular expression and returns a list of all the matches found in the string.
re.findall( ): 搜索正則表達(dá)式的所有可能匹配項,并返回字符串中所有匹配項的列表。
re.findall('3','98371234')
re.search( ): Searches for a single match and returns a match object corresponding to the first match found in the string.
re.search(): 搜索單個匹配,并返回一個與字符串中找到的第一個匹配對應(yīng)的匹配對象。
re.search('3','98371234')
re.match( ): This function is similar to the re.search function. The limitation of this function is that it returns a match object only if the pattern is present at the beginning of the string.
re.match(): 該函數(shù)與 re.search 函數(shù)類似。該函數(shù)的局限性在于,只有當(dāng)模式出現(xiàn)在字符串開頭時,它才會返回匹配對象。
re.match('3','98371234')
re.split( ): Splits the string at the locations where the search pattern is found in the string being searched.
re.split(): 在搜索字符串中找到搜索模式的位置分割字符串。
re.split('3','98371234')
re.sub( ): Substitutes the search pattern with another string or pattern.
re.sub(): 用另一個字符串或模式替換搜索模式。
re.sub('3','three','98371234')
元角色(Metacharacters)
Metacharacters are characters used in regular expressions that have a special meaning. These metacharacters are explained in the following, along with examples to demonstrate their usage.
元字符是正則表達(dá)式中使用的具有特殊含義的字符。下文將解釋這些元字符,并舉例說明其用法。
Dot (.) metacharacter
This metacharacter matches a single character, which could be a number, alphabet, or even itself.
該元字符匹配單個字符,可以是數(shù)字、字母,甚至是字符本身。
In the following example, we try to match three-letter words (from the list given after the comma in the following code), starting with the two letters “ba”
在下面的示例中,我們嘗試匹配以兩個字母 "ba "開頭的三個字母的單詞(從下面代碼中逗號后給出的列表中選擇)。
re.findall("ba.","bar bat bad ba. ban")
Note that one of the results shown in the output, “ba.”, is an instance where the . (dot) metacharacter has matched itself.
請注意,輸出中顯示的結(jié)果之一 "ba. "是一個 .(點)元字符與自身匹配的實例。
Square brackets ([]) as metacharacters
To match any one character among a set of characters, we use square brackets ([ ]). Within these square brackets, we define a set of characters, where one of these characters must match the characters in our text.
要匹配一組字符中的任何一個字符,我們使用方括號([ ])。在這些方括號中,我們定義了一組字符,其中一個字符必須與文本中的字符相匹配。
Let us understand this with an example. In the following example, we try to match all strings that contain the string “ash”, and start with any of following characters – ‘c’, ‘r’, ‘b’, ‘m’, ‘d’, ‘h’, or ‘w’
讓我們通過一個例子來理解這一點。在下面的示例中,我們嘗試匹配所有包含字符串 “ash”,并以以下任意字符開頭的字符串–“c”、“r”、“b”、“m”、“d”、"h "或 "w
regex=re.compile(r'[crbmdhw]ash')
regex.findall('cash rash bash mash dash hash wash crash ash')
Note that the strings “ash” and “crash” are not matched because they do not match the criterion (the string needs to start with exactly one of the characters defined within the square brackets).
請注意,字符串 "ash "和 "crash "不匹配,因為它們不符合標(biāo)準(zhǔn)(字符串必須以方括號內(nèi)定義的一個字符開頭)。
Question mark (?) metacharacter
This metacharacter is used when you need to match at most one occurrence of a character. This means that the character we are looking for could be absent in the search string or occur just once. Consider the following example, where we try to match strings starting with the characters “Austr”, ending with the characters, “ia”, and having zero or one occurrence of each the following characters – “a”, “l(fā)”, “a”, “s”
該元字符用于最多匹配一個出現(xiàn)過的字符。這意味著我們要查找的字符可能在搜索字符串中不存在或只出現(xiàn)一次。請看下面的示例,我們嘗試匹配以字符 "Austr "開頭,以字符 "ia "結(jié)尾,并且以下每個字符出現(xiàn) 0 次或 1 次的字符串–“a”、“l(fā)”、“a”、“s”
regex=re.compile(r'Austr[a]?[l]?[a]?[s]?ia')
regex.findall('Austria Australia Australasia Asia')
Asterisk (*) metacharacter
This metacharacter can match zero or more occurrences of a given search pattern. In other words, the search pattern may not occur at all in the string, or it can occur any number of times.
該元字符可以匹配給定搜索模式的零次或多次出現(xiàn)。換句話說,搜索模式可能在字符串中完全不出現(xiàn),也可能出現(xiàn)任意多次。
Let us understand this with an example, where we try to match all strings starting with the string, “abc”, and followed by zero or more occurrences of the digit –“1”
讓我們通過一個示例來理解這一點:我們嘗試匹配所有以字符串 "abc "開頭,后面出現(xiàn) 0 個或多個數(shù)字 "1 "的字符串。
re.findall("abc[1]*","abc1 abc111 abc1 abc abc111111111111 abc01")
Note that in this step, we have combined the compilation and search of the regular expression in one single step.
請注意,在這一步中,我們將正則表達(dá)式的編譯和搜索合并為一個步驟。
Backslash (\
) metacharacter
The backslash symbol is used to indicate a character class, which is a predefined set of characters. In Table 3-2, the commonly used character classes are explained.
反斜線符號用于表示字符類,即一組預(yù)定義的字符。表 3-2 解釋了常用的字符類別。
常用的字符類別(Commonly Used Character)
\d Matches a digit (0–9)
\D Matches any character that is not a digit
\w Matches an alphanumeric character, which could be a lowercase letter (a–z), an uppercase letter (A–Z), or a digit (0–9)
\W Matches any character which is not alphanumeric
\s Matches any whitespace character
\S Matches any non-whitespace character
\d 匹配數(shù)字(0-9)
\D 可匹配非數(shù)字的任何字符
\w 與字母數(shù)字字符匹配,可以是小寫字母 (a-z)、大寫字母 (A-Z) 或數(shù)字 (0-9)
\W 匹配任何非字母數(shù)字字符
\s 匹配任何空白字符
\S 匹配任何非空格字符
反斜杠符號的另一種用法: 轉(zhuǎn)義元字符(Another usage of the backslash symbol: Escaping metacharacters)
As we have seen, in regular expressions, metacharacters like . and *, have special meanings. If we want to use these characters in the literal sense, we need to “escape” them by prefixing these characters with a (backslash) sign. For example, to search for the text W.H.O, we would need to escape the . (dot) character to prevent it from being used as a regular metacharacter.
正如我們所見,在正則表達(dá)式中,.和 * 等元字符具有特殊含義。如果我們想在字面意義上使用這些字符,就需要在這些字符前加上反斜杠符號,以 "轉(zhuǎn)義 "這些字符。例如,要搜索文本 W.H.O,我們需要轉(zhuǎn)義 .(點)字符,以防止它被用作正則元字符。
regex=re.compile(r'W\.H\.O')
regex.search('W.H.O norms')
Plus (+) metacharacter
This metacharacter matches one or more occurrences of a search pattern. The following is an example where we try to match all strings that start with at least one letter.
該元字符可匹配一個或多個搜索模式的出現(xiàn)。下面是一個例子,我們嘗試匹配所有至少以一個字母開頭的字符串。
re.findall("[a-z]+123","a123 b123 123 ab123 xyz123")
Curly braces {} as metacharacters
Using the curly braces and specifying a number within these curly braces, we can specify a range or a number representing the number of repetitions of the search pattern.
使用大括號并在大括號內(nèi)指定一個數(shù)字,我們就可以指定一個范圍或一個代表搜索模式重復(fù)次數(shù)的數(shù)字。
In the following example, we find out all the phone numbers in the format “xxx-xxx-xxxx” (three digits, followed by another set of three digits, and a final set of four digits, each set separated by a “-” sign).
在下面的示例中,我們將找出所有格式為 "xxx-xxx-xxxx “的電話號碼(三位數(shù),然后是另一組三位數(shù),最后是一組四位數(shù),每組之間用”-"號隔開)。
regex=re.compile(r'[\d]{3}-[\d]{3}-[\d]{4}')
regex.findall('987-999-8888 99122222 911-911-9111')
Only the first and third numbers in the search string (987-999-8888, 911-911-9111) match the pattern. The \d metacharacter represents a digit.
只有搜索字符串(987-999-8888、911-911-9111)中的第一個和第三個數(shù)字符合模式。元字符 \d 表示數(shù)字。
If we do not have an exact figure for the number of repetitions but know the maximum and the minimum number of repetitions, we can mention the upper and lower limit within the curly braces. In the following example, we search for all strings containing a minimum of six characters and a maximum of ten characters.
如果我們沒有重復(fù)次數(shù)的精確數(shù)字,但知道最大和最小重復(fù)次數(shù),則可以在大括號內(nèi)注明上限和下限。在下面的示例中,我們搜索所有最少包含 6 個字符、最多包含 10 個字符的字符串。
regex=re.compile(r'[\w]{6,10}')
regex.findall('abcd abcd1234,abc$$$$$,abcd12 abcdef')
Dollar ($) metacharacter
This metacharacter matches a pattern if it is present at the end of the search string.
如果該元字符出現(xiàn)在搜索字符串的末尾,則與模式匹配。
In the following example, we use this metacharacter to check if the search string ends with a digit.
在下面的示例中,我們使用該元字符來檢查搜索字符串是否以數(shù)字結(jié)尾。
re.search(r'[\d]$','aa*5')
Caret (^) metacharacter
The caret (^) metacharacter looks for a match at the beginning of the string.
粗體 (^) 元字符會在字符串開頭查找匹配。
In the following example, we check if the search string begins with a whitespace.文章來源:http://www.zghlxwxcb.cn/news/detail-807883.html
在下面的示例中,我們將檢查搜索字符串是否以空格開頭。文章來源地址http://www.zghlxwxcb.cn/news/detail-807883.html
re.search(r'^[\s]',' a bird')
到了這里,關(guān)于《Python數(shù)據(jù)分析技術(shù)?!返?3章 01 正則表達(dá)式(Regular expressions)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!