Python三连(二)：如何使用正则获取变量

2018-10-02

使用正则

Python里的正则库是, 使用的方式有两种

直接匹配 match
预编译匹配 compile

import re

string = '''<string name="TP40008091" title="CH100-SWW">The Move Away from Threads</string>'''

# 直接匹配
match = re.match("<string *name=\"(?P<key>.*?)\".*>(?P<value>.*)</string>", string)
# 预编译匹配
match = re.compile("<string *name=\"(?P<key>.*?)\".*>(?P<value>.*)</string>").match(string)

注意 直接匹配和预编译匹配返回对象的match函数的入参不一样

获取到变量

获取到变量主要是参考re的说明文档中

(?P…) The substring matched by the group is accessible by name.

返回的结果match是一个 Match() 的Object，为系统的自带class

match = re.compile("<string *name=\"(?P<key>.*?)\".*>(?P<value>.*)</string>").match(string)
key = match.group("key")
value = match.group("value")

print(" Key : %s \n Value : %s" % (key, value))
#  Key : TP40008091
#  Value : The Move Away from Threads

匹配多行

上述是对单一字符串的匹配，如果想要匹配多行（例如文件的内容），可以使用re中带有 flags=0 的方法flags参数

标记位	含义
A	ASCII
I	IGNORECASE
L	LOCALE
M	MULTILINE
S	DOTALL
X	VERBOSE
U	UNICODE

使用的应该是 re.M 这个参数，不同的参数之间可以使用bit运算


content = '''
<string name="TP40000001" title="CH101-SWW">The Move</string>
<string  name="TP40000002" title="CH102-SWW">The Move Away</string>
<string   name="TP40000003" title="CH103-SWW">The Move Away from Threads</string>
'''

matches = re.compile("<string *name=\"(?P<key>.*?)\".*>(?P<value>.*)</string>", re.M | re.U).finditer(content)
for match in matches:
    key = match.group("key")
    value = match.group("value")
    print(" Key : %s \n Value : %s" % (key, value))

贪婪匹配和非贪婪匹配

贪婪匹配和非贪婪匹配的原则主要在于回溯逻辑的不同，贪婪指的是尽可能匹配更多的字符，非贪婪则相反

string = '''<string name="TP40008091" title="CH100-SWW">The Move Away from Threads</string>'''

# 贪婪
# Key : TP40008091" title="CH100-SWW
# Value : The Move Away from Threads
greed = re.match("<string *name=\"(?P<key>.*?)\".*>(?P<value>.*)</string>", string)

# 非贪婪
# Key : TP40008091
# Value : The Move Away from Threads
simple = re.match("<string *name=\"(?P<key>.*)\".*>(?P<value>.*)</string>", string)

区别在于子表达式 (?P.*) 和 (?P.*?) 中最后问号的差别

pattern中的r什么意思

很多晚上的回答中，喜欢在pattern之前加入小写字母r，这个在python中代表 Raw 的意思，是字符串规则的一部分和正则规定无关

# 匹配
match = re.match("<string *name=\"(?P<key>.*?)\".*>(?P<value>.*)</string>", string)

# Raw 匹配
match = re.match(r"<string *name=\"(?P<key>.*?)\".*>(?P<value>.*)</string>", string)

区别在于StackOverflow答案中的举例

print('\n') # 换行
print(r'\n') # 字符串\n
print('\b') # 空格
print(r'\b') # 字符串\b