Python学习笔记（二）

python 异常、正则表达式
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html#regex-howto

例 6.1. 打开一个不存在的文件
>>> fsock = open("/notthere", "r")
Traceback (innermost last):
File "<interactive input>", line 1, in ?
IOError: [Errno 2] No such file or directory: '/notthere'
>>> try:
...     fsock = open("/notthere")
... except IOError:
...     print "The file does not exist, exiting gracefully"
... print "This line will always print"
The file does not exist, exiting gracefully
This line will always print

# Bind the name getpass to the appropriate function
try:
      import termios, TERMIOS
except ImportError:
      try:
          import msvcrt
      except ImportError:
          try:
              from EasyDialogs import AskPassword
          except ImportError:
              getpass = default_getpass
          else:
              getpass = AskPassword
      else:
          getpass = win_getpass
else:
      getpass = unix_getpass

例 6.10. 遍历 dictionary
>>> import os
>>> for k, v in os.environ.items():
... print "%s=%s" % (k, v)
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim

[...略...]
>>> print "\n".join(["%s=%s" % (k, v)
... for k, v in os.environ.items()])
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM

例 6.13. 使用 sys.modules
>>> import fileinfo
>>> print '\n'.join(sys.modules.keys())
win32api
os.path
os
fileinfo
exceptions

>>> fileinfo
<module 'fileinfo' from 'fileinfo.pyc'>
>>> sys.modules["fileinfo"]
<module 'fileinfo' from 'fileinfo.pyc'>

下面的例子将展示通过结合使用 __module__ 类属性和 sys.modules dictionary 来获取已知类所在的模块。

例 6.14. __module__ 类属性
>>> from fileinfo import MP3FileInfo
>>> MP3FileInfo.__module__
'fileinfo'
>>> sys.modules[MP3FileInfo.__module__]
<module 'fileinfo' from 'fileinfo.pyc'> 每个 Python 类都拥有一个内置的类属性 __module__，它定义了这个类的模块的名字。
将它与 sys.modules 字典复合使用，你可以得到定义了某个类的模块的引用。

例 6.16. 构造路径名
>>> import os
>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.expanduser("~")
'c:\\Documents and Settings\\mpilgrim\\My Documents'
>>> os.path.join(os.path.expanduser("~"), "Python")
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'

例 7.2. 匹配整个单词
>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
'100 BRD.'
>>> re.sub('\\bROAD$', 'RD.', s)
'100 BROAD'
>>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b', 'RD.', s)
'100 BROAD RD. APT 3'

我真正想要做的是，当 'ROAD' 出现在字符串的末尾，并且是作为一个独立的单词时，而不是一些长单词的一部分，才对他进行匹配。为了在正则表达式中表达这个意思，你利用 \b，它的含义是“单词的边界必须在这里”。在 Python 中，由于字符 '\' 在一个字符串中必须转义，这会变得非常麻烦。有时候，这类问题被称为“反斜线灾难”，这也是 Perl 中正则表达式比 Python 的正则表达式要相对容易的原因之一。另一方面，Perl 也混淆了正则表达式和其他语法，因此，如果你发现一个 bug，很难弄清楚究竟是一个语法错误，还是一个正则表达式错误。
为了避免反斜线灾难，你可以利用所谓的“原始字符串”，只要为字符串添加一个前缀 r 就可以了。这将告诉 Python，字符串中的所有字符都不转义；'\t' 是一个制表符，而 r'\t' 是一个真正的反斜线字符 '\'，紧跟着一个字母 't'。我推荐只要处理正则表达式，就使用原始字符串；否则，事情会很快变得混乱 (并且正则表达式自己也会很快被自己搞乱了)。

例 7.4. 检验百位数
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
>>> re.search(pattern, 'MCM')
<SRE_Match object at 01070390>
>>> re.search(pattern, 'MD')
<SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC')
<SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC')
>>> re.search(pattern, '')
<SRE_Match object at 01071D98>

例 7.5. 老方法：每一个字符都是可选的
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EE090>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EEB48>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMMM')
>>>

例 7.6. 一个新的方法：从 n 到 m
>>> pattern = '^M{0,3}$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EEDA8>
>>> re.search(pattern, 'MMMM')
>>>

对于个位数的正则表达式有类似的表达方式，我将省略细节，直接展示结果。

>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
用另一种 {n,m} 语法表达这个正则表达式会如何呢？这个例子展示新的语法。

例 7.8. 用 {n,m} 语法确认罗马数字
>>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMDCLXVI')
<_sre.SRE_Match object at 0x008EEB48>

例 7.9. 带有内联注释 (Inline Comments) 的正则表达式
>>> pattern = """
    ^                   # beginning of string
    M{0,3}              # thousands - 0 to 3 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                        #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                        #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                        #        or 5-8 (V, followed by 0 to 3 I's)
    $                   # end of string
    """
>>> re.search(pattern, 'M', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'M')
当使用松散正则表达式时，最重要的一件事情就是：必须传递一个额外的参数 re.VERBOSE，该参数是定义在 re 模块中的一个常量，标志着待匹配的正则表达式是一个松散正则表达式。正如你看到的，这个模式中，有很多空格 (所有的空格都被忽略)，和几个注释 (所有的注释也被忽略)。如果忽略所有的空格和注释，它就和前面章节里的正则表达式完全相同，但是具有更好的可读性。
>>> re.search(pattern, 'M')
这个没有匹配。为什么呢？因为没有 re.VERBOSE 标记，所以 re.search 函数把模式作为一个紧凑正则表达式进行匹配。Python 不能自动检测一个正则表达式是为松散类型还是紧凑类型。Python 默认每一个正则表达式都是紧凑类型的，除非你显式地标明一个正则表达式为松散类型。

例 7.16. 解析电话号码 (最终版本)
>>> phonePattern = re.compile(r'''
                # don't match beginning of string, number can start anywhere
    (\d{3})     # area code is 3 digits (e.g. '800')
    \D*         # optional separator is any number of non-digits
    (\d{3})     # trunk is 3 digits (e.g. '555')
    \D*         # optional separator
    (\d{4})     # rest of number is 4 digits (e.g. '1212')
    \D*         # optional separator
    (\d*)       # extension is optional and can be any number of digits
    $           # end of string
    ''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
('800', '555', '1212', '')

现在，你应该熟悉下列技巧：

^ 匹配字符串的开始。
$ 匹配字符串的结尾。
\b 匹配一个单词的边界。
\d 匹配任意数字。
\D 匹配任意非数字字符。
x? 匹配一个可选的 x 字符 (换言之，它匹配 1 次或者 0 次 x 字符)。
x* 匹配0次或者多次 x 字符。
x+ 匹配1次或者多次 x 字符。
x{n,m} 匹配 x 字符，至少 n 次，至多 m 次。
(a|b|c) 要么匹配 a，要么匹配 b，要么匹配 c。
(x) 一般情况下表示一个记忆组 (remembered group)。你可以利用 re.search 函数返回对象的 groups() 函数获取它的值。

http://www.woodpecker.org.cn/diveintopython/regular_expressions/phone_numbers.html

Regular expression pattern syntax

Element

Meaning

.

Matches any character except \n (if DOTALL, also matches \n)

^

Matches start of string (if MULTILINE, also matches after \n)

$

Matches end of string (if MULTILINE, also matches before \n)

*

Matches zero or more cases of the previous regular expression; greedy (match as many as possible)

+

Matches one or more cases of the previous regular expression; greedy (match as many as possible)

?

Matches zero or one case of the previous regular expression; greedy (match one if possible)

*? , +?, ??

Non-greedy versions of *, +, and ? (match as few as possible)

{m,n}

Matches m to n cases of the previous regular expression (greedy)

{m,n}?

Matches m to n cases of the previous regular expression (non-greedy)

[...]

Matches any one of a set of characters contained within the brackets

|

Matches expression either preceding it or following it

(...)

Matches the regular expression within the parentheses and also indicates a group

(?iLmsux)

Alternate way to set optional flags; no effect on match

(?:...)

Like (...), but does not indicate a group

(?P<id>...)

Like (...), but the group also gets the name id

(?P=id)

Matches whatever was previously matched by group named id

(?#...)

Content of parentheses is just a comment; no effect on match

(?=...)

Lookahead assertion; matches if regular expression ... matches what comes next, but does not consume any part of the string

(?!...)

Negative lookahead assertion; matches if regular expression ... does not match what comes next, and does not consume any part of the string

(?<=...)

Lookbehind assertion; matches if there is a match for regular expression ... ending at the current position (... must match a fixed length)

(?<!...)

Negative lookbehind assertion; matches if there is no match for regular expression ... ending at the current position (... must match a fixed length)

\number

Matches whatever was previously matched by group numbered number (groups are automatically numbered from 1 up to 99)

\A

Matches an empty string, but only at the start of the whole string

\b

Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also \w)

\B

Matches an empty string, but not at the start or end of a word

\d

Matches one digit, like the set [0-9]

\D

Matches one non-digit, like the set [^0-9]

\s

Matches a whitespace character, like the set [ \t\n\r\f\v]

\S

Matches a non-white character, like the set [^ \t\n\r\f\v]

\w

Matches one alphanumeric character; unless LOCALE or UNICODE is set, \w is like [a-zA-Z0-9_]

\W

Matches one non-alphanumeric character, the reverse of \w

\Z

Matches an empty string, but only at the end of the whole string

\\

Matches one backslash character

Regular expression pattern syntax
Element	Meaning
.	Matches any character except `\n` (if `DOTALL`, also matches `\n`)
^	Matches start of string (if `MULTILINE`, also matches after `\n`)
$	Matches end of string (if `MULTILINE`, also matches before `\n`)
*	Matches zero or more cases of the previous regular expression; greedy (match as many as possible)
+	Matches one or more cases of the previous regular expression; greedy (match as many as possible)
?	Matches zero or one case of the previous regular expression; greedy (match one if possible)
`*?` , `+?`, `??`	Non-greedy versions of `*`, `+`, and `?` (match as few as possible)
{`m`,`n`}	Matches `m` to `n` cases of the previous regular expression (greedy)
{`m`,`n`}?	Matches `m` to `n` cases of the previous regular expression (non-greedy)
[...]	Matches any one of a set of characters contained within the brackets
\|	Matches expression either preceding it or following it
(...)	Matches the regular expression within the parentheses and also indicates a group
(?iLmsux)	Alternate way to set optional flags; no effect on match
(?:...)	Like `(...)`, but does not indicate a group
(?P<`id`>...)	Like `(...)`, but the group also gets the name `id`
(?P=`id`)	Matches whatever was previously matched by group named `id`
(?#...)	Content of parentheses is just a comment; no effect on match
(?=...)	Lookahead assertion; matches if regular expression `..`. matches what comes next, but does not consume any part of the string
(?!...)	Negative lookahead assertion; matches if regular expression `..`. does not match what comes next, and does not consume any part of the string
(?<=...)	Lookbehind assertion; matches if there is a match for regular expression `..`. ending at the current position (`..`. must match a fixed length)
(?<!...)	Negative lookbehind assertion; matches if there is no match for regular expression `..`. ending at the current position (`..`. must match a fixed length)
\`number`	Matches whatever was previously matched by group numbered `number` (groups are automatically numbered from 1 up to 99)
\A	Matches an empty string, but only at the start of the whole string
\b	Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also `\w`)
\B	Matches an empty string, but not at the start or end of a word
\d	Matches one digit, like the set `[0-9]`
\D	Matches one non-digit, like the set `[^0-9]`
\s	Matches a whitespace character, like the set `[` `\t\n\r\f\v]`
\S	Matches a non-white character, like the set `[^` `\t\n\r\f\v]`
\w	Matches one alphanumeric character; unless `LOCALE` or `UNICODE` is set, `\w` is like `[a-zA-Z0-9_]`
\W	Matches one non-alphanumeric character, the reverse of `\w`
\Z	Matches an empty string, but only at the end of the whole string
\\	Matches one backslash character

posted on 2009-08-22 23:48 Frank_Fang 阅读(1921) 评论(0) 编辑收藏所属分类: Python学习

Regular expression pattern syntax

常用链接

留言簿(1)

随笔分类(204)

随笔档案(100)

收藏夹(8)

牛人博客链接

搜索

最新评论

阅读排行榜

评论排行榜


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问
相关文章: Python学习笔记（二） Python学习笔记一