python 异常、正则表达式
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html#regex-howto
例 6.1. 打开一个不存在的文件
>>> fsock = open("/notthere", "r")      
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
IOError: [Errno 2] No such file or directory: '/notthere'
>>> try:
...     fsock = open("/notthere")       
... except IOError:                     
...     print "The file does not exist, exiting gracefully"
... print "This line will always print" 
The file does not exist, exiting gracefully
This line will always print
# Bind the name getpass to the appropriate function
  try:
      import termios, TERMIOS                     
  except ImportError:
      try:
          import msvcrt                           
      except ImportError:
          try:
              from EasyDialogs import AskPassword 
          except ImportError:
              getpass = default_getpass           
          else:                                   
              getpass = AskPassword
      else:
          getpass = win_getpass
  else:
      getpass = unix_getpass
 
例 6.10. 遍历 dictionary
>>> import os
>>> for k, v in os.environ.items():       
...     print "%s=%s" % (k, v)
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim
[...略...]
>>> print "\n".join(["%s=%s" % (k, v)
...     for k, v in os.environ.items()]) 
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
 
例 6.13. 使用 sys.modules
>>> import fileinfo         
>>> print '\n'.join(sys.modules.keys())
win32api
os.path
os
fileinfo
exceptions
>>> fileinfo
<module 'fileinfo' from 'fileinfo.pyc'>
>>> sys.modules["fileinfo"] 
<module 'fileinfo' from 'fileinfo.pyc'>
下面的例子将展示通过结合使用 __module__ 类属性和 sys.modules dictionary 来获取已知类所在的模块。 
例 6.14. __module__ 类属性 
>>> from fileinfo import MP3FileInfo
>>> MP3FileInfo.__module__              
'fileinfo'
>>> sys.modules[MP3FileInfo.__module__] 
<module 'fileinfo' from 'fileinfo.pyc'>  每个 Python 类都拥有一个内置的类属性 __module__,它定义了这个类的模块的名字。  
  将它与 sys.modules 字典复合使用,你可以得到定义了某个类的模块的引用。  
 
例 6.16. 构造路径名
>>> import os
>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")  
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")   
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.expanduser("~")                         
'c:\\Documents and Settings\\mpilgrim\\My Documents'
>>> os.path.join(os.path.expanduser("~"), "Python") 
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'
 
例 7.2. 匹配整个单词
>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
'100 BRD.'
>>> re.sub('\\bROAD$', 'RD.', s)  
'100 BROAD'
>>> re.sub(r'\bROAD$', 'RD.', s)  
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s)  
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b', 'RD.', s) 
'100 BROAD RD. APT 3'
我真正想要做的是,当 'ROAD' 出现在字符串的末尾,并且是作为一个独立的单词时,而不是一些长单词的一部分,才对他进行匹配。为了在正则表达式中表达这个意思,你利用 \b,它的含义是“单词的边界必须在这里”。在 Python 中,由于字符 '\' 在一个字符串中必须转义,这会变得非常麻烦。有时候,这类问题被称为“反斜线灾难”,这也是 Perl 中正则表达式比 Python 的正则表达式要相对容易的原因之一。另一方面,Perl 也混淆了正则表达式和其他语法,因此,如果你发现一个 bug,很难弄清楚究竟是一个语法错误,还是一个正则表达式错误。  
  为了避免反斜线灾难,你可以利用所谓的“原始字符串”,只要为字符串添加一个前缀 r 就可以了。这将告诉 Python,字符串中的所有字符都不转义;'\t' 是一个制表符,而 r'\t' 是一个真正的反斜线字符 '\',紧跟着一个字母 't'。我推荐只要处理正则表达式,就使用原始字符串;否则,事情会很快变得混乱 (并且正则表达式自己也会很快被自己搞乱了)。  
 
例 7.4. 检验百位数
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$' 
>>> re.search(pattern, 'MCM')            
<SRE_Match object at 01070390>
>>> re.search(pattern, 'MD')             
<SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC')         
<SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC')           
>>> re.search(pattern, '')               
<SRE_Match object at 01071D98>
 
例 7.5. 老方法:每一个字符都是可选的
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M')    
<_sre.SRE_Match object at 0x008EE090>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MM')   
<_sre.SRE_Match object at 0x008EEB48>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MMM')  
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMMM') 
>>> 
例 7.6. 一个新的方法:从 n 到 m
>>> pattern = '^M{0,3}$'       
>>> re.search(pattern, 'M')    
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MM')   
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMM')  
<_sre.SRE_Match object at 0x008EEDA8>
>>> re.search(pattern, 'MMMM') 
>>> 
对于个位数的正则表达式有类似的表达方式,我将省略细节,直接展示结果。
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
用另一种 {n,m} 语法表达这个正则表达式会如何呢?这个例子展示新的语法。 
例 7.8. 用 {n,m} 语法确认罗马数字 
>>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')             
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMDCLXVI')         
<_sre.SRE_Match object at 0x008EEB48>
例 7.9. 带有内联注释 (Inline Comments) 的正则表达式
>>> pattern = """
    ^                   # beginning of string
    M{0,3}              # thousands - 0 to 3 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                        #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                        #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                        #        or 5-8 (V, followed by 0 to 3 I's)
    $                   # end of string
    """
>>> re.search(pattern, 'M', re.VERBOSE)                
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)        
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE)  
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'M')                            
  当使用松散正则表达式时,最重要的一件事情就是:必须传递一个额外的参数 re.VERBOSE,该参数是定义在 re 模块中的一个常量,标志着待匹配的正则表达式是一个松散正则表达式。正如你看到的,这个模式中,有很多空格 (所有的空格都被忽略),和几个注释 (所有的注释也被忽略)。如果忽略所有的空格和注释,它就和前面章节里的正则表达式完全相同,但是具有更好的可读性。  
>>> re.search(pattern, 'M')        
这个没有匹配。为什么呢?因为没有 re.VERBOSE 标记,所以 re.search 函数把模式作为一个紧凑正则表达式进行匹配。Python 不能自动检测一个正则表达式是为松散类型还是紧凑类型。Python 默认每一个正则表达式都是紧凑类型的,除非你显式地标明一个正则表达式为松散类型。 
 
例 7.16. 解析电话号码 (最终版本)
>>> phonePattern = re.compile(r'''
                # don't match beginning of string, number can start anywhere
    (\d{3})     # area code is 3 digits (e.g. '800')
    \D*         # optional separator is any number of non-digits
    (\d{3})     # trunk is 3 digits (e.g. '555')
    \D*         # optional separator
    (\d{4})     # rest of number is 4 digits (e.g. '1212')
    \D*         # optional separator
    (\d*)       # extension is optional and can be any number of digits
    $           # end of string
    ''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()        
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')                                
('800', '555', '1212', '')
 
现在,你应该熟悉下列技巧:
^ 匹配字符串的开始。 
$ 匹配字符串的结尾。 
\b 匹配一个单词的边界。 
\d 匹配任意数字。 
\D 匹配任意非数字字符。 
x? 匹配一个可选的 x 字符 (换言之,它匹配 1 次或者 0 次 x 字符)。 
x* 匹配0次或者多次 x 字符。 
x+ 匹配1次或者多次 x 字符。 
x{n,m} 匹配 x 字符,至少 n 次,至多 m 次。 
(a|b|c) 要么匹配 a,要么匹配 b,要么匹配 c。 
(x) 一般情况下表示一个记忆组 (remembered group)。你可以利用 re.search 函数返回对象的 groups() 函数获取它的值。 
http://www.woodpecker.org.cn/diveintopython/regular_expressions/phone_numbers.html
    
    Regular expression pattern syntax
    
    
    
        
            
            
        
        
            | . | Matches any character except \n (if DOTALL, also matches \n) | 
        
            | ^ | Matches start of string (if MULTILINE, also matches after \n) | 
        
            | $ | Matches end of string (if MULTILINE, also matches before \n) | 
        
            | * | Matches zero or more cases of the previous regular expression; greedy (match as many as possible) | 
        
            | + | Matches one or more cases of the previous regular expression; greedy (match as many as possible) | 
        
            | ? | Matches zero or one case of the previous regular expression; greedy (match one if possible) | 
        
            | *? , +?, ?? | Non-greedy versions of *, +, and ? (match as few as possible) | 
        
            | {m,n} | Matches m to n cases of the previous regular expression (greedy) | 
        
            | {m,n}? | Matches m to n cases of the previous regular expression (non-greedy) | 
        
            | [...] | Matches any one of a set of characters contained within the brackets | 
        
            | | | Matches expression either preceding it or following it | 
        
            | (...) | Matches the regular expression within the parentheses and also indicates a group | 
        
            | (?iLmsux) | Alternate way to set optional flags; no effect on match | 
        
            | (?:...) | Like (...), but does not indicate a group | 
        
            | (?P<id>...) | Like (...), but the group also gets the name id | 
        
            | (?P=id) | Matches whatever was previously matched by group named id | 
        
            | (?#...) | Content of parentheses is just a comment; no effect on match | 
        
            | (?=...) | Lookahead assertion; matches if regular expression ... matches what comes next, but does not consume any part of the string | 
        
            | (?!...) | Negative lookahead assertion; matches if regular expression ... does not match what comes next, and does not consume any part of the string | 
        
            | (?<=...) | Lookbehind assertion; matches if there is a match for regular expression ... ending at the current position (... must match a fixed length) | 
        
            | (?<!...) | Negative lookbehind assertion; matches if there is no match for regular expression ... ending at the current position (... must match a fixed length) | 
        
            | \number | Matches whatever was previously matched by group numbered number (groups are automatically numbered from 1 up to 99) | 
        
            | \A | Matches an empty string, but only at the start of the whole string | 
        
            | \b | Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also \w) | 
        
            | \B | Matches an empty string, but not at the start or end of a word | 
        
            | \d | Matches one digit, like the set [0-9] | 
        
            | \D | Matches one non-digit, like the set [^0-9] | 
        
            | \s | Matches a whitespace character, like the set [ \t\n\r\f\v] | 
        
            | \S | Matches a non-white character, like the set [^ \t\n\r\f\v] | 
        
            | \w | Matches one alphanumeric character; unless LOCALE or UNICODE is set, \w is like [a-zA-Z0-9_] | 
        
            | \W | Matches one non-alphanumeric character, the reverse of \w | 
        
            | \Z | Matches an empty string, but only at the end of the whole string | 
        
            | \\ | Matches one backslash character | 
        
    
	posted on 2009-08-22 23:48 
Frank_Fang 阅读(1897) 
评论(0)  编辑  收藏  所属分类: 
Python学习