Python实战之字符串和文本处理

写在前面

博文为《Python Cookbook》读书笔记整理
涉及内容包括：

使用多个界定符分割字符串
字符串开头或结尾匹配,用Shell通配符匹配字符串
字符串匹配和搜索和替换(忽略大小写),最短匹配模式
将Unicode文本标准化,在正则式中使用Unicode
合并拼接字符串,字符串中插入变量,删除字符串中不需要的字符
以指定列宽格式化字符串,在字符串中处理html和xml
字节字符串上的字符串操作

理解不足小伙伴帮忙指正

「傍晚时分，你坐在屋檐下，看着天慢慢地黑下去，心里寂寞而凄凉，感到自己的生命被剥夺了。当时我是个年轻人，但我害怕这样生活下去，衰老下去。在我看来，这是比死亡更可怕的事。——–王小波」

字符串和文本处理

针对任意多的分隔符拆分字符串

「你需要将一个字符串分割为多个字段，但是分隔符 (还有周围的空格) 并不是固定的」

string 对象的 split() 方法只适应于非常简单的字符串分割情形，它并不允许有多个分隔符或者是分隔符周围不确定的空格。当你需要更加灵活的切割字符串的时候，最好使用re.split()方法：

>>> line = 'asdf fjdk; afed, fjek,asdf, foo'
>>> import re
>>> re.split(r'[;,s]s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
>>>

函数 re.split()允许你为分隔符指定多个正则模式。分隔符可以是逗号，分号或者是空格，并且后面紧跟着任意个的空格。只要这个模式被找到，那么匹配的分隔符两边的实体都会被当成是结果中的元素返回。返回结果为一个字段列表

>>> re.split(r'(;|,|s)s*', line)
['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
>>>

使用re.split()函数时候，需要特别注意的是正则表达式中是否包含一个括号捕获分组。如果使用了捕获分组，那么被匹配的文本也将出现在结果列表中。

>>> fields = re.split(r'(;|,|s)s*', line)
>>> values = fields[::2]
>>> delimiters = fields[1::2] + ['']
>>> values
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
>>> delimiters
[' ', ';', ',', ',', ',', '']
>>> ''.join(v+d for v,d in zip(values, delimiters))
'asdf fjdk;afed,fjek,asdf,foo'
>>>

获取分割字符在某些情况下也是有用的,可能想保留分割字符串，用来在后面重新构造一个新的输出字符串：

如果你不想保留分割字符串到结果列表中去，但仍然需要使用到括号来分组正则表达式的话，确保你的分组是非捕获分组，形如(?:...)。

>>> re.split(r'(?:,|;|s)s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
>>>

在字符串的开头或结尾处做文本匹配

「你需要通过指定的文本模式去检查字符串的开头或者结尾，比如文件名后缀，URLScheme 等等。」

检查字符串开头或结尾的一个简单方法是使用str.startswith()或者是str.endswith()方法。比如:

>>> filename = 'spam.txt'
>>> filename.endswith('.txt')
True
>>> filename.startswith('file:')
False
>>> url = 'http://www.python.org'
>>> url.startswith('http:')
True
>>>

如果你想检查多种匹配可能，只需要将所有的匹配项放入到一个元组中去，然后传给 startswith() 或者 endswith() 方法：

>>> import os
>>> filenames = os.listdir('.')
>>> filenames
['.bash_logout', '.bash_profile', '.cshrc', '.tcshrc', 'anaconda-ks.cfg', 'scp_script.py', 'uagtodata', '.bash_history', 'one-client-install.sh', 'calico.yaml', 'docker', '.mysql_history', 'UagAAA', 'Uag.tar', 'liruilong.snap1', '.python_history', '.cache', 'translateDemo', 'soft', 'jenkins.docker.sh', 'o3J6.txt', 'bak_shell', 'liruilong', 'index.html', 'load_balancing', 'redis-2.10.3.tar.gz', 'redis-2.10.3', '.kube', 'kc1', 'pod-demo.yaml', 'web-liruilong.yaml', 'shell.sh', '.config', 'nohup.out', '.viminfo', '.pki', 'kubectl.1', 'temp', 'go', '.vim', '111.txt', 'uagtodata.tar', 'set.sh', '.Xauthority', 'calico_3_14.tar', '.ssh', '.bashrc', 'db', '.docker', 'UagAAA.tar', 'Uag.war', 'Uag', 'txt.sh', '.lesshst', 'gitlab.docker.sh', 'kubectl', 'rsync', 'percona-toolkit-3.0.13-1.el7.x86_64.rpm', 'redisclear.py', 'Fetch']
>>> [name for name in filenames if name.endswith(('.yaml', '.sh')) ]
['one-client-install.sh', 'calico.yaml', 'jenkins.docker.sh', 'pod-demo.yaml', 'web-liruilong.yaml', 'shell.sh', 'set.sh', 'txt.sh', 'gitlab.docker.sh']
>>> any(name.endswith('.py') for name in filenames)
True
>>>

必须要输入一个元组作为参数。如果你恰巧有一个 list 或者 set 类型的选择项，要确保传递参数前先调用 tuple() 将其转换为元组类型

类似的操作也可以使用切片来实现，但是代码看起来没有那么优雅

>>> filename = 'spam.txt'
>>> filename[-4:] == '.txt'
True
>>> url = 'http://www.python.org'
>>> url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'
True
>>>

还可以使用正则表达式去实现

>>> import re
>>> url = 'http://www.python.org'
>>> re.match('http:|https:|ftp:', url)
<_sre.SRE_Match object; span=(0, 5), match='http:'>
>>>

利用Shell通配符做字符串匹配

「你想使用Unix Shell中常用的通配符 (比如 *.py , Dat[0-9]*.csv 等) 去匹配文本字符串」

可以使用 fnmatch() 函数

>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.txt')
True
>>> fnmatch('foo.txt', '?oo.txt')
True
>>> fnmatch('Dat45.csv', 'Dat[0-9]*')
True
>>> names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
>>> [name for name in names if fnmatch(name, 'Dat*.csv')]
['Dat1.csv', 'Dat2.csv']
>>>

fnmatch() 函数使用底层操作系统的大小写敏感规则 (不同的系统是不一样的) 来匹配模式

#winsows10
>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.TXT')
True
>>>
# Linux
>>> fnmatch('foo.txt', '*.TXT')
False
>>>

如果你对这个区别很在意，可以使用fnmatchcase()来代替。它完全使用你的模式大小写匹配。

>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.TXT')
True
>>> fnmatchcase('foo.txt', '*.TXT')
False
>>>

fnmatch() 函数匹配能力介于简单的字符串方法和强大的正则表达式之间.在处理非文件名的字符串时也可以使用

>>> from fnmatch import fnmatchcase
>>> addresses = [
... '5412 N CLARK ST',
... '1060 W ADDISON ST',
... '1039 W GRANVILLE AVE',
... '2122 N CLARK ST',
... '4802 N BROADWAY',
... ]
>>> [addr for addr in addresses if fnmatchcase(addr, '* ST')]
['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']
>>> [addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]
['5412 N CLARK ST']
>>>

文本模式的匹配和查找

「你想匹配或者搜索特定模式的文本」

如果你想匹配的是字面字符串，那么你通常只需要调用基本字符串方法就行，比如str.find() , str.endswith() , str.startswith() 或者类似的方法：

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> text.startswith('yeah')
True
>>> text.endswith('no')
False
>>> text.find('no')
10
>>>

对于复杂的匹配需要使用正则表达式和 re 模块

>>> text1 = '11/27/2012'
>>> import re
>>> if re.match(r'd+/d+/d+', text1):
...     print('yes')
... else:
...     print('no')
...
yes

如果你想使用同一个模式去做多次匹配，你应该先将模式字符串预编译为模式对象。

>>> datepat = re.compile(r'd+/d+/d+')
>>> if datepat.match(text1):
...     print('yes')
... else:
...     print('no')
...
yes
>>>

match() 总是从字符串开始去匹配，如果你想查找字符串任意部分的模式出现位置，使用findall()方法去代替

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
['11/27/2012', '3/13/2013']
>>>

在定义正则式的时候，通常会利用括号去捕获分组分别将每个组的内容提取出来

>>> datepat = re.compile(r'(d+)/(d+)/(d+)')
>>> m = datepat.match('11/27/2012')
>>> m
<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>
>>> m.group(0)
'11/27/2012'
>>> m.group(1)
'11'
>>> m.group(3)
'2012'
>>> m.groups()
('11', '27', '2012')

findall() 方法会搜索文本并以列表形式返回所有的匹配,想以迭代方式返回匹配，可以使用finditer()方法来代替

>>> [m.groups()  for m in datepat.finditer(text)]
[('11', '27', '2012'), ('3', '13', '2013')]
>>>

查找和替换文本

「你想在字符串中搜索和匹配指定的文本模式」

对于简单的字面模式，直接使用str.repalce()方法即可

>>> 'yeah, but no, but yeah, but no, but yeah'.replace('yeah', 'yep')
'yep, but no, but yep, but no, but yep'
>>>

复杂的模式，请使用 re 模块中的sub()函数。sub() 函数中的第一个参数是被匹配的模式，第二个参数是替换模式。反斜杠数字比如 3 指向前面模式的捕获组号。

>>> import re
>>> re.sub(r'(d+)/(d+)/(d+)', r'3-1-2','Today is 11/27/2012. PyCon starts 3/13/2013.')
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>>

如果你打算用相同的模式做多次替换，考虑先编译它来提升性能

>>> import re
>>> datepat = re.compile(r'(d+)/(d+)/(d+)')
>>> datepat.sub(r'3-1-2', text)
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>>

对于更加复杂的替换，可以传递一个替换回调函数来代替

>>> from calendar import month_abbr
>>> def change_date(m):
...     mon_name = month_abbr[int(m.group(1))]
...     return '{} {} {}'.format(m.group(2), mon_name, m.group(3))
...
>>> datepat.sub(change_date, text)
'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'
>>>

想知道有多少替换发生了，可以使用 re.subn()

>>> newtext, n = datepat.subn(r'3-1-2', text)
>>> newtext
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>> n
2
>>>

字符串忽略大小写的搜索替换

「你需要以忽略大小写的方式搜索与替换文本字符串」

类似 Lixnu 中grep的 -i 参数，python中需要在使用 re 模块的时候给这些操作提供re.IGNORECASE 标志参数。

>>> text = 'UPPER PYTHON, lower python, Mixed Python'
>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']

也可以用于替换

>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)
'UPPER snake, lower snake, Mixed snake'
>>>

替换字符串并不会自动跟被匹配字符串的大小写保持一致。为了修复这个，你可能需要一个辅助函数

def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)   
========
'UPPER SNAKE, lower snake, Mixed Snake'

定义实现最短匹配的正则表达式

「用正则表达式匹配某个文本模式，但是它找到的是模式的最长可能匹配。而你想修改它变成查找最短的可能匹配。」

在需要匹配一对分隔符之间的文本的时候,模式 r'"(.*)"' 的意图是匹配被双引号包含的文本

>>> str_pat = re.compile(r'"(.*)"')
>>> text1 = 'Computer says "no."'
>>> str_pat.findall(text1)
['no.']
>>> text2 = 'Computer says "no." Phone says "yes."'
>>> str_pat.findall(text2)
['no." Phone says "yes.']
>>> str_pat = re.compile(r'"(.*?)"')
>>> str_pat.findall(text2)
['no.', 'yes.']
>>>

正则表达式中 * 操作符是贪婪的，因此匹配操作会查找最长的可能匹配,可以在模式中的 * 操作符后面加上? 修饰符,使得匹配变成非贪婪模式

点 (.) 匹配除了换行外的任何字符,如果你将点 (.) 号放在开始与结束符 (比如引号) 之间的时候，那么匹配操作会查找符合模式的最长可能匹配,在 * 或者 + 这样的操作符后面添加一个?可以强制匹配算法改成寻找最短的可能匹配。

编写多行模式的正则表达式

「使用正则表达式去匹配一大块的文本，而你需要跨越多行去匹配。」

很典型的出现在当你用点 (.) 去匹配任意字符的时候，忘记了点(.)不能匹配换行符的事实,匹配 C 语言分割的注释：

>>> comment = re.compile(r'/*(.*?)*/')
>>> text1 = '/* this is a comment */'
>>> text2 = '''/* this is a
... multiline comment */
... '''
>>> comment.findall(text1)
[' this is a comment ']
>>> comment.findall(text2)
[]
>>>

可以修改模式字符串，增加对换行的支持

>>> comment = re.compile(r'/*((?:.|n)*?)*/')
>>> comment.findall(text2)
[' this is anmultiline comment ']
>>>

在这个模式中， (?:.|n) 指定了一个非捕获组 (也就是它定义了一个仅仅用来做匹配，而不能通过单独捕获或者编号的组)。

re.compile() 函数接受一个标志参数叫 re.DOTALL ，在这里非常有用。它可以让正则表达式中的点 (.) 匹配包括换行符在内的任意字符

>>> comment = re.compile(r'/*(.*?)*/', re.DOTALL)
>>> comment.findall(text2)
[' this is anmultiline comment ']
>>>

将Unicode文本统一表示为规范形式

「你正在处理 Unicode 字符串，需要确保所有字符串在底层有相同的表示。」

嗯，这块先记录下，感觉有些鸡肋….

在 Unicode 中，某些字符能够用多个合法的编码表示

>>> s1 = 'Spicy Jalapeu00f1o'
>>> s2 = 'Spicy Jalapenu0303o'
>>> s1
'Spicy Jalapeño'
>>> s2
'Spicy Jalapeño'
>>> s1 == s2
False
>>> len(s1)
14
>>> len(s2)
15
>>>

在需要比较字符串的程序中使用字符的多种表示会产生问题。为了修正这个问题，你可以使用 unicodedata 模块先将文本标准化：

>>> import unicodedata
>>> t1 = unicodedata.normalize('NFC', s1)
>>> t2 = unicodedata.normalize('NFC', s2)
>>> t1 == t2
True
>>> print(ascii(t1))
'Spicy Jalapexf1o'
>>> t3 = unicodedata.normalize('NFD', s1)
>>> t4 = unicodedata.normalize('NFD', s2)
>>> t3 == t4
True
>>> print(ascii(t3))
'Spicy Jalapenu0303o'
>>>

normalize() 第一个参数指定字符串标准化的方式。NFC 表示字符应该是整体组成 (比如可能的话就使用单一编码)，而 NFD 表示字符应该分解为多个组合字符表示。

Python 同样支持扩展的标准化形式 NFKC 和 NFKD，它们在处理某些字符的时候增加了额外的兼容特性

>>> s = 'ufb01' # A single character
>>> s
' fi'
>>> unicodedata.normalize('NFD', s)
' fi'
# Notice how the combined letters are broken apart here
>>> unicodedata.normalize('NFKD', s)
'fi'
>>> unicodedata.normalize('NFKC', s)
'fi'
>>>

用正则表达式处理Unicode字符

「你正在使用正则表达式处理文本，但是关注的是 Unicode 字符处理」。

默认情况下 re 模块已经对一些 Unicode 字符类有了基本的支持。比如，\d已经匹配任意的unicode数字字符了

>>> import re
>>> num = re.compile('d+')
>>> # ASCII digits
>>> num.match('123')
<_sre.SRE_Match object at 0x1007d9ed0>
>>> # Arabic digits
>>> num.match('u0661u0662u0663')
<_sre.SRE_Match object at 0x101234030>
>>>

嗯，这个不太懂，先记录下

从字符串中去掉不需要的字符

「想去掉文本字符串开头，结尾或者中间不想要的字符，比如空白。」

strip() 方法能用于删除开始或结尾的字符。lstrip() 和 rstrip() 分别从左和从右执行删除操作。

>>> s = ' hello world n'
>>> s.strip()
'hello world'
>>> s.lstrip()
'hello world n'
>>> s.rstrip()
' hello world'
>>>

默认情况下，这些方法会去除空白字符，但是你也可以指定其他字符

>>> t = '-----hello====='
>>> t.lstrip('-')
'hello====='
>>> t.strip('-=')
'hello'

想处理中间的空格，使用replace()方法或者是用正则表达式替换

>>> s.replace(' ', '')
'helloworld'
>>> import re
re.sub('s+', ' ', s)
'hello world'
>>>

将字符串 strip 操作和其他迭代操作相结合，利用生成器表达式

with open(filename) as f:
    lines = (line.strip() for line in f)
    for line in lines:
        print(line)

文本过滤和清理

「一些无聊的幼稚黑客在你的网站页面表单中输入文本”pýtĥöñ”，然后你想将这些字符清理掉。」

文本清理问题会涉及到包括文本解析与数据处理等一系列问题。

>>> s = 'pýtĥöñfistawesomern'
>>> s
'pýtĥöñx0cistawesomern'
>>> remap = {
... ord('t') : ' ',
... ord('f') : ' ',
... ord('r') : None # Deleted
... }
>>> s.translate(remap)
'pýtĥöñ is awesomen'
>>>

正如你看的那样，空白字符 t 和 f 已经被重新映射到一个空格。回车字符 r 直接被删除。

>>> import unicodedata
>>> import sys
>>> cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode)
... if unicodedata.combining(chr(c)))
...
>>> b = unicodedata.normalize('NFD', a)
>>> b
'pýtĥöñ is awesomen'
>>> b.translate(cmb_chrs)
'python is awesomen'
>>>

通过使用dict.fromkeys()方法构造一个字典，每个 Unicode 和音符作为键，对于的值全部为 None

然后使用unicodedata.normalize()将原始输入标准化为分解形式字符。然后再调用 translate 函数删除所有重音符。同样的技术也可以被用来删除其他类型的字符 (比如控制字符等)。

另一种清理文本的技术涉及到 I/O 解码与编码函数。这里的思路是先对文本做一些初步的清理，然后再结合 encode() 或者decode()操作来清除或修改它

>>> a
'pýtĥöñ is awesomen'
>>> b = unicodedata.normalize('NFD', a)
>>> b.encode('ascii', 'ignore').decode('ascii')
'python is awesomen'
>>>

对齐文本字符串

「通过某种对齐方式来格式化字符串」

对于基本的字符串对齐操作，可以使用字符串的 ljust() , rjust() 和 center()方法。

>>> text = 'Hello World'
>>> text.ljust(20)
'Hello World         '
>>> text.rjust(20)
'         Hello World'
>>> text.center(20)
'    Hello World     '
>>>

所有这些方法都能接受一个可选的填充字符。

>>> text = 'Hello World'
>>> text.rjust(20,'=')
'=========Hello World'
>>> num='5'
>>> num.rjust(8,'0')
'00000005'
>>>

rjust这类型的方法只针对对字符串，对于int类型不支持

>>> num.rjust(8,'0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute 'rjust'

函数format()同样可以用来很容易的对齐字符串。你要做的就是使用<,>或者^字符后面紧跟一个指定的宽度，

>>> num=5
>>> format(num, '>20')
'                   5'
>>> format(num, '0>20')
'00000000000000000005'
>>>
>>> format(num, '0<20')
'50000000000000000000'
>>> format(num, '0^20')
'00000000050000000000'
>>>

当格式化多个值的时候，这些格式代码也可以被用在format()方法中。

>>> '{:>10s} {:>10s}'.format('Hello', 'World')
'     Hello      World'
>>>

format() 函数的一个好处是它不仅适用于字符串。它可以用来格式化任何值

>>> x = 1.2345
>>> format(x, '>10')
' 1.2345'
>>> format(x, '^10.2f')
' 1.23 '
>>>
>>> '{:>10s} {:>10s}'.format('Hello', 'World')
'     Hello      World'

字符串连接及合并

「将几个小的字符串合并为一个大的字符串」

如果你想要合并的字符串是在一个序列或者 iterable 中，那么最快的方式就是使用join()方法

>>> parts = ['Is', 'Chicago', 'Not', 'Chicago?']
>>> ' '.join(parts)
'Is Chicago Not Chicago?'
>>> ','.join(parts)
'Is,Chicago,Not,Chicago?'
>>>
>>> ''.join(parts)
'IsChicagoNotChicago?'
>>>

>>> b = 'Not Chicago?'
>>> a + ' ' + b
'Is Chicago Not Chicago?'
>>>

如果你想在源码中将两个字面字符串合并起来，你只需要简单的将它们放到一起，不需要用加号 (+)。

>>> a = 'li' 'rui' 'long'
>>> a
'liruilong'
>>>

嗯，字符串变量是不行的，有些天真了哈….,只适用于字面量

>>> a = 'li' 'rui' 'long'
>>> a
'liruilong'
>>> a a
  File "<stdin>", line 1
    a a
      ^
SyntaxError: invalid syntax
>>> a = a a
  File "<stdin>", line 1
    a = a a
          ^
SyntaxError: invalid syntax
>>>

「使用加号 (+) 操作符去连接大量的字符串的时候是非常低效率的，因为加号连接会引起内存复制以及垃圾回收操作」

永远都不应像下面这样写字符串连接代码

s = ''
for p in parts:
    s += p

这种写法会比使用 join() 方法运行的要慢一些，因为每一次执行 += 操作的时候会创建一个新的字符串对象。你最好是先收集所有的字符串片段然后再将它们连接起来。可以利用生成器表达式

>>> data = ['ACME', 50, 91.1]
>>> ','.join(str(d) for  d in data)
'ACME,50,91.1'
>>>

同样还得注意不必要的字符串连接操作。

print(a + ':' + b + ':' + c) # Ugly
print(':'.join([a, b, c])) # Still ugly
print(a, b, c, sep=':') # Better

当混合使用 I/O 操作和字符串连接操作的时候，有时候需要仔细研究你的程序

# Version 1 (string concatenation)
f.write(chunk1 + chunk2)
# Version 2 (separate I/O operations)
f.write(chunk1)
f.write(chunk2)

如果两个字符串很小，那么第一个版本性能会更好些，因为 I/O 系统调用天生就慢。另外一方面，如果两个字符串很大，那么第二个版本可能会更加高效，因为它避免了创建一个很大的临时结果并且要复制大量的内存块数据。

如果你准备编写构建大量小字符串的输出代码，你最好考虑下使用生成器函数，利用yield语句产生输出片段,是它并没有对输出片段到底要怎样组织做出假设

def sample():
    yield 'Is'
    yield 'Chicago'
    yield 'Not'
    yield 'Chicago?'

text = ''.join(sample())    
print (text)

字符串中插入变量

「你想创建一个内嵌变量的字符串，变量被它的值所表示的字符串替换掉。」

Python 并没有对在字符串中简单替换变量值提供直接的支持(类似shell那样)。但是通过使用字符串的format()方法来解决这个问题。可用于sql拼接

>>> s = '{name} has {n} messages.'
>>> s.format(name='Guido', n=37)
'Guido has 37 messages.'
>>>

如果要被替换的变量能在变量域中找到，那么你可以结合使用 format map()和 vars()

>>> s = '{name} has {n} messages.'
>>> name = 'Guido'
>>> n = 37
>>> s.format_map(vars())
'Guido has 37 messages.'
>>>

vars() 还有一个有意思的特性就是它也适用于对象实例。强大到超乎了的想象…

>>> class Info:
...     def __init__(self, name, n):
...         self.name = name
...         self.n = n
...
>>> a = Info('Guido',37)
>>> s.format_map(vars(a))
'Guido has 37 messages.'
>>>

format 和 format map() 的一个缺陷就是它们并不能很好的处理变量缺失的情况,一种避免这种错误的方法是另外定义一个含有missing ()方法的字典对象，从2.5版本开始，如果派生自dict的子类定义了 __missing__() 方法，当访问不存在的键时，dict[key]会调用 __missing__() 方法取得默认值。

class safesub(dict):
""" 防止 key 找不到"""
    def __missing__(self, key):
        return '{' + key + '}'

现在你可以利用这个类包装输入后传递给format map()

>>> del n # Make sure n is undefined
>>> s.format_map(safesub(vars()))
'Guido has {n} messages.'
>>>

import sys
def sub(text):
    return text.format_map(safesub(sys._getframe(1).f_locals))

sys._getframe:返回来自调用栈的一个帧对象。如果传入可选整数 depth，则返回从栈顶往下相应调用层数的帧对象。如果该数比调用栈更深，则抛出 ValueError。depth 的默认值是 0，返回调用栈顶部的帧。sys. getframe(1) 返回调用者的栈帧，可以从中访问属性 f_locals 来获得局部变量,

f_locals 是一个复制调用函数的本地变量的字典。尽管你可以改变 f_locals 的内容，但是这个修改对于后面的变量访问没有任何影响。所以，虽说访问一个栈帧看上去很邪恶，但是对它的任何操作不会覆盖和改变调用者本地变量的值。

设置完我们可以这样用。

>>> name = 'Guido'
>>> n = 37
>>> print(sub('Hello {name}'))
Hello Guido
>>> print(sub('You have {n} messages.'))
You have 37 messages.
>>> print(sub('Your favorite color is {color}'))
Your favorite color is {color}
>>>

对于Python的字符串替换,如果不使用format() 和 format map() 还可以有如下方式

>>> name = 'Guido'
>>> n = 37
>>> '%(name) has %(n) messages.' % vars()
'Guido has 37 messages.'
>>>

>>> import string
>>> name = 'Guido'
>>> n = 37
>>> s = string.Template('$name has $n messages.')
>>> s.substitute(vars())
'Guido has 37 messages.'
>>>

以指定列宽格式化字符串

「你有一些长字符串，想以指定的列宽将它们重新格式化。」

使用 textwrap 模块来格式化字符串的输出

>>> s = "Look into my eyes, look into my eyes, the eyes, the eyes, 
... the eyes, not around the eyes, don't look around the eyes, 
... look into my eyes, you're under."
>>> import textwrap
>>> print(textwrap.fill(s, 70))
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes, don't look around the eyes, look into my eyes,
you're under.

>>> print(textwrap.fill(s, 40))
Look into my eyes, look into my eyes,
the eyes, the eyes, the eyes, not around
the eyes, don't look around the eyes,
look into my eyes, you're under.

>>> print(textwrap.fill(s, 40, initial_indent=' '))
 Look into my eyes, look into my eyes,
the eyes, the eyes, the eyes, not around
the eyes, don't look around the eyes,
look into my eyes, you're under.
>>> print(textwrap.fill(s, 40, subsequent_indent=' '))
Look into my eyes, look into my eyes,
 the eyes, the eyes, the eyes, not
 around the eyes, don't look around the
 eyes, look into my eyes, you're under.
>>>

textwrap 模块对于字符串打印是非常有用的，特别是当你希望输出自动匹配终端大小的时候。你可以使用 os.get terminal size() 方法来获取终端的大小尺寸。比如：

>>> print(textwrap.fill(s, os.get_terminal_size().columns, initial_indent=' '))
 Look into my eyes, look into my eyes, the eyes, the eyes, the eyes, not around
the eyes, don't look around the eyes, look into my eyes, you're under.
>>>

在字符串中处理 html 和 xml

「你想将 HTML 或者 XML 实体如 &entity; 或 &#code; 替换为对应的文本。再者，你需要转换文本中特定的字符 (比如<, >, 或 &)。」

如果你想替换文本字符串中的 ‘<’ 或者 ‘>’ ，使用 html.escape() 函数可以很容易的完成。

>>> s = 'Elements are written as "<tag>text</tag>".'
>>> import html
>>> print(s)
Elements are written as "<tag>text</tag>".
>>> print(html.escape(s))
Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.
>>> # Disable escaping of quotes
>>> print(html.escape(s, quote=False))

如果你正在处理的是 ASCII 文本，并且想将非 ASCII 文本对应的编码实体嵌入进去，可以给某些 I/O 函数传递参数errors='xmlcharrefreplace'来达到这个目。比如：

>>> s = 'Spicy Jalapeño'
>>> s.encode('ascii', errors='xmlcharrefreplace')
b'Spicy Jalape&#241;o'
>>>

为了替换文本中的编码实体，你需要使用另外一种方法。如果你正在处理 HTML或者 XML 文本，试着先使用一个合适的 HTML 或者 XML 解析

html ,这个方法被移除了，我的3.9的版本，

>>> from html.parser import HTMLParser
>>> p = HTMLParser()
>>> p.unescape(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'HTMLParser' object has no attribute 'unescape'

xml

>>> t = 'The prompt is &gt;&gt;&gt;'
>>> from xml.sax.saxutils import unescape
>>> unescape(t)
'The prompt is >>>'
>>>

在字节串上执行文本操作

需要注意这部分内容在Linux环境和Window环境差别有些大，书里讲的适用于window环境。

你想在字节字符串上执行普通的文本操作 (比如移除，搜索和替换)

>>> data = b'Hello World'
>>> data[0:5]
'Hello'
>>> data.startswith(b'Hello')
True
>>> data.split()
['Hello', 'World']
>>> data.replace(b'Hello', b'Hello Cruel')
'Hello Cruel World'

操作同样也适用于字节数组

>>> data = bytearray(b'Hello World')
>>> data[0:5]
bytearray(b'Hello')
>>> data.startswith(b'Hello')
True
>>> data.split()
[bytearray(b'Hello'), bytearray(b'World')]
>>> data.replace(b'Hello', b'Hello Cruel')
bytearray(b'Hello Cruel World')

可以使用正则表达式匹配字节字符串,Linux下无论是字节串还是字符串都可以，window下并不是这样，这里和书里有些出入。

Linux下

>>> data = b'FOO:BAR,SPAM'
>>> import re
>>> re.split('[:,]',data)
['FOO', 'BAR', 'SPAM']
>>> re.split(b'[:,]',data)
['FOO', 'BAR', 'SPAM']
>>>

window下

>>> data = b'FOO:BAR,SPAM'
>>> import re
>>> re.split('[:,]',data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:pythonPython310libre.py", line 231, in split
    return _compile(pattern, flags).split(string, maxsplit)
TypeError: cannot use a string pattern on a bytes-like object
>>> re.split(b'[:,]',data)
[b'FOO', b'BAR', b'SPAM']
>>>

字节字符串不会提供一个美观的字符串表示，也不能很好的打印出来，除非它们先被解码为一个文本字符串,但是这里Linux没有这种情况。

>>> s = b'Hello World'
>>> s
b'Hello World'
>>> s.decode('ascii')
'Hello World'

格式化字节字符串，你得先使用标准的文本字符串，然后将其编码为字节字符串,这里也有些区别

>>> '{:10s} {:10d} {:10.2f}'.format('ACME', 100, 490.1).encode('ascii')
b'ACME              100     490.10'
>>>

原文始发于微信公众号（山河已无恙）：Python实战之字符串和文本处理

文章由极客之音整理，本文链接：https://www.bmabk.com/index.php/post/60016.html