标记并标记tex

import re def alpha(scanner,token): return token, 'a' def numeric(scanner,token): return token,'rn' def punctuation(scanner,token): return token, 'p' def superscript(scanner,token): return token, 'sn' scanner = re.Scanner([ (u"[a-zA-Z]+", alpha), (u"[.,:;!?]", punctuation), (u"[0-9]+", numeric), (u"[\xb9\u2070\xb3\xb2\u2075\u2074\u2077\u2076\u2079\u2078]", superscript), (r"[\s\n]+", None), # whitespace, newline ]) tokens, _ = scanner.scan("This is a little test? With 7,9 and 6.") print tokens

1条回答

网友

1楼 · 发布于 2024-04-26 13:23:24

re.Scanner按照提供的顺序匹配模式。因此，您可以在末尾提供一个非常通用的模式来捕获“未知”字符：

(r".", unknown)

^{2}$

收益率

[('This', 'a'), ('is', 'a'), ('a', 'a'), ('little', 'a'), 
('test', 'a'), ('?', 'p'), ('With', 'a'), ('7', 'rn'), (',', 'p'), 
('9', 'rn'), ('and', 'a'), ('6', 'rn'), ('.', 'p'), ('\xa0', 'uk'), 
('-', 'uk'), ('\xaf', 'uk')]

您的一些模式是unicode，还有一个是str。在Python2中，模式和要匹配的字符串可以是unicode或{}。在

但是，in Python3：

Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match an Unicode string with a byte pattern or vice-versa

因此，最好不要把它们混在一起，即使是在Python中。在

我认为您的代码非常简单（除了superscriptregex。哎呀。我不知道有什么图书馆能使它更简单。在

相关问题更多 >

编程相关推荐

热门问题

热门文章