如何在Python中使用正则表达式对示例字符串进行分词？

0 投票

3 回答

663 浏览

提问于 2025-04-16 10:52

我刚接触正则表达式。除了找出匹配下面这个字符串的模式外，请也给我一些参考资料和示例网站。

数据字符串

1.  First1 Last1 - 20 (Long Description) 
2.  First2 Last2 - 40 (Another Description)

我想从上面的字符串中提取出元组 {First1,Last1,20} 和 {First2,Last2,40}。

正则表达式字符串处理数据提取模式匹配分词技术

3 个回答

根据Harman的部分解决方案，我想出了这个：

(?P<first>\w+)\s+(?P<last>\w+)[-\s]*(?P<number>\d[\d,]*)

代码和输出结果：

>>> regex = re.compile("(?P<first>\w+)\s+(?P<last>\w+)[-\s]*(?P<number>\d[\d,]*)")
>>> r = regex.search(string)
>>> regex.findall(string)
[(u'First1', u'Last1', u'20'), (u'First2', u'Last2', u'40')]

回答于 2025-04-16 由 Python大师

分享举报

这里其实不需要用正则表达式：

foo = "1.  First1 Last1 - 20 (Long Description)"
foo.split(" ")
>>> ['1.', '', 'First1', 'Last1', '-', '20', '(Long', 'Description)']

你现在可以选择你想要的元素（它们总是会在相同的位置）。

在2.7及以上版本中，你可以使用 itertools.compress 来选择这些元素：

tuple(compress(foo.split(" "), [0,0,1,1,0,1]))

回答于 2025-04-16 由 Python大师

分享举报

这个看起来还不错：http://docs.python.org/howto/regex.html#regex-howto。你可以简单浏览一下，试试里面的一些例子。正则表达式有点复杂（基本上就像是一种小编程语言），需要花点时间去学习，但了解它们非常有用。可以多做实验，慢慢来。

（是的，我可以直接给你答案，但教你钓鱼才是更重要的。）

...

根据要求，如果不使用split()的方法，这里有一个解决方案：遍历每一行，然后检查每一行的内容：

p = re.compile('\d+\.\s+(\w+)\s+(\w+)\s+-\s+(\d+)')
m = p.match(the_line)
// m.group(0) will be the first word
// m.group(1) the second word
// m.group(2) will be the firstnumber after the last word.

The regexp is :<some digits><a dot>
<some whitespace><alphanumeric characters, captured as group 0>
<some whtespace><alphanumeric characters, captured as group 1>
<some whitespace><a '-'><some witespace><digits, captured as group 2>

这个方法有点严格，但这样可以确保你能找到不符合要求的行。

回答于 2025-04-16 由 Python大师

分享举报

如何在Python中使用正则表达式对示例字符串进行分词？

3 个回答

撰写回答