re.findall使用中的问题（重复）

3 投票

1 回答

4208 浏览

提问于 2025-04-16 09:15

我试着获取4chan网站的源代码，并找到一些讨论串的链接。

我在使用正则表达式时遇到了问题（没有成功）。源代码是：

import urllib2, re

req = urllib2.Request('http://boards.4chan.org/wg/')
resp = urllib2.urlopen(req)
html = resp.read()

print re.findall("res/[0-9]+", html)
#print re.findall("^res/[0-9]+$", html)

问题在于：

print re.findall("res/[0-9]+", html)

这段代码返回了重复的结果。

我不能使用：

print re.findall("^res/[0-9]+$", html)

我看过Python的文档，但没有找到解决办法。

正则表达式源代码分析数据抓取 4chan 重复结果

1 个回答

这是因为源代码中有多个相同的链接。

你可以通过把它们放进一个集合里，轻松地让它们变得唯一。

>>> print set(re.findall("res/[0-9]+", html))
set(['res/3833795', 'res/3837945', 'res/3835377', 'res/3837941', 'res/3837942',
'res/3837950', 'res/3100203', 'res/3836997', 'res/3837643', 'res/3835174'])

不过，如果你打算做比这更复杂的事情，我建议你使用一个可以解析HTML的库。可以选择 BeautifulSoup 或者 lxml。

回答于 2025-04-16 由 Python大师

分享举报

re.findall使用中的问题（重复）

1 个回答

撰写回答