正则表达式匹配

import urllib2 from BeautifulSoup import BeautifulSoup import re url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html" page = urllib2.urlopen(url).read() soup = BeautifulSoup(page) pattern = r'<a href="http://forums.epicgames.com/archive/index.php?t-([0-9]+).html">(.?+)</a> <i>((.?+) replies)' #pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html">Gears of War 2: Horde Gameplay</a> <i>(20 replies)' for match in re.finditer(pattern, page, re.S): print match(0)

3条回答

网友

1楼 · 编辑于 2024-06-06 20:00:50

这意味着正则表达式有错误。

(.?+)</a> <i>((.?+)

怎么办？+意思是？两者都有？和+是元字符，彼此之间没有意义。也许你忘了逃走或者别的什么。

网友

2楼 · 编辑于 2024-06-06 20:00:50

你需要转义字面上的“？”以及要匹配的文本“（”和“）”。

还有，不是'？+，我认为您正在寻找由“+？”提供的非贪婪匹配。

More documentation here.

对于您的情况，请尝试以下操作：

pattern = r'<a href="http://forums.epicgames.com/archive/index.php\?t-([0-9]+).html"> (.+?)</a> <i>\((.+?) replies\)'

网友

3楼 · 编辑于 2024-06-06 20:00:50

import urllib2
import re
from BeautifulSoup import BeautifulSoup

url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

# Get all the links
links = [str(match) for match in soup('a')]

s = r'<a href="http://forums.epicgames.com/archive/index.php\?t-\d+.html">(.+?)</a>' 
r = re.compile(s)
for link in links:
    m = r.match(link)
    if m:
        print m.groups(1)[0]

相关问题更多 >

编程相关推荐

热门问题

热门文章