正则表达式问题需要匹配u的一部分

网友

1楼 · 编辑于 2024-04-26 14:32:44

import re
s = 'href="http://example.com/page/subpage/unik-id-12345">'
res = re.search('href=\"(.+?)\">', s).group(1)
print(res)
# Output: http://example.com/page/subpage/unik-id-12345

顺便说一句，最好使用特定的库，比如lxml，来进行html解析。你知道吗

网友

2楼 · 编辑于 2024-04-26 14:32:44

import re
regex = re.compile('<href="(.*)">')
url = '<href="https://stackoverflow.com/">'
m = regex.search(url)

然后你就可以得到小组了

>>> m.group(0)
'<href="https://stackoverflow.com/">'
>>> m.group(1)
'https://stackoverflow.com/'

PS：如果你想做网页抓取，那么使用专门为此设计的库会更容易，比如beautifulsoup。你可以在网上很容易地找到tutorials如何使用它。你知道吗

网友

3楼 · 编辑于 2024-04-26 14:32:44

你知道regex101.com吗？它是调整正则表达式的一个很好的工具。你知道吗

如果我对你的问题理解正确，你匹配的是href="http://example.com/page/subpage/unik-id-12345">，你只想得到http://example.com/page/subpage/unik-id-12345

一种方法是只获取http（s）：//，后跟任何不带引号的内容：http(s?):\/\/[^"]*

如果您有多个链接，并且只需要href标记中的链接，那么您可能只需要使用regex，然后使用更多操作来提取url。（例如match.split("\"")[1]）

或者你可以使用一个像BeautifulSoup这样的HTML解析器

相关问题更多 >

编程相关推荐

热门问题

热门文章