如何用正则匹配图片标签的链接

1 投票

1 回答

1337 浏览

提问于 2025-04-16 11:19

我正在用Python写一个正则表达式匹配的函数。我的代码如下：

def src_match(line, img):
    imgmatch = re.search(r'<img src="(?P<img>.*?)"', line)

    if imgmatch and imgmatch.groupdict()['img'] == img:
        print 'the match was:', imgmatch.groupdict()['img']

但是上面的代码对我来说似乎完全不管用。另一方面，我用这个代码倒是成功了：

def href_match(line, url):
    hrefmatch = re.search(r'<a href="(?P<url>.*?)"', line)

    if hrefmatch and hrefmatch.groupdict()['url'] == url:
        print 'the match was:', hrefmatch.groupdict()['url']
    else:
        return None

有人能解释一下为什么会这样吗？或者说是不是两者都应该能工作？比如说，href_match()函数里的标识符有什么特别之处吗？可以假设在这两个函数里，我传入的都是包含我想找的字符串的那一行，以及这个字符串本身。

补充：我应该提到，我肯定不会遇到像这样的标签：

<img width="200px" src="somefile.jpg">

原因是我正在使用一个特定的程序来生成HTML，它永远不会生成这样的标签。这个例子应该纯粹作为理论上的假设，因为我总是会得到像这样的标签：

<img src="somefile.jpg">

补充：

这里有一个我传给函数的例子，这个例子并没有匹配输入参数：

<p class="p1"><img src="myfile.anotherword.png" alt="beat-divisions.tiff"></p>

正则表达式函数调用数据提取字符串匹配编程调试 html解析网页爬虫图片标签

1 个回答

规则 #37：不要用正则表达式来解析HTML。

要用合适的工具来完成任务——在这种情况下，推荐使用BeautifulSoup。

编辑：

直接复制粘贴这个函数并进行测试，

>>> src_match('this is <img src="my example" />','my example')
the match was: my example

看起来是可以工作的；但是它会在一些（完全有效的）HTML代码上失败，比如

<img width="200px" src="Y U NO C ME!!" />

编辑4：

>>> src_match('<p class="p1"><img src="myfile.png" alt="beat-divisions.tiff"></p>','myfile.png')
the match was: myfile.png
>>> src_match('<p class="p1"><img src="myfile.anotherword.png" alt="beat-divisions.tiff"</p>\n','myfile.anotherword.png')
the match was: myfile.anotherword.png

仍然可以工作；你确定你要匹配的URL值是正确的吗？

回答于 2025-04-16 由 Python大师

分享举报

如何用正则匹配图片标签的链接

1 个回答

撰写回答