Python正则表达式在匹配的元素中与一个正则表达式匹配

<table id="test_table"> <td> <a href="#">#</a> <a href="#">#</a> <a href="#">#</a> <a href="#">#</a> <a href="#">#</a> <a href="#">#</a> <a href="#">#</a> <a href="#">#</a> </td> </table> <table id="test_table2"> <td> <a href="#">#33</a> <a href="#">#33</a> <a href="#">#33</a> <a href="#">#33</a> <a href="#">#33</a> <a href="#">#33</a> <a href="#">#33</a> <a href="#">#33</a> </td> </table>

3条回答

网友

1楼 · 编辑于 2024-06-17 10:40:37

另外请看一下PyQuery，我喜欢jQuery提供的熟悉性：

>>> from pyquery import PyQuery as pq
>>> html = '''<table id="test_table">
...     <td>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...     </td>
... </table>
... <table id="test_table2">
...     <td>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...     </td>
... </table>'''
>>> d = pq(html)
>>> for a in d('#test_table').find('a'):
...     print a.attrib.items()
...
...
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]

网友

2楼 · 编辑于 2024-06-17 10:40:37

不要使用regex来解析HTML，使用LXML来解析。你知道吗

使用iPython的示例（test是您的文件）

In [55]: import lxml.html

In [56]: x = lxml.html.fromstring(open("test").read())

In [57]: for i in x.iterlinks():
    print i # print ALL links 
   ....:     
(<Element a at 0x1bb7110>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1bb7110>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)

In [58]: path = x.xpath("./table[@id='test_table']")[0]

In [59]: for i in path.iterlinks():
   ....:     print i
   ....:     
(<Element a at 0x1bb7110>, 'href', '#', 0)
(<Element a at 0x1bb7050>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1bb7050>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1bb7050>, 'href', '#', 0)

使用Xpath可以使事情变得更简单，减少头痛和咖啡量；）

网友

3楼 · 编辑于 2024-06-17 10:40:37

对于HTML，请使用正确的工具。改用HTML解析器，如BeautifulSoup：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

table = soup.find('table', id='test_table')
for anchor in table.find_all('a'):
    print anchor['href'], anchor.string

不要使用正则表达式，用这样的表达式匹配HTML会变得太复杂，太快。别那么做。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章