mechanize - 如何根据相邻标签选择链接？

0 投票

1 回答

717 浏览

提问于 2025-04-17 20:52

我想要解析一个搜索结果列表，只跟随那些符合特定条件的链接。

假设这些结果有这样的结构：

<ul>
  <li>
    <div>
      <!-- in here there is a list of information such as: 
           height: xx  price: xx , and a link <a> to the page -->
    </div>
  </li>
  <li>
    <!-- next item -->
  .
  .
  .

我想要根据一组标准（比如高度 > x，价格 < x）来判断列表中的每一项，如果符合条件就跟随那个链接。

我需要找到一个标签，它是另一个标签的子标签（也就是说，找到第一个

标签的子标签）。

我觉得解决方案可能是这样的，但我不知道该用哪个库或者方法：

1 - 使用某个库把列表解析成一个对象，这样我就可以这样做：

for item in list:
  if item['price'] < x:
    br.follow_link(item.link)

2 - 我会在 HTML 响应中查找，直到找到第一个“价格”文本，解析出这个值并判断它是否符合条件。如果符合条件，就跟随那个在 HTML 字符串中与这个值相邻的链接（在我的情况下，链接出现在信息之前，所以我需要选择那个在匹配信息之前的链接）。

我能想到一些非常原始、低级的方法来做到这一点，但我在想是否有库或者机械化的方法可以使用。谢谢。

数据提取 html解析信息检索条件过滤标签结构搜索结果链接选择机械化处理

1 个回答

你可以使用一个叫做 BeautifulSoup 的库。这个库可以帮助你处理网页内容，下面是你用 Beautiful Soup 解析时的代码大致框架。

假设你的 HTML 内容是：

<ul>
  <li>
    <div>
        height: 10  price: 20
        <a href="google.com">
    </div>
  </li>
  <li>
    <div>
        height: 30  price: 40
        <a href="facebook.com">
    </div>
  </li>
  <li>
    <div>
        height: 50  price: 60
        <a href="stackoverflow.com">
    </div>
  </li>
</ul>

那么你用来解析的代码会是：

from bs4 import BeautifulSoup

# Read the input file. I am assuming the above html is part of test.html
html = ""
with open('test.html', 'r') as htmlfile:
    for line in htmlfile:
        html += line
htmlfile.close()

bs = BeautifulSoup(html)
links_to_follow = []


ul = bs.find('ul')
for li in ul.find_all('li'):
    height = int(li.find('div').get_text().strip().split()[1])
    price = int(li.find('div').get_text().strip().split()[3])
    if height > 10 and price > 20: # I am assuming this to be the criteria
        links_to_follow.append(li.find('a').get('href'))

print links_to_follow

这样做会得到：

facebook.com
stackoverflow.com

回答于 2025-04-17 由 Python大师

分享举报

mechanize - 如何根据相邻标签选择链接？

1 个回答

撰写回答