存储和重用BeautifulSoup搜索及节点遍历

1 投票

1 回答

542 浏览

提问于 2025-04-17 17:58

我正在制作一个工具，用来从很多格式不同的网站上提取相似的数据（比如标题和日期），而BeautifulSoup这个库对我帮助很大。不过，我还没找到一个好的方法来存储我使用的BeautifulSoup函数，这样我就不用为每个网站都写一个新的函数。下面是一个例子：

soup = BeautifulSoup(html)
title = soup.find("h4", "title").text    # extract title
date = soup.find('li', 'when').em.text       # extract date

每个网站的解析节点都不一样。面对数百个网站，为每个网站都写一个独特的函数显得太傻了。有没有办法把soup.find('x').等等的调用存储在一个表格里，和网址一起，然后在一个函数里应用正确的BeautifulSoup调用呢？希望这样说你能明白。

谢谢！

数据存储数据提取网络爬虫 beautifulsoup 网页解析数据格式化函数重用节点遍历

1 个回答

嗯，假设我理解了你的帖子，这样做可以吗？

linkInstructions = {
  "url1": {
    "title": lambda n: n.find('h4', 'title').text,
    "date": lambda n: n.find('li', 'when').em.text
  },
  "url2": {
    "title": lambda n: n.find('h3', 'title').text,
    "date": lambda n: n.find('li', 'when').strong.text
  }
  # and so forth
} 

def parseNode(node, url):
  # let 'node' be the result of BeautifulSoup(html)
  # and 'url' be the url of the site    

  result = {}

  for key,func in linkInstructions[url].iteritems():
    result[key] = func(node)

  # would return a dict with the structure {'title': <title>, 'date': <date>}
  return result

编辑：哎呀，enumerate 这个函数用错了。

回答于 2025-04-17 由 Python大师

分享举报

存储和重用BeautifulSoup搜索及节点遍历

1 个回答

撰写回答