Beautifulsoup 获取表格中的值

1 投票

3 回答

4595 浏览

提问于 2025-04-15 16:30

我正在尝试从这个网站抓取数据：http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104，想要获取“所有者姓名”。

我现在的方法可以工作，但看起来很糟糕，肯定不是最好的，所以我在寻找更好的解决方案。

这是我现在的代码：

soup = BeautifulSoup(url_opener.open(url))            
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next

3 个回答

这只是一个小改进，但我还是没搞明白怎么去掉这三个父元素。

x[0].parent.parent.parent.findAll('td')[1].string

回答于 2025-04-15 由 Python大师

分享举报

这是Aaron DeVore在Beautifulsoup讨论组里的回答，对我来说效果很好。

soup = BeautifulSoup(...)
label = soup.find(text="Owner Name(s)")

需要使用Tag.string来获取实际的名称字符串。

name = label.findNext('td').string

如果你要处理很多这样的情况，可以使用列表推导式来简化。

names = [unicode(label.findNext('td').string) for label in
soup.findAll(text="Owner Name(s)")]

回答于 2025-04-15 由 Python大师

分享举报

(编辑: 看来原作者发布的HTML有问题——实际上并没有tbody标签，尽管他在HTML中特意提到过。所以，建议改用table来代替tbody。)

因为你可能想要多个表格行（比如，看看你给的链接的兄弟链接，把最后一个数字4改成5），我建议使用一个循环，像下面这样：

# locate the table containing a cell with the given text
owner = re.compile('Owner Name')
cell = soup.find(text=owner).parent
while cell.name != 'table': cell = cell.parent
# print all non-empty strings in the table (except for the given text)
for x in cell.findAll(text=lambda x: x.strip() and not owner.match(x)):
  print x

这个方法对页面结构的小变化比较稳健：在找到感兴趣的单元格后，它会向上查找父元素，直到找到表格标签，然后遍历这个表格中所有可以导航的字符串，前提是这些字符串不是空的（或者只是空格），并排除owner标题。

回答于 2025-04-15 由 Python大师

分享举报

Beautifulsoup 获取表格中的值

3 个回答

撰写回答