Beautiful Soup 提取父级/兄弟 tr 表格类

0 投票
2 回答
1228 浏览
提问于 2025-04-20 01:00

我正在尝试学习bs4这个库,但在从以下的html中提取一些信息时遇到了一些困难:

<table border="1" cellspacing="0" class="browser">
<thead>..</thead>
<tbody class="body">
<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
</tbody>
</table>

我想要的是在两个date classes之间的内容(class),像这样:

<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>

还有,

<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>

我尝试过这样做:

xx = soup.find_all('tbody',{'class':'body'})

为了获取对应的right classes,我这样做:

yy = []
for i in xx:
    yy.append( i.find_all('tr',{'class':'right'}) )

...但这样做会给我所有的right classes,我想要的是找出每个yy元素的父级date类。简单来说,我希望每个right classes都能和它的parent date class关联起来。

如果这个问题听起来有点混乱,我先说声抱歉!

2 个回答

1

你需要遍历一下 tbody 标签里面的子元素。这样做就可以了:

# Get just the tags
tags = filter( lambda x: x != '\n', soup.tbody.contents)
collected_tags = []
latest_date = None
for tag in tags:
    if tag['class'] == ['date']:
        date_map = {tag: []}
        collected_tags.append(date_map)
        latest_date = tag
        continue
    if collected_tags and tag['class'] == ['right']:
        collected_tags[-1][latest_date].append(tag)

```

现在 collected_tags 是一个字典列表,它把 date 标签和 right 标签对应起来了。

0

你可以通过循环遍历 next_siblings,直到找到一个类名为 date 的元素为止:

for date_row in soup.select('table tbody.body tr.date'):
    for elem in date_row.next_siblings:
        if not elem.name:
            # NavigableString (text) element between rows
            continue
        if 'right' not in elem.get('class', []):
            # all done, found a row that doesn't have class="right"
            break

你可以把这些元素收集到一个列表里,或者直接在循环中处理它们。

示例:

>>> for date_row in soup.select('table tbody.body tr.date'):
...     print('Found a date row', date_row)
...     for elem in date_row.next_siblings:
...         if not elem.name:
...             # NavigableString (text) element between rows
...             continue
...         if 'right' not in elem.get('class', []):
...             # all done, found a row that doesn't have class="right"
...             break
...         print('Right row grouped with the date', elem)
...     print()
... 
Found a date row <tr class="date">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>

Found a date row <tr class="date">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>

撰写回答