Beautiful Soup 提取父级/兄弟 tr 表格类
我正在尝试学习bs4这个库,但在从以下的html中提取一些信息时遇到了一些困难:
<table border="1" cellspacing="0" class="browser">
<thead>..</thead>
<tbody class="body">
<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
</tbody>
</table>
我想要的是在两个date classes
之间的内容(class
),像这样:
<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
还有,
<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
我尝试过这样做:
xx = soup.find_all('tbody',{'class':'body'})
为了获取对应的right classes
,我这样做:
yy = []
for i in xx:
yy.append( i.find_all('tr',{'class':'right'}) )
...但这样做会给我所有的right classes
,我想要的是找出每个yy
元素的父级date
类。简单来说,我希望每个right classes
都能和它的parent date class
关联起来。
如果这个问题听起来有点混乱,我先说声抱歉!
2 个回答
1
你需要遍历一下 tbody
标签里面的子元素。这样做就可以了:
# Get just the tags
tags = filter( lambda x: x != '\n', soup.tbody.contents)
collected_tags = []
latest_date = None
for tag in tags:
if tag['class'] == ['date']:
date_map = {tag: []}
collected_tags.append(date_map)
latest_date = tag
continue
if collected_tags and tag['class'] == ['right']:
collected_tags[-1][latest_date].append(tag)
```
现在 collected_tags
是一个字典列表,它把 date
标签和 right
标签对应起来了。
0
你可以通过循环遍历 next_siblings
,直到找到一个类名为 date
的元素为止:
for date_row in soup.select('table tbody.body tr.date'):
for elem in date_row.next_siblings:
if not elem.name:
# NavigableString (text) element between rows
continue
if 'right' not in elem.get('class', []):
# all done, found a row that doesn't have class="right"
break
你可以把这些元素收集到一个列表里,或者直接在循环中处理它们。
示例:
>>> for date_row in soup.select('table tbody.body tr.date'):
... print('Found a date row', date_row)
... for elem in date_row.next_siblings:
... if not elem.name:
... # NavigableString (text) element between rows
... continue
... if 'right' not in elem.get('class', []):
... # all done, found a row that doesn't have class="right"
... break
... print('Right row grouped with the date', elem)
... print()
...
Found a date row <tr class="date">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Found a date row <tr class="date">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>