在Python中遍历HTML树

2 投票
1 回答
2393 浏览
提问于 2025-04-17 23:38
<td id="aisd_calendar-2014-04-28-0" class="single-day future" colspan="1" rowspan="1" date="**2014-04-28**" >
  <div class="inner">
    <div class="item">
  <div class="view-item view-item-aisd_calendar">
  <div class="calendar monthview">
        <div class="calendar.4168.field_date.8.0 contents">
                      <a href="/event/2013/regular-board-meeting">**Regular Board Meeting**</a>                      <span class="date-display-single">7:00 pm</span>          </div>  
        <div class="cutoff">&nbsp;</div>
      </div> 
  </div>   
</div>  </div>
</td>

我有上面的HTML代码。我想从中提取“日期”标签(2014-04-28)和“a href”标签(定期董事会会议)。我该如何用Python来做到这一点?可以使用Beautiful Soup吗?

1 个回答

2

这里是如何通过 BeautifulSoup 来实现的:

from bs4 import BeautifulSoup


data = """
<html>
    <body>
        <td id="aisd_calendar-2014-04-28-0" class="single-day future" colspan="1" rowspan="1" date="**2014-04-28**" >
          <div class="inner">
            <div class="item">
          <div class="view-item view-item-aisd_calendar">
          <div class="calendar monthview">
                <div class="calendar.4168.field_date.8.0 contents">
                              <a href="/event/2013/regular-board-meeting">**Regular Board Meeting**</a>                      <span class="date-display-single">7:00 pm</span>          </div>
                <div class="cutoff">&nbsp;</div>
              </div>
          </div>
        </div>  </div>
        </td>
    </body>
</html>
"""
soup = BeautifulSoup(data)

td = soup.body.td  # or soup.find('td', id='aisd_calendar-2014-04-28-0')
print td['date'].strip('*')

link = soup.find('div', {'class': 'contents'}).a
print link['href']

输出结果是:

2014-04-28
/event/2013/regular-board-meeting

另外,如果你需要把日期转换成 Python 的 datetime 格式,可以使用 strptime() 方法:

from datetime import datetime

...

datetime.strptime(td['date'].strip('*'), '%Y-%m-%d')

希望这能帮到你。

撰写回答