在Python中使用BS4从web页面提取数据

def extractData(): lDateInfoMatchCase = False # lDateInfoMatchCase = [] global gDict for row in table_for_players.findAll("tr"): for lDateRowIndex in row.findAll("th", {"colspan" : "4"}): ldateList.append(lDateRowIndex.text) print ldateList for index in ldateList: #print index lPreviewLinkList = [] for row in table_for_players.findAll("tr"): for lDateRowIndex in row.findAll("th", {"colspan" : "4"}): if lDateRowIndex.text == index: lDateInfoMatchCase = True else: lDateInfoMatchCase = False if lDateInfoMatchCase == True: for lInfoRowIndex in row.findAll("td", {"class": "info"}): for link in lInfoRowIndex.findAll("a", {"class" : "preview"}): lPreviewLinkList.append("http://www.afl.com.au/" + link.get('href')) print lPreviewLinkList gDict[index] = lPreviewLinkList

1条回答

网友

1楼 · 发布于 2024-05-16 21:09:07

我更喜欢用CSS Selectors。选择第一个表，然后选择tbody中的所有行以便于处理；这些行按tr th行“分组”。从那里，您可以选择所有不包含th头的下一个同级，并扫描这些预览链接：

previews = {}

table = soup.select('table.fixture')[0]
for group_header in table.select('tbody tr th'):
    date = group_header.string
    for next_sibling in group_header.parent.find_next_siblings('tr'):
        if next_sibling.th:
            # found a next group, end scan
            break
        for preview in next_sibling.select('a.preview'):
            previews.setdefault(date, []).append(
                "http://www.afl.com.au" + preview.get('href'))

这将生成一个列表字典；对于生成的页面的当前版本：

{u'Monday, June 09': ['http://www.afl.com.au/match-centre/2014/12/melb-v-coll'],
 u'Sunday, June 08': ['http://www.afl.com.au/match-centre/2014/12/gcfc-v-syd',
                      'http://www.afl.com.au/match-centre/2014/12/fre-v-adel',
                      'http://www.afl.com.au/match-centre/2014/12/nmfc-v-rich']}

相关问题更多 >

编程相关推荐

热门问题

热门文章