Python附加数据帧,奇怪的for循环

2024-04-19 12:16:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在做一些美国橄榄球联盟的统计网页抓取,老实说,活动并不重要。我花了大量时间调试,因为我不敢相信它在做什么,要么我疯了,要么包中有某种bug,要么python本身也有。以下是我正在使用的代码:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np

#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
    print(letter)
    players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
    soup = bs(players_html.content,"html.parser")
    for player in soup.find("div",{"id":"div_players"}).find_all("p"):
        temp_row = {}
        temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
        temp_row["name"] = player.text.split("(")[0].strip()
        years = player.text.split(")")[1].strip()
        temp_row["startYear"] = int(years.split("-")[0])
        temp_row["endYear"] = int(years.split("-")[1])
        temp_row["positions"] = player.text.split("(")[1].split(")")[0]
        players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)

game_df = pd.DataFrame()
def apply_test(row):
    #print(row)
    url = row['url']
    #print(list(range(int(row['startYear']),int(row['endYear'])+1)))
    for yr in range(int(row['startYear']),int(row['endYear'])+1):
        print(yr)
        content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
        soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
        #overheader
        over_headers = []
        for over in soup.find("thead").find("tr").find_all("th"):
            if("colspan" in over.attrs.keys()):
                for i in range(0,int(over['colspan'])):
                    over_headers = over_headers + [over.text]
            else:
                over_headers = over_headers + [over.text]
        #headers
        headers = []
        for header in soup.find("thead").find_all("tr")[1].find_all("th"):
            headers = headers + [header.text]
        all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
        #remove first column, it's meaningless
        all_headers = all_headers[1:len(all_headers)]
        for row in soup.find("tbody").find_all("tr"):
            temp_row = {}
            for i,col in enumerate(row.find_all("td")):
                temp_row[all_headers[i]] = col.text
            game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)


现在我可以再次进入我想做的,但这里似乎有一个更高层次的问题。for循环中的startYear和endYear是2013和2014,因此循环应该将yr变量设置为2013,然后设置为2014。但是当您查看由于print(yr)而打印出来的内容时,您会发现它打印了两次2013。但是如果你简单地注释掉game_df = game_df.append(temp_row,ignore_index=True)行,yr的打印输出是正确的。在前两行之后不久就出现了一个错误,但这是意料之中的,我很乐意调试其中一行。但是附加到全局数据帧会导致for循环的行为不同,这一事实现在让我大吃一惊。有人能帮忙吗

谢谢


Tags: textinforallfindtempoverint
1条回答
网友
1楼 · 发布于 2024-04-19 12:16:28

我不太明白总体目标是什么,但我注意到两件事:

  1. 您要么需要将本地game_df声明为global game_dfgame_df = game_df.append(temp_row,ignore_index=True)之前,要么最好还是在def签名中作为arg传递,尽管您需要相应地修改:players.apply(apply_test,axis=1)

  2. 您需要处理find返回None的情况,例如对于页面https://www.pro-football-reference.com/players/A/AaitIs00/gamelog/2014使用soup.find("thead").find_all("tr")[1].find_all("th")。可能放入try except块,并提供适当的默认值

相关问题 更多 >