我正在做一些美国橄榄球联盟的统计网页抓取,老实说,活动并不重要。我花了大量时间调试,因为我不敢相信它在做什么,要么我疯了,要么包中有某种bug,要么python本身也有。以下是我正在使用的代码:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np
#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
print(letter)
players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
soup = bs(players_html.content,"html.parser")
for player in soup.find("div",{"id":"div_players"}).find_all("p"):
temp_row = {}
temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
temp_row["name"] = player.text.split("(")[0].strip()
years = player.text.split(")")[1].strip()
temp_row["startYear"] = int(years.split("-")[0])
temp_row["endYear"] = int(years.split("-")[1])
temp_row["positions"] = player.text.split("(")[1].split(")")[0]
players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)
game_df = pd.DataFrame()
def apply_test(row):
#print(row)
url = row['url']
#print(list(range(int(row['startYear']),int(row['endYear'])+1)))
for yr in range(int(row['startYear']),int(row['endYear'])+1):
print(yr)
content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
#overheader
over_headers = []
for over in soup.find("thead").find("tr").find_all("th"):
if("colspan" in over.attrs.keys()):
for i in range(0,int(over['colspan'])):
over_headers = over_headers + [over.text]
else:
over_headers = over_headers + [over.text]
#headers
headers = []
for header in soup.find("thead").find_all("tr")[1].find_all("th"):
headers = headers + [header.text]
all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
#remove first column, it's meaningless
all_headers = all_headers[1:len(all_headers)]
for row in soup.find("tbody").find_all("tr"):
temp_row = {}
for i,col in enumerate(row.find_all("td")):
temp_row[all_headers[i]] = col.text
game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)
现在我可以再次进入我想做的,但这里似乎有一个更高层次的问题。for循环中的startYear和endYear是2013和2014,因此循环应该将yr变量设置为2013,然后设置为2014。但是当您查看由于print(yr)
而打印出来的内容时,您会发现它打印了两次2013。但是如果你简单地注释掉game_df = game_df.append(temp_row,ignore_index=True)
行,yr的打印输出是正确的。在前两行之后不久就出现了一个错误,但这是意料之中的,我很乐意调试其中一行。但是附加到全局数据帧会导致for循环的行为不同,这一事实现在让我大吃一惊。有人能帮忙吗
谢谢
我不太明白总体目标是什么,但我注意到两件事:
您要么需要将本地
game_df
声明为global game_df
在game_df = game_df.append(temp_row,ignore_index=True)
之前,要么最好还是在def签名中作为arg传递,尽管您需要相应地修改:players.apply(apply_test,axis=1)
您需要处理find返回None的情况,例如对于页面https://www.pro-football-reference.com/players/A/AaitIs00/gamelog/2014使用
soup.find("thead").find_all("tr")[1].find_all("th")
。可能放入try except
块,并提供适当的默认值相关问题 更多 >
编程相关推荐