从多个URL在多个页面上刮取表数据（Python和BeautifulSoup）

page = 1 year = 2000 while page < 20 and year < 2020: base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page) response = requests.get(base_URL, headers) if response.status_code == 200: soup = BeautifulSoup(response.content, 'html.parser') sal_table = soup.find_all('table', class_ = 'tablehead') if len(sal_table) < 2: sal_table = sal_table[0] with open ('NBA_Salary_2000_2019.txt', 'a') as r: for row in sal_table.find_all('tr'): for cell in row.find_all('td'): r.write(cell.text.ljust(30)) r.write('\n') page+=1 else: print("too many tables") else: year +=1 page = 1

4条回答

网友

1楼 · 编辑于 2024-04-26 15:03:05

我会考虑使用熊猫作为1）它是{{CD1>}函数（使用引擎盖下的漂亮汤），更容易解析^ {CD2>}标签，2）它可以很容易地直接写入文件。

也就是说，重复20页是一种浪费（例如，第一个赛季之后你只有4页……其余的都是空白的。所以我会考虑添加一些东西，比如一旦它到达空白表，就转到下一个赛季。

import pandas as pd
import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}

results = pd.DataFrame()
year = 2000

while year < 2020:
    goToNextPage = True
    page = 1
    while goToNextPage == True:
        base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page) 
        response = requests.get(base_URL, headers)
        if response.status_code == 200:
            temp_df = pd.read_html(base_URL)[0]
            temp_df.columns = list(temp_df.iloc[0,:])
            temp_df = temp_df[temp_df['RK'] != 'RK']

            if len(temp_df) == 0:
                goToNextPage = False
                year +=1
                continue


            print ('Aquiring Season: %s\tPage: %s' %(year, page))

            temp_df['Season'] = '%s-%s' %(year-1, year)

            results = results.append(temp_df, sort=False).reset_index(drop=True)

            page+=1


results.to_csv('c:/test/NBA_Salary_2000_2019.csv', index=False)

输出：

print (results.head(25).to_string())
    RK                     NAME                    TEAM       SALARY     Season
0    1      Shaquille O'Neal, C      Los Angeles Lakers  $17,142,000  1999-2000
1    2        Kevin Garnett, PF  Minnesota Timberwolves  $16,806,000  1999-2000
2    3       Alonzo Mourning, C              Miami Heat  $15,004,000  1999-2000
3    4         Juwan Howard, PF      Washington Wizards  $15,000,000  1999-2000
4    5       Scottie Pippen, SF  Portland Trail Blazers  $14,795,000  1999-2000
5    6          Karl Malone, PF               Utah Jazz  $14,000,000  1999-2000
6    7         Larry Johnson, F         New York Knicks  $11,910,000  1999-2000
7    8          Gary Payton, PG     Seattle SuperSonics  $11,020,000  1999-2000
8    9      Rasheed Wallace, PF  Portland Trail Blazers  $10,800,000  1999-2000
9   10            Shawn Kemp, C     Cleveland Cavaliers  $10,780,000  1999-2000
10  11     Damon Stoudamire, PG  Portland Trail Blazers  $10,125,000  1999-2000
11  12      Antonio McDyess, PF          Denver Nuggets   $9,900,000  1999-2000
12  13       Antoine Walker, PF          Boston Celtics   $9,000,000  1999-2000
13  14  Shareef Abdur-Rahim, PF     Vancouver Grizzlies   $9,000,000  1999-2000
14  15        Allen Iverson, SG      Philadelphia 76ers   $9,000,000  1999-2000
15  16            Vin Baker, PF     Seattle SuperSonics   $9,000,000  1999-2000
16  17            Ray Allen, SG         Milwaukee Bucks   $9,000,000  1999-2000
17  18    Anfernee Hardaway, SF            Phoenix Suns   $9,000,000  1999-2000
18  19          Kobe Bryant, SF      Los Angeles Lakers   $9,000,000  1999-2000
19  20      Stephon Marbury, PG         New Jersey Nets   $9,000,000  1999-2000
20  21           Vlade Divac, C        Sacramento Kings   $8,837,000  1999-2000
21  22         Bryant Reeves, C     Vancouver Grizzlies   $8,666,000  1999-2000
22  23        Tom Gugliotta, PF            Phoenix Suns   $8,558,000  1999-2000
23  24        Nick Van Exel, PG          Denver Nuggets   $8,354,000  1999-2000
24  25        Elden Campbell, C       Charlotte Hornets   $7,975,000  1999-2000
...

相关问题更多 >

编程相关推荐

热门问题

热门文章