如何抓取和存储维基页面上的多个表格?
我正在尝试从《幸存者》维基页面上提取三个特定的表格数据。主要是参赛者表、赛季总结表和投票历史表。我可以顺利获取参赛者表的数据,但系统却告诉我找不到赛季总结表和投票历史表。我的最终目标是把这些数据合并成一个数据框,方便后续的清理和处理。
我能成功获取参赛者表数据的代码如下:
import pandas as pd
list_of_seasons = ['41', '42', '43', '44', '45', '46']
season_start = 41
contestants = {}
season_summary = {}
voting_history = {}
for i in list_of_seasons :
contestants[i] = pd.read_html('https://en.wikipedia.org/wiki/Survivor_' + str(season_start), match='contestants')
season_summary[i] = pd.read_html('https://en.wikipedia.org/wiki/Survivor_' + str(season_start), match='season summary')
voting_history[i] = pd.read_html('https://en.wikipedia.org/wiki/Survivor_' + str(season_start), match='voting history')
season_start = season_start + 1
print(contestants['45'])
print(season_summary['45'])
print(voting_history['45'])
但是我遇到的错误是:
Traceback (most recent call last):
File "c:\Users\bsjes\Documents\Code\Personal Projects\Survivor Data Grabber\SurvivorWikiRipper_0.2.py", line 13, in <module>
season_summary[i] = pd.read_html('https://en.wikipedia.org/wiki/Survivor_' + str(season_start), match='season summary')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\bsjes\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\html.py", line 1246, in read_html
return _parse(
^^^^^^^
File "C:\Users\bsjes\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\html.py", line 1009, in _parse
raise retained
File "C:\Users\bsjes\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\html.py", line 989, in _parse
tables = p.parse_tables()
^^^^^^^^^^^^^^^^
File "C:\Users\bsjes\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\html.py", line 249, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\bsjes\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\html.py", line 622, in _parse_tables
raise ValueError(f"No tables found matching pattern {repr(match.pattern)}")
ValueError: No tables found matching pattern 'season summary'
我应该怎么做才好呢?我需要学习其他的工具包吗?
1 个回答
1
从维基页面来看,这些表格都是放在同一个索引上的(比如“参赛者”表格是第二个<table>
,季节总结是第三个,等等)。
你可以试试:
import pandas as pd
contestants = {}
season_summary = {}
voting_history = {}
for season_start in range(41, 47):
u = f"https://en.wikipedia.org/wiki/Survivor_{season_start}"
tables = pd.read_html(u)
contestants[season_start] = tables[1]
season_summary[season_start] = tables[2]
voting_history[season_start] = tables[4]
print(contestants[45])
print(season_summary[45])
print(voting_history[45])
输出结果:
Contestant Age From Tribe Finish
Contestant Age From Original Switched None Merged Placement Day
0 Hannah Rose 33 Baltimore, Maryland Lulu NaN NaN NaN 1st voted out Day 3
1 Brandon Donlon 26 Sicklerville, New Jersey Lulu NaN NaN NaN 2nd voted out Day 5
2 Sabiyah Broderick 28 Jacksonville, North Carolina Lulu NaN NaN NaN 3rd voted out Day 7
3 Sean Edwards 35 Provo, Utah Lulu Reba NaN NaN 4th voted out Day 9
4 Brandon "Brando" Meyer 23 Seattle, Washington Belo Belo NaN NaN 5th voted out Day 11
5 Janani "J. Maya" Krishnan-Jha 24 Los Angeles, California Reba Reba None[a] NaN 6th voted out Day 13
6 Nicholas "Sifu" Alsup 30 O'Fallon, Illinois Reba Reba None[a] Dakuwaqa 7th voted out Day 14
7 Kaleb Gebrewold 29 Vancouver, British Columbia Lulu Lulu None[a] Dakuwaqa 8th voted out 1st jury member Day 14
8 Kellie Nalbandian 29 New York City, New York Belo Lulu None[a] Dakuwaqa 9th voted out 2nd jury member Day 16
9 Kendra McQuarrie 30 Steamboat Springs, Colorado Belo Belo None[a] Dakuwaqa 10th voted out 3rd jury member Day 17
10 Bruce Perreault Survivor 44 47 Warwick, Rhode Island Belo Lulu None[a] Dakuwaqa 11th voted out 4th jury member Day 19
11 Emily Flippen 28 Laurel, Maryland Lulu Belo None[a] Dakuwaqa 12th voted out 5th jury member Day 21
12 Drew Basile 23 Philadelphia, Pennsylvania Reba Belo None[a] Dakuwaqa 13th voted out 6th jury member Day 23
13 Julie Alley 49 Brentwood, Tennessee Reba Reba None[a] Dakuwaqa 14th voted out 7th jury member Day 24
14 Katurah Topps 35 Brooklyn, New York Belo Lulu None[a] Dakuwaqa Eliminated 8th jury member Day 25
15 Jake O'Kane 26 Boston, Massachusetts Belo Lulu None[a] Dakuwaqa 2nd runner-up Day 26
16 Austin Li Coon 26 Chicago, Illinois Reba Belo None[a] Dakuwaqa Runner-up Day 26
17 Dee Valladares 26 Miami, Florida Reba Reba None[a] Dakuwaqa Sole Survivor Day 26
Episode Challenge winner(s) Eliminated
No. Title Air date Reward Immunity Tribe Player
0 1 "We Can Do Hard Things" September 27, 2023 Reba Belo Lulu Hannah
1 1 "We Can Do Hard Things" September 27, 2023 Reba Reba Lulu Hannah
2 2 "Brought a Bazooka to a Tea Party" October 4, 2023 Reba Reba Lulu Brandon
3 2 "Brought a Bazooka to a Tea Party" October 4, 2023 Belo Belo Lulu Brandon
4 3 "No Man Left Behind" October 11, 2023 Lulu Reba Lulu Sabiyah
5 3 "No Man Left Behind" October 11, 2023 Reba Belo Lulu Sabiyah
6 4 "Music to My Ears" October 18, 2023 NaN Lulu Reba Sean
7 4 "Music to My Ears" October 18, 2023 NaN Belo Reba Sean
8 5 "I Don't Want to Be the Worm" October 25, 2023 Reba Reba Belo Brando
9 5 "I Don't Want to Be the Worm" October 25, 2023 Lulu Lulu Belo Brando
10 6 "I'm Not Batman, I'm the Canadian" November 1, 2023 Austin, Bruce, Drew, Julie, Kendra, Sifu [Katurah] (Blue Team)[a] Austin, Bruce, Drew, Julie, Kendra, Sifu [Katurah] (Blue Team)[a] NaN J. Maya
11 7 "The Thorn in My Thumb" November 8, 2023 Dee [Austin, Jake, Julie, Kaleb, Katurah] (Red Team)[b] Kellie (Blue Team) Dakuwaqa Sifu
12 7 "The Thorn in My Thumb" November 8, 2023 Dee [Austin, Jake, Julie, Kaleb, Katurah] (Red Team)[b] Dee (Red Team) Dakuwaqa Kaleb
13 8 "Following a Dead Horse to Water" November 15, 2023 Survivor Auction Bruce Dakuwaqa Kellie
14 9 "Sword of Damocles" November 22, 2023 Bruce, Julie, Kendra (Yellow Team) Bruce Dakuwaqa Kendra
15 10 "How Am I the Mobster?" November 29, 2023 Emily [Dee, Julie, Katurah] Austin Dakuwaqa Bruce
16 11 "This Game Rips Your Heart Out" December 6, 2023 Drew [Austin, Jake][c] Drew Dakuwaqa Emily
17 12 "The Ex-Girlfriend at the Wedding" December 13, 2023 Austin [Dee, Katurah] Dee Dakuwaqa Drew
18 13 "Living the Survivor Dream" December 20, 2023 Austin [Jake][d] Austin Dakuwaqa Julie
19 13 "Living the Survivor Dream" December 20, 2023 NaN Dee [Austin] Dakuwaqa Katurah
Unnamed: 0_level_0 Original tribes Switched tribes No tribes Merged tribe Unnamed: 17_level_0
Episode 1 2 3 4 5 6 6.1 7 7.1 8 9 10 11 12 13 13.1 Unnamed: 17_level_1
0 Day 3 5 7 9 11 13 13 14[a] 14[a] 16 17 19 21 23 24 25 NaN
1 Tribe Lulu Lulu Lulu Reba Belo NaN NaN Dakuwaqa Dakuwaqa Dakuwaqa Dakuwaqa Dakuwaqa Dakuwaqa Dakuwaqa Dakuwaqa Dakuwaqa NaN
2 Eliminated Hannah Brandon Sabiyah Sean Brando NaN J. Maya Sifu Kaleb Kellie Kendra Bruce Emily Drew Julie Katurah NaN
3 Votes 5–0[b] 3–0 2–1 3–1–1 3–2 0–0[c] 10–1 5–1 4–2 5–3 6–1 4–3–1 1–0[d] 4–2 2–1–1–0[e] None[f] NaN
4 Voter Vote Vote Vote Vote Vote Vote Vote Vote Vote Vote Vote Vote Vote Vote Vote Challenge NaN
5 Dee NaN NaN NaN Sifu NaN Kaleb J. Maya NaN Kaleb Kellie Kendra Jake Julie Drew Katurah Immune[f] NaN
6 Austin NaN NaN NaN NaN Brando[g] None[h] None[h] NaN Kaleb Kellie Kendra[i] Jake Julie Julie Julie Saved[f] NaN
7 Jake NaN NaN NaN NaN NaN Kaleb J. Maya NaN Julie None[j] Kendra Bruce Julie Drew Dee Won[f] NaN
8 Katurah NaN NaN NaN NaN NaN Kaleb J. Maya NaN Kaleb Jake None[i] Bruce Julie Drew Julie Lost[f] NaN
9 Julie NaN NaN NaN Sean NaN Kaleb J. Maya NaN Kaleb Kellie Kendra Bruce Emily Drew Jake NaN NaN
10 Drew NaN NaN NaN NaN Brando Kaleb J. Maya Sifu NaN Kellie Kendra Jake Julie Julie NaN NaN NaN
11 Emily Hannah Brandon Sabiyah NaN Brando Kaleb J. Maya Sifu NaN Kellie None[i] Bruce Julie NaN NaN NaN NaN
12 Bruce NaN NaN NaN NaN NaN Kaleb J. Maya Sifu NaN None[k] Kendra Julie NaN NaN NaN NaN NaN
13 Kendra NaN NaN NaN NaN Drew Kaleb J. Maya Sifu NaN Jake Jake NaN NaN NaN NaN NaN NaN
14 Kellie NaN NaN NaN NaN NaN Kaleb J. Maya Sifu NaN Jake NaN NaN NaN NaN NaN NaN NaN
15 Kaleb Hannah Brandon Sabiyah NaN NaN None[j] None[j] NaN Julie NaN NaN NaN NaN NaN NaN NaN NaN
16 Sifu NaN NaN NaN Sean NaN Kaleb J. Maya Bruce NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 J. Maya NaN NaN NaN Sean NaN Kaleb Emily NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 Brando NaN NaN NaN NaN Drew NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 Sean Hannah Brandon Kaleb Dee NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20 Sabiyah Hannah None[l] None[h] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 Brandon Hannah None[l] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 Hannah None[b] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN