解析网页爬虫数据为Excel
我刚接触这个,所以可能说的有些傻,抱歉啦。:D 我做了一些研究,现在能从我想抓取的网页上获取数据了。但是,我就是无法把数据整理成我想要的样子。
首先,这个网址是(每次活动都不一样,但这是一个示例活动):https://results.advancedeventsystems.com/event/PTAwMDAwMjkwMjQ90/divisions/131313/standings
到目前为止,我写的代码能把包含我数据的表格提取出来(不算表头,不过我现在不太担心表头的问题):
我希望你们能给我一些建议。
凤凰
import chromedriver_autoinstaller
from selenium import webdriver
from bs4 import BeautifulSoup
chromedriver_autoinstaller.install()
driver = webdriver.Chrome()
driver.get('https://results.advancedeventsystems.com/event/PTAwMDAwMjkwMjQ90/divisions/131313/standings')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
teams = soup.find_all('tbody', 'k-table-tbody')
print(teams)
这段代码让我获取到了整个网页的内容。但是现在,我想让数据的显示方式和HTML渲染的效果相似(比如,这里展示的样子)……我一直没有成功。
这是我想要的输出效果示例:
2 个回答
这些数据是通过JavaScript从其他网址加载过来的。下面是一个例子,教你如何把这些数据放到Panda的DataFrame里:
import pandas as pd
import requests
api_url = "https://results.advancedeventsystems.com/odata/PTAwMDAwMjkwMjQ90/standings(dId=131313,cId=null,tIds=[])"
params = {"$orderby": "OverallRank,FinishRank,TeamName,TeamCode"}
data = requests.get(api_url, params=params).json()
# print(data)
df = pd.DataFrame(data["value"])
df = pd.concat([df, df.pop("Club").apply(pd.Series).add_prefix("Club_")], axis=1)
df = pd.concat(
[df, df.pop("Division").apply(pd.Series).add_prefix("Division_")], axis=1
)
df = pd.concat(
[df, df.pop("BidIdentification").apply(pd.Series).add_prefix("BidIdentification_")],
axis=1,
)
print(df)
输出结果是:
TeamId TeamName TeamCode TeamText MatchesWon MatchesLost MatchPercent SetsWon SetsLost SetPercent PointRatio FinishRank OverallRank FinishRankText SearchableTeamName Club_ClubId Club_Name Division_DivisionId Division_Name Division_TeamCount Division_CodeAlias Division_ColorHex BidIdentification_BidStatus BidIdentification_DivisionAlias BidIdentification_DivisionId
0 171661 SA LADY GRIZZLIES 12-1 g12salgr1ls SA LADY GRIZZLIES 12-1 (LS) 6 0 1.000000 12 0 1.000000 2.097902 1 1 1st sa lady grizzlies 12-1 27673 SAN ANTONIO LADY GRIZZLIES 131313 12 Girls 16 12 Girls #5FBFFF None None 0
1 165364 CTX Juniors 12 Mizuno g12ctxjr1ls CTX Juniors 12 Mizuno (LS) 5 1 0.833333 10 5 0.666667 1.183521 2 2 2nd ctx juniors 12 mizuno 28511 CTX Juniors 131313 12 Girls 16 12 Girls #5FBFFF None None 0
2 425 AJV 12 adidas g12ajvba1ls AJV 12 adidas (LS) 4 1 0.800000 9 2 0.818182 1.690789 3 3 3rd ajv 12 adidas 207 Austin Junior Volleyball 131313 12 Girls 16 12 Girls #5FBFFF None None 0
3 17524 IMPACT - 121 g12impac1ls IMPACT - 121 (LS) 4 1 0.800000 8 3 0.727273 1.191489 3 3 3rd impact - 121 344 Impact Volleyball Club 131313 12 Girls 16 12 Girls #5FBFFF None None 0
4 28820 AP 11 adidas g11aperf1ls AP 11 adidas (LS) 3 2 0.600000 7 4 0.636364 1.295337 5 5 5th ap 11 adidas 469 Austin Performance Volleyball 131313 12 Girls 16 12 Girls #5FBFFF None None 0
5 26234 WACO VBC 12 UA Red g12wacov2ls WACO VBC 12 UA Red (LS) 3 2 0.600000 6 6 0.500000 0.913934 5 5 5th waco vbc 12 ua red 248 Waco Volleyball Club 131313 12 Girls 16 12 Girls #5FBFFF None None 0
6 167561 Premier 12 Crimson g12premr2nt Premier 12 Crimson (NT) 2 3 0.400000 6 6 0.500000 1.110092 7 7 7th premier 12 crimson 96 Dallas Premier 131313 12 Girls 16 12 Girls #5FBFFF None None 0
7 806 Roots 121 Green g12roots1ls Roots 121 Green (LS) 2 3 0.400000 4 6 0.400000 0.871287 7 7 7th roots 121 green 99 Roots Volleyball 131313 12 Girls 16 12 Girls #5FBFFF None None 0
8 152120 AJV 12FutureWilcoAlliance g12ajvba10ls AJV 12FutureWilcoAlliance (LS) 3 2 0.600000 6 4 0.600000 0.952153 9 9 9th ajv 12futurewilcoalliance 207 Austin Junior Volleyball 131313 12 Girls 16 12 Girls #5FBFFF None None 0
9 167217 Angelo United 12 g12angun1ls Angelo United 12 (LS) 2 3 0.400000 4 6 0.400000 0.963134 10 10 10th angelo united 12 28568 Angelo United 131313 12 Girls 16 12 Girls #5FBFFF None None 0
10 26005 Roots 12 Maple g12roots3ls Roots 12 Maple (LS) 2 3 0.400000 4 7 0.363636 0.776860 11 11 11th roots 12 maple 99 Roots Volleyball 131313 12 Girls 16 12 Girls #5FBFFF None None 0
11 426 AJV 12 Navy g12ajvba5ls AJV 12 Navy (LS) 1 4 0.200000 4 9 0.307692 0.771739 12 12 12th ajv 12 navy 207 Austin Junior Volleyball 131313 12 Girls 16 12 Girls #5FBFFF None None 0
12 126932 Austin Velocity 12s Green g12avvbc4ls Austin Velocity 12s Green (LS) 2 3 0.400000 5 6 0.454545 0.848101 13 13 13th austin velocity 12s green 6974 Austin Velocity Volleyball Club 131313 12 Girls 16 12 Girls #5FBFFF None None 0
13 428 AJV 12 Red g12ajvba7ls AJV 12 Red (LS) 1 4 0.200000 2 8 0.200000 0.752252 14 14 14th ajv 12 red 207 Austin Junior Volleyball 131313 12 Girls 16 12 Girls #5FBFFF None None 0
14 17215 AJV 12 Cedar Park g12ajvba4ls AJV 12 Cedar Park (LS) 1 4 0.200000 3 8 0.272727 0.872340 15 15 15th ajv 12 cedar park 207 Austin Junior Volleyball 131313 12 Girls 16 12 Girls #5FBFFF None None 0
15 124578 AJV 12 Toro g12ajvba6ls AJV 12 Toro (LS) 0 5 0.000000 0 10 0.000000 0.500000 15 15 15th ajv 12 toro 207 Austin Junior Volleyball 131313 12 Girls 16 12 Girls #5FBFFF None None 0
首先,soup.find_all('tbody', 'k-table-tbody')
这个代码只会找到表格的主体部分。你可以右键点击页面,然后选择检查,查看页面的源代码。我简单看了一下,发现<div role="grid" class="k-grid-aria-root" id="k-8e3ece95-0943-4c84-bba8-4e6a808da4bf" aria-label="Data table" aria-rowcount="18" aria-colcount="10">
是这个表格的最上层元素。
其次,可以试试 print(results.prettify())
来让输出的内容更整齐易读。
如果你想把数据提取到某种数据结构里,就需要对这些元素进行循环处理。
这里有一个很好的入门教程:https://realpython.com/beautiful-soup-web-scraper-python/