因此,我的网站刮成一个可行的csv的索菲法网站。每个玩家都有一列。我的主要问题是网站的位置部分只是在我尝试迭代时导出第一个位置。理想情况下,我希望所有位置都在同一列中,用逗号分隔
这里是源HTML和图片 Sofifa网站1
<tr>
<td class="col-avatar"><figure class="avatar">
<img alt="" data-src="https://cdn.sofifa.com/players/240/950/21_60.png" data-srcset="https://cdn.sofifa.com/players/240/950/21_120.png 2x, https://cdn.sofifa.com/players/240/950/21_180.png 3x" src="https://cdn.sofifa.com/players/240/950/21_60.png" data-root="https://cdn.sofifa.com/players/" data-type="player" id="240950" class="player-check loaded" srcset="https://cdn.sofifa.com/players/240/950/21_120.png 2x, https://cdn.sofifa.com/players/240/950/21_180.png 3x" data-was-processed="true"></figure></td>
<td class="col-name">
<a class="tooltip" href="/player/240950/pedro-antonio-pereira-goncalves/210058/" data-tooltip="Pedro António Pereira Gonçalves"><div class="bp3-text-overflow-ellipsis"><img title="Portugal" alt="" src="https://cdn.sofifa.com/flags/pt.png" data-src="https://cdn.sofifa.com/flags/pt.png" data-srcset="https://cdn.sofifa.com/flags/pt@2x.png 2x, https://cdn.sofifa.com/flags/pt@3x.png 3x" class="flag loaded" srcset="https://cdn.sofifa.com/flags/pt@2x.png 2x, https://cdn.sofifa.com/flags/pt@3x.png 3x" data-was-processed="true"> Pedro Gonçalves</div></a><a rel="nofollow" href="/players?pn=23"><span class="pos pos23">RW</span></a> <a rel="nofollow" href="/players?pn=14"><span class="pos pos14">CM</span></a></td><td class="col col-ae" data-col="ae">22</td><td class="col col-oa" data-col="oa"><span class="bp3-tag p p-79">79</span></td><td class="col col-pt" data-col="pt"><span class="bp3-tag p p-87">87</span></td><td class="col-name">
<div class="bp3-text-overflow-ellipsis"><figure class="avatar avatar-sm transparent">
<img alt="" class="team loaded" data-src="https://cdn.sofifa.com/teams/237/30.png" data-srcset="https://cdn.sofifa.com/teams/237/60.png 2x, https://cdn.sofifa.com/teams/237/90.png 3x" src="https://cdn.sofifa.com/teams/237/30.png" data-root="https://cdn.sofifa.com/teams/" data-type="team" srcset="https://cdn.sofifa.com/teams/237/60.png 2x, https://cdn.sofifa.com/teams/237/90.png 3x" data-was-processed="true">
</figure>
<a href="/team/237/sporting-cp/">Sporting CP</a><div class="sub">
2020 ~ 2025</div>
</div>
</td><td class="col col-vl" data-col="vl">€39.5M</td><td class="col col-wg" data-col="wg">€16K</td><td class="col col-tt" data-col="tt"><span class="bp3-tag p">2021</span></td><td class="col-comment">
5.2K</td>
</tr>
这是我的网络垃圾API
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
# Get basic players information for all players
base_url = "https://sofifa.com/players?offset="
columns = ['ID', 'Name', 'Age', 'Positions','Nationality', 'Overall', 'Potential', 'Club', 'Value', 'Wage',]
data = pd.DataFrame(columns = columns)
for offset in range(0, 335):
url = base_url + str(offset * 60)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
table_body = soup.find('tbody')
for row in table_body.findAll('tr'):
td = row.findAll('td')
pid = td[0].find('img').get('id')
nationality = td[1].find('img').get('title')
name = td[1].find("a").get("data-tooltip")
rel = td[1].findAll('a',{'rel': 'nofollow'})
pos= rel[0].findAll('span')
for span in pos :
positions= (span.text.split)
age = td[2].text
overall = td[3].text.strip()
potential = td[4].text.strip( )
club = td[5].find('a').text
value = td[6].text.strip()
wage = td[7].text.strip()
player_data = pd.DataFrame([[pid, name, age, positions, nationality, overall, potential, club, value, wage]])
player_data.columns = columns
data = data.append(player_data, ignore_index=True)
print("done for "+str(offset),end="\r")
data.drop_duplicates()
data.head()
data.to_csv('player data.csv', encoding='utf-8-sig')
它产生这个输出
Excel输出2
要获取以逗号分隔的字符串形式的位置,可以尝试:
印刷品:
并保存
data.csv
(LibreOffice的屏幕截图):相关问题 更多 >
编程相关推荐