为什么我的网站scrape在使用python时丢失了预期的表?

2024-06-16 09:16:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图使用此代码从BallotMedia(https://ballotpedia.org/Governor_(state_executive_office))中获取信息,特别是高管的姓名。我这里的代码只提供以下输出:

,Governor_(state_executive_office),Lieutenant_Governor_(state_executive_office),Secretary_of_State_(state_executive_office),Attorney_General_(state_executive_office)

我也在设法弄到那些名字。这是我目前的代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd

list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']

temp_dict = {}

for page in list:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')

    temp_dict[page.split('/')[-1]] = [item.text for item in 
soup.select("table.bptable.gray.sortable.tablesorter 
tablesorter-default tablesorter17e7f0d6cf4b4 jquery- 
tablesorter")]

最后一行是我认为存在问题的那一行。我已尝试删除并向“table.bptable.gray.sortable.tablesorter tablesorter default tablesorter17e7f0d6cf4b4 jquery tablesorter”部分添加代码,但始终得到相同的结果。我直接从网站上复制的不我不确定我遗漏了什么。如果不是这样,那么该行中的其余代码是否有问题?谢谢大家!


Tags: of代码httpsorgimportpagestateoffice
3条回答

首先使用以下css选择器查找表,然后使用pandas读取_html() 并加载到数据帧中。 这将在单个数据帧中提供所有数据

import pandas as pd
import requests
from bs4 import BeautifulSoup

listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']

df1=pd.DataFrame()
for l in listurl:
    res=requests.get(l)
    soup=BeautifulSoup(res.text,'html.parser')
    table=soup.select("table#officeholder-table")[-1]
    df= pd.read_html(str(table))[0]
    df1=df1.append(df,ignore_index=True)

print(df1)

如果要获取单个数据帧,请尝试此操作

import pandas as pd
import requests
from bs4 import BeautifulSoup

listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']

for l in listurl:
    res=requests.get(l)
    soup=BeautifulSoup(res.text,'html.parser')
    table=soup.select("table#officeholder-table")[-1]
    df= pd.read_html(str(table))[0]
    print(df)

有一种更简单的方法。随机选取一个URL,尝试以下操作:

import pandas as pd
tables = pd.read_html("https://ballotpedia.org/Governor_(state_executive_office)")
tables[4]

输出:

    Office  Name    Party   Date assumed office
0   Governor of Georgia     Brian Kemp  Republican  January 14, 2019
1   Governor of Tennessee   Bill Lee    Republican  January 15, 2019
2   Governor of Missouri    Mike Parson     Republican  June 1, 2018

等等

您可以尝试通过选择器访问表格:

import requests
from bs4 import BeautifulSoup
import pandas as pd

list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']

temp_dict = {}

for page in list:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')    
    temp_dict[page.split('/')[-1]] = [item.text for item in soup.select('#officeholder-table')] 

相关问题 更多 >