为什么我的网站scrape在使用python时丢失了预期的表？

import requests from bs4 import BeautifulSoup import pandas as pd list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)'] temp_dict = {} for page in list: r = requests.get(page) soup = BeautifulSoup(r.content, 'html.parser') temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.bptable.gray.sortable.tablesorter tablesorter-default tablesorter17e7f0d6cf4b4 jquery- tablesorter")]

3条回答

网友

1楼 · 编辑于 2024-06-16 09:16:52

首先使用以下css选择器查找表，然后使用pandas读取_html（）并加载到数据帧中。这将在单个数据帧中提供所有数据

import pandas as pd
import requests
from bs4 import BeautifulSoup

listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']

df1=pd.DataFrame()
for l in listurl:
    res=requests.get(l)
    soup=BeautifulSoup(res.text,'html.parser')
    table=soup.select("table#officeholder-table")[-1]
    df= pd.read_html(str(table))[0]
    df1=df1.append(df,ignore_index=True)

print(df1)

如果要获取单个数据帧，请尝试此操作

import pandas as pd
import requests
from bs4 import BeautifulSoup

listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']

for l in listurl:
    res=requests.get(l)
    soup=BeautifulSoup(res.text,'html.parser')
    table=soup.select("table#officeholder-table")[-1]
    df= pd.read_html(str(table))[0]
    print(df)

网友

2楼 · 编辑于 2024-06-16 09:16:52

有一种更简单的方法。随机选取一个URL，尝试以下操作：

import pandas as pd
tables = pd.read_html("https://ballotpedia.org/Governor_(state_executive_office)")
tables[4]

输出：

    Office  Name    Party   Date assumed office
0   Governor of Georgia     Brian Kemp  Republican  January 14, 2019
1   Governor of Tennessee   Bill Lee    Republican  January 15, 2019
2   Governor of Missouri    Mike Parson     Republican  June 1, 2018

等等

网友

3楼 · 编辑于 2024-06-16 09:16:52

您可以尝试通过选择器访问表格：

import requests
from bs4 import BeautifulSoup
import pandas as pd

list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']

temp_dict = {}

for page in list:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')    
    temp_dict[page.split('/')[-1]] = [item.text for item in soup.select('#officeholder-table')]

相关问题更多 >

编程相关推荐

热门问题

热门文章