如何使用BeautifulSoup获取表中的信息?

2024-04-16 18:21:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从以下网站获取表中的信息:http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_LaboratoryInformation_S.aspx?Rep=0&RP=Y

当我检查页面时,可以在td中找到oddrowcolor和evenrowcolor类的数据。但是,当我尝试获取信息时,什么也没有输出。如何使用beautifulsoupforpython获取表中的信息?你知道吗

下面是我的代码:

import requests
from bs4 import BeautifulSoup
url = "http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_LaboratoryInformation_S.aspx?Rep=0&RP=Y"
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

for tr in soup.find_all('tr', {'class':'oddrowcolor'):
    print tr

我试过使用oddrowcolor,但没有输出结果。你知道吗


Tags: in信息httptrgovreportsrptrep
1条回答
网友
1楼 · 发布于 2024-04-16 18:21:43

您可以使用表id来获取表,但是oddrowcolor等。。是动态添加的,因此它不在源中:

import requests
from bs4 import BeautifulSoup
url = "http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_LaboratoryInformation_S.aspx?Rep=0&RP=Y"
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")
table = soup.select_one("#tableReportTable")

for tr in table.find_all("tr"):
    print tr

要提取表数据,可以执行以下操作:

soup = BeautifulSoup(r.content, "html.parser")

# gets the table using the table id
table = soup.select_one("#tableReportTable")
# column names
print(", ".join([th.text.strip() for th in table.select_one("tr").find_all("th")]))

#  tr + tr -> gets all the tr tags after the first 
for tr in table.select("tr + tr"):
    # tr.select("td a") -> get all the anchor tags inside the row tds
    # then get the text from each anchor.
    print(",".join([a.text for a in tr.select("td a")]))

这给了你:

S.No., State, State Labs (without mobile labs), District Labs (without mobile labs), Block Labs/Total Blocks (without mobile labs), SubDivision Labs (without mobile labs), Mobile Labs (State/ District/ Block/ Sub-division Level), Total Labs   (State/ District/ Block/ Sub-division Level)

ANDAMAN and NICOBAR,1,0,NA / 9,0,2,3
ANDHRA PRADESH,1,32,NA / 662,73,0,106
ARUNACHAL PRADESH,1,17,NA / 100,31,0,49
ASSAM,1,29,NA / 242,53,20,103
BIHAR,1,41,NA / 536,0,0,42
CHANDIGARH,0,0,NA / 1,0,0,0
CHATTISGARH,1,27,NA / 146,20,5,53
DADRA & NAGAR HAVELI,0,0,NA / 10,0,0,0
DAMAN & DIU,0,0,NA / 1,0,0,0
DELHI,0,0,NA / 0,0,0,0
GOA,1,0,1 / 11,9,0,11
GUJARAT,1,34,50 / 246,0,6,91
HARYANA,0,21,NA / 126,21,0,42
HIMACHAL PRADESH,1,14,NA / 77,28,0,43
JAMMU AND KASHMIR,0,22,2 / 148,74,0,98
JHARKHAND,1,24,NA / 259,3,5,33
KARNATAKA,1,44,39 / 176,106,46,236
KERALA,1,14,NA / 148,33,0,48
LAKSHADWEEP,0,9,NA / 9,0,0,9
MADHYA PRADESH,1,51,3 / 313,106,0,161
MAHARASHTRA,1,44,2 / 351,139,0,186
MANIPUR,1,9,NA / 38,2,0,12
MEGHALAYA,1,7,NA / 42,22,0,30
MIZORAM,1,8,NA / 26,18,0,27
NAGALAND,0,11,NA / 74,1,2,14
ODISHA,1,32,NA / 314,42,0,75
PUDUCHERRY,0,2,NA / 3,0,0,2
PUNJAB,3,22,8 / 145,0,1,34
RAJASTHAN,1,33,163 / 295,0,0,197
SIKKIM,0,2,NA / 9,0,0,2
TAMIL NADU,1,34,NA / 385,49,0,84
TELANGANA,1,19,NA / 438,56,0,76
TRIPURA,1,8,7 / 58,6,0,22
UTTAR PRADESH,1,76,3 / 820,2,0,82
UTTARAKHAND,0,28,1 / 95,14,0,43
WEST BENGAL,1,18,NA / 341,201,0,220

这似乎符合我在浏览器中看到的,总数等等。。在最后一个tr内的th标记中,因此在循环外添加以下内容:

print(",".join([a.text.strip() for a in tr.select("th")])) 

这会给你:

Total,27,732,279,1109,87,2234

相关问题 更多 >