如何从无空间标注的字符串中提取数据？

import urllib2 from bs4 import BeautifulSoup import time def scraper(): #Arkansas State Plant Board Weather Web data url1 = 'http://170.94.200.136/weather/Inversion.aspx' #opens url and parses HTML into Unicode page1 = urllib2.urlopen(url1) soup1 = BeautifulSoup(page1, 'lxml') #print(soup.get_text()) gives a single Unicode string of relevant data in strings from the url #Without print(), returns everything in without proper spacing sp1 = soup1.get_text() #datasp1 is the chunk with the website data in it so the search for Arkansas doesn't return the header #everything else finds locations for Unicode strings for first four stations start1 = sp1.find('Today') end1 = sp1.find('new Sys.') datasp1 = sp1[start1:end1-10] startArkansas = datasp1.find('Arkansas') startAshley = datasp1.find('Ashley') dataArkansas = datasp1[startArkansas:startAshley-2] startBradley = datasp1.find('Bradley') dataAshley = datasp1[startAshley:startBradley-2] startChicot = datasp1.find('Chicot') dataBradley = datasp1[startBradley:startChicot-2] startCleveland = datasp1.find('Cleveland') dataChicot = datasp1[startChicot:startCleveland-2] print(dataArkansas) print(dataAshley) print(dataBradley) print(dataChicot)

2条回答

网友

1楼 · 编辑于 2024-04-23 17:50:34

您需要使用beautifulsoup解析html页面并检索数据：

url1 = 'http://170.94.200.136/weather/Inversion.aspx'

#opens  url and parses HTML into Unicode
page1 = urlopen(url1)
soup1 = BeautifulSoup(page1)

# get the table
table = soup1.find(id='MainContent_GridView1')

# find the headers
headers = [h.get_text() for h in table.find_all('th')]

# retrieve data
data = {}
tr_elems = table.find_all('tr')
for tr in tr_elems:
    tr_content = [td.get_text() for td in tr.find_all('td')]
    if tr_content:
        data[tr_content[0]] = dict(zip(headers[1:], tr_content[1:]))

print(data)

该示例将显示：

{
  "Greene West": {
    "Low Temp  (\u00b0F)": "67.7",
    "Time Of High": "10:19 AM",
    "Wind Speed (MPH)": "0.6",
    "High Temp  (\u00b0F)": "83.2",
    "Wind Dir (\u00b0)": "20",
    "Time Of Low": "6:04 AM",
    "Current Time": "10:19 AM",
    "Current Temp  (\u00b0F)": "83.2"
  },
  "Cleveland": {
    "Low Temp  (\u00b0F)": "70.8",
    "Time Of High": "10:14 AM",
    "Wind Speed (MPH)": "1.9",
    [.....]

}

网友

2楼 · 编辑于 2024-04-23 17:50:34

只要改进提取表格数据的方法。我会使用^{}将其读入数据帧，我很肯定，您会发现使用它很方便：

import pandas as pd

df = pd.read_html("http://170.94.200.136/weather/Inversion.aspx", attrs={"id": "MainContent_GridView1"})[0]
print(df)

相关问题更多 >

编程相关推荐

热门问题

热门文章