从网站上抓取数据以获得一个表,但我得到的是一个空表

2024-05-13 11:10:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从这个website中刮取表,但是我得到一个空的csv文件,只包含头

我尝试了这个post中的代码。我不知道我的代码发生了什么,为什么它返回一个空表

我的代码:

url = "http://www.peaklist.org/WWlists/WorldTop50.html"
r = requests.get(url)
data= r.text

soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")
file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)

list_to_write = []

list_to_write.append(["Summit Name", "Country", "Lat.", "Long.", "Elevation mtrs.", "Prom. mtrs.", "Saddle mtrs.", "Saddle Location", "Elevation ft.", "Prom. ft.", "Notes", "Aerial Photo" ])

for script in scripts:
   text = script.text
   start = 0
   end = 0
   if(len(text) > 10000):
       while(start > -1):
           start = text.find('"Summit Name":"',start)
           if(start == -1):
               break
           start += len('"Summit Name":"')
           end = text.find('"',start)
           summit_name = text[start:end]

           #start = text.find('"Summit Name":"',start)
           #start += len('"Summit Name":"')
           #end = text.find('"',start)
           #summit_name = text[start:end]

           start = text.find('"Country":"',start)
           start += len('"Country":"')
           end = text.find('"',start)
           country = text[start:end]

           start = text.find('"Lat.":"',start)
           start += len('"Lat":"')
           end = text.find('"',start)
           lat = text[start:end]

           start = text.find('"Long.":"',start)
           start += len('"Long.":"')
           end = text.find('"',start)
           long = text[start:end]

           start = text.find('"Elevation mtrs.":"',start)
           start += len('"Elevation mtrs.":"')
           end = text.find('"',start)
           elevation = text[start:end]

           start = text.find('"Prom. mtrs.":"',start)
           start += len('"Prom. mtrs.":"')
           end = text.find('"',start)
           prom = text[start:end]

           start = text.find('"Saddle mtrs.":"',start)
           start += len('"Saddle mtrs.":"')
           end = text.find('"',start)
           saddle = text[start:end]

           start = text.find('"Saddle Location":"',start)
           start += len('"Saddle Location":"')
           end = text.find('"',start)
           saddle_loc = text[start:end]

           start = text.find('"Elevation ft.":"',start)
           start += len('"Elevation ft.":"')
           end = text.find('"',start)
           elevation_ft = text[start:end]

           start = text.find('"Prom. ft.":"',start)
           start += len('"Prom. ft.":"')
           end = text.find('"',start)
           prom_ft = text[start:end]

           start = text.find('"Notes":"',start)
           start += len('"Notes":"')
           end = text.find('"',start)
           notes = text[start:end]

           start = text.find('"Aerial Photo":"',start)
           start += len('"Aerial Photo":"')
           end = text.find('"',start)
           aerial = text[start:end]

           list_to_write.append([summit_name,country,lat,long,elevation,prom,saddle,saddle_loc,elevation_ft,prom_ft,notes,aerial])
writer.writerows(list_to_write)
file_name.close()

我没有收到这段代码的错误消息,只是一个空表,所以我想这个方法可能无法识别网站中的表数据

谢谢


Tags: 代码textnamelenfindstartlistend
2条回答

您的问题在于迭代数据的方式。 将来,在编写程序时,试着从逻辑上考虑一下,例如在find_all中使用正确的标记名,并使用最容错的方式来迭代正在删除的数据

这个脚本并不完美,但我认为它可以指导你更好地理解如何刮。检查注释以了解代码的作用

url = "http://www.peaklist.org/WWlists/WorldTop50.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,"html.parser")

to_csv = [["Summit Name", "Country", "Lat.", "Long.", "Elevation mtrs.", "Prom. mtrs.", "Saddle mtrs.", "Saddle Location", "Elevation ft.", "Prom. ft.", "Notes", "Aerial Photo" ]]

table = soup.find_all('table')[1] # Chose the second table on the page
rows = table.find_all('tr') # Get all table rows from our table element

del rows[0] # Remove the first row which is the table heading (we already have it)

for row in rows:
    tmp = [] # We're gonna add our column data in this list
    columns = row.find_all('td') # Find all columns in the row
    for column in columns:
        tmp.append(column.text.strip()) # strip() is used to remove extra space from the text
    
    to_csv.append(tmp) # Append this list to the main list

with open('output.csv', 'w') as csvfile:
    for row in to_csv: # For each list in the main list
        line = ','.join(row) # We're gonna join the column data in row
        csvfile.write(line) # Write each line to the file

谢谢大家的提示。下面的代码运行良好:

url="http://www.peaklist.org/WWlists/WorldTop50.html"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.find_all("tr")
print("Number of rows on site: ",len(gdp))
body_rows = gdp[2:]

all_rows = []

for row_num in range(len(body_rows)): 
    row = [] 
    for row_item in body_rows[row_num].find_all("td"): 
        aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
        row.append(aa)
        
    all_rows.append(row)

df = pd.DataFrame(data=all_rows)

在此之前我检查了html代码,这就是为什么gdp从第2行开始

相关问题 更多 >