在csv文件Python中从url检索数据

import requests import re from bs4 import BeautifulSoup import csv #Read csv csvfile = open("gymsfinal.csv") csvfilelist = csvfile.read() #Get data from each url def get_page_data(): for page_data in csvfilelist: r = requests.get(page_data.strip()) soup = BeautifulSoup(r.text, 'html.parser') return soup pages = get_page_data() '''print pages''' #The work performed on scraped data print soup.find("span",{"class":"wlt_shortcode_TITLE"}).text print soup.find("span",{"class":"wlt_shortcode_map_location"}).text print soup.find("span",{"class":"wlt_shortcode_phoneNum"}).text print soup.find("span",{"class":"wlt_shortcode_EMAIL"}).text th = soup.find('b',text="Category") td = th.findNext() for link in td.findAll('a',href=True): match = re.search(r'http://(\w+).(\w+).(\w+)', link.text) if match: print link.text gyms = [name,address,phoneNum,email] gym_data_list.append(gyms) #Saving specific listing data to csv with open ("xgyms.csv", "wb") as file: writer = csv.writer(file) for row in gym_data_list: writer.writerow(row)

1条回答

网友

1楼 · 发布于 2024-06-11 17:44:07

这里有几个问题。首先，您永远不会关闭第一个file对象，这是一个很大的禁忌。您应该使用with语法，您可以在代码片段的底部使用该语法来读取csv。在

您会得到错误requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?，因为当您读入csv时，您只是将它作为一个大字符串读入，并用换行符完成。因此，当您使用for page_data in csvfilelist:对其进行迭代时，它将遍历字符串中的每个字符（字符串在Python中是可编辑的）。显然，这不是一个有效的url，因此请求抛出一个异常。当你读入你的文件时，它应该是这样的

with open('gymsfinal.csv') as f:
    reader = csv.reader(f)
    csvfilelist = [ row[0] for row in reader ]

您还应该更改从get_page_data()返回url的方式。目前，你只会退还第一份汤。为了使它返回所有soup的生成器，您只需将return更改为yield。Good resource on yield and generators。在

你的印刷品也会有问题。它们要么进入看起来像for soup in pages:的for循环，要么进入get_page_data()。在这些打印的上下文中没有定义变量soup。在

相关问题更多 >

编程相关推荐

热门问题

热门文章