如何转换。php.html到Python中的csv

2024-04-28 12:31:03 发布

您现在位置:Python中文网/ 问答频道 /正文

2条回答

使用requests和lxml:

import requests
from lxml.html import fromstring
from lxml.html.clean import Cleaner
import string


# download response
response = requests.get('http://water.weather.gov/ahps2/crests.php?wfo=lch&gage=bsll1&crest_type=historic')
html = response.text

现在有了原始的html文本。你得把这些标签去掉。这里我们使用lxml,一个python库来处理HTML/XML文本。函数的作用是:将字符串解析为元素。在

^{pr2}$

确定要移除的标签。Cleaner类清除html文档中有问题的标记,因此我们创建一个Cleaner对象,传递一个要被黑名单的类变量列表(以及要删除的标记)。请参见lxml Cleaner class documentation,了解每个属性默认设置为什么。请注意,remove_tags只剥离标记,而不剥离内容。在

cleaner = Cleaner(**args)
path = '/html/body'
body = doc.xpath(path)[0] #only interested in the body of the response
clean_response = cleaner.clean_html(body).text_content() #clean!

# split into lines.
table = clean_response.splitlines()

#parse whichever way you wish to
#your code here 

从网站中提取数据的过程称为webscraping。在

这段代码可以帮助您:

from bs4 import BeautifulSoup
import urllib2

url = 'http://water.weather.gov/ahps2/crests.php?wfo=lch&gage=bsll1&crest_type=historic'
#read html page using urlopen() method
r = urllib2.urlopen(url).read()

#create soup to navigate through tags
soup = BeautifulSoup(r, 'lxml')

#find the data inside the div mark, under the water_information class tag
results = soup.find('div', {'class':'water_information'})

#get only text from the results soup
water_data = results.text

#write this info to an output file
with open('outputfile.txt', 'w') as f:
    f.write(water_data)

这是我的outputfile.txt内容的示例:

^{pr2}$

现在,您可以通过使用regexsplit()创建自己的CSV文件,轻松处理water_data字符串。在

你没想到我会为你写的,对吧?:P

相关问题 更多 >