使用BeautifulSoup解析表格并写入文本文件
我需要从一个文本文件(output.txt)里的表格中获取数据,格式是这样的:
数据1;数据2;数据3;数据4;.....
比如说:公寓的总面积;33米;电梯;是;楼层;底层;.....;产权形式;个人
所有数据都要在一行里,数据之间用;分隔(之后会导出为csv文件)。
我还是个初学者.. 请帮帮我,谢谢。
from BeautifulSoup import BeautifulSoup
import urllib2
import codecs
response = urllib2.urlopen('http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever')
html = response.read()
soup = BeautifulSoup(html)
tabulka = soup.find("table", {"class" : "detail-char"})
for row in tabulka.findAll('tr'):
col = row.findAll('td')
prvy = col[0].string.strip()
druhy = col[1].string.strip()
record = ([prvy], [druhy])
fl = codecs.open('output.txt', 'wb', 'utf8')
for rec in record:
line = ''
for val in rec:
line += val + u';'
fl.write(line + u'\r\n')
fl.close()
2 个回答
0
这里有一个简单直接的方法,专门针对你的任务。
store=[] #to store your results
url="""http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever"""
page=urllib2.urlopen(url)
data=page.read()
for table in data.split("</table>"):
if "<table" in table and 'class="detail-char' in table:
for item in table.split("</td>"):
if "<td" in item:
store.append(item.split(">")[-1].strip())
print ','.join(store)
输出结果
$ ./python.py
Celková podlahová plocha bytu,33 m2,Výťah,Áno,Nadzemné podlažie,Prízemné podlažie,Stav,Čiastočná rekonštrukcia,Konštrukcia bytu,tehlová,Forma vlastníctva,osobné
15
你在读取每条记录的时候并没有把它们保存下来。试试这个方法,它会把记录存储在 records
里:
from BeautifulSoup import BeautifulSoup
import urllib2
import codecs
response = urllib2.urlopen('http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever')
html = response.read()
soup = BeautifulSoup(html)
tabulka = soup.find("table", {"class" : "detail-char"})
records = [] # store all of the records in this list
for row in tabulka.findAll('tr'):
col = row.findAll('td')
prvy = col[0].string.strip()
druhy = col[1].string.strip()
record = '%s;%s' % (prvy, druhy) # store the record with a ';' between prvy and druhy
records.append(record)
fl = codecs.open('output.txt', 'wb', 'utf8')
line = ';'.join(records)
fl.write(line + u'\r\n')
fl.close()
这个代码可以再简化一下,但我觉得这就是你想要的效果。