在BeautifulSoup后写入csv文件
我在用BeautifulSoup提取一些文本,然后想把这些内容保存到一个csv文件里。我的代码如下:
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
saveFile = open("some.csv", "a")
saveFile.write(str(tdTags_string) + ",")
saveFile.close()
saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()
大部分情况下,这段代码能完成我想要的功能,但有一个问题:如果条目中有逗号(","),它就会把这个条目当成分隔符,把一个条目分成两个不同的单元格(这不是我想要的)。所以我在网上查了一下,发现有人建议使用csv模块,于是我把代码改成了:
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
print tdTags_string
with open("some.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(tdTags_string)
saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()
结果变得更糟了,现在每个单词或数字的每个字母/数字都占据了csv文件中的一个单元格。比如,如果条目是“Cat”,那么“C”在一个单元格里,“a”在下一个单元格里,“t”在第三个单元格里,依此类推。
编辑后的版本:
import urllib2
import re
import csv
from bs4 import BeautifulSoup
SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()
# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow([tdTags_string])
第二版:
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
placeHolder.append(tdTags_string)
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
更新后的输出:
u'stuff1'
u'stuff2'
u'stuff3'
输出示例:
u'record1' u'31 Mar 1901' u'California'
u'record1' u'31 Mar 1901' u'California'
record1 31-Mar-01 California
另一个编辑过的代码(仍然有一个问题 - 跳过了下面的一行):
import urllib2
import re
import csv
from bs4 import BeautifulSoup
SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()
# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
#print repr(tdTags_string)
placeHolder.append(tdTags_string.rstrip('\n'))
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
2 个回答
1
关于最近跳过一行的问题,我找到了一个解决办法。不要使用
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
而是用这个:
with open("SomeSite.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
来源:https://docs.python.org/3/library/functions.html#open。这里的 "a" 模式是追加模式,而 "ab" 是以二进制文件的形式打开文件的追加模式,这样可以解决跳过多出一行的问题。
1
with open("some.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow([tdTags_string]) # put in a list
writeFile.writerow
会对你传入的内容逐个处理,所以如果你传入一个字符串 "foo"
,它会把这个字符串拆成三个单独的值 f,o,o
。如果你把这个字符串放在一个 list
里,就可以避免这种情况,因为写入器会对列表进行处理,而不是对字符串。
你应该在循环开始之前只打开一次文件,而不是每次循环时都打开文件:
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True) #
writeFile.writerow([tdTags_string])