在Python中使用Scrapy格式化文本输出
我正在尝试使用Scrapy这个工具来抓取网页,然后把这些网页保存到一个可读的.txt文件里。我用的代码是:
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
title = hxs.select('/html/head/title/text()').extract()
content = hxs.select('//*[@id="content"]').extract()
texts = "%s\n\n%s" % (title, content)
soup = BeautifulSoup(''.join(texts))
strip = ''.join(BeautifulSoup(pretty).findAll(text=True))
filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title)
filly = open(filename, "w")
filly.write(strip)
我在这里结合使用了BeautifulSoup,因为网页的主体内容里有很多我不想要的HTML代码(主要是链接),所以我用BeautifulSoup来去掉这些HTML,只留下我关心的文本。
这样得到的输出看起来像:
[u"School, Chandler's Ford (Hansard, 30 November 1961)"]
[u'
\n \n
HC Deb 30 November 1961 vol 650 cc608-9
\n
608
\n
\n
\n
\n
\xa7
\n
28.
\n
Dr. King
\n
\n asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\'s Ford; and why he refused permission to acquire this site in 1954.\n
\n
\n
\n \n
\n
\n
\n
\xa7
\n
Sir D. Eccles
\n
\n I understand that the authority has paid \xa375,000 for this site.\n \n
但我希望输出的样子是:
School, Chandler's Ford (Hansard, 30 November 1961)
HC Deb 30 November 1961 vol 650 cc608-9
608
28.
Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler's Ford; and why he refused permission to acquire this site in 1954.
Sir D. Eccles I understand that the authority has paid £375,000 for this site.
所以我主要想知道怎么去掉换行符\n
,把内容紧凑一些,并把任何特殊字符转换成正常的格式。
1 个回答
8
我在评论中对代码的回答:
import re
import codecs
#...
#...
#extract() returns list, so you need to take first element
title = hxs.select('/html/head/title/text()').extract() [0]
content = hxs.select('//*[@id="content"]')
#instead of using BeautifulSoup for this task, you can use folowing
content = content.select('string()').extract()[0]
#simply delete duplicating spaces and newlines, maybe you need to adjust this expression
cleaned_content = re.sub(ur'(\s)\s+', ur'\1', content, flags=re.MULTILINE + re.UNICODE)
texts = "%s\n\n%s" % (title, cleaned_content)
#look's like typo in filename creation
#filename ....
#and my preferable way to write file with encoding
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(texts)