Scrapy:从多个元素提取并以JSON数组格式POST
我正在抓取一个天气网站,需要从一个表格单元格中提取评论,然后把它们作为一个JSON数组发送到一个远程的API。
以下是网页的结构:
<td>
<p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
<p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p>
<p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
</td>
这是我正在使用的代码:
comments = []
cmnts = sel.xpath('td//p/text()').extract()
for cmnt in cmnts:
comments.append(cmnt)
item['comments'] = comments
r = requests.post(api_url, data = json.dumps(dict(item)))
这个方法有点效果,但输出中有很多"\r\n"这样的字符串,而且在"<"符号后面的内容会被去掉。以下是上面代码的输出:
[
"Temperature is cold (\r\n \r\n ",
"Temperature is very warm (> 60 degrees C / 140 degrees F)."
"Temperature is cold (\r\n \r\n ",
]
有没有什么办法可以得到一个“干净”的(也就是说,没有换行符)和“编码过的”结果数组呢?
1 个回答
3
正如@alecxe在上面的评论中提到的,lxml这个库的默认解析器对某些HTML内容处理得不好。解决这个问题的方法是用一个更宽容的解析器,比如BeautifulSoup或者html5lib。
其实,lxml可以使用不同的解析器,但仍然可以使用相同的XPath方法。
使用BeautifulSoup解析器的代码如下:
In [1]: from lxml.html import soupparser, html5parser
In [2]: html = """<td>
<p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
<p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p>
<p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
</td>
"""
In [3]: doc = soupparser.fromstring(html)
In [4]: for p in doc.xpath('//p'):
print p.xpath('normalize-space()')
...:
Temperature is cold (< 4 degrees C / 40 degrees F).
Temperature is very warm (> 60 degrees C / 140 degrees F).
Temperature is cold (< 4 degrees C / 40 degrees F).
使用html5lib解析器的代码(在你的XPath调用中需要添加XHTML命名空间):
In [5]: doc = html5parser.fromstring(html)
In [6]: for p in doc.xpath('//xhtml:p', namespaces={"xhtml": "http://www.w3.org/1999/xhtml"}):
print p.xpath('normalize-space()')
...:
Temperature is cold (< 4 degrees C / 40 degrees F).
Temperature is very warm (> 60 degrees C / 140 degrees F).
Temperature is cold (< 4 degrees C / 40 degrees F).
In [7]:
这样,你的Scrapy回调代码就可以变成:
doc = soupparser.fromstring(response.body)
comments = []
cmnts = doc.xpath('td//p')
for cmnt in cmnts:
comments.append(cmnt.xpath('normalize-space(.)'))
item['comments'] = comments
r = requests.post(api_url, data = json.dumps(dict(item)))