Scrapy:从多个元素提取并以JSON数组格式POST

1 投票
1 回答
538 浏览
提问于 2025-04-18 16:54

我正在抓取一个天气网站,需要从一个表格单元格中提取评论,然后把它们作为一个JSON数组发送到一个远程的API。

以下是网页的结构:

<td>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
    <p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
</td>

这是我正在使用的代码:

comments = []
cmnts = sel.xpath('td//p/text()').extract()

for cmnt in cmnts:
    comments.append(cmnt)

item['comments'] = comments

r = requests.post(api_url, data = json.dumps(dict(item)))

这个方法有点效果,但输出中有很多"\r\n"这样的字符串,而且在"<"符号后面的内容会被去掉。以下是上面代码的输出:

[
   "Temperature is cold (\r\n \r\n ",
   "Temperature is very warm (> 60 degrees C / 140 degrees F)."
   "Temperature is cold (\r\n \r\n ",
]

有没有什么办法可以得到一个“干净”的(也就是说,没有换行符)和“编码过的”结果数组呢?

1 个回答

3

正如@alecxe在上面的评论中提到的,lxml这个库的默认解析器对某些HTML内容处理得不好。解决这个问题的方法是用一个更宽容的解析器,比如BeautifulSoup或者html5lib。

其实,lxml可以使用不同的解析器,但仍然可以使用相同的XPath方法。

使用BeautifulSoup解析器的代码如下:

In [1]: from lxml.html import soupparser, html5parser

In [2]: html = """<td>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
    <p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p>
    <p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
</td>
"""

In [3]: doc = soupparser.fromstring(html)

In [4]: for p in doc.xpath('//p'):
    print p.xpath('normalize-space()')
   ...:     
Temperature is cold (< 4 degrees C / 40 degrees F).
Temperature is very warm (> 60 degrees C / 140 degrees F).
Temperature is cold (< 4 degrees C / 40 degrees F).

使用html5lib解析器的代码(在你的XPath调用中需要添加XHTML命名空间):

In [5]: doc = html5parser.fromstring(html)

In [6]: for p in doc.xpath('//xhtml:p', namespaces={"xhtml": "http://www.w3.org/1999/xhtml"}):
    print p.xpath('normalize-space()')
   ...:     
Temperature is cold (< 4 degrees C / 40 degrees F).
Temperature is very warm (> 60 degrees C / 140 degrees F).
Temperature is cold (< 4 degrees C / 40 degrees F).

In [7]: 

这样,你的Scrapy回调代码就可以变成:

doc = soupparser.fromstring(response.body)

comments = []
cmnts = doc.xpath('td//p')

for cmnt in cmnts:
    comments.append(cmnt.xpath('normalize-space(.)'))

item['comments'] = comments

r = requests.post(api_url, data = json.dumps(dict(item)))

撰写回答