lxml难以解析stackexchange rss提要

In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' In [32]: cooking_content = urllib2.urlopen(url_cooking) In [33]: cooking_parsed = lxml.etree.parse(cooking_content) In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary') In [35]: cooking_texts Out[35]: []

3条回答

网友

1楼 · 编辑于 2024-04-19 17:51:58

看看这两个版本

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

正如您所发现的，第二个版本不返回节点，但是lxml.html版本运行良好。etree版本不工作，因为它需要名称空间，html版本正在工作，因为它忽略名称空间。在http://lxml.de/lxmlhtml.html部分，它说“HTML解析器明显忽略了名称空间和其他一些xmlism。”

注意：当您打印etree版本的根节点（print(data.getroot())）时，会得到类似于<Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>的内容。这意味着它是一个名称空间为http://www.w3.org/2005/Atom的feed元素。这是etree代码的更正版本。在

^{pr2}$

网友

2楼 · 编辑于 2024-04-19 17:51:58

问题是名称空间。在

运行这个：

 cooking_parsed.getroot().tag

您将看到元素的名称空间为

^{pr2}$

类似地，如果您导航到其中一个提要条目。在

这意味着lxml中正确的xpath是：

print cooking_parsed.xpath(
  "//a:feed/a:entry",
  namespaces={ 'a':'http://www.w3.org/2005/Atom' })

网友

3楼 · 编辑于 2024-04-19 17:51:58

尝试使用beautifulsoup导入中的BeautifulStoneSoup。它可能会起作用。在

相关问题更多 >

编程相关推荐

热门问题

热门文章