用Scrapy获取arXiv xml数据

from scrapy.spider import BaseSpider from scrapy.selector import XmlXPathSelector class arXivSpider(BaseSpider): name = "arxiv" allowed_domains = ["arxiv.org"] start_urls = ["http://export.arxiv.org/rss/hep-th/recent"] def parse(self, response): xxs = XmlXPathSelector(response) papers = xxs.select('//item') print papers

2条回答

网友

1楼 · 编辑于 2024-04-25 00:44:49

It may have to do w/ namespaces...

是的。在

XmlXPathSelector可以通过注册名称空间（examples in documentation）来处理名称空间。在您的情况下：

$ scrapy shell http://export.arxiv.org/rss/hep-th/recent
In [1]: xxs.register_namespace('g', 'http://purl.org/rss/1.0/')

In [2]: xxs.namespaces
Out[2]: {'g': 'http://purl.org/rss/1.0/'}

In [3]: xxs.select('//item')
Out[3]: []

In [4]: xxs.select('//g:item')
Out[4]:
[<XmlXPathSelector xpath='//g:item' data=u'<item xmlns="http://purl.org/rss/1.0/" x'>,
 <XmlXPathSelector xpath='//g:item' data=u'<item xmlns="http://purl.org/rss/1.0/" x'>,
...

网友

2楼 · 编辑于 2024-04-25 00:44:49

我觉得你应该试试你那破破烂烂的壳，做实验。 1破壳“http://export.arxiv.org/rss/hep-th/recent”

在sel.remove_名称空间（）
a=选择xpath（'//title/text（）'）

enter image description here

相关问题更多 >

编程相关推荐

热门问题

热门文章

用Scrapy获取arXiv xml数据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >