Robotparser似乎无法正确解析

6 投票

5 回答

4024 浏览

提问于 2025-04-17 18:42

我正在写一个爬虫程序，为此我在实现一个叫做robots.txt的解析器，我使用的是标准库中的robotparser。

看起来robotparser的解析效果不太好，我正在用谷歌的robots.txt来调试我的爬虫。

（以下示例来自IPython）

In [1]: import robotparser

In [2]: x = robotparser.RobotFileParser()

In [3]: x.set_url("http://www.google.com/robots.txt")

In [4]: x.read()

In [5]: x.can_fetch("My_Crawler", "/catalogs") # This should return False, since it's on Disallow
Out[5]: False

In [6]: x.can_fetch("My_Crawler", "/catalogs/p?") # This should return True, since it's Allowed
Out[6]: False

In [7]: x.can_fetch("My_Crawler", "http://www.google.com/catalogs/p?")
Out[7]: False

有趣的是，有时候它似乎“正常工作”，而有时候又好像失败了。我也尝试过用Facebook和Stackoverflow的robots.txt来测试。这是robotparser模块的一个bug吗？还是我这里做错了什么？如果是的话，错在哪里呢？

我在想这个bug是否有关系。

解析器调试网络爬虫网站爬虫爬虫 bug robots.txt robotparser

5 个回答

这是个有趣的问题。我查看了一下源代码（我只有Python 2.4的源代码，但我敢打赌它没有改变），代码会对正在测试的URL进行规范化处理，具体执行的是：

urllib.quote(urlparse.urlparse(urllib.unquote(url))[2])

这就是你遇到问题的原因：

>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo"))[2]) 
'/foo'
>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo?"))[2]) 
'/foo'

所以这可能是Python库里的一个bug，或者是谷歌在robot.txt的规则中包含了一个“？”字符，这有点不寻常。

（为了确保你明白，我再换种说法。上面的代码是robotparser库用来检查URL的一部分。所以当URL以“？”结尾时，这个字符会被去掉。因此，当你检查/catalogs/p?时，实际执行的测试是/catalogs/p。这就是你意外结果的原因。）

我建议你可以去向Python团队报告一个bug（你可以在解释中附上这个链接）[编辑：谢谢]。然后使用你找到的其他库……

回答于 2025-04-17 由 Python大师

分享举报

这不是一个错误，而是对规则的不同理解。根据草案版的robots.txt规范（这个规范从来没有被正式批准过，也不太可能会被批准）：

要判断一个网址是否可以访问，机器人需要按照记录中出现的顺序，将网址与“允许”和“禁止”行进行匹配。找到的第一个匹配结果将被使用。如果没有找到匹配，默认的假设是这个网址是可以访问的。

（第3.2.2节，允许和禁止行）

根据这个理解，"/catalogs/p?"应该被拒绝，因为之前有一个“Disallow: /catalogs”的指令。

但在某个时候，谷歌开始以不同的方式理解robots.txt。他们的方法似乎是：

Check for Allow. If it matches, crawl the page.
Check for Disallow. If it matches, don't crawl.
Otherwise, crawl.

问题在于，关于robots.txt的解释并没有正式的共识。我见过一些爬虫使用谷歌的方法，还有一些使用1996年的草案标准。当我在运行爬虫时，如果我使用谷歌的解释，就会收到网站管理员的投诉，因为我爬取了他们认为不应该被爬取的页面；而如果我使用另一种解释，又会收到其他人的投诉，因为他们认为应该被索引的内容却没有被索引。

回答于 2025-04-17 由 Python大师

分享举报

经过几次谷歌搜索，我没有找到关于 robotparser 的问题。最后我发现了其他的东西，找到一个叫 reppy 的模块，我做了一些测试，感觉它非常强大。你可以通过 pip 来安装它；

pip install reppy

这里有一些使用 reppy 的例子（在 IPython 上），同样是使用谷歌的 robots.txt 文件。

In [1]: import reppy

In [2]: x = reppy.fetch("http://google.com/robots.txt")

In [3]: x.atts
Out[3]: 
{'agents': {'*': <reppy.agent at 0x1fd9610>},
 'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
  'http://www.google.com/hostednews/sitemap_index.xml',
  'http://www.google.com/sitemaps_webmasters.xml',
  'http://www.google.com/ventures/sitemap_ventures.xml',
  'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
  'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
  'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
  'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}

In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it's allowed.
Out[4]: True

In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it's not allowed.
Out[5]: False

In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it's allowed.
Out[7]: True

In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?

In [9]: x.ttl
Out[9]: 3721.3556718826294

In [10]: # It also has a x.disallowed function. The contrary of x.allowed

回答于 2025-04-17 由 Python大师

分享举报

Robotparser似乎无法正确解析

5 个回答

撰写回答