使用Python的robotparser

0 投票

2 回答

524 浏览

提问于 2025-04-17 03:42

我不太明白如何在robotparser模块中使用parse函数。以下是我尝试的内容：

In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")

In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")

In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True

看起来rp.entries是空的[]。我不明白哪里出了问题。我尝试了更简单的例子，但还是遇到同样的问题。

2 个回答

我刚刚找到了答案。

1. 问题在于这个来自wordpress.com的robots.txt文件里包含了多个用户代理声明。这个robotparser模块不支持这种写法。通过简单地去掉多余的User-agent: *行，问题就解决了。

2. 正如Andrew所指出的，传给解析的参数应该是一个列表。

回答于 2025-04-17 由 Python大师

分享举报

这里有两个问题。首先，rp.parse 这个方法需要一个字符串列表，所以你应该在那一行加上 .split("\n")。

第二个问题是，针对 * 用户代理的规则是存储在 rp.default_entry 里，而不是 rp.entries。如果你检查一下，就会发现里面包含了一个 Entry 对象。

我不太确定是谁的问题，但这个 Python 的解析器只会关注第一个 User-agent: * 的部分，所以在你给的例子中，只有 /next/ 是不被允许的。其他的禁止规则会被忽略。我没有看过相关的规范，所以不能确定这是否是一个格式不正确的 robots.txt 文件，或者是 Python 代码的问题。不过我觉得前者的可能性更大。

回答于 2025-04-17 由 Python大师

分享举报

使用Python的robotparser

2 个回答

撰写回答