在Python中读取并打印robots.txt内容

-2 投票

1 回答

2126 浏览

数据工程师

提问于 2025-04-18 13:56

我想检查一个网站是否有 robot.txt 文件，读取这个文件的所有内容并打印出来。也许把内容放到一个字典里会更好。

我试着使用 robotparser 模块，但不知道该怎么做。

我希望只使用 Python 2.7 自带的模块。

我按照 @Stefano Sanfilippo 的建议做了：

from urllib.request import urlopen

结果是

    Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    from urllib.request import urlopen
ImportError: No module named request

所以我又试了：

import urllib2
from urllib2 import Request
from urllib2 import urlopen
with urlopen("https://www.google.com/robots.txt") as stream:
    print(stream.read().decode("utf-8"))

但得到了：

Traceback (most recent call last):

文件 "", 第 1 行，出错 with urlopen("https://www.google.com/robots.txt") as stream: AttributeError: addinfourl 实例没有 'exit' 属性

从 bugs.python.org 看来，这在 2.7 版本中是不支持的。实际上，这段代码在 Python 3 中运行得很好。有没有什么办法可以解决这个问题？

错误处理字典网络爬虫 urlopen python 2.7 网站访问 robots.txt robotparser

1 个回答

是的，robots.txt 只是一个文件，下载下来看看就行！

Python 3 的代码：

from urllib.request import urlopen

with urlopen("https://www.google.com/robots.txt") as stream:
    print(stream.read().decode("utf-8"))

Python 2 的代码：

from urllib import urlopen
from contextlib import closing

with closing(urlopen("https://www.google.com/robots.txt")) as stream:
    print stream.read()

注意，路径总是 /robots.txt。

如果你需要把内容放进一个字典里，.split(":") 和 .strip() 是很有用的工具：

回答于 2025-04-18 由 Python大师

分享举报

在Python中读取并打印robots.txt内容

1 个回答

撰写回答