关于urlopen的简单Python问题

1 投票

2 回答

1082 浏览

提问于 2025-04-16 12:08

我正在尝试写一个程序，目的是删除HTML文档中的所有标签。所以我写了一个这样的程序。

import urllib
loc_left = 0
while loc_left != -1 :
    html_code = urllib.urlopen("http://www.python.org/").read()

    loc_left = html_code.find('<')
    loc_right = html_code.find('>')

    str_in_braket = html_code[loc_left, loc_right + 1]

    html_code.replace(str_in_braket, "")

但是它显示了下面这样的错误信息。

lee@Lee-Computer:~/pyt$ python html_braket.py
Traceback (most recent call last):
  File "html_braket.py", line 1, in <module>
    import urllib
  File "/usr/lib/python2.6/urllib.py", line 25, in <module>
    import string
  File "/home/lee/pyt/string.py", line 4, in <module>
    html_code = urllib.urlopen("http://www.python.org/").read()
AttributeError: 'module' object has no attribute 'urlopen'

有趣的是，如果我把代码输入到Python里，上面的错误就不会出现了。

2 个回答

第一步是下载文档，这样你就可以把它放在一个字符串里：

import urllib
html_code = urllib.urlopen("http://www.python.org/").read() # <-- Note: this does not give me any sort of error

接下来你有两个不错的选择，这两个方法都很稳妥，因为它们会解析HTML文档，而不是单纯地寻找'<'和'>'这些字符：

选项1：使用Beautiful Soup

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

选项2：使用Python内置的HTMLParser类

from HTMLParser import HTMLParser

class TagStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

下面是使用选项2的例子：

In [22]: strip_tags('<html>hi</html>')
Out[22]: 'hi'

如果你已经有了BeautifulSoup，那就很简单。把TagStripper类和strip_tags函数粘贴进去也很直接。

祝你好运！

回答于 2025-04-16 由 Python大师

分享举报

你把一个脚本命名为 string.py。这时候，urllib 模块会误以为你的是标准库里那个 string 模块，然后你的代码就会尝试使用一个在这个还没完全定义的 urllib 模块上不存在的属性。为了避免这种情况，给你的脚本起个别的名字吧。

回答于 2025-04-16 由 Python大师

分享举报

关于urlopen的简单Python问题

2 个回答

撰写回答