使用 urllib2.urlopen() 读取数据

Question

我正在尝试在Python中使用urllib2模块来获取一个网址的内容。

假设我的网址是 "http://chortle.ccsu.edu/AssemblyTutorial/Chapter-01/ass01_12.html"。

当我用这两行简单的代码去获取它的内容时，它会给我完整的HTML内容。

response = urllib2.urlopen(url)
content = response.read()
print(content)

但是，当我把这段代码放到一个函数里时，它返回的HTML却没有标签里的内容。

def getContentURL(url):
    ''' returns the html content of the given url '''
    response = urllib2.urlopen(url)
    content = response.read()
    return content

content = getContentURL(url)
soup = BeautifulSoup(conten) #added in edit
print(content)

我只得到了这么多。

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"     "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
 <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
 <meta content="Bradley Kjell kjell at ieee dot org " name="author"/>
 <meta content="2007" name="copyright"/>
 <meta content="index,follow" name="robots"/>
 <title>
  CHAPTER 1 — Introduction
 </title>
 <link href="../AssemblyStyle.css" rel="stylesheet" type="text/css"/>
</head>
<body>
</body>
</html>

这是为什么呢？我无法理解这种奇怪的情况。

=============================== 编辑 ===============================================

所以我写了一个test.py，里面的代码和之前的一样，结果运行得很好。

import os
from bs4 import BeautifulSoup
import urllib2
import urllib

def getContentURL(url):
    ''' returns the content of the given url in text format '''
    response = urllib2.urlopen(url)
    content = response.read()
    return content

url = "http://chortle.ccsu.edu/AssemblyTutorial/Chapter-01/ass01_1.html"

content = getContentURL(url)
soup = BeautifulSoup(content)
print(content) #prints everything
print(soup) #prints without the body's inner html

for link in soup.find_all('a'):
    #print(link)
    print(link.get('href'))

但是在我原来的代码中，同样的代码却不工作，原来的代码开头还有其他一些东西。这里是链接 https://github.com/kumar116/WebsiteCopier/blob/master/web_save.py。我发链接是因为代码太长，不方便直接粘贴。

你会看到的唯一变化是我在打印时用了 print(soup.prettify()) 或者 print(soup)。

这导致我失去了标签里的所有内容。

我需要这个soup对象，以便能够解析HTML。

urllib2 编程调试 html解析网络请求数据读取 web抓取内容获取 soup对象

使用 urllib2.urlopen() 读取数据

1 个回答

撰写回答