网页数据的传递入Beautiful Soup - 空清单 - 问答

3条回答

网友

1楼 · 编辑于 2024-04-27 13:13:20

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content…

您所称的content不是内容，而是一个类似文件的对象，您可以从中读取内容。BeautifulSoup很乐意接受这样的东西，但是出于调试目的打印出来并不是很有帮助。因此，让我们实际读取其中的内容，以便于调试：

>>> response = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')
>>> response
<urllib3.response.HTTPResponse object at 0x00000000032EC390>
>>> content = response.read()
>>> content
b''

这应该清楚地表明BeautifulSoup不是这里的问题。但继续：

… but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup).

是的。事实上soup.title给了你None而不是提出AttributeError是很好的证据，但是你可以直接测试它：

>>> type(soup)
bs4.BeautifulSoup

那绝对是一个BeautifulSoup对象。

当您传递一个空字符串BeautifulSoup时，返回的内容将取决于它在封面下使用的解析器；如果它依赖于Python 3.x stdlib，那么您将得到一个html节点，其中有一个空的head，还有一个空的body，其他什么都没有。所以，当你寻找一个title节点时，没有一个，你得到None。

那么，你怎么解决这个问题呢？

正如the documentation所说，您使用的是“发出请求的最低级别调用，因此需要指定所有原始详细信息”。这些原始详细信息是什么？老实说，如果你还不知道，你不应该使用这种方法来教你如何处理urllib3的幕后细节，在你还不知道基础知识不会为你提供服务之前。

实际上，您根本不需要urllib3在这里。只需使用Python附带的模块：

>>> # on Python 2.x, instead do: from urllib2 import urlopen 
>>> from urllib.request import urlopen
>>> r = urlopen('http://www.crummy.com/software/BeautifulSoup/')
>>> soup = BeautifulSoup(r)
>>> soup.title.text
'Beautiful Soup: We called him Tortoise because he taught us.'

网友

2楼 · 编辑于 2024-04-27 13:13:20

如果您只想擦掉页面，requests将获得您需要的内容：

from bs4 import BeautifulSoup

import requests
r = requests.get('http://www.crummy.com/software/BeautifulSoup/')
soup = BeautifulSoup(r.content)

In [59]: soup.title
Out[59]: <title>Beautiful Soup: We called him Tortoise because he taught us.</title>

In [60]: soup.title.name
Out[60]: 'title'

网友

3楼 · 编辑于 2024-04-27 13:13:20

urllib3返回一个响应对象，该对象包含具有预加载的主体负载的.data。

在顶部快速启动usage example here中，我将执行以下操作：

import urllib3
http = urllib3.PoolManager()
response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/')

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.data)  # Note the use of the .data property
...

其余的应该按预期工作。

关于你原来的代码出了什么问题：

您传递的是整个response对象，而不是主体负载。这通常应该没问题，因为response对象是一个类似于文件的对象，除了在本例中，urllib3已经使用了所有响应并为您解析它，因此没有任何东西留给.read()。这就像传递一个已经被读取的文件指针。.data另一方面将访问已经读取的数据。

如果要将urllib3响应对象用作类文件对象，则需要禁用内容预加载，如下所示：

response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/', preload_content=False)
soup = BeautifulSoup(response)  # We can pass the original `response` object now.

现在它应该如你所期望的那样工作了。

我知道这不是很明显的行为，作为urllib3的作者，我很抱歉。：）我们计划有一天将preload_content=False设为默认值。也许不久的将来（I opened an issue here）。

关于.urlopen与.request的简要说明：

.urlopen假设您将负责对传递给请求的任何参数进行编码。在这种情况下，使用.urlopen是可以的，因为您没有向请求传递任何参数，但通常.request会为您做所有额外的工作，因此更方便。

如果有人愿意改进我们的文档，这将是非常感谢。：）请发送一个PR到https://github.com/shazow/urllib3并添加您自己作为贡献者！

网页数据的传递入Beautiful Soup - 空清单

相关问题更多 >

编程相关推荐

热门问题

热门文章

网页数据的传递入Beautiful Soup - 空清单

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >