BeautifulSoup:find\u all()和unicode有问题吗?

2024-06-12 21:42:02 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我用BeautifulSoup建立了一个webscraper来抓取Craigslist页面上的每个广告。到目前为止我得到的是:

import requests
from bs4 import BeautifulSoup, SoupStrainer
import bs4

page = "http://miami.craigslist.org/search/roo?query=brickell"
search_html = requests.get(page).text

roomSoup = BeautifulSoup(search_html, "html.parser")

ad_list = roomSoup.find_all("a", {"class":"hdrlnk"})
#print ad_list
ad_ls = [item["href"] for item in ad_list]
#print ad_ls
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
#print ad_urls 
url_str = [str(unicode) for unicode in ad_urls]

# What's in url_str?
for url in url_str:
    print url

当我运行这个时,我得到:

miami.craigslist.org/mdc/roo/4870912192.html miami.craigslist.org/mdc/roo/4858122981.html miami.craigslist.org/mdc/roo/4870665175.html miami.craigslist.org/mdc/roo/4857247075.html miami.craigslist.org/mdc/roo/4870540048.html ...

这正是我想要的:一个包含页面上每个广告的URL的列表。你知道吗

我的下一步是从这些页面中提取一些内容;从而构建另一个BeautifulSoup对象。但我突然停了下来:

for url in url_str:
    ad_html = requests.get(str(url)).text

在这里,我们终于得到我的问题:这到底是什么错误?我唯一能理解的是最后两行:

 Traceback (most recent call last):   File "webscraping.py", line 24,
 in <module>
     ad_html = requests.get(str(url)).text   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
 line 65, in get
     return request('get', url, **kwargs)   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
 line 49, in request
     response = session.request(method=method, url=url, **kwargs)   File
 "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
 line 447, in request
     prep = self.prepare_request(req)   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
 line 378, in prepare_request
     hooks=merge_hooks(request.hooks, self.hooks),   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
 line 303, in prepare
     self.prepare_url(url, params)   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
 line 360, in prepare_url
     "Perhaps you meant http://{0}?".format(url)) requests.exceptions.MissingSchema: Invalid URL
 u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
 Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?

看起来问题是我所有的链接前面都有u',所以请求。获取()不起作用。这就是为什么你看到我几乎要用str()强制所有的url成为一个普通的字符串。不管我做什么,我都会犯这个错误。我还缺什么吗?我完全误解我的问题了吗?你知道吗

提前多谢!你知道吗


Tags: inpyorgurlrequesthtmllinerequests
1条回答
网友
1楼 · 发布于 2024-06-12 21:42:02

看来你对问题的理解有误

信息:

 u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
 Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?

意味着在url之前缺少http://(模式)

所以更换

ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]

ad_urls = ["http://miami.craigslist.org" + ad for ad in ad_ls]

我应该做这项工作

相关问题 更多 >