靓汤黑屏长时间没有任何输出

import urllib import mechanize from bs4 import BeautifulSoup import datetime def searchAP(searchterm): newlinks = [] browser = mechanize.Browser() browser.set_handle_robots(False) browser.addheaders = [('User-agent', 'Firefox')] text = "" start = 0 while "There were no matches for your search" not in text: url = "http://www.marketing-interactive.com/"+"?s="+searchterm text = urllib.urlopen(url).read() soup = BeautifulSoup(text, "lxml") results = soup.findAll('a') for r in results: if "rel=bookmark" in r['href'] : newlinks.append("http://www.marketing-interactive.com"+ str(r["href"])) start +=10 return newlinks print searchAP("digital marketing")

2条回答

网友

1楼 · 编辑于 2024-05-16 03:44:23

你犯了四个错误：

您正在定义start，但从未使用过它。（你也不能，就我在http://www.marketing-interactive.com/?s=something上看到的。不存在基于url的分页。）因此您将无休止地循环第一组结果。
"There were no matches for your search"不是该站点返回的无结果字符串。所以不管怎样它都会一直持续下去。
您正在附加链接，包括http://www.marketing-interactive.com到http://www.marketing-interactive.com。所以你会得到http://www.marketing-interactive.comhttp://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/
关于rel=bookmark选择：arifs solution是正确的方法。但如果你真的想这样做，你需要这样做：
```
for r in results:
    if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
        newlinks.append(r["href"])
```
这首先检查rel是否存在，然后检查其第一个子级是否为"bookmark"，因为r['href']不包含{}。这不是Beautiulsoup的结构。

要获取此特定站点，您可以执行以下两个操作：

您可以使用Selenium或其他支持Javascript的东西，然后按"Load more"按钮。但这很麻烦。
你可以使用这个漏洞：http://www.marketing-interactive.com/wp-content/themes/MI/library/inc/loop_handler.php?pageNumber=1&postType=search&searchValue=digital+marketing 这是该列表的源。它具有分页功能，因此您可以轻松地循环所有结果。

网友

2楼 · 编辑于 2024-05-16 03:44:23

下面的脚本根据给定的搜索关键字从网页中提取所有链接。但它不会超出第一页。尽管下面的代码可以通过操作URL中的页码（如other answer中的Rutger de Knijf所述）轻松地修改为从多个页面获得所有结果。在

from pprint import pprint
import requests
from BeautifulSoup import BeautifulSoup

def get_url_for_search_key(search_key):
    base_url = 'http://www.marketing-interactive.com/'
    response = requests.get(base_url + '?s=' + search_key)
    soup = BeautifulSoup(response.content)

    return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]

用法：

^{pr2}$

输出：

[u'http://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/',
 u'http://www.marketing-interactive.com/singapore-polytechnic-on-the-hunt-for-digital-marketing-agency/',
 u'http://www.marketing-interactive.com/how-to-get-your-bosses-on-board-your-digital-marketing-plan/',
 u'http://www.marketing-interactive.com/digital-marketing-institute-launches-brand-refresh/',
 u'http://www.marketing-interactive.com/entropia-highlights-the-7-original-sins-of-digital-marketing/',
 u'http://www.marketing-interactive.com/features/futurist-right-mindset-digital-marketing/',
 u'http://www.marketing-interactive.com/lenovo-brings-board-new-digital-marketing-head/',
 u'http://www.marketing-interactive.com/video/discussing-digital-marketing-indonesia-video/',
 u'http://www.marketing-interactive.com/ubs-melvin-kwek-joins-credit-suisse-as-apac-digital-marketing-lead/',
 u'http://www.marketing-interactive.com/linkedins-top-10-digital-marketing-predictions-2017/']

希望这是你想要的作为你项目的第一步。在

相关问题更多 >

编程相关推荐

热门问题

热门文章