给定网站列表，用Python搜索并返回信息

import requests from googlesearch import search from bs4 import BeautifulSoup as BS def get_url(company_name): url_list = [] for url in search(company_name, stop=10): url_list.append(url) return url_list test1 = get_url('Marketo') print(test1[7]) r = requests.get(test1[7]) html = r.text soup = BS(html, 'lxml') stuff = soup.find_all('a') print(stuff)

2条回答

网友

1楼 · 编辑于 2024-05-16 15:03:47

你可以从Crunchbase这样的网站上找到这些信息。在

获取步骤如下：

构建包含目标公司信息的url。假设您找到包含所需信息的url，如：
url = 'https://www.example.com/infoaboutmycompany.html'
使用selenium获取html，因为该站点不允许您直接刮取页面。像这样：
from selenium import webdriver from bs4 import BeautifulSoup driver = webdriver.Firefox() driver.get(url) html = driver.page_source
使用BeautifulSoup从包含信息的div获取文本。它有一个特定的类，在html中可以很容易地找到：
bsobj = BeautifulSoup(html, 'lxml') res = bsobj.find('div', {'class':'alpha beta gamma'}) res.text.strip()

不到10行代码就可以得到它。在

当然，它可以改变你的列表，从一个url列表到一个公司列表，希望能被该站点考虑。对marketo来说，它是有效的。在

网友

2楼 · 编辑于 2024-05-16 15:03:47

I want to return whether some company was acquired and by whom

您可以通过浏览crunchbase网站来获取此信息信息。那个缺点是你将限制你的搜索到他们的网站。为了扩展这一点，你也许还可以包括一些其他网站。在

import requests
from bs4 import BeautifulSoup
import re
while True:
    print()
    organization_name=input('Enter organization_name: ').strip().lower()
    crunchbase_url='https://www.crunchbase.com/organization/'+organization_name
    headers={
        'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }
    r=requests.get(crunchbase_url,headers=headers)
    if r.status_code == 404:
        print('This organization is not available\n')
    else:
        soup=BeautifulSoup(r.text,'html.parser')
        overview_h2=soup.find('h2',text=re.compile('Overview'))
        try:
            possible_acquired_by_span=overview_h2.find_next('span',class_='bigValueItemLabelOrData')
            if possible_acquired_by_span.text.strip() == 'Acquired by':
                acquired_by=possible_acquired_by_span.find_next('span',class_='bigValueItemLabelOrData').text.strip()
            else:
                acquired_by=False
        except Exception as e:
                acquired_by=False
                # uncomment below line if you want to see the error
                # print(e)
        if acquired_by:
            print('Acquired By: '+acquired_by+'\n')
        else:
            print('No acquisition information available\n')

    again=input('Do You Want To Continue? ').strip().lower()
    if  again not in ['y','yes']:
        break

样本输出：

^{pr2}$

注意事项

阅读crunchbase Terms并在将其部署到任何商业项目之前征求他们的同意。
同时检查crunchbase api-我认为这将是继续你所要求的东西的合法方式。

相关问题更多 >

编程相关推荐

热门问题

热门文章