如何使用“请求”模块在网站内搜索?

2024-04-23 11:14:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在网站上搜索不同的公司名称。网站链接:https://www.firmenwissen.de/index.html

在这个网站上,我想使用搜索引擎和搜索公司。下面是我尝试使用的代码:

from bs4 import BeautifulSoup as BS
import requests
import re

companylist = ['ABEX Dachdecker Handwerks-GmbH']

url = 'https://www.firmenwissen.de/index.html'

payloads = {
        'searchform': 'UFT-8',
        'phrase':'ABEX Dachdecker Handwerks-GmbH',
        "mainSearchField__button":'submit'
        }

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

html = requests.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'html.parser')
link_list= []

links = soup.findAll('a')

for li in links:
    link_list.append(li.get('href'))
print(link_list)

这段代码将带我进入公司信息的下一页。但不幸的是,它只返回主页。我该怎么做?你知道吗


Tags: 代码httpsimportindexbs网站htmlwww
1条回答
网友
1楼 · 发布于 2024-04-23 11:14:49

更改要搜索的初始url。只获取适当的href并添加到集合中,以确保没有重复项(或者改变selector,尽可能只返回一个匹配项);将这些项添加到最终集合中,以便循环,以确保只循环所需数量的链接。我已经使用了Session,假设你会在许多公司重复。你知道吗

使用selenium遍历集合,导航到每个公司url并提取您需要的任何信息。你知道吗

这是一个提纲。你知道吗

from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver

d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH']

url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}

finalLinks = set()

## searches section; gather into set

with requests.Session() as s:
    for company in companyList:
        payloads = {
        'searchform': 'UFT-8',
        'phrase':company,
        "mainSearchField__button":'submit'
        }

        html = s.post(url, data=payloads, headers=headers)
        soup = BS(html.content, 'lxml')

        companyLinks = {baseUrl + item['href'] for item in soup.select("[href*='firmeneintrag/']")}
        # print(soup.select_one('.fp-result').text)
        finalLinks = finalLinks.union(companyLinks)

for item in finalLinks:
    d.get(item)
    info  = d.find_element_by_css_selector('.yp_abstract_narrow')
    address =  d.find_element_by_css_selector('.yp_address')
    print(info.text, address.text)

d.quit()

只是第一个链接:

from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver

d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH', 'aktive Stuttgarter']

url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}

finalLinks = []

## searches section; add to list

with requests.Session() as s:
    for company in companyList:
        payloads = {
        'searchform': 'UFT-8',
        'phrase':company,
        "mainSearchField__button":'submit'
        }

        html = s.post(url, data=payloads, headers=headers)
        soup = BS(html.content, 'lxml')

        companyLink = baseUrl + soup.select_one("[href*='firmeneintrag/']")['href']
        finalLinks.append(companyLink)

for item in set(finalLinks):
    d.get(item)
    info  = d.find_element_by_css_selector('.yp_abstract_narrow')
    address =  d.find_element_by_css_selector('.yp_address')
    print(info.text, address.text)
d.quit()

相关问题 更多 >