如何使用BeautifulSoup从Python中基于dataautomation属性的div类中获取内容?

2024-05-15 12:25:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用BeautifulSoup来创建一个动态页面。在Selenium的帮助下从https://www.nemlig.com/访问上述页面后(感谢@cruisepandey的代码建议),如下所示:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from bs4 import BeautifulSoup


driver = webdriver.Chrome(executable_path = r'C:\Users\user\lib\chromedriver_77.0.3865.40.exe')
wait = WebDriverWait(driver,10)

driver.maximize_window()
driver.get("https://www.nemlig.com/")

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".timeslot-prompt.initial-animation-done")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[type='tel'][class^='pro']"))).send_keys('2300')  
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn.prompt__button"))).click()

有人提示我要刮这一页。你知道吗

enter image description here

更准确地说,在这一点上,我想从页面的右侧刮取行。如果您仔细查看这些代码背后的HTML代码,就会发现div类time-block__row对于一天中的主要3个时间段有3个不同的数据自动化属性。你知道吗

<div class="time-block__row" data-automation="beforDinnerRowTmSlt">
                            <div class="time-block__row-header">Formiddag</div>

                            <div class="no-timeslots ng-hide" ng-show="$ctrl.timeslotDays[$ctrl.selectedDateIndex].morningHours == 0">
                                Ingen levering..
                            </div>

                            <!----><!----><div class="time-block__item duration-1 disabled" ng-repeat="item in $ctrl.selectedHours track by $index" ng-if="item.StartHour >= 0 &amp;&amp; item.StartHour < 12" ng-click="$ctrl.setActiveTimeslot(item, $index)" ng-class="['duration-1', {'cheapest': item.IsCheapHour, 'event': item.IsEventSlot, 'selected': $ctrl.selectedTimeId == item.Id || $ctrl.selectedTimeIndex == $index, 'disabled': item.isUnavailable()}]" data-automation="notActiveSltTmSlt">

                                <div class="time-block__inner-container">
                <div class="time-block__time">8-9</div>
                <div class="time-block__attributes">
                  <!----></div>
                                    <div class="time-block__cost">29&nbsp;kr.</div>

所以,上午有data-automation = "beforDinnerRowTmSlt"下午有data-automation = "afternoonRowTmSlt"下午有data-automation = "eveningRowTmSlt"。你知道吗

page_source = wait.until(driver.page_source)
soup = BeautifulSoup(page_source)

time_of_the_day = soup.find('div', class_='time-block__row').text
  • 问题是

使用上面的代码,time_of_the_day只包含来自晨行的信息。你知道吗

如何使用data-automation属性正确地刮取这些行?如何访问其他2个div类及其子div?我的计划是创建一个包含以下内容的数据帧:

Time_of_the_day          Hours          Price        Day
Formiddag                8-9            29kr.        Tor. 10/10
....                     ....           ....         ....
Eftermiddag              12-13          29kr.        Tor. 10/10
....                     ....           ....         ....

day列将包含这里的输出:day = soup.find('div', class_='content').text

我知道这是一个相当长的职位,但希望我已经很容易理解的任务,你将能够帮助我的建议,提示或代码!你知道吗


Tags: 代码fromimportdivdatatimedriverselenium
2条回答

下面是获取所有这些值的代码。你知道吗

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd

driver = webdriver.Chrome(executable_path = r'C:\Users\user\lib\chromedriver_77.0.3865.40.exe')
wait = WebDriverWait(driver,10)
driver.maximize_window()
driver.get("https://www.nemlig.com/")

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".timeslot-prompt.initial-animation-done")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[type='tel'][class^='pro']"))).send_keys('2300')
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn.prompt__button"))).click()
time.sleep(3)
soup=BeautifulSoup(driver.page_source,'html.parser')
time_of_day=[]
price=[]
Hours=[]
day=[]
for morn in soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__time'):
    time_of_day.append(soup.select_one('[data-automation="beforDinnerRowTmSlt"] > .time-block__row-header').text)
    Hours.append(morn.text)
    price.append(morn.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)

df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day})
print(df)

time_of_day=[]
price=[]
Hours=[]
day=[]

for after in soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__time'):
    time_of_day.append(soup.select_one('[data-automation="afternoonRowTmSlt"] > .time-block__row-header').text)
    Hours.append(after.text)
    price.append(after.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)

df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day})
print(df)

time_of_day=[]
price=[]
Hours=[]
day=[]

for evenin in soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__time'):
    time_of_day.append(soup.select_one('[data-automation="eveningRowTmSlt"] > .time-block__row-header').text)
    Hours.append(evenin.text)
    price.append(evenin.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)

df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day})
print(df)

输出:

         Day  Hours   price time_of_day
0  fre. 11/10    8-9  29 kr.   Formiddag
1  fre. 11/10   9-10  29 kr.   Formiddag
2  fre. 11/10  10-11  39 kr.   Formiddag
3  fre. 11/10  11-12  39 kr.   Formiddag
          Day  Hours   price  time_of_day
0  fre. 11/10  12-13  29 kr.  Eftermiddag
1  fre. 11/10  13-14  29 kr.  Eftermiddag
2  fre. 11/10  14-15  29 kr.  Eftermiddag
3  fre. 11/10  15-16  29 kr.  Eftermiddag
4  fre. 11/10  16-17  29 kr.  Eftermiddag
5  fre. 11/10  17-18  19 kr.  Eftermiddag
          Day  Hours   price time_of_day
0  fre. 11/10  18-19  29 kr.       Aften
1  fre. 11/10  19-20  19 kr.       Aften
2  fre. 11/10  20-21  29 kr.       Aften
3  fre. 11/10  21-22  19 kr.       Aften

编辑

soup=BeautifulSoup(driver.page_source,'html.parser')
time_of_day=[]
price=[]
Hours=[]
day=[]
disabled=[]

for morn,d in zip(soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__item')):

    time_of_day.append(soup.select_one('[data-automation="beforDinnerRowTmSlt"] > .time-block__row-header').text)
    Hours.append(morn.text)
    price.append(morn.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
    if 'disabled' in d['class']:
        disabled.append('1')
    else:
        disabled.append('0')

for after,d in zip(soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__item')):
    time_of_day.append(soup.select_one('[data-automation="afternoonRowTmSlt"] > .time-block__row-header').text)
    Hours.append(after.text)
    price.append(after.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
    if 'disabled' in d['class']:
        disabled.append('1')
    else:
        disabled.append('0')

for evenin,d in zip(soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__item')):
    time_of_day.append(soup.select_one('[data-automation="eveningRowTmSlt"] > .time-block__row-header').text)
    Hours.append(evenin.text)
    price.append(evenin.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
    if 'disabled' in d['class']:
        disabled.append('1')
    else:
        disabled.append('0')

df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day,"Disabled" : disabled})
print(df)

输出

           Day Disabled  Hours   price  time_of_day
0   fre. 11/10        1    8-9  29 kr.    Formiddag
1   fre. 11/10        1   9-10  29 kr.    Formiddag
2   fre. 11/10        0  10-11  39 kr.    Formiddag
3   fre. 11/10        0  11-12  39 kr.    Formiddag
4   fre. 11/10        0  12-13  29 kr.  Eftermiddag
5   fre. 11/10        0  13-14  29 kr.  Eftermiddag
6   fre. 11/10        0  14-15  19 kr.  Eftermiddag
7   fre. 11/10        0  15-16  29 kr.  Eftermiddag
8   fre. 11/10        0  16-17  29 kr.  Eftermiddag
9   fre. 11/10        0  17-18  29 kr.  Eftermiddag
10  fre. 11/10        0  18-19  29 kr.        Aften
11  fre. 11/10        0  19-20  19 kr.        Aften
12  fre. 11/10        0  20-21  29 kr.        Aften
13  fre. 11/10        0  21-22  19 kr.        Aften

您可以使用soup.find_all

from bs4 import BeautifulSoup as soup
import re
... #rest of your current selenium code

d = soup(driver.page_source, 'html.parser')
r, _day = [[i.div.text, [['disabled' in k['class'], k.find_all('div', {'class':re.compile('time-block__time|ime-block__cost')})] for k in i.find_all('div', {'class':'time-block__item'})]] for i in d.find_all('div', {'class':'time-block__row'})], d.find('div', {'class':'content'}).get_text(strip=True)
new_r = [[a, [[int(j), *[i.text for i in b]] for j, b in k]] for a, k in r]
new_data = [[a, *i, _day] for a, b in new_r for i in b]

要将结果转换为数据帧,请执行以下操作:

import pandas as pd
df = pd.DataFrame([dict(zip(['Time_of_the_day', 'Disabled', 'Hours', 'Price', 'Day'], i)) for i in new_data])

输出:

      Day  Disabled  Hours   Price Time_of_the_day
0   fre.11/10         1    8-9  29 kr.       Formiddag
1   fre.11/10         1   9-10  29 kr.       Formiddag
2   fre.11/10         1  10-11  39 kr.       Formiddag
3   fre.11/10         0  11-12  39 kr.       Formiddag
4   fre.11/10         0  12-13  29 kr.     Eftermiddag
....

相关问题 更多 >