Selenium python:从<div>

2024-05-29 08:12:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从几页中获取所有作为dutch word = english word 的单词列表

通过检查HTML,这意味着我需要从#mw-content-text的子div中获取所有li的所有ul的所有文本

这是我的密码:

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window
driver = webdriver.Chrome(chrome_options=options)

listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_tag_name("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

这是输出

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

我不明白为什么有些li文本即使xpath相同也无法检索(我通过开发者控制台的复制xpath对其中几个进行了双重检查)


Tags: texthttpscomdrivernlwikiliul
3条回答

您的脚本似乎还可以,但我会添加显式或隐式等待。 尝试等待页面上的所有元素都可见:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window

driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
    WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))
    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_tag_name("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

另外,您可以在声明driver之后立即添加driver.implicitly_wait(15)

输出:

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', 'meisje = girl', 'kind = child/kid', 'hij = he', 'ze = she (unstressed)', 'is = is', 'of = or', 'appel = apple', 'melk = milk', 'drinkt = drinks (2nd and 3rd person singular)', 'drink = drink (1st person singular)', 'eet = eat(s) (singular)', 'de = the', 'sap = juice', 'water = water', 'brood = bread', 'het = it, the', 'je = you (singular informal, unstressed)', 'bent = are (2nd person singular)', 'Zijn (to be)', 'Hebben (to have)', 'Mogen (to be allowed to)', 'Willen (to want)', 'Kunnen (to be able to)', 'Zullen ("will")', 'boterham = sandwich', 'rijst = rice', 'we = we (unstressed)', 'jullie = you (plural informal)', 'eten = eat (plural)', 'drinken = drink (plural)', 'vrouwen = women', 'mannen = men', 'meisjes = girls', 'krant = newspaper', 'lezen = read (plural)', 'jongens = boys', 'menu = menu', 'dat = that', 'zijn = are (plural)', 'ze = they (unstressed)', 'heb = have (1st person singular)', 'heeft = has (3rd person singular)', 'hebt = have (2nd person singular)', 'hebben = have (plural)', 'boek = book', 'lees = read (1st person singular)', 'leest = read(s) (2nd and 3rd person singular)', 'kinderen = children', 'spreken = speak (plural)', 'spreek = speak (1st person singular)', 'spreekt = speak(s) (2nd and 3rd person singular)', 'hallo = hello', 'bedankt = thanks', 'doei = bye', 'dag = goodbye', 'tot ziens = see you later', 'hoi = hi', 'goedemorgen = good morning', 'goededag = good day', 'goedenavond = good evening', 'goedenacht = good night', 'welterusten = good night', 'ja = yes', 'dank je wel = thank you very much', 'alsjeblieft = please', 'sorry = sorry', 'het spijt me = I am sorry', 'oké = okay', 'pardon = excuse me', 'hoe gaat het = how are you', 'goed = good, fine, well', 'dank je = thank you', '(een) beetje = (a) bit of', 'Engels = English', 'Nederlands = Dutch', 'Geen: negating indefinite nouns (you can think of it as "no" things or "none of" a thing if that helps). Geen replaces the indefinite pronoun in question.', 'Niet: negating a verb, adjective or definite nouns. Niet comes at the end of a sentence or directly after the verb zijn.', 'nee = no', 'niet = not', 'geen = not']

更新: 我找到了一种使用CSS选择器的更可靠的方法。请试一试:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window

driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
driver.implicitly_wait(15)
listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
wait = WebDriverWait(driver, 15)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] ")))
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.mw-parser-output>ul')))
    elem = driver.find_elements_by_css_selector('.mw-parser-output>ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_css_selector("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

更新2 在试图理解原因后,我发现广告占据了大部分的加载时间。所以我添加了wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] ")))等待所有广告加载

我还通过删除最后一个li将第二个等待更改为.mw-parser-output>ul。我认为没有必要。您也可以尝试删除第二个等待,看看是否有帮助

尝试在解析页面之前等待页面完全加载,一种方法是使用^{}方法:

from time import sleep
...

for url in listURL:
    driver.get(url)
    sleep(5)
    ...

编辑:使用BeautifulSoup

import requests
from bs4 import BeautifulSoup


listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print("Link:", url)
    
    for tag in soup.select("[id*=Lesson]:not([id*=Lessons])"):
        print(tag.text)
        print()
        print(tag.find_next("ul").text)
        print("-" * 80)
    print()

输出(截断):

Link: https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1
Lesson 1

man = man
vrouw = woman
jongen = boy
ik = I
ben = am
een = a/an
en = and
                                        
Lesson 2

meisje = girl
kind = child/kid
hij = he
ze = she (unstressed)
is = is
of = or
                                        
Lesson 3

appel = apple

... And on

如果希望输出为list,请执行以下操作:

for url in listURL:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print("Link:", url)
    print([tag.text for tag in soup.select(".mw-parser-output > ul li")])
    print("-" * 80)

之后

WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))

你需要增加一些睡眠,我想time.sleep(1)就足够了,而且只有在那之后你才能这样做

elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')

您的问题是由于对visibility_of_all_elements_located功能的误解造成的。
它实际上并不是在等待通过它的定位器找到的所有元素都变为可见,它不知道等待的元素数量。
因此,一旦它检测到至少1个元素与您的定位器可见匹配,它将返回检测到的元素列表,程序将继续前进。
有关这些方法的更多详情见官方文件

相关问题 更多 >

    热门问题