使用Python访问Selenium网页元素

13 投票

2 回答

22438 浏览

提问于 2025-04-17 06:17

我相信这个问题在某个地方已经被回答过，因为它非常基础。不过，我怎么也找不到答案，感觉自己像个傻瓜，但我还是得问，所以就这样吧：

我正在写一个Python代码，目的是生成一个域名下所有页面地址的列表。这是通过使用selenium 2来实现的，但我遇到的问题是，当我试图访问selenium生成的所有链接列表时，出现了困难。

这是我目前的代码：

from selenium import webdriver
import time

HovedDomene = 'http://www.example.com'
Listlinker = []
Domenesider = []
Domenesider.append(HovedDomene)

driver = webdriver.Firefox()

for side in Domenesider:        

        driver.get(side)
        time.sleep(10)
        Listlinker = driver.find_elements_by_xpath("//a")

        for link in Listlinker: 

            if link in Domenesider:
              pass
            elif str(HovedDomene) in str(link):
              Domenesider.append(side)

print(Domenesider)
driver.close()

这个Listlinker变量并没有包含页面上找到的链接，而是包含了一些我猜测是selenium特有的对象，叫做WebElements。不过，我找不到任何WebElement的属性可以让我获取这些链接。实际上，我在Python中找不到任何关于如何访问WebElement属性的例子（至少没有我能复现的方式）。

我非常感谢大家能给我的任何帮助。

真诚的，
新手

web scraping webdriver selenium automation testing browser automation programming challenges link extraction web element

2 个回答

我一直在关注你提到的不要用time.sleep(10)来等待页面加载的建议。从我阅读的不同帖子来看，似乎在使用selenium 2时，等待页面加载是多余的。比如说这个链接。原因是selenium 2有一个隐式等待加载的功能。我只是想提一下这个，因为你花时间回答了我的问题。

有时候，selenium的表现会让人感到困惑。而且有时候，selenium会抛出一些我们并不关心的错误。

By byCondition;
T result; // T is IWebElement
const int SELENIUMATTEMPTS = 5;
int timeout = 60 * 1000;
StopWatch watch = new StopWatch();

public T MatchElement<T>() where T : IWebElement
{
    try
    {
        try {
            this.result = this.find(WebDriver.Instance, this.byCondition);
        }
        catch (NoSuchElementException) { }

        while (this.watch.ElapsedMilliseconds < this.timeout && !this.ReturnCondMatched)
        {

            Thread.Sleep(100);
            try {
                this.result = this.find(WebDriver.Instance, this.byCondition);
            }
            catch (NoSuchElementException) { }
        }
    }
    catch (Exception ex)
    {
        if (this.IsKnownError(ex))
        {
            if (this.seleniumAttempts < SELENIUMATTEMPTS)
            {
                this.seleniumAttempts++;
                return MatchElement();
            }
        }
        else { log.Error(ex); }
    }
    return this.result;
    }

    public bool IsKnownError(Exception ex)
    {
    //if selenium find nothing it throw an exception. This is bad practice to my mind.
    bool res = (ex.GetType() == typeof(NoSuchElementException));

    //OpenQA.Selenium.StaleElementReferenceException: Element not found in the cache
    //issue appears when selenium interact with other plugins.
    //this is probably something connected with syncronization
    res = res || (ex.GetType() == (typeof(InvalidSelectorException) && ex.Message
        .Contains("Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE)" +
                "[nsIDOMXPathEvaluator.createNSResolver]"));

    //OpenQA.Selenium.StaleElementReferenceException: Element not found in the cache
    res = res || (ex.GetType() == typeof(StaleElementReferenceException) && 
        ex.Message.Contains("Element not found in the cache"));

    return res;
}

抱歉用的是C#，但我在Python方面还是个初学者。代码当然是简化过的。

回答于 2025-04-17 由 Python大师

分享举报

我对Python的Selenium库有点了解。你可以使用get_attribute(attributename)这个方法来获取链接。所以大概可以这样写：

linkstr = ""
for link in Listlinker: 
  linkstr = link.get_attribute("href")

  if linkstr in Domenesider:
    pass
  elif str(HovedDomene) in linkstr:
    Domenesider.append(side)

回答于 2025-04-17 由 Python大师

分享举报

使用Python访问Selenium网页元素

2 个回答

撰写回答