从Python跨度标记提取文本

2024-03-28 12:53:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我在做selenium机器人,我需要在机器人搜索之后从页面中提取信息,但是我遇到了麻烦

我在下面的图片中有HTML

Image-here

我想从这些斜体标记“class='escapamentoLinhas”中提取文本

    from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
from selenium.webdriver.support.ui import Select

URL = '\x0\x0\x0\x0'

search = '\x0\x0'
print("Running...")

class ScrapingTJ:
    def __init__(self):

        self.browser = webdriver.Firefox()
        self.browser.get(URL)

        sleep(1)
        select = Select(self.browser.find_element_by_id('cbPesquisa'))
        select.select_by_value('NMPARTE')
        sleep(1)
        self.browser.find_element_by_xpath('//*[@id="campo_NMPARTE"]').send_keys(CNPJ_CLARO)
        self.browser.find_element_by_id('pbEnviar').click()
        sleep(2)

        dados = self.browser.find_element_by_id('listagemDeProcessos')
        HTML = dados.get_attribute("innerHTML")

        scraping = BeautifulSoup(HTML, "html.parser")
        # links
        links = scraping.find_all('a')
        for scrap in links:
            print(scrap.get_text())


        textos = scraping.find(class_ = 'espacamentoLinhas')
        subtextos = scraping.find_all('span')
        for ext in subtextos:
            print(ext.get_text())



if __name__ == '__main__':
    ScrapingTJ()

退出:

Exectdo:
Recebido em:




Exectda:
   Recebido em:

:我应该得到'30/04/2007-Vara das Execuçõ“es Fiscais Estaduais”在图像中加下划线


Tags: fromimportselfbrowseridgetbyhtml
1条回答
网友
1楼 · 发布于 2024-03-28 12:53:54

根据您提供的HTML图像,元素文本似乎在div元素中,而不是在span元素中。您需要从div而不是span提取文本。我将替换此块:

textos = scraping.find(class_ = 'espacamentoLinhas')
subtextos = scraping.find_all('span')
for ext in subtextos:
    print(ext.get_text())

有了这个:

elements = self.browser.find_elements_by_xpath("//div[@class='espacamentoLinhas']")
for element in elements:
    print(element.text)

span只包含文本“Recebido em:”,而不包含您要查找的文本,即30/04/2007 - Vara das Execuções Fiscais Estaduais。这个文本实际上包含在我包含的XPath中引用的div

如果您不想使用self.browser.find_elements_by_xpath,您可以删除代码的scraping.find_all('span')部分:

 textos = scraping.find(class_ = 'espacamentoLinhas')
 for ext in textos:
     print(ext.get_text())

相关问题 更多 >