如何从页面中提取包含锚文本并与条件匹配的p标记文本

2024-04-25 22:07:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我的代码不能从抓取的数据中产生非常可读的结果。我有一些在我理解范围内的方法,但是,我不能让它正常工作

import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver

driver = webdriver.Chrome('chromedriver.exe')
addrlist = ['https://poocoin.app/rugcheck/0x8076c74c5e3f5852037f31ff0093eeb8c8add8d3/dev-activity',
            'https://poocoin.app/rugcheck/0xd7ac542add4994a9d72369ab8d4788a38df6a217/dev-activity',
            'https://poocoin.app/rugcheck/0xf017e2773e4ee0590c81d79ccbcf1b2de1d22877/dev-activity']

for url in addrlist: 
    driver.get(url)

    time.sleep(8)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    pdata = soup.find_all('div',attrs={"class":"mt-2"})
    for x in pdata:
        print (x.find('p'))
    print ()
driver.quit()

电流输出:#——仅限部分部件

<p><a href="/tokens/0x8076c74c5e3f5852037f31ff0093eeb8c8add8d3">Go to chart</a></p>
<p>This is a log of activity related to the token from all wallets that have had ownership of the contract.</p>
<p>Wallet activity for <a href="https://bscscan.com/address/0xCd198Be08A33cbe2172f3BE45cdB431E060076BC" rel="noreferrer" target="_blank">0xCd198Be08A33cbe2172f3BE45cdB431E060076BC</a></p>
<p>Wallet activity for <a href="https://bscscan.com/address/0x79c4af7c43f500b9ccba9396d079cc03dfcafda1" rel="noreferrer" target="_blank">0x79c4af7c43f500b9ccba9396d079cc03dfcafda1</a><br/><span class="text-muted text-small">(Ownership transferred to <a href="https://bscscan.com/address/undefined" rel="noreferrer" target="_blank"></a> on 9/3/2021, 1:55:09 AM)</span></p>
<p>Wallet activity for <a href="https://bscscan.com/address/0xc95063d946242f26074a76c8a2e94c9d735dfc78" rel="noreferrer" target="_blank">0xc95063d946242f26074a76c8a2e94c9d735dfc78</a><br/><span class="text-muted text-small">(Ownership transferred to <a href="https://bscscan.com/address/0x79c4af7c43f500b9ccba9396d079cc03dfcafda1" rel="noreferrer" target="_blank">0x79c4af7c43f500b9ccba9396d079cc03dfcafda1</a> on 4/1/2021, 8:46:31 AM)</span></p>

想要的输出:#--仅当锚文本不为空时抓取

0x8076c74c5e3f5852037f31ff0093eeb8c8add8d3
  Wallet activity for 0xc95063d946242f26074a76c8a2e94c9d735dfc78
  (Ownership transferred to 0x79c4af7c43f500b9ccba9396d079cc03dfcafda1 on 01/04/2021, 8:46:31 am)

0xd7ac542add4994a9d72369ab8d4788a38df6a217
  Wallet activity for 0x9ecedaafc0d45ad80b2515e24c61d6a7c5b917bd
  (Ownership transferred to 0x0000000000000000000000000000000000000000 on 06/09/2021, 7:37:22 pm)

0xf017e2773e4ee0590c81d79ccbcf1b2de1d22877
  Wallet activity for 0x61b1e31107953f8af76d19ba503ed1798b760c13
  (Ownership transferred to 0x0000000000000000000000000000000000000000 on 23/04/2021, 5:27:11 am)

Tags: tohttpscomtargetforaddressactivityrel
1条回答
网友
1楼 · 发布于 2024-04-25 22:07:21

你可以这样做

您需要的数据存在于最后的<div class="mt-2">中。只需选择最后一个<div>,找到<p>并打印它的文本

下面是打印所需数据的代码

import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver

driver = webdriver.Chrome('chromedriver.exe')
addrlist = ['https://poocoin.app/rugcheck/0x8076c74c5e3f5852037f31ff0093eeb8c8add8d3/dev-activity',
            'https://poocoin.app/rugcheck/0xd7ac542add4994a9d72369ab8d4788a38df6a217/dev-activity',
            'https://poocoin.app/rugcheck/0xf017e2773e4ee0590c81d79ccbcf1b2de1d22877/dev-activity']

for url in addrlist: 
    driver.get(url)

    time.sleep(5)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    pdata = soup.find_all('div',attrs={"class":"mt-2"})[-1]
    print(pdata.find('p').text.strip())
    
driver.quit()
Wallet activity for 0xc95063d946242f26074a76c8a2e94c9d735dfc78(Ownership transferred to 0x79c4af7c43f500b9ccba9396d079cc03dfcafda1 on 4/1/2021, 6:16:31 AM)

Wallet activity for 0x9ecedaafc0d45ad80b2515e24c61d6a7c5b917bd(Ownership transferred to 0x0000000000000000000000000000000000000000 on 9/6/2021, 5:07:22 PM)

Wallet activity for 0x61b1e31107953f8af76d19ba503ed1798b760c13(Ownership transferred to 0x0000000000000000000000000000000000000000 on 4/23/2021, 2:57:11 AM)

相关问题 更多 >