我正试图获取一些数据。我面临的问题是页面每隔几秒钟刷新一次。我想仅根据最新的块限制数据采集,并刷新扫描,希望能赶上下一个后续块。任何想法都会很有帮助
目标#1-连续抓取拦网
目标2-消除重复项
from bs4 import BeautifulSoup
from time import sleep
import re, requests
trim = re.compile(r'[^\d,.]+')
url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
while True:
scans += 1
reqtxsInternal = requests.get(url,header, timeout=2)
souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')
for row in blocktxsInternal[1:]:
txnhash = row.find_all('td')[1].text[0:]
txnhashdetails = txnhash.strip()
block = row.find_all('td')[3].text[0:]
value = row.find_all('td')[9].text[0:]
amount = trim.sub('', value).replace(",", "")
transval = float(amount)
if float(transval) >= 1:
print ("Doing something with the data -> " + str(block) + " " + str(transval))
else:
pass
print (" -> Whole Page Scanned: ", scans)
sleep(1)
当前输出:#--运行脚本时将不同
Doing something with the data -> 10186993 1.233071907624764
Doing something with the data -> 10186993 4.689434542638692
Doing something with the data -> 10186993 27.97137792744322 #-- grab only until here and reload the scan
Doing something with the data -> 10186992 9.0
Doing something with the data -> 10186991 2.98
Doing something with the data -> 10186991 1.0
-> Whole Page Scanned: 1
Doing something with the data -> 10186994 1.026868093169767
Doing something with the data -> 10186994 4.0
Doing something with the data -> 10186994 4.55582682
Doing something with the data -> 10186994 8.184713205161088
Doing something with the data -> 10186993 1.233071907624764
Doing something with the data -> 10186993 4.689434542638692
Doing something with the data -> 10186993 27.97137792744322
Doing something with the data -> 10186992 9.0
-> Whole Page Scanned: 2
想要的输出:
Doing something with the data -> 10186993 1.233071907624764
Doing something with the data -> 10186993 4.689434542638692
Doing something with the data -> 10186993 27.97137792744322
-> Whole Page Scanned: 1
Doing something with the data -> 10186994 1.026868093169767
Doing something with the data -> 10186994 4.0
Doing something with the data -> 10186994 4.55582682
Doing something with the data -> 10186994 8.184713205161088
-> Whole Page Scanned: 2
只有当
block
数增加/减少时,连续性才起作用由于每次刷新时数据都会发生变化,我建议先收集所需的数据,然后进行重复数据消除并执行所需的操作
我在这里使用了
Pandas
,因为它在引擎盖下使用了beautifulsoup,但因为它是一个表,所以我让pandas解析它。这样就很容易操纵桌子了因此,它看起来像是您只需要最新的/max
"Block"
,然后返回任何大于或等于1的值。这能满足你的需要吗您的另一个选项是让它检查当前的
'block'
是否大于上一个。然后将该逻辑添加到仅在以下情况下打印:相关问题 更多 >
编程相关推荐