Python网页抓取访问HTML几秒钟后?

2024-04-19 19:06:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Python访问这个站点并清除HTML:http://forum.toribash.com/tori_spy.php

如您所见,如果您访问该网页,内容会在几秒钟内发生变化。这是一个页面,显示在一个论坛上最新的帖子,我正在做一个不和谐的机器人,将能够显示最新的帖子。你知道吗

现在,它显示列表中的第一个帖子,在任何动画/更改发生之前。你知道吗

我想知道是否有一种方法可以让我跳过动画或者让程序在访问后等待几秒钟,然后再获取所有的HTML。你知道吗

当前代码:

    if message.content.startswith("-post"):
        await client.send_message(message.channel, ":arrows_counterclockwise: **Accessing forums...**")
        await client.send_typing(message.channel)
        time.sleep(5)
        #access site
        session_requests = requests.session()
        url = "http://forum.toribash.com/tori_spy.php"
        result = session_requests.get(url,headers = dict(referer = url))
        #access html
        tree = html.fromstring(result.content)

        list_stuff=[]
        for atag in tree.xpath("//strong/a"): #search for <strong><a>
            list_stuff.append(atag.text_content())
        await client.send_message(message.channel, ":white_check_mark: Last post was in the thread **"+list_stuff[0]+"**")

非常感谢!你知道吗


Tags: clientsendhttpurlmessagesessionhtmlchannel
1条回答
网友
1楼 · 发布于 2024-04-19 19:06:29

页面使用ajax/xhr加载新帖子。它使用这样的url

forum.toribash.com/vaispy.php?do=xml&last=9297850&r=0....

last是最后一条消息的id,您可以在HTML中找到它 highestid = 9297850;在某些<script>标记中。r似乎并不重要-至少代码在没有r的情况下对我是有效的。你知道吗

在获得highestid之后,您可以使用它来获得带有最新消息的XML。你知道吗

XML中,您可以将它的ID看作<postid>,这样您就可以在下一个请求中使用它。你知道吗

import requests
from lxml import html

s = requests.session()

result = s.get("http://forum.toribash.com/tori_spy.php")
tree = html.fromstring(result.content)

for script in tree.xpath("//script"):
    if script.text and 'highestid' in script.text:
        highestid = script.text.split('\n')[3]
        highestid = highestid[13:-1]
        print('highestid:', highestid)

        result = s.get('http://forum.toribash.com/vaispy.php?do=xml&last='+highestid, headers=dict(referer=url))
        #print(result.text)
        data = html.fromstring(result.content)

        for item in data.xpath('.//event'):
            print(' - event  -')
            print('id:', item.xpath('.//id')[0].text)
            print('postid:', item.xpath('.//postid')[0].text)
            print(item.xpath('.//preview')[0].text)

当前结果(您的结果可能不同)

highestid: 9297873
 - event  -
id: 9297883
postid: 9297883
me vende esse full valkyrie por 18k
 - event  -
id: 9297881
postid: 9297881
Congratz Goat! Welcome to the team! :)
 - event  -
id: 9297879
postid: 9297879
Try to reset your email password, then attempt to do what I suggested.
 - event  -
id: 9297877
postid: 9297877
Hello Nope. Most of these bugs are known to currently cause issues and they are being worked on. People pinging and rejoining are bots that are being dealt with (it's just an extensive process to...
 - event  -
id: 9297874
postid: 9297874
Bon courage :)

相关问题 更多 >