通过网络爬虫获取脚本页面的URL

1 投票

1 回答

515 浏览

提问于 2025-04-18 02:40

我正在尝试从一个网页抓取搜索结果中的所有文章链接，但似乎没有得到任何结果。

涉及的网页是：http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=police+check&page=1&isAreaUnspecified=false&location=&area=&nation=3000&sortMode=Advertiser&searchFrom=quick&searchType=

我的方法是：我想获取文章的ID，然后把这些ID加到一个已经知道的链接上（http://www.seek.com.au/job/+ ID），但是在我的请求中（使用的是来自http://docs.python-requests.org/en/latest/的Python库）没有找到任何ID，实际上根本没有文章。

看起来在这种情况下，我需要以某种方式执行生成ID的脚本，才能获取完整的数据，我该怎么做呢？

也许还有其他方法可以从这个搜索查询中获取所有结果？

自动化脚本数据提取 URL解析网络爬虫数据抓取网页链接搜索结果文章ID

1 个回答

如前所述，先下载Selenium。这里有Python的绑定库。

Selenium是一个用于网页测试的自动化框架。简单来说，使用Selenium就像是在远程控制一个网页浏览器。这是必要的，因为网页浏览器有自己的JavaScript引擎和DOM，这样才能支持AJAX功能。

使用这个测试脚本（假设你已经安装了Firefox；如果需要，Selenium也支持其他浏览器）：

# Import 3rd Party libraries
from selenium                                       import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

class requester_firefox(object):
    def __init__(self):
        self.selenium_browser = webdriver.Firefox()
        self.selenium_browser.set_page_load_timeout(30)

    def __del__(self):
        self.selenium_browser.quit()
        self.selenium_browser = None

    def __call__(self, url):
        try:
            self.selenium_browser.get(url)
            the_page = self.selenium_browser.page_source
        except Exception:
            the_page = ""
        return the_page

test = requester_firefox()
print test("http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=police+check&page=1&isAreaUnspecified=false&location=&area=&nation=3000&sortMode=Advertiser&searchFrom=quick&searchType=").encode("ascii", "ignore")

它会加载SEEK并等待AJAX页面的加载。encode这个方法是必要的（至少对我来说），因为SEEK返回的是一个Unicode字符串，而Windows控制台似乎无法正确显示这个字符串。

回答于 2025-04-18 由 Python大师

分享举报

通过网络爬虫获取脚本页面的URL

1 个回答

撰写回答