Python 从 monster.com 提取搜索结果

0 投票
1 回答
1891 浏览
提问于 2025-04-17 12:12

我看到过关于谷歌提取数据的结果,但这对我来说不管用。我想直接进入代码,修改一些参数,然后运行它,这样就能搜索并抓取职位名称、地点和日期。这是我目前的进展。如果能得到一些帮助,那就太好了,提前谢谢你。

我希望这个脚本能在monster.com上执行一个搜索,使用给定的参数(软件工程师,加州),并抓取结果。

#! /usr/bin/python
import re
import requests
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

parameters = ["Software","Engineer","CA"]
base_url = "http://careers.boozallen.com/search?q="
search_string = "+".join(parameters)

final_url = base_url + search_string

a = requests.get(final_url)
raw_string = a.text.strip()


soup = BeautifulSoup( raw_string )

job_urls = soup.findAll(name = 'a', attrs = { 'class': 'jobTitle fnt11_js' })

for job_url in job_urls:

    print job_url.text
    print

raw_input("Press enter to close: ")

我知道下面的代码可以作为标准的抓取。

handle = urlopen("http://jobsearch.monster.com/search/Engineer_5?q=Software&where=AZ&rad=20&sort=rv.di.dt")
responce = handle.read()
soup = BeautifulSoup( responce )

job_urls = soup.findAll(name = 'a', attrs = { 'class': 'jobTitle fnt11_js' })
for job_url in job_urls:
    print job_url.text
    print

1 个回答

1

如果你在浏览器里输入 http://careers.boozallen.com/search?q=software+engineer+CA,然后查看网页的HTML代码,你会看到类似这样的内容:

<tr class="dbOutputRow2">
    <td style="width: 400px;" class="colTitle" headers="hdrTitle"><span class="jobTitle"><a href="http://careers.boozallen.com/job/San-Diego-Network-Engineer%2C-Senior-Job-CA-92101/1645793/">Network Engineer, Senior Job</a></span></td>
    <td style="width: auto;" class="colLocation" headers="hdrLocation"><span class="jobLocation">San Diego, CA, US</span></td>
    <td style="width: 155px;" class="colDate" headers="hdrDate" nowrap="nowrap"><span class="jobDate">Jan 5, 2012</span></td>

你想要的信息都在 <span> 标签里,这些标签的 class 属性分别是 jobTitlejobLocationjobDate

下面是如何使用 lxml 来提取这些信息的示例:

import urllib2
import lxml.html as LH

url = 'http://careers.boozallen.com/search?q=software+engineer+CA'
doc = LH.parse(urllib2.urlopen(url))

def text_content(iterable):
    for elt in iterable:
        yield elt.text_content()

data = text_content(doc.xpath('''//span[@class = "jobTitle"
                                        or @class = "jobLocation"
                                        or @class = "jobDate"]'''))

for title, location, date in zip(*[data]*3):
    print(title,location,date)

这样做会得到

('Title', 'Location', 'Date')
('Network Engineer, Senior Job', 'San Diego, CA, US', 'Jan 5, 2012')
('Network Integration Engineer, Mid Job', 'San Diego, CA, US', 'Jan 12, 2012')
('Systems Engineer, Senior Job', 'San Diego, CA, US', 'Jan 31, 2012')
('Enterprise Architect, Senior Job', 'Washington, DC, US', 'Jan 23, 2012')
...

撰写回答