从td标签中获取特定数据

2条回答

网友

1楼 · 编辑于 2024-05-17 18:28:43

你想要的不是一个简单的问题，但是这个脚本可以让你开始：

import re
import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/jobs'

plain_html_text = requests.get(url);

soup = BeautifulSoup(plain_html_text.text, "html.parser")

rows = []
for title in soup.select('.title:not(:has(.morelink)) .storylink'):
    t = title.get_text(strip=True)

    company = re.findall(r'^(.*?)(?:is hiring|is looking|seeking|hiring)', t, flags=re.I)
    if company:
        company = company[0].strip()
    else:
        company = '-'

    position = re.findall(r'(?:is hiring|is looking|seeking|hiring)(.*?)(?=\bin\b|$)', t, flags=re.I)
    if position:
        position = position[0].strip()
    else:
        position = '-'

    location = re.findall(r'(?:\bin\b)(.*)', t, flags=re.I)
    if location:
        location = location[0].strip()
    else:
        location = '-'

    rows.append([company, position, location])

print('{: ^50}{: ^80}{: ^20}'.format('Company', 'Position', 'Location'))
for row in rows:
    c, p, l = row
    print('{: <50}{: <80}{: <20}'.format(c, p, l))

印刷品：

                     Company                                                          Position                                          Location      
Scale AI                                          engineers to accelerate the development of AI                                   -                   
Mino Games (YC W11)                               Game Developers                                                                 Montreal            
BuildZoom (YC W13)                                – Help us un-break construction                                                 -                   
Bitmovin (YC S15)                                 a Video Solutions Architect/Software Engineer                                   Brazil              
Streak – CRM for Gmail (YC S11)                                                                                                   Vancouver           
ZeroCater (YC W11)                                a Director of Engineer                                                          SF                  
UpCodes (YC S17)                                  engineers to automate compliance for architects                                 -                   
Tech Nonprofit Upsolve (YC W19)                   a Software Engineer                                                             -                   
Gitlab (YC W15)                                   an Engineering Manager, Ecosystem                                               -                   
Saleswhale (YC S16)                               Our First U.S. Strategic Account Executive                                      -                   
Jerry (YC S17)                                    for a Director of Ops and Growth                                                -                   
Sourceress (YC S17)                               Product and ML Engineers (Remote OK, No Prior ML OK)                            -                   
GiveCampus (YC S15)                               a Product Designer who cares about education                                    -                   
Iris Automation                                   an Account Executive for B2B Flying Vehicle Software                            -                   
LogDNA (YC W15)                                   Software Engineers – DevOps Monitoring at Scale                                 -                   
Flexport                                          software engineers to work on our trucking apps                                 Chicago             
Mux                                               an ML engineer to help train our machines to deliver better video               -                   
The Muse (YC W12)                                 a Product Director for Growth                                                   -                   
OneSignal                                         an SRE to scale our bare-metal infrastructure                                   -                   
Atomwise (YC W15)                                 a Senior Systems/Cloud Engineer                                                 -                   
Demodesk (YC W19)                                 Software Engineers                                                              Munich              
Gusto                                             for Android and iOS developers to build our native mobile app                   -                   
Fond (YC W12)                                     an Engineering Manager                                                          Portland            
ReadMe (YC W15)                                   – Help us make APIs easy to use                                                 -                   
Keeper (YC W19)                                   a lead engineer – help save gig workers money on taxes                          -                   
Asseta (YC S13)                                   a technical lead                                                                -                   
Tesorio (YC S15)                                  Engineering Managers, Senior Engineers                                          -                   
Standard Cognition (YC S17)                       – Work on vision systems                                                        Rust                
Curebase (YC S18)                                 first sales hire – distributed clinical research                                -                   
Mashgin (YC W15)                                  a Fullstack SWE Interested                                                      Computer Vision/AI

网友

2楼 · 编辑于 2024-05-17 18:28:43

这是一个基本的刮刀，将标题分为公司和职位。你知道吗

import requests
from bs4 import BeautifulSoup
import re

from pprint import pprint

def make_soup(url: str) -> BeautifulSoup:
    res = requests.get(url, headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'})
    res.raise_for_status()
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    return soup

def extract_jobs(soup: BeautifulSoup) -> list:
    titles = soup.select('.storylink')
    hiring_re = re.compile('\s+(is)?\s+(hiring|seeking|looking)\s+(for)?', flags=re.IGNORECASE)

    jobs = []
    for el in titles:
        title = el.text.strip()
        m = hiring_re.search(title)
        if not m:
            continue
        company = title[:m.start()].strip()
        offer = title[m.end():].strip().title()
        jobs.append({
            'company': company,
            'wants': offer,
        })
    return jobs


url = 'https://news.ycombinator.com/jobs'
soup = make_soup(url)
jobs = extract_jobs(soup)
pprint(jobs)

输出：

 {'company': 'Mino Games (YC W11)', 'wants': 'Game Developers In Montreal'},
 {'company': 'BuildZoom (YC W13)', 'wants': '– Help Us Un-Break Construction'},
 {'company': 'Streak – CRM for Gmail (YC S11)', 'wants': 'In Vancouver'},
 {'company': 'ZeroCater (YC W11)', 'wants': 'A Director Of Engineer In Sf'},
 {'company': 'UpCodes (YC S17)', 'wants': 'Engineers To Automate Compliance For Architects'},
 {'company': 'Tech Nonprofit Upsolve (YC W19)', 'wants': 'A Software Engineer'},
...

相关问题更多 >

编程相关推荐

热门问题

热门文章

从td标签中获取特定数据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >