2024-04-30 03:31:24 发布
网友
链接标签“a”上有以下文字:“Mino Games(YC W11)正在蒙特利尔招聘高级工程师,QC(workable.com)”
我想在sqlite3中存储“Mino游戏”、“高级工程师”、“蒙特利尔”和“workable.com”
请建议,我该怎么做
假设您正在刮除https://news.ycombinator.com/jobs,这应该可以:
import re, sqlite3 conn = sqlite3.connect('jobs.db') c = conn.cursor() c.execute('''CREATE TABLE jobs (company text, position text, location text, source real)''') company_pattern = re.compile(r'(.+)(hiring|looking|wants|is )', re.IGNORECASE) source_pattern = re.compile(r'\(([^)]+)\)$') location_pattern = re.compile(r'in (.*)|(remote)', re.IGNORECASE) position_pattern = re.compile(r'(?:hiring|looking|wants) (.*)', re.IGNORECASE) clean_up_pattern = re.compile(r'\(([^)]+)\)| is | for | in |a ', re.IGNORECASE) # Load up <a> nodes into elements here for element in elements: element = element.text source = source_pattern.findall(element)[0].strip() element = element.replace('(' + source + ')', '') company = clean_up_pattern.sub('', company_pattern.findall(element)[0][0]) try: location = location_pattern.findall(element)[0][0].strip() except IndexError: location = 'Not stated' element = element.replace(location, '') position = clean_up_pattern.sub('', position_pattern.findall(element)[0]) c.execute("INSERT INTO jobs VALUES (company, position, location, source)") conn.commit() conn.close()
这将解析那里大约80%的工作机会。如果需要捕获更多,请调整正则表达式
假设您正在刮除https://news.ycombinator.com/jobs,这应该可以:
这将解析那里大约80%的工作机会。如果需要捕获更多,请调整正则表达式
相关问题 更多 >
编程相关推荐