在Python中使用Beautifulsoup时如何排除不需要的标记

2024-05-14 04:13:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在Beautifulsoup.com上练习python抓取

使用[div class companyLocation]提取“作业位置”时, 我想要的是在'div^{cl1}之后获得位置字符串$

但在某些情况下,会有额外的'a aria label'或'span'子句,其中包含不需要的字符串,如“+1 location”等

我想不出如何摆脱这些。 所以我征求你的意见

<div class="companyLocation">United States
<span><a aria-label="Same Python Developer job in 1 other location" class="more_loc" href="/addlLoc/redirect?tk=1fgg7b6pa306m001&amp;jk=d724dab9a2d2af2c&amp;dest=%2Fjobs%3Fq%3Dpython%26limit%3D50%26grpKey%3DkAO5nvwVmAPOkxWgAwHyBwN0Y2w%253D" rel="nofollow">
+1 location</a></span>

<span class="remote-bullet">•</span><span>Remote</span></div>, United States+1 location•Remote

以下是我的Python代码供您参考。 问题出现在“if a.string为None:”情况下

您可以通过以下代码看到上面的div+span html子句: 打印(f“{a},{a.text}”)

import requests
from bs4 import BeautifulSoup

url = "https://www.indeed.com/jobs?q=python&limit=50"

extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})

for soup_job in soup_jobs:
    for a in soup_job.select("div.companyLocation"):
        if a.string is not None:
            pass

        #problem(below)
        if a.string is None:
            print(f"{a}, {a.text}")

Tags: textindivnoneurlstringifjobs
2条回答

这样行吗

    #problem(below)
    if a.string is None:
        data=''
        for child in a.children:
            if not child.name and child != '':
                data+=child
        print(data)

您混淆了if语句,请尝试以下操作:

import requests
from bs4 import BeautifulSoup

url = "https://www.indeed.com/jobs?q=python&limit=50"

extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})

for soup_job in soup_jobs:
    for a in soup_job.select("div.companyLocation"):
        if a.string is not None:
            print(f"{a}, {a.text}")

输出:

<div class="companyLocation">United States</div>, United States
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Boulder, CO</div>, Boulder, CO
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Allen, TX</div>, Allen, TX
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York State</div>, New York State
<div class="companyLocation">Austin, TX</div>, Austin, TX
<div class="companyLocation">Research Triangle Park, NC</div>, Research Triangle Park, NC
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">Cary, NC</div>, Cary, NC
<div class="companyLocation">Raleigh, NC</div>, Raleigh, NC
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Bellevue, WA</div>, Bellevue, WA
<div class="companyLocation">Houston, TX</div>, Houston, TX

现在它工作得很好

相关问题 更多 >