用BeautifulSoup提取html数据不起作用

2024-05-14 08:25:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从这个网站检索所有的行数据 https://www.dibbs.bsm.dla.mil/Awards/AwdRecs.aspx?Category=awddt&TypeSrch=cq&Value=02-06-2018 这是行的示例html

  <tr class="BgWhite" style="border-color:Gray;border-width:1px;border-style:Solid;">

<td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl43_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" title="Link To Award/Basic Document" target="DIBBSDocuments"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" alt="PDF Document" width="16" height="16" hspace="2" border="0"></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" title="Link To Award/Basic Document" target="DIBBSDocuments">SP450017D0005</a></span>
</td>

<td align="center" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl43_lblCage"><a href="javascript:void(0);" onclick="return openNewWindow(&quot;https://www.dibbs.bsm.dla.mil/Refs/cage.aspx?Cage=0ZE15&quot;, &quot;CAGE&quot;, 475, 300)" title="Click to perform a CAGE Search">0ZE15</a></span>
</td>
<td align="right" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl43_lblTotalContactPrice">                   $2,341.94</span>
</td>

 </tr>

 <tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">


<td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrder" style="display:inline-block;width:175px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" alt="PDF Document" width="16" height="16" hspace="2" border="0"></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments">SP450018F2293</a> <br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16" height="16" hspace="1" border="0" alt="-spacer-"><span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0005&amp;dlv=SP450018F2293&amp;cnt=108" title="Delivery Order Package View" target="DIBBS">Delivery Order Package View</a></span></span>
</td>
<td align="right" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrderCounter" style="display:inline-block;width:50px;">108</span>
</td>

<td align="right" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl44_lblTotalContactPrice">                   $2,341.94</span>
</td>

我想从html中提取awardids SP450017D0005和SP450018F2293。所以我试了这个 dibbssoup=BeautifulSoup(main\u page.content,'html5lib')

containers1 = dibbssoup.find_all("tr", {"class": "BgWhite"})
containers2 = dibbssoup.find_all("tr", {"class": "BgSilver"})

containers = containers1 + containers2

for container1 in containers:


    for page in range(row)[3:]:
        containerid = "ctl00_cph1_grdAwardSearch_ctl"+str(page)+"_lblAwardBasicNumber"

        awardid = container1.find("td", {"align": "left"}).find("span", {"id":containerid})

        print(page)
        print(containerid)
        print(awardid)
        print(" ")

页面增量工作,containerid正确,但awardid的输出为“none”。我做错了什么?我怎样才能改正


Tags: httpsidstylewwwwidthtdspanalign
1条回答
网友
1楼 · 发布于 2024-05-14 08:25:56

我目前没有看到你的代码有什么大的缺陷。使用这种嵌套的html标记时,将find语句拆分并打印每个语句的结果通常很有用。调试时,现在可以清楚地看到哪些find调用失败。在解决了问题之后,您仍然可以重新组合它们并清理代码

要摆脱pagecontainerid变量,可以使用函数作为find的参数,如下所示:

def basic_number_filter(tag):
    return tag.name == "span" and tag.attrs.get("id", "").endswith("_lblAwardBasicNumber")

containers = soup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})

for container in containers:
    awardid = container.find("td", align="left").find(basic_number_filter)
    print(awardid)

你可以在这里找到更多信息:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function

使用您提供的示例html运行此代码时,我得到:

<span id="ctl00_cph1_grdAwardSearch_ctl43_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0005</a></span>
None

第二个awardidNone,因为

<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
    <td align="left" valign="top">
        <span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrder" style="display:inline-block;width:175px;">
            <a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" alt="PDF Document" width="16" height="16" hspace="2" border="0"></a>
            <a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments">SP450018F2293</a>
            <br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16" height="16" hspace="1" border="0" alt="-spacer-">
            <span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0005&amp;dlv=SP450018F2293&amp;cnt=108" title="Delivery Order Package View" target="DIBBS">Delivery Order Package View</a></span>
        </span>
    </td>
    <td align="right" valign="top">
        <span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrderCounter" style="display:inline-block;width:50px;">108</span>
    </td>

    <td align="right" valign="top">
        <span id="ctl00_cph1_grdAwardSearch_ctl44_lblTotalContactPrice">                   $2,341.94</span>
    </td>
</tr>

不包含spanid类似的ctl00_cph1_grdAwardSearch_ctl43_lblAwardBasicNumber

相关问题 更多 >

    热门问题