如何使用BeautifulSoup刮去嵌套表?

2024-05-14 08:02:20 发布

您现在位置:Python中文网/ 问答频道 /正文

大家好,谢谢你们的帮助。我一直在刮嵌套的桌子。我可以刮取主表,但是当我发现一个表行包含其他表时,我真的不知道如何继续。html表如下所示:

<tr class="table">
                 <td class="table" valign="top">
                    <p class="tbl-cod">0403</p>
                 </td>
                 <td class="table" valign="top">
                    <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
                 </td>
                 <td class="table" valign="top">
                    <p class="tbl-txt">Manufacture in which:</p>
                    <table width="100%" cellspacing="0" cellpadding="0" border="0">
                       <colgroup><col width="4%">
                       <col width="96%">
                       </colgroup><tbody>
                          <tr>
                             <td valign="top">
                                <p class="normal">—</p>
                             </td>
                             <td valign="top">
                                <p class="normal">all the materials of Chapter&nbsp;4 used are wholly obtained,</p>
                             </td>
                          </tr>
                       </tbody>
                    </table>
                    <table width="100%" cellspacing="0" cellpadding="0" border="0">
                       <colgroup><col width="4%">
                       <col width="96%">
                       </colgroup><tbody>
                          <tr>
                             <td valign="top">
                                <p class="normal">—</p>
                             </td>
                             <td valign="top">
                                <p class="normal">all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating,</p>
                                <p class="normal">and</p>
                             </td>
                          </tr>
                       </tbody>
                    </table>
                    <table width="100%" cellspacing="0" cellpadding="0" border="0">
                       <colgroup><col width="4%">
                       <col width="96%">
                       </colgroup><tbody>
                          <tr>
                             <td valign="top">
                                <p class="normal">—</p>
                             </td>
                             <td valign="top">
                                <p class="normal">the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
                             </td>
                          </tr>
                       </tbody>
                    </table>
                 </td>
                 <td class="table" valign="top">
                    <p class="normal">&nbsp;</p>
                 </td>
              </tr>

我使用以下代码刮取了主表:

with open ('algeriaroo.txt', 'w') as algroo:
    for row in RoOtbody.find_all('tr'):
        for cell in row.find_all('td'):
            algroo.write(cell.text.strip())
        algroo.write('\n')

到目前为止,我获得了这种刮削:

0403Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoaManufacture in which:






—


all the materials of Chapter 4 used are wholly obtained,










—


all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,
and










—


the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,
and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product
—all the materials of Chapter 4 used are wholly obtained,
—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,
and
—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product

我想刮一下这样的东西:

0403Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoaManufacture in which: — all the materials of Chapter 4 used are wholly obtained, — all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating, and — the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product

提前感谢您的帮助


Tags: orandofthetoptablealltr
2条回答

只是一个建议。 可以在函数中添加从表中提取数据的逻辑。如果需要,检查每个td 它具有标记,如果存在,则使用调用相同的函数 唯一的问题是返回值可能是创建一个dict并返回调用函数并处理它。 这将有助于任何数量的嵌套表

您可能正在搜索带有separator=参数的.get_text()方法

例如(html_code包含您问题中的html代码):

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_code, 'html.parser')
print(soup.select_one('tr.table').get_text(strip=True, separator=' '))

印刷品:

0403 Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa Manufacture in which: — all the materials of Chapter 4 used are wholly obtained, — all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating, and — the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product

相关问题 更多 >

    热门问题