使用Beautiful Soup解析,获取不同层级的节点

0 投票
2 回答
1025 浏览
提问于 2025-04-18 05:38

我想获取 delititle,然后在这个 delititle 下获取两个菜单项:Made to Order Deli CoreTurkey Chipotle Petite Wrap。我正在使用 Beautiful Soup 4 来实现这个,但没有成功。对于主菜的时间也是一样的问题。

<html>
<head>
    <title></title>
</head>

<body>
    <table class="dayinner">
        <tr class="lun">
            <td class="mealname" colspan="3">LUNCH</td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Deli</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000010000047598_35356" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047598_35356');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Made to Order Deli Core</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000020000047933_06835" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047933_06835');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Turkey Chipotle Petite Wrap</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="height:3px;"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Entrée</td>

            <td class="menuitem">
                <div class="menuitem"><input class="chk" id=
                "S1L0000030000044794_08943" onclick="rptlist(this);"
                onmouseout="wschk(0);" onmouseover="wschk(1);" type="checkbox">
                <span class="ul" onclick="nf('0000044794_08943');" onmouseout=
                "pcls(this);" onmouseover="ws(this);">Steamed
                Corn</span><img alt="Vegan" class="icon" src=
                "images/g_062.gif"><img alt="Mindful Item" class="icon" src=
                "images/m_051.gif"></div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000040000033087_22244" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000033087_22244');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Cuban Mojo Roasted Pork Loin</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>
    </table>
</body>
</html>

或者如果我能把它转换成像这样的 XML 格式就好了:

<counter name="Deli">
    <dish>
        <name>Made to Order Deli Core</name>
    </dish>
    <dish>
        <name>Turkey Chipotle Petite Wrap</name>
    </dish>
</counter>

非常感谢你提前的帮助,我真的很感激你花时间来帮我。

2 个回答

1

你可以这样做:

# -*- coding: utf-8 -*-

soup = BeautifulSoup(html)
title = soup.find('td', class_='station').text.strip()

spans = soup.find_all('span', class_='ul')

# create the root of the XML file
root = ET.Element("counter")
root.set("name", title)

for item in spans:
    # retrieve the text inside the <td class="station">
    text = list(list(item.parents)[2].previous_siblings)[1].text.strip()
    if text == u'Entrée':
        break

    dish = ET.SubElement(root, 'dish')
    name = ET.SubElement(dish, 'name')
    name.text = item.text.rstrip()

tree = ET.ElementTree(root)
tree.write("filename.xml")

这是你想要的xml文件的内容:

<counter name="Deli">
    <dish>
        <name>Made to Order Deli Core</name>
    </dish> 
    <dish>
        <name>Turkey Chipotle Petite Wrap</name>
    </dish>
</counter>

在你的文件开头一定要加上这一行 # -*- coding: utf-8 -*-,这样可以避免出现带重音符号的问题。想了解更多,可以看看这个链接:SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'

1

其实我用过 Beautiful Soup 和 Element Tree(用于解析 XML)。这两个工具可以帮助我获取所有在 <span> 标签里的内容。

# -*- coding: UTF-8 -*-

from bs4 import *
import xml.etree.ElementTree as ET

html='''<html>
<head>
    <title></title>
</head>

<body>
    <table class="dayinner">
        <tr class="lun">
            <td class="mealname" colspan="3">LUNCH</td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Deli</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000010000047598_35356" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047598_35356');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Made to Order Deli Core</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000020000047933_06835" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047933_06835');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Turkey Chipotle Petite Wrap</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="height:3px;"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Entrée</td>

            <td class="menuitem">
                <div class="menuitem"><input class="chk" id=
                "S1L0000030000044794_08943" onclick="rptlist(this);"
                onmouseout="wschk(0);" onmouseover="wschk(1);" type="checkbox">
                <span class="ul" onclick="nf('0000044794_08943');" onmouseout=
                "pcls(this);" onmouseover="ws(this);">Steamed
                Corn</span><img alt="Vegan" class="icon" src=
                "images/g_062.gif"><img alt="Mindful Item" class="icon" src=
                "images/m_051.gif"></div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000040000033087_22244" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000033087_22244');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Cuban Mojo Roasted Pork Loin</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>
    </table>
</body>
</html> '''

soup = BeautifulSoup(html)

counter = ET.Element('counter')
counter.set("name", "#Deli")





for i in soup.findAll('span'):
    dish = ET.SubElement(counter, 'dish')
    name = ET.SubElement(dish, 'name')
    name.text= i.text.replace('\n',' ')

print ET.dump(counter)

撰写回答