使用Beautiful Soup解析,获取不同层级的节点
我想获取 deli
的 title
,然后在这个 deli
的 title
下获取两个菜单项:Made to Order Deli Core
和 Turkey Chipotle Petite Wrap
。我正在使用 Beautiful Soup 4 来实现这个,但没有成功。对于主菜的时间也是一样的问题。
<html>
<head>
<title></title>
</head>
<body>
<table class="dayinner">
<tr class="lun">
<td class="mealname" colspan="3">LUNCH</td>
</tr>
<tr class="lun">
<td class="station"> Deli</td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000010000047598_35356" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox"> <span class="ul" onclick=
"nf('0000047598_35356');" onmouseout="pcls(this);"
onmouseover="ws(this);">Made to Order Deli Core</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000047933_06835" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox"> <span class="ul" onclick=
"nf('0000047933_06835');" onmouseout="pcls(this);"
onmouseover="ws(this);">Turkey Chipotle Petite Wrap</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td colspan="3" style="height:3px;"></td>
</tr>
<tr class="lun">
<td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
</tr>
<tr class="lun">
<td class="station"> Entrée</td>
<td class="menuitem">
<div class="menuitem"><input class="chk" id=
"S1L0000030000044794_08943" onclick="rptlist(this);"
onmouseout="wschk(0);" onmouseover="wschk(1);" type="checkbox">
<span class="ul" onclick="nf('0000044794_08943');" onmouseout=
"pcls(this);" onmouseover="ws(this);">Steamed
Corn</span><img alt="Vegan" class="icon" src=
"images/g_062.gif"><img alt="Mindful Item" class="icon" src=
"images/m_051.gif"></div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000040000033087_22244" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox"> <span class="ul" onclick=
"nf('0000033087_22244');" onmouseout="pcls(this);"
onmouseover="ws(this);">Cuban Mojo Roasted Pork Loin</span>
</div>
</td>
<td class="price"></td>
</tr>
</table>
</body>
</html>
或者如果我能把它转换成像这样的 XML 格式就好了:
<counter name="Deli">
<dish>
<name>Made to Order Deli Core</name>
</dish>
<dish>
<name>Turkey Chipotle Petite Wrap</name>
</dish>
</counter>
非常感谢你提前的帮助,我真的很感激你花时间来帮我。
2 个回答
1
你可以这样做:
# -*- coding: utf-8 -*-
soup = BeautifulSoup(html)
title = soup.find('td', class_='station').text.strip()
spans = soup.find_all('span', class_='ul')
# create the root of the XML file
root = ET.Element("counter")
root.set("name", title)
for item in spans:
# retrieve the text inside the <td class="station">
text = list(list(item.parents)[2].previous_siblings)[1].text.strip()
if text == u'Entrée':
break
dish = ET.SubElement(root, 'dish')
name = ET.SubElement(dish, 'name')
name.text = item.text.rstrip()
tree = ET.ElementTree(root)
tree.write("filename.xml")
这是你想要的xml文件的内容:
<counter name="Deli">
<dish>
<name>Made to Order Deli Core</name>
</dish>
<dish>
<name>Turkey Chipotle Petite Wrap</name>
</dish>
</counter>
在你的文件开头一定要加上这一行 # -*- coding: utf-8 -*-
,这样可以避免出现带重音符号的问题。想了解更多,可以看看这个链接:SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'。
1
其实我用过 Beautiful Soup 和 Element Tree(用于解析 XML)。这两个工具可以帮助我获取所有在 <span>
标签里的内容。
# -*- coding: UTF-8 -*-
from bs4 import *
import xml.etree.ElementTree as ET
html='''<html>
<head>
<title></title>
</head>
<body>
<table class="dayinner">
<tr class="lun">
<td class="mealname" colspan="3">LUNCH</td>
</tr>
<tr class="lun">
<td class="station"> Deli</td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000010000047598_35356" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox"> <span class="ul" onclick=
"nf('0000047598_35356');" onmouseout="pcls(this);"
onmouseover="ws(this);">Made to Order Deli Core</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000047933_06835" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox"> <span class="ul" onclick=
"nf('0000047933_06835');" onmouseout="pcls(this);"
onmouseover="ws(this);">Turkey Chipotle Petite Wrap</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td colspan="3" style="height:3px;"></td>
</tr>
<tr class="lun">
<td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
</tr>
<tr class="lun">
<td class="station"> Entrée</td>
<td class="menuitem">
<div class="menuitem"><input class="chk" id=
"S1L0000030000044794_08943" onclick="rptlist(this);"
onmouseout="wschk(0);" onmouseover="wschk(1);" type="checkbox">
<span class="ul" onclick="nf('0000044794_08943');" onmouseout=
"pcls(this);" onmouseover="ws(this);">Steamed
Corn</span><img alt="Vegan" class="icon" src=
"images/g_062.gif"><img alt="Mindful Item" class="icon" src=
"images/m_051.gif"></div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000040000033087_22244" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox"> <span class="ul" onclick=
"nf('0000033087_22244');" onmouseout="pcls(this);"
onmouseover="ws(this);">Cuban Mojo Roasted Pork Loin</span>
</div>
</td>
<td class="price"></td>
</tr>
</table>
</body>
</html> '''
soup = BeautifulSoup(html)
counter = ET.Element('counter')
counter.set("name", "#Deli")
for i in soup.findAll('span'):
dish = ET.SubElement(counter, 'dish')
name = ET.SubElement(dish, 'name')
name.text= i.text.replace('\n',' ')
print ET.dump(counter)