Python使用正则表达式解析HTML

<tr> <td class="tableHeader">Section</td> <td class="odd">001</td> </tr> <tr> <td class="tableHeader">Credits</td> <td class="even" align="left"> 4.00</td> </tr> <tr> <td class="tableHeader">Title</td> <td class="odd">Linear Algebra</td> </tr> <tr> <td class="tableHeader">Campus</td> <td class="even" align="left">University City</td> </tr> <tr> <td class="tableHeader">Instructor(s)</td> <td class="odd">Guang Yang</td> </tr> <tr> <td class="tableHeader">Instruction Type</td> <td class="even">Lecture</td> </tr> <tr> <td class="tableHeader">Max Enroll</td> <td class="odd">30</td> </tr>

3条回答

网友

1楼 · 编辑于 2024-04-23 20:02:01

DO NOT PARSE HTML USING REGEXP.

为正确的工作使用正确的工具。在

让我们做一个类比来解释为什么它是错误的：这就像试图让一个5 year old理解Hamlet，而他没有vocabulary and grammar来理解{a5}，当他能够process more abstract concepts时，他就会明白。在

使用^{}或{a8}来执行此操作。在

举个例子：要想得到所有赔率和赔率的列表：

>>> from lxml import etree
>>> tree = etree.HTML(your_html_text)
>>> odds = tree.xpath('//td[@class="odd"]/text()')
>>> evens = tree.xpath('//td[@class="even"]/text()')
>>> odds
['001', 'Linear Algebra', 'Guang  Yang', '30']
>>> evens
['  4.00', 'University City', 'Lecture']

编辑：

I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number.

好了，现在我得到了您想要的，下面是使用lxml的解决方案：

^{pr2}$

在那里，您只有最大注册号码。在

使用BeautifulSoup更容易：

>>> bs = BeautifulSoup(your_html_text)
>>> for t in bs.findAll('td', attrs={'class': 'tableHeader'}):
...   if t.text == "Max Enroll":
...     print t.findNext('td').text
'30'

网友

2楼 · 编辑于 2024-04-23 20:02:01

zmo's answer的替代方法，使用BeautifulSoup：

from bs4 import BeautifulSoup

data = """
<snipped html>
"""

soup = BeautifulSoup(data)

for tableHeaders in soup.find_all('td', class_="tableHeader"):
    if tableHeaders.get_text() == "Max Enroll":
        print tableHeaders.find_next_siblings('td', class_="odd")[0].get_text()

输出：

^{pr2}$

网友

3楼 · 编辑于 2024-04-23 20:02:01

使用专门分析html的工具，如^{}：

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

例如，以下是如何获得您想要的：

from bs4 import BeautifulSoup

data = """your html here"""

soup = BeautifulSoup(data)
print soup.find('td', text="Max Enroll").find_next_sibling('td').text

印刷品：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章