Python使用正则表达式解析HTML

2024-04-23 20:02:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图通过一个网站的HTML,并解析它寻找一个类的最大注册量。我试着在HTML文件的每一行中检查一个子字符串,但这会试图解析错误的行。所以我现在使用正则表达式。我现在使用\t\t\t\t\t\t\t<td class="odd">([0-9])|([0-9][0-9])|([0-9][0-9][0-9])<\/td>\r\n作为正则表达式,但是这个正则表达式匹配max registration以及节号。有没有其他方法可以让我从网页中提取内容?HTML代码片段如下:

<tr>
    <td class="tableHeader">Section</td>
    <td class="odd">001</td>
</tr>

<tr>
    <td class="tableHeader">Credits</td>
    <td class="even" align="left">  4.00</td>
</tr>

<tr>
<td class="tableHeader">Title</td>
<td class="odd">Linear Algebra</td>
</tr>

<tr>
    <td class="tableHeader">Campus</td>
    <td class="even" align="left">University City</td>
</tr>

<tr>
    <td class="tableHeader">Instructor(s)</td>
    <td class="odd">Guang  Yang</td>
</tr>
<tr>
    <td class="tableHeader">Instruction Type</td>
    <td class="even">Lecture</td>
</tr>

<tr>
    <td class="tableHeader">Max Enroll</td>
    <td class="odd">30</td>
</tr>

Tags: 文件字符串网站html错误lefttrmax
3条回答

DO NOT PARSE HTML USING REGEXP.

为正确的工作使用正确的工具。在

让我们做一个类比来解释为什么它是错误的:这就像试图让一个5 year old理解Hamlet,而他没有vocabulary and grammar来理解{a5},当他能够process more abstract concepts时,他就会明白。在

使用^{}或{a8}来执行此操作。在

举个例子:要想得到所有赔率和赔率的列表:

>>> from lxml import etree
>>> tree = etree.HTML(your_html_text)
>>> odds = tree.xpath('//td[@class="odd"]/text()')
>>> evens = tree.xpath('//td[@class="even"]/text()')
>>> odds
['001', 'Linear Algebra', 'Guang  Yang', '30']
>>> evens
['  4.00', 'University City', 'Lecture']

编辑:

I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number.

好了,现在我得到了您想要的,下面是使用lxml的解决方案:

^{pr2}$

在那里,您只有最大注册号码。在

使用BeautifulSoup更容易:

>>> bs = BeautifulSoup(your_html_text)
>>> for t in bs.findAll('td', attrs={'class': 'tableHeader'}):
...   if t.text == "Max Enroll":
...     print t.findNext('td').text
'30'

zmo's answer的替代方法,使用BeautifulSoup

from bs4 import BeautifulSoup

data = """
<snipped html>
"""

soup = BeautifulSoup(data)

for tableHeaders in soup.find_all('td', class_="tableHeader"):
    if tableHeaders.get_text() == "Max Enroll":
        print tableHeaders.find_next_siblings('td', class_="odd")[0].get_text()

输出:

^{pr2}$

使用专门分析html的工具,如^{}

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

例如,以下是如何获得您想要的:

from bs4 import BeautifulSoup

data = """your html here"""

soup = BeautifulSoup(data)
print soup.find('td', text="Max Enroll").find_next_sibling('td').text

印刷品:

^{pr2}$

相关问题 更多 >