Python 在标签内解析 HTML 特定信息

1 投票

2 回答

2508 浏览

提问于 2025-04-16 15:27

我想从一些标签中提取特定的信息。

比如，在这个网站上：

http://www.epicurious.com/articlesguides/bestof/toprecipes/bestchickenrecipes/recipes/food/views/My-Favorite-Simple-Roast-Chicken-231348

我想提取一些非常具体的信息，比如食材。如果你查看页面源代码，你会发现这些信息是在叫做

<h2>Ingredients</h2>的标签中，而实际的食材则在

<ul class="ingredientsList">这个标签里。

我在网上找到一个Python程序，它可以方便地提取网站中的超链接。但我想修改它来提取这些食材。我对Python不是很熟悉，但我该如何修改我的代码以满足我的提取需求呢？

如果能详细说明我该怎么做，或者提供一些示例，我会非常感激，因为我对此不是很了解。

代码：

import sgmllib

class MyParser(sgmllib.SGMLParser):
    "A simple parser class."

    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        "Initialise an object, passing 'verbose' to the superclass."

        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []
        self.descriptions = []
        self.inside_a_element = 0
        self.starting_description = 0

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)
                self.inside_a_element = 1
                self.starting_description = 1

    def end_a(self):
        "Record the end of a hyperlink."

        self.inside_a_element = 0

    def handle_data(self, data):
        "Handle the textual 'data'."

        if self.inside_a_element:
            if self.starting_description:
                self.descriptions.append(data)
                self.starting_description = 0
            else:
                self.descriptions[-1] += data

    def get_hyperlinks(self):
        "Return the list of hyperlinks."

        return self.hyperlinks

    def get_descriptions(self):
        "Return a list of descriptions."

        return self.descriptions

import urllib, sgmllib

# Get something to work with.
f = urllib.urlopen("http://www.epicurious.com/Roast-Chicken-231348")
s = f.read()

# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the hyperlinks.
print myparser.get_hyperlinks()
print myparser.get_descriptions()

源代码分析数据提取网页抓取 html解析信息提取超链接标签解析食材信息

2 个回答

你可以看看这个链接：http://www.crummy.com/software/BeautifulSoup/。你现在的方法在简单的情况下可以用，但一旦HTML代码或者你的需求变得复杂一点，就会让你感到头疼。

回答于 2025-04-16 由 Python大师

分享举报

我知道会有人批评我说HTML文本不能用正则表达式分析。

好吧，我承认，但我在五十分钟内得到了结果：

首先，我用这段代码获取了网页源代码的方便显示：

import urllib

url = ('http://www.epicurious.com/articlesguides/bestof/'
       'toprecipes/bestchickenrecipes/recipes/food/views/'
       'My-Favorite-Simple-Roast-Chicken-231348')


sock = urllib.urlopen(url)
ch = sock.read()
sock.close()


gen = (str(i)+' '+repr(line) for i,line in enumerate(ch.splitlines(1)))

print '\n'.join(gen)

然后，抓取内容就简单多了：

import urllib
import re

url = ('http://www.epicurious.com/articlesguides/bestof/'
       'toprecipes/bestchickenrecipes/recipes/food/views/'
       'My-Favorite-Simple-Roast-Chicken-231348')

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

x = ch.find('ul class="ingredientsList">')

patingr = re.compile('<li class="ingredient">(.+?)</li>\n')

print patingr.findall(ch,x)

编辑

Achim，

关于出现的'\n'，这是我的错，不是正则工具的问题：我写代码太快了。

你说得对，关于大写字母：BeautifulSoup仍然能找到正确的字符串，而正则却失败了。不过，我从来没见过元素标签是用大写字母写的。你能给我一个这样的链接吗？

至于'和"，情况也是一样，我从来没见过，但你说得对，确实可能会发生。

不过，当写正则表达式时，如果某些地方有大写字母或者'代替"，正则表达式会写成可以匹配它们的样子：那有什么问题呢？

你的意思是：如果源代码改变？要是有一天网站的源代码从小写变成大写，或者"变成'，这种情况几乎不可能发生，这不太现实。

所以，修正我的正则表达式很简单

import urllib
import re

url = ('http://www.epicurious.com/articlesguides/bestof/'
       'toprecipes/bestchickenrecipes/recipes/food/views/'
       'My-Favorite-Simple-Roast-Chicken-231348')

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

#----------------------------------------------------------
patingr = re.compile('<li class="ingredient">(.+?)</li>\n')
print
print '\n'.join(repr(mat.group()) for mat in patingr.finditer(ch))


ch = ch.replace('<li class="ingredient">One 2- to 3-pound farm-raised chicken</li>',
                "<LI class='ingredient'>One 2- to 3-pound farm-raised \nchicken</li>")
print
print '\n'.join(repr(mat.group()) for mat in patingr.finditer(ch))


patingr = re.compile('<li class=["\']ingredient["\']>(.+?)</li>\n',re.DOTALL|re.IGNORECASE)
print
print '\n'.join(repr(mat.group()) for mat in patingr.finditer(ch))

结果

'<li class="ingredient">One 2- to 3-pound farm-raised chicken</li>\n'
'<li class="ingredient">Kosher salt and freshly ground black pepper</li>\n'
'<li class="ingredient">2 teaspoons minced thyme (optional)</li>\n'
'<li class="ingredient">Unsalted butter</li>\n'
'<li class="ingredient">Dijon mustard</li>\n'

'<li class="ingredient">Kosher salt and freshly ground black pepper</li>\n'
'<li class="ingredient">2 teaspoons minced thyme (optional)</li>\n'
'<li class="ingredient">Unsalted butter</li>\n'
'<li class="ingredient">Dijon mustard</li>\n'

"<LI class='ingredient'>One 2- to 3-pound farm-raised \nchicken</li>\n"
'<li class="ingredient">Kosher salt and freshly ground black pepper</li>\n'
'<li class="ingredient">2 teaspoons minced thyme (optional)</li>\n'
'<li class="ingredient">Unsalted butter</li>\n'
'<li class="ingredient">Dijon mustard</li>\n'

从现在开始，我会始终添加标志re.IGNORECASE和["']在标签中。

还有其他可能出现的“问题”吗？我很想知道。

我并不是说正则表达式在所有情况下都必须使用，而解析器就不能用，我只是认为如果在可控和有限的条件下使用正则表达式，它们是非常有趣的，忽视它们实在可惜。

顺便说一下，你没有提到正则表达式比BeautifulSoup快得多。可以看看正则表达式和BeautifulSoup的时间比较

回答于 2025-04-16 由 Python大师

分享举报

Python 在标签内解析 HTML 特定信息

2 个回答

编辑

撰写回答