Python使用Beautiful Soup处理特定内容的HTML

5 投票

2 回答

10936 浏览

数据工程师

提问于 2025-04-16 15:27

我决定从一个网站提取内容，比如这个链接：http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx。

我想把食材提取到一个文本文件里。这些食材的位置在：

< div class="ingredients" style="margin-top: 10px;">

在这个部分，每个食材都被放在

< li class="plaincharacterwrap">

有人很友好地提供了使用正则表达式的代码，但在不同网站之间修改时会让人感到困惑。所以我想用Beautiful Soup，因为它有很多内置功能。不过我对怎么实际操作还是有些迷茫。

代码：

import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)

try:

        ingrdiv = soup.find('div', attrs={'class': 'ingredients'})

except IOError: 
        print 'IO error'

这大概是你开始的方式吗？我想找到实际的div类，然后提取出所有在li类中的食材。

任何帮助都非常感谢！谢谢！

正则表达式数据提取网页抓取 html解析 beautiful soup div类食材列表 li类

2 个回答

是的，每个网站都需要写一个特别的正则表达式模式。

不过我觉得：

1- 用Beautiful Soup处理的内容也得根据每个网站来调整。

2- 写正则表达式其实并不复杂，稍微练习一下就能很快上手。

我很好奇用Beautiful Soup处理后，能得到和我几分钟内用正则表达式得到的结果一样的效果。以前我试着学习Beautiful Soup，但完全搞不懂那个乱七八糟的东西。现在我再试试，因为我对Python稍微熟悉了一点。不过到目前为止，正则表达式对我来说已经够用了。

这是我为这个新网站写的代码：

import urllib
import re

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

x = ch.find('Ingredients</h3>')

patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')

print '\n'.join(patingr.findall(ch,x))

编辑

我下载并安装了BeautifulSoup，并进行了正则表达式的比较。

我觉得我的比较代码没有出错。

import urllib
import re
from time import clock
import BeautifulSoup

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()


te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te

te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te

print res1
print
print res2
print
print 'res1==res2 is ',res1==res2

print '\nRegex :',t1
print '\nBeautifulSoup :',t2
print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1

结果

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

res1==res2 is  True

Regex : 0.00210892725193

BeautifulSoup : 2.32453566026

BeautifulSoup execution time / Regex execution time == 1102.23605776

没什么好说的！

编辑 2

我意识到在我的代码中，我并没有直接使用正则表达式，而是用了一个方法，这个方法使用了正则表达式和find()。

这是我在使用正则表达式时常用的方法，因为在某些情况下，它能提高处理速度。这是因为find()这个函数运行得非常快。

为了知道我们在比较什么，我们需要以下代码。

在代码3和4中，我参考了Achim在另一个帖子中的建议：使用re.IGNORECASE和re.DOTALL，以及["\']代替"。

这些代码是分开的，因为它们必须在不同的文件中执行才能得到可靠的结果：我不知道为什么，但如果所有代码在同一个文件中执行，某些结果的时间差异会很大（比如0.00075和0.0022）

import urllib
import re
import BeautifulSoup
from time import clock

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()

# Simple regex , without x
te = clock()
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res0 = '\n'.join(patingr.findall(data))
t0 = clock()-te

print '\nSimple regex , without x :',t0

和

# Simple regex , with x
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te

print '\nSimple regex , with x :',t1

和

# Regex with flags , without x and y
te = clock()
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                     flags=re.DOTALL|re.IGNORECASE)
res10 = '\n'.join(patingr.findall(data))
t10 = clock()-te

print '\nRegex with flags , without x and y :',t10

和

# Regex with flags , with x and y 
te = clock()
x = data.find('Ingredients</h3>')
y = data.find('h3>\r\n                    Footnotes</h3>\r\n')
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                     flags=re.DOTALL|re.IGNORECASE)
res11 = '\n'.join(patingr.findall(data,x,y))
t11 = clock()-te

print '\nRegex with flags , without x and y :',t11

和

# BeautifulSoup
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te

print '\nBeautifulSoup                      :',t2

结果

Simple regex , without x           : 0.00230488284125

Simple regex , with x              : 0.00229121279385

Regex with flags , without x and y : 0.00758719458758

Regex with flags , with x and y    : 0.00183724493364

BeautifulSoup                      : 2.58728860791

使用x对简单的正则表达式速度没有影响。

带有标志的正则表达式，没用x和y，执行时间更长，但结果和其他的不同，因为它会捕获额外的一段文本。因此在实际应用中，应该使用带有标志的正则表达式和x/y。

更复杂的带有标志的正则表达式，使用x和y的情况下，执行时间减少了20%。

总的来说，结果变化不大，无论有没有x/y。

所以我的结论是一样的：

使用正则表达式，无论是否使用find()，速度大约是BeautifulSoup的1000倍，

我估计比lxml快100倍（我没有安装lxml）。

对于你说的，Hugh，我想说：

当正则表达式出错时，它既不会更快也不会更慢。它根本就不运行。

当正则表达式出错时，程序员会把它改正，仅此而已。

我不明白为什么95%的人在stackoverflow.com上想要说服另外5%的人，认为正则表达式不应该用来分析HTML、XML或其他任何东西。我说的是“分析”，而不是“解析”。据我理解，解析器首先分析整个文本，然后显示我们想要的元素内容。相反，正则表达式直接找到我们要搜索的内容，它不会像解析器那样构建HTML/XML文本的树状结构，而我对此并不太了解。

所以，我对正则表达式非常满意。我写很长的正则表达式也没问题，正则表达式让我能在分析文本后迅速运行程序。用Beautiful Soup或lxml也能工作，但那会很麻烦。

我还有其他想说的，但没时间深入讨论这个话题，实际上我让其他人按自己的方式去做就好。

回答于 2025-04-16 由 Python大师

分享举报

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]

    fname = 'PorkChopsRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

结果是

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

对@eyquem的后续回复：

from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html

start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"

# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"

# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)

# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)

给出

Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s  - same = True
lxml parse took 0.0100940499505 s  - same = True

正则表达式的速度要快很多（除非它出错）；不过如果你把加载页面和解析页面一起考虑，使用BeautifulSoup的时间也只占总时间的20%。如果你特别在意速度，我建议你使用lxml。

回答于 2025-04-16 由 Python大师

分享举报

Python使用Beautiful Soup处理特定内容的HTML

2 个回答

编辑

编辑 2

撰写回答