使用Beautiful Soup解析Python HTML并过滤停用词

2 投票

1 回答

4233 浏览

数据工程师

提问于 2025-04-16 15:32

我正在从一个网站提取特定的信息到一个文件里。现在我有的程序会查看一个网页，找到合适的HTML标签，然后提取出正确的内容。接下来，我想进一步筛选这些“结果”。

举个例子，在这个网站上： http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

我正在提取位于< div class="ingredients"...>标签中的配料。这段代码能很好地完成这个任务，但我想进一步处理这些结果。

当我运行这个提取程序时，它会去掉数字、符号、逗号和斜杠（\ 或 /），但保留所有的文字。当我在网站上运行它时，我得到的结果是：

cup olive oil
cup chicken broth
cloves garlic minced
tablespoon paprika

现在我想进一步处理这些结果，去掉一些常见的无意义词，比如“杯”、“瓣”、“剁碎”、“汤匙”等等。我该怎么做呢？这段代码是用Python写的，但我对Python不太熟悉，我只是用这个程序来获取信息，然后手动输入，但我更希望能自动化处理。

如果能详细指导我怎么做，我会非常感激！我的代码在下面：我该怎么做呢？

代码：

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip('123456789.,/\ ') for s in ingreds.findAll('li')]

    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

数据提取自动化处理网页解析信息抓取内容处理 HTML标签结果筛选停用词过滤

1 个回答

import urllib2
import BeautifulSoup
import string

badwords = set([
    'cup','cups',
    'clove','cloves',
    'tsp','teaspoon','teaspoons',
    'tbsp','tablespoon','tablespoons',
    'minced'
])

def cleanIngred(s):
    # remove leading and trailing whitespace
    s = s.strip()
    # remove numbers and punctuation in the string
    s = s.strip(string.digits + string.punctuation)
    # remove unwanted words
    return ' '.join(word for word in s.split() if not word in badwords)

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]

    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

结果是

olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste

? 我不知道为什么它还留着那个逗号 - s.strip(string.punctuation) 应该处理掉它的。

回答于 2025-04-16 由 Python大师

分享举报

使用Beautiful Soup解析Python HTML并过滤停用词

1 个回答

撰写回答