使用Beautiful Soup解析Python HTML并过滤停用词
我正在从一个网站提取特定的信息到一个文件里。现在我有的程序会查看一个网页,找到合适的HTML标签,然后提取出正确的内容。接下来,我想进一步筛选这些“结果”。
举个例子,在这个网站上: http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
我正在提取位于< div class="ingredients"...>标签中的配料。这段代码能很好地完成这个任务,但我想进一步处理这些结果。
当我运行这个提取程序时,它会去掉数字、符号、逗号和斜杠(\ 或 /),但保留所有的文字。当我在网站上运行它时,我得到的结果是:
cup olive oil
cup chicken broth
cloves garlic minced
tablespoon paprika
现在我想进一步处理这些结果,去掉一些常见的无意义词,比如“杯”、“瓣”、“剁碎”、“汤匙”等等。我该怎么做呢?这段代码是用Python写的,但我对Python不太熟悉,我只是用这个程序来获取信息,然后手动输入,但我更希望能自动化处理。
如果能详细指导我怎么做,我会非常感激!我的代码在下面:我该怎么做呢?
代码:
import urllib2
import BeautifulSoup
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip('123456789.,/\ ') for s in ingreds.findAll('li')]
fname = 'PorkRecipe.txt'
with open(fname, 'w') as outf:
outf.write('\n'.join(ingreds))
if __name__=="__main__":
main()
1 个回答
3
import urllib2
import BeautifulSoup
import string
badwords = set([
'cup','cups',
'clove','cloves',
'tsp','teaspoon','teaspoons',
'tbsp','tablespoon','tablespoons',
'minced'
])
def cleanIngred(s):
# remove leading and trailing whitespace
s = s.strip()
# remove numbers and punctuation in the string
s = s.strip(string.digits + string.punctuation)
# remove unwanted words
return ' '.join(word for word in s.split() if not word in badwords)
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]
fname = 'PorkRecipe.txt'
with open(fname, 'w') as outf:
outf.write('\n'.join(ingreds))
if __name__=="__main__":
main()
结果是
olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste
? 我不知道为什么它还留着那个逗号 - s.strip(string.punctuation) 应该处理掉它的。