如何抓取多个网站寻找共同词汇(BeautifulSoup、Requests、Python3)
我在想怎么用Beautiful Soup和Requests这两个工具,同时爬取多个不同的网站,而不需要一遍又一遍地重复我的代码。
这是我现在的代码:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print(makeaframe)
我想要做的事情是理想情况下爬取5个不同的网站,找出这些网站上的所有单词,统计每个单词在每个网站上的出现频率,然后把每个单词的频率加在一起,最后把这些数据合并成一个可以用Pandas导出的数据框。
希望输出的结果看起来像这样:
Word Frequency
the 200
man 300
is 400
tired 300
现在我的代码只能处理一个网站,无法一次性处理多个,我想避免重复代码。
我可以手动重复我的代码,逐个爬取每个网站,然后把每个数据框的结果合并在一起,但这样做感觉不太符合Python的编程风格。我在想有没有更快的方法或者建议?谢谢!
2 个回答
2
创建一个函数:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
cnt = Counter()
def GetData(url):
Website1 = requests.get(url)
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
cnt.update(a.most_common())
websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
GetData(url)
makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe
1
只需要循环并更新一个主要的计数器字典:
main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)
这里的 update
方法和普通的 dict.update
不一样,它是把值加起来,而不是替换掉原来的值。
另外,关于命名风格,变量名要用小写字母,并且用下划线,比如 make_a_frame
。
可以试试:
comm = [[k,v] for k,v in main_c]
make_a_frame = pd.DataFrame(comm)
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame).sort("Frequency",ascending=False)