如何抓取多个网站寻找共同词汇(BeautifulSoup、Requests、Python3)

0 投票
2 回答
2328 浏览
提问于 2025-04-18 18:54

我在想怎么用Beautiful Soup和Requests这两个工具,同时爬取多个不同的网站,而不需要一遍又一遍地重复我的代码。

这是我现在的代码:

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print(makeaframe)

我想要做的事情是理想情况下爬取5个不同的网站,找出这些网站上的所有单词,统计每个单词在每个网站上的出现频率,然后把每个单词的频率加在一起,最后把这些数据合并成一个可以用Pandas导出的数据框。

希望输出的结果看起来像这样:

Word     Frequency
the       200
man       300
is        400
tired     300

现在我的代码只能处理一个网站,无法一次性处理多个,我想避免重复代码。

我可以手动重复我的代码,逐个爬取每个网站,然后把每个数据框的结果合并在一起,但这样做感觉不太符合Python的编程风格。我在想有没有更快的方法或者建议?谢谢!

2 个回答

2

创建一个函数:

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd

cnt = Counter()
def GetData(url):
 Website1 = requests.get(url)
 soup = BeautifulSoup(Website1.content)
 texts = soup.findAll(text=True)
 a = Counter([x.lower() for y in texts for x in y.split()])
 cnt.update(a.most_common())

websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
 GetData(url)

makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe
1

只需要循环并更新一个主要的计数器字典:

main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    texts = soup.findAll(text=True)
    a = Counter([x.lower() for y in texts for x in y.split()])
    b = (a.most_common())
    main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)

这里的 update 方法和普通的 dict.update 不一样,它是把值加起来,而不是替换掉原来的值。

另外,关于命名风格,变量名要用小写字母,并且用下划线,比如 make_a_frame

可以试试:

comm = [[k,v] for k,v in main_c]
make_a_frame = pd.DataFrame(comm)
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame).sort("Frequency",ascending=False)

撰写回答