用Python从带有多个链接的网站抓取数据的最佳方法是什么?

1 投票
1 回答
36 浏览
提问于 2025-04-14 15:42

在我下面列出的例子中,这是一个关于弗吉尼亚理工大学所有校友关系章节的页面。我想进入每一个校友关系章节,并为每一条列出的信息创建一个CSV文件。我尝试使用BeautifulSoup这个工具,但没有成功。

对此话题的任何帮助都非常感谢,谢谢!

我想要抓取的数据示例

url=https://www.alumni.vt.edu/chapters/chapter_list.html


from bs4 import BeautifulSoup
import requests

website = 'https://www.alumni.vt.edu/chapters/chapter_list.html'

result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, 'lxml')

print(soup.prettify())

1 个回答

0

这里有一个例子,教你如何抓取章节列表页面中的每个链接,并从子页面获取一些信息:

import requests
from bs4 import BeautifulSoup

url = "https://www.alumni.vt.edu/chapters/chapter_list.html"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

links = []
for a in soup.select(".general-body li > a"):
    links.append(a["href"])

for u in links:
    print(f"Opening {u}")
    soup = BeautifulSoup(requests.get(u).content, "html.parser")

    # get some info here:
    contact = soup.select_one(".general-body strong:-soup-contains(Contact)")
    if contact:
        c = contact.next_element.next_element
        c = c.text.strip()

        print(contact.text, c)

输出结果:

Opening https://alumni.vt.edu/chapters/chapter_list/alleghany_highlands.html
Contact: Kathleen All          
Opening https://alumni.vt.edu/chapters/chapter_list/augusta.html                                    
Contact:  augustacountyhokies@gmail.com                                                                  
Opening https://alumni.vt.edu/chapters/chapter_list/central_virginia.html       
Contact:  Sammy Paris                                                                                                                                                                                              
Opening https://alumni.vt.edu/chapters/chapter_list/charlottesville.html 
Contact: Martin Harar                                                                                    
Opening https://alumni.vt.edu/chapters/chapter_list/commonwealth.html
Contact:  Volunteers Needed             

...

撰写回答