用Python从带有多个链接的网站抓取数据的最佳方法是什么?
在我下面列出的例子中,这是一个关于弗吉尼亚理工大学所有校友关系章节的页面。我想进入每一个校友关系章节,并为每一条列出的信息创建一个CSV文件。我尝试使用BeautifulSoup这个工具,但没有成功。
对此话题的任何帮助都非常感谢,谢谢!
url=https://www.alumni.vt.edu/chapters/chapter_list.html
from bs4 import BeautifulSoup
import requests
website = 'https://www.alumni.vt.edu/chapters/chapter_list.html'
result = requests.get(website)
content = result.text
soup = BeautifulSoup(content, 'lxml')
print(soup.prettify())
1 个回答
0
这里有一个例子,教你如何抓取章节列表页面中的每个链接,并从子页面获取一些信息:
import requests
from bs4 import BeautifulSoup
url = "https://www.alumni.vt.edu/chapters/chapter_list.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = []
for a in soup.select(".general-body li > a"):
links.append(a["href"])
for u in links:
print(f"Opening {u}")
soup = BeautifulSoup(requests.get(u).content, "html.parser")
# get some info here:
contact = soup.select_one(".general-body strong:-soup-contains(Contact)")
if contact:
c = contact.next_element.next_element
c = c.text.strip()
print(contact.text, c)
输出结果:
Opening https://alumni.vt.edu/chapters/chapter_list/alleghany_highlands.html
Contact: Kathleen All
Opening https://alumni.vt.edu/chapters/chapter_list/augusta.html
Contact: augustacountyhokies@gmail.com
Opening https://alumni.vt.edu/chapters/chapter_list/central_virginia.html
Contact: Sammy Paris
Opening https://alumni.vt.edu/chapters/chapter_list/charlottesville.html
Contact: Martin Harar
Opening https://alumni.vt.edu/chapters/chapter_list/commonwealth.html
Contact: Volunteers Needed
...