使用BeautifulSoup保存网页内容

2 投票

2 回答

3854 浏览

提问于 2025-04-18 16:56

我正在尝试使用BeautifulSoup抓取一个网页，下面是我的代码：

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen("http://en.wikipedia.org//wiki//Markov_chain.htm") as url:
    s = url.read()

soup = BeautifulSoup(s)

with open("scraped.txt", "w", encoding="utf-8") as f:
    f.write(soup.get_text())
    f.close()

问题是，它保存的是维基百科的主页，而不是我想要的具体文章。为什么这个地址不对，应该怎么改呢？

数据解析网页抓取网络爬虫 beautifulsoup HTML处理

2 个回答

@alecxe的回答会生成：

**GuessedAtParserWarning**: 
No parser was explicitly specified, so I'm using the best 
available HTML parser for this system ("html.parser"). This usually isn't a problem, 
but if you run this code on another system, or in a different virtual environment, it 
may use a different parser and behave differently. The code that caused this warning
is on line 25 of the file crawl.py. 

To get rid of this warning, pass the additional argument 'features="html.parser"' to
the BeautifulSoup constructor.

这里有一个没有GuessedAtParserWarning的解决方案，使用了requests库：

# crawl.py

import requests

url = 'https://www.sap.com/belgique/index.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

file = path.join(path.dirname(__file__), 'downl.txt')

# Either print the title/text or save it to a file
print(soup.title)
# download the text
with open(file, 'w') as f:
    f.write(soup.text)

回答于 2025-04-18 由 Python大师

分享举报

这个页面的正确网址是 http://en.wikipedia.org/wiki/Markov_chain：

>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/Markov_chain"
>>> soup = BeautifulSoup(urllib.request.urlopen(url))
>>> soup.title
<title>Markov chain - Wikipedia, the free encyclopedia</title>

回答于 2025-04-18 由 Python大师

分享举报

使用BeautifulSoup保存网页内容

2 个回答

撰写回答