使用BeautifulSoup保存网页内容

2 投票
2 回答
3854 浏览
提问于 2025-04-18 16:56

我正在尝试使用BeautifulSoup抓取一个网页,下面是我的代码:

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen("http://en.wikipedia.org//wiki//Markov_chain.htm") as url:
    s = url.read()

soup = BeautifulSoup(s)

with open("scraped.txt", "w", encoding="utf-8") as f:
    f.write(soup.get_text())
    f.close()

问题是,它保存的是维基百科的主页,而不是我想要的具体文章。为什么这个地址不对,应该怎么改呢?

2 个回答

0

@alecxe的回答会生成:

**GuessedAtParserWarning**: 
No parser was explicitly specified, so I'm using the best 
available HTML parser for this system ("html.parser"). This usually isn't a problem, 
but if you run this code on another system, or in a different virtual environment, it 
may use a different parser and behave differently. The code that caused this warning
is on line 25 of the file crawl.py. 

To get rid of this warning, pass the additional argument 'features="html.parser"' to
the BeautifulSoup constructor.

这里有一个没有GuessedAtParserWarning的解决方案,使用了requests库:

# crawl.py

import requests

url = 'https://www.sap.com/belgique/index.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

file = path.join(path.dirname(__file__), 'downl.txt')

# Either print the title/text or save it to a file
print(soup.title)
# download the text
with open(file, 'w') as f:
    f.write(soup.text)
5

这个页面的正确网址是 http://en.wikipedia.org/wiki/Markov_chain

>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/Markov_chain"
>>> soup = BeautifulSoup(urllib.request.urlopen(url))
>>> soup.title
<title>Markov chain - Wikipedia, the free encyclopedia</title>

撰写回答