如何使用Python、Selenium和BeautifulSoup将html保存到文本文件

2024-04-25 07:14:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用BeautifulSoup和Selenium来浏览youtube播放列表。我希望能够将网页的html保存到一个文本文件中,这样,当我使用BeautifulSoup时,我不需要继续运行脚本的其余部分来打开浏览器并获取html

这是我的代码的缩短版本,给出了错误:“UnicodeEncodeError:'charmap'编解码器无法对位置0:字符映射到的字符'\u200b'进行编码”
我知道我可以将其保存为utf-8格式的文本文件,但我不确定如何将其转换回ASCII以使用BeautifulSoup解析它

我的代码:

from pathlib import Path
from selenium import webdriver
from bs4 import BeautifulSoup
def test_html_save():
    playlist_url = 'https://www.youtube.com/watch?v=IdneKLhsWOQ&list=PLMEZyDHJojxNYSVgRCPt589DI5H7WT1ZK'
    browser = webdriver.Firefox()
    browser.get(playlist_url)
    html_content = browser.page_source  # Getting the html from the webpage
    browser.close()
    soup = BeautifulSoup(html_content, 'html.parser') # creates a beautiful soup object 'soup'.

    html_save_path = Path(__file__).parent / ".//html_save_test.txt"

    with open(html_save_path, 'wt') as html_file:
        for line in soup.prettify():
            html_file.write(line)

test_html_save()

我的问题是如何将网页的整个html保存到.txt文件中


Tags: path代码fromtestimportbrowser网页youtube
1条回答
网友
1楼 · 发布于 2024-04-25 07:14:11

encoding参数设置为utf-8

with open(html_save_path, 'wt', encoding='utf-8') as html_file:
    for line in soup.prettify():
        html_file.write(line)

您的目的是从视频中删除视频标题和频道名称。以下是执行此操作的完整代码:

from pathlib import Path
from selenium import webdriver
from bs4 import BeautifulSoup
import time

def test_html_save():
    playlist_url = 'https://www.youtube.com/watch?v=IdneKLhsWOQ&list=PLMEZyDHJojxNYSVgRCPt589DI5H7WT1ZK'
    browser = webdriver.Chrome()
    browser.get(playlist_url)
    time.sleep(4) #Waits for 4 secs until the page loads
    html_content = browser.page_source  # Getting the html from the webpage
    browser.close()
    soup = BeautifulSoup(html_content, 'html.parser') # creates a beautiful soup object 'soup'.

    html_save_path = "D:\\bs4_html.txt"

    with open(html_save_path, 'wt', encoding='utf-8') as html_file:
        for line in soup.prettify():
            html_file.write(line)

    title = soup.find('yt-formatted-string', class_ = 'style-scope ytd-video-primary-info-renderer').text
    channel_name = soup.find('a', class_ = 'yt-simple-endpoint style-scope yt-formatted-string').text
    print(f"Video Title: {title}")
    print(f"Channel Name: {channel_name}")

test_html_save()

输出:

Video Title: Taylor Swift - Wildest Dreams
Channel Name: Taylor Swift

相关问题 更多 >