Python靓汤中的特殊人物

2024-04-18 19:13:52 发布

您现在位置:Python中文网/ 问答频道 /正文

如何从下面引用的页面中删除(或编码)特殊字符?在

import urllib2
from bs4 import BeautifulSoup
import re

link = "https://www.sec.gov/Archives/edgar/data/4281/000119312513062916/R2.htm"

request_headers = {"Accept-Language": "en-US,en;q=0.5", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Referer": "http://google.com", "Connection": "keep-alive"}
request = urllib2.Request(link, headers=request_headers)
html = urllib2.urlopen(request).read()
soup = BeautifulSoup(html, "html.parser")
soup = soup.encode('utf-8', 'ignore')
print(soup)

Tags: import编码applicationrequesthtmllink页面xml
1条回答
网友
1楼 · 发布于 2024-04-18 19:13:52

Unicode对象只有在可以转换为ASCII时才能打印。如果不能用ASCII编码,你会得到这个错误。您可能需要显式地对其进行编码,然后打印结果soup:

import requests
from bs4 import BeautifulSoup
import re

link = "https://www.sec.gov/Archives/edgar/data/4281/000119312513062916/R2.htm"

request_headers = {"Accept-Language": "en-US,en;q=0.5", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Referer": "http://google.com", "Connection": "keep-alive"}
reuest = requests.get(link, headers=request_headers)
soup = BeautifulSoup(reuest.text,"lxml")
print(soup.encode('utf-8'))

相关问题 更多 >