如何从下面引用的页面中删除(或编码)特殊字符?在
import urllib2
from bs4 import BeautifulSoup
import re
link = "https://www.sec.gov/Archives/edgar/data/4281/000119312513062916/R2.htm"
request_headers = {"Accept-Language": "en-US,en;q=0.5", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Referer": "http://google.com", "Connection": "keep-alive"}
request = urllib2.Request(link, headers=request_headers)
html = urllib2.urlopen(request).read()
soup = BeautifulSoup(html, "html.parser")
soup = soup.encode('utf-8', 'ignore')
print(soup)
Unicode对象只有在可以转换为ASCII时才能打印。如果不能用ASCII编码,你会得到这个错误。您可能需要显式地对其进行编码,然后打印结果soup:
相关问题 更多 >
编程相关推荐