为什么我的绳子断了？

from lxml.html import parse from urllib2 import urlopen import codecs parsed = parse(urlopen('http://lakgsa.org/?page_id=18')) doc = parsed.getroot() links = doc.findall('.//div/a') print(links[15:20]) lnk=links[3] lnk.get('href') print(lnk.get('href')) print(lnk.text_content()) with codecs.open('hey.json', 'wb', encoding='utf-8') as file: file.write(lnk.text_content())

1条回答

网友

1楼 · 发布于 2024-04-25 16:59:53

问题是，您正在进行双重编码-来自远程源的内容已经是UTF-8，然后当您编写时，它将再次被编码。你知道吗

处理这个问题的最快方法是从输出文件open()中删除encoding=utf-8。你知道吗

正确的处理方法是根据远程服务器的字符集定义将输入流转换为Unicode。最简单的方法是使用python请求及其response.text字段。你知道吗

from lxml.html import parse
import requests
import io

url = 'http://lakgsa.org/'
params = {'page_id': '18'}

response = requests.get(url, params)
parsed = parse(response.text)
doc = parsed.getroot()

links = doc.findall('.//div/a')
print(links[15:20])
lnk=links[3]
lnk.get('href')
print(lnk.get('href'))
print(lnk.text_content())

# io should be used instead of codecs
# you don't need the 'b' mode
with io.open('hey.json', 'w', encoding='utf-8') as file:
    file.write(lnk.text_content())

您可能需要考虑BeautifulSoup，它具有非常好的Unicode支持。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章