在python中使用lxml进行web抓取之后，我得到了奇怪的字符，而不是土耳其字符

import cssselect import requests from lxml import html def parse_html(url, selector): page = requests.get(url) tree = html.fromstring(page.content) titles = tree.cssselect(selector) for title in titles: print(title.text_content().strip())

1条回答

网友

1楼 · 发布于 2024-06-16 14:09:06

答案

import cssselect
import requests
from lxml import html

def parse_html(url, selector):
    page = requests.get(url)

    content = str(page.content, 'utf-8')

    tree = html.fromstring(content)
    titles = tree.cssselect(selector)

    for title in titles:
        print(title.text_content().strip())

为什么

unicode字符“ı”（U+0131）在UTF-8中编码为0xC4B1。2字节

> echo -e '\u0131' | xxd -u
00000000: C4B1 0A                                  ...

page.content返回一个Binary Response Content

0xC4B1变为0xC4（U+00C4'Ä'）和0xB1（U+00B1'±'）

而U+00FC'ü'（UTF-8编码：0xC3BC）变成0xC3（U+00C3'Ã'）和0xBC（U+00BC'¼'）

答案

为什么

相关问题更多 >

编程相关推荐

热门问题

热门文章

在python中使用lxml进行web抓取之后，我得到了奇怪的字符，而不是土耳其字符

答案

为什么

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >