我的xpath表达式有什么问题？

import os import urllib import lxml.html down='http://v.163.com/special/opencourse/bianchengdaolun.html' file=urllib.urlopen(down).read() root=lxml.html.document_fromstring(file) namelist=root.xpath('//td[@class="u-ctitle"]/a') len(namelist)

1条回答

网友

1楼 · 发布于 2024-06-16 13:54:51

您的XPath是正确的。这个问题是无关的。你知道吗

如果检查HTML，您将看到以下元标记：

<meta http-equiv="Content-Type" content="text/html; charset=GBK" />

在这个代码中：

file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)

file实际上是一个字节序列，因此从GBK编码的字节到Unicode字符串的解码在document_fromstring方法中进行。你知道吗

问题是，HTML编码实际上不是GBK，lxml错误地对其进行解码，导致数据丢失。你知道吗

>>> file.decode('gbk')
Traceback (most recent call last):
  File "down.py", line 9, in <module>
    file.decode('gbk')
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 7247-7248: illegal multibyte sequence

经过反复试验，我们可以发现实际的编码是GB_18030。要使脚本正常工作，需要手动解码字节：

root=lxml.html.document_fromstring(file.decode('GB18030'))

相关问题更多 >

编程相关推荐

热门问题

热门文章

我的xpath表达式有什么问题？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >