Python用于web垃圾处理时汉字编码规范的矛盾性

import requests from bs4 import BeautifulSoup url = "http://www.jjwxc.net/onebook.php?novelid=1485737" response = requests.get(url) text = response.text print text.encode('gb2312') >> UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa1' in position 340: illegal multibyte sequence print text.encode('utf-8') >> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=gb2312"/> <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" /> <title>¡¶£¨Õý°æ£©±¼ÔÂ¡·Êñ¿Í_¡¾Ô´´Ð¡Ëµ|ÑÔÇéÐ¡Ëµ¡¿_½ú½ÎÄÑ§³Ç</title> <meta name="Keywords" content="Êñ¿Í,£¨Õý°æ£©±¼ÔÂ,Êñ¿Í¡¶£¨Õý°æ£©±¼ÔÂ¡·,Ö÷½Ç£ºÁøÉÒ ©§ Åä½Ç£ºÔÂ£¬Â½Àë£¬ËÕÐÅ£¬°×ÒÂÚÄÇ£¬Âå¸è£¬×¿ÇïÏÒ£¬ÉÌÓñÈÝ£¬Ð»ÁîÆëµÈµÈ£¨³ö³¡ÅÅÃû£© ©§ ÆäËü£ºÏÉÏÀ£¬ÁøÉÒ£¬ÔÂÉñ£¬Éñ»°,ÇéÓÐ¶ÀÖÓ Å°ÁµÇéÉî ÁéÒìÉñ¹Ö âêÈ»ÈôÊ§ ×îÐÂ¸üÐÂ:2015-07-15 23:57:04 ×÷Æ·»ý·Ö£º193191456" />

1条回答

网友

1楼 · 发布于 2024-06-16 11:20:08

试试这个，它应该能起作用。在

The GBK codec provides conversion to and from the Chinese GB18030/GBK/GB2312 encoding.

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import requests
from bs4 import BeautifulSoup
url = "http://www.jjwxc.net/onebook.php?novelid=1485737"
response = requests.get(url)
text = response.text
text = text.decode('gbk').encode('utf-8')
print text

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=gb2312"/>
        <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
        <title>隆露拢篓脮媒掳忙拢漏卤录脭脗隆路脢帽驴脥_隆戮脭颅麓麓脨隆脣碌|脩脭脟茅脨隆脣碌隆驴_陆煤陆颅脦脛脩搂鲁脟</title>
...
...

相关问题更多 >

编程相关推荐

热门问题

热门文章