在Python中处理HTTP用户代理的Unicode字符

Question

我刚接触Python，但我找到一个需要用到的包，并在测试它。这个Python包叫做 pywurfl。

我根据示例创建了一个简单的代码，读取一个文本文件中某一列的用户代理（UA）字符串。这个UA的数量非常多（有些可能还有外文字符）。生成这个包含UA的文件是通过bash命令“>”和一个perl脚本完成的。比如，执行命令 perl somescript.pl > outfile.txt。

但是，当我在这个文件中运行以下代码时，出现了错误。

#!/usr/bin/python

import fileinput
import sys

from wurfl import devices
from pywurfl.algorithms import LevenshteinDistance


for line in fileinput.input():
    line = line.rstrip("\r\n")    # equiv of chomp
    H = line.split('\t')

    if H[27]=='Mobile':

        user_agent = H[23].decode('utf8')           
        search_algorithm = LevenshteinDistance()
        device = devices.select_ua(user_agent, search=search_algorithm)

        sys.stdout.write( "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % (user_agent, device.devid, device.devua, device.fall_back, device.actual_device_root, device.brand_name, device.marketing_name, device.model_name, device.device_os, device.device_os_version, device.mobile_browser, device.mobile_browser_version, device.model_extra_info, device.pointing_method, device.has_qwerty_keyboard, device.is_tablet, device.has_cellular_radio, device.max_data_rate, device.wifi, device.dual_orientation, device.physical_screen_height, device.physical_screen_width,device.resolution_height, device.resolution_width, device.full_flash_support, device.built_in_camera, device.built_in_recorder, device.receiver, device.sender, device.can_assign_phone_number, device.is_wireless_device, device.sms_enabled) + "\n")

    else:
        # do something else
        pass

这里的H[23]是包含UA字符串的那一列。但我收到的错误信息是这样的：

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

当我把'utf8'替换成'latin1'时，又出现了以下错误：

 sys.stdout.write(................) # with the .... as in the code
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128).

我是不是做错了什么？我需要把UA字符串转换成Unicode，因为这个包是这么要求的。我对Unicode不太了解，尤其是在Python中。我该如何处理这个错误？比如，怎么找到导致这个错误的UA字符串，这样我可以提出更具体的问题？

error handling unicode http character encoding text processing data conversion user agent bash scripting

在Python中处理HTTP用户代理的Unicode字符

1 个回答

撰写回答