在Windows上使用Python 2.x从命令行参数读取Unicode字符

29 投票

4 回答

22596 浏览

提问于 2025-04-15 11:30

我希望我的Python脚本能够在Windows中读取Unicode格式的命令行参数。但是，似乎sys.argv是用某种本地编码的字符串，而不是Unicode格式。我该如何以完整的Unicode格式读取命令行呢？

示例代码：argv.py

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)
print first_arg.encode("hex")
print open(first_arg)

在我设置为日文编码的电脑上，我得到的是：

C:\temp>argv.py "PC・ソフト申請書08.09.24.doc"
PC・ソフト申請書08.09.24.doc
<type 'str'>
50438145835c83748367905c90bf8f9130382e30392e32342e646f63
<open file 'PC・ソフト申請書08.09.24.doc', mode 'r' at 0x00917D90>

我相信这是Shift-JIS编码的，它对这个文件名“有效”。但是对于那些不在Shift-JIS字符集中的文件名，它就不行了——最后的“打开”调用失败：

C:\temp>argv.py Jörgen.txt
Jorgen.txt
<type 'str'>
4a6f7267656e2e747874
Traceback (most recent call last):
  File "C:\temp\argv.py", line 7,
in <module>
    print open(first_arg)
IOError: [Errno 2] No such file or directory: 'Jorgen.txt'

注意——我说的是Python 2.x，而不是Python 3.0。我发现Python 3.0的sys.argv是正确的Unicode格式。不过，现在转到Python 3.0还为时尚早，因为缺少第三方库的支持。

更新：

有一些回答说我应该根据sys.argv的编码来解码。问题是，这并不是完整的Unicode，所以有些字符无法表示。

让我头疼的使用场景是：我在Windows资源管理器中启用了将文件拖放到.py文件上的功能。我有一些文件名包含各种字符，包括一些不在系统默认编码中的字符。当这些字符在当前编码中无法表示时，我的Python脚本通过sys.argv传递给它的Unicode文件名就不对了。

肯定有一些Windows API可以用完整的Unicode读取命令行（而Python 3.0可以做到）。我想Python 2.x的解释器并没有使用它。

windows unicode character encoding command-line sys.argv shift-jis python 2.x third-party libraries

4 个回答

试试这个：

import sys
print repr(sys.argv[1].decode('UTF-8'))

也许你需要把 UTF-8 替换成 CP437 或者 CP1252。你可以通过查看注册表中的这个键 HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP 来推测出正确的编码名称。

回答于 2025-04-15 由 Python大师

分享举报

处理编码问题真的很让人困惑。

我认为如果你通过命令行输入数据，它会根据你系统的编码方式来编码数据，而不是使用unicode。（即使是复制粘贴也应该这样）

所以，使用系统编码解码成unicode应该是正确的：

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)

first_arg_unicode = first_arg.decode(sys.getfilesystemencoding())
print first_arg_unicode
print type(first_arg_unicode)

f = codecs.open(first_arg_unicode, 'r', 'utf-8')
unicode_text = f.read()
print type(unicode_text)
print unicode_text.encode(sys.getfilesystemencoding())

运行以下命令会输出：提示> python myargv.py "PC・ソフト申請書08.09.24.txt"

PC・ソフト申請書08.09.24.txt
<type 'str'>
<type 'unicode'>
PC・ソフト申請書08.09.24.txt
<type 'unicode'>
?日本語

其中"PC・ソフト申請書08.09.24.txt"文件里包含了"日本語"这个文本。（我用Windows记事本把文件编码成utf8的，但我有点困惑，为什么打印时开头会有一个'?'。这可能和记事本保存utf8的方式有关吧？）

可以使用字符串的'decode'方法或者内置的unicode()函数来把编码转换成unicode。

unicode_str = utf8_str.decode('utf8')
unicode_str = unicode(utf8_str, 'utf8')

另外，如果你在处理编码文件时，可能想用codecs.open()函数来替代内置的open()。这个函数允许你定义文件的编码，然后会用指定的编码自动解码内容为unicode。

所以当你调用 content = codecs.open("myfile.txt", "r", "utf8").read() 时，content 就会是unicode格式。

codecs.open: http://docs.python.org/library/codecs.html?#codecs.open

如果我有什么理解错误，请告诉我。

如果你还没看过，我推荐你阅读Joel关于unicode和编码的文章： http://www.joelonsoftware.com/articles/Unicode.html

回答于 2025-04-15 由 Python大师

分享举报

这里有一个解决方案，正是我想要的，它调用了Windows的 GetCommandLineArgvW 函数：
在Windows下获取带有Unicode字符的sys.argv（来自ActiveState）

不过我做了一些修改，让它更简单易用，并且更好地处理某些情况。以下是我使用的内容：

win32_unicode_argv.py

"""
win32_unicode_argv.py

Importing this will replace sys.argv with a full Unicode form.
Windows only.

From this site, with adaptations:
      http://code.activestate.com/recipes/572200/

Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""


import sys

def win32_unicode_argv():
    """Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
    strings.

    Versions 2.x of Python don't support Unicode in sys.argv on
    Windows, with the underlying Windows API instead replacing multi-byte
    characters with '?'.
    """

    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR

    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR

    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)

    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

sys.argv = win32_unicode_argv()

现在，我使用它的方法很简单，就是这样：

import sys
import win32_unicode_argv

从那以后， sys.argv 就变成了一个包含Unicode字符串的列表。Python的 optparse 模块似乎也能很好地解析它，这太棒了。

回答于 2025-04-15 由 Python大师

分享举报

在Windows上使用Python 2.x从命令行参数读取Unicode字符

4 个回答

撰写回答