Python C API 的 Unicode 参数

4 投票

1 回答

1364 浏览

提问于 2025-04-17 05:20

我有一个简单的Python脚本

import _tph
str = u'Привет, <b>мир!</b>' # Some unicode string with a russian characters
_tph.strip_tags(str)

还有一个C语言库，这个库编译成了_tph.so。里面有一个叫做strip_tags的函数：

PyObject *strip_tags(PyObject *self, PyObject *args) {
    PyUnicodeObject *string;
    Py_ssize_t length;

    PyArg_ParseTuple(args, "u#", &string, &length);
    printf("%d, %d\n", string->length, length);

    // ...
}

printf函数输出的是：1080, 19。所以，str的长度确实是19个字符，但我从哪里冒出来的1080个字符呢？

当我打印string的时候，我得到了我的str，一个空字符，然后是一堆杂七杂八的字节。

这些杂乱的内存看起来像这样：

u'\u041f\u0440\u0438\u0432\u0435\u0442, <b>\u043c\u0438\u0440!</b>\x00\x00\u0299\Ub7024000\U08c55800\Ub7025904\x00\Ub777351c\U08c79e58\x00\U08c7a0b4\x00\Ub7025904\Ub7025954\Ub702594c\Ub702591c\Ub702592c\Ub7025934\x00\x00\x00

我该怎么才能得到一个正常的字符串呢？

unicode character encoding string manipulation c++ api interoperability debugging memory management data representation

1 个回答

这里的“string”参数命名得不太好。它其实是指向一个Python的Unicode对象的指针，所以你的printf看到的是很多二进制数据（包括对象类型、垃圾回收的头信息、引用计数，以及编码后的Unicode代码点），直到它找到一个零字节，这个零字节被printf当作字符串的结束。

查看这个字符串最简单的方法是使用 PyObject_Print(string)。你可以在这里找到处理Python Unicode对象的C函数：http://docs.python.org/c-api/unicode.html#unicode-objects

回答于 2025-04-17 由 Python大师

分享举报

Python C API 的 Unicode 参数

1 个回答

撰写回答