Python 列表的二进制布局

1 投票

1 回答

1473 浏览

提问于 2025-04-18 09:28

我正在写一个程序，需要了解Python和Cython中不同数据容器的内存效率。其中一个容器就是标准的Python list。

Python的列表让我有点困惑，因为我不知道它在二进制层面是怎么运作的。与Python不同，C语言的数组比较容易理解，因为所有元素都是同一种类型，并且空间是在使用之前就声明好的。这意味着当程序员想要访问数组中的某个元素时，程序可以通过数学计算知道要去哪个内存地址。但是问题是，Python的列表可以存储多种不同类型的数据，甚至可以在一个列表里嵌套另一个列表。这些数据结构的大小总是在变化，而列表依然能够容纳它们，并且能够处理这些变化。那么，是否存在额外的分隔内存来让列表如此灵活呢？

如果可以的话，我希望能看到一个示例列表在内存中的实际二进制布局，并标注每个字节代表的内容。这将帮助我更好地理解列表的内部工作原理，因为我正在研究二进制层面的问题。

数据结构内存管理内存效率嵌套列表类型系统 cython 动态数组二进制布局

1 个回答

列表对象的定义在 Include/listobject.h 文件中。这个结构其实很简单：

typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_ssize_t allocated;
} PyListObject;

而 PyObject_VAR_HEAD 的定义是：

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;

所以，列表对象大致长这样：

[ssize_t ob_refcnt]
[type *ob_type]
[ssize_t ob_size]
[object **ob_item] -> [object *][object *][object *]...
[ssize_t allocated]

需要注意的是，len 函数获取的是 ob_size 的值。

ob_item 指向一个 PyObject * 指针数组。列表中的每个元素都是一个 Python 对象，而 Python 对象在 C 语言的 API 层面上总是通过引用传递（也就是指向实际 PyObject 的指针）。因此，列表只存储指向对象的指针，而不是对象本身。

当列表装满时，它会重新分配空间。allocated 用来跟踪列表最多可以容纳多少个元素（在重新分配之前）。重新分配的算法在 Objects/listobject.c 中：

/* Ensure ob_item has room for at least newsize elements, and set
 * ob_size to newsize.  If newsize > ob_size on entry, the content
 * of the new slots at exit is undefined heap trash; it's the caller's
 * responsibility to overwrite them with sane values.
 * The number of allocated elements may grow, shrink, or stay the same.
 * Failure is impossible if newsize <= self.allocated on entry, although
 * that partly relies on an assumption that the system realloc() never
 * fails when passed a number of bytes <= the number of bytes last
 * allocated (the C standard doesn't guarantee this, but it's hard to
 * imagine a realloc implementation where it wouldn't be true).
 * Note that self->ob_item may change, and even if newsize is less
 * than ob_size on entry.
 */
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
    PyObject **items;
    size_t new_allocated;
    Py_ssize_t allocated = self->allocated;

    /* Bypass realloc() when a previous overallocation is large enough
       to accommodate the newsize.  If the newsize falls lower than half
       the allocated size, then proceed with the realloc() to shrink the list.
    */
    if (allocated >= newsize && newsize >= (allocated >> 1)) {
        assert(self->ob_item != NULL || newsize == 0);
        Py_SIZE(self) = newsize;
        return 0;
    }

    /* This over-allocates proportional to the list size, making room
     * for additional growth.  The over-allocation is mild, but is
     * enough to give linear-time amortized behavior over a long
     * sequence of appends() in the presence of a poorly-performing
     * system realloc().
     * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
     */
    new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);

    /* check for integer overflow */
    if (new_allocated > PY_SIZE_MAX - newsize) {
        PyErr_NoMemory();
        return -1;
    } else {
        new_allocated += newsize;
    }

    if (newsize == 0)
        new_allocated = 0;
    items = self->ob_item;
    if (new_allocated <= (PY_SIZE_MAX / sizeof(PyObject *)))
        PyMem_RESIZE(items, PyObject *, new_allocated);
    else
        items = NULL;
    if (items == NULL) {
        PyErr_NoMemory();
        return -1;
    }
    self->ob_item = items;
    Py_SIZE(self) = newsize;
    self->allocated = new_allocated;
    return 0;
}

从注释中可以看到，列表的增长速度相对较慢，按照以下顺序进行：

0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...

回答于 2025-04-18 由 Python大师

分享举报

Python 列表的二进制布局

1 个回答

撰写回答