Cython解析数字的字符串行

1条回答

网友

1楼 · 发布于 2024-04-23 21:35:50

我想出了一个满足我需要的Cython解决方案。这是使用Cython cell magic for Jupyter notebook来处理编译。我选择2000000作为数组初始化，因为这是我的数据的合理上限。函数只返回numpy数组中实际填充的行。然后将numpy数组传递到pandas数据帧是相当便宜的。在

我不确定还有多少优化可以做，因为我实际上也抛出了一些垃圾行，我认为这排除了内存映射。我可以使用类似于an answer to another question I had中的指针，但是如果我移动指针而不是迭代行，那么在我的文件中查找数据并检测坏行（有关读取数据页的更大问题，请参见下面的更多内容）。在

%%cython
import numpy as np
cimport numpy as np
np.import_array()
from libc.stdlib cimport atof
from cpython cimport bool

def read_with_cython(filename):    

    cdef float[:, ::1]  data = np.zeros((2000000, 13), np.float32)
    cdef int i = 0
    with open(filename, 'rb') as f:
        for line in f:
            if len(line) == 133:
                data[i, 0] = atof(line[0:5])
                data[i, 1] = atof(line[5:10])
                data[i, 2] = atof(line[12:21])
                data[i, 3] = atof(line[23:32])
                data[i, 4] = atof(line[34:43])
                data[i, 5] = atof(line[45:54])
                data[i, 6] = atof(line[56:65])
                data[i, 7] = atof(line[67:76])
                data[i, 8] = atof(line[78:87])
                data[i, 9] = atof(line[89:98])
                data[i, 10] = atof(line[100:109])
                data[i, 11] = atof(line[111:120])
                data[i, 12] = atof(line[122:131])

            i += 1

    return data.base[:i]

通过这个，我可以运行以下内容：

^{pr2}$

得到这个结果：

^{3}$

为了比较和完整，我还编写了一个快速的纯python版本：

def read_python(text):
    data = np.zeros((300000, 13), dtype=np.float)
    for i, line in enumerate(text.splitlines()):
        data[i, 0] = float(line[:5])
        data[i, 1] = float(line[5:10])
        for j in range(11):
            a = 10+j*11
            b = a + 11
            data[i, j+2] = float(line[a:b])

    return data

它用了1.15秒：

1.15 s ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

然后，我尝试将其应用到一个非常简单的Cython示例中，该示例在717毫秒内运行：

%%cython
def read_python_cy(text):
    text.replace('\r\n', '')
    i = 0

    while True:
        float(line[i:i+5])
        float(line[i+5:i+10])
        for j in range(11):
            a = i+10+j*11
            b = i+a + 11
            float(line[a:b])
        i += 131

    return 0

717 ms ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

然后我就崩溃了，找到了上面更优化的Cython版本。在

就在那时，我意识到Cython可以更有效地解决这个问题和一个缓慢的regex问题。我使用regex查找并捕获大约5000页的数据，然后将这些数据连接到我要读取的表中。下面显示了更接近我实际的Cython函数。它处理查找数据页、捕获页级详细信息（时间），然后读取实际数据行，直到检测到停止标志（以0或1开头的行）。我的regex只是为了提取我想要的数据而占用了1s的时间，所以总体来说这节省了我很多时间。在

%%cython
import numpy as np
cimport numpy as np
np.import_array()
from libc.stdlib cimport atof
import cython
from cpython cimport bool

def read_pages_cython(filename):    

    cdef int n_pages = 0
    cdef bool reading_page = False
    cdef float[:, ::1]  data = np.zeros((2000000, 14), np.float32)
    cdef int i = 0
    cdef float time
    with open(filename, 'rb') as f:
        for line in f:
            if not reading_page:
                if b'SUMMARY' in line:
                    time = atof(line[73:80])
                    reading_page = True
            else:
                if len(line) == 133:
                    data[i, 0] = atof(line[0:5])
                    # data[i, 1] = atof(line[5:10])
                    data[i, 2] = atof(line[12:21])
                    data[i, 3] = atof(line[23:32])
                    data[i, 4] = atof(line[34:43])
                    data[i, 5] = atof(line[45:54])
                    data[i, 6] = atof(line[56:65])
                    data[i, 7] = atof(line[67:76])
                    data[i, 8] = atof(line[78:87])
                    data[i, 9] = atof(line[89:98])
                    data[i, 10] = atof(line[100:109])
                    data[i, 11] = atof(line[111:120])
                    data[i, 12] = atof(line[122:131])
                    data[i, 13] = time

                if len(line) > 6:
                    if line[:1] == b'1':
                        if b'SUMMARY' in line:
                            time = atof(line[73:80])
                            reading_page = True
                        else:
                            reading_page = False
                            i += 1
                            continue

                    elif line[:1] == b'0':
                        reading_page = False
                        i += 1
                        continue

            i += 1

    return data.base[:i]

相关问题更多 >

编程相关推荐

热门问题

热门文章