使用Python在内存中处理文件

1 投票

3 回答

2415 浏览

提问于 2025-04-18 04:50

我正在读取一些存储为Excel格式的数据文件，这些文件是从网上下载的。现在的流程是先用下面定义的retrieve函数把文件下载到电脑上，这个函数使用了urllib2库，然后再用traverseWorkbook函数解析这个Excel文档。traverse函数是用xlrd库来解析Excel的。

我想要实现同样的操作，但不想把文件下载到电脑上，而是希望把文件保存在内存中并直接在内存中解析。

我不太确定该怎么开始，但我相信这是可能的。

def retrieveFile(url, filename):
    try:
        req = urllib2.urlopen(url)
        CHUNK = 16 * 1024
        with open(filename, 'wb') as fp:
            while True:
                chunk = req.read(CHUNK)
                if not chunk: break
                    fp.write(chunk)
        return True
    except Exception, e:
        return None


def traverseWorkbook(filename):
    values = []

    wb = open_workbook(filename)
    for s in wb.sheets():
        for row in range(s.nrows):
           if row > 10:
               rowData = processRow(s, row, type)
               if rowData:
                   values.append(rowData)

数据处理 urllib2 xlrd 文件解析数据下载 excel文件内存处理

3 个回答

你可以使用pandas来处理这个问题。它的好处在于，pandas是专门为在内存中处理数据而优化的，因为它的计算是用C语言完成的，而不是用Python。这也让你不用担心下载数据时遇到的一些麻烦细节。

import pandas as pd

xl = pd.ExcelFile(url, engine='xlrd')
sheets = xl.sheet_names

# work with the first sheet, or iterate through sheets if there are more than one.
df = xl.parse(sheets[0])

# The file is now a dataframe.
# You can manipulate the data in memory using the Pandas API
# ...
# ...

# after massaging the data, write to to an xls file:
out_file = '~/Documents/out_file.xls'
data.to_excel(out_file, encoding='utf-8', index=False)

回答于 2025-04-18 由 Python大师

分享举报

你可以使用 StringIO 这个库，把下载的数据写入一个像文件一样的 StringIO 对象，而不是写入一个普通的文件。

import cStringIO as cs
from contextlib import closing

def retrieveFile(url, filename):
    try:
        req = urllib2.urlopen(url)
        CHUNK = 16 * 1024
        full_str = None
        with closing(cs.StringIO()) as fp:
            while True:
                chunk = req.read(CHUNK)
                if not chunk: break
                    fp.write(chunk)
            full_str = fp.getvalue()  # This contains the full contents of the downloaded file.
        return True
    except Exception, e:
        return None

回答于 2025-04-18 由 Python大师

分享举报

你可以通过以下方式将整个文件读入内存：

data = urllib2.urlopen(url).read()

一旦文件被加载到内存中，你就可以使用 open_workbook 的 file_contents 参数将其加载到 xlrd 中：

wb = xlrd.open_workbook(url, file_contents=data)

根据文档说明，你可以将网址作为文件名传入，因为它可能会在消息中使用；否则，它会被忽略。

因此，你的 traverseWorbook 方法可以重写为：

def traverseWorkbook(url):
    values = []
    data = urllib2.urlopen(url).read()
    wb = xlrd.open_workbook(filename, file_contents=data)
    for s in wb.sheets():
        for row in range(s.nrows):
        if row > 10:
            rowData = processRow(s, row, type)
            if rowData:
                values.append(rowData)
    return values

回答于 2025-04-18 由 Python大师

分享举报

使用Python在内存中处理文件

3 个回答

撰写回答