从密码保护的Excel文件到pandas DataFrame
我可以用这个方法打开一个有密码保护的Excel文件:
import sys
import win32com.client
xlApp = win32com.client.Dispatch("Excel.Application")
print "Excel library version:", xlApp.Version
filename, password = sys.argv[1:3]
xlwb = xlApp.Workbooks.Open(filename, Password=password)
# xlwb = xlApp.Workbooks.Open(filename)
xlws = xlwb.Sheets(1) # counts from 1, not from 0
print xlws.Name
print xlws.Cells(1, 1) # that's A1
不过我不太确定怎么把里面的信息转到一个pandas的数据框里。我是需要一个一个单元格地读取,还是有更方便的方法可以做到这一点呢?
6 个回答
7
来自David Hamann的网站(所有荣誉归他)
https://davidhamann.de/2018/02/21/read-password-protected-excel-files-into-pandas-dataframe/
使用xlwings这个工具,打开文件时会先启动Excel应用程序,这样你就可以输入密码了。
import pandas as pd
import xlwings as xw
PATH = '/Users/me/Desktop/xlwings_sample.xlsx'
wb = xw.Book(PATH)
sheet = wb.sheets['sample']
df = sheet['A1:C4'].options(pd.DataFrame, index=False, header=True).value
df
17
简单的解决方案
import io
import pandas as pd
import msoffcrypto
passwd = 'xyz'
decrypted_workbook = io.BytesIO()
with open(path_to_your_file, 'rb') as file:
office_file = msoffcrypto.OfficeFile(file)
office_file.load_key(password=passwd)
office_file.decrypt(decrypted_workbook)
df = pd.read_excel(decrypted_workbook, sheet_name='abc')
pip install --user msoffcrypto-tool
将目录及子目录中每个Excel文件的所有工作表导出为单独的CSV文件
from glob import glob
PATH = "Active Cons data"
# Scaning all the excel files from directories and sub-directories
excel_files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.xlsx'))]
for i in excel_files:
print(str(i))
decrypted_workbook = io.BytesIO()
with open(i, 'rb') as file:
office_file = msoffcrypto.OfficeFile(file)
office_file.load_key(password=passwd)
office_file.decrypt(decrypted_workbook)
df = pd.read_excel(decrypted_workbook, sheet_name=None)
sheets_count = len(df.keys())
sheet_l = list(df.keys()) # list of sheet names
print(sheet_l)
for i in range(sheets_count):
sheet = sheet_l[i]
df = pd.read_excel(decrypted_workbook, sheet_name=sheet)
new_file = f"D:\\all_csv\\{sheet}.csv"
df.to_csv(new_file, index=False)
6
假设我们有一个起始单元格,位置是 (StartRow, StartCol),还有一个结束单元格,位置是 (EndRow, EndCol)。我发现下面的方法对我来说很有效:
# Get the content in the rectangular selection region
# content is a tuple of tuples
content = xlws.Range(xlws.Cells(StartRow, StartCol), xlws.Cells(EndRow, EndCol)).Value
# Transfer content to pandas dataframe
dataframe = pandas.DataFrame(list(content))
注意:在 win32com 中,Excel 的 B5 单元格被表示为第 5 行,第 2 列。此外,我们需要用 list(...) 来把元组的元组转换成元组的列表,因为没有办法直接用 pandas.DataFrame 来处理元组的元组。