XLRD/Python: 用for循环将Excel文件读取为字典
我想读取一个包含15个字段和大约2000行的Excel工作簿,并把每一行转换成Python中的字典。然后,我想把每个字典添加到一个列表里。我希望工作簿顶部的每个字段都成为字典中的一个键,而对应的单元格值则是字典中的值。我已经看过一些例子,这里和这里,但我想做一些不同的事情。第二个例子可以用,但我觉得如果先遍历顶部的字段来填充字典的键,然后再遍历每一行来获取值,会更有效率。我的Excel文件包含讨论论坛的数据,样子大概是这样的(当然还有更多列):
id thread_id forum_id post_time votes post_text
4 100 3 1377000566 1 'here is some text'
5 100 4 1289003444 0 'even more text here'
所以,我希望字段 id
、thread_id
等等,成为字典的键。我希望我的字典看起来像这样:
{id: 4,
thread_id: 100,
forum_id: 3,
post_time: 1377000566,
votes: 1,
post_text: 'here is some text'}
最开始,我有一些代码在遍历这个文件,但我的循环范围设置得不对,导致生成了太多字典。以下是我最初的代码:
import xlrd
from xlrd import open_workbook, cellname
book = open_workbook('forum.xlsx', 'r')
sheet = book.sheet_by_index(3)
dict_list = []
for row_index in range(sheet.nrows):
for col_index in range(sheet.ncols):
d = {}
# My intuition for the below for-loop is to take each cell in the top row of the
# Excel sheet and add it as a key to the dictionary, and then pass the value of
# current index in the above loops as the value to the dictionary. This isn't
# working.
for i in sheet.row(0):
d[str(i)] = sheet.cell(row_index, col_index).value
dict_list.append(d)
任何帮助都将非常感谢。提前谢谢你的阅读。
6 个回答
这个脚本可以帮助你把Excel里的数据转换成字典的列表。
import xlrd
workbook = xlrd.open_workbook('forum.xls')
workbook = xlrd.open_workbook('forum.xls', on_demand = True)
worksheet = workbook.sheet_by_index(0)
first_row = [] # The row where we stock the name of the column
for col in range(worksheet.ncols):
first_row.append( worksheet.cell_value(0,col) )
# tronsform the workbook to a list of dictionnary
data =[]
for row in range(1, worksheet.nrows):
elm = {}
for col in range(worksheet.ncols):
elm[first_row[col]]=worksheet.cell_value(row,col)
data.append(elm)
print data
首先,试着只解析第一行的所有列来设置你的键。然后再写一个函数来解析数据,最后按顺序调用它们。
all_fields_list = []
header_dict = {}
def parse_data_headers(sheet):
global header_dict
for c in range(sheet.ncols):
key = sheet.cell(1, c) #here 1 is the row number where your header is
header_dict[c] = key #store it somewhere, here I have chosen to store in a dict
def parse_data(sheet):
for r in range(2, sheet.nrows):
row_dict = {}
for c in range(sheet.ncols):
value = sheet.cell(r,c)
row_dict[c] = value
all_fields_list.append(row_dict)
试试这个吧。下面这个函数会返回一个生成器,里面包含每一行和每一列的字典。
from xlrd import open_workbook
for row in parse_xlsx():
print row # {id: 4, thread_id: 100, forum_id: 3, post_time: 1377000566, votes: 1, post_text: 'here is some text'}
def parse_xlsx():
workbook = open_workbook('excelsheet.xlsx')
sheets = workbook.sheet_names()
active_sheet = workbook.sheet_by_name(sheets[0])
num_rows = active_sheet.nrows
num_cols = active_sheet.ncols
header = [active_sheet.cell_value(0, cell).lower() for cell in range(num_cols)]
for row_idx in xrange(1, num_rows):
row_cell = [active_sheet.cell_value(row_idx, col_idx) for col_idx in range(num_cols)]
yield dict(zip(header, row_cell))
在编程中,有时候我们需要处理一些数据,这些数据可能会有重复的部分。为了让我们的程序更高效,我们可以使用一种叫做“集合”的东西。集合就像一个装东西的盒子,但这个盒子有个特点,就是里面的东西不能重复。
比如说,如果你有一个装着水果的盒子,里面有苹果、香蕉和苹果,那么这个盒子里的苹果就会被自动去掉,只剩下香蕉和一个苹果。这样,我们就能更清楚地知道盒子里到底有什么。
在代码中,我们可以使用集合来存储这些不重复的数据,这样在处理的时候就会更方便,也能节省一些内存空间。
总之,集合就是一个帮助我们管理数据的工具,让我们可以更轻松地处理那些不想要重复的内容。
from xlrd import open_workbook
dict_list = []
book = open_workbook('forum.xlsx')
sheet = book.sheet_by_index(3)
# read first row for keys
keys = sheet.row_values(0)
# read the rest rows for values
values = [sheet.row_values(i) for i in range(1, sheet.nrows)]
for value in values:
dict_list.append(dict(zip(keys, value)))
print dict_list
这个想法是,首先把表格的标题读入一个列表中。接着,从标题下面的第一行开始,逐行读取表格的内容,根据标题的键和相应的单元格值创建一个新的字典,然后把这个字典添加到字典的列表里:
from xlrd import open_workbook
book = open_workbook('forum.xlsx')
sheet = book.sheet_by_index(3)
# read header values into the list
keys = [sheet.cell(0, col_index).value for col_index in xrange(sheet.ncols)]
dict_list = []
for row_index in xrange(1, sheet.nrows):
d = {keys[col_index]: sheet.cell(row_index, col_index).value
for col_index in xrange(sheet.ncols)}
dict_list.append(d)
print dict_list
假设表格内容是:
A B C D
1 2 3 4
5 6 7 8
那么它会输出:
[{'A': 1.0, 'C': 3.0, 'B': 2.0, 'D': 4.0},
{'A': 5.0, 'C': 7.0, 'B': 6.0, 'D': 8.0}]
更新(扩展字典理解的部分):
d = {}
for col_index in xrange(sheet.ncols):
d[keys[col_index]] = sheet.cell(row_index, col_index).value