在Python中使用csv.DictReader进行数据类型转换的最快方法

8 投票

3 回答

8349 浏览

提问于 2025-04-16 11:13

我正在用Python处理一个CSV文件，使用时大约有10万行。每一行都有一组维度（以字符串形式表示）和一个数值（浮点数）。

因为csv.DictReader或csv.reader返回的值都是字符串，所以我现在需要遍历所有行，把那个唯一的数字值转换成浮点数。

for i in csvDict:
    i[col] = float(i[col])

有没有人能建议更好的方法来做到这一点？我尝试过用map、izip、itertools等各种组合，并且搜索了很多更高效的做法，但遗憾的是没有太大成功。

如果有帮助的话：我是在appengine上做这个。我认为我现在的做法可能导致我遇到这个错误：在处理完11个请求后，超过了267.789 MB的软进程大小限制 - 只有在CSV文件比较大的时候才会出现这个问题。

编辑：我的目标 我正在解析这个CSV，以便将其用作Google可视化API的数据源。最终的数据集将被加载到一个gviz DataTable中以供查询。在构建这个表时必须指定类型。如果有人知道一个好的Python gviz csv到datatable的转换器，我的问题也能解决！

编辑2：我的代码

我认为我的问题与我尝试修复Csv类型的方式有关。此外，data_table.LoadData()需要一个可迭代的对象。

class GvizFromCsv(object):
  """Convert CSV to Gviz ready objects."""

  def __init__(self, csvFile, dateTimeFormat=None):
    self.fileObj = StringIO.StringIO(csvFile)
    self.csvDict = list(csv.DictReader(self.fileObj))
    self.dateTimeFormat = dateTimeFormat
    self.headers = {}
    self.ParseHeaders()
    self.fixCsvTypes()

  def IsNumber(self, st):
    try:
        float(st)
        return True
    except ValueError:
        return False

  def IsDate(self, st):
    try:
      datetime.datetime.strptime(st, self.dateTimeFormat)
    except ValueError:
      return False

  def ParseHeaders(self):
    """Attempts to figure out header types for gviz, based on first row"""
    for k, v in self.csvDict[0].items():
      if self.IsNumber(v):
        self.headers[k] = 'number'
      elif self.dateTimeFormat and self.IsDate(v):
        self.headers[k] = 'date'
      else:
        self.headers[k] = 'string'

  def fixCsvTypes(self):
    """Only fixes numbers."""
    update_to_numbers = []
    for k,v in self.headers.items():
      if v == 'number':
        update_to_numbers.append(k)
    for i in self.csvDict:
      for col in update_to_numbers:
        i[col] = float(i[col])

  def CreateDataTable(self):
    """creates a gviz data table"""
    data_table = gviz_api.DataTable(self.headers)
    data_table.LoadData(self.csvDict)
    return data_table

csv appengine data processing data conversion iterable objects google visualization api float conversion data table

3 个回答

这里有两个不同的概念：
“数据源”和“数据表”。

“数据源”是指通过谷歌可视化API服务器提供的格式化数据，这是一种可视化的网络服务：

This page describes how you can implement a data source to feed data
to visualizations built on the Google Visualization API. 

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source.html

“数据源”这个名字还包含了“传输协议”的意思：

In response [to a request], the data source returns properly formatted data 
that the visualization can use to render the graphic on the page. 
This request-response protocol is known as the Google Visualization API wire protocol,

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html

要实现“数据源”，有两种可能的方法：

• Use one of the data source libraries listed in the Data Sources and Tools Gallery. 
All the data source libraries listed on that page implement the wire protocol.

• Write your own data source from scratch, 

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html

从以下内容来看：

• ... Data Sources and Tools Gallery : (....) You therefore need write only the
code needed to make your data available to the library in the form of a data table. 

• Write your own data source from scratch, as described in the
Writing your own Data Source

我明白，从零开始，我们需要自己实现传输协议和创建“数据表”，而使用数据源库时，我们只需要创建“数据表”。

关于创建“数据源”的内容有很多页面。

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html

http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html

在我看来，地址为 http://groups.google.com/group/google-visualization-api/browse_thread/thread/9d1d941e0f0b32ed 的例子是关于创建“数据源”的，但那里的回答让我有些怀疑。不过我对此不是很清楚。

不过，这些页面和主题对你来说并不有趣，因为你其实想知道的是如何准备被称为“数据表”的数据，以便通过“数据源”提供，而不是构建“数据源”。

3.Prepare your data. You'll need to prepare the data to visualize; 
this means either specifying the data yourself in code, 
or querying a remote site for data.

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#keycomponents

A visualization stores the data that it visualizes as two-dimensional data table with 
rows and columns.
Cells are referenced by (row, column) where row is a zero-based row number, and column
is either a zero-based column index or a unique ID that you can specify. 

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#preparedata

所以，准备“数据表”是关键。

这里是：

There are two ways to create/populate your visualization's data table:

•Query a data provider. A data provider is another site that returns
a populated DataTable in response to a request from your code. 
Some data providers also accept SQL-like query strings to sort or 
filter the data. See Data Queries for more information and an example
of a query.

•Create and populate your own DataTable by hand. You can populate your
DataTable in code on your page. The simplest way to do this is to create
a DataTable object without any data and populate it by calling addRows()
on it. You can also pass a JavaScript literal representation of the data
table into the DataTable constructor, but this is more complex and is
covered on the reference page.

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#preparedata

更多信息可以在这里找到：

2. Describe your table schema
The table schema is specified by the table_description parameter
passed into the constructor. You cannot change it later. 
The schema describes all the columns in the table: the data type of
each column, the ID, and an optional label.

Each column is described by a tuple: (ID [,data_type [,label [,custom_properties]]]). 



The table schema is a collection of column descriptor tuples. 
Every list member, dictionary key or dictionary value must be either 
another collection or a descriptor tuple. You can use any combination 
of dictionaries or lists, but every key, value, or member must
eventually evaluate to a descriptor tuple. Here are some examples.

•List of columns: [('a', 'number'), ('b', 'string')]
•Dictionary of lists: {('a', 'number'): [('b', 'number'), ('c', 'string')]}
•Dictionary of dictionaries: {('a', 'number'): {'b': 'number', 'c': 'string'}}
•And so on, with any level of nesting.


3. Populate your data
To add data to the table, build a structure of data elements in the
exact same structure as the table schema. So, for example, if your
schema is a list, the data must be a list: 

•schema: [("color", "string"), ("shape", "string")] 
•data: [["blue", "square"], ["red", "circle"]] 
If the schema is a dictionary, the data must be a dictionary:

•schema: {("rowname", "string"): [("color", "string"), ("shape", "string")] }
•data: {"row1": ["blue", "square"], "row2": ["red", "circle"]}

http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html#populatedata

最后，我想说，对于你的问题，你需要定义一个“表模式”，并处理你的CSV文件，以获得与表模式完全相同结构的数据元素。

列中数据类型的定义是在“表模式”的定义中完成的。如果填充“数据表”时必须使用正确类型的数据（我想说不是字符串），我可以帮助你写提取CSV数据的代码，这很简单。

希望这些内容都是正确的，并能对你有所帮助。

回答于 2025-04-16 由 Python大师

分享举报

首先，如果你只是想展示这些数据，其实不需要做任何转换。gviz可以直接处理JSON（就是文本格式的）或者CSV（你已经有这个文件了，不需要解析！）。你只需把文件放在一个合适的网络服务器上，然后通过gviz发出的GET请求来访问它，基本上可以忽略请求中的参数。

但假设你需要处理这些数据。看起来你不仅是读取CSV文件，还想把它完全存储在内存中。这可能不太实际：随着你添加更多的处理，内存的限制会越来越快地到达。建议你一次处理一行数据（或者如果你使用窗口过滤等方法，可以处理合理数量的行），然后把处理好的数据存储到数据存储中，而不是放到任何列表里。当通过GET请求提供数据时，读取/处理一行数据，写入响应中，而不是放到任何列表或其他地方。

我觉得转换的方法没问题，只要你在后面的代码中合理使用i，而不是在过程中记住所有的i。

回答于 2025-04-16 由 Python大师

分享举报

我最开始是用正则表达式来处理CSV文件，但因为文件里的数据每一行都排得很整齐，所以我们可以直接用split()函数。

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    def transf(surname,x,y):
        return (surname,float(x),float(y))

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
    # to populate the data table by iterating in the CSV file

或者也可以不定义函数：

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )    
    # to populate the data table by iterating in the CSV file

有一段时间，我以为我必须一行一行地填充数据表，因为我在用正则表达式，这样需要先获取匹配的组，然后再处理数字的字符串。不过用split()的话，就可以用LoadData()一条指令搞定。

所以，你的代码可以简化一下。顺便说一下，我觉得没必要继续定义一个类。对我来说，定义一个函数就足够了：

def GvizFromCsv(filename):
  """ creates a gviz data table from a CSV file """

  data_table = gviz_api.DataTable([('col1','string','SURNAME'),
                                   ('col2','number','ONE'    ),
                                   ('col3','number','TWO'    ) ])

  #  --- with such a table schema , lines in the file must be like that: ---  
  #  blah, number, number, ...anything else...\n 
  #  SMITH,1.006,1.006, ...anything else...\n 
  #  JOHNSON,0.810,1.816, ...anything else...\n 
  #  WILLIAMS,0.699,2.515, ...anything else...\n

  with open(filename) as f:
    data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
                         for line in f )
  return data_table

现在你需要检查一下，如何从另一个API读取CSV数据，并把它插入到这个代码里，以保持填充数据表的循环原则。

回答于 2025-04-16 由 Python大师

分享举报

在Python中使用csv.DictReader进行数据类型转换的最快方法

3 个回答

撰写回答