Python数组数据处理

2024-03-29 11:18:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用Python处理csv文件中的数据,将csv读入数组后,我的数据如下所示:

data = [
    ["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
    ["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
    ["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]

我知道我可以使用“for”循环,但我的数据有大约数百万条记录,“for”循环运行时间太长,所以你有没有想法不使用“for”来执行下面列出的任务?你知道吗

  1. 将列1中的值从字符串转换为整数(即:“10”->;10)
  2. 在第3列中添加“http://”(即:“url10”->;“http://url10”)
  3. 将第4列中的值转换为布尔值(即:“False”->;False)

非常感谢!你知道吗


Tags: 文件csv数据namegtfalsetruehttp
2条回答

可以将map与预定义函数一起使用。maplarger input上的列表理解速度略快:

def clean_data(row):
   val, date, name, url, truthy = row
   return [int(val), date, name, 'http://{}'.format(url), truthy == 'True']


data = [
["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]
print(list(map(clean_data, data)))

输出:

[[10, '2018-03-22 14:38:18.329963', 'name 10', 'http://url10', True], [11, '2018-03-22 14:38:18.433497', 'name 11', 'http://url11', False], [12, '2018-03-22 14:38:18.532312', 'name 12', 'http://url12', False]]

Pandas应该是一个选项,如果您不介意先花点时间将数据加载到dataframe。你知道吗

下面是一个使用Pandas的解决方案,然后简单地将时间成本与map解决方案进行比较。你知道吗

import pandas as pd
from datetime import datetime
data = [
    ["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
    ["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
    ["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]*10000 #multiply 10000 to simulate large data, you can test with one bigger number.

#Pandas
df = pd.DataFrame(data=data, columns=['seq', 'datetime', 'name', 'url', 'boolean'])
pandas_beg = datetime.now()
df['seq'] = df['seq'].astype(int)
df['url'] = 'http://' + df['url']
df['boolean'] = df['boolean'] == 'True'
pandas_end = datetime.now()
print('pandas: ', (pandas_end - pandas_beg))

#map
def clean_data(row):
   val, date, name, url, truthy = row
   return [int(val), date, name, 'http://{}'.format(url), truthy == 'True']
map_beg = datetime.now()
result = list(map(clean_data, data))
map_end = datetime.now()
print('map: ', (map_end - map_beg))

输出:

pandas:  0:00:00.016091
map:  0:00:00.036025
[Finished in 0.997s]

相关问题 更多 >