在pandas/python中高效解析时间格式

+-----------+------+--------------+ | invoiceNo | time | invoiceValue | +-----------+------+--------------+ | A | 6 | 2 | +-----------+------+--------------+ | B | 12 | 3 | +-----------+------+--------------+ | C | 356 | 5 | +-----------+------+--------------+ | D | 2145 | 6 | +-----------+------+--------------+ df = pd.DataFrame({'invoiceNo':['A','B','C','D'], 'time':[6,12,356,2145], 'invoiceValue':[2,3,5,6] })

df['adj-time'] = df['time'].apply(lambda x: '{0:0>4}'.format(x)) df['adj-time'] = df['adj-time'].apply(lambda x: pd.to_datetime(x,format= '%H%M')) df['hour'] = df['adj-time'].apply(lambda x: x.hour) df.drop('adj-time',axis=1, inplace=True)

+-----------+------+--------------+------+ | invoiceNo | time | invoiceValue | hour | +-----------+------+--------------+------+ | A | 6 | 2 | 0 | +-----------+------+--------------+------+ | B | 12 | 3 | 0 | +-----------+------+--------------+------+ | C | 356 | 5 | 3 | +-----------+------+--------------+------+ | D | 2145 | 6 | 21 | +-----------+------+--------------+------+

3条回答

网友

1楼 · 编辑于 2024-05-15 22:11:27

如果时间是整数，则：

hour = int(time/100)

如果是字符串：

hour = int(int(time)/100)

网友

2楼 · 编辑于 2024-05-15 22:11:27

也使用zfill
将'time'设置为字符串，转换为日期时间并提取小时组件

df['hour'] = pd.to_datetime(df.time.astype('str').str.zfill(4), format='%H%M').dt.hour

# display(df)
  invoiceNo  time  invoiceValue  hour
0         A     6             2     0
1         B    12             3     0
2         C   356             5     3
3         D  2145             6    21

从csv读取

在中读取数据时设置'time'列的类型，这样就不需要.astype('str')

df = pd.read_csv('test.csv', dtype={'time': str})
df['hour'] = pd.to_datetime(df.time.str.zfill(4), format='%H%M').dt.hour

`timeit`测试

# 2M rows of data
df = pd.DataFrame({'time':[6,12,356,2145]})
dft = pd.concat([df] * 500000).reset_index(drop=True)

%%timeit
pd.to_datetime(dft.time.astype('str').str.zfill(4), format='%H%M').dt.hour
[out]:
1.51 s ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.to_numeric(dft.time.astype(str).str.zfill(4).str[0:2])
[out]:
2.6 s ± 41.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

网友

3楼 · 编辑于 2024-05-15 22:11:27

使用字符串操作提取小时数zfill最多4个字符（如果还有秒，则为6个），然后对前2个字符进行切片以获得小时数（分钟为[2:4]，秒为[4:6]）。使用pd.to_numeric获取数字数据类型

df['hour'] = pd.to_numeric(df['time'].astype(str).str.zfill(4).str[0:2])
df['minutes'] = pd.to_numeric(df['time'].astype(str).str.zfill(4).str[2:4])

  invoiceNo  time  invoiceValue  hour  minutes
0         A     6             2     0        6
1         B    12             3     0       12
2         C   356             5     3       56
3         D  2145             6    21       45

如果您有兴趣将'time'转换为timedelta64[ns]数据类型，您可以使用pd.to_datetime的灵活解析。由于缺少年/月/日，因此默认值为1900-01-01，我们将其减去

df['new_time'] = (pd.to_datetime(df['time'].astype(str).str.zfill(4), format='%H%M')
                  - pd.to_datetime('1900-01-01'))

  invoiceNo  time  invoiceValue  hour  minutes        new_time
0         A     6             2     0        6 0 days 00:06:00
1         B    12             3     0       12 0 days 00:12:00
2         C   356             5     3       56 0 days 03:56:00
3         D  2145             6    21       45 0 days 21:45:00

从csv读取

`timeit`测试

相关问题更多 >

编程相关推荐

热门问题

热门文章