在拼花地板中使用Dask日期/时间戳列存储

网友

1楼 · 编辑于 2024-05-29 10:09:12

这里有一个关于TO_TIMESTAMP()函数的演练文档链接。（https://drill.apache.org/docs/data-type-conversion/#to_timestamp）我认为@mdurant的方法是正确的

我想试试：

SELECT TO_TIMESTAMP(<date_col>) FROM ...

或

SELECT TO_TIMSTAMP((<date_col> / 1000)) FROM ...

网友

2楼 · 编辑于 2024-05-29 10:09:12

如果内存可用，Drill使用一个旧的非标准的INT96时间戳，这是拼花地板从未支持过的。Aparquet timestamp本质上是一个UNIX时间戳，与int64一样，具有不同的精度。Drill必须有一个函数来正确地将其转换为内部格式

我不是钻取方面的专家，但似乎需要先将整数除以10的适当幂（参见this answer）。此syntac可能是错误的，但可能会让您产生以下想法：

SELECT TO_TIMESTAMP((mycol as FLOAT) / 1000) FROM ...;

网友

3楼 · 编辑于 2024-05-29 10:09:12

不确定是否与您相关，但您似乎只对日期值感兴趣（忽略小时、分钟等）。如果是这样，您可以使用.dt.date显式地将时间戳信息转换为日期字符串

import pandas as pd
import dask.dataframe as dd

sample_dates = [
    '2019-01-01 00:01:00',
    '2019-01-02 05:04:02',
    '2019-01-02 15:04:02'
]

df = pd.DataFrame(zip(sample_dates, range(len(sample_dates))), columns=['datestring', 'value'])

ddf = dd.from_pandas(df, npartitions=2)

# convert to timestamp and calculate as unix time (relative to 1970)
ddf['unix_timestamp_seconds'] = (ddf['datestring'].astype('M8[s]') - pd.to_datetime('1970-01-01')).dt.total_seconds()

# convert to timestamp format and extract dates
ddf['datestring'] = ddf['datestring'].astype('M8[s]').dt.date

ddf.to_parquet('test.parquet', engine='pyarrow', write_index=False, coerce_timestamps='ms')

对于时间转换，可以使用.astype或dd.to_datetime，请参见this question的答案。还有一个非常类似的question和answer，这表明确保时间戳向下转换为ms可以解决这个问题

因此，在使用您提供的值时，可能会发现核心问题是变量的缩放不匹配：

# both yield: Timestamp('2019-01-01 00:00:00')

pd.to_datetime(1546300800000000*1000, unit='ns')
pd.to_datetime(1546300800000000/1000000, unit='s')

相关问题更多 >

编程相关推荐

热门问题

热门文章

在拼花地板中使用Dask日期/时间戳列存储

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >