如何将CSV作为流表源加载到PyFlink中？

from pyflink.dataset import ExecutionEnvironment from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import ( TableConfig, DataTypes, BatchTableEnvironment, StreamTableEnvironment, ) from pyflink.table.descriptors import Schema, Csv, OldCsv, FileSystem from pathlib import Path exec_env = ExecutionEnvironment.get_execution_environment() exec_env.set_parallelism(1) t_config = TableConfig() t_env = BatchTableEnvironment.create(exec_env, t_config) root = Path(__file__).parent.resolve() out_path = root / "output.csv" try: out_path.unlink() except: pass from pyflink.table.window import Tumble ( t_env.connect(FileSystem().path(str(root / "input.csv"))) .with_format(Csv()) .with_schema( Schema().field("time", DataTypes.TIMESTAMP(3)).field("word", DataTypes.STRING()) ) .create_temporary_table("mySource") ) ( t_env.connect(FileSystem().path(str(out_path))) .with_format(Csv()) .with_schema( Schema().field("word", DataTypes.STRING()).field("count", DataTypes.BIGINT()) ) .create_temporary_table("mySink") ) ( t_env.from_path("mySource") .group_by("word") .select("word, count(1) as count") .filter("count > 1") .insert_into("mySink") ) t_env.execute("tutorial_job")

( t_env.from_path("mySource") .window(Tumble.over("10.minutes").on("time").alias("w")) .group_by("w, word") .select("w, word, count(1) as count") .filter("count > 1") .insert_into("mySink") )

2条回答

网友

1楼 · 编辑于 2024-06-07 03:17:44

您是否尝试过使用水印策略？正如前面提到的here，您需要使用水印策略来使用事件时间。对于pyflink，我个人认为用ddl格式（如this）声明它更容易

网友

2楼 · 编辑于 2024-06-07 03:17:44

如果使用描述符API，则可以通过架构将字段指定为事件时间字段：

.with_schema(  # declare the schema of the table
             Schema()
             .field("rowtime", DataTypes.TIMESTAMP())
             .rowtime(
                Rowtime()
                .timestamps_from_field("time")
                .watermarks_periodic_bounded(60000))
             .field("a", DataTypes.STRING())
             .field("b", DataTypes.STRING())
             .field("c", DataTypes.STRING())
         )

但是我仍然建议您使用DDL，一方面它更易于使用，另一方面现有的描述符API中存在一些bug，社区正在讨论重构描述符API

相关问题更多 >

编程相关推荐

热门问题

热门文章