如何使用“触发器一次”触发器控制Spark结构化流媒体中每个触发器处理的文件量？问题的回答

如何使用“触发器一次”触发器控制Spark结构化流媒体中每个触发器处理的文件量？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在尝试使用Spark Structured Streaming的特性Trigger once来模拟类似批处理的设置。但是，我在运行初始批处理时遇到了一些问题，因为我有很多历史数据，因此我还使用选项.option（“cloudFiles.includeExistingFiles”，“true”）来处理现有文件 因此，我的初始批处理变得非常大，因为我无法控制该批处理的文件量 我还尝试使用选项cloudFiles.maxBytesPerTrigger，但是，当您使用触发器一次时，会忽略此选项--&gt<a href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html" rel="nofollow noreferrer">https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html</a> 当我指定maxFilesPerTrigger选项时，它也会被忽略。它只需要所有可用的文件 我的代码如下所示： <pre><code>df = ( spark.readStream.format("cloudFiles") .schema(schemaAsStruct) .option("cloudFiles.format", sourceFormat) .option("delimiter", delimiter) .option("header", sourceFirstRowIsHeader) .option("cloudFiles.useNotifications", "true") .option("cloudFiles.includeExistingFiles", "true") .option("badRecordsPath", badRecordsPath) .option("maxFilesPerTrigger", 1) .option("cloudFiles.resourceGroup", omitted) .option("cloudFiles.region", omitted) .option("cloudFiles.connectionString", omitted) .option("cloudFiles.subscriptionId", omitted) .option("cloudFiles.tenantId", omitted) .option("cloudFiles.clientId", omitted) .option("cloudFiles.clientSecret", omitted) .load(sourceBasePath) ) # Traceability columns df = ( df.withColumn(sourceFilenameColumnName, input_file_name()) .withColumn(processedTimestampColumnName, lit(processedTimestamp)) .withColumn(batchIdColumnName, lit(batchId)) ) def process_batch(batchDF, id): batchDF.persist() (batchDF .write .format(destinationFormat) .mode("append") .save(destinationBasePath + processedTimestampColumnName + "=" + processedTimestamp) ) (batchDF .groupBy(sourceFilenameColumnName, processedTimestampColumnName) .count() .write .format(destinationFormat) .mode("append") .save(batchSourceFilenamesTmpDir)) batchDF.unpersist() stream = ( df.writeStream .foreachBatch(process_batch) .trigger(once=True) .option("checkpointLocation", checkpointPath) .start() ) </code></pre> 如您所见，我使用的是cloudfiles格式，这是Databricks Autoloader的格式--&gt<a href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html" rel="nofollow noreferrer">https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html</a> “自动加载程序在新数据文件到达云存储时以增量方式高效地处理它们。 自动加载程序提供一个名为cloudFiles的结构化流媒体源。给定云文件存储上的输入目录路径，cloudFiles源会在新文件到达时自动处理这些文件，还可以选择处理该目录中的现有文件” 如果我以一种令人困惑的方式提出我的问题，或者它缺乏信息，请这样说

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

如何使用“触发器一次”触发器控制Spark结构化流媒体中每个触发器处理的文件量？

1 个回答

相关Python问题