避免在Beam Python SDK中重新计算所有云存储文件的大小

INFO:apache_beam.io.gcp.gcsio:Starting the size estimation of the input INFO:apache_beam.io.gcp.gcsio:Finished computing size of: 10000 files [...] INFO:apache_beam.io.gcp.gcsio:Finished computing size of: 5480000 files INFO:apache_beam.io.gcp.gcsio:Finished listing 5483720 files in 5549.38778591156 seconds. INFO:apache_beam.io.gcp.gcsio:Starting the size estimation of the input INFO:apache_beam.io.gcp.gcsio:Finished computing size of: 10000 files [...] INFO:apache_beam.io.gcp.gcsio:Finished computing size of: 5480000 files INFO:apache_beam.io.gcp.gcsio:Finished listing 5483720 files in 7563.196493148804 seconds. INFO:apache_beam.io.gcp.gcsio:Starting the size estimation of the input INFO:apache_beam.io.gcp.gcsio:Finished computing size of: 10000 files [...]

INFO:apache_beam.io.gcp.gcsio:Starting the size estimation of the input INFO:apache_beam.io.gcp.gcsio:Finished listing 2 files in 0.33771586418151855 seconds. INFO:apache_beam.io.gcp.gcsio:Starting the size estimation of the input INFO:apache_beam.io.gcp.gcsio:Finished listing 2 files in 0.1244659423828125 seconds. INFO:apache_beam.io.gcp.gcsio:Starting the size estimation of the input INFO:apache_beam.io.gcp.gcsio:Finished listing 2 files in 0.13422417640686035 seconds. INFO:apache_beam.io.gcp.gcsio:Starting the size estimation of the input INFO:apache_beam.io.gcp.gcsio:Finished listing 2 files in 0.14139890670776367 seconds.

parser = argparse.ArgumentParser() parser.add_argument( '--input', required=True, help='Input Cloud Storage directory to process.') known_args, pipeline_args = parser.parse_known_args(argv) pipeline_options = PipelineOptions(pipeline_args) pipeline_options.view_as(SetupOptions).save_main_session = True with beam.Pipeline(options=pipeline_options) as p: files = p | beam.io.ReadFromText('gs://project/dir/*.har.gz')

1条回答

网友

1楼 · 发布于 2024-05-16 16:20:23

谢谢你的报道。Beam有两个用于读取文本的变换ReadFromText和ReadAllFromTextReadFromText将遇到此问题，但ReadAllFromText不应遇到此问题

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py#L438

ReadAllFromText的缺点是它不会执行动态工作重新平衡，但在读取大量文件时，这不应该是一个问题

创建了https://issues.apache.org/jira/browse/BEAM-9620用于跟踪ReadFromText（通常是基于文件的源）的问题

相关问题更多 >

编程相关推荐

热门问题

热门文章