流式处理大文件的实用程序(s3、hdfs、gzip、bz2…)

smart-open的Python项目详细描述


什么?

smart庠open 是一个python 2&python 3库,用于从/到s3、hdfs、webhdfs、http或本地存储的非常大的文件的高效流式传输。它支持各种不同格式的透明动态(动态)压缩。

smart-open 是python内置的 open()的替代品:它可以做任何事情 open 可以(100%兼容,尽可能回到原生的 open ),加上许多漂亮的附加功能。

smart\u open经过了良好的测试,有很好的文档记录,并且有一个简单的pythonic api:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'

其他 智能打开的URL示例 接受:

s3://my_bucket/my_key
s3://my_key:my_secret@my_bucket/my_key
s3://my_key:my_secret@my_server:my_port@my_bucket/my_key
hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
./local/path/file
~/local/path/file
local/path/file
./local/path/file.gz
file:///home/user/file
file:///home/user/file.bz2
[ssh|scp|sftp]://username@host//path/file
[ssh|scp|sftp]://username@host/path/file

有关详细的API信息,请参阅联机帮助:

help('smart_open')

或单击此处的 查看浏览器中的帮助。

更多示例:

>>>importboto3>>>>>># stream content *into* S3 (write mode) using a custom session>>>url='s3://smart-open-py37-benchmark-results/test.txt'>>>lines=[b'first line\n',b'second line\n',b'third line\n']>>>transport_params={'session':boto3.Session(profile_name='smart_open')}>>>withopen(url,'wb',transport_params=transport_params)asfout:...forlineinlines:...bytes_written=fout.write(line)
# stream from HDFSforlineinopen('hdfs://user/hadoop/my_file.txt',encoding='utf8'):print(line)# stream from WebHDFSforlineinopen('webhdfs://host:port/user/hadoop/my_file.txt'):print(line)# stream content *into* HDFS (write mode):withopen('hdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream content *into* WebHDFS (write mode):withopen('webhdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream from a completely custom s3 server, like s3proxy:forlineinopen('s3u://user:secret@host:port@mybucket/mykey.txt'):print(line)# Stream to Digital Ocean Spaces bucket providing credentials from boto profiletransport_params={'session':boto3.Session(profile_name='digitalocean'),'resource_kwargs':{'endpoint_url':'https://ams3.digitaloceanspaces.com',}}withopen('s3://bucket/key.txt','wb',transport_params=transport_params)asfout:fout.write(b'here we stand')
为什么?

使用amazon默认的python库使用大型s3文件时, boto boto3 是一种痛苦。 它的 键。从字符串()和 键设置内容。get_contents_as_string() 方法只适用于小文件(加载在RAM中,不流)。 在使用大型文件和大量样板文件所需的多部分上载功能时,会出现令人讨厌的隐藏问题。

智能打开 保护您不受影响。它建立在BOTO3的基础上,但是提供了一个更干净的pythonicAPI。其结果是编写的代码更少,生成的错误更少。

安装

pip install smart_open

或者,如果您希望从源tar.gz安装

python setup.py test  # run unit tests
python setup.py install

要运行单元测试(可选),还需要安装mock、moto和响应( pip install mock moto responses )。 测试也会在每次提交推拉请求时使用travis ci自动运行。

支持的压缩格式

smart_open允许读取和写入gzip和bzip2文件。 基于所打开文件的扩展名,它们也可以通过http、s3和其他协议进行透明处理。 您可以轻松添加对其他文件扩展名和压缩格式的支持。 例如,要打开xz压缩文件:

>>>importlzma,os>>>fromsmart_openimportopen,register_compressor>>>def_handle_xz(file_obj,mode):...returnlzma.LZMAFile(filename=file_obj,mode=mode,format=lzma.FORMAT_XZ)>>>register_compressor('.xz',_handle_xz)>>>withopen('smart_open/tests/test_data/crime-and-punishment.txt.xz')asfin:...text=fin.read()>>>print(len(text))1696

lzma 位于python 3.3及更高版本的标准库中。 对于2.7,使用backports.lzma>backports.lzma>

特定于传输的选项

智能打开 支持多种现成的传输选项,包括:

  • http,https(只读)
  • ssh、scp和sftp
  • webhdfs

每个选项都包括设置自己的参数集。 例如,对于访问s3,通常需要设置身份验证,如api密钥或配置文件名。 smart_open 's open 函数接受关键字参数transport_params ,该参数接受传输层的附加参数。 下面是一些使用此参数的示例:

>>>importboto3>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(session=boto3.Session()))>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(buffer_size=1024))

有关每个传输选项支持的关键字参数的完整列表,请参阅文档:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
0

S3凭证

smart_open 使用 boto3 库与s3对话。 boto3 有几个确定要使用的凭据的机制。 默认情况下, smart_open 将遵从 boto3 并让后者处理凭证。 有几种方法可以覆盖此行为。

第一种方法是将一个 boto3.session 对象作为传输参数传递给 open 函数。 您可以在构造会话时自定义凭据。 智能打开 将在与S3通话时使用会话。

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
1

第二个选项是在s3 url本身中指定凭据:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
2

重要提示:以上两种方法是互斥的。如果您传递一个aws会话 并且 该url包含凭据, smart\u open 将忽略后者。

在s3存储桶的内容上迭代

由于检查S3存储桶中的所有(或选择)键是一个非常常见的操作,因此还有一个额外的功能可以有效地执行此操作,即并行处理存储桶键(使用多处理):

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
3

迁移到新的 打开功能

从1.8.1开始,有一个 smart-open.open 函数取代了 smart-open.smart-open 。 新功能比旧功能有几个优点:

  • 100%兼容内置的 open 函数(又称io.open ):它接受所有 内置 打开的 接受的参数。
  • 默认打开模式现在是"R",与内置的打开模式相同。 以前的smart_open.smart_open函数的默认值是"rb"。
  • 完整记录的关键字参数(请尝试 帮助("smart_open.open")

下面的说明将帮助您轻松地迁移到新功能。

首先,更新您的导入:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
4

一般来说,smart-open在可能的情况下直接使用io.open。 代码已经使用 open 作为本地文件I/O,然后它将继续工作。 如果要继续使用内置的 open 函数进行调试, 然后您可以导入smart_open并使用smart_open.open

默认读取模式现在为"R"(读取文本)。 如果代码隐式依赖于默认模式"rb"(读取 ,然后需要更新它并显式地传递"r"。

之前:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
5

之后:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
6

ignore_extension的关键字参数现在称为ignore_ext。 它的行为完全不同。

最重要的变化是对 传输层,如http、s3等。旧函数直接接受这些:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
7

新函数接受一个 transport_params 关键字参数。这是一个口述。 把你的传输参数放到字典里。

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
8

重命名参数:

  • s3上载 ->; 多部分上载
  • s3_会话 ->; 会话

删除的参数:

  • 配置文件名

配置文件名参数已被删除。 改为传递整个boto3.session对象。

之前:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
9

之后:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
8

有关可接受参数名称的完整列表,请参见"帮助"("smart_open.open") , 或者在这里在线查看帮助。

如果传递的参数名无效,则smart_open.open 函数将对此发出警告。 注意您的日志中是否有来自Smart_Open的警告消息

评论、错误报告

smart-open 位于github上。你可以文件 在那里发布或拉取请求。建议,拉要求和改进欢迎!


smart-open 是在麻省理工学院许可证下发布的开源软件。 版权所有(c)2015 Now Radim_eh_ek

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
相对于框架java窗口的鼠标位置错误   Java 8流peek api   java将数据附加到文件中   java使用ExoPlayer 2.8播放播放列表中的特定文件   JavaSpring国际化:如何动态设置语言环境值   java如何在mysql中实现两个表之间的两个关联   java在gradle可执行jar文件中包含运行时参数   surefire插件中的java maven多套测试套件   java试图理解堆分析以确定内存泄漏或所需的大量内存   java识别字符串有数字   数组如何解决错误“java.lang.ArrayIndexOutOfBoundsException:5”   java Swt文件对话框选择的文件太多?   java此登录代码易受SQL注入攻击吗?   Java[3]中的文件<identifier>预期编译错误   java如何在spring webflux中发送列表   jar中未找到java文件异常   如何在java中合并2D数组?   java如何评测本机JNI库