适用于Amazon Athena JDBC驱动程序的Python DB API 2.0(PEP 249)兼容包装器
PyAthenaJDBC-qubole的Python项目详细描述
Pyathenajdbc
pyathenajdbc是一个符合pythonDB API 2.0 (PEP 249)的包装器,用于Amazon Athena JDBC driver。
要求
- Python
- 第2、7、3、4、3.5、3.6节
- Java=8
安装
$ pip install PyAthenaJDBC
额外套餐:
Package | Install command | Version |
---|---|---|
Pandas | ^{tt1}$ | >=0.19.0 |
SQLAlchemy | ^{tt2}$ | >=1.0.0 |
使用量
基本用法
frompyathenajdbcimportconnectconn=connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',region_name='us-west-2')try:withconn.cursor()ascursor:cursor.execute(""" SELECT * FROM one_row """)print(cursor.description)print(cursor.fetchall())finally:conn.close()
光标迭代
frompyathenajdbcimportconnectconn=connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',region_name='us-west-2')try:withconn.cursor()ascursor:cursor.execute("""
SELECT * FROM many_rows LIMIT 10
""")forrowincursor:print(row)finally:conn.close()
使用参数查询
支持的DB API paramstyle仅为PyFormat。 PyFormat只支持named placeholders旧的%运算符样式,参数指定字典格式。
frompyathenajdbcimportconnectconn=connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',region_name='us-west-2')try:withconn.cursor()ascursor:cursor.execute(""" SELECT col_string FROM one_row_complex WHERE col_string = %(param)s """,{'param':'a string'})print(cursor.fetchall())finally:conn.close()
如果查询中包含%字符,则必须使用%%进行转义,如下所示:
SELECTcol_stringFROMone_row_complexWHEREcol_string=%(param)sORcol_stringLIKE'a%%'
JVM选项
在connect方法或connection对象中,可以使用字符串数组指定jvm选项。
您可以按如下方式增加jvm堆大小:
frompyathenajdbcimportconnectconn=connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',region_name='us-west-2',jvm_options=['-Xms1024m','-Xmx4096m'])try:withconn.cursor()ascursor:cursor.execute(""" SELECT * FROM many_rows """)print(cursor.fetchall())finally:conn.close()
sql炼金术
使用pip install SQLAlchemy>=1.0.0或pip install PyAthenaJDBC[SQLAlchemy]安装sqlalchemy。 支持的sqlalchemy为1.0.0或更高版本。
importcontextlibfromurllib.parseimportquote_plus# PY2: from urllib import quote_plusfromsqlalchemy.engineimportcreate_enginefromsqlalchemy.sql.expressionimportselectfromsqlalchemy.sql.functionsimportfuncfromsqlalchemy.sql.schemaimportTable,MetaDataconn_str='awsathena+jdbc://{access_key}:{secret_key}@athena.{region_name}.amazonaws.com:443/'\ '{schema_name}?s3_staging_dir={s3_staging_dir}'engine=create_engine(conn_str.format(access_key=quote_plus('YOUR_ACCESS_KEY'),secret_key=quote_plus('YOUR_SECRET_ACCESS_KEY'),region_name='us-west-2',schema_name='default',s3_staging_dir=quote_plus('s3://YOUR_S3_BUCKET/path/to/')))try:withcontextlib.closing(engine.connect())asconn:many_rows=Table('many_rows',MetaData(bind=engine),autoload=True)print(select([func.count('*')],from_obj=many_rows).scalar())finally:engine.dispose()
连接字符串的格式如下:
awsathena+jdbc://{access_key}:{secret_key}@athena.{region_name}.amazonaws.com:443/{schema_name}?s3_staging_dir={s3_staging_dir}&driver_path={driver_path}&...
注意:s3_staging_dir需要引号。如果access_key、secret_key和其他参数包含特殊字符,则还需要引号。
熊猫数据帧的最小示例:
frompyathenajdbcimportconnectimportpandasaspdconn=connect(access_key='YOUR_ACCESS_KEY_ID',secret_key='YOUR_SECRET_ACCESS_KEY',s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',region_name='us-west-2',jvm_path='/path/to/jvm')# optional, as used by JPypedf=pd.read_sql("SELECT * FROM many_rows LIMIT 10",conn)
作为熊猫数据帧:
importcontextlibfrompyathenajdbcimportconnectfrompyathenajdbc.utilimportas_pandaswithcontextlib.closing(connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/'region_name='us-west-2')))asconn:withconn.cursor()ascursor:cursor.execute(""" SELECT * FROM many_rows """)df=as_pandas(cursor)print(df.describe())
示例
Redash查询运行程序示例
凭证
支持AWS CLI credentials、Properties file credentials和AWS credentials provider chain。
凭证文件
~/.aws/凭证
[default]aws_access_key_id=YOUR_ACCESS_KEY_IDaws_secret_access_key=YOUR_SECRET_ACCESS_KEY
~/.aws/配置
[default]region=us-west-2output=json
环境变量
$ exportAWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID
$ exportAWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY
$ exportAWS_DEFAULT_REGION=us-west-2
附加环境变量:
$ exportAWS_ATHENA_S3_STAGING_DIR=s3://YOUR_S3_BUCKET/path/to/
属性文件凭据
创建以下格式的属性文件。
/路径/to/awscredentials.properties
accessKeyId:YOUR_ACCESS_KEY_IDsecretKey:YOUR_SECRET_ACCESS_KEY
使用connect方法或连接对象的credential_file指定属性文件路径。
frompyathenajdbcimportconnectconn=connect(credential_file='/path/to/AWSCredentials.properties',s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',region_name='us-west-2')
pyathenajdbc使用属性文件来验证amazon athena。
AWS凭证提供商链
见AWS credentials provider chain
AWS credentials provider chain that looks for credentials in this order:
- Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
- Java System Properties - aws.accessKeyId and aws.secretKey
- Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
- Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI” environment variable is set and security manager has permission to access the variable,
- Instance profile credentials delivered through the Amazon EC2 metadata service
在connect方法或connection对象中,可以通过指定至少s3_staging_dir和region_name进行连接。 不需要指定access_key和secret_key。
frompyathenajdbcimportconnectconn=connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',region_name='us-west-2')
Terraform实例配置文件示例:
测试
取决于以下环境变量:
$ exportAWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID $ exportAWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY $ exportAWS_DEFAULT_REGION=us-west-2 $ exportAWS_ATHENA_S3_STAGING_DIR=s3://YOUR_S3_BUCKET/path/to/