Python sparksteps包_程序模块 - PyPI

在aws emr上启动spark作业的工作流工具

sparksteps的Python项目详细描述

SparkSteps允许您配置EMR群集并上载通过aws s3的spark脚本及其依赖项。你要做的就是定义一个s3 bucket。

安装

pip install sparksteps

cli选项

Prompt parameters:
  app                           main spark script for submit spark (required)
  app-args:                     arguments passed to main spark script
  app-list:                     Space delimited list of applications to be installed on the EMR cluster (Default: Hadoop Spark)
  aws-region:                   AWS region name
  bid-price:                    specify bid price for task nodes
  bootstrap-action:             include a bootstrap script (s3 path)
  cluster-id:                   job flow id of existing cluster to submit to
  debug:                        allow debugging of cluster
  defaults:                     cluster configurations of the form "<classification1> key1=val1 key2=val2 ..."
  dynamic-pricing-master:       use spot pricing for the master nodes.
  dynamic-pricing-core:         use spot pricing for the core nodes.
  dynamic-pricing-task:         use spot pricing for the task nodes.
  ebs-volume-size-core:         size of the EBS volume to attach to core nodes in GiB.
  ebs-volume-type-core:         type of the EBS volume to attach to core nodes (supported: [standard, gp2, io1]).
  ebs-volumes-per-core:         the number of EBS volumes to attach per core node.
  ebs-optimized-core:           whether to use EBS optimized volumes for core nodes.
  ebs-volume-size-task:         size of the EBS volume to attach to task nodes in GiB.
  ebs-volume-type-task:         type of the EBS volume to attach to task nodes.
  ebs-volumes-per-task:         the number of EBS volumes to attach per task node.
  ebs-optimized-task:           whether to use EBS optimized volumes for task nodes.
  ec2-key:                      name of the Amazon EC2 key pair
  ec2-subnet-id:                Amazon VPC subnet id
  help (-h):                    argparse help
  jobflow-role:                 Amazon EC2 instance profile name to use (Default: EMR_EC2_DefaultRole)
  keep-alive:                   whether to keep the EMR cluster alive when there are no steps
  log-level (-l):               logging level (default=INFO)
  instance-type-master:         instance type of of master host (default='m4.large')
  instance-type-core:           instance type of the core nodes, must be set when num-core > 0
  instance-type-task:           instance type of the task nodes, must be set when num-task > 0
  maximize-resource-allocation: sets the maximizeResourceAllocation property for the cluster to true when supplied.
  name:                         specify cluster name
  num-core:                     number of core nodes
  num-task:                     number of task nodes
  release-label:                EMR release label
  s3-bucket:                    name of s3 bucket to upload spark file (required)
  s3-path:                      path within s3-bucket to use when writing assets
  s3-dist-cp:                   s3-dist-cp step after spark job is done
  submit-args:                  arguments passed to spark-submit
  tags:                         EMR cluster tags of the form "key1=value1 key2=value2"
  uploads:                      files to upload to /home/hadoop/ in master instance
  wait:                         poll until all steps are complete (or error)

示例

AWS_S3_BUCKET = <insert-s3-bucket>
cd sparksteps/
sparksteps examples/episodes.py \
  --s3-bucket $AWS_S3_BUCKET \
  --aws-region us-east-1 \
  --release-label emr-4.7.0 \
  --uploads examples/lib examples/episodes.avro \
  --submit-args="--deploy-mode client --jars /home/hadoop/lib/spark-avro_2.10-2.0.2-custom.jar" \
  --app-args="--input /home/hadoop/episodes.avro" \
  --tags Application="Spark Steps" \
  --debug

上面的示例使用默认实例创建一个由1个节点组成的emr集群键入m4.large，上载pyspark script sceptions.py及其依赖于指定的s3 bucket并将文件从s3复制到集群。每个操作都定义为一个emr“步骤”，您可以 EMR中的监视器。最后一步是使用提交包含自定义spark avro包和应用程序参数的参数 “–输入”。

< H2>现有星团上的SCABE作业

可以使用选项--cluster-id指定要上载的群集运行spark作业。这对调试特别有帮助。

动态定价（alpha）

使用cli选项--dynamic-pricing-<instance-type>允许sparksteps动态地确定某个实例组内emr实例的最佳投标价格。

目前，该算法回顾了过去12年的现场历史小时并计算min(50% * on_demand_price, max_spot_price)到确定投标价格。也就是说，如果当前现货价格超过按需成本，然后按需实例用于保守的。

测试

make test

博客

在我们的博客文章中阅读更多关于SparkSteps的信息： https://www.jwplayer.com/blog/sparksteps/

许可证

apache许可证2.0

欢迎加入QQ群-->： 979659372

sparksteps 2.1.0

sparksteps的Python项目详细描述

安装

cli选项

示例

动态定价（alpha）

测试

博客

许可证

推荐PyPI第三方库

odoo13-addon-sale-product-pack

pywpsrpc

todo-tasks

jupyter-nbrequirements

datacatalog-api

orbis-new

odoo12-addon-l10n-br-fiscal

pyThingPark

wikivents

nemf

maxim

cuefix

py-gql-test-client

odbcinst

gsc

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

sparksteps 2.1.0

sparksteps的Python项目详细描述

安装

cli选项

示例

动态定价（alpha）

测试

博客

许可证

推荐PyPI第三方库

odoo13-addon-sale-product-pack

pywpsrpc

todo-tasks

jupyter-nbrequirements

datacatalog-api

orbis-new

odoo12-addon-l10n-br-fiscal

pyThingPark

wikivents

nemf

maxim

cuefix

py-gql-test-client

odbcinst

gsc

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签