在aws emr上启动spark作业的工作流工具

sparksteps的Python项目详细描述


Build StatusDocumentation Status

SparkSteps允许您配置EMR群集并上载 通过aws s3的spark脚本及其依赖项。你要做的就是 定义一个s3 bucket。

安装

pip install sparksteps

cli选项

Prompt parameters:
  app                           main spark script for submit spark (required)
  app-args:                     arguments passed to main spark script
  app-list:                     Space delimited list of applications to be installed on the EMR cluster (Default: Hadoop Spark)
  aws-region:                   AWS region name
  bid-price:                    specify bid price for task nodes
  bootstrap-action:             include a bootstrap script (s3 path)
  cluster-id:                   job flow id of existing cluster to submit to
  debug:                        allow debugging of cluster
  defaults:                     cluster configurations of the form "<classification1> key1=val1 key2=val2 ..."
  dynamic-pricing-master:       use spot pricing for the master nodes.
  dynamic-pricing-core:         use spot pricing for the core nodes.
  dynamic-pricing-task:         use spot pricing for the task nodes.
  ebs-volume-size-core:         size of the EBS volume to attach to core nodes in GiB.
  ebs-volume-type-core:         type of the EBS volume to attach to core nodes (supported: [standard, gp2, io1]).
  ebs-volumes-per-core:         the number of EBS volumes to attach per core node.
  ebs-optimized-core:           whether to use EBS optimized volumes for core nodes.
  ebs-volume-size-task:         size of the EBS volume to attach to task nodes in GiB.
  ebs-volume-type-task:         type of the EBS volume to attach to task nodes.
  ebs-volumes-per-task:         the number of EBS volumes to attach per task node.
  ebs-optimized-task:           whether to use EBS optimized volumes for task nodes.
  ec2-key:                      name of the Amazon EC2 key pair
  ec2-subnet-id:                Amazon VPC subnet id
  help (-h):                    argparse help
  jobflow-role:                 Amazon EC2 instance profile name to use (Default: EMR_EC2_DefaultRole)
  keep-alive:                   whether to keep the EMR cluster alive when there are no steps
  log-level (-l):               logging level (default=INFO)
  instance-type-master:         instance type of of master host (default='m4.large')
  instance-type-core:           instance type of the core nodes, must be set when num-core > 0
  instance-type-task:           instance type of the task nodes, must be set when num-task > 0
  maximize-resource-allocation: sets the maximizeResourceAllocation property for the cluster to true when supplied.
  name:                         specify cluster name
  num-core:                     number of core nodes
  num-task:                     number of task nodes
  release-label:                EMR release label
  s3-bucket:                    name of s3 bucket to upload spark file (required)
  s3-path:                      path within s3-bucket to use when writing assets
  s3-dist-cp:                   s3-dist-cp step after spark job is done
  submit-args:                  arguments passed to spark-submit
  tags:                         EMR cluster tags of the form "key1=value1 key2=value2"
  uploads:                      files to upload to /home/hadoop/ in master instance
  wait:                         poll until all steps are complete (or error)

示例

AWS_S3_BUCKET = <insert-s3-bucket>
cd sparksteps/
sparksteps examples/episodes.py \
  --s3-bucket $AWS_S3_BUCKET \
  --aws-region us-east-1 \
  --release-label emr-4.7.0 \
  --uploads examples/lib examples/episodes.avro \
  --submit-args="--deploy-mode client --jars /home/hadoop/lib/spark-avro_2.10-2.0.2-custom.jar" \
  --app-args="--input /home/hadoop/episodes.avro" \
  --tags Application="Spark Steps" \
  --debug

上面的示例使用默认实例创建一个由1个节点组成的emr集群 键入m4.large,上载pyspark script sceptions.py及其 依赖于指定的s3 bucket并将文件从s3复制到 集群。每个操作都定义为一个emr“步骤”,您可以 EMR中的监视器。最后一步是使用 提交包含自定义spark avro包和应用程序参数的参数 “–输入”。

< H2>现有星团上的SCABE作业

可以使用选项--cluster-id指定要上载的群集 运行spark作业。这对调试特别有帮助。

动态定价(alpha)

使用cli选项--dynamic-pricing-<instance-type>允许sparksteps动态地 确定某个实例组内emr实例的最佳投标价格。

目前,该算法回顾了过去12年的现场历史 小时并计算min(50% * on_demand_price, max_spot_price)到 确定投标价格。也就是说,如果当前现货价格超过 按需成本,然后按需实例用于 保守的。

测试

make test

博客

在我们的博客文章中阅读更多关于SparkSteps的信息: https://www.jwplayer.com/blog/sparksteps/

许可证

apache许可证2.0

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java Android HttpClient cookies   如何使用Java在远程系统上运行SSH命令?   java从字符串数组中的字符串末尾删除“,”   在One plus 3t手机上,当应用程序被终止或从最近的应用程序中刷出时,java Android FCM推送通知不起作用   java如何使垂直滚动条始终位于jtable的末尾   在java中解析迄今为止“未知”的字符串   javascript在Java中获取Nashorn JsonObject   java windows 10和ubuntu可以使用相同的JDK吗?   java在不同的文件中记录不同的日志。但所有日志都放在同一个文件中   具有特定jdk的java Gradle构建项目   xml Java web服务生成错误响应   javascript Jaggery文件更改不显示   java输出二进制搜索树数组   将BufferedReader解析为JSON对象时,java在位置处意外标记文件结尾