在aws emr上启动spark作业的工作流工具
sparksteps的Python项目详细描述
SparkSteps允许您配置EMR群集并上载 通过aws s3的spark脚本及其依赖项。你要做的就是 定义一个s3 bucket。
安装
pip install sparksteps
cli选项
Prompt parameters: app main spark script for submit spark (required) app-args: arguments passed to main spark script app-list: Space delimited list of applications to be installed on the EMR cluster (Default: Hadoop Spark) aws-region: AWS region name bid-price: specify bid price for task nodes bootstrap-action: include a bootstrap script (s3 path) cluster-id: job flow id of existing cluster to submit to debug: allow debugging of cluster defaults: cluster configurations of the form "<classification1> key1=val1 key2=val2 ..." dynamic-pricing-master: use spot pricing for the master nodes. dynamic-pricing-core: use spot pricing for the core nodes. dynamic-pricing-task: use spot pricing for the task nodes. ebs-volume-size-core: size of the EBS volume to attach to core nodes in GiB. ebs-volume-type-core: type of the EBS volume to attach to core nodes (supported: [standard, gp2, io1]). ebs-volumes-per-core: the number of EBS volumes to attach per core node. ebs-optimized-core: whether to use EBS optimized volumes for core nodes. ebs-volume-size-task: size of the EBS volume to attach to task nodes in GiB. ebs-volume-type-task: type of the EBS volume to attach to task nodes. ebs-volumes-per-task: the number of EBS volumes to attach per task node. ebs-optimized-task: whether to use EBS optimized volumes for task nodes. ec2-key: name of the Amazon EC2 key pair ec2-subnet-id: Amazon VPC subnet id help (-h): argparse help jobflow-role: Amazon EC2 instance profile name to use (Default: EMR_EC2_DefaultRole) keep-alive: whether to keep the EMR cluster alive when there are no steps log-level (-l): logging level (default=INFO) instance-type-master: instance type of of master host (default='m4.large') instance-type-core: instance type of the core nodes, must be set when num-core > 0 instance-type-task: instance type of the task nodes, must be set when num-task > 0 maximize-resource-allocation: sets the maximizeResourceAllocation property for the cluster to true when supplied. name: specify cluster name num-core: number of core nodes num-task: number of task nodes release-label: EMR release label s3-bucket: name of s3 bucket to upload spark file (required) s3-path: path within s3-bucket to use when writing assets s3-dist-cp: s3-dist-cp step after spark job is done submit-args: arguments passed to spark-submit tags: EMR cluster tags of the form "key1=value1 key2=value2" uploads: files to upload to /home/hadoop/ in master instance wait: poll until all steps are complete (or error)
示例
AWS_S3_BUCKET = <insert-s3-bucket> cd sparksteps/ sparksteps examples/episodes.py \ --s3-bucket $AWS_S3_BUCKET \ --aws-region us-east-1 \ --release-label emr-4.7.0 \ --uploads examples/lib examples/episodes.avro \ --submit-args="--deploy-mode client --jars /home/hadoop/lib/spark-avro_2.10-2.0.2-custom.jar" \ --app-args="--input /home/hadoop/episodes.avro" \ --tags Application="Spark Steps" \ --debug
上面的示例使用默认实例创建一个由1个节点组成的emr集群 键入m4.large,上载pyspark script sceptions.py及其 依赖于指定的s3 bucket并将文件从s3复制到 集群。每个操作都定义为一个emr“步骤”,您可以 EMR中的监视器。最后一步是使用 提交包含自定义spark avro包和应用程序参数的参数 “–输入”。
< H2>现有星团上的SCABE作业
可以使用选项--cluster-id指定要上载的群集 运行spark作业。这对调试特别有帮助。
动态定价(alpha)
使用cli选项--dynamic-pricing-<instance-type>允许sparksteps动态地 确定某个实例组内emr实例的最佳投标价格。
目前,该算法回顾了过去12年的现场历史 小时并计算min(50% * on_demand_price, max_spot_price)到 确定投标价格。也就是说,如果当前现货价格超过 按需成本,然后按需实例用于 保守的。
测试
make test
博客
在我们的博客文章中阅读更多关于SparkSteps的信息: https://www.jwplayer.com/blog/sparksteps/
许可证
apache许可证2.0