用于fred hutch和其他地方的slurm hpc工作负载管理器的帮助工具
slurm-toys的Python项目详细描述
一组拙劣的命令行工具和包装器,主要在github上找到
为什么是泥泞的玩具
slurm toys的目的是打包用python 3或shell编写的有用slurm助手工具,并 将它们发布到pypi上的单个包中
当前集成玩具
泥浆限制器
HPC集群被优化,以最大限度地利用批处理作业。Fairshare有助于确保
随着时间的推移,所有用户都会获得适当数量的资源。然而,公平分享只能影响
尚未开始的工作。如果集群被“大”用户百分之百地使用,“小”用户变成
不高兴,因为他们可能无法得到一个单独的节点特别。目前唯一的解决方案
这个问题似乎是设置了硬帐户限制。不幸的是,这些限制经常被设定
群集忙时太高,空闲时太低。泥浆限制器通过
根据总体分区/队列负载动态调整限制。
如果您需要响应的HPC群集,则此过程不应超过5秒:
~$ time srun hostname
srun: job 61004624 queued and waiting for resources
srun: job 61004624 has been allocated resources
gizmof171
real 0m1.668s
user 0m0.044s
sys 0m0.012s
在cron作业中使用的示例,每20分钟运行一次:
*/20 * * * * root ( ml Python/3.6.4-foss-2016b-fh1; /app/bin/slurm-limiter -p campus \
--error-email=sysadmin\@institute.org --minaccountlimit=50 --maxaccountlimit=350 \
--slaaccountlimit=300 --changestep=50 --maxpercentuse=90 \
--minidlenodes=5 ) >>/var/tmp/slurm-limiter.log 2>&1
输出到syslog的示例:
~$ grep slurm-limiter: /var/log/syslog
Apr 15 09:40:03 gizmo-ctld slurm-limiter: INFO:slurm-limiter.85: Cores: running=689, pending=3299, total=1180, Usage=58 %, Limits: 350 / 370, Nodes: idle=101
Apr 15 10:00:03 gizmo-ctld slurm-limiter: INFO:slurm-limiter.85: Cores: running=689, pending=3274, total=1180, Usage=58 %, Limits: 350 / 370, Nodes: idle=101
Apr 15 10:20:03 gizmo-ctld slurm-limiter: INFO:slurm-limiter.85: Cores: running=680, pending=3241, total=1180, Usage=57 %, Limits: 350 / 370, Nodes: idle=102
Apr 15 10:40:03 gizmo-ctld slurm-limiter: INFO:slurm-limiter.85: Cores: running=680, pending=3219, total=1180, Usage=57 %, Limits: 350 / 370, Nodes: idle=102
泥浆限制器输出–帮助
~$ slurm-limiter --help
usage: slurm-limiter [-h] [--debug] [--error-email ERROREMAIL]
[--cluster CLUSTER] [--partition PARTITION]
[--feature FEATURE] [--qos QOS]
[--maxaccountlimit MAXLIMIT] [--minaccountlimit MINLIMIT]
[--slaaccountlimit SLALIMIT]
[--userlimitoffset USERLIMITOFFSET]
[--changestep CHANGESTEP] [--minpending MINPENDING]
[--maxpercentuse MAXPERCENTUSE]
[--minidlenodes MINIDLENODES]
slurm-limiter checks the current util of a slurm cluster and adjusts the
account and user limits dynamically within certain range
optional arguments:
-h, --help show this help message and exit
--debug, -d verbose output for all commands
--error-email ERROREMAIL, -e ERROREMAIL
send errors to this email address.
--cluster CLUSTER, -M CLUSTER
name of the slurm cluster, (default: current cluster)
--partition PARTITION, -p PARTITION
partition of the slurm cluster (default: entire
cluster)
--feature FEATURE, -f FEATURE
filter for only this slurm feature
--qos QOS, -q QOS slurm QOS to use for changing account limits (default:
public)
--maxaccountlimit MAXLIMIT, -x MAXLIMIT
maximum account limit, never go above this (default:
300)
--minaccountlimit MINLIMIT, -n MINLIMIT
minimum account limit, never go below this (default:
100)
--slaaccountlimit SLALIMIT, -t SLALIMIT
min SLA limit that has been committed to customers,
notify via email if breached (default: 150)
--userlimitoffset USERLIMITOFFSET, -o USERLIMITOFFSET
offset of userlimit from account limit, set a negative
number for a userlimit lower than account limit
(default: 20)
--changestep CHANGESTEP, -s CHANGESTEP
increase or decrease the limit by this # of cores
(default: 10)
--minpending MINPENDING, -i MINPENDING
minimum number of jobs that have to be pending to take
action (default: 50)
--maxpercentuse MAXPERCENTUSE, -u MAXPERCENTUSE
maximum allowed % usage in this cluster or partition
Throttle QOS down by --changestep if exceeded.
(default: 90)
--minidlenodes MINIDLENODES, -w MINIDLENODES
critical minimum number of idle nodes. Throttle QOS
down to --minaccountlimit if exceeded. (default: 5)
未来玩具
将来我们可以集成其他工具,主要是github上的内容
https://github.com/search?l=Python&p=1&q=slurm+&type=Repositories
https://github.com/search?l=Shell&q=slurm+&type=Repositories