使用yaml aws AttributeError启动光线群集:“Worker”对象没有属性“Worker\u id”

2024-06-17 10:31:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我不知道这是从哪里来的,或者为什么会发生此错误:

集群可以使用yaml启动,但是当我查看日志时,会发现这个错误

尽管出现了错误,它仍在工作吗?如何从docker图像中检查打印输出

雷似乎没有任何“有效”的例子可以效仿。我正在尝试启动aws docker群集的最简单版本,以证明其原理

 ray exec /home/user/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'

Fetched IP: xxxxxxxxx
Warning: Permanently added 'xxxxxxxxx' (ECDSA) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 854, in custom_excepthook
    worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'

Original exception was:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 390, in <module>
    redis_password=args.redis_password)
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 111, in __init__
    self.load_metrics)
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 76, in __init__
    self.reset(errors_fatal=True)
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 490, in reset
    raise e
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 452, in reset
    self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'

==> /tmp/ray/session_latest/logs/monitor.log <==

==> /tmp/ray/session_latest/logs/monitor.out <==
Shared connection to 18.130.107.42 closed.
Error: Command failed:

  ssh -tt -i /home/joe/.ssh/aws_ubuntu_test.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ff32489f9/8dbdda48fb/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@xxxxxxxx bash --login -c -i ''"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  my_simple_docker_container /bin/bash -c '"'"'"'"'"'"'"'"'bash --login -c -i '"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (tail -n 100 -f /tmp/ray/session_latest/logs/monitor*)'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"''"'"'"'"'"'"'"'"' )'"'"''

(base) xxxxx:~/RAY_AWS_DOCKER/3xxxxx/aws_docker_simple$  ray exec /home/xxxxxxxxx/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: xxxxxx
Warning: Permanently added 'xxxxxxxx' (ECDSA) to the list of known hosts.



==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 854, in custom_excepthook
    worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'

Original exception was:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 390, in <module>
    redis_password=args.redis_password)
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 111, in __init__
    self.load_metrics)
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 76, in __init__
    self.reset(errors_fatal=True)
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 490, in reset
    raise e
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 452, in reset
    self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'

Dockerfile:

FROM continuumio/miniconda3:4.7.10
CMD ["mkdir", "hello_folder"]
CMD ["echo", "Hello StackOverflow!"]

亚马尔:

cluster_name: simple

min_workers: 0

max_workers: 2

docker:
    image: "xxxxxx/simple "
    container_name: "my_simple_docker_container"
    pull_before_run: True

idle_timeout_minutes: 5

initialization_commands:

#    - curl https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh --output anaconda.sh
#    - bash anaconda.sh
#    - conda install python=3.8

    - sudo apt-get update
    - sudo apt-get upgrade
    - sudo apt-get install -y python-setuptools
    - sudo apt-get install -y build-essential curl unzip psmisc
    - pip install --upgrade pip
    - pip install discord

    - curl -fsSL https://get.docker.com -o get-docker.sh
    - sudo sh get-docker.sh
    - sudo usermod -aG docker $USER
    - sudo systemctl restart docker -f


provider:
    type: aws
    region: eu-west-2
    availability_zone: eu-west-2a

file_mounts_sync_continuously: False



auth:
    ssh_user: ubuntu
    ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
    InstanceType: c5.2xlarge
    ImageId: ami-xxxxxxxxfd2c
    KeyName: aws_ubuntu_test

    BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 200

worker_nodes:
   InstanceType: c5.2xlarge
   ImageId: ami-xxxxxxxxfd2c
   KeyName: aws_ubuntu_test
   InstanceMarketOptions:
        MarketType: spot

file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

setup_commands:
    - conda install python=3.7
    - conda create --name ray
    - conda activate ray
    - conda install --name ray pip
    - pip install --upgrade pip
    - pip install discord
    - pip install ray

head_setup_commands:
    - pip install boto3==1.4.8

worker_setup_commands:  []

head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Tags: installdockerinpylibpackageslinesite
2条回答

更好的解决方案是确保两个头部簇上的光线与本地光线相同

您可以使用以下方法进行此操作:

ray  version

本地和群集上,具有:

ray attach config.yaml

这是因为光线版本有问题。例如,如果您执行pip安装ray==1.0,它就可以工作

相关问题 更多 >