Nova计算和网络在重启管理服务后无法联系nova服务
我有两个节点的设置来运行OpenStack。
第一个节点上有管理服务,比如 nova-api
、nova-scheduler
和 glance
等等。
第二个节点上则有网络和计算服务。
当我检查 nova-manage service list
时,所有服务都显示正常。
但是,当我重启管理节点(节点1)时,计算节点就断开了连接。
当计算节点尝试连接管理节点时,计算日志中会显示错误信息。
2013-01-21 20:49:28 TRACE nova.manager Traceback (most recent call last):
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/manager.py", line 155, in periodic_tasks
2013-01-21 20:49:28 TRACE nova.manager task(self, context)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 2244, in _heal_instance_info_cache
2013-01-21 20:49:28 TRACE nova.manager context, self.host)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/db/api.py", line 594, in instance_get_all_by_host
2013-01-21 20:49:28 TRACE nova.manager return IMPL.instance_get_all_by_host(context, host)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/db/sqlalchemy/api.py", line 103, in wrapper
2013-01-21 20:49:28 TRACE nova.manager return f(*args, **kwargs)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/db/sqlalchemy/api.py", line 1582, in instance_get_all_by_host
2013-01-21 20:49:28 TRACE nova.manager return _instance_get_all_query(context).filter_by(host=host).all()
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 1922, in all
2013-01-21 20:49:28 TRACE nova.manager return list(self)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2032, in __iter__
2013-01-21 20:49:28 TRACE nova.manager return self._execute_and_instances(context)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2047, in _execute_and_instances
2013-01-21 20:49:28 TRACE nova.manager result = conn.execute(querycontext.statement, self._params)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1399, in execute
2013-01-21 20:49:28 TRACE nova.manager params)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1532, in _execute_clauseelement
2013-01-21 20:49:28 TRACE nova.manager compiled_sql, distilled_params
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1640, in _execute_context
2013-01-21 20:49:28 TRACE nova.manager context)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1633, in _execute_context
2013-01-21 20:49:28 TRACE nova.manager context)
2013-01-21 20:49:28 TRACE nova.manager File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.3-py2.6-linux-x86_64.egg/sqlalchemy/engine/default.py", line 330, in do_execute
2013-01-21 20:49:28 TRACE nova.manager cursor.execute(statement, parameters)
2013-01-21 20:49:28 TRACE nova.manager OperationalError: (OperationalError) socket not open
重启计算和网络服务后,问题就解决了。但在我重启计算或网络之前,它一直会报错。
我在计算节点上检查控制器的套接字是否打开。
[root@compute ~]# ps -ef | grep compute
nova 30859 1 27 18:51 ? 00:00:03 /usr/bin/python /usr/bin/nova-compute --config-file /etc/nova/nova.conf --logfile /var/log/nova/compute.log
root 30996 30807 0 18:51 pts/0 00:00:00 grep compute
[root@compute ~]# netstat -p | grep 30859
tcp 0 0 compute:56988 controller:postgres ESTABLISHED 30859/python
tcp 0 0 compute:37869 controller:amqps ESTABLISHED 30859/python
tcp 0 0 compute:37871 controller:amqps ESTABLISHED 30859/python
unix 3 [ ] STREAM CONNECTED 3588759 30859/python
控制器上有两个套接字是打开的,分别是 postgres
和 amqps
。
当我在控制器上运行 reboot now
并检查控制器上有多少个套接字时,
[root@compute ~]# netstat -p | grep 30859
tcp 208 0 compute:56988 controller:postgres CLOSE_WAIT 30859/python
unix 3 [ ] STREAM CONNECTED 3590103 30859/python
unix 3 [ ] STREAM CONNECTED 3588759 30859/python
发现 postgres
的套接字是关闭的。
当控制器上的所有服务都启动后,我再次运行相同的命令来检查连接到控制器的套接字,结果是一样的。
为什么计算节点不为 postgres
创建新的套接字呢?
1 个回答
你遇到的这个socket错误,是因为nova-compute在尝试联系你在nova.conf文件中配置的数据库,正如Matt Joyce之前提到的。在日志的早些地方,你可以看到这个服务的所有配置值。找一下“Full set of FLAGS”这个字符串——它至少能给你一些关于配置内容的提示——它会把“sql_connection”的实际值隐藏起来(因为里面通常包含密码),但这可能有助于你理解发生了什么。
根据我对你问题的理解,nova-compute的日志文件显示这个错误,直到你重启服务。我的理解对吗?重启后它就能正常工作了?
如果是这样的话,是否有某个东西在基础包安装后又对nova进行了配置?比如运行chef、puppet之类的工具,在服务可能已经用错误配置启动后,又添加了一些配置细节?