Using Intelligent Platform Management Interface at the LHC Grid

Hugo Monteiro Cacote and Massimiliano Masi
CERN, Switzerland

With this poster we propose the use of the Intelligent Platfrom Management Interface (IPMI) for the temperature monitoring of the Tier-0 of the LHC Computing Grid (LCG) situated at CERN. IPMI is a standard from Intel, Sun and others. It is implemented as a daughter card having a separate processor and a non-volatile memory. IPMI has functionalities like power on/off remotely a machine (via an ethernet interface) and remote and local access to sensor readings. Standard sensors are temperature, fans speed and chassis intrusion. Monitoring the temperature of a big cluster (Tier-0 is composed by thousands machines, near 8000) is very important since high peaks of temperature can be an index of wrong air circulation, resulting in damage of servers. Fans speed can be an indicator of fans failures, that are difficult to detect in huge environments. In our cluster we have a monitoring suite called LHC Era Monitoring, LEMON that is already connected with the GRID schedulers. Our solution feeds the LEMON infrastructure by sending CPU temperatures to the LEMON repository. For maintaning the software for our machines up-to-date we use the Quattor suite (http://www.quattor.org). Our solutions integrates with Quattor by configuring the BMC at installation time or when an update is scheduled. The full architecture is shown in the figure below.
IPMI at CERN architecture

Due to the infrastructure of Tier-0, machines can be located hundred of meters under the earth, near the CERN's experiments. In our case, remote power on and off is an important issue. Our system administrators have to go physically under the earth and power off and on a machine in case of needs. Of course, using remote power on and off exposes the machine at severe security risk. The standard IPMI cards come with a security infrastructure based on username and passwords. In our infrastructure using different passwords for thousand machines was too difficult. Quattor comes with the Secure Information Delivery System, SINES, that owns a X.509 certificate for each machine in the cluster, used to send sensible informations like root passwords or kerberos keytabs. Users in the LHC grid and at CERN have a kerberos account or a GRID proxy. We propose a solution based on GSSAPI to authenticate and authorize users to obtain administrator privileges on the IPMI Baseboard Management Controller (BMC). At installation time a random password is generated using the makepasswd() program. This password is sent to the central SINDES repository that stores it in a tree. We provide wrappers to the IPMI functionalities like power on/off, getting sensor readings and power status. These wrappers check the environment for a valid X.509 proxy certificate or a valid Kerberos Ticket. Then, using GSSAPI, they obtain in a local machine a kerberos token for accessing a webservice on the SINDES, that enforces a policy based on the access control lists for the target machine (i.e. the machine that must be powered off). If the username contained in the token matches with a valid policy, the SINDES retrieves the password for the BMC of the machine and forks Ipmitool, a popular open source IPMI implementation, for performing the operation needed. The SINDES is not a bottleneck since remote IPMI operations are not made with high frequencies and the machines certificates are refreshed once in a year.

Future work on this topic could be te evaluation of the impact of CPU temperatures and Fans speed in job routing (i.e. if a cluster is hot, do the job should be routed to another?). Security of our infrastructure is limited to kerberos (that is widely used at CERN, but it might be not so used outside the Tier-0) and Grid proxies. By using other standards like SAML we should move our dedicated authentication and authorization structure to a standard way to develop Single Sign On and our users could operate with IPMI as well with other services, remotely, using their own authentication services. In the figure below is shown the temperature status of the machine in production.

Air Conditioning status
This figure explains how the air conditioning is circulating on the computer machine room. Statistics can be made upon this result, in order to explore "hot spots" in the area.