On routing tolerant to power supply failure of a cabinet of a torus-based supercomputer

Supercomputers are massively parallel systems as they include hundreds of thousands of compute nodes. Power management is key to such machinery: nodes are gathered into cabinets, with a consumption of 15 to 20 kW per cabinet in the case of the IBM Blue Gene/L, for a grand total of more than 1 MW (1.2 MW for the IBM Blue Gene/L). Failure of the power supply unit of one cabinet would thus trigger a sudden and massive increase of the number of faulty compute nodes. Conventional approaches to fault tolerance in such interconnection networks are able to cope with only a few faulty nodes or edges, thus not tolerant to such cabinet power supply failure. In this paper, recognising that torus-based interconnects are very popular, we are going to propose and evaluate a particular node organisation into cabinets and a routing algorithm that is capable of dealing with such failure scenario in an n-dimensional torus.