Scalable Installation

Section: C3 User Manual (1)
Return to Main Contents

Index

SCALABLE ISNTALLATION
MISC

I. SCALABLE INSTALLATION

First, read the INSTALL document and make sure you have followed the directions in step A and B and read step C.
D. Scalable configuration file

The syntax of the scalable configuration file is identical to the non-scalable configuration file but the meanings of the positions have changed.

The basic concept is that the cluster is broken into smaller sub-clusters that execute in parallel. For example, a 64 node cluster could be broken into many different combinations, eight 8-way sub clusters, four 16-way sub-clusters, or two 32-way sub-clusters. The closer to a square you can break the cluster into the better the performance - thus we will choose the eight 8-way execution model.

There may be other considerations in deciding the level of fanout in each sub-cluster. No sub-cluster should have nodes in its list of responsibilities that are on different switches - inter-switch communication is much slower than intra-switch communication. There is also a maximum on the level of fanout that one should observe. For example, on our hardware the scripts begin to really slow down at around a 64 way fanout leaving the largest cluster we should support being a 64 64-way fanout (or 4096 nodes). For most people this only makes a difference with slow hardware and a large number of nodes. And lastly, for small clusters (8 nodes and below for us) the non-scalable may be faster due to less communication overhead, once again depending on your hardware.

The last major decision that must be made before continuing is whether to include the staging node (this is the "head node" for each sub-cluster - the command is staged on that node before being sent to its list of responsibilities) in it's list of responsibilities. That is, should the staging node be separate or part of the compute nodes. The staging node should be separate in the case where you have dedicated nodes into each of the sub-clusters. A system administrator will find many times that they will need the nodes separate and should have, at the least, their own private copy that is separate. If the staging nodes are simply just another node in the cluster then it should include itself as this is what mode users would expect.

Once those decisions are made there are two versions of a scalable cluster to choose from. A direct scalable cluster has all the layout of the cluster in a single file on the head node. Due to extra communication between the head node and the staging nodes this is slightly slower (though it would only be noticeable on quick commands such a cexec). But it has the advantage of being easy to administrate. An indirect scalable cluster has a pointer to the staging node and that node has, stored locally, of its list of responsibilities. While this is somewhat faster it can be difficult to keep all the files correctly in sync with the hardware. If a node goes offline it can be troublesome to keep track of, if it is a staging node that goes offline it can be difficult to set up another node as a staging node. We use the direct scalable cluster as it is more convenient.

NOTE: in the following two examples the first is a direct cluster and the second is an indirect cluster. Notice that they have the same syntax, but different meaning, than the non-scalable model. All the ranges, excludes, and dead tags still hold true.

*************************************
64 node direct scalable

cluster part1 {
        node1              #staging node
        node[1-8]        #list of responsibilities
}

cluster part2 {
        node9
        node[9-16]
}

cluster part3 {
        node17
        node[17-24]
}

cluster part4 {
        node25
        node[25-32]
}

cluster part5 {
        node33
        node[33-40]
}

cluster part6 {
        node41
        node[41-48]
}

cluster part7 {
        node49
        node[49-56]
}

cluster part8 {
        node57
        node[57-64]
}

*************************************
64 node indirect scalable

cluster part1 {
        :node1 #staging node
}

cluster part2 {
        :node9
}

cluster part3 {
        :node17
}

cluster part4 {
        :node25
}

cluster part5 {
        :node33
}

cluster part6 {
        :node41
}

cluster part7 {
        :node49
}

cluster part8 {
        :node57
}

On node1 /etc/c3.conf

cluster stage {
        node1
        node[1-8]
}

  II. MISC

Two commands do not benefit from the scalable execution model. Cget, because it has a single point that all the commands must talk to, will not see much - if any - improvement.

Because SystemImager does not support staging of images it does not directly benefit from the scalable model. You can manipulate where an image is located and what it is an image of to get some benefit. First you must take an image of a compute node onto one of the staging nodes. Then on the head-node take an image of the staging node. Next, using cpushimage with the --head option push that image out to each staging node. Then, using cpushiamge push the image stored on the staging node to the compute nodes making sure that the staging node does not include it self in its list of responsibilities.


Last Modified: