Scalable Cluster Software
MPI MTBF
- We do have about 1000 CPU cycles we can spend on communication software
- We must not use any data structures that scale up with the number of nodes
- There are not a lot of cycles left to do any kind of error checking and processing
- A bit change once per year on a single link gives us three bit errors per day with 1000 links
-
- Error checking and correction is a must
- And we cannot afford cycles for it