As already mentioned in the introduction, a trend can be observed to build systems that have a rather small (up to 16) number of RISC processors that are tightly integrated in a cluster, a Symmetric Multi-Processing (SMP) node. The processors in such a node are virtually always connected by a 1-stage crossbar while these clusters are connected by a less costly network. Such a system may look as depicted in Figure 8. Note that in Figure 8 all CPUs in a cluster are connected to a common part of the memory.
Figure 8: Block diagram of a system with a "hybrid" network: clusters of four CPUs are connected by a crossbar. The clusters are connected by a less expensive network, e.g., a Butterfly network.
(Figure Figure 8 looks functionally identical to Figure
6, however, there is a difference that
cannot be expressed in the figure: all memory is directly accessible by all
processors without the necessity to transfer the data explicitely.)
Presently, no commercially available machine uses the S-COMA scheme. By contrast, there are several popular ccNUMA systems (like the Bull's bullx R422 series, HP Superdome, and SGI Altix UV) commercially available. An important characteristic for NUMA machines is the NUMA factor. This factor shows the difference in latency for accessing data from a local memory location as opposed to a non-local one. Depending on the connection structure of the system the NUMA factor for various parts of a system can differ for part to part: accessing of data from a neighbouring node will be faster than from a distant node in which possibly a number of stages of a crossbar must be traversed. So, when a NUMA factor is mentioned, this is mostly for the largest network crosssection, i.e., the maximal distance between processors.
Since the appearance of the multi-core processors the ccNUMA phenomenon also manifests itself within processors with multiple cores: first and second level cache belong to a particular core and therefore when another core needs data that not resides in its own cache it has to retrieve it via the complete memory hierarchy of the processor chip. This is typically orders of magnitude slower than when it can be fetched from its local cache.
For all practical purposes we can classify these systems as being SM-MIMD machines also because special assisting hardware/software (such as a directory memory) has been incorporated to establish a single system image although the memory is physically distributed.