Profiling Ray on distributed computers Sébastien Boisvert Ray use the parallel runtime engine RayPlatform. When running a compute job with Ray, some profiling information are generated by default without penality. ## Message passing The first is the network testing. Before doing the biologically-relevant stuff, Ray tests the network. Each MPI rank send a number of messages to measure the latency of a round trip. So the point-to-point latency is actually half this value. The result is written in RayOutput/NetworkText.txt The network test can also dump detailed data if the option -write-network-test-raw-data. RayPlatform implements an array of communication graphs such as complete, de Bruijn, Kautz, hypercube, polytope (via hypercube), random, group. It can be activated with -route-messages. The model is set with -connection-type . The model complete is the same as not doing. The best is the hypercube/regular polytope because it can do load balancing of routed messages. See Documentation/Routing.txt This is useful on supercomputers with network hardware that does not support too many communication peers. ## Signal scheduling In Ray, all the code paths must use the format imposed by RayPlatform. All the code is put inside functions/methods starting with call_ followed by the signal name. The Ray code actually runs in some sort of supervisor implemented in RayPlatform. It deleguates the signals such as a received message, or a slave mode or master mode that must be executed for one tick. All scheduling information is written in RayOutput/Scheduling/*. These reports provide, for each MPI rank, the granularity in nanoseconds the number of messages sent/received per second, the number of ticks in the supervisor, the total number of milliseconds for any given slave mode or master mode, and so on. Statistics on messages sent or received are written in RayOutput/MessagePassingInterface.txt The option -show-communication-events activates the reporting of all communications (send and receive operations). The option -run-profiler will run Ray in slow mode, but the supervisor will collect a lot of profiling information. There is also a compile time option to add collectors in the code. The option is PROFILER_COLLECT=y (or -D CONFIG_PROFILER_COLLECT). ## Running time The file RayOutput/ElapsedTime.txt contains a human-readable report of the time required by each step. ## Memory usage In the standard output, Ray reports its memory usage for each MPI rank. ## Getting the best performance When compiling the code, turning on link time optimization and using the native instruction set is suggested. See scripts/Build-Link-Time-Optimization.sh