Profiling Ray on distributed computers
Sébastien Boisvert


Ray use the parallel runtime engine RayPlatform.

When running a compute job with Ray, some profiling information
are generated by default without penality.

## Message passing

The first is the network testing. Before doing the biologically-relevant
stuff, Ray tests the network. Each MPI rank send a number of messages
to measure the latency of a round trip. So the point-to-point latency
is actually half this value. The result is written in 
RayOutput/NetworkText.txt

The network test can also dump detailed data if the option
-write-network-test-raw-data.

RayPlatform implements an array of communication graphs such as
complete, de Bruijn, Kautz, hypercube, polytope (via hypercube),
random, group. It can be activated with -route-messages. The model
is set with -connection-type <type>. The model complete is the same 
as not doing. The best is the hypercube/regular polytope because it
can do load balancing of routed messages. See Documentation/Routing.txt
This is useful on supercomputers with network hardware that does 
not support too many communication peers.

## Signal scheduling

In Ray, all the code paths must use the format imposed by RayPlatform.
All the code is put inside functions/methods starting with call_
followed by the signal name.

The Ray code actually runs in some sort of supervisor implemented in
RayPlatform. It deleguates the signals such as a received message, or 
a slave mode or master mode that must be executed for one tick.
All scheduling information is written in RayOutput/Scheduling/*.
These reports provide, for each MPI rank, the granularity in nanoseconds
the number of messages sent/received per second, the number of ticks
in the supervisor, the total number of milliseconds for any given
slave mode or master mode, and so on.

Statistics on messages sent or received are written in
RayOutput/MessagePassingInterface.txt

The option -show-communication-events activates the reporting of
all communications (send and receive operations).

The option -run-profiler will run Ray in slow mode, but the supervisor
will collect a lot of profiling information. There is also a compile
time option to add collectors in the code. The option is 
PROFILER_COLLECT=y (or -D CONFIG_PROFILER_COLLECT).

## Running time

The file RayOutput/ElapsedTime.txt contains a human-readable report of
the time required by each step.

## Memory usage

In the standard output, Ray reports its memory usage for each MPI rank.

## Getting the best performance

When compiling the code, turning on link time optimization and using 
the native instruction set is suggested.

See scripts/Build-Link-Time-Optimization.sh