In user-facing docs, log and debug output, we should be consistent.
(Later, in master, we can fix the code and comments). Definitions:
* core: hardware that actually executes instructions
* socket: a group of cores sharing (e.g.) L3 cache
* node: a group of sockets not needing a network connection
to run in parallel
* thread: vague - an instruction stream (but should not be used
if one of the foregoing is more appropriate)
* rank: an MPI rank (of either flavour), thus containing at least
one thread
* process: in mdrun, regarding parallelism, only used when
needing to distinguish real MPI from tMPI
(These, or similar, will end up in some user docs shortly.)
The goal is to express ourselves in the most relevant abstraction, not
all of them. For example, we should talk about the number of OpenMP
threads per MPI rank without observing that there are two kinds of
implementations of MPI, unless we need to draw a relevant distinction
between those two kinds of implementations.
Where debug output seems more like dumping a data structure than
describing logical state, I've left the names in strings referring to
the members of the data structure (e.g. cr->nodeid, which refers to
the MPI rank).
Changed variable name in gmx_check_thread_affinity_set() from ncpus to
nthreads_hw_avail to make clear that the associated change to the
debug string is correct.
Doubtless there are still uses of "process" that should refer to
"rank," but we use "process" to mean very many different things, so it
is hard to use sed-like tools effectively. Ideally, clang libTooling
would make it easy to find only uses that are present in strings, but
I don't have time to learn how to do that now.
Change-Id: I4dc39dff8be81a30ce803d7833dc305d29d8d188
All replica exchange variants are options of the {\tt mdrun}
program. It will only work when MPI is installed, due to the inherent
parallelism in the algorithm. For efficiency each replica can run on a
-separate node. See the manual page of {\tt mdrun} on how to use these
+separate rank. See the manual page of {\tt mdrun} on how to use these
multinode features.
% \ifthenelse{\equal{\gmxlite}{1}}{}{
\section{Parallelization\index{parallelization}}
The CPU time required for a simulation can be reduced by running the simulation
-in parallel over more than one processor or processor core.
-Ideally one would want to have linear scaling: running on $N$ processors/cores
+in parallel over more than one core.
+Ideally, one would want to have linear scaling: running on $N$ cores
makes the simulation $N$ times faster. In practice this can only be
-achieved for a small number of processors. The scaling will depend
+achieved for a small number of cores. The scaling will depend
a lot on the algorithms used. Also, different algorithms can have different
restrictions on the interaction ranges between atoms.
\section{Domain decomposition\index{domain decomposition}}
Since most interactions in molecular simulations are local,
domain decomposition is a natural way to decompose the system.
-In domain decomposition, a spatial domain is assigned to each processor,
+In domain decomposition, a spatial domain is assigned to each rank,
which will then integrate the equations of motion for the particles
that currently reside in its local domain. With domain decomposition,
there are two choices that have to be made: the division of the unit cell
-into domains and the assignment of the forces to processors.
+into domains and the assignment of the forces to domains.
Most molecular simulation packages use the half-shell method for assigning
the forces. But there are two methods that always require less communication:
the eighth shell~\cite{Liem1991} and the midpoint~\cite{Shaw2006} method.
In the most general case of a triclinic unit cell,
the space in divided with a 1-, 2-, or 3-D grid in parallelepipeds
that we call domain decomposition cells.
-Each cell is assigned to a processor.
-The system is partitioned over the processors at the beginning
+Each cell is assigned to a particle-particle rank.
+The system is partitioned over the ranks at the beginning
of each MD step in which neighbor searching is performed.
Since the neighbor searching is based on charge groups, charge groups
are also the units for the domain decomposition.
then one in $y$ and then one in $z$. The forces are communicated by
reversing this procedure. See the {\gromacs} 4 paper~\cite{Hess2008b}
for details on determining which non-bonded and bonded forces
-should be calculated on which node.
+should be calculated on which rank.
\subsection{Dynamic load balancing\swapindexquiet{dynamic}{load balancing}}
-When different processors have a different computational load
-(load imbalance), all processors will have to wait for the one
+When different ranks have a different computational load
+(load imbalance), all ranks will have to wait for the one
that takes the most time. One would like to avoid such a situation.
Load imbalance can occur due to three reasons:
\begin{itemize}
\begin{figure}
\centerline{\includegraphics[width=7cm]{plots/dd-tric}}
\caption{
-The zones to communicate to the processor of zone 0,
+The zones to communicate to the rank of zone 0,
see the text for details. $r_c$ and $r_b$ are the non-bonded
and bonded cut-off radii respectively, $d$ is an example
of a distance between following, staggered boundaries of cells.
\subsection{Constraints in parallel\index{constraints}}
\label{subsec:plincs}
Since with domain decomposition parts of molecules can reside
-on different processors, bond constraints can cross cell boundaries.
+on different ranks, bond constraints can cross cell boundaries.
Therefore a parallel constraint algorithm is required.
{\gromacs} uses the \normindex{P-LINCS} algorithm~\cite{Hess2008a},
which is the parallel version of the \normindex{LINCS} algorithm~\cite{Hess97}
scaling with domain decomposition.
To reduce the effect of this problem, we have come up with
a Multiple-Program, Multiple-Data approach~\cite{Hess2008b}.
-Here, some processors are selected to do only the PME mesh calculation,
-while the other processors, called particle-particle (PP) nodes,
+Here, some ranks are selected to do only the PME mesh calculation,
+while the other ranks, called particle-particle (PP) ranks,
do all the rest of the work.
-For rectangular boxes the optimal PP to PME node ratio is usually 3:1,
+For rectangular boxes the optimal PP to PME rank ratio is usually 3:1,
for rhombic dodecahedra usually 2:1.
-When the number of PME nodes is reduced by a factor of 4, the number
+When the number of PME ranks is reduced by a factor of 4, the number
of communication calls is reduced by about a factor of 16.
-Or put differently, we can now scale to 4 times more nodes.
+Or put differently, we can now scale to 4 times more ranks.
In addition, for modern 4 or 8 core machines in a network,
the effective network bandwidth for PME is quadrupled,
since only a quarter of the cores will be using the network connection
\begin{figure}
\centerline{\includegraphics[width=12cm]{plots/mpmd-pme}}
\caption{
-Example of 8 nodes without (left) and with (right) MPMD.
+Example of 8 ranks without (left) and with (right) MPMD.
The PME communication (red arrows) is much higher on the left
than on the right. For MPMD additional PP - PME coordinate
and force communication (blue arrows) is required,
}
\end{figure}
-{\tt mdrun} will by default interleave the PP and PME nodes.
-If the processors are not number consecutively inside the machines,
+{\tt mdrun} will by default interleave the PP and PME ranks.
+If the ranks are not number consecutively inside the machines,
one might want to use {\tt mdrun -ddorder pp_pme}.
For machines with a real 3-D torus and proper communication software
-that assigns the processors accordingly one should use
+that assigns the ranks accordingly one should use
{\tt mdrun -ddorder cartesian}.
To optimize the performance one should usually set up the cut-offs
and the PME grid such that the PME load is 25 to 33\% of the total
calculation load. {\tt grompp} will print an estimate for this load
at the end and also {\tt mdrun} calculates the same estimate
-to determine the optimal number of PME nodes to use.
+to determine the optimal number of PME ranks to use.
For high parallelization it might be worthwhile to optimize
the PME load with the {\tt mdp} settings and/or the number
-of PME nodes with the {\tt -npme} option of {\tt mdrun}.
+of PME ranks with the {\tt -npme} option of {\tt mdrun}.
For changing the electrostatics settings it is useful to know
the accuracy of the electrostatics remains nearly constant
when the Coulomb cut-off and the PME grid spacing are scaled
by the same factor.
{\bf Note} that it is usually better to overestimate than to underestimate
-the number of PME nodes, since the number of PME nodes is smaller
-than the number of PP nodes, which leads to less total waiting time.
+the number of PME ranks, since the number of PME ranks is smaller
+than the number of PP ranks, which leads to less total waiting time.
The PME domain decomposition can be 1-D or 2-D along the $x$ and/or
$y$ axis. 2-D decomposition is also known as \normindex{pencil decomposition} because of
the PP decomposition has only 1 domain along $x$. 2-D PME decomposition
has to have the number of domains along $x$ equal to the number of
the PP decomposition. {\tt mdrun} automatically chooses 1-D or 2-D
-PME decomposition (when possible with the total given number of nodes),
+PME decomposition (when possible with the total given number of ranks),
based on the minimum amount of communication for the coordinate redistribution
in PME plus the communication for the grid overlap and transposes.
To avoid superfluous communication of coordinates and forces
-between the PP and PME nodes, the number of DD cells in the $x$
+between the PP and PME ranks, the number of DD cells in the $x$
direction should ideally be the same or a multiple of the number
-of PME nodes. By default, {\tt mdrun} takes care of this issue.
+of PME ranks. By default, {\tt mdrun} takes care of this issue.
\subsection{Domain decomposition flow chart}
In \figref{dd_flow} a flow chart is shown for domain decomposition
\caption{
Flow chart showing the algorithms and communication (arrows)
for a standard MD simulation with virtual sites, constraints
-and separate PME-mesh nodes.
+and separate PME-mesh ranks.
\label{fig:dd_flow}
}
\end{figure}
\item {\tt GMX_NBNXN_SIMD_4XN}: force the use of 4xN SIMD CPU non-bonded kernels,
mutually exclusive of {\tt GMX_NBNXN_SIMD_2XNN}.
\item {\tt GMX_NO_ALLVSALL}: disables optimized all-vs-all kernels.
-\item {\tt GMX_NO_CART_REORDER}: used in initializing domain decomposition communicators. Node reordering
+\item {\tt GMX_NO_CART_REORDER}: used in initializing domain decomposition communicators. Rank reordering
is default, but can be switched off with this environment variable.
\item {\tt GMX_NO_CUDA_STREAMSYNC}: the opposite of {\tt GMX_CUDA_STREAMSYNC}. Disables the use of the
standard cudaStreamSynchronize-based GPU waiting to improve performance when using CUDA driver API
\end{enumerate}
\section{Running {\gromacs} in parallel}
-By default {\gromacs} will be compiled with the built-in threaded MPI library.
-This library supports MPI communication between threads instead of between
-processes. To run {\gromacs} in parallel over multiple nodes in a cluster
-of a supercomputer, you need to configure and compile {\gromacs} with an external
+By default {\gromacs} will be compiled with the built-in thread-MPI library.
+This library handles communication between threads on a single
+node more efficiently than using an external MPI library.
+To run {\gromacs} in parallel over multiple nodes, e.g. on a cluster,
+you need to configure and compile {\gromacs} with an external
MPI library. All supercomputers are shipped with MPI libraries optimized for
-that particular platform, and if you are using a cluster of workstations
-there are several good free MPI implementations; OpenMPI is usually a good choice.
-Note that MPI and threaded-MPI support are mutually incompatible.
+that particular platform, and there are several good free MPI
+implementations; OpenMPI is usually a good choice.
+Note that MPI and thread-MPI support are mutually incompatible.
In addition to MPI parallelization, {\gromacs} supports also
thread-parallelization through \normindex{OpenMP}. MPI and OpenMP parallelization
n 10290 3777 m 10350 4017 l 10410 3777 l col4 s
/Helvetica ff 317.50 scf sf
7200 1980 m
-gs 1 -1 sc (6 PP nodes) dup sw pop 2 div neg 0 rm col0 sh gr
+gs 1 -1 sc (6 PP ranks) dup sw pop 2 div neg 0 rm col0 sh gr
/Helvetica ff 317.50 scf sf
10350 1980 m
-gs 1 -1 sc (2 PME nodes) dup sw pop 2 div neg 0 rm col0 sh gr
+gs 1 -1 sc (2 PME ranks) dup sw pop 2 div neg 0 rm col0 sh gr
/Helvetica ff 317.50 scf sf
3150 2025 m
-gs 1 -1 sc (8 PP/PME nodes) dup sw pop 2 div neg 0 rm col0 sh gr
+gs 1 -1 sc (8 PP/PME ranks) dup sw pop 2 div neg 0 rm col0 sh gr
% here ends figure;
$F2psEnd
rs
0 0 2.00 120.00 240.00
0 0 2.00 120.00 240.00
10350 3150 10350 4050
-4 1 0 50 -1 16 20 0.0000 4 255 1695 7200 1980 6 PP nodes\001
-4 1 0 50 -1 16 20 0.0000 4 255 1965 10350 1980 2 PME nodes\001
-4 1 0 50 -1 16 20 0.0000 4 255 2505 3150 2025 8 PP/PME nodes\001
+4 1 0 50 -1 16 20 0.0000 4 255 1695 7200 1980 6 PP ranks\001
+4 1 0 50 -1 16 20 0.0000 4 255 1965 10350 1980 2 PME ranks\001
+4 1 0 50 -1 16 20 0.0000 4 255 2505 3150 2025 8 PP/PME ranks\001
Ions will be exchanged between compartments depending on their $z$-positions alone.
{\tt swap_frequency} determines how often a swap attempt will be made.
This step requires that the positions of the ions, solvent, and swap groups are
-communicated between the parallel processes, so if chosen too small it can decrease the simulation
+communicated between the parallel ranks, so if chosen too small it can decrease the simulation
performance.
{\small
\begin{verbatim}
char fn[STRLEN];
- sprintf(fn, "EDdump_node%d_edi%d", cr->nodeid, nr_edi);
+ sprintf(fn, "EDdump_rank%d_edi%d", cr->nodeid, nr_edi);
out = gmx_ffopen(fn, "w");
fprintf(out, "#NINI\n %d\n#FITMAS\n %d\n#ANALYSIS_MAS\n %d\n",
if (debug)
{
- fprintf(debug, "FFT5D: Using %dx%d processor grid, rank %d,%d\n",
+ fprintf(debug, "FFT5D: Using %dx%d rank grid, rank %d,%d\n",
P[0], P[1], prank[0], prank[1]);
}
{
if (prank == 0)
{
- printf("FFT5D: WARNING: Number of processors %d not evenly dividable by %d\n", size, P0);
+ printf("FFT5D: WARNING: Number of ranks %d not evenly divisible by %d\n", size, P0);
}
P0 = lfactor(size);
}
"HIDDENDffusion coefficient to use in the reversible geminate recombination kinetic model. If negative, then it will be fitted to the ACF along with ka and kd."},
#ifdef GMX_OPENMP
{ "-nthreads", FALSE, etINT, {&nThreads},
- "Number of threads used for the parallel loop over autocorrelations. nThreads <= 0 means maximum number of threads. Requires linking with OpenMP. The number of threads is limited by the number of processors (before OpenMP v.3 ) or environment variable OMP_THREAD_LIMIT (OpenMP v.3)"},
+ "Number of threads used for the parallel loop over autocorrelations. nThreads <= 0 means maximum number of threads. Requires linking with OpenMP. The number of threads is limited by the number of cores (before OpenMP v.3 ) or environment variable OMP_THREAD_LIMIT (OpenMP v.3)"},
#endif
};
const char *bugs[] = {
"(collective coordinates etc.), at least on the 'protein' side, ED sampling",
"is not very parallel-friendly from an implementation point of view. Because",
"parallel ED requires some extra communication, expect the performance to be",
- "lower as in a free MD simulation, especially on a large number of nodes and/or",
+ "lower as in a free MD simulation, especially on a large number of ranks and/or",
"when the ED group contains a lot of atoms. [PAR]",
"Please also note that if your ED group contains more than a single protein,",
"then the [TT].tpr[tt] file must contain the correct PBC representation of the ED group.",
xtot, xtot == 1 ? "" : "s");
if (PAR(cr))
{
- fprintf(stdout, " (%d sample%s per node)", x_per_core, x_per_core == 1 ? "" : "s");
+ fprintf(stdout, " (%d sample%s per rank)", x_per_core, x_per_core == 1 ? "" : "s");
}
fprintf(stdout, ".\n");
}
#ifdef DEBUG
if (PAR(cr))
{
- fprintf(stderr, "Node %3d: nx=[%3d...%3d] e_rec3=%e\n",
+ fprintf(stderr, "Rank %3d: nx=[%3d...%3d] e_rec3=%e\n",
cr->nodeid, startlocal, stoplocal, e_rec3);
}
#endif
/* Look for domain decomp grid and separate PME nodes: */
if (str_starts(line, matchstrdd))
{
- sscanf(line, "Domain decomposition grid %d x %d x %d, separate PME nodes %d",
+ sscanf(line, "Domain decomposition grid %d x %d x %d, separate PME ranks %d",
&(perfdata->nx), &(perfdata->ny), &(perfdata->nz), &npme);
if (perfdata->nPMEnodes == -1)
{
}
else if (perfdata->nPMEnodes != npme)
{
- gmx_fatal(FARGS, "PME nodes from command line and output file are not identical");
+ gmx_fatal(FARGS, "PME ranks from command line and output file are not identical");
}
iFound = eFoundDDStr;
}
fclose(fp);
return eParselogNoDDGrid;
}
- else if (str_starts(line, "The number of nodes you selected"))
+ else if (str_starts(line, "The number of ranks you selected"))
{
fclose(fp);
return eParselogLargePrimeFactor;
{
sep_line(fp);
fprintf(fp, "Summary of successful runs:\n");
- fprintf(fp, "Line tpr PME nodes Gcycles Av. Std.dev. ns/day PME/f");
+ fprintf(fp, "Line tpr PME ranks Gcycles Av. Std.dev. ns/day PME/f");
if (nnodes > 1)
{
fprintf(fp, " DD grid");
/* We have optimized the number of PME-only nodes */
if (winPME == -1)
{
- sprintf(strbuf, "%s", "the automatic number of PME nodes");
+ sprintf(strbuf, "%s", "the automatic number of PME ranks");
}
else
{
- sprintf(strbuf, "%d PME nodes", winPME);
+ sprintf(strbuf, "%d PME ranks", winPME);
}
}
fprintf(fp, "Best performance was achieved with %s", strbuf);
"No DD grid found for these settings.",
"TPX version conflict!",
"mdrun was not started in parallel!",
- "Number of PP nodes has a prime factor that is too large.",
+ "Number of PP ranks has a prime factor that is too large.",
"An error occured."
};
char str_PME_f_load[13];
*pmeentries = 1;
snew(nPMEnodes, 1);
nPMEnodes[0] = npme_fixed;
- fprintf(stderr, "Will use a fixed number of %d PME-only nodes.\n", nPMEnodes[0]);
+ fprintf(stderr, "Will use a fixed number of %d PME-only ranks.\n", nPMEnodes[0]);
}
if (0 == repeats)
for (k = 0; k < nr_tprs; k++)
{
fprintf(fp, "\nIndividual timings for input file %d (%s):\n", k, tpr_names[k]);
- fprintf(fp, "PME nodes Gcycles ns/day PME/f Remark\n");
+ fprintf(fp, "PME ranks Gcycles ns/day PME/f Remark\n");
/* Loop over various numbers of PME nodes: */
for (i = 0; i < *pmeentries; i++)
{
/* Check number of nodes */
if (nnodes < 1)
{
- gmx_fatal(FARGS, "Number of nodes/threads must be a positive integer.");
+ gmx_fatal(FARGS, "Number of ranks/threads must be a positive integer.");
}
/* Automatically choose -ntpr if not set */
/* No more than 50% of all nodes can be assigned as PME-only nodes. */
if (2*npme_fixed > nnodes)
{
- gmx_fatal(FARGS, "Cannot have more than %d PME-only nodes for a total of %d nodes (you chose %d).\n",
+ gmx_fatal(FARGS, "Cannot have more than %d PME-only ranks for a total of %d ranks (you chose %d).\n",
nnodes/2, nnodes, npme_fixed);
}
if ((npme_fixed > 0) && (5*npme_fixed < nnodes))
{
- fprintf(stderr, "WARNING: Only %g percent of the nodes are assigned as PME-only nodes.\n",
+ fprintf(stderr, "WARNING: Only %g percent of the ranks are assigned as PME-only ranks.\n",
100.0*((real)npme_fixed / (real)nnodes));
}
if (opt2parg_bSet("-min", npargs, pa) || opt2parg_bSet("-max", npargs, pa))
{
fprintf(stderr, "NOTE: The -min, -max, and -npme options have no effect when a\n"
- " fixed number of PME-only nodes is requested with -fix.\n");
+ " fixed number of PME-only ranks is requested with -fix.\n");
}
}
}
int gmx_tune_pme(int argc, char *argv[])
{
const char *desc[] = {
- "For a given number [TT]-np[tt] or [TT]-ntmpi[tt] of processors/threads, [THISMODULE] systematically",
- "times [gmx-mdrun] with various numbers of PME-only nodes and determines",
+ "For a given number [TT]-np[tt] or [TT]-ntmpi[tt] of ranks, [THISMODULE] systematically",
+ "times [gmx-mdrun] with various numbers of PME-only ranks and determines",
"which setting is fastest. It will also test whether performance can",
"be enhanced by shifting load from the reciprocal to the real space",
"part of the Ewald sum. ",
"via the MPIRUN variable, e.g.[PAR]",
"[TT]export MPIRUN=\"/usr/local/mpirun -machinefile hosts\"[tt][PAR]",
"Please call [THISMODULE] with the normal options you would pass to",
- "[gmx-mdrun] and add [TT]-np[tt] for the number of processors to perform the",
+ "[gmx-mdrun] and add [TT]-np[tt] for the number of ranks to perform the",
"tests on, or [TT]-ntmpi[tt] for the number of threads. You can also add [TT]-r[tt]",
"to repeat each test several times to get better statistics. [PAR]",
"[THISMODULE] can test various real space / reciprocal space workloads",
"In this last test, the Fourier spacing is multiplied with [TT]rmax[tt]/rcoulomb. ",
"The remaining [TT].tpr[tt] files will have equally-spaced Coulomb radii (and Fourier "
"spacings) between these extremes. [BB]Note[bb] that you can set [TT]-ntpr[tt] to 1",
- "if you just seek the optimal number of PME-only nodes; in that case",
+ "if you just seek the optimal number of PME-only ranks; in that case",
"your input [TT].tpr[tt] file will remain unchanged.[PAR]",
"For the benchmark runs, the default of 1000 time steps should suffice for most",
"MD systems. The dynamic load balancing needs about 100 time steps",
/* g_tune_pme options: */
/***********************/
{ "-np", FALSE, etINT, {&nnodes},
- "Number of nodes to run the tests on (must be > 2 for separate PME nodes)" },
+ "Number of ranks to run the tests on (must be > 2 for separate PME ranks)" },
{ "-npstring", FALSE, etENUM, {procstring},
- "Specify the number of processors to [TT]$MPIRUN[tt] using this string"},
+ "Specify the number of ranks to [TT]$MPIRUN[tt] using this string"},
{ "-ntmpi", FALSE, etINT, {&nthreads},
"Number of MPI-threads to run the tests on (turns MPI & mpirun off)"},
{ "-r", FALSE, etINT, {&repeats},
"Repeat each test this often" },
{ "-max", FALSE, etREAL, {&maxPMEfraction},
- "Max fraction of PME nodes to test with" },
+ "Max fraction of PME ranks to test with" },
{ "-min", FALSE, etREAL, {&minPMEfraction},
- "Min fraction of PME nodes to test with" },
+ "Min fraction of PME ranks to test with" },
{ "-npme", FALSE, etENUM, {npmevalues_opt},
"Within -min and -max, benchmark all possible values for [TT]-npme[tt], or just a reasonable subset. "
"Auto neglects -min and -max and chooses reasonable values around a guess for npme derived from the .tpr"},
{ "-fix", FALSE, etINT, {&npme_fixed},
- "If >= -1, do not vary the number of PME-only nodes, instead use this fixed value and only vary rcoulomb and the PME grid spacing."},
+ "If >= -1, do not vary the number of PME-only ranks, instead use this fixed value and only vary rcoulomb and the PME grid spacing."},
{ "-rmax", FALSE, etREAL, {&rmax},
"If >0, maximal rcoulomb for -ntpr>1 (rcoulomb upscaling results in fourier grid downscaling)" },
{ "-rmin", FALSE, etREAL, {&rmin},
{
fprintf(stdout, "- %d ", maxPMEnodes);
}
- fprintf(stdout, "PME-only nodes.\n Note that the automatic number of PME-only nodes and no separate PME nodes are always tested.\n");
+ fprintf(stdout, "PME-only ranks.\n Note that the automatic number of PME-only ranks and no separate PME ranks are always tested.\n");
}
}
else
fprintf(fp, "%s for Gromacs %s\n", ShortProgram(), GromacsVersion());
if (!bThreads)
{
- fprintf(fp, "Number of nodes : %d\n", nnodes);
+ fprintf(fp, "Number of ranks : %d\n", nnodes);
fprintf(fp, "The mpirun command is : %s\n", cmd_mpirun);
if (strcmp(procstring[0], "none") != 0)
{
- fprintf(fp, "Passing # of nodes via : %s\n", procstring[0]);
+ fprintf(fp, "Passing # of ranks via : %s\n", procstring[0]);
}
else
{
- fprintf(fp, "Not setting number of nodes in system call\n");
+ fprintf(fp, "Not setting number of ranks in system call\n");
}
}
else
}
if (bPrintSepPot)
{
- fprintf(fplog, "Step %s: bonded V and dVdl for this node\n",
+ fprintf(fplog, "Step %s: bonded V and dVdl for this rank\n",
gmx_step_str(step, buf));
}
*step = idum;
}
do_cpt_double_err(xd, "t", t, list);
- do_cpt_int_err(xd, "#PP-nodes", nnodes, list);
+ do_cpt_int_err(xd, "#PP-ranks", nnodes, list);
idum = 1;
do_cpt_int_err(xd, "dd_nc[x]", dd_nc ? &(dd_nc[0]) : &idum, list);
do_cpt_int_err(xd, "dd_nc[y]", dd_nc ? &(dd_nc[1]) : &idum, list);
do_cpt_int_err(xd, "dd_nc[z]", dd_nc ? &(dd_nc[2]) : &idum, list);
- do_cpt_int_err(xd, "#PME-only nodes", npme, list);
+ do_cpt_int_err(xd, "#PME-only ranks", npme, list);
do_cpt_int_err(xd, "state flags", flags_state, list);
if (*file_version >= 4)
{
check_int (fplog, "Double prec.", GMX_CPT_BUILD_DP, double_prec, &mm);
check_string(fplog, "Program name", Program(), fprog, &mm);
- check_int (fplog, "#nodes", cr->nnodes, npp_f+npme_f, &mm);
+ check_int (fplog, "#ranks", cr->nnodes, npp_f+npme_f, &mm);
if (cr->nnodes > 1)
{
- check_int (fplog, "#PME-nodes", cr->npmenodes, npme_f, &mm);
+ check_int (fplog, "#PME-ranks", cr->npmenodes, npme_f, &mm);
npp = cr->nnodes;
if (cr->npmenodes >= 0)
/* Return the number of hardware threads supported by the current CPU.
- * We assume that this is equal with the number of CPUs reported to be
- * online by the OS at the time of the call.
- */
+ * We assume that this is equal with the number of "processors"
+ * reported to be online by the OS at the time of the call. The
+ * definition of "processor" is according to an old POSIX standard.
+ *
+ * Note that the number of hardware threads is generally greater than
+ * the number of cores (e.g. x86 hyper-threading, Power). Managing the
+ * mapping of software threads to hardware threads is managed
+ * elsewhere. */
static int get_nthreads_hw_avail(FILE gmx_unused *fplog, const t_commrec gmx_unused *cr)
{
int ret = 0;
#endif /* End of check for sysconf argument values */
#else
- /* Neither windows nor Unix. No fscking idea how many CPUs we have! */
+ /* Neither windows nor Unix. No fscking idea how many hardware threads we have! */
ret = -1;
#endif
if (debug)
{
- fprintf(debug, "Detected %d processors, will use this as the number "
- "of supported hardware threads.\n", ret);
+ fprintf(debug, "Detected %d hardware threads to use.\n", ret);
}
#ifdef GMX_OPENMP
if (ret != gmx_omp_get_num_procs())
{
md_print_warn(cr, fplog,
- "Number of CPUs detected (%d) does not match the number reported by OpenMP (%d).\n"
+ "Number of hardware threads detected (%d) does not match the number reported by OpenMP (%d).\n"
"Consider setting the launch configuration manually!",
ret, gmx_omp_get_num_procs());
}
if (nnodes > 1)
{
- fprintf(stderr, "Error on node %d, will try to stop all the nodes\n",
+ fprintf(stderr, "Error on rank %d, will try to stop all ranks\n",
noderank);
}
gmx_abort(noderank, nnodes, -1);
sprintf(sbuf, "thread-MPI threads");
#else
sprintf(sbuf, "MPI processes");
- sprintf(sbuf1, " per node");
- sprintf(sbuf2, "On node %d: o", cr->sim_nodeid);
+ sprintf(sbuf1, " per rank");
+ sprintf(sbuf2, "On rank %d: o", cr->sim_nodeid);
#endif
}
#endif
gmx_check_thread_affinity_set(FILE gmx_unused *fplog,
const t_commrec gmx_unused *cr,
gmx_hw_opt_t gmx_unused *hw_opt,
- int gmx_unused ncpus,
+ int gmx_unused nthreads_hw_avail,
gmx_bool gmx_unused bAfterOpenmpInit)
{
#ifdef HAVE_SCHED_GETAFFINITY
* detected CPUs is >= the CPUs in the current set.
* We need to check for CPU_COUNT as it was added only in glibc 2.6. */
#ifdef CPU_COUNT
- if (ncpus < CPU_COUNT(&mask_current))
+ if (nthreads_hw_avail < CPU_COUNT(&mask_current))
{
if (debug)
{
- fprintf(debug, "%d CPUs detected, but %d was returned by CPU_COUNT",
- ncpus, CPU_COUNT(&mask_current));
+ fprintf(debug, "%d hardware threads detected, but %d was returned by CPU_COUNT",
+ nthreads_hw_avail, CPU_COUNT(&mask_current));
}
return;
}
#endif /* CPU_COUNT */
bAllSet = TRUE;
- for (i = 0; (i < ncpus && i < CPU_SETSIZE); i++)
+ for (i = 0; (i < nthreads_hw_avail && i < CPU_SETSIZE); i++)
{
bAllSet = bAllSet && (CPU_ISSET(i, &mask_current) != 0);
}
}
if (bAppendNodeId)
{
- strcat(buf, "_node");
+ strcat(buf, "_rank");
sprintf(buf+strlen(buf), "%d", cr->nodeid);
}
strcat(buf, ".");
strcat(buf, (ftp == efTPX) ? "tpr" : (ftp == efEDR) ? "edr" : ftp2ext(ftp));
if (debug)
{
- fprintf(debug, "node %d par_fn '%s'\n", cr->nodeid, buf);
+ fprintf(debug, "rank %d par_fn '%s'\n", cr->nodeid, buf);
if (fn2ftp(buf) == efLOG)
{
fprintf(debug, "log\n");
fprintf(fp,
"Log file opened on %s"
- "Host: %s pid: %d nodeid: %d nnodes: %d\n",
+ "Host: %s pid: %d rank ID: %d number of ranks: %d\n",
timebuf, host, pid, cr->nodeid, cr->nnodes);
try
{
nnodes = cr->nnodes;
if (nnodes % nsim != 0)
{
- gmx_fatal(FARGS, "The number of nodes (%d) is not a multiple of the number of simulations (%d)", nnodes, nsim);
+ gmx_fatal(FARGS, "The number of ranks (%d) is not a multiple of the number of simulations (%d)", nnodes, nsim);
}
nnodpersim = nnodes/nsim;
if (debug)
{
- fprintf(debug, "We have %d simulations, %d nodes per simulation, local simulation is %d\n", nsim, nnodpersim, sim);
+ fprintf(debug, "We have %d simulations, %d ranks per simulation, local simulation is %d\n", nsim, nnodpersim, sim);
}
snew(ms, 1);
fprintf(debug, "This is simulation %d", cr->ms->sim);
if (PAR(cr))
{
- fprintf(debug, ", local number of nodes %d, local nodeid %d",
+ fprintf(debug, ", local number of ranks %d, local rank ID %d",
cr->nnodes, cr->sim_nodeid);
}
fprintf(debug, "\n\n");
MPI_Comm_rank(nc->comm_intra, &nc->rank_intra);
if (debug)
{
- fprintf(debug, "In gmx_setup_nodecomm: node rank %d rank_intra %d\n",
+ fprintf(debug, "In gmx_setup_nodecomm: node ID %d rank within node %d\n",
rank, nc->rank_intra);
}
/* The inter-node communicator, split on rank_intra.
nc->bUse = TRUE;
if (fplog)
{
- fprintf(fplog, "Using two step summing over %d groups of on average %.1f processes\n\n",
+ fprintf(fplog, "Using two step summing over %d groups of on average %.1f ranks\n\n",
ng, (real)n/(real)ng);
}
if (nc->rank_intra > 0)
{
sprintf(sbuf, "%s", cr->duty & DUTY_PP ? "PP" : "PME");
}
- fprintf(debug, "On %3s node %d: nrank_intranode=%d, rank_intranode=%d, "
+ fprintf(debug, "On %3s rank %d: nrank_intranode=%d, rank_intranode=%d, "
"nrank_pp_intranode=%d, rank_pp_intranode=%d\n",
sbuf, cr->sim_nodeid,
nrank_intranode, rank_intranode,
fprintf(log, "\nDetailed load balancing info in percentage of average\n");
- fprintf(log, " Type NODE:");
+ fprintf(log, " Type RANK:");
for (i = 0; (i < cr->nnodes); i++)
{
fprintf(log, "%3d ", i);
if (od->nr == 0)
{
/* This means that this is not the master node */
- gmx_fatal(FARGS, "Orientation restraints are only supported on the master node, use less processors");
+ gmx_fatal(FARGS, "Orientation restraints are only supported on the master rank, use fewer ranks");
}
bTAV = (od->edt != 0);
fprintf(fp, "commrec:\n");
indent += 2;
pr_indent(fp, indent);
- fprintf(fp, "nodeid = %d\n", cr->nodeid);
+ fprintf(fp, "rank = %d\n", cr->nodeid);
pr_indent(fp, indent);
- fprintf(fp, "nnodes = %d\n", cr->nnodes);
+ fprintf(fp, "number of ranks = %d\n", cr->nnodes);
pr_indent(fp, indent);
- fprintf(fp, "npmenodes = %d\n", cr->npmenodes);
+ fprintf(fp, "PME-only ranks = %d\n", cr->npmenodes);
/*
pr_indent(fp,indent);
fprintf(fp,"threadid = %d\n",cr->threadid);
#ifdef DEBUG_GMX
#define debug_gmx() do { FILE *fp = debug ? debug : stderr; \
- if (bDebugMode()) { fprintf(fp, "NODEID=%d, %s %d\n", gmx_mpi_initialized() ? gmx_node_rank() : -1, __FILE__, __LINE__); } fflush(fp); } while (0)
+ if (bDebugMode()) { fprintf(fp, "rank=%d, %s %d\n", gmx_mpi_initialized() ? gmx_node_rank() : -1, __FILE__, __LINE__); } fflush(fp); } while (0)
#else
#define debug_gmx()
#endif
if (debug)
{
- fprintf(debug, "Receive coordinates from PP nodes:");
+ fprintf(debug, "Receive coordinates from PP ranks:");
for (x = 0; x < *nmy_ddnodes; x++)
{
fprintf(debug, " %d", (*my_ddnodes)[x]);
if (!bLocalCG[dd->index_gl[i]])
{
fprintf(stderr,
- "DD node %d, %s: cg %d, global cg %d is not marked in bLocalCG (ncg_home %d)\n", dd->rank, where, i+1, dd->index_gl[i]+1, dd->ncg_home);
+ "DD rank %d, %s: cg %d, global cg %d is not marked in bLocalCG (ncg_home %d)\n", dd->rank, where, i+1, dd->index_gl[i]+1, dd->ncg_home);
nerr++;
}
}
}
if (ngl != dd->ncg_tot)
{
- fprintf(stderr, "DD node %d, %s: In bLocalCG %d cgs are marked as local, whereas there are %d\n", dd->rank, where, ngl, dd->ncg_tot);
+ fprintf(stderr, "DD rank %d, %s: In bLocalCG %d cgs are marked as local, whereas there are %d\n", dd->rank, where, ngl, dd->ncg_tot);
nerr++;
}
{
if (have[dd->gatindex[a]] > 0)
{
- fprintf(stderr, "DD node %d: global atom %d occurs twice: index %d and %d\n", dd->rank, dd->gatindex[a]+1, have[dd->gatindex[a]], a+1);
+ fprintf(stderr, "DD rank %d: global atom %d occurs twice: index %d and %d\n", dd->rank, dd->gatindex[a]+1, have[dd->gatindex[a]], a+1);
}
else
{
{
if (a >= dd->nat_tot)
{
- fprintf(stderr, "DD node %d: global atom %d marked as local atom %d, which is larger than nat_tot (%d)\n", dd->rank, i+1, a+1, dd->nat_tot);
+ fprintf(stderr, "DD rank %d: global atom %d marked as local atom %d, which is larger than nat_tot (%d)\n", dd->rank, i+1, a+1, dd->nat_tot);
nerr++;
}
else
have[a] = 1;
if (dd->gatindex[a] != i)
{
- fprintf(stderr, "DD node %d: global atom %d marked as local atom %d, which has global atom index %d\n", dd->rank, i+1, a+1, dd->gatindex[a]+1);
+ fprintf(stderr, "DD rank %d: global atom %d marked as local atom %d, which has global atom index %d\n", dd->rank, i+1, a+1, dd->gatindex[a]+1);
nerr++;
}
}
if (ngl != dd->nat_tot)
{
fprintf(stderr,
- "DD node %d, %s: %d global atom indices, %d local atoms\n",
+ "DD rank %d, %s: %d global atom indices, %d local atoms\n",
dd->rank, where, ngl, dd->nat_tot);
}
for (a = 0; a < dd->nat_tot; a++)
if (have[a] == 0)
{
fprintf(stderr,
- "DD node %d, %s: local atom %d, global %d has no global index\n",
+ "DD rank %d, %s: local atom %d, global %d has no global index\n",
dd->rank, where, a+1, dd->gatindex[a]+1);
}
}
if (nerr > 0)
{
- gmx_fatal(FARGS, "DD node %d, %s: %d atom/cg index inconsistencies",
+ gmx_fatal(FARGS, "DD rank %d, %s: %d atom/cg index inconsistencies",
dd->rank, where, nerr);
}
}
/* This error should never be triggered under normal
* circumstances, but you never know ...
*/
- gmx_fatal(FARGS, "Step %s: The domain decomposition grid has shifted too much in the %c-direction around cell %d %d %d. This should not have happened. Running with less nodes might avoid this issue.",
+ gmx_fatal(FARGS, "Step %s: The domain decomposition grid has shifted too much in the %c-direction around cell %d %d %d. This should not have happened. Running with fewer ranks might avoid this issue.",
gmx_step_str(step, buf),
dim2char(dim), dd->ci[XX], dd->ci[YY], dd->ci[ZZ]);
}
dim2char(d), ddbox->box_size[d], ddbox->skew_fac[d],
comm->cutoff,
dd->nc[d], dd->nc[d],
- dd->nnodes > dd->nc[d] ? "cells" : "processors");
+ dd->nnodes > dd->nc[d] ? "cells" : "ranks");
if (setmode == setcellsizeslbLOCAL)
{
if (npme > 0 && fabs(lossp) >= DD_PERF_LOSS_WARN)
{
sprintf(buf,
- "NOTE: %.1f %% performance was lost because the PME nodes\n"
- " had %s work to do than the PP nodes.\n"
- " You might want to %s the number of PME nodes\n"
+ "NOTE: %.1f %% performance was lost because the PME ranks\n"
+ " had %s work to do than the PP ranks.\n"
+ " You might want to %s the number of PME ranks\n"
" or %s the cut-off and the grid spacing.\n",
fabs(lossp*100),
(lossp < 0) ? "less" : "more",
if (fplog)
{
fprintf(fplog,
- "Domain decomposition nodeid %d, coordinates %d %d %d\n\n",
+ "Domain decomposition rank %d, coordinates %d %d %d\n\n",
dd->rank, dd->ci[XX], dd->ci[YY], dd->ci[ZZ]);
}
if (debug)
{
fprintf(debug,
- "Domain decomposition nodeid %d, coordinates %d %d %d\n\n",
+ "Domain decomposition rank %d, coordinates %d %d %d\n\n",
dd->rank, dd->ci[XX], dd->ci[YY], dd->ci[ZZ]);
}
}
}
else if (fplog)
{
- fprintf(fplog, "#pmenodes (%d) is not a multiple of nx*ny (%d*%d) or nx*nz (%d*%d)\n", cr->npmenodes, dd->nc[XX], dd->nc[YY], dd->nc[XX], dd->nc[ZZ]);
+ fprintf(fplog, "Number of PME-only ranks (%d) is not a multiple of nx*ny (%d*%d) or nx*nz (%d*%d)\n", cr->npmenodes, dd->nc[XX], dd->nc[YY], dd->nc[XX], dd->nc[ZZ]);
fprintf(fplog,
"Will not use a Cartesian communicator for PP <-> PME\n\n");
}
if (fplog)
{
- fprintf(fplog, "Cartesian nodeid %d, coordinates %d %d %d\n\n",
+ fprintf(fplog, "Cartesian rank %d, coordinates %d %d %d\n\n",
cr->sim_nodeid, dd->ci[XX], dd->ci[YY], dd->ci[ZZ]);
}
case ddnoPP_PME:
if (fplog)
{
- fprintf(fplog, "Order of the nodes: PP first, PME last\n");
+ fprintf(fplog, "Order of the ranks: PP first, PME last\n");
}
break;
case ddnoINTERLEAVE:
*/
if (fplog)
{
- fprintf(fplog, "Interleaving PP and PME nodes\n");
+ fprintf(fplog, "Interleaving PP and PME ranks\n");
}
comm->pmenodes = dd_pmenodes(cr);
break;
if (fplog)
{
- fprintf(fplog, "This is a %s only node\n\n",
+ fprintf(fplog, "This rank does only %s work.\n\n",
(cr->duty & DUTY_PP) ? "particle-particle" : "PME-mesh");
}
}
if (fplog)
{
fprintf(fplog,
- "\nInitializing Domain Decomposition on %d nodes\n", cr->nnodes);
+ "\nInitializing Domain Decomposition on %d ranks\n", cr->nnodes);
}
snew(dd, 1);
if (dd->nc[XX] == 0)
{
bC = (dd->bInterCGcons && rconstr > r_bonded_limit);
- sprintf(buf, "Change the number of nodes or mdrun option %s%s%s",
+ sprintf(buf, "Change the number of ranks or mdrun option %s%s%s",
!bC ? "-rdd" : "-rcon",
comm->eDLB != edlbNO ? " or -dds" : "",
bC ? " or your LINCS settings" : "");
gmx_fatal_collective(FARGS, cr, NULL,
- "There is no domain decomposition for %d nodes that is compatible with the given box and a minimum cell size of %g nm\n"
+ "There is no domain decomposition for %d ranks that is compatible with the given box and a minimum cell size of %g nm\n"
"%s\n"
"Look in the log file for details on the domain decomposition",
cr->nnodes-cr->npmenodes, limit, buf);
if (fplog)
{
fprintf(fplog,
- "Domain decomposition grid %d x %d x %d, separate PME nodes %d\n",
+ "Domain decomposition grid %d x %d x %d, separate PME ranks %d\n",
dd->nc[XX], dd->nc[YY], dd->nc[ZZ], cr->npmenodes);
}
if (cr->nnodes - dd->nnodes != cr->npmenodes)
{
gmx_fatal_collective(FARGS, cr, NULL,
- "The size of the domain decomposition grid (%d) does not match the number of nodes (%d). The total number of nodes is %d",
+ "The size of the domain decomposition grid (%d) does not match the number of ranks (%d). The total number of ranks is %d",
dd->nnodes, cr->nnodes - cr->npmenodes, cr->nnodes);
}
if (cr->npmenodes > dd->nnodes)
{
gmx_fatal_collective(FARGS, cr, NULL,
- "The number of separate PME nodes (%d) is larger than the number of PP nodes (%d), this is not supported.", cr->npmenodes, dd->nnodes);
+ "The number of separate PME ranks (%d) is larger than the number of PP ranks (%d), this is not supported.", cr->npmenodes, dd->nnodes);
}
if (cr->npmenodes > 0)
{
if (dd->pme_nodeid >= 0)
{
gmx_fatal_collective(FARGS, NULL, dd,
- "Can not have separate PME nodes without PME electrostatics");
+ "Can not have separate PME ranks without PME electrostatics");
}
}
nsend, 2, buf, 2);
if (debug)
{
- fprintf(debug, "Send to node %d, %d (%d) indices, "
- "receive from node %d, %d (%d) indices\n",
+ fprintf(debug, "Send to rank %d, %d (%d) indices, "
+ "receive from rank %d, %d (%d) indices\n",
dd->neighbor[d][1-dir], nsend[1], nsend[0],
dd->neighbor[d][dir], buf[1], buf[0]);
if (gmx_debug_at)
}
if (npme > nnodes/2)
{
- gmx_fatal(FARGS, "Could not find an appropriate number of separate PME nodes. i.e. >= %5f*#nodes (%d) and <= #nodes/2 (%d) and reasonable performance wise (grid_x=%d, grid_y=%d).\n"
- "Use the -npme option of mdrun or change the number of processors or the PME grid dimensions, see the manual for details.",
+ gmx_fatal(FARGS, "Could not find an appropriate number of separate PME ranks. i.e. >= %5f*#ranks (%d) and <= #ranks/2 (%d) and reasonable performance wise (grid_x=%d, grid_y=%d).\n"
+ "Use the -npme option of mdrun or change the number of ranks or the PME grid dimensions, see the manual for details.",
ratio, (int)(0.95*ratio*nnodes+0.5), nnodes/2, ir->nkx, ir->nky);
/* Keep the compiler happy */
npme = 0;
if (fplog)
{
fprintf(fplog,
- "Will use %d particle-particle and %d PME only nodes\n"
+ "Will use %d particle-particle and %d PME only ranks\n"
"This is a guess, check the performance at the end of the log file\n",
nnodes-npme, npme);
}
fprintf(stderr, "\n"
- "Will use %d particle-particle and %d PME only nodes\n"
+ "Will use %d particle-particle and %d PME only ranks\n"
"This is a guess, check the performance at the end of the log file\n",
nnodes-npme, npme);
}
if (cr->nnodes <= 2)
{
gmx_fatal(FARGS,
- "Can not have separate PME nodes with 2 or less nodes");
+ "Cannot have separate PME ranks with 2 or fewer ranks");
}
if (cr->npmenodes >= cr->nnodes)
{
gmx_fatal(FARGS,
- "Can not have %d separate PME nodes with just %d total nodes",
+ "Cannot have %d separate PME ranks with just %d total ranks",
cr->npmenodes, cr->nnodes);
}
/* Check if the largest divisor is more than nnodes^2/3 */
if (ldiv*ldiv*ldiv > nnodes_div*nnodes_div)
{
- gmx_fatal(FARGS, "The number of nodes you selected (%d) contains a large prime factor %d. In most cases this will lead to bad performance. Choose a number with smaller prime factors or set the decomposition (option -dd) manually.",
+ gmx_fatal(FARGS, "The number of ranks you selected (%d) contains a large prime factor %d. In most cases this will lead to bad performance. Choose a number with smaller prime factors or set the decomposition (option -dd) manually.",
nnodes_div, ldiv);
}
}
cr->npmenodes = 0;
if (fplog)
{
- fprintf(fplog, "Using %d separate PME nodes, as there are too few total\n nodes for efficient splitting\n", cr->npmenodes);
+ fprintf(fplog, "Using %d separate PME ranks, as there are too few total\n ranks for efficient splitting\n", cr->npmenodes);
}
}
else
cr->npmenodes = guess_npme(fplog, mtop, ir, box, cr->nnodes);
if (fplog)
{
- fprintf(fplog, "Using %d separate PME nodes, as guessed by mdrun\n", cr->npmenodes);
+ fprintf(fplog, "Using %d separate PME ranks, as guessed by mdrun\n", cr->npmenodes);
}
}
}
{
if (fplog)
{
- fprintf(fplog, "Using %d separate PME nodes, per user request\n", cr->npmenodes);
+ fprintf(fplog, "Using %d separate PME ranks, per user request\n", cr->npmenodes);
}
}
}
if (bSepDVDL)
{
- fprintf(fplog, "Step %s: non-bonded V and dVdl for node %d:\n",
+ fprintf(fplog, "Step %s: non-bonded V and dVdl for rank %d:\n",
gmx_step_str(step, buf), cr->nodeid);
}
fr->t_wait += t3-t2;
if (fr->timesteps == 11)
{
- fprintf(stderr, "* PP load balancing info: node %d, step %s, rel wait time=%3.0f%% , load string value: %7.2f\n",
+ fprintf(stderr, "* PP load balancing info: rank %d, step %s, rel wait time=%3.0f%% , load string value: %7.2f\n",
cr->nodeid, gmx_step_str(fr->timesteps, buf),
100*fr->t_wait/(fr->t_wait+fr->t_fnbf),
(fr->t_fnbf+fr->t_wait)/fr->t_fnbf);
{
/* At this point the init should never fail as we made sure that
* we have all the GPUs we need. If it still does, we'll bail. */
- gmx_fatal(FARGS, "On node %d failed to initialize GPU #%d: %s",
+ gmx_fatal(FARGS, "On rank %d failed to initialize GPU #%d: %s",
cr->nodeid,
get_gpu_device_id(&hwinfo->gpu_info, gpu_opt,
cr->rank_pp_intranode),
{
if (atc->count[atc->nodeid] + nsend != n)
{
- gmx_fatal(FARGS, "%d particles communicated to PME node %d are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension %c.\n"
+ gmx_fatal(FARGS, "%d particles communicated to PME rank %d are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension %c.\n"
"This usually means that your system is not well equilibrated.",
n - (atc->count[atc->nodeid] + nsend),
pme->nodeid, 'x'+atc->dimind);
/* Communicate the count */
if (debug)
{
- fprintf(debug, "dimind %d PME node %d send to node %d: %d\n",
+ fprintf(debug, "dimind %d PME rank %d send to rank %d: %d\n",
atc->dimind, atc->nodeid, commnode[i], scount);
}
pme_dd_sendrecv(atc, FALSE, i,
/* Copy data to contiguous send buffer */
if (debug)
{
- fprintf(debug, "PME send node %d %d -> %d grid start %d Communicating %d to %d\n",
+ fprintf(debug, "PME send rank %d %d -> %d grid start %d Communicating %d to %d\n",
pme->nodeid, overlap->nodeid, send_id,
pme->pmegrid_start_iy,
send_index0-pme->pmegrid_start_iy,
/* Get data from contiguous recv buffer */
if (debug)
{
- fprintf(debug, "PME recv node %d %d <- %d grid start %d Communicating %d to %d\n",
+ fprintf(debug, "PME recv rank %d %d <- %d grid start %d Communicating %d to %d\n",
pme->nodeid, overlap->nodeid, recv_id,
pme->pmegrid_start_iy,
recv_index0-pme->pmegrid_start_iy,
if (debug)
{
- fprintf(debug, "PME send node %d %d -> %d grid start %d Communicating %d to %d\n",
+ fprintf(debug, "PME send rank %d %d -> %d grid start %d Communicating %d to %d\n",
pme->nodeid, overlap->nodeid, send_id,
pme->pmegrid_start_ix,
send_index0-pme->pmegrid_start_ix,
send_index0-pme->pmegrid_start_ix+send_nindex);
- fprintf(debug, "PME recv node %d %d <- %d grid start %d Communicating %d to %d\n",
+ fprintf(debug, "PME recv rank %d %d <- %d grid start %d Communicating %d to %d\n",
pme->nodeid, overlap->nodeid, recv_id,
pme->pmegrid_start_ix,
recv_index0-pme->pmegrid_start_ix,
*bValidSettings = FALSE;
return;
}
- gmx_fatal(FARGS, "The number of PME grid lines per node along x is %g. But when using OpenMP threads, the number of grid lines per node along x should be >= pme_order (%d) or = pmeorder-1. To resolve this issue, use less nodes along x (and possibly more along y and/or z) by specifying -dd manually.",
+ gmx_fatal(FARGS, "The number of PME grid lines per rank along x is %g. But when using OpenMP threads, the number of grid lines per rank along x should be >= pme_order (%d) or = pmeorder-1. To resolve this issue, use fewer ranks along x (and possibly more along y and/or z) by specifying -dd manually.",
nkx/(double)nnodes_major, pme_order);
}
MPI_Comm_size(pme->mpi_comm, &pme->nnodes);
if (pme->nnodes != nnodes_major*nnodes_minor)
{
- gmx_incons("PME node count mismatch");
+ gmx_incons("PME rank count mismatch");
}
}
else
{
if (pme->nnodes % nnodes_major != 0)
{
- gmx_incons("For 2D PME decomposition, #PME nodes must be divisible by the number of nodes in the major dimension");
+ gmx_incons("For 2D PME decomposition, #PME ranks must be divisible by the number of ranks in the major dimension");
}
pme->ndecompdim = 2;
"\n"
"NOTE: The load imbalance in PME FFT and solve is %d%%.\n"
" For optimal PME load balancing\n"
- " PME grid_x (%d) and grid_y (%d) should be divisible by #PME_nodes_x (%d)\n"
- " and PME grid_y (%d) and grid_z (%d) should be divisible by #PME_nodes_y (%d)\n"
+ " PME grid_x (%d) and grid_y (%d) should be divisible by #PME_ranks_x (%d)\n"
+ " and PME grid_y (%d) and grid_z (%d) should be divisible by #PME_ranks_y (%d)\n"
"\n",
(int)((imbal-1)*100 + 0.5),
pme->nkx, pme->nky, pme->nnodes_major,
if (debug)
{
- fprintf(debug, "PME: nnodes = %d, nodeid = %d\n",
+ fprintf(debug, "PME: number of ranks = %d, rank = %d\n",
cr->nnodes, cr->nodeid);
fprintf(debug, "Grid = %p\n", (void*)grid);
if (grid == NULL)
if (debug)
{
- fprintf(debug, "Node= %6d, pme local particles=%6d\n",
+ fprintf(debug, "Rank= %6d, pme local particles=%6d\n",
cr->nodeid, atc->n);
}
if (debug)
{
- fprintf(debug, "PP node %d sending to PME node %d: %d%s%s\n",
+ fprintf(debug, "PP rank %d sending to PME rank %d: %d%s%s\n",
cr->sim_nodeid, dd->pme_nodeid, n,
flags & PP_PME_CHARGE ? " charges" : "",
flags & PP_PME_COORD ? " coordinates" : "");
if (debug)
{
- fprintf(debug, "PME only node receiving:%s%s%s%s%s\n",
+ fprintf(debug, "PME only rank receiving:%s%s%s%s%s\n",
(cnb.flags & PP_PME_CHARGE) ? " charges" : "",
(cnb.flags & PP_PME_COORD ) ? " coordinates" : "",
(cnb.flags & PP_PME_FINISH) ? " finish" : "",
nat += pme_pp->nat[sender];
if (debug)
{
- fprintf(debug, "Received from PP node %d: %d "
+ fprintf(debug, "Received from PP rank %d: %d "
"charges\n",
pme_pp->node[sender], pme_pp->nat[sender]);
}
{
if (!(pme_pp->flags_charge & (PP_PME_CHARGE | PP_PME_SQRTC6)))
{
- gmx_incons("PME-only node received coordinates before charges and/or C6-values"
+ gmx_incons("PME-only rank received coordinates before charges and/or C6-values"
);
}
if (*bFreeEnergy_q && !(pme_pp->flags_charge & PP_PME_CHARGEB))
{
- gmx_incons("PME-only node received free energy request, but "
+ gmx_incons("PME-only rank received free energy request, but "
"did not receive B-state charges");
}
if (*bFreeEnergy_lj && !(pme_pp->flags_charge & PP_PME_SQRTC6B))
{
- gmx_incons("PME-only node received free energy request, but "
+ gmx_incons("PME-only rank received free energy request, but "
"did not receive B-state C6-values");
}
nat += pme_pp->nat[sender];
if (debug)
{
- fprintf(debug, "Received from PP node %d: %d "
+ fprintf(debug, "Received from PP rank %d: %d "
"coordinates\n",
pme_pp->node[sender], pme_pp->nat[sender]);
}
if (debug)
{
fprintf(debug,
- "PP node %d receiving from PME node %d: virial and energy\n",
+ "PP rank %d receiving from PME rank %d: virial and energy\n",
cr->sim_nodeid, cr->dd->pme_nodeid);
}
#ifdef GMX_MPI
if (debug)
{
- fprintf(debug, "PME node sending to PP node %d: virial and energy\n",
+ fprintf(debug, "PME rank sending to PP rank %d: virial and energy\n",
pme_pp->node_peer);
}
#ifdef GMX_MPI
/* issue a fatal if the user wants to run with more than one node */
if (PAR(cr))
{
- gmx_fatal(FARGS, "QM/MM does not work in parallel, use a single node instead\n");
+ gmx_fatal(FARGS, "QM/MM does not work in parallel, use a single rank instead\n");
}
/* Make a local copy of the QMMMrec */
time_string[i] = '\0';
}
- fprintf(fplog, "%s on node %d %s\n", title, nodeid, time_string);
+ fprintf(fplog, "%s on rank %d %s\n", title, nodeid, time_string);
}
void print_start(FILE *fplog, t_commrec *cr,
}
else
{
- fprintf(stderr, "%s node %d: Inconsistency during ion compartmentalization. !inA: %d, !inB: %d, total ions %d\n",
+ fprintf(stderr, "%s rank %d: Inconsistency during ion compartmentalization. !inA: %d, !inB: %d, total ions %d\n",
SwS, cr->nodeid, not_in_comp[eCompA], not_in_comp[eCompB], iong->nat);
}
}
else
{
- fprintf(stderr, "%s node %d: %d atoms are in the ion group, but altogether %d have been assigned to the compartments.\n",
+ fprintf(stderr, "%s rank %d: %d atoms are in the ion group, but altogether %d have been assigned to the compartments.\n",
SwS, cr->nodeid, iong->nat, sum);
}
}
}
else
{
- fprintf(stderr, "%s node %d: Inconsistency during solvent compartmentalization. !inA: %d, !inB: %d, solvent atoms %d\n",
+ fprintf(stderr, "%s rank %d: Inconsistency during solvent compartmentalization. !inA: %d, !inB: %d, solvent atoms %d\n",
SwS, cr->nodeid, not_in_comp[eCompA], not_in_comp[eCompB], solg->nat);
}
}
}
else
{
- fprintf(stderr, "%s node %d: %d atoms in solvent group, but %d have been assigned to the compartments.\n",
+ fprintf(stderr, "%s rank %d: %d atoms in solvent group, but %d have been assigned to the compartments.\n",
SwS, cr->nodeid, solg->nat, sum);
}
}
fprintf(fplog, "\n\n");
fprintf(fplog, " Computing: Num Num Call Wall time Giga-Cycles\n");
- fprintf(fplog, " Nodes Threads Count (s) total sum %%\n");
+ fprintf(fplog, " Ranks Threads Count (s) total sum %%\n");
}
void wallcycle_print(FILE *fplog, int nnodes, int npme, double realtime,
if (npme > 0)
{
fprintf(fplog,
- "(*) Note that with separate PME nodes, the walltime column actually sums to\n"
+ "(*) Note that with separate PME ranks, the walltime column actually sums to\n"
" twice the total reported, but the cycle count total and %% are correct.\n"
"%s\n", hline);
}
registerModule(manager, &gmx_traj, "traj",
"Plot x, v, f, box, temperature and rotational energy from trajectories");
registerModule(manager, &gmx_tune_pme, "tune_pme",
- "Time mdrun as a function of PME nodes to optimize settings");
+ "Time mdrun as a function of PME ranks to optimize settings");
registerModule(manager, &gmx_vanhove, "vanhove",
"Compute Van Hove displacement and correlation functions");
registerModule(manager, &gmx_velacc, "velacc",
"the structure provided is properly energy-minimized.",
"The generated matrix can be diagonalized by [gmx-nmeig].[PAR]",
"The [TT]mdrun[tt] program reads the run input file ([TT]-s[tt])",
- "and distributes the topology over nodes if needed.",
+ "and distributes the topology over ranks if needed.",
"[TT]mdrun[tt] produces at least four output files.",
"A single log file ([TT]-g[tt]) is written, unless the option",
- "[TT]-seppot[tt] is used, in which case each node writes a log file.",
+ "[TT]-seppot[tt] is used, in which case each rank writes a log file.",
"The trajectory file ([TT]-o[tt]), contains coordinates, velocities and",
"optionally forces.",
"The structure file ([TT]-c[tt]) contains the coordinates and",
"compiled with the GROMACS built-in thread-MPI library. OpenMP threads",
"are supported when [TT]mdrun[tt] is compiled with OpenMP. Full OpenMP support",
"is only available with the Verlet cut-off scheme, with the (older)",
- "group scheme only PME-only processes can use OpenMP parallelization.",
+ "group scheme only PME-only ranks can use OpenMP parallelization.",
"In all cases [TT]mdrun[tt] will by default try to use all the available",
"hardware resources. With a normal MPI library only the options",
"[TT]-ntomp[tt] (with the Verlet cut-off scheme) and [TT]-ntomp_pme[tt],",
- "for PME-only processes, can be used to control the number of threads.",
+ "for PME-only ranks, can be used to control the number of threads.",
"With thread-MPI there are additional options [TT]-nt[tt], which sets",
"the total number of threads, and [TT]-ntmpi[tt], which sets the number",
"of thread-MPI threads.",
"The number of OpenMP threads used by [TT]mdrun[tt] can also be set with",
"the standard environment variable, [TT]OMP_NUM_THREADS[tt].",
"The [TT]GMX_PME_NUM_THREADS[tt] environment variable can be used to specify",
- "the number of threads used by the PME-only processes.[PAR]",
+ "the number of threads used by the PME-only ranks.[PAR]",
"Note that combined MPI+OpenMP parallelization is in many cases",
"slower than either on its own. However, at high parallelization, using the",
"combination is often beneficial as it reduces the number of domains and/or",
"the number of MPI ranks. (Less and larger domains can improve scaling,",
- "with separate PME processes fewer MPI ranks reduces communication cost.)",
+ "with separate PME ranks, using fewer MPI ranks reduces communication costs.)",
"OpenMP-only parallelization is typically faster than MPI-only parallelization",
"on a single CPU(-die). Since we currently don't have proper hardware",
"topology detection, [TT]mdrun[tt] compiled with thread-MPI will only",
"to specify [TT]cutoff-scheme = Verlet[tt] in the [TT].mdp[tt] file.",
"[PAR]",
"With GPUs (only supported with the Verlet cut-off scheme), the number",
- "of GPUs should match the number of MPI processes or MPI threads,",
- "excluding PME-only processes/threads. With thread-MPI, unless set on the command line, the number",
+ "of GPUs should match the number of particle-particle ranks, i.e.",
+ "excluding PME-only ranks. With thread-MPI, unless set on the command line, the number",
"of MPI threads will automatically be set to the number of GPUs detected.",
"To use a subset of the available GPUs, or to manually provide a mapping of",
"GPUs to PP ranks, you can use the [TT]-gpu_id[tt] option. The argument of [TT]-gpu_id[tt] is",
"fast GPUs, a (user-supplied) larger nstlist value can give much",
"better performance.",
"[PAR]",
- "When using PME with separate PME nodes or with a GPU, the two major",
+ "When using PME with separate PME ranks or with a GPU, the two major",
"compute tasks, the non-bonded force calculation and the PME calculation",
"run on different compute resources. If this load is not balanced,",
"some of the resources will be idle part of time. With the Verlet",
"to avoid overloading cores; with [TT]-pinoffset[tt] you can specify",
"the offset in logical cores for pinning.",
"[PAR]",
- "When [TT]mdrun[tt] is started using MPI with more than 1 process",
- "or with thread-MPI with more than 1 thread, MPI parallelization is used.",
- "Domain decomposition is always used with MPI parallelism.",
+ "When [TT]mdrun[tt] is started with more than 1 rank,",
+ "parallelization with domain decomposition is used.",
"[PAR]",
"With domain decomposition, the spatial decomposition can be set",
"with option [TT]-dd[tt]. By default [TT]mdrun[tt] selects a good decomposition.",
"At high parallelization the options in the next two sections",
"could be important for increasing the performace.",
"[PAR]",
- "When PME is used with domain decomposition, separate nodes can",
+ "When PME is used with domain decomposition, separate ranks can",
"be assigned to do only the PME mesh calculation;",
- "this is computationally more efficient starting at about 12 nodes",
+ "this is computationally more efficient starting at about 12 ranks,",
"or even fewer when OpenMP parallelization is used.",
- "The number of PME nodes is set with option [TT]-npme[tt],",
- "this can not be more than half of the nodes.",
+ "The number of PME ranks is set with option [TT]-npme[tt],",
+ "but this cannot be more than half of the ranks.",
"By default [TT]mdrun[tt] makes a guess for the number of PME",
- "nodes when the number of nodes is larger than 16. With GPUs,",
- "PME nodes are not selected automatically, since the optimal setup",
- "depends very much on the details of the hardware.",
- "In all cases you might gain performance by optimizing [TT]-npme[tt].",
- "Performance statistics on this issue",
+ "ranks when the number of ranks is larger than 16. With GPUs,",
+ "using separate PME ranks is not selected automatically,",
+ "since the optimal setup depends very much on the details",
+ "of the hardware. In all cases, you might gain performance",
+ "by optimizing [TT]-npme[tt]. Performance statistics on this issue",
"are written at the end of the log file.",
"For good load balancing at high parallelization, the PME grid x and y",
- "dimensions should be divisible by the number of PME nodes",
+ "dimensions should be divisible by the number of PME ranks",
"(the simulation will run correctly also when this is not the case).",
"[PAR]",
"This section lists all options that affect the domain decomposition.",
"With [TT]-multi[tt], the system number is appended to the run input ",
"and each output filename, for instance [TT]topol.tpr[tt] becomes",
"[TT]topol0.tpr[tt], [TT]topol1.tpr[tt] etc.",
- "The number of nodes per system is the total number of nodes",
+ "The number of ranks per system is the total number of ranks",
"divided by the number of systems.",
"One use of this option is for NMR refinement: when distance",
"or orientation restraints are present these can be ensemble averaged",
"and no old output files are modified and no new output files are opened.",
"The result with appending will be the same as from a single run.",
"The contents will be binary identical, unless you use a different number",
- "of nodes or dynamic load balancing or the FFT library uses optimizations",
+ "of ranks or dynamic load balancing or the FFT library uses optimizations",
"through timing.",
"[PAR]",
"With option [TT]-maxh[tt] a simulation is terminated and a checkpoint",
"pressed), it will stop after the next neighbor search step ",
"(with nstlist=0 at the next step).",
"In both cases all the usual output will be written to file.",
- "When running with MPI, a signal to one of the [TT]mdrun[tt] processes",
+ "When running with MPI, a signal to one of the [TT]mdrun[tt] ranks",
"is sufficient, this signal should not be sent to mpirun or",
"the [TT]mdrun[tt] process that is the parent of the others.",
"[PAR]",
{ "-dd", FALSE, etRVEC, {&realddxyz},
"Domain decomposition grid, 0 is optimize" },
{ "-ddorder", FALSE, etENUM, {ddno_opt},
- "DD node order" },
+ "DD rank order" },
{ "-npme", FALSE, etINT, {&npme},
- "Number of separate nodes to be used for PME, -1 is guess" },
+ "Number of separate ranks to be used for PME, -1 is guess" },
{ "-nt", FALSE, etINT, {&hw_opt.nthreads_tot},
"Total number of threads to start (0 is guess)" },
{ "-ntmpi", FALSE, etINT, {&hw_opt.nthreads_tmpi},
"Number of thread-MPI threads to start (0 is guess)" },
{ "-ntomp", FALSE, etINT, {&hw_opt.nthreads_omp},
- "Number of OpenMP threads per MPI process/thread to start (0 is guess)" },
+ "Number of OpenMP threads per MPI rank to start (0 is guess)" },
{ "-ntomp_pme", FALSE, etINT, {&hw_opt.nthreads_omp_pme},
- "Number of OpenMP threads per MPI process/thread to start (0 is -ntomp)" },
+ "Number of OpenMP threads per MPI rank to start (0 is -ntomp)" },
{ "-pin", FALSE, etENUM, {thread_aff_opt},
- "Fix threads (or processes) to specific cores" },
+ "Set thread affinities" },
{ "-pinoffset", FALSE, etINT, {&hw_opt.core_pinning_offset},
"The starting logical core number for pinning to cores; used to avoid pinning threads from different mdrun instances to the same core" },
{ "-pinstride", FALSE, etINT, {&hw_opt.core_pinning_stride},
{ "-nstlist", FALSE, etINT, {&nstlist},
"Set nstlist when using a Verlet buffer tolerance (0 is guess)" },
{ "-tunepme", FALSE, etBOOL, {&bTunePME},
- "Optimize PME load between PP/PME nodes or GPU/CPU" },
+ "Optimize PME load between PP/PME ranks or GPU/CPU" },
{ "-testverlet", FALSE, etBOOL, {&bTestVerlet},
"Test the Verlet non-bonded scheme" },
{ "-v", FALSE, etBOOL, {&bVerbose},
{ "-compact", FALSE, etBOOL, {&bCompact},
"Write a compact log file" },
{ "-seppot", FALSE, etBOOL, {&bSepPot},
- "Write separate V and dVdl terms for each interaction type and node to the log file(s)" },
+ "Write separate V and dVdl terms for each interaction type and rank to the log file(s)" },
{ "-pforce", FALSE, etREAL, {&pforce},
"Print all forces larger than this (kJ/mol nm)" },
{ "-reprod", FALSE, etBOOL, {&bReproducible},
{
md_print_warn(cr, fplog,
"NOTE: PME load balancing increased the non-bonded workload by more than 50%%.\n"
- " For better performance use (more) PME nodes (mdrun -npme),\n"
- " or in case you are beyond the scaling limit, use less nodes in total.\n");
+ " For better performance, use (more) PME ranks (mdrun -npme),\n"
+ " or if you are beyond the scaling limit, use fewer total ranks (or nodes).\n");
}
else
{
/* uninitialize GPU (by destroying the context) */
if (!free_gpu(gpu_err_str))
{
- gmx_warning("On node %d failed to free GPU #%d: %s",
+ gmx_warning("On rank %d failed to free GPU #%d: %s",
cr->nodeid, get_current_gpu_device_id(), gpu_err_str);
}
}
#ifdef GMX_THREAD_MPI
if (cr->npmenodes > 0 && hw_opt->nthreads_tmpi <= 0)
{
- gmx_fatal(FARGS, "You need to explicitly specify the number of MPI threads (-ntmpi) when using separate PME nodes");
+ gmx_fatal(FARGS, "You need to explicitly specify the number of MPI threads (-ntmpi) when using separate PME ranks");
}
#endif
if (hw_opt->nthreads_omp_pme != hw_opt->nthreads_omp &&
cr->npmenodes <= 0)
{
- gmx_fatal(FARGS, "You need to explicitly specify the number of PME nodes (-npme) when using different number of OpenMP threads for PP and PME nodes");
+ gmx_fatal(FARGS, "You need to explicitly specify the number of PME ranks (-npme) when using different number of OpenMP threads for PP and PME ranks");
}
}
#ifdef GMX_THREAD_MPI
"but the number of threads (option -nt) is 1"
#else
- "but %s was not started through mpirun/mpiexec or only one process was requested through mpirun/mpiexec"
+ "but %s was not started through mpirun/mpiexec or only one rank was requested through mpirun/mpiexec"
#endif
#endif
, ShortProgram()
if (cr->npmenodes > 0)
{
gmx_fatal_collective(FARGS, cr, NULL,
- "PME nodes are requested, but the system does not use PME electrostatics or LJ-PME");
+ "PME-only ranks are requested, but the system does not use PME for electrostatics or LJ");
}
cr->npmenodes = 0;