BioD PNPI Git Repos - alexxy/gromacs.git/log

portability aspects in install guide + minor tweaks

Added information on portability aspects related to CPU instruction
sets, related to #1428.

Additionally, made several minor updates and tweaks related to
compilers, platforms, cmake, etc.

Change-Id: I621262c939c119e5bdd5e7c91dda0ae3ffc60b7b

Reinstate shell code with DD

Further work on the complex/sw test case in the 5.0 regressiontests
branch reveals that the initial conditions may have been the reason
for the problems observed with DD and more than one node, rather than
the implementation.

Refs #1429

Change-Id: I26ff6d9f8c79605afa794cae4761b5643b712124

added gmx_is_{single,double}_precision

* allows easy detection of the precision for cmake, autotools
  without parsing the output of gmx_print_version_info (no
  cross-compile support) or the output of strings command (unix
  only)
* linking against libgmx with/out -DGMX_DOUBLE will lead to
  unpredictable segfaults

Change-Id: I472f10ae374a1f42c94c55e156b53f8905bdf098

Fix aligned store to unaligned memory

Also fixes that unaligned store was used when not necessary.

Change-Id: I44bb222a07ec0af65198667787b8673b3c6cd2e7

avoid mdrun crash when rdtscp is not supported

When using rdtscp, mdrun now detects at runtime whether the CPU supports
this instruction and if this is not the case, it issues a fatal error
and instructs the user to recompile mdrun for the compute host. Note
that this will happen rarely, only when cross-compiling from a newer
host for a rather old one.

Additionally, when the user manually picks AVX, we also turn on RDTSCP
as all AVX-capable CPUs support it.

Also made CMake advanced cache option for GMX_USE_RDTSCP. This replaces
the previously hidden GMX_DISTRIBUTABLE_BUILD option.

Fixes #1428

Change-Id: I8bc884ef9ea8ea4661626b60490182ae2b302648

Added safety check for fitting group in anaeig.

Previously, g_anaeig would not check the number of atoms in the selected
fit group against the number of atoms in the reference structure if this
was read from the eigenvector file (g_covar adds the reference structure
to the eigenvector file if fit and analysis group are identitcal). As a
result, anaeig would run out of bounds when selecting the atoms for
fitting, reading random values from memory.

This simple check should prevent this behaviour by terminating anaeig
with a fatal error similar to the one that is invoked if the group
selected for analysis has an incorrect number of atoms.

Change-Id: I63a1e1629144e539808d95d867e0ad0673480fdf

Issue fatal errors rather than use broken shell code

Refs #1429

Change-Id: I18a17f1e232a86a13f4e3b591bd992702af3017b

cmake eats slashes

Change-Id: I5ea157c4a5e9df2212643b49ba9b270ffd9a6978

Keep clang Address Sanitizer happy

Allocating 15 bytes with the 8-byte aligned memory at offset 8 of
15, would overflow the buffer, which would be fairly likely to
have no effect. But ASan notices this if you run it on AVX hardware,
unlike the Jenkins build which runs on SSE4.1. The good news is
that this fix is enough to make all the existing tests pass under
ASan on AVX.

Change-Id: I61ff11687709e096c70a162d3514227cb243561d

cmake: added FFTW_URL to allow easy offline build

Change-Id: I9904ce03e0ee1b377e4961c1f8481fc98c10cba4

Remove unused fplog

Keeps gcc 4.8 build happy

Change-Id: I392b02c0950ead04c414dffd6340b364b804b7aa

Unify logic for timing counts

Made all integrators use the same logic for starting timing
mechanisms.

Change-Id: Id8cb154f7b96d977efffcc9533d4a6dd9894afbd

clarified OpenMP-related things in mdrun help/man

Added note on OMP_NUM_THREADS/GMX_PME_NUM_THREADS env vars and
improved description on the use-cases when MPI+OpenMP improves
performance.

Change-Id: I904f00c8a4b6907a006b9d4367406d3fa3f3ce42

Fixed precision in thermal expansion coefficient calc.

Loss of accuracy was caused by different sampling
of volume and enthalpy and as a result alpha was
computed incorrectly. With the present "fix" the volume
and enthalpy are both downsampled to what is written
in the .edr file. The real fix would be to store the product
of H and V in the .edr file, but that falls outside the
4.6 branch policy.

Change-Id: I1be06d689002d7c9d6be92bf1e377912f0be1efd

Checkpointing fix for Native Client

Native Client doesn't allow file renames.  We can over-write output
files, however.  For checkpoints, live dangerously and skip backups.
The alternate would be to use an in-memory file system, but then we're
still screwed if the program gets killed partway through writing the
on-disk version.  Other alternates:  keep all checkpoints.

Change-Id: I952ee6436e69f015633a150f94fca65c7271c6bb

Patch for Native Client builds.

This patch contains the source changes necessary to compile Gromacs
for Native Client. Patch is based on original work by Ivan Krasin,
additional changes from Joseph Coffland.
Also included are a few compiler warning fixes and a minor FAHCORE
tweak.

Change-Id: I085c52ff1d8e45ec8ffb8c56f5877313d6225bb2

cmake: make GMX_BUILD_OWN_FFTW work without fortran compiler

Fixes #1412

Change-Id: I4739c112630ad7e264ce314d2da0b29932ea3041

Pass on default value of radstep in make_edi

This is a bugfix for make_edi when -radfix is chosen. Problem was:
if the user did not specify -radstep, then the default value of 0
was not written to the sam.edi output file. Now it is. Also
renamed "radfix" variable to "radstep" because that better reflects
what it is.

Change-Id: I0cc6ee84d42b18ee0ea6b045cdfb0c1d55d51b9f

Essential dynamics: move bNeedDoEdsam evaluation to separate function

The bool variable edi->bNeedDoEdsam is used to signal whether any essential
dynamics constraints have to be evaluated for the ED group. This variable
was evaluated at the beginning of an ED simulation in write_edo_legend().
The latter is however not called if continuing from checkpoint. To
get rid of having to remember whether edi->bNeedDoEdsam was already
initialized, there is now a function bNeedDoEdsam(edi) that is evaluated
every time when called.
Also corrected a few typos.

Change-Id: Iab899a677a85ee8270354859c98cc9e5a9db34b7

Fixed essential dynamics (ED) continuation from .cpt for reference=average

ED runs where the reference and average structure indices are the same can
crash when continued from checkpoint. For these cases (where reference =
average atom indices) obviously the set of atomic positions for the reference
and average structures is always identical. Therefore, only one of the two
structures is stored (which is the average structure edi->sav). When reading
the old values of these structures from the checkpoint file, edi->sav.x_old
therefore needs to be copied both to xstart and xfit in init_edsam().

Change-Id: Ieb1f029f4a927999dfb4579ee7c3bebe15071dc8

Fixed compilation issue due to gcc4.8

by turning off warnings.

Change-Id: Ice3dd8dec8cb9dc590fb293c1face3ed603f7abb

Fix non-critical typo in #ifdef GMX_OMPEN_MP

Added #include to make it work.

Change-Id: Icea244c4fb63aee6ae67a29370d08177a66129a8

Version bump after 4.6.5 release

Change-Id: I1d1c1ee28d585b6cf4431f9f1ec1a334f68ae6e3

Version bumps before release

Change-Id: I5b5ea233c47ce95474dae3b0f71a4ae6ae704f6c

Fixed return value of gmx_mtop_bondeds_free_energy

The return value was always true, which was harmless, since it
could only cause a small performance hit of useless sorting.

Fixes #1387

Change-Id: I088a3747ddb3517fbb5e416b791bd542bd49fed2

Fix DD load balancing bug with GPU sharing

The recent DD load balancing fix which solved the issue of incorrect
imbalance measure with GPU sharing (ba8232e9) addressed GPUs with
incorrect indexing. This caused out of bounds indexing in the GPU ID
query function. The query function also had a bug in the error checking
which allowed the incorrect indexing.
Now also mdrun -nb cpu -gpu_id ... is allowed, which before would give
a fatal error.

This commit addresses both issues; fixes #1385

Change-Id: I2800f610b873da92afe78bbfd869258f378ba2d7

Fix incorrect variable name in documentation

Change-Id: I312e3886ebc692f2331ac2f9a612d530b5d4914c

corrected potential nbnxn SIMD memory issue

A fixed size array on the stack was declared with one element
too few. Probably this never caused trouble with 64-bit builds,
but it might have caused trouble with 32-bit builds.

Change-Id: I4dad0a7a9e80f5d27ac6ee7e4383082db654481a

Bump version after 4.6.4 release

Change-Id: Ied0463a471657e39cb6c4c41d6112f5778ef00d5

Fix minor things before release

Bumped various version numbers

Trivial fix to install guide

Removed out-of-date gmxfaq.html and links to it, replaced links with
links to up-to-date FAQ

share/html/online.html is generated by mkhtml, so stopped caching it
in the repo.

Change-Id: I52265e1174f6e42a2a9d056c3a1751c1cd5886ac

Clarified GPU selection output and mdrun help

The reporting of GPU selection has been confusing when devices are
shared as GPU IDs would show up multiple times in a list of "devices
selected to be used." The reporting has been modified to print the
number of devices selected followed by the GPU to PP rank mapping which
is in fact exactly the previously printed list of IDs.

Additionally, the mdrun help page now explicitly states that the GPU ID
string passed with -gpu_id specifies a per-node GPU to PP rank mapping
and that multiple ranks can share GPUs.

Change-Id: Id98c592c1dd38573df003247281e4edf50debba7

Added rotation to the tests run by ctest

Was forgotten when the rotation tests were added and thus
they weren't run by Jenkins

Change-Id: I27fc51b1314e6377d1e866a8ba4658700cc71cfa

fix nbnxn atom sorting with distant bondeds

Atoms communicated for bonded interactions can be beyond the non-local
search grid. Only a single cell extra was accounted for, which could
give inconsistency errors. Now any distance is handled correctly.
Fixes #1379

Change-Id: I7b12efeeab4074f2b356c0d0739105ce38371901

corrected dynamic load balancing when sharing GPUs

When sharing GPUs over MPI ranks, the time the GPU is busy might not
reflect the actual load. To make the dynamic load balancing between
domains work correctly, the GPU wait times are now redistributed over
the ranks/domains sharing a GPU.

Change-Id: Id9414e3ef7cc5a73a2b4560a0e10c2ee8ab1257f

enable GPU sharing among tMPI ranks

It turns out that the only issue preventing sharing GPUs among thread-MPI
threads was that when the thread arriving to free_gpu() first destroys
the context, it is highly likely that the other thread(s) sharing a GPU
with this are still freeing their resources - operation which fails as
soon as the context is destroyed by the "fast" thread.

Simply placing a barrier between the GPU resource freeing and context
destruction solves the issue. However, there is still a very unlikely
concurrency hazard after CUDA texture reference updates (non-bonded
parameter table and coulomb force table initialization). To be on the
safe side, with tMPI a barrier is placed after these operations.

Change-Id: Iac7a39f841ca31a32ab979ee0012cfc18a811d76

GPU detection is done once per physical node

Only one MPI rank in each physical node now run the GPU detection.
The resulting information is broadcasted to the other ranks.
Note that we should also implement this for the CPU detection.
Fixes #1358

Change-Id: I16c6ccc40bd53d96b99d3f6a0abed69cc89136d8

removed (harmless) left-over in nbnxn SIMD kernels

This improves performance of PME + p-coupling by about 5%.
With Ewald and virial, the nbnxn SIMD energy kernels were used
(some left-over development code). The plain-C code did not do this.

Change-Id: I039044fcb393bf0bcaa06f38498b2a57d60cf080

reorganized GPU detection and selection

The GPU selection has been separated from the GPU detection
and now happens after the thread-MPI threads are started.
The GPU user/auto-selected options have been removed from
gmx_hw_info_t, such that it only contains hardware info
and can be passed around as const.
As both the CPU and GPU options structs are now tMPI rank local,
tMPI thread concurrency issues are avoided.
Fixes #1334 #1359

The GPU detection is now skipped with mdrun -nb cpu
CPU acceleration binary/hardware mismatch is now only printed once
to stderr (instead of #MPI-rank times to stdout).
Removed the master_inf_t struct.

Change-Id: If497f611b911808f6d01ca83f41ae288061dd361

Rename GMX_IS_* to GMX_TARGET_*

This addresses some confusion that developed between release-4-6 and
master with me trying to develop the kernels in master branch so I
could have unit testing support, and then cherry-pick them back. I
had intended to solve #1269 in a separate commit, but it didn't happen
that way.

As #1269 discusses, the code sometimes needs to know what architecture
is being targetted by the compiler. This information is held in the
GMX_TARGET_X86 and GMX_TARGET_BGQ CMake and preprocessor
variables. Note that this information is distinct from what CPU
acceleration is being used (which might be "None" on either platform).

gmx_cpuid.c needs GMX_TARGET_X86 defined to work correctly on x86, and
is called at configure time (at which time config.h is
unavailable). So, this in CMake is treated via a command-line
definition of GMX_TARGET_X86 when required.

Fixes #1269 (even though I98c5791ec silently did this already)

Change-Id: I94e0756856e7d49ff09a87b8283189976b48ea49

corrected volume with serial NPT replica exchange

Replica exchange with replicas run in serial would only update
x and v, not the other state data. This gave incorrect volumes
with NPT replica exchange.
Fixes #1362

Change-Id: Ib726fbb75e800c624ef61f31e76a5d4a4e408b9c

improved the nbnxn buffer size estimate with GPUs

The nbnxn Verlet buffer estimate now takes into account that
constrained atoms rotate, and don't move linearly, around the atom
they are constrained to. This significantly lower the buffer size
estimate for long neighborlist life times (as used with GPUs).
The buffer for most CPU runs is not affected (significantly).
Because of the smaller buffer, mdrun now uses smaller list increase
limits for increasing nstlist when using GPUs. This improves
performance.

Also activated and tested the virtual site effective mass calculation
(vsites were ignored in the drift calculation).

Change-Id: I2cb349f483610eabcc97bfbc23d17f189dec19d6

Fix NBNxN SIMD reference kernels

nbfp_stride was added independently by both 25eb0e14 and 5deee8a0.

Removing static is not OK for gcc. Mark will resolve later whether
this was even needed for his upstream work.

Change-Id: I97ea4131163512354b5e339dd19549c3e49e9de2

fixed recent bug with CUDA texture objects

On GPUs with CUDA architecture 3.0, mdrun would exit with an error.
This bug was introduced very recently in 43b41cb8
Fixes #1361

Change-Id: I0c46867b987cbf3c0da3aa9384d985fef1e4aa73

fixed OpenMP threads being pinned to the same cores

Due to the thread id not being a thread-local variable in the OpenMP
loop setting the thread affinities, different OpenMP threads could be
pinned to the same physical cores.
Fixes #1360

Change-Id: I7bc39aef9a8854ec24006895da6005c1326033a3

BlueGene/Q Verlet cut-off scheme kernels

The kernels are implemented with small functions whose inlining
is guaranteed by the use of xlc and clang extensions. That's a hack
whose general solution I plan to implement in master branch.

Other BG/Q considerations:

Architecture detection now works on A2 core.

Install guide updated.

It is better to use intra-node communicators than not, and ranks
within nodes are correctly detected via querying the BlueGene/Q API,
since the hostname is not useful for the purpose.

It is better to not set GMX_DD_SENDRECV2.

It is better to use the analytical Ewald correction.

In principle, we should version the type of variables and fields named
d2, rl2, rbb2 in nbnxn_search*[ch] to be double on PowerPC and float
everywhere else (each regardless of GROMACS target precision). This
would mean that on PowerPC (where all flops take place in double
precision with free precision-extension upon load) we can be both
cache-efficient by storing bounding boxes in float, and flop-efficient
by not having to generate a round-to-single instruction to compare the
result of subc_bb_dist2_simd4 with the cut-off stored as a
float. Still, a flop per bounding-box distance comparison will not
break the bank.

Enough bgclang support exists for the build to succeed (no platform
file is required), even with OpenMP, but a number of compiler issues
have been reported on llvm-bgq-discuss mailing list.

Change-Id: I98c5791ec3766cdbdcb8a8eb7418d00585727cc0

Call atomics from TestAtomic.c

This exposes more compile-time errors than simply parsing
the definitions. This makes CMake's diagnostics more useful
with respect to atomic operations.

Fixes #1355

Change-Id: Ie1d6f14565700b98988cadc17cb7ac2b78d76ce3

Fix tMPI_Atomic_memory_barrier for MIC

MIC doesn't has sfence. It isn't required because the current generation
of MIC is in-order.

Change-Id: I6953bc3168a191a3038408e6ea35025a25509abe

Fix typo in g_membed documentation

Suggested by Iman Pouya

Change-Id: I5c77a29b64e61f9da5a663119e149d992141eb21

Fix SIMD C reference nbnxn kernels

Got broken by ace006a86 and 022581b388.
An additional fix for nbnxn 4x8 reference code, broken by c0cf8ce,
is in a separate patch.
Also changed the AVX256 double precision nbfp_stride from 4 to 2.

Refs #1173

Change-Id: If3b3291a7ff765acc19c29f834e856cc9798d47e

Restarting from checkpoint no longer reinitializes WL weights.

Fixes a problem where mdrun was reinitializing the initial Wang-Landau
delta for expanded ensemble simulations, because the flag turning it
off was stored the expanded ensemble data structure (not saved in cpt)
instead of the df_history structure (is saved in checkpoint). In the
process, some moderate encasulaton of the df_history structure and
the expanded ensemble methods.

Fixes #1350

Change-Id: I13492a7a9773fcb417fcd0ee106d851d9838ce25

avoid division by zero in SIMD angles and dihedrals

The SIMD accelerated angle and dihedral code did not (correctly)
check for dividing by zero, which can happen with aligned bonds.
Fixes #1351

Change-Id: I326f90fca87ab5cca493204de4a58655465634ca

turning off expanded ensemble for all integrators but md-vv.

Broke at some point, and somewhat tricky to turn back on
correctly for other integrators at this point; target for
5.0 when it should be more straightforward.

fixes #1321

Change-Id: I599b308800411e0cea111ffd280487037d613755

fixed a half bin misalignment in gmx_vanhove -or

Change-Id: Ia800861912d50f5047742bcb1bb51e753920968f

Fixes a problem with pair type 2 interactions with free energy

Pair type 2 interactions, which should remain on regardless
of couple-intramol=yes, were being turned off. Currently, when free
energies were turned on, they were just ignored, because the (empty)
pair one 1 type list was copied over them. This fix adresses
this problem by adding onto the list instead of copying it over.

Fixes #1315

Change-Id: I240479a8dc083f7a355917ed9f74f4337fa3448f

make use of CUDA stream priorities

CUDA 5.5 introduced steam priorities with 2 levels. We make use of this
feature by launching the non-local non-bonded kernel in a high priority
stream. As a consequence, the non-local kernel will preempt the local
one and finish first. This will improve performance in multi-node runs
by reducing the possibility of late arrival of non-local forces.

Change-Id: I4efc65546e4135f12006c0422e1fca42a788129f

use CUDA texture objects when supported

CUDA texture objects are more efficient than texture references, their
use reduces the kernel launch overhead by up to 20%. The kernel
performance is not affected.

Change-Id: Ifa7c148eb2eea8e33ed0b2f1d8ef092d59ba768e

introduced general 4-wide SIMD support

PME spread+gather and the nbnxn search bounding box checks use
4-wide SIMD (as opposed to arbitrary width SIMD). This SSE code
has now been replaced by macros from gmx_simd4_macros.h.
pme_sse_single.h has been renamed to pme_simd4.h
This change is mainly refactoring; it only adds PME spread+gather
AVX acceleration in double precision plus a few FMA instructions.

Change-Id: Ia5e02295bb281a2e23d57f4c165f555de6744064

fixed nbnxn 4x8 pair search without AVX

This bug was introduced recently.
Note that nbnxn 4x8 without AVX was only possible when manually
changing the code to use plain-C reference SIMD.

Change-Id: I5effe4076bc5ff270ebeb366f9c2b8a13c256025

Fix parallel build for GMX_BUILD_OWN_FFTW

* only works for cmake >=2.8.8
* cmake 2.8.7 has a bug in add_library, but
cmake 2.8.[0123] have other problem, cmake-2.8.[456]
still don't build in parallel
* fix from https://gerrit.gromacs.org/#/c/1675/12
* hardcode libdir to fix build on OpenSuse

Change-Id: I74315880f71fd4384084819ccc686072f7cad4f5

Fixes reaction field free energy bug

Version 4.6 introduced an error in the reaction-field correction
force term for perturbed interactions. A factor r_softcore^2 was
missing. The force calculation code is now slightly reorganized
and comments added to avoid such issues in the future.
Fixes #1318

Change-Id: I9105139f8975495c323008ce202cde517a69281a

Silence clang warnings

Pre-release clang 3.4 warns that the types of lout[23] variables
is not the real * expected with GMX_MPI_REAL.

Change-Id: Id3ca4567f5eb642ead0cb4ce8d48dafbb92c303a

allow compilation to optimize for CUDA compute cap. 3.5

Enabling optimizations targeting compute capability 3.5 devices
(GK110) slightly improves performance of both PME and RF kernels.
This requires a hint for the compiler optimization indicating
the maximum number of threads/block and minimum number of
blocks/multiprocessor. This change allows nvcc >=5.0 to generate
code for CC 3.5 devices and switches to including PTX 3.5 code
(instead of 3.0) in the binary.

Change-Id: If7e14d31165bc05859250db7468bf6bd8c186264

Corrected info text. -center --> -boxcenter

Change-Id: I99901047fcde55f9714c81d3182a3778f290ebac

Fixed limitations in g_cluster

Old version produced wrong output for large trajectories with more
than 46340 frames. The reason was that the number of RMSD matrix
entries which is the square number of frames was stored as int which
caused a MAX_INT overflow. By changing it to gmx_large_int_t, g_cluster
is now able to handle trajectories with up to 3e9 frames.

Also freed leaking temporary buffers.

Change-Id: I8acfb0cedae9ddde207f39cb627ad2ea9fbbb9e6

logic fix for free energies with mdrun -rerun

It was taking the wrong logical path when it checks whether
delta_lambda = 0 when doing mdrun -rerun

fixes #1330

Change-Id: I3dadbb546b4376fae72c1b00c0684450bf77396f

Drop md5sum check for GMX_BUILD_OWN_FFTW

The old version gave a confusing error message about a wrong md5sum if
the download failed.

The new version no longer checks an md5sum at all, which avoids the
need to test a CMake version. It also gives an explicit warning and
instructions on how to proceed safely.

CMake bug reported at http://www.cmake.org/Bug/view.php?id=14330
Noted TODO to revisit if that bug gets fixed.

Noted TODO in master branch to show this warning only the first time a
suitable cached variable is set.

Change-Id: I403896505b178251087d71f95362c3754cd4a2de

Fix bug in (long) neighborlist SIMD padding when adding to previous list

Gromacs-4.6 introduced SIMD padding in the neighborlists, which works
fine for normal simulations. However, when the neighborlist gets long
and we end up adding a second batch of particles we need to remove the
previous padding, which was not done until now. This will typically only
occur when the list per node is large, e.g. when using long cutoffs
(>2nm) with only a single core. Normal simulations should not have been
affected by it (which is also why we did not find it until now).

Fixes #1341.

Change-Id: Ie64ab6c0313a8dc0d3545a5e7d610f24adae4438

optimized generic SIMD invsqrt

The function gmx_invsqrt_pr now uses one instruction less when
FMA is not supported in hardware.
Fixes #1333

Change-Id: Idace7296b88a8ecc0331e22d5bb3088753c478de

fixed multiple distance restraints with OpenMP

Distance restraints with multiple pairs (the same label) are no
longer split over multiple OpenMP threads. Some (beneficial)
reorganization of the bonded thread division was required to do this,
most importantly: removed calc_one_bond_foreign.
Fixes #1316

Change-Id: I88d8eafede5cbc26c19026a9272639e652f7abd7

Described another way g_tune_pme might reasonably fail

Change-Id: Ibb75f40a17b81934ae768a57d5e4fb11d07cdc2d

Fix total time measurement with separate PME nodes

The runtime counter needs to be passed to the PME code so that
PME-only nodes can have their time included in the statistics.

Fixes #1325

Change-Id: I13effaa185b1290e41bdd642c607ff75ab8db929

Fixed reading history_t from checkpoint

The use of the wrong reading function prevented reading any checkpoint
file with distance restraints, (and probably any with orientation
restraints, too) because the stored lengths of the arrays could not be
read.

Additionally, the way a t_state is allocated on the stack and most of
the GROMACS code base assumes structures have been allocated on the
heap (and set to zero) by snew() makes the problem worse. Noted that
this is evil and must go away some time.

Fixes #1174

Change-Id: Ic8240f80c17272a1499421233689ed4b2c640ba3

Fixed g_tune_pme assumption that MPI is available

Refactored function with two distinct parts into two functions. This
makes it easy to call the part that checks that mdrun works only when
that check is necessary. Now g_tune_pme -np 512 -nobench works on
machines like BlueGene/Q where you might only be able to get the MPI
environment via the queuing system. g_tune_pme -nobench should work as
a stand-alone.

Fixes #1319

Change-Id: I7237800a1c67664c9253e5422a7b3f12f4ebd62f

made free energy PME kernel 40% faster

Also removed double assignments and unused variables.

Change-Id: Ia8202ee90f70da86474cc946707f016d8ad69286

Fixing manual for sc-coul description.

fixes #1331

Change-Id: I84a388bf8a9e289b37ff838a0d335d96d4393fb2

mdrun without OpenMP with thread-MPI now uses all cores

With the Verlet scheme, mdrun compiled without OpenMP would
often run on a single MPI-thread only.
Fixes #1317

Change-Id: I8fc43fe933ba23047f0ee9368ad9105cfc62eb4a

corrected grompp constraint/DOF warning with vsites

grompp now counts constraints after removing the ones with vsites.
Fixes #1322

Change-Id: I1f2b129fa5f4c5f56fea384f2749af833c92eabc

Set upper case of FFTW variable at the right place.

Fixes #1327

Change-Id: Ie9ef1fffceefef7d46f1e7a5a8ca3ceb22a81854

Split up Verlet SIMD kernels for faster compilation

Each kernel is now defined in a seperate compilation unit, and the
files for the kernel functions, the kernel dispatcher function, and
their definitions are all generated by Python script. See its comments
for details. The generation is done by developers and the results
stored in the repo, to avoid CMake-time dependencies. The source for
the file generation is separated from the results of the generation.

On a 48-core AMD machine using GCC 4.7 for an SSE2 build (i.e. 4xN
Verlet kernels), the time for "make mdrun -j" went from 73 s to 34 s.

On a more recent 8-core Intel machine using GCC 4.6.4 for AVX_256
build (both 2xNN and 4xN), the time for "make mdrun -j" went from
81 s to 63 s.

On 16 cores of a POWER7 BlueGene/Q front-end node, time to compile
release-4-6 stays about the same (around 160s).

Also unwound some include file dependencies and removed repeated
definitions of some supporting structures and functions.

Change-Id: I0da1faf351defbe68d5ca43febcabddd93e21f0d

Fix detection and suggestion in gmx_cpuid for IBM QPX

Also made some variable names more descriptive.

Change-Id: I0eaac7ff6ce5cb0da82d9c54d10c555850a3dad1

introduced nbnxn data structure for bounding boxes

The nbnxn search code now uses a well organized bounding box data
structure instead of a complicated indexed plain float array
for the cluster bounding boxes.

Change-Id: Ia5adf2d33d495cff3178ca950e8c16aefcfef1fb

Fixes for Bluegene/Q

* Moved suffixing code to after where gmxManageBlueGene sets
  GMX_MPI. Since nothing depends on the suffixing until the calls to
  add_subdirectory(), this is safe. There is a complication from the
  way that GMX_MPI means "real MPI" for most of CMake and "real or
  thread MPI" in the source code; that is managed by GMX_THREAD_MPI=ON
  causing GMX_MPI=ON at a certain point. To cope with this, we add the
  "_mpi" suffix in response to GMX_LIB_MPI=on, which is how things
  should have been done back when GMX_MPI was co-opted, except that
  the suffixing was probably being done too early... Altogether, this
  change means that real-MPI binaries are always built with the _mpi
  suffix, like users have come to expect.
* Added compiler suppression for a useless information message that
  is otherwise issued for almost every source file
* Made the C and CXX platform files identical; I'd originally
  thought they could be combined on the command line, but
  that is not true

Change-Id: I26cb782a1b47cf48f47b80ec8c8d0a53db338872

Consolidated NBNxN SIMD kernel utility routines

Hardware-, precision-, and j-width-specific routines used only by the
NBNxN kernels are now all defined in
nbnxn_kernel_simd_utils*.h. Hardware-specific details are contained in
files specific to that hardware.

A major feature contained in this patch is a refactored treatment of
NBNxN particle-particle exclusions. This hides the
x86-implementation-specific details of using integer- or real-valued
SIMD registers and operations. Both inner and outer NBNXN loops are
now more independent of hardware.

* Moved SIMD types, constants and functions for exclusions to the
  NBNxN kernel module, because that is the only place where they are
  used.
* Introduced gmx_exclfilter type to hide the implementation detail of
  whether the masking is handled in integer- or real-valued SIMD
  registers
* Consolidated gmx_checkbitmask* likewise, and renamed to reflect that
  it returns a gmx_mm_pb
* Eliminated the need for gmx_castsi_pr, gmx_set1_epi32, gmx_load_si
  by recasting the code as the composite operation of
  gmx_load1?_exclfilter
* Through the above, eliminated the need for CHECKBITMASK preprocessor
  defines and checks
* Converted code macros to static inline functions
* Converted FILTER_STRIDE and NBFP_STRIDE to a static const instead of
  macro, since neither are ever used as an array dimension in C. This
  works towards using the compiler where possible and the preprocessor
  only where necessary.
* Introduced functions for exclusion mask loading so that there will
  be a link seam for testing with in master branch

TODO: Respond to two questions addressed to Berk embedded in new
comments.

Change-Id: I2a74638b982bdbf5a88442b93736df0a2f0c14b0

Fix minor wallcycle output issues

Document 19-character wcn name restriction better.

Enforce 19-character wcn name restriction better. The new behaviour
correctly truncates a wcn[] that is too long for the 19-printing
character field width of the wallcycle table.

Compress two names into one field for GMX_CYCLE_ALL correctly. The
previous version was buggy because the first call to snprintf()
resulted in 0 == buf[8], which suppressed output of the second name
completely. The previous version's second call to snprintf() was
probably wrong similarly.

Change-Id: I430ca6eb2bccfe3775e58bae4b6fc3326bcde706

Fixed GMX_DD_DUMP_GRID

The pdb format string no longer matched the parameters it was given

Change-Id: I682154a0986fdbd73ae414264982a68fed164c87

fixed parallel normal modes with PME

When doing a normal mode calculation with PME and (thread-)MPI,
the PME forces were incorrect. Also fixed some EM/NM output layout.
Fixes #1308

Change-Id: Ia7862fa62e235336c546824afdcbe28f37b145c5

Reordered FFTW fatal error; avoids using unset variables

Now variables that are set in FFTW detection are only used when the
variable that reports FFTW has been found is set (i.e. there has not
been a fatal error). Not sure why we haven't seen problems with this
so far, but we do see them on Power7 with cmake 2.8.8 when FFTWF is
not available.

Change-Id: I1a3e84002dda7f29f90eb532ab14ce51dfca6e5c

Fixed reference SIMD code to work with MSVC

Also noted minor issue with rounding to even. Using rint would be
preferable, but that's even harder to work around on MSVC.

Change-Id: I495cf2d43cf0b33cc0896958f15703ba8632c7e4

Fixing problem caused by overflow in expanded ensemble.

Basic problem: logic was failing for single precision because of round off errors.
Solution: convert some of the intermediate arrays into double.

Fixes redmine #1314

Change-Id: Id7e3771d257bbeebeed2f340593b817015a0cd4c

Refactored preparation of buffer for NBNxN table loads

The outer and inner kernels try to be hardware-agnostic, but some
architectures need a thread-local buffer for transferring computed
table indices to integer registers via memory. It is better to set up
this buffer using a helper function that is implemented in
hardware-specific files, because the compiler can help prevent some
bugs. In particular, you can no longer forget to make a function
definition on non-SSE2. This eliminates one of the special-case kernel

Change-Id: I1ebdd15afaf03ff4559365867970b88e9ce0ed5e

Properly finalize MPI on mdrun -version. Fixes #1313

Change-Id: I1ae5f342da96980df322770d134555e2dc9fe712

cmake: Improve some error messages

Lessons learned from the Q-bio summer school

Change-Id: I521a08ba3a83566582137740a0d097c0637d7d3e

Updated dlist.c to recognize more atom names.

Previously, g_chi basically only worked correctly for OPLS-AA and
Gromos96 force fields, based on C-terminal atom names. This commit
adds atom names for AMBER and CHARMM force fields so that g_chi
calculates dihedral properties for all residues.

Change-Id: I48517fb55bd46e7d49941f7902f4f6531e443e62

Find mkl.h on more icc versions

Refs #1110

Change-Id: I0b6bc2497fc2a504b6c29b0697ee9e354fe6cffd

Update management of linear algebra libraries

Management of detection and/or linking to BLAS and LAPACK libraries is
re-organized. The code has migrated to its own module. This will
help future extension and maintenance. This version communicates
things that are newsworthy and stays out of the way when nothing
is changing.

We no longer over-write the values specified by the user for
GMX_EXTERNAL_(BLAS|LAPACK). Previously, this was used to signal
whether detection succeeded, but that does not really get the job
done. Instead, the user is notified that detection failed (repeatedly,
if they deliberately set such an option on).

Correct usage and expected behaviour in all cases is documented both
in the code and the install guide.

The user interface is pretty much unchanged. We still don't offer full
configurability (e.g. MKL for FFTs must use MKL for linear algebra
unless GMX_*_USER is used, and the only way to get MKL for linear
algebra is to use it for FFTs). The size of any performance difference
is probably very small, and if the user really needs mdrun with
certain FFT and tools with certain linear algebra library, they can do
two configurations. Note that mdrun never calls any linear algebra
routines (tested empirically)!

Expanded the solution of #771 by testing that the user supplied
libraries that actually work. If not, we emit a warning and try to use
them anyway.

We also now check that MKL really does provide linear algebra
routines, and fall back to the default treatment if it does not.

Refs #771,#1186

Change-Id: Ife5c59694e29a3ce73fc55975e26f6c083317d9b

Fixes #1312 uninitialized error in g_enemat.

Change-Id: Ia0ac6d095dd560f08576aad1b435c92b6de52b3b

implemented plain-C SIMD macros for reference

This is mainly code reorganization.
Adds reference plain-C, slow, arbitrary width SIMD for testing.
Adds FMA for gmx_calc_rsq_pr.
Adds generic SIMD acceleration (also AVX or double) for pme solve.
Moved SIMD vector operations to gmx_simd_vec.h
The math functions invsqrt, inv, pmecorrF and pmecorrV have been
copied from the x86 specific single/double files to generic files
using the SIMD macros from gmx_simd_macros.h.
Moved all architecture specific nbnxn_kernel_simd_utils code to
separate files for each SIMD architecture and replaced all macros
by inline functions.
The SIMD reference nbnxn 2xnn kernels now support 16-wide SIMD.
Adds FMA for in nbnxn kernels for calc_rsq and Coulomb forces.

Refs #1173

Change-Id: Ieda78cc3bcb499e8c17ef8ef539c49cbc2d6d74d

Removed buggy -seppot output

ns() no longer computes forces, so the comment needs correcting.

dvdlambda was changed to an array at some point, but the output code
did not change in sync. So, that output has been broken in at least
4.6, and is always zero anyway. So the output code can go away.

A patch in master branch removes the useless parameters in the
neighbour-search code. When merging this patch into master branch,
eliminate the useless dvdlambda variable.

Fixes #1294

Change-Id: Iba48f549ee07f0ee1877035a9e506780439c8002

Fixed pdb2gmx -vsite hydrogen -o conf.pdb

Generation of v-sites did not initialise the elem field of the t_atom
structs. Later when writing the output to PDB format, that elem is
used for the final column. Somehow, the memory in elem was
uninitialized at that point, despite the use of srenew. Oh for proper
constructors and copies!

I observed the problem with v-sites generated from a POPC lipid, but
presumably the underlying cause is common to all kinds of v-sites, so
I have filled the elem field for two other kinds of v-site
generation. I have no idea why not all of the aromatic v-site
generation schemes involve filling a new t_atom struct.

Change-Id: I69efee07d8fd51192808a85d5704d488a0d310be

Fixed possible dereference of null pointer

getpwuid() can return null. For example, Shun Sakuruba reported that
on Cray XE6 compute nodes have passwd.h, but users will not have
passwd entries.

Fixes #1301

Change-Id: I66e8064438fc02591629b0d381bc00afd06795c0