docs/doxygen/analysisdata.md

   1 Analysis output data handling {#page_analysisdata}
   2 =============================
   3
   4 The \ref module_analysisdata module provides support for common data analysis
   5 tasks within the \ref page_analysisframework.  The basic approach used in the
   6 module is visualized below:
   7
   8 \dot
   9   digraph analysisdata_overview {
  10     rankdir = BT
  11     dataobject [label="data object\n(subclass of gmx::AbstractAnalysisData)"]
  12     datamodule1 [label="data module\n(implements gmx::AnalysisDataModuleInterface)"]
  13     datamodule2 [label="data module\nthat also provides data"]
  14     datamodule3 [label="data module"]
  15     datamodule1 -> dataobject
  16     datamodule2 -> dataobject
  17     datamodule3 -> datamodule2
  18   }
  19 \enddot
  20
  21 Typically, an analysis tool provides its raw data output through one or more
  22 gmx::AnalysisData objects (the root _data object_ in the diagram above).
  23 This object provides only storage for the data.
  24
  25 To perform operations on the data, one or more _data modules_ can be attached
  26 to the data object.  Examples of such operations are averaging, histogramming,
  27 and plotting the data into a file.  Some data modules are provided by the \ref
  28 module_analysisdata module.  To implement new ones, it is necessary to create a
  29 class that implements gmx::AnalysisDataModuleInterface.
  30
  31 In many cases, such data modules also provide data that can be processed
  32 further, acting as data objects themselves.  This makes it possible to attach
  33 further data modules to form a processing chain.  In simple cases, such a chain
  34 ends in a module that writes the data into a file, but it is also possible to
  35 access the data in a data object (whether a plain data object or a data module)
  36 programmatically to do further computation or post-processing outside the
  37 framework.  To do this, the data object typically needs to be told in advance
  38 such that it knows to store the data permanently even if attached modules do
  39 not require it.
  40
  41 The modules can do their processing online, i.e., as the data is produced.
  42 If all the attached modules support this, it is not necessary to store all the
  43 raw data in memory.  The module design also supports processing frames in
  44 parallel: in such cases, the data may become available out of order.  In
  45 particular for writing the per-frame data into a file, but also for other types
  46 of post-processing, it is necessary to reorder the data sequentially.  This is
  47 implemented once in the framework, and analysis tools do not need to worry,
  48 other than using the provided API.
  49
  50
  51 Structure of data
  52 =================
  53
  54 At the highest level, data can be structured into separate
  55 gmx::AbstractAnalysisData objects that operate independently.  Each such object
  56 has an independent set of post-processing modules.
  57
  58 Within a gmx::AbstractAnalysisData object, data is structured along three
  59 "dimensions":
  60
  61  - _frames_: There is one or more frames in each data object.  For raw data
  62    produced by an analysis tool, these typically correspond to input trajectory
  63    frames.  For other data set, it can be viewed as an X axis of a graph.
  64  - _data sets_: There is one or more data sets in each data object.  For most
  65    purposes, data sets work independently (i.e., the post-processing modules
  66    operate on each data set separately), but some modules reduce the data sets
  67    into single columns in the output.  The main purpose for using multiple data
  68    sets is to share the same post-processing chain for multiple sets of data
  69    (e.g., multiple RDFs computed by the same tool in one pass), in particular
  70    for cases where the number of data sets is not known at compile time.
  71    Note that each data set contains the same number of frames.
  72  - _columns_: There is one or more columns in each data set.  Different data
  73    sets can contain a different number of columns.  Each column in a frame can
  74    contain a single value (see below for supported values).
  75
  76 Programmatically the data within each frame is organized into _point sets_.
  77 Each point set consists of a continuous range of columns from a single data
  78 set.  There are two types of data:
  79
  80  - _simple_: For each frame, there is exactly one point set for each data set,
  81    and that point set spans all columns in that data set.
  82  - _multipoint_: For each frame, there can be any number of point sets, and
  83    they may span arbitrary columns.  It is allowed that point sets overlap,
  84    i.e., that multiple point sets specify a value for the same column.
  85
  86 The main purpose of multipoint data is to support cases where it is not known
  87 in advance how many values there will be for each frame, or where that number
  88 is impractically large.  The need to do this is mainly a matter of
  89 performance/implementation complexity tradeoff: with a more complex internal
  90 implementation, it would be possible to support larger data sets without a
  91 performance/memory impact they currently impose.  The current implementation
  92 places the burden of deciding on the appropriate usage pattern on the user
  93 code, allowing for much simpler internal implementation.
  94
  95 An individual value (identified by frame, data set, and column) consists of a
  96 single value of type `real`, an optional error value, and some flags.
  97 The flags identify what parts of the value are really available.  The following
  98 states are possible:
  99  - _present_: The value is set.
 100  - _missing_: The value is marked as missing by the data source.  In this
 101    state, the value can still be accessed, and the returned `real` value has
 102    some meaning.  Different data modules handle these cases differently.
 103  - _unset_: The value is not set.  It is not allowed to access the value for
 104    other than querying the state.  Data modules that ignore missing values
 105    (by skipping all values not _present_) can also handle unset values.
 106    Other data modules typically do not allow unset values.
 107
 108
 109 Data provider classes
 110 =====================
 111
 112 The base class for all data objects (including data modules that provide data)
 113 is gmx::AbstractAnalysisData.  This class provides facilities for attaching
 114 data modules to the data, and to query the data.  It does not provide any
 115 methods to alter the data; all logic for managing the actual data is in derived
 116 classes.
 117
 118 The main root (non-module) data object class for use in analysis tools is
 119 gmx::AnalysisData.  This class provides methods to set properties of the data,
 120 and to add frames to it.  The interface is frame-based: you construct one frame
 121 at a time, and after it is finished, you move to the next frame.  The frames
 122 are not constructed directly using gmx::AnalysisData, but instead separate
 123 _data handles_ are used.  This is explained in more detail below under
 124 \ref section_analysisdata_parallelization.
 125
 126 For simple needs and small amounts of data, gmx::AnalysisArrayData is also
 127 provided.  This class allows for all the data to be prepared in memory as a
 128 single big array, and allows random access to the data while setting the
 129 values.  When all the values are set to their final values, it then notifies
 130 the attached data modules by looping over the array.
 131
 132
 133 Parallelization {#section_analysisdata_parallelization}
 134 ===============
 135
 136 One major driver for the design of the analysis data module has been to provide
 137 support for transparently processing multiple frames in parallel.  In such
 138 cases, output data for multiple frames may be constructed simultaneously, and
 139 must be ordered correctly for some data modules, such as writing it into a
 140 file.  This ordering is taken care of by the framework, allowing the analysis
 141 tool writer to concentrate on the actual analysis task.
 142
 143 From a user's point of view, the main player in this respect is the
 144 gmx::AnalysisData object.  If there are two threads doing the processing in
 145 parallel, it allows creating a separate gmx::AnalysisDataHandle for each
 146 object.  Each of these handles can be used independently to construct frames
 147 into the output data, and the gmx::AnalysisData object internally takes care of
 148 notifying the modules correctly.  If necessary, it stores finished frames into
 149 a temporary buffer until all preceding frames have also been finished.
 150
 151 For increased efficiency, some data modules are also parallelization-aware:
 152 they have the ability to process the data in any order, allowing
 153 gmx::AnalysisData to notify them as soon as a frame becomes available.
 154 If there are only parallel data modules attached, no frame reordering or
 155 temporary buffers are needed.  If a non-parallel data module is attached to a
 156 parallel data module, then that parallel data module takes the responsibility
 157 of ordering its output frames.  Ideally, such data modules produce
 158 significantly less data than what they take in, making it cheaper to do the
 159 ordering only at this point.
 160
 161 Currently, no parallel runner has been implemented, but it is likely that
 162 applicable tools written to use the framework require minimal or no changes to
 163 take advantage of frame-level parallelism once such a runner materializes.
 164
 165
 166 Provided data processing modules
 167 ================================
 168
 169 Data modules provided by the \ref module_analysisdata module are listed below
 170 with a short description.  See the documentation of the individual classes for
 171 more details.
 172 Note that this list is manually maintained, so it may not always be up-to-date.
 173 A comprehensive list can be found by looking at the inheritance graph of
 174 gmx::AnalysisDataModuleInterface, but the list here is more user-friendly.
 175
 176 <dl>
 177 <dt>gmx::AnalysisDataAverageModule</dt>
 178 <dd>
 179 Computes averages and standard deviations for columns in input data.
 180 One output value for each input column.
 181 </dd>
 182 <dt>gmx::AnalysisDataFrameAverageModule</dt>
 183 <dd>
 184 Computes averages for each frame in input data.
 185 One output value for each input data set for each frame.
 186 </dd>
 187 <dt>gmx::AnalysisDataBinAverageModule</dt>
 188 <dd>
 189 Computes averages within bins.  Input is pairs of values, where the first
 190 value defines the bin, and the second value sets the value to accumulate into
 191 the average within the bin.
 192 One output histogram for each input data set.
 193 </dd>
 194 <dt>gmx::AnalysisDataSimpleHistogramModule</dt>
 195 <dd>
 196 Computes histograms.  All values within a data set are added into a histogram.
 197 One output histogram for each input data set.
 198 Provides the histogram for each input frame separately, and also the full
 199 histogram over frames (through an internal submodule).
 200 </dd>
 201 <dt>gmx::AnalysisDataWeightedHistogramModule</dt>
 202 <dd>
 203 Computes histograms.  Input is pairs of values, where the first value defines
 204 the bin, and the second value sets the value to add into that bin.
 205 Output like with gmx::AnalysisDataSimpleHistogramModule.
 206 </dd>
 207 <dt>gmx::AnalysisDataLifetimeModule</dt>
 208 <dd>
 209 Computes lifetime histograms.  For each input column, determines the time
 210 intervals during which a value is continuously present/non-zero, and creates a
 211 histogram from the lengths of these intervals.
 212 One output histogram for each input data set.
 213 </dd>
 214 <dt>gmx::AnalysisDataPlotModule</dt>
 215 <dt>gmx::AnalysisDataVectorPlotModule</dt>
 216 <dd>
 217 Writes data into a file.
 218 </dd>
 219 </dl>