TBBT/trace_play/README.old

   1 A few notes about the redesign of the file selection algorithm for
   2 SFS 3.0, which may be incorporated into the inline comments in the
   3 next maintenance release.
   4
   5
   6 Q1.  How is the Poisson Distribution used in SFS?
   7
   8 A1. The probabilities from a Poisson-derived distribution are used
   9 to select files from the I/O working-set in a non-uniform fashion.
  10
  11 The load-generators are not Poisson Processes.  In particular, their
  12 inter-request times are not exponentially-distributed; they are programmed
  13 to have uniformly distributed think-times, which are only approximately
  14 achieved, due to limitations of the load-generating client and its
  15 operating system.
  16
  17 Q2.  Why is a Poisson Distribution used in SFS, rather than, say a
  18 Zipf distribution?
  19
  20 A2.  The reasoning behind the use of Poisson probabilities in SFS 2.0
  21 may be lost in the mists of time.  Because SFS 2.0 was withdrawn by SPEC
  22 before SFS 3.0 development had proceeded very far, there was considerable
  23 time-pressure to release SFS 3.0 quickly.  So we chose to patch the
  24 existing Poisson probability calculations instead of implementing a new
  25 distribution for file accesses.
  26
  27 There is experimental data showing that a Zipf Distribution is a good
  28 model for file access probabilities.  The question of what distribution
  29 to use in SFS 4.0 is open, so various distributions can be considered.
  30
  31 Q3.  Is it true that SFS 3.0 combines multiple Poisson cycles into a
  32 single cycle that is also a Poisson Distribution?
  33
  34 A3.  The combination of the distributions follows the curve of a Poisson
  35 Distribution only if you treat each of the twelve generations of
  36 access-groups as a single outcome.
  37
  38 Q4.  Why did the calculation of "Io_working_set.access_group_cnt" change
  39 between SFS 2.0 and SFS 3.0?
  40
  41 A4.  This calculation determines the number of access-groups in the I/O
  42 working-set.  The number of access-groups grows *roughly* in linear
  43 proportion to the number of files, so that no access-group contains more
  44 than 125 files.
  45
  46 In SFS 2.0, 4 groups are added for every 500 files, but in SFS 3.0
  47 12 groups are added for every 1500 files, so the number of files per
  48 access-group remains roughly the same.
  49
  50 The main effect of the change is how the number of groups is rounded.
  51 In SFS 2.0, the number of groups was rounded to a multiple of 4,
  52 with a minimum of 8 groups.  In SFS 3.0, the number of groups is
  53 rounded to a multiple of 12 (the value of "generations") so that the
  54 cycle of Poisson probabilities (for all 12 generations of access-groups)
  55 will be completely repeated group_cnt/12 times with no groups left over.
  56
  57 Q5.  Why was the minimum number of access-groups changed from 8 to 12?
  58 Why not, say, 24?
  59
  60 A5.  The choice was somewhat arbitrary.  Here is some of the rationale:
  61
  62 In order to get a complete set of Poisson probabilities, there must be
  63 at least as many access-groups as generations.  If we had generations=24
  64 instead of 12, then some of the group-probabilities would be zero due to
  65 defect #2.  (Recall that with 24 groups in SFS 2.0, 13% were inaccessible
  66 due to rounding of the scaled probabilities.)  In SFS 2.0, 12 was the
  67 largest number of groups which did not trigger defect #2.  Of course,
  68 since we now know how to fix that defect, that wasn't the complete reason.
  69
  70 Another consideration was that the number of access groups must be rounded
  71 to a multiple of the number of generations.  If we had used generations=24
  72 then the rounding would introduce a more radical change in the number
  73 of groups, especially for large numbers of procs.
  74
  75 On the other hand, if we had used generations=8 instead of 12, we would
  76 have gotten very little variation in access probabilities, which seemed
  77 against the intent of the benchmark design.
  78
  79 Q6.  Why is lambda being calculated differently in SFS 3.0?
  80
  81 A6.  Lambda is the parameter of the Poisson Distribution which determines
  82 the shape of its probability density function, its mean, and other
  83 characteristics.
  84
  85 SFS 2.0 set lambda to half the number of groups, which in turn was a
  86 function of the requested ops/sec/proc.  That was defect #4, which caused
  87 the shape of the access-distribution to change depending on how many
  88 processes a vendor chose to use -- it would get narrower as the number
  89 of processes was reduced.  SFS 3.0 sets lambda based on "generations",
  90 which is a constant.  Thus the new distribution has the same shape no
  91 matter how many load-generating processes are used.
  92
  93 The key to defect #4 was that we *don't* want lambda to vary
  94 with the number of access-groups.  We want it to be a constant so that
  95 for a given size fileset the number of files that fit into cache
  96 depends very little on the number of load-generation processes used.
  97
  98 Q7.  Why was lambda set to half the number of groups in SFS 2.0
  99 instead of something else?
 100
 101 A7.  The SFS 2.0 rationale is not documented, as far as we know.  The nice
 102 thing about group_count/2 is that it puts the characteristic "hump" of the
 103 probability density function (the mode) near group_count/2, in the middle
 104 of the array of groups.  Probably that was the reason.
 105
 106 In SFS 3.0, each cycle of probabilities has 12 groups, so lambda=6
 107 to preserve this feature of the distribution.
 108
 109 Q8.  How is the SFS 3.0 approach differ from always using 12 access-groups
 110 and allowing the number of files per access-group to be more than 125?
 111
 112 A8.  The end-result would have been roughly the same.  The reason this
 113 was not done was that SFS sometimes searches within an access-group, for
 114 instance to find the best fit for an operation.  The search is a linear
 115 search.  We were concerned that if the number of files in a group got
 116 too large, the client could get bogged down searching for the best fit.
 117 The search for the appropriate access-group is a binary search, which
 118 scales better.
 119
 120 Q9.  Why not just limit the benchmark to 25 requested ops/second/process,
 121 so that the number of groups would always be 12 or less?
 122
 123 A9.  Historically SPEC has given server vendors lots of leeway to
 124 have as few processes as they wish, modulo the Uniform Access Rule and
 125 the run-rule requirement for at least 8 processes per network.  In the
 126 interest of avoiding a long, drawn-out debate about adding a new run-rule
 127 to arbitrarily limit ops/process, we decided to simply fix the code so
 128 that it would work reasonably well for larger numbers of ops/process.
 129
 130 Q10.  Where did the new variable "cumulative_ratio" come from?
 131
 132 A10.  This variable is used in the loop that calculates Poisson distribution
 133 for I/O working-set accesses.  In 2.0, lambda^x and x! were calculated
 134 separately.  Both variables got very large and would overflow when 'x' got
 135 into the hundreds.  This was defect #3.  The Poisson distribution really
 136 only needs the ratio (lambda^x/x!), which is numerically more stable.
 137 So that is what the SFS 3.0 code calculates.
 138
 139 Since SFS 3.0 never uses values of lambda other than 6.0, the change
 140 has proven to be moot.