Class CSVSampler

java.lang.Object
io.nosqlbench.virtdata.library.basics.shared.distributions.CSVSampler
All Implemented Interfaces:
LongFunction<String>
Direct Known Subclasses:
Cities, CitiesByDensity, CitiesByPopulation, Counties, CountiesByDensity, CountiesByPopulation, CountryCodes, CountryNames, StateCodes, StateCodesByDensity, StateCodesByPopulation, StateNames, StateNamesByDensity, StateNamesByPopulation, TimeZones, TimeZonesByDensity, TimeZonesByPopulation, ZipCodes, ZipCodesByDensity, ZipCodesByPopulation

public class CSVSampler extends Object implements LongFunction<String>
This function is a toolkit version of the WeightedStringsFromCSV function. It is more capable and should be the preferred function for alias sampling over any CSV data. This sampler uses a named column in the CSV data as the value. This is also referred to as the labelColumn. The frequency of this label depends on the weight assigned to it in another named CSV column, known as the weightColumn.

Combining duplicate labels

When you have CSV data which is not organized around the specific identifier that you want to sample by, you can use some combining functions to tabulate these prior to sampling. In that case, you can use any of "sum", "avg", "count", "min", or "max" as the reducing function on the value in the weight column. If none are specified, then "sum" is used by default. All modes except "count" and "name" require a valid weight column to be specified. These functions apply to the reduction of labels in the selected CSV column, and only apply when there is more than one row with the same value in that named column. Thus, the order of appearance row-by-row will be preserved in cases that all values in that column are distinct. This means that if you have multiple associated values on a given row, you can use the same
  • sum, avg, min, max - takes the given stat for the weight of each distinct label
  • count - takes the number of occurrences of a given label as the weight
  • name - sets the weight of all distinct labels to 1.0d

Map vs Hash mode

As with some of the other statistical functions, you can use this one to pick through the sample values by using the map mode. This is distinct from the default hash mode. When map mode is used, the values will appear monotonically as you scan through the unit interval of all long values. Specifically, 0L represents 0.0d in the unit interval on input, and Long.MAX_VALUE represents 1.0 on the unit interval.) This mode is only recommended for advanced scenarios and should otherwise be avoided. You will know if you need this mode. For alias sampling, the values may not always occur in the order specified due to the alias table construction. However, the values will be clustered in the order they appear in that table.
  • Constructor Details

    • CSVSampler

      public CSVSampler(String labelColumn, String weightColumn, String... data)
      Build an efficient O(1) sampler for the given column values with respect to the weights, combining equal values by summing the weights.
      Parameters:
      labelColumn - The CSV column name containing the value
      weightColumn - The CSV column name containing a double weight
      data - Sampling modes or file names. Any of map, hash, sum, avg, count are taken as configuration modes, and all others are taken as CSV filenames.
  • Method Details