io.nosqlbench.virtdata.library.basics.shared.distributions.CSVSampler

All Implemented Interfaces:: LongFunction<String>

Direct Known Subclasses:: Cities, CitiesByDensity, CitiesByPopulation, Counties, CountiesByDensity, CountiesByPopulation, CountryCodes, CountryNames, StateCodes, StateCodesByDensity, StateCodesByPopulation, StateNames, StateNamesByDensity, StateNamesByPopulation, TimeZones, TimeZonesByDensity, TimeZonesByPopulation, ZipCodes, ZipCodesByDensity, ZipCodesByPopulation

public class CSVSampler extends Object implements LongFunction<String>

This function is a toolkit version of the WeightedStringsFromCSV function. It is more capable and should be the preferred function for alias sampling over any CSV data. This sampler uses a named column in the CSV data as the value. This is also referred to as the labelColumn. The frequency of this label depends on the weight assigned to it in another named CSV column, known as the weightColumn.

Combining duplicate labels

When you have CSV data which is not organized around the specific identifier that you want to sample by, you can use some combining functions to tabulate these prior to sampling. In that case, you can use any of "sum", "avg", "count", "min", or "max" as the reducing function on the value in the weight column. If none are specified, then "sum" is used by default. All modes except "count" and "name" require a valid weight column to be specified. These functions apply to the reduction of labels in the selected CSV column, and only apply when there is more than one row with the same value in that named column. Thus, the order of appearance row-by-row will be preserved in cases that all values in that column are distinct. This means that if you have multiple associated values on a given row, you can use the same

sum, avg, min, max - takes the given stat for the weight of each distinct label
count - takes the number of occurrences of a given label as the weight
name - sets the weight of all distinct labels to 1.0d

Map vs Hash mode

As with some of the other statistical functions, you can use this one to pick through the sample values by using the map mode. This is distinct from the default hash mode. When map mode is used, the values will appear monotonically as you scan through the unit interval of all long values. Specifically, 0L represents 0.0d in the unit interval on input, and Long.MAX_VALUE represents 1.0 on the unit interval.) This mode is only recommended for advanced scenarios and should otherwise be avoided. You will know if you need this mode. For alias sampling, the values may not always occur in the order specified due to the alias table construction. However, the values will be clustered in the order they appear in that table.

Constructor Summary

Constructors

Constructor

Description

CSVSampler(String labelColumn, String weightColumn, String... data)

Build an efficient O(1) sampler for the given column values with respect to the weights, combining equal values by summing the weights.
Method Summary

Modifier and Type

Method

Description

String

apply(long value)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- CSVSampler
  
  public CSVSampler(String labelColumn, String weightColumn, String... data)
  
  Build an efficient O(1) sampler for the given column values with respect to the weights, combining equal values by summing the weights.
  
  Parameters:
  
  labelColumn - The CSV column name containing the value
  
  weightColumn - The CSV column name containing a double weight
  
  data - Sampling modes or file names. Any of map, hash, sum, avg, count are taken as configuration modes, and all others are taken as CSV filenames.
Method Details
- apply
  
  public String apply(long value)
  
  Specified by:
  
  apply in interface LongFunction<String>

Class CSVSampler

Combining duplicate labels

Map vs Hash mode

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

CSVSampler

Method Details

apply