API-cachecow

this module offers the upper level API to user, it currently supports four types of operations,

  • trace loading
  • trace information retrieving
  • trace profiling
  • plotting

Author: Jason Yang <peter.waynechina@gmail.com> 2017/08

class PyMimircache.top.cachecow.Cachecow(**kwargs)

cachecow class providing top level API

open(file_path, trace_type='p', data_type='c', **kwargs)

The default operation of this function opens a plain text trace, the format of a plain text trace is such a file that each line contains a label.

By changing trace type, it can be used for opening other types of trace, supported trace type includes

trace_type file type require init_params
“p” plain text No
“c” csv Yes
“b” binary Yes
“v” vscsi No

the effect of this is the save as calling corresponding functions (csv, binary, vscsi)

Parameters:
  • file_path – the path to the data
  • trace_type – type of trace, “p” for plainText, “c” for csv, “v” for vscsi, “b” for binary
  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
  • kwargs – parameters for opening the trace
Returns:

reader object

csv(file_path, init_params, data_type='c', block_unit_size=0, disk_sector_size=0, **kwargs)

open a csv trace, init_params is a dictionary specifying the specs of the csv file, the possible keys are listed in the table below. The column/field number begins from 1, so the first column(field) is 1, the second is 2, etc.

Parameters:
  • file_path – the path to the data
  • init_params – params related to csv file, see above or csvReader for details
  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
  • block_unit_size – the block size for a cache, currently storage system only
  • disk_sector_size – the disk sector size of input file, storage system only
Returns:

reader object

Keyword Argument file type Value Type Default Value Description
label csv/ binary int this is required the column of the label of the request
fmt binary string this is required fmt string of binary data, same as python struct
header csv True/False False whether csv data has header
delimiter csv char “,” the delimiter separating fields in the csv file
real_time csv/ binary int NA the column of real time
op csv/ binary int NA the column of operation (read/write)
size csv/ binary int NA the column of block/request size
binary(file_path, init_params, data_type='l', block_unit_size=0, disk_sector_size=0, **kwargs)

open a binary trace file, init_params see function csv

Parameters:
  • file_path – the path to the data
  • init_params – params related to the spec of data, see above csv for details
  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
  • block_unit_size – the block size for a cache, currently storage system only
  • disk_sector_size – the disk sector size of input file, storage system only
Returns:

reader object

vscsi(file_path, block_unit_size=0, **kwargs)

open vscsi trace file

Parameters:
  • file_path – the path to the data
  • block_unit_size – the block size for a cache, currently storage system only
Returns:

reader object

reset()
reset cachecow to initial state, including
reset reader to the beginning of the trace
close()

close the reader opened in cachecow, and clean up in the future

stat(time_period=[-1, 0])

obtain the statistical information about the trace, including

  • number of requests
  • number of uniq items
  • cold miss ratio
  • a list of top 10 popular in form of (obj, num of requests):
  • number of obj/block accessed only once
  • frequency mean
  • time span
Returns:a string of the information above
get_frequency_access_list(time_period=[-1, 0])

obtain the statistical information about the trace, including

  • number of requests
  • number of uniq items
  • cold miss ratio
  • a list of top 10 popular in form of (obj, num of requests):
  • number of obj/block accessed only once
  • frequency mean
  • time span
Returns:a string of the information above
num_of_req()
Returns:the number of requests in the trace
num_of_uniq_req()
Returns:the number of unique requests in the trace
get_reuse_distance()
Returns:an array of reuse distance
get_hit_count_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)

get hit count of the given algorithm and return a dict of mapping from cache size -> hit count notice that hit count array is not CDF, meaning hit count of size 2 does not include hit count of size 1, you need to sum up to get a CDF.

Parameters:
  • algorithm – cache replacement algorithms
  • cache_size – size of cache
  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
  • bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU
  • use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).
  • kwargs – other parameters including num_of_threads
Returns:

an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio

get_hit_ratio_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)

get hit ratio of the given algorithm and return a dict of mapping from cache size -> hit ratio

Parameters:
  • algorithm – cache replacement algorithms
  • cache_size – size of cache
  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
  • bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU
  • use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).
  • kwargs – other parameters including num_of_threads
Returns:

an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio

profiler(algorithm, cache_params=None, cache_size=-1, bin_size=-1, use_general_profiler=False, **kwargs)

get a profiler instance, this should not be used by most users

Parameters:
  • algorithm – name of algorithm
  • cache_params – parameters of given cache replacement algorithm
  • cache_size – size of cache
  • bin_size – bin_size for generalProfiler
  • use_general_profiler

    this option is for LRU only, if it is True, then return a cGeneralProfiler for LRU, otherwise, return a LRUProfiler for LRU.

    Note: LRUProfiler does not require cache_size/bin_size params, it does not sample thus provides a smooth curve, however, it is O(logN) at each step, in constrast, cGeneralProfiler samples the curve, but use O(1) at each step

  • kwargs – num_of_threads
Returns:

a profiler instance

heatmap(time_mode, plot_type, time_interval=-1, num_of_pixels=-1, algorithm='LRU', cache_params=None, cache_size=-1, **kwargs)

plot heatmaps, currently supports the following heatmaps

  • hit_ratio_start_time_end_time
  • hit_ratio_start_time_cache_size (python only)
  • avg_rd_start_time_end_time (python only)
  • cold_miss_count_start_time_end_time (python only)
  • rd_distribution
  • rd_distribution_CDF
  • future_rd_distribution
  • dist_distribution
  • reuse_time_distribution
Parameters:
  • time_mode – the type of time, can be “v” for virtual time, or “r” for real time
  • plot_type – the name of plot types, see above for plot types
  • time_interval – the time interval of one pixel
  • num_of_pixels – if you don’t to use time_interval, you can also specify how many pixels you want in one dimension, note this feature is not well tested
  • algorithm – what algorithm to use for plotting heatmap, this is not required for distance related heatmap like rd_distribution
  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
  • cache_size – The size of cache, this is required only for hit_ratio_start_time_end_time
  • kwargs – other parameters for computation and plotting such as num_of_threads, figname
diff_heatmap(time_mode, plot_type, algorithm1='LRU', time_interval=-1, num_of_pixels=-1, algorithm2='Optimal', cache_params1=None, cache_params2=None, cache_size=-1, **kwargs)

Plot the differential heatmap between two algorithms by alg2 - alg1

Parameters:
  • cache_size – size of cache
  • time_mode – time time_mode “v” for virutal time, “r” for real time
  • plot_type – same as the name in heatmap function
  • algorithm1 – name of the first alg
  • time_interval – same as in heatmap
  • num_of_pixels – same as in heatmap
  • algorithm2 – name of the second algorithm
  • cache_params1 – parameters of the first algorithm
  • cache_params2 – parameters of the second algorithm
  • kwargs – include num_of_threads
twoDPlot(plot_type, **kwargs)

an aggregate function for all two dimenional plots printing except hit ratio curve

plot type required parameters Description
cold_miss_count time_mode, time_interval cold miss count VS time
cold_miss_ratio time_mode, time_interval cold miss ratio VS time
request_rate time_mode, time_interval num of requests VS time
popularity NA Percentage of obj VS frequency
rd_popularity NA Num of req VS reuse distance
rt_popularity NA Num of req VS reuse time
scan_vis_2d NA mapping from original objID to sequential number
interval_hit_ratio cache_size hit ratio of interval VS time
Parameters:
  • plot_type – type of the plot, see above
  • kwargs – paramters related to plots, see twoDPlots module for detailed control over plots
plotHRCs(algorithm_list, cache_params=(), cache_size=-1, bin_size=-1, auto_resize=True, figname='HRC.png', **kwargs)

this function provides hit ratio curve plotting

Parameters:
  • algorithm_list – a list of algorithm(s)
  • cache_params – the corresponding cache params for the algorithms, use None for algorithms that don’t require cache params, if none of the alg requires cache params, you don’t need to set this
  • cache_size – maximal size of cache, use -1 for max possible size
  • bin_size – bin size for non-LRU profiling
  • auto_resize – when using max possible size or specified cache size too large, you will get a huge plateau at the end of hit ratio curve, set auto_resize to True to cutoff most of the big plateau
  • figname – name of figure
  • kwargs

    options: block_unit_size, num_of_threads, auto_resize_threshold, xlimit, ylimit, cache_unit_size

    save_gradually - save a figure everytime computation for one algorithm finishes,

    label - instead of using algorithm list as label, specify user-defined label

characterize(characterize_type, cache_size=-1, **kwargs)

use this function to obtain a series of plots about your trace, the type includes

  • short - short run time, fewer plots with less accuracy
  • medium
  • long
  • all - most of the available plots with high accuracy, notice it can take LONG time on big trace
Parameters:
  • characterize_type – see above, options: short, medium, long, all
  • cache_size – estimated cache size for the trace, if -1, PyMimircache will estimate the cache size
  • kwargs – print_stat
Returns:

trace stat string

class PyMimircache.top.cachecow.Cachecow(**kwargs)

cachecow class providing top level API

open(file_path, trace_type='p', data_type='c', **kwargs)

The default operation of this function opens a plain text trace, the format of a plain text trace is such a file that each line contains a label.

By changing trace type, it can be used for opening other types of trace, supported trace type includes

trace_type file type require init_params
“p” plain text No
“c” csv Yes
“b” binary Yes
“v” vscsi No

the effect of this is the save as calling corresponding functions (csv, binary, vscsi)

Parameters:
  • file_path – the path to the data
  • trace_type – type of trace, “p” for plainText, “c” for csv, “v” for vscsi, “b” for binary
  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
  • kwargs – parameters for opening the trace
Returns:

reader object

csv(file_path, init_params, data_type='c', block_unit_size=0, disk_sector_size=0, **kwargs)

open a csv trace, init_params is a dictionary specifying the specs of the csv file, the possible keys are listed in the table below. The column/field number begins from 1, so the first column(field) is 1, the second is 2, etc.

Parameters:
  • file_path – the path to the data
  • init_params – params related to csv file, see above or csvReader for details
  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
  • block_unit_size – the block size for a cache, currently storage system only
  • disk_sector_size – the disk sector size of input file, storage system only
Returns:

reader object

Keyword Argument file type Value Type Default Value Description
label csv/ binary int this is required the column of the label of the request
fmt binary string this is required fmt string of binary data, same as python struct
header csv True/False False whether csv data has header
delimiter csv char “,” the delimiter separating fields in the csv file
real_time csv/ binary int NA the column of real time
op csv/ binary int NA the column of operation (read/write)
size csv/ binary int NA the column of block/request size
binary(file_path, init_params, data_type='l', block_unit_size=0, disk_sector_size=0, **kwargs)

open a binary trace file, init_params see function csv

Parameters:
  • file_path – the path to the data
  • init_params – params related to the spec of data, see above csv for details
  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
  • block_unit_size – the block size for a cache, currently storage system only
  • disk_sector_size – the disk sector size of input file, storage system only
Returns:

reader object

vscsi(file_path, block_unit_size=0, **kwargs)

open vscsi trace file

Parameters:
  • file_path – the path to the data
  • block_unit_size – the block size for a cache, currently storage system only
Returns:

reader object

reset()
reset cachecow to initial state, including
reset reader to the beginning of the trace
close()

close the reader opened in cachecow, and clean up in the future

stat(time_period=[-1, 0])

obtain the statistical information about the trace, including

  • number of requests
  • number of uniq items
  • cold miss ratio
  • a list of top 10 popular in form of (obj, num of requests):
  • number of obj/block accessed only once
  • frequency mean
  • time span
Returns:a string of the information above
get_frequency_access_list(time_period=[-1, 0])

obtain the statistical information about the trace, including

  • number of requests
  • number of uniq items
  • cold miss ratio
  • a list of top 10 popular in form of (obj, num of requests):
  • number of obj/block accessed only once
  • frequency mean
  • time span
Returns:a string of the information above
num_of_req()
Returns:the number of requests in the trace
num_of_uniq_req()
Returns:the number of unique requests in the trace
get_reuse_distance()
Returns:an array of reuse distance
get_hit_count_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)

get hit count of the given algorithm and return a dict of mapping from cache size -> hit count notice that hit count array is not CDF, meaning hit count of size 2 does not include hit count of size 1, you need to sum up to get a CDF.

Parameters:
  • algorithm – cache replacement algorithms
  • cache_size – size of cache
  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
  • bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU
  • use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).
  • kwargs – other parameters including num_of_threads
Returns:

an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio

get_hit_ratio_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)

get hit ratio of the given algorithm and return a dict of mapping from cache size -> hit ratio

Parameters:
  • algorithm – cache replacement algorithms
  • cache_size – size of cache
  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
  • bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU
  • use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).
  • kwargs – other parameters including num_of_threads
Returns:

an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio

profiler(algorithm, cache_params=None, cache_size=-1, bin_size=-1, use_general_profiler=False, **kwargs)

get a profiler instance, this should not be used by most users

Parameters:
  • algorithm – name of algorithm
  • cache_params – parameters of given cache replacement algorithm
  • cache_size – size of cache
  • bin_size – bin_size for generalProfiler
  • use_general_profiler

    this option is for LRU only, if it is True, then return a cGeneralProfiler for LRU, otherwise, return a LRUProfiler for LRU.

    Note: LRUProfiler does not require cache_size/bin_size params, it does not sample thus provides a smooth curve, however, it is O(logN) at each step, in constrast, cGeneralProfiler samples the curve, but use O(1) at each step

  • kwargs – num_of_threads
Returns:

a profiler instance

heatmap(time_mode, plot_type, time_interval=-1, num_of_pixels=-1, algorithm='LRU', cache_params=None, cache_size=-1, **kwargs)

plot heatmaps, currently supports the following heatmaps

  • hit_ratio_start_time_end_time
  • hit_ratio_start_time_cache_size (python only)
  • avg_rd_start_time_end_time (python only)
  • cold_miss_count_start_time_end_time (python only)
  • rd_distribution
  • rd_distribution_CDF
  • future_rd_distribution
  • dist_distribution
  • reuse_time_distribution
Parameters:
  • time_mode – the type of time, can be “v” for virtual time, or “r” for real time
  • plot_type – the name of plot types, see above for plot types
  • time_interval – the time interval of one pixel
  • num_of_pixels – if you don’t to use time_interval, you can also specify how many pixels you want in one dimension, note this feature is not well tested
  • algorithm – what algorithm to use for plotting heatmap, this is not required for distance related heatmap like rd_distribution
  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
  • cache_size – The size of cache, this is required only for hit_ratio_start_time_end_time
  • kwargs – other parameters for computation and plotting such as num_of_threads, figname
diff_heatmap(time_mode, plot_type, algorithm1='LRU', time_interval=-1, num_of_pixels=-1, algorithm2='Optimal', cache_params1=None, cache_params2=None, cache_size=-1, **kwargs)

Plot the differential heatmap between two algorithms by alg2 - alg1

Parameters:
  • cache_size – size of cache
  • time_mode – time time_mode “v” for virutal time, “r” for real time
  • plot_type – same as the name in heatmap function
  • algorithm1 – name of the first alg
  • time_interval – same as in heatmap
  • num_of_pixels – same as in heatmap
  • algorithm2 – name of the second algorithm
  • cache_params1 – parameters of the first algorithm
  • cache_params2 – parameters of the second algorithm
  • kwargs – include num_of_threads
twoDPlot(plot_type, **kwargs)

an aggregate function for all two dimenional plots printing except hit ratio curve

plot type required parameters Description
cold_miss_count time_mode, time_interval cold miss count VS time
cold_miss_ratio time_mode, time_interval cold miss ratio VS time
request_rate time_mode, time_interval num of requests VS time
popularity NA Percentage of obj VS frequency
rd_popularity NA Num of req VS reuse distance
rt_popularity NA Num of req VS reuse time
scan_vis_2d NA mapping from original objID to sequential number
interval_hit_ratio cache_size hit ratio of interval VS time
Parameters:
  • plot_type – type of the plot, see above
  • kwargs – paramters related to plots, see twoDPlots module for detailed control over plots
plotHRCs(algorithm_list, cache_params=(), cache_size=-1, bin_size=-1, auto_resize=True, figname='HRC.png', **kwargs)

this function provides hit ratio curve plotting

Parameters:
  • algorithm_list – a list of algorithm(s)
  • cache_params – the corresponding cache params for the algorithms, use None for algorithms that don’t require cache params, if none of the alg requires cache params, you don’t need to set this
  • cache_size – maximal size of cache, use -1 for max possible size
  • bin_size – bin size for non-LRU profiling
  • auto_resize – when using max possible size or specified cache size too large, you will get a huge plateau at the end of hit ratio curve, set auto_resize to True to cutoff most of the big plateau
  • figname – name of figure
  • kwargs

    options: block_unit_size, num_of_threads, auto_resize_threshold, xlimit, ylimit, cache_unit_size

    save_gradually - save a figure everytime computation for one algorithm finishes,

    label - instead of using algorithm list as label, specify user-defined label

characterize(characterize_type, cache_size=-1, **kwargs)

use this function to obtain a series of plots about your trace, the type includes

  • short - short run time, fewer plots with less accuracy
  • medium
  • long
  • all - most of the available plots with high accuracy, notice it can take LONG time on big trace
Parameters:
  • characterize_type – see above, options: short, medium, long, all
  • cache_size – estimated cache size for the trace, if -1, PyMimircache will estimate the cache size
  • kwargs – print_stat
Returns:

trace stat string