API-cachecow¶
this module offers the upper level API to user, it currently supports four types of operations,
trace loading
trace information retrieving
trace profiling
plotting
Author: Jason Yang <peter.waynechina@gmail.com> 2017/08
- class PyMimircache.top.cachecow.Cachecow(**kwargs)¶
cachecow class providing top level API
- open(file_path, trace_type='p', data_type='c', **kwargs)¶
The default operation of this function opens a plain text trace, the format of a plain text trace is such a file that each line contains a label.
By changing trace type, it can be used for opening other types of trace, supported trace type includes
trace_type
file type
require init_params
“p”
plain text
No
“c”
csv
Yes
“b”
binary
Yes
“v”
vscsi
No
the effect of this is the save as calling corresponding functions (csv, binary, vscsi)
- Parameters:
file_path – the path to the data
trace_type – type of trace, “p” for plainText, “c” for csv, “v” for vscsi, “b” for binary
data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
kwargs – parameters for opening the trace
- Returns:
reader object
- csv(file_path, init_params, data_type='c', block_unit_size=0, disk_sector_size=0, **kwargs)¶
open a csv trace, init_params is a dictionary specifying the specs of the csv file, the possible keys are listed in the table below. The column/field number begins from 1, so the first column(field) is 1, the second is 2, etc.
- Parameters:
file_path – the path to the data
init_params – params related to csv file, see above or csvReader for details
data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
block_unit_size – the block size for a cache, currently storage system only
disk_sector_size – the disk sector size of input file, storage system only
- Returns:
reader object
Keyword Argument
file type
Value Type
Default Value
Description
label
csv/ binary
int
this is required
the column of the label of the request
fmt
binary
string
this is required
fmt string of binary data, same as python struct
header
csv
True/False
False
whether csv data has header
delimiter
csv
char
“,”
the delimiter separating fields in the csv file
real_time
csv/ binary
int
NA
the column of real time
op
csv/ binary
int
NA
the column of operation (read/write)
size
csv/ binary
int
NA
the column of block/request size
- binary(file_path, init_params, data_type='l', block_unit_size=0, disk_sector_size=0, **kwargs)¶
open a binary trace file, init_params see function csv
- Parameters:
file_path – the path to the data
init_params – params related to the spec of data, see above csv for details
data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
block_unit_size – the block size for a cache, currently storage system only
disk_sector_size – the disk sector size of input file, storage system only
- Returns:
reader object
- vscsi(file_path, block_unit_size=0, **kwargs)¶
open vscsi trace file
- Parameters:
file_path – the path to the data
block_unit_size – the block size for a cache, currently storage system only
- Returns:
reader object
- reset()¶
- reset cachecow to initial state, including
reset reader to the beginning of the trace
- close()¶
close the reader opened in cachecow, and clean up in the future
- stat(time_period=[-1, 0])¶
obtain the statistical information about the trace, including
number of requests
number of uniq items
cold miss ratio
a list of top 10 popular in form of (obj, num of requests):
number of obj/block accessed only once
frequency mean
time span
- Returns:
a string of the information above
- get_frequency_access_list(time_period=[-1, 0])¶
obtain the statistical information about the trace, including
number of requests
number of uniq items
cold miss ratio
a list of top 10 popular in form of (obj, num of requests):
number of obj/block accessed only once
frequency mean
time span
- Returns:
a string of the information above
- num_of_req()¶
- Returns:
the number of requests in the trace
- num_of_uniq_req()¶
- Returns:
the number of unique requests in the trace
- get_reuse_distance()¶
- Returns:
an array of reuse distance
- get_hit_count_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)¶
get hit count of the given algorithm and return a dict of mapping from cache size -> hit count notice that hit count array is not CDF, meaning hit count of size 2 does not include hit count of size 1, you need to sum up to get a CDF.
- Parameters:
algorithm – cache replacement algorithms
cache_size – size of cache
cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU
use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).
kwargs – other parameters including num_of_threads
- Returns:
an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio
- get_hit_ratio_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)¶
get hit ratio of the given algorithm and return a dict of mapping from cache size -> hit ratio
- Parameters:
algorithm – cache replacement algorithms
cache_size – size of cache
cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU
use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).
kwargs – other parameters including num_of_threads
- Returns:
an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio
- profiler(algorithm, cache_params=None, cache_size=-1, bin_size=-1, use_general_profiler=False, **kwargs)¶
get a profiler instance, this should not be used by most users
- Parameters:
algorithm – name of algorithm
cache_params – parameters of given cache replacement algorithm
cache_size – size of cache
bin_size – bin_size for generalProfiler
use_general_profiler –
this option is for LRU only, if it is True, then return a cGeneralProfiler for LRU, otherwise, return a LRUProfiler for LRU.
Note: LRUProfiler does not require cache_size/bin_size params, it does not sample thus provides a smooth curve, however, it is O(logN) at each step, in contrast, cGeneralProfiler samples the curve, but use O(1) at each step
kwargs – num_of_threads
- Returns:
a profiler instance
- heatmap(time_mode, plot_type, time_interval=-1, num_of_pixels=-1, algorithm='LRU', cache_params=None, cache_size=-1, **kwargs)¶
plot heatmaps, currently supports the following heatmaps
hit_ratio_start_time_end_time
hit_ratio_start_time_cache_size (python only)
avg_rd_start_time_end_time (python only)
cold_miss_count_start_time_end_time (python only)
rd_distribution
rd_distribution_CDF
future_rd_distribution
dist_distribution
reuse_time_distribution
- Parameters:
time_mode – the type of time, can be “v” for virtual time, or “r” for real time
plot_type – the name of plot types, see above for plot types
time_interval – the time interval of one pixel
num_of_pixels – if you don’t to use time_interval, you can also specify how many pixels you want in one dimension, note this feature is not well tested
algorithm – what algorithm to use for plotting heatmap, this is not required for distance related heatmap like rd_distribution
cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
cache_size – The size of cache, this is required only for hit_ratio_start_time_end_time
kwargs – other parameters for computation and plotting such as num_of_threads, figname
- diff_heatmap(time_mode, plot_type, algorithm1='LRU', time_interval=-1, num_of_pixels=-1, algorithm2='Optimal', cache_params1=None, cache_params2=None, cache_size=-1, **kwargs)¶
Plot the differential heatmap between two algorithms by alg2 - alg1
- Parameters:
cache_size – size of cache
time_mode – time time_mode “v” for virtual time, “r” for real time
plot_type – same as the name in heatmap function
algorithm1 – name of the first alg
time_interval – same as in heatmap
num_of_pixels – same as in heatmap
algorithm2 – name of the second algorithm
cache_params1 – parameters of the first algorithm
cache_params2 – parameters of the second algorithm
kwargs – include num_of_threads
- twoDPlot(plot_type, **kwargs)¶
an aggregate function for all two dimensional plots printing except hit ratio curve
plot type
required parameters
Description
cold_miss_count
time_mode, time_interval
cold miss count VS time
cold_miss_ratio
time_mode, time_interval
cold miss ratio VS time
request_rate
time_mode, time_interval
num of requests VS time
popularity
NA
Percentage of obj VS frequency
rd_popularity
NA
Num of req VS reuse distance
rt_popularity
NA
Num of req VS reuse time
scan_vis_2d
NA
mapping from original objID to sequential number
interval_hit_ratio
cache_size
hit ratio of interval VS time
- Parameters:
plot_type – type of the plot, see above
kwargs – parameters related to plots, see twoDPlots module for detailed control over plots
- plotHRCs(algorithm_list, cache_params=(), cache_size=-1, bin_size=-1, auto_resize=True, figname='HRC.png', **kwargs)¶
this function provides hit ratio curve plotting
- Parameters:
algorithm_list – a list of algorithm(s)
cache_params – the corresponding cache params for the algorithms, use None for algorithms that don’t require cache params, if none of the alg requires cache params, you don’t need to set this
cache_size – maximal size of cache, use -1 for max possible size
bin_size – bin size for non-LRU profiling
auto_resize – when using max possible size or specified cache size too large, you will get a huge plateau at the end of hit ratio curve, set auto_resize to True to cutoff most of the big plateau
figname – name of figure
kwargs –
options: block_unit_size, num_of_threads, auto_resize_threshold, xlimit, ylimit, cache_unit_size
save_gradually - save a figure every time computation for one algorithm finishes,
label - instead of using algorithm list as label, specify user-defined label
- characterize(characterize_type, cache_size=-1, **kwargs)¶
use this function to obtain a series of plots about your trace, the type includes
short - short run time, fewer plots with less accuracy
medium
long
all - most of the available plots with high accuracy, notice it can take LONG time on big trace
- Parameters:
characterize_type – see above, options: short, medium, long, all
cache_size – estimated cache size for the trace, if -1, PyMimircache will estimate the cache size
kwargs – print_stat
- Returns:
trace stat string
- class PyMimircache.top.cachecow.Cachecow(**kwargs)¶
cachecow class providing top level API
- open(file_path, trace_type='p', data_type='c', **kwargs)¶
The default operation of this function opens a plain text trace, the format of a plain text trace is such a file that each line contains a label.
By changing trace type, it can be used for opening other types of trace, supported trace type includes
trace_type
file type
require init_params
“p”
plain text
No
“c”
csv
Yes
“b”
binary
Yes
“v”
vscsi
No
the effect of this is the save as calling corresponding functions (csv, binary, vscsi)
- Parameters:
file_path – the path to the data
trace_type – type of trace, “p” for plainText, “c” for csv, “v” for vscsi, “b” for binary
data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
kwargs – parameters for opening the trace
- Returns:
reader object
- csv(file_path, init_params, data_type='c', block_unit_size=0, disk_sector_size=0, **kwargs)¶
open a csv trace, init_params is a dictionary specifying the specs of the csv file, the possible keys are listed in the table below. The column/field number begins from 1, so the first column(field) is 1, the second is 2, etc.
- Parameters:
file_path – the path to the data
init_params – params related to csv file, see above or csvReader for details
data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
block_unit_size – the block size for a cache, currently storage system only
disk_sector_size – the disk sector size of input file, storage system only
- Returns:
reader object
Keyword Argument
file type
Value Type
Default Value
Description
label
csv/ binary
int
this is required
the column of the label of the request
fmt
binary
string
this is required
fmt string of binary data, same as python struct
header
csv
True/False
False
whether csv data has header
delimiter
csv
char
“,”
the delimiter separating fields in the csv file
real_time
csv/ binary
int
NA
the column of real time
op
csv/ binary
int
NA
the column of operation (read/write)
size
csv/ binary
int
NA
the column of block/request size
- binary(file_path, init_params, data_type='l', block_unit_size=0, disk_sector_size=0, **kwargs)¶
open a binary trace file, init_params see function csv
- Parameters:
file_path – the path to the data
init_params – params related to the spec of data, see above csv for details
data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)
block_unit_size – the block size for a cache, currently storage system only
disk_sector_size – the disk sector size of input file, storage system only
- Returns:
reader object
- vscsi(file_path, block_unit_size=0, **kwargs)¶
open vscsi trace file
- Parameters:
file_path – the path to the data
block_unit_size – the block size for a cache, currently storage system only
- Returns:
reader object
- reset()¶
- reset cachecow to initial state, including
reset reader to the beginning of the trace
- close()¶
close the reader opened in cachecow, and clean up in the future
- stat(time_period=[-1, 0])¶
obtain the statistical information about the trace, including
number of requests
number of uniq items
cold miss ratio
a list of top 10 popular in form of (obj, num of requests):
number of obj/block accessed only once
frequency mean
time span
- Returns:
a string of the information above
- get_frequency_access_list(time_period=[-1, 0])¶
obtain the statistical information about the trace, including
number of requests
number of uniq items
cold miss ratio
a list of top 10 popular in form of (obj, num of requests):
number of obj/block accessed only once
frequency mean
time span
- Returns:
a string of the information above
- num_of_req()¶
- Returns:
the number of requests in the trace
- num_of_uniq_req()¶
- Returns:
the number of unique requests in the trace
- get_reuse_distance()¶
- Returns:
an array of reuse distance
- get_hit_count_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)¶
get hit count of the given algorithm and return a dict of mapping from cache size -> hit count notice that hit count array is not CDF, meaning hit count of size 2 does not include hit count of size 1, you need to sum up to get a CDF.
- Parameters:
algorithm – cache replacement algorithms
cache_size – size of cache
cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU
use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).
kwargs – other parameters including num_of_threads
- Returns:
an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio
- get_hit_ratio_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)¶
get hit ratio of the given algorithm and return a dict of mapping from cache size -> hit ratio
- Parameters:
algorithm – cache replacement algorithms
cache_size – size of cache
cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU
use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).
kwargs – other parameters including num_of_threads
- Returns:
an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio
- profiler(algorithm, cache_params=None, cache_size=-1, bin_size=-1, use_general_profiler=False, **kwargs)¶
get a profiler instance, this should not be used by most users
- Parameters:
algorithm – name of algorithm
cache_params – parameters of given cache replacement algorithm
cache_size – size of cache
bin_size – bin_size for generalProfiler
use_general_profiler –
this option is for LRU only, if it is True, then return a cGeneralProfiler for LRU, otherwise, return a LRUProfiler for LRU.
Note: LRUProfiler does not require cache_size/bin_size params, it does not sample thus provides a smooth curve, however, it is O(logN) at each step, in contrast, cGeneralProfiler samples the curve, but use O(1) at each step
kwargs – num_of_threads
- Returns:
a profiler instance
- heatmap(time_mode, plot_type, time_interval=-1, num_of_pixels=-1, algorithm='LRU', cache_params=None, cache_size=-1, **kwargs)¶
plot heatmaps, currently supports the following heatmaps
hit_ratio_start_time_end_time
hit_ratio_start_time_cache_size (python only)
avg_rd_start_time_end_time (python only)
cold_miss_count_start_time_end_time (python only)
rd_distribution
rd_distribution_CDF
future_rd_distribution
dist_distribution
reuse_time_distribution
- Parameters:
time_mode – the type of time, can be “v” for virtual time, or “r” for real time
plot_type – the name of plot types, see above for plot types
time_interval – the time interval of one pixel
num_of_pixels – if you don’t to use time_interval, you can also specify how many pixels you want in one dimension, note this feature is not well tested
algorithm – what algorithm to use for plotting heatmap, this is not required for distance related heatmap like rd_distribution
cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU
cache_size – The size of cache, this is required only for hit_ratio_start_time_end_time
kwargs – other parameters for computation and plotting such as num_of_threads, figname
- diff_heatmap(time_mode, plot_type, algorithm1='LRU', time_interval=-1, num_of_pixels=-1, algorithm2='Optimal', cache_params1=None, cache_params2=None, cache_size=-1, **kwargs)¶
Plot the differential heatmap between two algorithms by alg2 - alg1
- Parameters:
cache_size – size of cache
time_mode – time time_mode “v” for virtual time, “r” for real time
plot_type – same as the name in heatmap function
algorithm1 – name of the first alg
time_interval – same as in heatmap
num_of_pixels – same as in heatmap
algorithm2 – name of the second algorithm
cache_params1 – parameters of the first algorithm
cache_params2 – parameters of the second algorithm
kwargs – include num_of_threads
- twoDPlot(plot_type, **kwargs)¶
an aggregate function for all two dimensional plots printing except hit ratio curve
plot type
required parameters
Description
cold_miss_count
time_mode, time_interval
cold miss count VS time
cold_miss_ratio
time_mode, time_interval
cold miss ratio VS time
request_rate
time_mode, time_interval
num of requests VS time
popularity
NA
Percentage of obj VS frequency
rd_popularity
NA
Num of req VS reuse distance
rt_popularity
NA
Num of req VS reuse time
scan_vis_2d
NA
mapping from original objID to sequential number
interval_hit_ratio
cache_size
hit ratio of interval VS time
- Parameters:
plot_type – type of the plot, see above
kwargs – parameters related to plots, see twoDPlots module for detailed control over plots
- plotHRCs(algorithm_list, cache_params=(), cache_size=-1, bin_size=-1, auto_resize=True, figname='HRC.png', **kwargs)¶
this function provides hit ratio curve plotting
- Parameters:
algorithm_list – a list of algorithm(s)
cache_params – the corresponding cache params for the algorithms, use None for algorithms that don’t require cache params, if none of the alg requires cache params, you don’t need to set this
cache_size – maximal size of cache, use -1 for max possible size
bin_size – bin size for non-LRU profiling
auto_resize – when using max possible size or specified cache size too large, you will get a huge plateau at the end of hit ratio curve, set auto_resize to True to cutoff most of the big plateau
figname – name of figure
kwargs –
options: block_unit_size, num_of_threads, auto_resize_threshold, xlimit, ylimit, cache_unit_size
save_gradually - save a figure every time computation for one algorithm finishes,
label - instead of using algorithm list as label, specify user-defined label
- characterize(characterize_type, cache_size=-1, **kwargs)¶
use this function to obtain a series of plots about your trace, the type includes
short - short run time, fewer plots with less accuracy
medium
long
all - most of the available plots with high accuracy, notice it can take LONG time on big trace
- Parameters:
characterize_type – see above, options: short, medium, long, all
cache_size – estimated cache size for the trace, if -1, PyMimircache will estimate the cache size
kwargs – print_stat
- Returns:
trace stat string