API-cachecow

this module offers the upper level API to user, it currently supports four types of operations,

  • trace loading

  • trace information retrieving

  • trace profiling

  • plotting

Author: Jason Yang <peter.waynechina@gmail.com> 2017/08

class PyMimircache.top.cachecow.Cachecow(**kwargs)

cachecow class providing top level API

open(file_path, trace_type='p', data_type='c', **kwargs)

The default operation of this function opens a plain text trace, the format of a plain text trace is such a file that each line contains a label.

By changing trace type, it can be used for opening other types of trace, supported trace type includes

trace_type

file type

require init_params

“p”

plain text

No

“c”

csv

Yes

“b”

binary

Yes

“v”

vscsi

No

the effect of this is the save as calling corresponding functions (csv, binary, vscsi)

Parameters:
  • file_path – the path to the data

  • trace_type – type of trace, “p” for plainText, “c” for csv, “v” for vscsi, “b” for binary

  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)

  • kwargs – parameters for opening the trace

Returns:

reader object

csv(file_path, init_params, data_type='c', block_unit_size=0, disk_sector_size=0, **kwargs)

open a csv trace, init_params is a dictionary specifying the specs of the csv file, the possible keys are listed in the table below. The column/field number begins from 1, so the first column(field) is 1, the second is 2, etc.

Parameters:
  • file_path – the path to the data

  • init_params – params related to csv file, see above or csvReader for details

  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)

  • block_unit_size – the block size for a cache, currently storage system only

  • disk_sector_size – the disk sector size of input file, storage system only

Returns:

reader object

Keyword Argument

file type

Value Type

Default Value

Description

label

csv/ binary

int

this is required

the column of the label of the request

fmt

binary

string

this is required

fmt string of binary data, same as python struct

header

csv

True/False

False

whether csv data has header

delimiter

csv

char

“,”

the delimiter separating fields in the csv file

real_time

csv/ binary

int

NA

the column of real time

op

csv/ binary

int

NA

the column of operation (read/write)

size

csv/ binary

int

NA

the column of block/request size

binary(file_path, init_params, data_type='l', block_unit_size=0, disk_sector_size=0, **kwargs)

open a binary trace file, init_params see function csv

Parameters:
  • file_path – the path to the data

  • init_params – params related to the spec of data, see above csv for details

  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)

  • block_unit_size – the block size for a cache, currently storage system only

  • disk_sector_size – the disk sector size of input file, storage system only

Returns:

reader object

vscsi(file_path, block_unit_size=0, **kwargs)

open vscsi trace file

Parameters:
  • file_path – the path to the data

  • block_unit_size – the block size for a cache, currently storage system only

Returns:

reader object

reset()
reset cachecow to initial state, including

reset reader to the beginning of the trace

close()

close the reader opened in cachecow, and clean up in the future

stat(time_period=[-1, 0])

obtain the statistical information about the trace, including

  • number of requests

  • number of uniq items

  • cold miss ratio

  • a list of top 10 popular in form of (obj, num of requests):

  • number of obj/block accessed only once

  • frequency mean

  • time span

Returns:

a string of the information above

get_frequency_access_list(time_period=[-1, 0])

obtain the statistical information about the trace, including

  • number of requests

  • number of uniq items

  • cold miss ratio

  • a list of top 10 popular in form of (obj, num of requests):

  • number of obj/block accessed only once

  • frequency mean

  • time span

Returns:

a string of the information above

num_of_req()
Returns:

the number of requests in the trace

num_of_uniq_req()
Returns:

the number of unique requests in the trace

get_reuse_distance()
Returns:

an array of reuse distance

get_hit_count_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)

get hit count of the given algorithm and return a dict of mapping from cache size -> hit count notice that hit count array is not CDF, meaning hit count of size 2 does not include hit count of size 1, you need to sum up to get a CDF.

Parameters:
  • algorithm – cache replacement algorithms

  • cache_size – size of cache

  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU

  • bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU

  • use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).

  • kwargs – other parameters including num_of_threads

Returns:

an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio

get_hit_ratio_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)

get hit ratio of the given algorithm and return a dict of mapping from cache size -> hit ratio

Parameters:
  • algorithm – cache replacement algorithms

  • cache_size – size of cache

  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU

  • bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU

  • use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).

  • kwargs – other parameters including num_of_threads

Returns:

an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio

profiler(algorithm, cache_params=None, cache_size=-1, bin_size=-1, use_general_profiler=False, **kwargs)

get a profiler instance, this should not be used by most users

Parameters:
  • algorithm – name of algorithm

  • cache_params – parameters of given cache replacement algorithm

  • cache_size – size of cache

  • bin_size – bin_size for generalProfiler

  • use_general_profiler

    this option is for LRU only, if it is True, then return a cGeneralProfiler for LRU, otherwise, return a LRUProfiler for LRU.

    Note: LRUProfiler does not require cache_size/bin_size params, it does not sample thus provides a smooth curve, however, it is O(logN) at each step, in contrast, cGeneralProfiler samples the curve, but use O(1) at each step

  • kwargs – num_of_threads

Returns:

a profiler instance

heatmap(time_mode, plot_type, time_interval=-1, num_of_pixels=-1, algorithm='LRU', cache_params=None, cache_size=-1, **kwargs)

plot heatmaps, currently supports the following heatmaps

  • hit_ratio_start_time_end_time

  • hit_ratio_start_time_cache_size (python only)

  • avg_rd_start_time_end_time (python only)

  • cold_miss_count_start_time_end_time (python only)

  • rd_distribution

  • rd_distribution_CDF

  • future_rd_distribution

  • dist_distribution

  • reuse_time_distribution

Parameters:
  • time_mode – the type of time, can be “v” for virtual time, or “r” for real time

  • plot_type – the name of plot types, see above for plot types

  • time_interval – the time interval of one pixel

  • num_of_pixels – if you don’t to use time_interval, you can also specify how many pixels you want in one dimension, note this feature is not well tested

  • algorithm – what algorithm to use for plotting heatmap, this is not required for distance related heatmap like rd_distribution

  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU

  • cache_size – The size of cache, this is required only for hit_ratio_start_time_end_time

  • kwargs – other parameters for computation and plotting such as num_of_threads, figname

diff_heatmap(time_mode, plot_type, algorithm1='LRU', time_interval=-1, num_of_pixels=-1, algorithm2='Optimal', cache_params1=None, cache_params2=None, cache_size=-1, **kwargs)

Plot the differential heatmap between two algorithms by alg2 - alg1

Parameters:
  • cache_size – size of cache

  • time_mode – time time_mode “v” for virtual time, “r” for real time

  • plot_type – same as the name in heatmap function

  • algorithm1 – name of the first alg

  • time_interval – same as in heatmap

  • num_of_pixels – same as in heatmap

  • algorithm2 – name of the second algorithm

  • cache_params1 – parameters of the first algorithm

  • cache_params2 – parameters of the second algorithm

  • kwargs – include num_of_threads

twoDPlot(plot_type, **kwargs)

an aggregate function for all two dimensional plots printing except hit ratio curve

plot type

required parameters

Description

cold_miss_count

time_mode, time_interval

cold miss count VS time

cold_miss_ratio

time_mode, time_interval

cold miss ratio VS time

request_rate

time_mode, time_interval

num of requests VS time

popularity

NA

Percentage of obj VS frequency

rd_popularity

NA

Num of req VS reuse distance

rt_popularity

NA

Num of req VS reuse time

scan_vis_2d

NA

mapping from original objID to sequential number

interval_hit_ratio

cache_size

hit ratio of interval VS time

Parameters:
  • plot_type – type of the plot, see above

  • kwargs – parameters related to plots, see twoDPlots module for detailed control over plots

plotHRCs(algorithm_list, cache_params=(), cache_size=-1, bin_size=-1, auto_resize=True, figname='HRC.png', **kwargs)

this function provides hit ratio curve plotting

Parameters:
  • algorithm_list – a list of algorithm(s)

  • cache_params – the corresponding cache params for the algorithms, use None for algorithms that don’t require cache params, if none of the alg requires cache params, you don’t need to set this

  • cache_size – maximal size of cache, use -1 for max possible size

  • bin_size – bin size for non-LRU profiling

  • auto_resize – when using max possible size or specified cache size too large, you will get a huge plateau at the end of hit ratio curve, set auto_resize to True to cutoff most of the big plateau

  • figname – name of figure

  • kwargs

    options: block_unit_size, num_of_threads, auto_resize_threshold, xlimit, ylimit, cache_unit_size

    save_gradually - save a figure every time computation for one algorithm finishes,

    label - instead of using algorithm list as label, specify user-defined label

characterize(characterize_type, cache_size=-1, **kwargs)

use this function to obtain a series of plots about your trace, the type includes

  • short - short run time, fewer plots with less accuracy

  • medium

  • long

  • all - most of the available plots with high accuracy, notice it can take LONG time on big trace

Parameters:
  • characterize_type – see above, options: short, medium, long, all

  • cache_size – estimated cache size for the trace, if -1, PyMimircache will estimate the cache size

  • kwargs – print_stat

Returns:

trace stat string

class PyMimircache.top.cachecow.Cachecow(**kwargs)

cachecow class providing top level API

open(file_path, trace_type='p', data_type='c', **kwargs)

The default operation of this function opens a plain text trace, the format of a plain text trace is such a file that each line contains a label.

By changing trace type, it can be used for opening other types of trace, supported trace type includes

trace_type

file type

require init_params

“p”

plain text

No

“c”

csv

Yes

“b”

binary

Yes

“v”

vscsi

No

the effect of this is the save as calling corresponding functions (csv, binary, vscsi)

Parameters:
  • file_path – the path to the data

  • trace_type – type of trace, “p” for plainText, “c” for csv, “v” for vscsi, “b” for binary

  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)

  • kwargs – parameters for opening the trace

Returns:

reader object

csv(file_path, init_params, data_type='c', block_unit_size=0, disk_sector_size=0, **kwargs)

open a csv trace, init_params is a dictionary specifying the specs of the csv file, the possible keys are listed in the table below. The column/field number begins from 1, so the first column(field) is 1, the second is 2, etc.

Parameters:
  • file_path – the path to the data

  • init_params – params related to csv file, see above or csvReader for details

  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)

  • block_unit_size – the block size for a cache, currently storage system only

  • disk_sector_size – the disk sector size of input file, storage system only

Returns:

reader object

Keyword Argument

file type

Value Type

Default Value

Description

label

csv/ binary

int

this is required

the column of the label of the request

fmt

binary

string

this is required

fmt string of binary data, same as python struct

header

csv

True/False

False

whether csv data has header

delimiter

csv

char

“,”

the delimiter separating fields in the csv file

real_time

csv/ binary

int

NA

the column of real time

op

csv/ binary

int

NA

the column of operation (read/write)

size

csv/ binary

int

NA

the column of block/request size

binary(file_path, init_params, data_type='l', block_unit_size=0, disk_sector_size=0, **kwargs)

open a binary trace file, init_params see function csv

Parameters:
  • file_path – the path to the data

  • init_params – params related to the spec of data, see above csv for details

  • data_type – the type of request label, can be either “c” for string or “l” for number (for example block IO LBA)

  • block_unit_size – the block size for a cache, currently storage system only

  • disk_sector_size – the disk sector size of input file, storage system only

Returns:

reader object

vscsi(file_path, block_unit_size=0, **kwargs)

open vscsi trace file

Parameters:
  • file_path – the path to the data

  • block_unit_size – the block size for a cache, currently storage system only

Returns:

reader object

reset()
reset cachecow to initial state, including

reset reader to the beginning of the trace

close()

close the reader opened in cachecow, and clean up in the future

stat(time_period=[-1, 0])

obtain the statistical information about the trace, including

  • number of requests

  • number of uniq items

  • cold miss ratio

  • a list of top 10 popular in form of (obj, num of requests):

  • number of obj/block accessed only once

  • frequency mean

  • time span

Returns:

a string of the information above

get_frequency_access_list(time_period=[-1, 0])

obtain the statistical information about the trace, including

  • number of requests

  • number of uniq items

  • cold miss ratio

  • a list of top 10 popular in form of (obj, num of requests):

  • number of obj/block accessed only once

  • frequency mean

  • time span

Returns:

a string of the information above

num_of_req()
Returns:

the number of requests in the trace

num_of_uniq_req()
Returns:

the number of unique requests in the trace

get_reuse_distance()
Returns:

an array of reuse distance

get_hit_count_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)

get hit count of the given algorithm and return a dict of mapping from cache size -> hit count notice that hit count array is not CDF, meaning hit count of size 2 does not include hit count of size 1, you need to sum up to get a CDF.

Parameters:
  • algorithm – cache replacement algorithms

  • cache_size – size of cache

  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU

  • bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU

  • use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).

  • kwargs – other parameters including num_of_threads

Returns:

an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio

get_hit_ratio_dict(algorithm, cache_size=-1, cache_params=None, bin_size=-1, use_general_profiler=False, **kwargs)

get hit ratio of the given algorithm and return a dict of mapping from cache size -> hit ratio

Parameters:
  • algorithm – cache replacement algorithms

  • cache_size – size of cache

  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU

  • bin_size – if algorithm is not LRU, then the hit ratio will be calculated by simulating cache at cache size [0, bin_size, bin_size*2 … cache_size], this is not required for LRU

  • use_general_profiler – if algorithm is LRU and you don’t want to use LRUProfiler, then set this to True, possible reason for not using a LRUProfiler: 1. LRUProfiler is too slow for your large trace because the algorithm is O(NlogN) and it uses single thread; 2. LRUProfiler has a bug (let me know if you found a bug).

  • kwargs – other parameters including num_of_threads

Returns:

an dict of hit ratio of given algorithms, mapping from cache_size -> hit ratio

profiler(algorithm, cache_params=None, cache_size=-1, bin_size=-1, use_general_profiler=False, **kwargs)

get a profiler instance, this should not be used by most users

Parameters:
  • algorithm – name of algorithm

  • cache_params – parameters of given cache replacement algorithm

  • cache_size – size of cache

  • bin_size – bin_size for generalProfiler

  • use_general_profiler

    this option is for LRU only, if it is True, then return a cGeneralProfiler for LRU, otherwise, return a LRUProfiler for LRU.

    Note: LRUProfiler does not require cache_size/bin_size params, it does not sample thus provides a smooth curve, however, it is O(logN) at each step, in contrast, cGeneralProfiler samples the curve, but use O(1) at each step

  • kwargs – num_of_threads

Returns:

a profiler instance

heatmap(time_mode, plot_type, time_interval=-1, num_of_pixels=-1, algorithm='LRU', cache_params=None, cache_size=-1, **kwargs)

plot heatmaps, currently supports the following heatmaps

  • hit_ratio_start_time_end_time

  • hit_ratio_start_time_cache_size (python only)

  • avg_rd_start_time_end_time (python only)

  • cold_miss_count_start_time_end_time (python only)

  • rd_distribution

  • rd_distribution_CDF

  • future_rd_distribution

  • dist_distribution

  • reuse_time_distribution

Parameters:
  • time_mode – the type of time, can be “v” for virtual time, or “r” for real time

  • plot_type – the name of plot types, see above for plot types

  • time_interval – the time interval of one pixel

  • num_of_pixels – if you don’t to use time_interval, you can also specify how many pixels you want in one dimension, note this feature is not well tested

  • algorithm – what algorithm to use for plotting heatmap, this is not required for distance related heatmap like rd_distribution

  • cache_params – parameters passed to cache, some of the cache replacement algorithms require parameters, for example LRU-K, SLRU

  • cache_size – The size of cache, this is required only for hit_ratio_start_time_end_time

  • kwargs – other parameters for computation and plotting such as num_of_threads, figname

diff_heatmap(time_mode, plot_type, algorithm1='LRU', time_interval=-1, num_of_pixels=-1, algorithm2='Optimal', cache_params1=None, cache_params2=None, cache_size=-1, **kwargs)

Plot the differential heatmap between two algorithms by alg2 - alg1

Parameters:
  • cache_size – size of cache

  • time_mode – time time_mode “v” for virtual time, “r” for real time

  • plot_type – same as the name in heatmap function

  • algorithm1 – name of the first alg

  • time_interval – same as in heatmap

  • num_of_pixels – same as in heatmap

  • algorithm2 – name of the second algorithm

  • cache_params1 – parameters of the first algorithm

  • cache_params2 – parameters of the second algorithm

  • kwargs – include num_of_threads

twoDPlot(plot_type, **kwargs)

an aggregate function for all two dimensional plots printing except hit ratio curve

plot type

required parameters

Description

cold_miss_count

time_mode, time_interval

cold miss count VS time

cold_miss_ratio

time_mode, time_interval

cold miss ratio VS time

request_rate

time_mode, time_interval

num of requests VS time

popularity

NA

Percentage of obj VS frequency

rd_popularity

NA

Num of req VS reuse distance

rt_popularity

NA

Num of req VS reuse time

scan_vis_2d

NA

mapping from original objID to sequential number

interval_hit_ratio

cache_size

hit ratio of interval VS time

Parameters:
  • plot_type – type of the plot, see above

  • kwargs – parameters related to plots, see twoDPlots module for detailed control over plots

plotHRCs(algorithm_list, cache_params=(), cache_size=-1, bin_size=-1, auto_resize=True, figname='HRC.png', **kwargs)

this function provides hit ratio curve plotting

Parameters:
  • algorithm_list – a list of algorithm(s)

  • cache_params – the corresponding cache params for the algorithms, use None for algorithms that don’t require cache params, if none of the alg requires cache params, you don’t need to set this

  • cache_size – maximal size of cache, use -1 for max possible size

  • bin_size – bin size for non-LRU profiling

  • auto_resize – when using max possible size or specified cache size too large, you will get a huge plateau at the end of hit ratio curve, set auto_resize to True to cutoff most of the big plateau

  • figname – name of figure

  • kwargs

    options: block_unit_size, num_of_threads, auto_resize_threshold, xlimit, ylimit, cache_unit_size

    save_gradually - save a figure every time computation for one algorithm finishes,

    label - instead of using algorithm list as label, specify user-defined label

characterize(characterize_type, cache_size=-1, **kwargs)

use this function to obtain a series of plots about your trace, the type includes

  • short - short run time, fewer plots with less accuracy

  • medium

  • long

  • all - most of the available plots with high accuracy, notice it can take LONG time on big trace

Parameters:
  • characterize_type – see above, options: short, medium, long, all

  • cache_size – estimated cache size for the trace, if -1, PyMimircache will estimate the cache size

  • kwargs – print_stat

Returns:

trace stat string