tsseg.algorithms.patss.embedding package
Submodules
tsseg.algorithms.patss.embedding.FrequentPatternMiningEmbedder module
- class tsseg.algorithms.patss.embedding.FrequentPatternMiningEmbedder.FrequentPatternMiningEmbedder(window_sizes=None, stride=1, word_size=16, alphabet_size=3, binning_method='global', top_k_patterns=25, min_pattern_size=3, max_pattern_size=10, duration=1.2, do_mdl=False, min_r_support=0.0, max_r_support=1.0, jaccard_similarity_threshold=0.9, nb_largest_variance=50, do_pca=False, n_jobs=1)[source]
Bases:
PatternBasedEmbedderMines frequent sequential patterns from the time series data to construct a pattern-based embedding. This process happens in a few different steps.
First, the time series is preprocessed by extracting fixed-size subsequences at multiple resolution from each attribute of the time series. These subsequences are consequently discretized through SAX. By employing multiple resolutions we enable to capture short-term and long-term behavior.
Second, patterns are mined in every resolution and for each attribute independently. The patterns are then pruned using Jaccard similarity and the relative support to obtain a set of more interesting patterns.
Third, The patterns are converted to an embedding matrix by computing at which interval the pattern and replacing those values in the matrix by the relative support of that pattern. If multiple subsequences overlap and therefore cover the same observations, the average value of each subsequence is taken.
- Parameters:
window_sizes (
List[int]) – The different window sizes or resolutions to use for extracting subsequences from the time series, from which patterns will be mined.stride (
int) – The amount with which the sliding windows will shift to extract fixed-size subsequences.word_size (
int) – The word size for SAX discretization. This is the number of discrete symbols to maintain in the discretized word through PAA.alphabet_size (
int) – The alphabet size for SAX discretization. This is th number of discrete symbols used to create words.binning_method (
str) –How to compute the discrete symbols with SAX. The possible options are:
'global': Consider all values in the time series and discretize them jointly;'local': Discretize the values of each subsequence independently;'k_means': Use K-Means clustering to cluster the values in the time series and assign a dicrete label to each cluster.
top_k_patterns (
int) – The number of patterns with maximum relative support to mine in each resolution.min_pattern_size (
int) – The minimum length of a pattern before it should actually be considered relevant.max_pattern_size (
int) – The maximum length of a pattern to consider it.duration (
float) – The constraint on the relative duration of a pattern before it covers a window. Specifically, for a pattern of lengthL, there may be at mostnp.floor(L * duration)gaps. The value should therefore be larger than 1.do_mdl (
bool) – Whether to prune the mined patterns based on the MDL-principle.min_r_support (
float) – The minimum relative support of a pattern before it should be considered relevant.max_r_support (
float) – The maximum relative support a pattern may have for it to still be interesting.jaccard_similarity_threshold (
float) – The threshold on the Jaccard similarity for two patterns to be considered similar.nb_largest_variance (
int) – The number of patterns with largest variance to maintain.do_pca (
bool) – Whether PCA should be performed to maintain the number of linear combinations of patterns with largest variance.n_jobs (
int) – The number of jobs that are allowed to run in parallel.
- fit(time_series, y=None)[source]
Fitting and transforming a time series using the
FrequentPatternMiningEmbedderis closely tight together, and therefore thefit()method should not be used directly. Instead, use thefit_transform()method.:raises exception : Exception: Upon calling this method.
- Return type:
- fit_transform(time_series, y=None)[source]
Computes a pattern-based embedding for the given time series by mining frequent sequential patterns. This process happens in a few steps. (1) The time series is preprocessed by using multiple sliding windows of varying size to extract subsequences, which are discretized using SAX. (2) Frequent sequential patterns are mined within the discretized subsequences at each resolution and for each attribute. These patterns pruned to obtain a set of more informative patterns. (3) The embedding matrix is computed by checking at which location in the time series each pattern occurs.
- Parameters:
time_series (
ndarray) – The time series from which the patterns should be mined, withn_samplesthe number of observations in the time series, andn_attributesthe number of attributes.y (Ignored) – Not used, present here for API consistency by convention.
- Returns:
embedding – The computed pattern-based embedding.
- Return type:
- transform(trend_data)[source]
Fitting and transforming a time series using the
FrequentPatternMiningEmbedderis closely tight together, and therefore thetransform()method should not be used directly. Instead, use thefit_transform()method.:raises exception : Exception: Upon calling this method.
- Return type:
tsseg.algorithms.patss.embedding.PatternBasedEmbedder module
- class tsseg.algorithms.patss.embedding.PatternBasedEmbedder.PatternBasedEmbedder[source]
Bases:
ABC- abstractmethod fit(time_series, y=None)[source]
Fit this embedder to the given trend data, i.e., mining the patterns in the given time series
- Parameters:
time_series (
ndarray) – The time series from which the patterns should be mined, withn_samplesthe number of observations in the time series, andn_attributesthe number of attributes.y (Ignored) – Not used, present here for API consistency by convention.
- Returns:
self – Returns the instance itself
- Return type:
- fit_transform(time_series, y=None)[source]
Fit the embedder on the given trend data and transform it to a pattern based embedding.
- Parameters:
time_series (
ndarray) – The time series from which the patterns should be mined, withn_samplesthe number of observations in the time series, andn_attributesthe number of attributes.y (Ignored) – Not used, present here for API consistency by convention.
- Returns:
embedding – The computed pattern-based embedding.
- Return type:
- abstractmethod transform(time_series)[source]
Transforms the given trend data into a pattern-based embedding.
- Parameters:
time_series (
ndarray) – The time series to transform into a pattern-based embedding, with n_samples the number of observations in the time series, and n_attributes the number of attributes.- Returns:
embedding – The computed pattern-based embedding.
- Return type:
tsseg.algorithms.patss.embedding.PatternBasedEmbedding module
- class tsseg.algorithms.patss.embedding.PatternBasedEmbedding.PatternBasedEmbedding(time_series, embedding_matrix, patterns)[source]
Bases:
objectA class to maintain all information of a pattern-based embedding. This includes the time series itself, the embedding matrix, as well as the corresponding patterns.
- Parameters:
time_series (
ndarray) – The time series from which the patterns were mined, withn_samplesthe number of observations in the time series, andn_attributesthe number of attributes.embedding_matrix (
ndarray) – The computed pattern-based embedding matrix, withn_patternsthe number of patterns used to compute the embedding andn_samplesthe number of observations in the time series.patterns (
DataFrame) – The patterns that where used to compute the pattern-based embedding. Each row corresponds to a pattern, and the columns represent meta- information regarding the patterns, such as exact pattern.
tsseg.algorithms.patss.embedding.embedding_matrix module
- tsseg.algorithms.patss.embedding.embedding_matrix.create_univariate_embedding(pbad_embedding, nb_patterns)[source]
Wrapper function in case you want to embed the time series differently using the patterns.
- Parameters:
pbad_embedding – The PBAD embedding that is computed by the pattern mining. This is a list of tuples, containing the ID of the pattern and the support of that pattern
nb_patterns (
int) – The total number of patterns mined
- Returns:
A 2D numpy array with the embedding of the time series
- tsseg.algorithms.patss.embedding.embedding_matrix.format_embedding_with_overlapping_windows(overlapping_windows_embedding, interval, stride)[source]
Format the embedding such that each individual time unit is embedded in case there are overlapping windows due to a stride smaller than the interval length
- Parameters:
overlapping_windows_embedding – The embedding of the time series per subsequence, thus each time unit is embedded in multiple subsequences
interval – The interval length of a subsequence used for mining patterns
stride – The stride used in extracting the subsequences with a rolling window
- Returns:
An embedding such that each time unit has a single feature representation
tsseg.algorithms.patss.embedding.pattern_filter module
- tsseg.algorithms.patss.embedding.pattern_filter.filter_jaccard_similarity(patterns, threshold)[source]
Filter the given patterns using the Jaccard similarity
- Parameters:
patterns (
DataFrame) – The patterns to filterthreshold – The upper threshold for the Jaccard index
- Returns:
The patterns that remain after filtering
- tsseg.algorithms.patss.embedding.pattern_filter.filter_jaccard_similarity_in_embedding(embedding, patterns, threshold)[source]
Filter the patterns with the Jaccard similarity, but in the embedding space
- Parameters:
- Returns:
Both the embedding and patterns, but without those instances that did not satisfy the given threshold in the Jaccard index
- tsseg.algorithms.patss.embedding.pattern_filter.filter_maximum_variance(embedding, patterns, nb_patterns, do_pca)[source]
Filter the embedding by taking the maximum variance
- Parameters:
embedding (
ndarray) – The embedding to filterpatterns (
DataFrame) – The patterns that also should be filterednb_patterns (
int) – The number of patterns to keep after filteringdo_pca (
bool) – Whether or not to do PCA. If this is True, than the patterns are first transformed to a new space using a linear transformation, afterwhich the features with maximum variance are selected.
- Returns:
Both the embedding and the patterns, but with the given number of features.
tsseg.algorithms.patss.embedding.pattern_mining module
- tsseg.algorithms.patss.embedding.pattern_mining.mine_patterns_univariate(data, files_prefix='../temp/', interval=24, stride=24, nb_symbols=10, nb_bins=5, binning_method='global', top_k_patterns=10000, min_pattern_size=4, max_pattern_size=10, pattern_duration=1.2, do_mdl=True)[source]
Mine the frequent sequential patterns in the given time series.
- Parameters:
data (
DataFrame) – A univariate time series, a DataFrame with two columns: ‘time’ and ‘average_value’files_prefix (
str) – The prefix for the temporary files, used to avoid rage conditionsinterval (
int) – The interval length of the subsequencesstride (
int) – The stride to use for extracting subsequencesnb_symbols (
int) – The word size to use for SAXnb_bins (
int) – The alphabet size to use for SAXbinning_method (
str) – The method used to assign symbols: ‘global’, ‘local’ (per subsquence) or ‘k_means’top_k_patterns (
int) – The number of patterns with highest frequency to minemin_pattern_size (
int) – The minimal length of the patternsmax_pattern_size (
int) – The maximum pattern sizepattern_duration (
float) – The constraint on the relative duration of a patterndo_mdl (
bool) – Whether to filter the patterns using MDL
- Returns:
A DataFrame containing the patterns, the PBAD embedding corresponding to these patterns, the windows with raw values and the discrete segments
Module contents
This module offers all functionality to create a pattern-based embedding! In essence,
a PatternBasedEmbedder is used to transform a time series into a
PatternBasedEmbedding.