tsseg.algorithms.patss.embedding package

Submodules

tsseg.algorithms.patss.embedding.FrequentPatternMiningEmbedder module

class tsseg.algorithms.patss.embedding.FrequentPatternMiningEmbedder.FrequentPatternMiningEmbedder(window_sizes=None, stride=1, word_size=16, alphabet_size=3, binning_method='global', top_k_patterns=25, min_pattern_size=3, max_pattern_size=10, duration=1.2, do_mdl=False, min_r_support=0.0, max_r_support=1.0, jaccard_similarity_threshold=0.9, nb_largest_variance=50, do_pca=False, n_jobs=1)[source]

Bases: PatternBasedEmbedder

Mines frequent sequential patterns from the time series data to construct a pattern-based embedding. This process happens in a few different steps.

First, the time series is preprocessed by extracting fixed-size subsequences at multiple resolution from each attribute of the time series. These subsequences are consequently discretized through SAX. By employing multiple resolutions we enable to capture short-term and long-term behavior.

Second, patterns are mined in every resolution and for each attribute independently. The patterns are then pruned using Jaccard similarity and the relative support to obtain a set of more interesting patterns.

Third, The patterns are converted to an embedding matrix by computing at which interval the pattern and replacing those values in the matrix by the relative support of that pattern. If multiple subsequences overlap and therefore cover the same observations, the average value of each subsequence is taken.

Parameters:
  • window_sizes (List[int]) – The different window sizes or resolutions to use for extracting subsequences from the time series, from which patterns will be mined.

  • stride (int) – The amount with which the sliding windows will shift to extract fixed-size subsequences.

  • word_size (int) – The word size for SAX discretization. This is the number of discrete symbols to maintain in the discretized word through PAA.

  • alphabet_size (int) – The alphabet size for SAX discretization. This is th number of discrete symbols used to create words.

  • binning_method (str) –

    How to compute the discrete symbols with SAX. The possible options are:

    • 'global': Consider all values in the time series and discretize them jointly;

    • 'local': Discretize the values of each subsequence independently;

    • 'k_means': Use K-Means clustering to cluster the values in the time series and assign a dicrete label to each cluster.

  • top_k_patterns (int) – The number of patterns with maximum relative support to mine in each resolution.

  • min_pattern_size (int) – The minimum length of a pattern before it should actually be considered relevant.

  • max_pattern_size (int) – The maximum length of a pattern to consider it.

  • duration (float) – The constraint on the relative duration of a pattern before it covers a window. Specifically, for a pattern of length L, there may be at most np.floor(L * duration) gaps. The value should therefore be larger than 1.

  • do_mdl (bool) – Whether to prune the mined patterns based on the MDL-principle.

  • min_r_support (float) – The minimum relative support of a pattern before it should be considered relevant.

  • max_r_support (float) – The maximum relative support a pattern may have for it to still be interesting.

  • jaccard_similarity_threshold (float) – The threshold on the Jaccard similarity for two patterns to be considered similar.

  • nb_largest_variance (int) – The number of patterns with largest variance to maintain.

  • do_pca (bool) – Whether PCA should be performed to maintain the number of linear combinations of patterns with largest variance.

  • n_jobs (int) – The number of jobs that are allowed to run in parallel.

fit(time_series, y=None)[source]

Fitting and transforming a time series using the FrequentPatternMiningEmbedder is closely tight together, and therefore the fit() method should not be used directly. Instead, use the fit_transform() method.

:raises exception : Exception: Upon calling this method.

Return type:

FrequentPatternMiningEmbedder

fit_transform(time_series, y=None)[source]

Computes a pattern-based embedding for the given time series by mining frequent sequential patterns. This process happens in a few steps. (1) The time series is preprocessed by using multiple sliding windows of varying size to extract subsequences, which are discretized using SAX. (2) Frequent sequential patterns are mined within the discretized subsequences at each resolution and for each attribute. These patterns pruned to obtain a set of more informative patterns. (3) The embedding matrix is computed by checking at which location in the time series each pattern occurs.

Parameters:
  • time_series (ndarray) – The time series from which the patterns should be mined, with n_samples the number of observations in the time series, and n_attributes the number of attributes.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns:

embedding – The computed pattern-based embedding.

Return type:

PatternBasedEmbedding

transform(trend_data)[source]

Fitting and transforming a time series using the FrequentPatternMiningEmbedder is closely tight together, and therefore the transform() method should not be used directly. Instead, use the fit_transform() method.

:raises exception : Exception: Upon calling this method.

Return type:

PatternBasedEmbedding

tsseg.algorithms.patss.embedding.PatternBasedEmbedder module

class tsseg.algorithms.patss.embedding.PatternBasedEmbedder.PatternBasedEmbedder[source]

Bases: ABC

abstractmethod fit(time_series, y=None)[source]

Fit this embedder to the given trend data, i.e., mining the patterns in the given time series

Parameters:
  • time_series (ndarray) – The time series from which the patterns should be mined, with n_samples the number of observations in the time series, and n_attributes the number of attributes.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns:

self – Returns the instance itself

Return type:

PatternBasedEmbedder

fit_transform(time_series, y=None)[source]

Fit the embedder on the given trend data and transform it to a pattern based embedding.

Parameters:
  • time_series (ndarray) – The time series from which the patterns should be mined, with n_samples the number of observations in the time series, and n_attributes the number of attributes.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns:

embedding – The computed pattern-based embedding.

Return type:

PatternBasedEmbedding

abstractmethod transform(time_series)[source]

Transforms the given trend data into a pattern-based embedding.

Parameters:

time_series (ndarray) – The time series to transform into a pattern-based embedding, with n_samples the number of observations in the time series, and n_attributes the number of attributes.

Returns:

embedding – The computed pattern-based embedding.

Return type:

PatternBasedEmbedding

tsseg.algorithms.patss.embedding.PatternBasedEmbedding module

class tsseg.algorithms.patss.embedding.PatternBasedEmbedding.PatternBasedEmbedding(time_series, embedding_matrix, patterns)[source]

Bases: object

A class to maintain all information of a pattern-based embedding. This includes the time series itself, the embedding matrix, as well as the corresponding patterns.

Parameters:
  • time_series (ndarray) – The time series from which the patterns were mined, with n_samples the number of observations in the time series, and n_attributes the number of attributes.

  • embedding_matrix (ndarray) – The computed pattern-based embedding matrix, with n_patterns the number of patterns used to compute the embedding and n_samples the number of observations in the time series.

  • patterns (DataFrame) – The patterns that where used to compute the pattern-based embedding. Each row corresponds to a pattern, and the columns represent meta- information regarding the patterns, such as exact pattern.

property embedding_matrix: ndarray
property patterns: DataFrame
property time_series: ndarray

tsseg.algorithms.patss.embedding.embedding_matrix module

tsseg.algorithms.patss.embedding.embedding_matrix.create_univariate_embedding(pbad_embedding, nb_patterns)[source]

Wrapper function in case you want to embed the time series differently using the patterns.

Parameters:
  • pbad_embedding – The PBAD embedding that is computed by the pattern mining. This is a list of tuples, containing the ID of the pattern and the support of that pattern

  • nb_patterns (int) – The total number of patterns mined

Returns:

A 2D numpy array with the embedding of the time series

tsseg.algorithms.patss.embedding.embedding_matrix.format_embedding_with_overlapping_windows(overlapping_windows_embedding, interval, stride)[source]

Format the embedding such that each individual time unit is embedded in case there are overlapping windows due to a stride smaller than the interval length

Parameters:
  • overlapping_windows_embedding – The embedding of the time series per subsequence, thus each time unit is embedded in multiple subsequences

  • interval – The interval length of a subsequence used for mining patterns

  • stride – The stride used in extracting the subsequences with a rolling window

Returns:

An embedding such that each time unit has a single feature representation

tsseg.algorithms.patss.embedding.pattern_filter module

tsseg.algorithms.patss.embedding.pattern_filter.filter_jaccard_similarity(patterns, threshold)[source]

Filter the given patterns using the Jaccard similarity

Parameters:
  • patterns (DataFrame) – The patterns to filter

  • threshold – The upper threshold for the Jaccard index

Returns:

The patterns that remain after filtering

tsseg.algorithms.patss.embedding.pattern_filter.filter_jaccard_similarity_in_embedding(embedding, patterns, threshold)[source]

Filter the patterns with the Jaccard similarity, but in the embedding space

Parameters:
  • embedding (ndarray) – The embedding matrix that should be filtered with the jaccard similarity

  • patterns (DataFrame) – The patterns that should also be filtered, to still match the embedding

  • threshold – The threshold used in the Jaccard similarity

Returns:

Both the embedding and patterns, but without those instances that did not satisfy the given threshold in the Jaccard index

tsseg.algorithms.patss.embedding.pattern_filter.filter_maximum_variance(embedding, patterns, nb_patterns, do_pca)[source]

Filter the embedding by taking the maximum variance

Parameters:
  • embedding (ndarray) – The embedding to filter

  • patterns (DataFrame) – The patterns that also should be filtered

  • nb_patterns (int) – The number of patterns to keep after filtering

  • do_pca (bool) – Whether or not to do PCA. If this is True, than the patterns are first transformed to a new space using a linear transformation, afterwhich the features with maximum variance are selected.

Returns:

Both the embedding and the patterns, but with the given number of features.

tsseg.algorithms.patss.embedding.pattern_filter.filter_pbad_embedding(filtered_patterns, pbad_embedding)[source]

tsseg.algorithms.patss.embedding.pattern_mining module

tsseg.algorithms.patss.embedding.pattern_mining.mine_patterns_univariate(data, files_prefix='../temp/', interval=24, stride=24, nb_symbols=10, nb_bins=5, binning_method='global', top_k_patterns=10000, min_pattern_size=4, max_pattern_size=10, pattern_duration=1.2, do_mdl=True)[source]

Mine the frequent sequential patterns in the given time series.

Parameters:
  • data (DataFrame) – A univariate time series, a DataFrame with two columns: ‘time’ and ‘average_value’

  • files_prefix (str) – The prefix for the temporary files, used to avoid rage conditions

  • interval (int) – The interval length of the subsequences

  • stride (int) – The stride to use for extracting subsequences

  • nb_symbols (int) – The word size to use for SAX

  • nb_bins (int) – The alphabet size to use for SAX

  • binning_method (str) – The method used to assign symbols: ‘global’, ‘local’ (per subsquence) or ‘k_means’

  • top_k_patterns (int) – The number of patterns with highest frequency to mine

  • min_pattern_size (int) – The minimal length of the patterns

  • max_pattern_size (int) – The maximum pattern size

  • pattern_duration (float) – The constraint on the relative duration of a pattern

  • do_mdl (bool) – Whether to filter the patterns using MDL

Returns:

A DataFrame containing the patterns, the PBAD embedding corresponding to these patterns, the windows with raw values and the discrete segments

Module contents

This module offers all functionality to create a pattern-based embedding! In essence, a PatternBasedEmbedder is used to transform a time series into a PatternBasedEmbedding.