Processing

This module contains some processing functions that while not belonging to a specific class, require them to already be loaded and DISSTANS to be initialized.

For general helper functions, see tools.

unwrap_dict_and_ts

@disstans.processing.unwrap_dict_and_ts(func)[source]

A wrapper decorator that aims at simplifying the coding of processing functions. Ideally, a new function that doesn’t need to know if its input is a Timeseries, DataFrame, ndarray or a dictionary containing them, should not need to reimplement a check and conversion for all of these because they just represent a data array of some form. So, by providing this function decorator, a wrapped function only needs to be able to work for a single array (plus some optional keyword arguments). The wrapping will extract the data of the input types and convert the returned array from func into the original format.

Example

A basic function of the form:

def func(in_array, **kw_args):
    # do some things
    return out_array

that takes and returns a NumPy array can be wrapped as follows to be able to also take and return all the other data forms:

@unwrap_dict_and_ts
def func(in_array, **kw_args):
    # do some things
    return out_array
Parameters

func (Callable) – Function to be wrapped.

Return type

Callable

Functions

clean

disstans.processing.clean(station, ts_in, reference, ts_out=None, clean_kw_args={}, reference_callable_args={})[source]

Function operating on a single station’s timeseries to clean it from outliers, and mask it out if the data is not good enough. The criteria are set by defaults but can be overriden by providing clean_kw_args. The criteria are:

  • 'min_obs': Minimum number of observations the timeseries has to contain.

  • 'std_outlier': Classify as an outlier any observation that is this many standard deviations away from the reference.

  • 'iqr_outlier': Classify as an outlier any observation that is this many inter-quartile ranges (IQR, difference between the 25th and 75th percentile) away from the reference’s 25th-75th percentile range.

  • 'std_thresh': After the removal of outliers, the maximum standard deviation that the residual between reference and input timeseries is allowed to have.

  • 'min_clean_obs': After the removal of outliers, the minimum number of observations the timeseries has to contain.

Parameters
  • station (Station) – Station to operate on.

  • ts_in (str) – Description of the timeseries to clean.

  • reference (str | Timeseries | Callable) – Reference timeseries. If string, checks for a timeseries with that description in the station. If a Timeseries instance, use it directly. If a function, the reference timeseries will be calculated as t_ref = reference(ts_in, **reference_callable_args).

  • ts_out (str, default: None) – If provided, duplicate ts_in to a new timeseries ts_out and clean the copy (to preserve the raw timeseries).

  • clean_kw_args (dict[str, Any], default: {}) – Override the default cleaning criteria in defaults, see the explanations above.

  • reference_callable_args (dict[str, Any], default: {}) – If reference is a function, reference_callable_args can be used to pass additional keyword arguments to the former when calculating the reference timeseries.

Return type

None

Warning

By default, this function operates in-place. If you don’t wish to overwrite the raw input timeseries, specify ts_out.

decompose

disstans.processing.decompose(array, method, num_components=1, return_sources=False, rng=None)[source]

Decomposes the input signal into different components using PCA or ICA.

Parameters
  • array (ndarray) – Input array of shape \((\text{num_observations},\text{n_stations})\) (can contain NaNs). Wrapped by unwrap_dict_and_ts() to also accept Timeseries, DataFrame and dictionaries of them as input (i.e. the output of export_network_ts()).

  • method (str) – Method to use to decompose the array. Possible values are 'pca' and 'ica': 'pca' uses PCA (motivated by [dong06]), whereas 'ica' uses FastICA (motivated by [huang12]).

  • num_components (int, default: 1) – Number of components to estimate. If None, all are used.

  • return_sources (bool, default: False) – If True, return not only the best-fit model, but also the sources themselves in space and time. Defaults to False.

  • rng (Optional[Generator], default: None) – Random number generator instance to use to fill missing values.

Return type

tuple[ndarray, Optional[ndarray], Optional[ndarray]]

Returns

  • model – Best-fit model with shape \((\text{num_observations},\text{n_stations})\).

  • temporal – Only if return_sources=True: Temporal source with shape \((\text{num_observations},\text{num_components})\).

  • spatial – Only if return_sources=True: Spatial source with shape \((\text{num_components},\text{n_stations})\).

References

dong06

Dong, D., Fang, P., Bock, Y., Webb, F., Prawirodirdjo, L., Kedar, S., & Jamason, P. (2006). Spatiotemporal filtering using principal component analysis and Karhunen-Loeve expansion approaches for regional GPS network analysis. Journal of Geophysical Research: Solid Earth, 111(B3). doi:10.1029/2005JB003806.

huang12

Huang, D. W., Dai, W. J., & Luo, F. X. (2012). ICA Spatiotemporal Filtering Method and Its Application in GPS Deformation Monitoring. Applied Mechanics and Materials, 204–208, 2806–2812. doi:10.4028/www.scientific.net/AMM.204-208.2806.

median

disstans.processing.median(array, kernel_size)[source]

Computes the rolling median filter column-wise. Missing observations (NaNs) are ignored during the median calculation, but missing observations are not imputed from the rolling median (i.e., a NaN value remains a NaN value, but does not affect surrounding values).

Parameters
  • array (ndarray) – 2D input array (can contain NaNs). Wrapped by unwrap_dict_and_ts() to also accept Timeseries, DataFrame and dictionaries of them as input.

  • kernel_size (int) – Kernel size (length of moving window to compute the median over). Has to be an odd number.

Return type

ndarray

Returns

2D filtered array (may still contain NaNs).

midas

disstans.processing.midas(ts, steps=None, tolerance=0.001)[source]

This function performs the MIDAS estimate as described by [blewitt16]. It is adapted from the Fortran code provided by the author (see selectpair() for more details and original copyright).

MIDAS returns the median estimate of secular (constant) velocities in all data components using data pairs spanning exactly one year. By not including pairs crossing known step epochs and using a fixed period, the influence of unmodeled jumps and seasonal variations can be minimized. At the end, an empirical estimate of the velocities’ uncertainty is calculated as well.

Parameters
  • ts (Timeseries) – Timeseries to perform the MIDAS algorithm on.

  • steps (UnionType[Series, DatetimeIndex, None], default: None) – If given, a pandas Series or Index of step times, across which no pairs should be formed.

  • tolerance (float, default: 0.001) – Tolerance when enforcing the one-year period of pairs (in 365.25-days-long years`).

Return type

tuple[Polynomial, Timeseries, dict[str, Any]]

Returns

  • mdl – Fitted polynomial (offset & constant velocity) model.

  • res – Residual timeseries.

  • stats – Fittings statistics computed along the way. 'num_epochs', 'num_used', 'num_pairs', and 'nstep' are the number of epochs in ts, the number of epochs used in the velocity pairs, the number of pairs formed, and the number of included steps, respectively. 'frac_removed' and 'sd_velpairs' are the fraction of removed pairs (because of velocity pairs more than two standard deviations away from their medians) and the estimated standard deviation of the velocity pairs, respectively, and for each component.

References

blewitt16

Blewitt, G., Kreemer, C., Hammond, W. C., & Gazeaux, J. (2016). MIDAS robust trend estimator for accurate GPS station velocities without step detection. Journal of Geophysical Research: Solid Earth, 121(3), 2054–2068. doi:10.1002/2015JB012552

Classes

StepDetector

class disstans.processing.StepDetector(kernel_size=None, kernel_size_min=0, x=None, y=None)[source]

This class implements a step detector based on the Akaike Information Criterion (AIC).

A window is moved over the input data, and two linear models are fit in the method search(): one containing only a linear polynomial, and one containing an additional step in the middle of the window. Then, using the AIC, the relative probabilities are calculated, and saved for each timestep.

In the final step, one can threshold these relative probabilities with the method steps(), and look for local maxima, which will correspond to probable steps.

If the class is constructed with kernel_size, x and y passed, it automatically calls its method search(), otherwise, search() needs to be called manually. Running the method again with a different kernel_size, x or y will overwrite previous results.

Parameters
  • kernel_size (Optional[int], default: None) – Window size of the detector. Must be odd.

  • kernel_size_min (int, default: 0) – Minimum window size of the detector (for edges). Must be smaller than or equal to kernel_size.

  • x (Optional[ndarray], default: None) – Input array of shape \((\text{num_observations},)\). Should not contain NaNs.

  • y (Optional[ndarray], default: None) – Input array of shape \((\text{num_observations}, \text{num_components})\). Can contain NaNs.

static AIC_c(rss, n, K)[source]

Calculates the Akaike Information Criterion for small samples for Least Squares regression results. Implementation is based on [burnhamanderson02] (ch. 2).

Parameters
  • rss (float) – Residual sum of squares.

  • n (int) – Number of samples.

  • K (int) – Degrees of freedom. If the Least Squares model has \(\text{num_parameters}\) parameters (including the mean), then the degrees of freedom are \(K = \text{num_parameters} + 1\)

Return type

float

Returns

The Small Sample AIC for Least Squares \(\text{AIC}_c\).

References

burnhamanderson02

(2002) Information and Likelihood Theory: A Basis for Model Selection and Inference. In: Burnham K.P., Anderson D.R. (eds) Model Selection and Multimodel Inference. Springer, New York, NY. doi:10.1007/978-0-387-22456-5_2.

property kernel_size: int

Kernel (window) size of the detector.

property kernel_size_min: int

Minimum kernel (window) size of the detector.

search(x, y, maxdel=10.0)[source]

Function that will search for steps in the data. Upon successful completion, it will return the relative step probabilities as well as the residuals variances of the two hypotheses tested (as reported by test_single()).

Parameters
  • x (ndarray) – Input array of shape \((\text{num_observations},)\). Should not contain NaNs.

  • y (ndarray) – Input array of shape \((\text{num_observations}, \text{num_components})\). Can contain NaNs.

  • maxdel (float, default: 10.0) – Difference in AIC that should be considered not significantly better. (Refers to \(\Delta_i = \text{AIC}_{c,i} - \text{AIC}_{c,\text{min}}\).)

Return type

tuple[ndarray, ndarray, ndarray]

Returns

  • probabilities – Contains the relative probabilities array. Has shape \((\text{num_observations}, \text{num_components})\).

  • var0 – Contains the array of the residuals variance of the hypothesis that no step is present. Has shape \((\text{num_observations}, \text{num_components})\).

  • var1 – Contains the array of the residuals variance of the hypothesis that a step is present. Has shape \((\text{num_observations}, \text{num_components})\).

See also

test_single()

For more explanations about the return values.

search_catalog(net, ts_description, catalog, threshold=None, gap=2.0, gap_unit='D', keep_nan_probs=True, no_pbar=False)[source]

Search a dictionary of potential step times for each station in the dictionary and assess the probability for each one.

Parameters
  • net – Network instance to operate on.

  • ts_descriptionTimeseries description that will be analyzed.

  • catalog – Dictionary where each key is a station name and its value is a list of Timestamp compatible potential times/dates. Alternatively, a DataFrame with at least the columns 'station' and 'time'.

  • threshold (default: None) – Minimum \(\Delta_i \geq 0\) that needs to be satisfied in order to be a step.

  • gap (default: 2.0) – Maximum gap between identified steps to count as a continuous period of possible steps.

  • gap_unit (default: 'D') – Time unit of gap.

  • keep_nan_probs (default: True) – (Only applies to a DataFrame-type catalog input.) If a catalogued station is not in the network, or if a catalogued timestamp is after the available timeseries, no step probability can be calculated and the results will contain NaNs. If True, those entries will be kept in the output, and if False, they will be dropped.

  • no_pbar (default: False) – Suppress the progress bar with True.

Returns

  • step_table – A DataFrame containing the columns 'station' (its name), 'time' (a timestamp of the station) and 'probability' (maximum \(\Delta_i\) over all components for this timestamp) for each potential step in catalog, as well as var0 and var1 (the two hypotheses’ residuals variances for the component of maximum step probability). If a DataFrame was passed as catalog, a copy of that will be returned, with the added columns specified above.

  • step_ranges – A list of lists containing continuous periods over all stations of the potential steps as determined by gap and gap_unit.

search_network(net, ts_description, maxdel=10.0, threshold=20.0, gap=2.0, gap_unit='D', aggregate_components=True, no_pbar=False)[source]

Function that searches for steps in an entire network (possibly in parallel), thresholds those probabilities, and identifies all the consecutive ranges in which steps happen over the network.

Parameters
  • net (Network) – Network instance to operate on.

  • ts_description (str) – Timeseries description that will be analyzed.

  • maxdel (float, default: 10.0) – Difference in AIC that should be considered not significantly better. (Refers to \(\Delta_i = \text{AIC}_{c,i} - \text{AIC}_{c,\text{min}}\).)

  • threshold (float, default: 20.0) – Minimum \(\Delta_i \geq 0\) that needs to be satisfied in order to be a step.

  • gap (float, default: 2.0) – Maximum gap between identified steps to count as a continuous period of possible steps.

  • gap_unit (str, default: 'D') – Time unit of gap.

  • aggregate_components (bool, default: True) – If True, use the maximum step probability across all data components when searching for steps. Otherwise, keep all steps from maximum probabilities in each component (which could lead to multiple close-by steps).

  • no_pbar (bool, default: False) – Suppress the progress bar with True.

Return type

tuple[DataFrame, list]

Returns

  • step_table – A DataFrame containing the columns 'station' (its name), 'time' (a timestamp of the station) and 'probability' (maximum \(\Delta_i\) over all components for this timestamp), as well as var0, var1 (the two hypotheses’ residuals variances for the component of maximum step probability) and varred (the variance reduction in percent, (var0 - var1) / var0).

  • step_ranges – A list of lists containing continuous periods over all stations of the potential steps as determined by gap and gap_unit.

static steps(probabilities, threshold=2.0, maxsteps=inf, aggregate_components=True, verbose=True)[source]

Threshold the probabilities to return a list of steps.

Parameters
  • probabilities (ndarray) – Array of probabilities.

  • threshold (float, default: 2.0) – Minimum \(\Delta_i \geq 0\) that needs to be satisfied in order to be a step.

  • maxsteps (int, default: inf) – Return at most maxsteps number of steps. Can be useful if a good value for threshold has not been found yet.

  • aggregate_components (bool, default: True) – If True, use the maximum step probability across all data components when searching for steps. Otherwise, keep all steps from maximum probabilities in each component (which could lead to multiple close-by steps).

  • verbose (bool, default: True) – If True, print warnings when there will be a large number of steps identified given the threshold.

Return type

list[ndarray]

Returns

Returns a list of ndarray arrays that contain the indices of steps for each component.

static test_single(xwindow, ywindow, valid=None, maxdel=10.0)[source]

For a single window (of arbitrary, but odd length), perform the AIC hypothesis test whether a step is likely present (H1) or not (H0) in the y data given x coordinates.

Parameters
  • xwindow (ndarray) – Time array of shape \((\text{num_window},)\). Should not contain NaNs.

  • ywindow (ndarray) – Data array of shape \((\text{num_window},)\). Should not contain NaNs.

  • valid (Optional[ndarray], default: None) – Mask array of the data of shape \((\text{num_window},)\), with 1 where the ywindow is finite (not NaN or infinity). If not passed to the function, it is calculated internally, which will slow down the computation.

  • maxdel (float, default: 10.0) – Difference in AIC that should be considered not significantly better. (Refers to \(\Delta_i = \text{AIC}_{c,i} - \text{AIC}_{c,\text{min}}\).)

Return type

tuple[int, float, (Optional[float], Optional[float])]

Returns

  • best_hyp – Best hypothesis (0 for no step, 1 for step).

  • rel_prob – If H1 is the best hypothesis (and suffices maxdel), its relative probability, otherwise the relative probability of H0 (which therefore can be 0 if H0 is also the best hypothesis in general).

  • msr_hyps – A 2-tuple of the two mean-squared residuals of the H0 and H1 hypotheses, respectively. Assuming the test is unbiased, this is the residual’s variance. Is NaN in an element if the least-squares model did not converge.

See also

AIC_c

For more information about the AIC hypothesis test.