Processing

This module contains some processing functions that while not belonging to a specific class, require them to already be loaded and DISSTANS to be initialized.

For general helper functions, see tools.

unwrap_dict_and_ts

@disstans.processing.unwrap_dict_and_ts(func)[source]

A wrapper decorator that aims at simplifying the coding of processing functions. Ideally, a new function that doesn’t need to know if its input is a Timeseries, DataFrame, ndarray or a dictionary containing them, should not need to reimplement a check and conversion for all of these because they just represent a data array of some form. So, by providing this function decorator, a wrapped function only needs to be able to work for a single array (plus some optional keyword arguments). The wrapping will extract the data of the input types and convert the returned array from func into the original format.

Example

A basic function of the form:

def func(in_array, **kw_args):
    # do some things
    return out_array

that takes and returns a NumPy array can be wrapped as follows to be able to also take and return all the other data forms:

@unwrap_dict_and_ts
def func(in_array, **kw_args):
    # do some things
    return out_array

Parameters: func (Callable) – Function to be wrapped.
Return type: Callable

Functions

clean

disstans.processing.clean(station, ts_in, reference, ts_out=None, clean_kw_args={}, reference_callable_args={})[source]

Function operating on a single station’s timeseries to clean it from outliers, and mask it out if the data is not good enough. The criteria are set by defaults but can be overriden by providing clean_kw_args. The criteria are:

'min_obs': Minimum number of observations the timeseries has to contain.
'std_outlier': Classify as an outlier any observation that is this many standard deviations away from the reference.
'iqr_outlier': Classify as an outlier any observation that is this many inter-quartile ranges (IQR, difference between the 25th and 75th percentile) away from the reference’s 25th-75th percentile range.
'std_thresh': After the removal of outliers, the maximum standard deviation that the residual between reference and input timeseries is allowed to have.
'min_clean_obs': After the removal of outliers, the minimum number of observations the timeseries has to contain.

Parameters

station (Station) – Station to operate on.
ts_in (str) – Description of the timeseries to clean.
reference (str | Timeseries | Callable) – Reference timeseries. If string, checks for a timeseries with that description in the station. If a Timeseries instance, use it directly. If a function, the reference timeseries will be calculated as t_ref = reference(ts_in, **reference_callable_args).
ts_out (str, default: None) – If provided, duplicate ts_in to a new timeseries ts_out and clean the copy (to preserve the raw timeseries).
clean_kw_args (dict[str, Any], default: {}) – Override the default cleaning criteria in defaults, see the explanations above.
reference_callable_args (dict[str, Any], default: {}) – If reference is a function, reference_callable_args can be used to pass additional keyword arguments to the former when calculating the reference timeseries.

Return type

None

Warning

By default, this function operates in-place. If you don’t wish to overwrite the raw input timeseries, specify ts_out.

decompose

disstans.processing.decompose(array, method, num_components=1, return_sources=False, rng=None)[source]

Decomposes the input signal into different components using PCA or ICA.

Parameters

array (ndarray) – Input array of shape \((\text{num_observations},\text{n_stations})\) (can contain NaNs). Wrapped by unwrap_dict_and_ts() to also accept Timeseries, DataFrame and dictionaries of them as input (i.e. the output of export_network_ts()).
method (str) – Method to use to decompose the array. Possible values are 'pca' and 'ica': 'pca' uses PCA (motivated by [dong06]), whereas 'ica' uses FastICA (motivated by [huang12]).
num_components (int, default: 1) – Number of components to estimate. If None, all are used.
return_sources (bool, default: False) – If True, return not only the best-fit model, but also the sources themselves in space and time. Defaults to False.
rng (Optional[Generator], default: None) – Random number generator instance to use to fill missing values.

Return type

tuple[ndarray, Optional[ndarray], Optional[ndarray]]

Returns

model – Best-fit model with shape \((\text{num_observations},\text{n_stations})\).
temporal – Only if return_sources=True: Temporal source with shape \((\text{num_observations},\text{num_components})\).
spatial – Only if return_sources=True: Spatial source with shape \((\text{num_components},\text{n_stations})\).

References

dong06: Dong, D., Fang, P., Bock, Y., Webb, F., Prawirodirdjo, L., Kedar, S., & Jamason, P. (2006). Spatiotemporal filtering using principal component analysis and Karhunen-Loeve expansion approaches for regional GPS network analysis. Journal of Geophysical Research: Solid Earth, 111(B3). doi:10.1029/2005JB003806.
huang12: Huang, D. W., Dai, W. J., & Luo, F. X. (2012). ICA Spatiotemporal Filtering Method and Its Application in GPS Deformation Monitoring. Applied Mechanics and Materials, 204–208, 2806–2812. doi:10.4028/www.scientific.net/AMM.204-208.2806.

median

disstans.processing.median(array, kernel_size)[source]

Computes the rolling median filter column-wise. Missing observations (NaNs) are ignored during the median calculation, but missing observations are not imputed from the rolling median (i.e., a NaN value remains a NaN value, but does not affect surrounding values).

Parameters

array (ndarray) – 2D input array (can contain NaNs). Wrapped by unwrap_dict_and_ts() to also accept Timeseries, DataFrame and dictionaries of them as input.
kernel_size (int) – Kernel size (length of moving window to compute the median over). Has to be an odd number.

Return type

ndarray

Returns

2D filtered array (may still contain NaNs).

midas

disstans.processing.midas(ts, steps=None, tolerance=0.001)[source]

This function performs the MIDAS estimate as described by [blewitt16]. It is adapted from the Fortran code provided by the author (see selectpair() for more details and original copyright).

MIDAS returns the median estimate of secular (constant) velocities in all data components using data pairs spanning exactly one year. By not including pairs crossing known step epochs and using a fixed period, the influence of unmodeled jumps and seasonal variations can be minimized. At the end, an empirical estimate of the velocities’ uncertainty is calculated as well.

Parameters

ts (Timeseries) – Timeseries to perform the MIDAS algorithm on.
steps (UnionType[Series, DatetimeIndex, None], default: None) – If given, a pandas Series or Index of step times, across which no pairs should be formed.
tolerance (float, default: 0.001) – Tolerance when enforcing the one-year period of pairs (in 365.25-days-long years`).

Return type

tuple[Polynomial, Timeseries, dict[str, Any]]

Returns

mdl – Fitted polynomial (offset & constant velocity) model.
res – Residual timeseries.
stats – Fittings statistics computed along the way. 'num_epochs', 'num_used', 'num_pairs', and 'nstep' are the number of epochs in ts, the number of epochs used in the velocity pairs, the number of pairs formed, and the number of included steps, respectively. 'frac_removed' and 'sd_velpairs' are the fraction of removed pairs (because of velocity pairs more than two standard deviations away from their medians) and the estimated standard deviation of the velocity pairs, respectively, and for each component.

References

blewitt16: Blewitt, G., Kreemer, C., Hammond, W. C., & Gazeaux, J. (2016). MIDAS robust trend estimator for accurate GPS station velocities without step detection. Journal of Geophysical Research: Solid Earth, 121(3), 2054–2068. doi:10.1002/2015JB012552

Classes

StepDetector

class disstans.processing.StepDetector(kernel_size=None, kernel_size_min=0, x=None, y=None)[source]

This class implements a step detector based on the Akaike Information Criterion (AIC).

A window is moved over the input data, and two linear models are fit in the method search(): one containing only a linear polynomial, and one containing an additional step in the middle of the window. Then, using the AIC, the relative probabilities are calculated, and saved for each timestep.

In the final step, one can threshold these relative probabilities with the method steps(), and look for local maxima, which will correspond to probable steps.

If the class is constructed with kernel_size, x and y passed, it automatically calls its method search(), otherwise, search() needs to be called manually. Running the method again with a different kernel_size, x or y will overwrite previous results.

Parameters

kernel_size (Optional[int], default: None) – Window size of the detector. Must be odd.
kernel_size_min (int, default: 0) – Minimum window size of the detector (for edges). Must be smaller than or equal to kernel_size.
x (Optional[ndarray], default: None) – Input array of shape \((\text{num_observations},)\). Should not contain NaNs.
y (Optional[ndarray], default: None) – Input array of shape \((\text{num_observations}, \text{num_components})\). Can contain NaNs.

static AIC_c(rss, n, K)[source]

Calculates the Akaike Information Criterion for small samples for Least Squares regression results. Implementation is based on [burnhamanderson02] (ch. 2).

Parameters

rss (float) – Residual sum of squares.
n (int) – Number of samples.
K (int) – Degrees of freedom. If the Least Squares model has \(\text{num_parameters}\) parameters (including the mean), then the degrees of freedom are \(K = \text{num_parameters} + 1\)

Return type

float

Returns

The Small Sample AIC for Least Squares \(\text{AIC}_c\).

References

burnhamanderson02: (2002) Information and Likelihood Theory: A Basis for Model Selection and Inference. In: Burnham K.P., Anderson D.R. (eds) Model Selection and Multimodel Inference. Springer, New York, NY. doi:10.1007/978-0-387-22456-5_2.

property kernel_size: int: Kernel (window) size of the detector.

property kernel_size_min: int: Minimum kernel (window) size of the detector.

search(x, y, maxdel=10.0)[source]

Function that will search for steps in the data. Upon successful completion, it will return the relative step probabilities as well as the residuals variances of the two hypotheses tested (as reported by test_single()).

Parameters

x (ndarray) – Input array of shape \((\text{num_observations},)\). Should not contain NaNs.
y (ndarray) – Input array of shape \((\text{num_observations}, \text{num_components})\). Can contain NaNs.
maxdel (float, default: 10.0) – Difference in AIC that should be considered not significantly better. (Refers to \(\Delta_i = \text{AIC}_{c,i} - \text{AIC}_{c,\text{min}}\).)

Return type

tuple[ndarray, ndarray, ndarray]

Returns

probabilities – Contains the relative probabilities array. Has shape \((\text{num_observations}, \text{num_components})\).
var0 – Contains the array of the residuals variance of the hypothesis that no step is present. Has shape \((\text{num_observations}, \text{num_components})\).
var1 – Contains the array of the residuals variance of the hypothesis that a step is present. Has shape \((\text{num_observations}, \text{num_components})\).