Processing
This module contains some processing functions that while not belonging to a specific class, require them to already be loaded and DISSTANS to be initialized.
For general helper functions, see tools
.
unwrap_dict_and_ts
- @disstans.processing.unwrap_dict_and_ts(func)[source]
A wrapper decorator that aims at simplifying the coding of processing functions. Ideally, a new function that doesn’t need to know if its input is a
Timeseries
,DataFrame
,ndarray
or a dictionary containing them, should not need to reimplement a check and conversion for all of these because they just represent a data array of some form. So, by providing this function decorator, a wrapped function only needs to be able to work for a single array (plus some optional keyword arguments). The wrapping will extract the data of the input types and convert the returned array fromfunc
into the original format.Example
A basic function of the form:
def func(in_array, **kw_args): # do some things return out_array
that takes and returns a NumPy array can be wrapped as follows to be able to also take and return all the other data forms:
@unwrap_dict_and_ts def func(in_array, **kw_args): # do some things return out_array
Functions
clean
- disstans.processing.clean(station, ts_in, reference, ts_out=None, clean_kw_args={}, reference_callable_args={})[source]
Function operating on a single station’s timeseries to clean it from outliers, and mask it out if the data is not good enough. The criteria are set by
defaults
but can be overriden by providingclean_kw_args
. The criteria are:'min_obs'
: Minimum number of observations the timeseries has to contain.'std_outlier'
: Classify as an outlier any observation that is this many standard deviations away from the reference.'iqr_outlier'
: Classify as an outlier any observation that is this many inter-quartile ranges (IQR, difference between the 25th and 75th percentile) away from the reference’s 25th-75th percentile range.'std_thresh'
: After the removal of outliers, the maximum standard deviation that the residual between reference and input timeseries is allowed to have.'min_clean_obs'
: After the removal of outliers, the minimum number of observations the timeseries has to contain.
- Parameters
station (
Station
) – Station to operate on.ts_in (
str
) – Description of the timeseries to clean.reference (
str
|Timeseries
|Callable
) – Reference timeseries. If string, checks for a timeseries with that description in thestation
. If aTimeseries
instance, use it directly. If a function, the reference timeseries will be calculated ast_ref = reference(ts_in, **reference_callable_args)
.ts_out (
str
, default:None
) – If provided, duplicatets_in
to a new timeseriests_out
and clean the copy (to preserve the raw timeseries).clean_kw_args (
dict
[str
,Any
], default:{}
) – Override the default cleaning criteria indefaults
, see the explanations above.reference_callable_args (
dict
[str
,Any
], default:{}
) – Ifreference
is a function,reference_callable_args
can be used to pass additional keyword arguments to the former when calculating the reference timeseries.
- Return type
Warning
By default, this function operates in-place. If you don’t wish to overwrite the raw input timeseries, specify
ts_out
.
decompose
- disstans.processing.decompose(array, method, num_components=1, return_sources=False, rng=None)[source]
Decomposes the input signal into different components using PCA or ICA.
- Parameters
array (
ndarray
) – Input array of shape \((\text{num_observations},\text{n_stations})\) (can contain NaNs). Wrapped byunwrap_dict_and_ts()
to also acceptTimeseries
,DataFrame
and dictionaries of them as input (i.e. the output ofexport_network_ts()
).method (
str
) – Method to use to decompose the array. Possible values are'pca'
and'ica'
:'pca'
usesPCA
(motivated by [dong06]), whereas'ica'
usesFastICA
(motivated by [huang12]).num_components (
int
, default:1
) – Number of components to estimate. IfNone
, all are used.return_sources (
bool
, default:False
) – IfTrue
, return not only the best-fit model, but also the sources themselves in space and time. Defaults toFalse
.rng (
Optional
[Generator
], default:None
) – Random number generator instance to use to fill missing values.
- Return type
- Returns
model – Best-fit model with shape \((\text{num_observations},\text{n_stations})\).
temporal – Only if
return_sources=True
: Temporal source with shape \((\text{num_observations},\text{num_components})\).spatial – Only if
return_sources=True
: Spatial source with shape \((\text{num_components},\text{n_stations})\).
References
- dong06
Dong, D., Fang, P., Bock, Y., Webb, F., Prawirodirdjo, L., Kedar, S., & Jamason, P. (2006). Spatiotemporal filtering using principal component analysis and Karhunen-Loeve expansion approaches for regional GPS network analysis. Journal of Geophysical Research: Solid Earth, 111(B3). doi:10.1029/2005JB003806.
- huang12
Huang, D. W., Dai, W. J., & Luo, F. X. (2012). ICA Spatiotemporal Filtering Method and Its Application in GPS Deformation Monitoring. Applied Mechanics and Materials, 204–208, 2806–2812. doi:10.4028/www.scientific.net/AMM.204-208.2806.
median
- disstans.processing.median(array, kernel_size)[source]
Computes the rolling median filter column-wise. Missing observations (NaNs) are ignored during the median calculation, but missing observations are not imputed from the rolling median (i.e., a NaN value remains a NaN value, but does not affect surrounding values).
- Parameters
array (
ndarray
) – 2D input array (can contain NaNs). Wrapped byunwrap_dict_and_ts()
to also acceptTimeseries
,DataFrame
and dictionaries of them as input.kernel_size (
int
) – Kernel size (length of moving window to compute the median over). Has to be an odd number.
- Return type
- Returns
2D filtered array (may still contain NaNs).
midas
- disstans.processing.midas(ts, steps=None, tolerance=0.001)[source]
This function performs the MIDAS estimate as described by [blewitt16]. It is adapted from the Fortran code provided by the author (see
selectpair()
for more details and original copyright).MIDAS returns the median estimate of secular (constant) velocities in all data components using data pairs spanning exactly one year. By not including pairs crossing known step epochs and using a fixed period, the influence of unmodeled jumps and seasonal variations can be minimized. At the end, an empirical estimate of the velocities’ uncertainty is calculated as well.
- Parameters
ts (
Timeseries
) – Timeseries to perform the MIDAS algorithm on.steps (
UnionType
[Series
,DatetimeIndex
,None
], default:None
) – If given, a pandas Series or Index of step times, across which no pairs should be formed.tolerance (
float
, default:0.001
) – Tolerance when enforcing the one-year period of pairs (in 365.25-days-long years`).
- Return type
tuple
[Polynomial
,Timeseries
,dict
[str
,Any
]]- Returns
mdl – Fitted polynomial (offset & constant velocity) model.
res – Residual timeseries.
stats – Fittings statistics computed along the way.
'num_epochs'
,'num_used'
,'num_pairs'
, and'nstep'
are the number of epochs ints
, the number of epochs used in the velocity pairs, the number of pairs formed, and the number of included steps, respectively.'frac_removed'
and'sd_velpairs'
are the fraction of removed pairs (because of velocity pairs more than two standard deviations away from their medians) and the estimated standard deviation of the velocity pairs, respectively, and for each component.
References
- blewitt16
Blewitt, G., Kreemer, C., Hammond, W. C., & Gazeaux, J. (2016). MIDAS robust trend estimator for accurate GPS station velocities without step detection. Journal of Geophysical Research: Solid Earth, 121(3), 2054–2068. doi:10.1002/2015JB012552
Classes
StepDetector
- class disstans.processing.StepDetector(kernel_size=None, kernel_size_min=0, x=None, y=None)[source]
This class implements a step detector based on the Akaike Information Criterion (AIC).
A window is moved over the input data, and two linear models are fit in the method
search()
: one containing only a linear polynomial, and one containing an additional step in the middle of the window. Then, using the AIC, the relative probabilities are calculated, and saved for each timestep.In the final step, one can threshold these relative probabilities with the method
steps()
, and look for local maxima, which will correspond to probable steps.If the class is constructed with
kernel_size
,x
andy
passed, it automatically calls its methodsearch()
, otherwise,search()
needs to be called manually. Running the method again with a differentkernel_size
,x
ory
will overwrite previous results.- Parameters
kernel_size (
Optional
[int
], default:None
) – Window size of the detector. Must be odd.kernel_size_min (
int
, default:0
) – Minimum window size of the detector (for edges). Must be smaller than or equal tokernel_size
.x (
Optional
[ndarray
], default:None
) – Input array of shape \((\text{num_observations},)\). Should not contain NaNs.y (
Optional
[ndarray
], default:None
) – Input array of shape \((\text{num_observations}, \text{num_components})\). Can contain NaNs.
- static AIC_c(rss, n, K)[source]
Calculates the Akaike Information Criterion for small samples for Least Squares regression results. Implementation is based on [burnhamanderson02] (ch. 2).
- Parameters
- Return type
- Returns
The Small Sample AIC for Least Squares \(\text{AIC}_c\).
References
- burnhamanderson02
(2002) Information and Likelihood Theory: A Basis for Model Selection and Inference. In: Burnham K.P., Anderson D.R. (eds) Model Selection and Multimodel Inference. Springer, New York, NY. doi:10.1007/978-0-387-22456-5_2.
- search(x, y, maxdel=10.0)[source]
Function that will search for steps in the data. Upon successful completion, it will return the relative step probabilities as well as the residuals variances of the two hypotheses tested (as reported by
test_single()
).- Parameters
x (
ndarray
) – Input array of shape \((\text{num_observations},)\). Should not contain NaNs.y (
ndarray
) – Input array of shape \((\text{num_observations}, \text{num_components})\). Can contain NaNs.maxdel (
float
, default:10.0
) – Difference in AIC that should be considered not significantly better. (Refers to \(\Delta_i = \text{AIC}_{c,i} - \text{AIC}_{c,\text{min}}\).)
- Return type
- Returns
probabilities – Contains the relative probabilities array. Has shape \((\text{num_observations}, \text{num_components})\).
var0 – Contains the array of the residuals variance of the hypothesis that no step is present. Has shape \((\text{num_observations}, \text{num_components})\).
var1 – Contains the array of the residuals variance of the hypothesis that a step is present. Has shape \((\text{num_observations}, \text{num_components})\).
See also
test_single()
For more explanations about the return values.
- search_catalog(net, ts_description, catalog, threshold=None, gap=2.0, gap_unit='D', keep_nan_probs=True, no_pbar=False)[source]
Search a dictionary of potential step times for each station in the dictionary and assess the probability for each one.
- Parameters
net – Network instance to operate on.
ts_description –
Timeseries
description that will be analyzed.catalog – Dictionary where each key is a station name and its value is a list of
Timestamp
compatible potential times/dates. Alternatively, a DataFrame with at least the columns'station'
and'time'
.threshold (default:
None
) – Minimum \(\Delta_i \geq 0\) that needs to be satisfied in order to be a step.gap (default:
2.0
) – Maximum gap between identified steps to count as a continuous period of possible steps.gap_unit (default:
'D'
) – Time unit ofgap
.keep_nan_probs (default:
True
) – (Only applies to a DataFrame-typecatalog
input.) If a catalogued station is not in the network, or if a catalogued timestamp is after the available timeseries, no step probability can be calculated and the results will contain NaNs. IfTrue
, those entries will be kept in the output, and ifFalse
, they will be dropped.no_pbar (default:
False
) – Suppress the progress bar withTrue
.
- Returns
step_table – A DataFrame containing the columns
'station'
(its name),'time'
(a timestamp of the station) and'probability'
(maximum \(\Delta_i\) over all components for this timestamp) for each potential step incatalog
, as well asvar0
andvar1
(the two hypotheses’ residuals variances for the component of maximum step probability). If a DataFrame was passed ascatalog
, a copy of that will be returned, with the added columns specified above.step_ranges – A list of lists containing continuous periods over all stations of the potential steps as determined by
gap
andgap_unit
.
- search_network(net, ts_description, maxdel=10.0, threshold=20.0, gap=2.0, gap_unit='D', aggregate_components=True, no_pbar=False)[source]
Function that searches for steps in an entire network (possibly in parallel), thresholds those probabilities, and identifies all the consecutive ranges in which steps happen over the network.
- Parameters
net (
Network
) – Network instance to operate on.ts_description (
str
) –Timeseries
description that will be analyzed.maxdel (
float
, default:10.0
) – Difference in AIC that should be considered not significantly better. (Refers to \(\Delta_i = \text{AIC}_{c,i} - \text{AIC}_{c,\text{min}}\).)threshold (
float
, default:20.0
) – Minimum \(\Delta_i \geq 0\) that needs to be satisfied in order to be a step.gap (
float
, default:2.0
) – Maximum gap between identified steps to count as a continuous period of possible steps.gap_unit (
str
, default:'D'
) – Time unit ofgap
.aggregate_components (
bool
, default:True
) – IfTrue
, use the maximum step probability across all data components when searching for steps. Otherwise, keep all steps from maximum probabilities in each component (which could lead to multiple close-by steps).no_pbar (
bool
, default:False
) – Suppress the progress bar withTrue
.
- Return type
- Returns
step_table – A DataFrame containing the columns
'station'
(its name),'time'
(a timestamp of the station) and'probability'
(maximum \(\Delta_i\) over all components for this timestamp), as well asvar0
,var1
(the two hypotheses’ residuals variances for the component of maximum step probability) andvarred
(the variance reduction in percent,(var0 - var1) / var0
).step_ranges – A list of lists containing continuous periods over all stations of the potential steps as determined by
gap
andgap_unit
.
- static steps(probabilities, threshold=2.0, maxsteps=inf, aggregate_components=True, verbose=True)[source]
Threshold the probabilities to return a list of steps.
- Parameters
probabilities (
ndarray
) – Array of probabilities.threshold (
float
, default:2.0
) – Minimum \(\Delta_i \geq 0\) that needs to be satisfied in order to be a step.maxsteps (
int
, default:inf
) – Return at mostmaxsteps
number of steps. Can be useful if a good value forthreshold
has not been found yet.aggregate_components (
bool
, default:True
) – IfTrue
, use the maximum step probability across all data components when searching for steps. Otherwise, keep all steps from maximum probabilities in each component (which could lead to multiple close-by steps).verbose (
bool
, default:True
) – IfTrue
, print warnings when there will be a large number of steps identified given thethreshold
.
- Return type
- Returns
Returns a list of
ndarray
arrays that contain the indices of steps for each component.
- static test_single(xwindow, ywindow, valid=None, maxdel=10.0)[source]
For a single window (of arbitrary, but odd length), perform the AIC hypothesis test whether a step is likely present (H1) or not (H0) in the
y
data givenx
coordinates.- Parameters
xwindow (
ndarray
) – Time array of shape \((\text{num_window},)\). Should not contain NaNs.ywindow (
ndarray
) – Data array of shape \((\text{num_window},)\). Should not contain NaNs.valid (
Optional
[ndarray
], default:None
) – Mask array of the data of shape \((\text{num_window},)\), with1
where theywindow
is finite (not NaN or infinity). If not passed to the function, it is calculated internally, which will slow down the computation.maxdel (
float
, default:10.0
) – Difference in AIC that should be considered not significantly better. (Refers to \(\Delta_i = \text{AIC}_{c,i} - \text{AIC}_{c,\text{min}}\).)
- Return type
- Returns
best_hyp – Best hypothesis (
0
for no step,1
for step).rel_prob – If H1 is the best hypothesis (and suffices
maxdel
), its relative probability, otherwise the relative probability of H0 (which therefore can be0
if H0 is also the best hypothesis in general).msr_hyps – A 2-tuple of the two mean-squared residuals of the H0 and H1 hypotheses, respectively. Assuming the test is unbiased, this is the residual’s variance. Is
NaN
in an element if the least-squares model did not converge.
See also
AIC_c
For more information about the AIC hypothesis test.