Tutorial 2: Advanced Models and Fitting
This tutorial recreates the basics of the synthetic timeseries example as described in Bryan Riel’s [riel14] paper on detecting geodetic transients. To do this, we’ll be building on the workflow from the first example. This time though, we’ll first focus a bit more on making the synthetic data, before creating the station itself.
Creating more complex synthetic data
Let’s start with creating the timestamps for our synthetic data:
>>> import pandas as pd
>>> t_start_str = "2000-01-01"
>>> t_end_str = "2020-01-01"
>>> timevector = pd.date_range(start=t_start_str, end=t_end_str, freq="1D")
Next up is the model collection we’re going to use to simulate our data. This time, we’ll be using a Polynomial, Sinusoids and some Arctangents. If you have any question about this, please refer to the previous tutorial.
>>> from disstans.models import Arctangent, Polynomial, Sinusoid
>>> mdl_secular = Polynomial(order=1, t_reference=t_start_str)
>>> mdl_annual = Sinusoid(period=365.25, t_reference=t_start_str)
>>> mdl_semiannual = Sinusoid(period=365.25/2, t_reference=t_start_str)
>>> mdl_transient_1 = Arctangent(tau=100, t_reference="2002-07-01")
>>> mdl_transient_2 = Arctangent(tau=50, t_reference="2010-01-01")
>>> mdl_transient_3 = Arctangent(tau=300, t_reference="2016-01-01")
>>> mdl_coll_synth = {"Secular": mdl_secular,
... "Annual": mdl_annual,
... "Semi-Annual": mdl_semiannual,
... "Transient_1": mdl_transient_1,
... "Transient_2": mdl_transient_2,
... "Transient_3": mdl_transient_3}
Now, if we give these model objects to our station and perform fitting, their parameters will be overwritten (it’s one of those Python caveats). So, let’s take this time here to create a “deep” copy that we will then use for fitting. (Because fitting Arctangents is hard, we’ll omit them here, and try to approximate them with other models later.)
>>> from copy import deepcopy
>>> mdl_coll = deepcopy({"Secular": mdl_secular,
... "Annual": mdl_annual,
... "Semi-Annual": mdl_semiannual})
Now that we have a copy for safekeeping, we can add the “true” parameters to the models:
>>> import numpy as np
>>> mdl_secular.read_parameters(np.array([-20, 200/(20*365.25)]))
>>> mdl_annual.read_parameters(np.array([-5, 0]))
>>> mdl_semiannual.read_parameters(np.array([0, 5]))
>>> mdl_transient_1.read_parameters(np.array([40]))
>>> mdl_transient_2.read_parameters(np.array([-4]))
>>> mdl_transient_3.read_parameters(np.array([-20]))
We can evaluate them just like before:
>>> sum_seas_sec = mdl_secular.evaluate(timevector)["fit"] \
... + mdl_annual.evaluate(timevector)["fit"] \
... + mdl_semiannual.evaluate(timevector)["fit"]
>>> sum_transient = mdl_transient_1.evaluate(timevector)["fit"] \
... + mdl_transient_2.evaluate(timevector)["fit"] \
... + mdl_transient_3.evaluate(timevector)["fit"]
>>> sum_all_models = sum_seas_sec + sum_transient
Our noise this time has two components: white and colored. For the white noise,
we can just use NumPy’s default functions, but for the colored noise, we have to use
DISSTANS’s create_powerlaw_noise()
function:
>>> from disstans.tools import create_powerlaw_noise
>>> rng = np.random.default_rng(0)
>>> white_noise = rng.normal(scale=2, size=(timevector.size, 1))
>>> colored_noise = create_powerlaw_noise(size=(timevector.size, 1),
... exponent=1.5, seed=0) * 2
>>> sum_noise = white_noise + colored_noise
Our synthetic data is then just the sum of the ground truth sum_all_models
and the total noise sum_noise
:
>>> synth_data = sum_all_models + sum_noise
Let’s have a look what we fabricated:
>>> import matplotlib.pyplot as plt
>>> from pandas.plotting import register_matplotlib_converters
>>> register_matplotlib_converters() # improve how time data looks
>>> plt.figure()
>>> plt.plot(timevector, sum_seas_sec, c='C1', label="Seasonal + Secular")
>>> plt.plot(timevector, sum_transient, c='k', label="Transient")
>>> plt.plot(timevector, sum_noise, c='0.5', lw=0.3, label="Noise")
>>> plt.plot(timevector, synth_data, c='C0', ls='none', marker='.',
... markersize=2, alpha=0.5, label="Synthetic Data")
>>> plt.xlabel("Time")
>>> plt.ylim(-50, 250)
>>> plt.ylabel("Displacement [mm]")
>>> plt.legend(loc="upper left")
>>> plt.savefig("tutorial_2a.png")
This looks close to the example in [riel14]. We can see that there are some significant transients alongside a strong secular signal, and seasonal signals plus the colored noise make it look a bit more realistic.
Spline models for transients
How do we model the transients though? For this, we will use an over-complete set
of basis functions, built by a collection of integrated B-Splines. For more on that,
see the class documentations for BSpline
and
ISpline
. There is a simple SplineSet
constructor class that takes care of that for us, which we’ll directly add to our
model collection from before:
>>> from disstans.models import ISpline, SplineSet
>>> mdl_coll["Transient"] = SplineSet(degree=2,
... t_center_start=t_start_str,
... t_center_end=t_end_str,
... list_num_knots=[4, 8, 16, 32, 64, 128],
... splineclass=ISpline)
It creates sets of integrated B-Splines of degree 2, with the timespan
covered to be that of our synthetic timeseries, and then divided into 4, 8, etc.
subintervals. The splineclass
parameter only makes it clear that we want a set of
ISpline
, but we could have omitted it, as it’s the default
behavior.
Building a Network
Now, we’re ready to build our synthetic network and add our generated data.
Again, we start by creating a Station
object, but this time,
we’ll also assign it to a Network
object:
>>> from disstans import Network, Station, Timeseries
>>> net_name = "TutorialLand"
>>> stat_name = "TUT"
>>> caltech_lla = (34.1375, -118.125, 263.0)
>>> net = Network(name=net_name)
>>> stat = Station(name=stat_name,
... location=caltech_lla)
>>> net[stat_name] = stat
Note
Note that the stations internal name name
does not
have to match the network’s name of that station in
stations
, but it avoids confusion.
net[stat_name] = synth_stat
is equivalent to
net.add_station(stat_name, synth_stat)
.
Add the generated timeseries (including models), as well as the ground truth to the station:
>>> ts = Timeseries.from_array(timevector=timevector,
... data=synth_data,
... src="synthetic",
... data_unit="mm",
... data_cols=["Total"])
>>> truth = Timeseries.from_array(timevector=timevector,
... data=sum_all_models,
... src="synthetic",
... data_unit="mm",
... data_cols=["Total"])
>>> stat["Displacement"] = ts
>>> stat["Truth"] = truth
>>> stat.add_local_model_dict(ts_description="Displacement",
... model_dict=mdl_coll)
Fitting an entire network
At this point, we’re ready to do the fitting using the simple linear non-regularized least-squares we used in the previous tutorial:
>>> net.fit(ts_description="Displacement", solver="linear_regression")
>>> net.evaluate(ts_description="Displacement", output_description="Fit_noreg")
>>> stat["Res_noreg"] = stat["Displacement"] - stat["Fit_noreg"]
>>> stat["Err_noreg"] = stat["Fit_noreg"] - stat["Truth"]
We saved a lot of lines and hassle compared to the previous fitting by using the
Network
methods. We already calculated the residuals and
errors, so let’s print some statistics:
>>> _ = stat.analyze_residuals(ts_description="Res_noreg",
... mean=True, std=True, verbose=True)
TUT: Res_noreg Mean Standard Deviation
Total-Displacement_Model_Total -2.193582e-08 2.046182
Advanced plotting
What do our fit, residuals (between the observations and our fit) and errors (between the fit and the true displacement signal) look like compared to the data and noise?
>>> from matplotlib.lines import Line2D
>>> fig, ax = plt.subplots(nrows=2, sharex=True)
>>> ax[0].plot(stat["Displacement"].data, label="Synthetic")
>>> ax[0].plot(stat["Fit_noreg"].data, label="Fit")
>>> ax[0].set_ylim(-50, 250)
>>> ax[0].set_ylabel("Displacement [mm]")
>>> ax[0].legend(loc="upper left")
>>> ax[0].set_title("No Regularization")
>>> ax[1].plot(stat["Res_noreg"].data, c='0.3', ls='none',
... marker='.', markersize=0.5)
>>> ax[1].plot(stat["Err_noreg"].time, sum_noise, c='C1', ls='none',
... marker='.', markersize=0.5)
>>> ax[1].plot(stat["Err_noreg"].data, c="C0")
>>> ax[1].set_ylim(-15, 15)
>>> ax[1].set_ylabel("Error [mm]")
>>> custom_lines = [Line2D([0], [0], c="0.3", marker=".", linestyle='none'),
... Line2D([0], [0], c="C1", marker=".", linestyle='none'),
... Line2D([0], [0], c="C0")]
>>> ax[1].legend(custom_lines, ["Residual", "True Noise", "Error"],
... loc="lower left", ncol=3)
>>> fig.savefig("tutorial_2b.png")
Note
As expected with any regular least-squares minimization, the residuals look like a zero-mean Gaussian distribution. The true noise, plotted for comparison, contains colored noise, and therefore is not Gaussian. Because our solver has no way of knowing what is true noise and small transient signals, it assumes that all the transient it sees are part of the displacement signal to fit. Therefore, our error tracks the noise. This behaviour will not change significantly throughout this second tutorial, but will be addressed in the third tutorial.
We can use a scalogram (see make_scalogram()
) to visualize
the coefficient values of our spline collection, and quickly understand that without
regularization, the set is quite heavily populated in order to minimize the residuals:
>>> fig, ax = stat.models["Displacement"]["Transient"].make_scalogram(t_left=t_start_str,
... t_right=t_end_str,
... cmaprange=20)
>>> ax[0].set_title("No Regularization")
>>> fig.savefig("tutorial_2c.png")
Repeat with L2 regularization
Now, we can do the exact same thing as above, but choose a ridge regression (L2-regularized) solver:
>>> net.fit(ts_description="Displacement", solver="ridge_regression", penalty=10)
>>> net.evaluate(ts_description="Displacement", output_description="Fit_L2")
>>> stat["Res_L2"] = stat["Displacement"] - stat["Fit_L2"]
>>> stat["Err_L2"] = stat["Fit_L2"] - stat["Truth"]
Giving us the statistics:
>>> _ = stat.analyze_residuals(ts_description="Res_L2",
... mean=True, std=True, verbose=True)
TUT: Res_L2 Mean Standard Deviation
Total-Displacement_Model_Total -5.627219e-09 2.088274
>>> fig, ax = plt.subplots(nrows=2, sharex=True)
>>> ax[0].plot(stat["Displacement"].data, label="Synthetic")
>>> ax[0].plot(stat["Fit_L2"].data, label="Fit")
>>> ax[0].set_ylabel("Displacement [mm]")
>>> ax[0].legend(loc="upper left")
>>> ax[0].set_title("L2 Regularization")
>>> ax[1].plot(stat["Res_L2"].data, c='0.3', ls='none',
... marker='.', markersize=0.5)
>>> ax[1].plot(stat["Err_L2"].time, sum_noise, c='C1', ls='none',
... marker='.', markersize=0.5)
>>> ax[1].plot(stat["Err_L2"].data, c="C0")
>>> ax[1].set_ylim(-15, 15)
>>> ax[1].set_ylabel("Error [mm]")
>>> custom_lines = [Line2D([0], [0], c="0.3", marker=".", linestyle='none'),
... Line2D([0], [0], c="C1", marker=".", linestyle='none'),
... Line2D([0], [0], c="C0")]
>>> ax[1].legend(custom_lines, ["Residual", "True Noise", "Error"],
... loc="lower left", ncol=3)
>>> fig.savefig("tutorial_2d.png")
>>> fig, ax = stat.models["Displacement"]["Transient"].make_scalogram(t_left=t_start_str,
... t_right=t_end_str,
... cmaprange=20)
>>> ax[0].set_title("L2 Regularization")
>>> fig.savefig("tutorial_2e.png")
We can see that L2 regularization has significantly reduced the magnitude of the splines used in the fitting, and the fit overall (see the residual statistics) appears to be better. However, most splines are actually non-zero. This might produced the best fit, but our physical knowledge of the processes happening tell us that our station is not always moving - there are discrete processes. A higher penalty parameter might make those parameters even smaller, but they will not become significantly sparser.
Let’s save the parameter values of the transient spline model such that we can compare them later to cases with different regularizations:
>>> params_transient_L2 = stat.models["Displacement"]["Transient"].parameters.ravel().copy()
Repeat with L1 regularization
Using L1-regularized lasso regression, we finally hope to get rid of all the small, basically-zero splines in the transient dictionary:
>>> net.fit(ts_description="Displacement", solver="lasso_regression", penalty=10)
>>> net.evaluate(ts_description="Displacement", output_description="Fit_L1")
>>> stat["Res_L1"] = stat["Displacement"] - stat["Fit_L1"]
>>> stat["Err_L1"] = stat["Fit_L1"] - stat["Truth"]
Giving us the statistics:
>>> _ = stat.analyze_residuals(ts_description="Res_L1",
... mean=True, std=True, verbose=True)
TUT: Res_L1 Mean Standard Deviation
Total-Displacement_Model_Total 2.171535e-10 2.07782
>>> fig, ax = plt.subplots(nrows=2, sharex=True)
>>> ax[0].plot(stat["Displacement"].data, label="Synthetic")
>>> ax[0].plot(stat["Fit_L1"].data, label="Fit")
>>> ax[0].set_ylabel("Displacement [mm]")
>>> ax[0].legend(loc="upper left")
>>> ax[0].set_title("L1 Regularization")
>>> ax[1].plot(stat["Res_L1"].data, c='0.3', ls='none',
... marker='.', markersize=0.5)
>>> ax[1].plot(stat["Err_L1"].time, sum_noise, c='C1', ls='none',
... marker='.', markersize=0.5)
>>> ax[1].plot(stat["Err_L1"].data, c="C0")
>>> ax[1].set_ylim(-15, 15)
>>> ax[1].set_ylabel("Error [mm]")
>>> custom_lines = [Line2D([0], [0], c="0.3", marker=".", linestyle='none'),
... Line2D([0], [0], c="C1", marker=".", linestyle='none'),
... Line2D([0], [0], c="C0")]
>>> ax[1].legend(custom_lines, ["Residual", "True Noise", "Error"],
... loc="lower left", ncol=3)
>>> fig.savefig("tutorial_2f.png")
>>> fig, ax = stat.models["Displacement"]["Transient"].make_scalogram(t_left=t_start_str,
... t_right=t_end_str,
... cmaprange=20)
>>> ax[0].set_title("L1 Regularization")
>>> fig.savefig("tutorial_2g.png")
This looks much better - the scalogram now shows us that we only select splines around where we put the Arctangent models, and is close to zero otherwise. We again sae the transient model parameters for later.
>>> params_transient_L1 = stat.models["Displacement"]["Transient"].parameters.ravel().copy()
Repeat with L0 regularization
Okay, one last thing about fitting, I promise. L1 regularization aims to penalize the sum of the absolute values of our model parameters. However, that’s also not actually what we want. In fact, transient signals in the real world have no constraint to be as small as possible. However, the number of transients should be the one that is minimized. That is what is mathematically referred to as L0 regularization, but is sadly not an easy problem to solve rigorously.
However, by modifying an additional weight of each regularized parameter, that drives small
values even closer to zero, but leaves significant values unperturbed, one can approximate
such an L0 regularization by iteratively solving the L1-regularized problem. That is exactly
what the option reweight_max_iters
does. You can find more information about it in
the notes of lasso_regression()
.
The solver requires the definition of a reweighting function - i.e., a function that
helps the solver distinguish between significant and insignificant parameters.
A reweighting function takes in parameter values, and outputs a weight to add in the
regularization process. Small parameters (insignificant) will get a very high weight
(penalty), whereas large parameters will get a weight close to zero. For more information,
refer to ReweightingFunction
. Let’s define one:
>>> from disstans.solvers import InverseReweighting
>>> rw_func = InverseReweighting(1e-4, scale=1)
With this, we can now run the L1 solver iteratively to approximate the L0 solution:
>>> net.fit(ts_description="Displacement", solver="lasso_regression",
... penalty=10, reweight_max_iters=20, reweight_func=rw_func)
>>> net.evaluate(ts_description="Displacement", output_description="Fit_L0")
>>> stat["Res_L0"] = stat["Displacement"] - stat["Fit_L0"]
>>> stat["Err_L0"] = stat["Fit_L0"] - stat["Truth"]
Giving us the statistics:
>>> _ = stat.analyze_residuals(ts_description="Res_L0",
... mean=True, std=True, verbose=True)
TUT: Res_L0 Mean Standard Deviation
Total-Displacement_Model_Total 4.663455e-13 2.071561
>>> fig, ax = plt.subplots(nrows=2, sharex=True)
>>> ax[0].plot(stat["Displacement"].data, label="Synthetic")
>>> ax[0].plot(stat["Fit_L0"].data, label="Fit")
>>> ax[0].set_ylabel("Displacement [mm]")
>>> ax[0].legend(loc="upper left")
>>> ax[0].set_title("Reweighted L1 Regularization")
>>> ax[1].plot(stat["Res_L0"].data, c='0.3', ls='none',
... marker='.', markersize=0.5)
>>> ax[1].plot(stat["Err_L0"].time, sum_noise, c='C1', ls='none',
... marker='.', markersize=0.5)
>>> ax[1].plot(stat["Err_L0"].data, c="C0")
>>> ax[1].set_ylim(-15, 15)
>>> ax[1].set_ylabel("Error [mm]")
>>> custom_lines = [Line2D([0], [0], c="0.3", marker=".", linestyle='none'),
... Line2D([0], [0], c="C1", marker=".", linestyle='none'),
... Line2D([0], [0], c="C0")]
>>> ax[1].legend(custom_lines, ["Residual", "True Noise", "Error"],
... loc="lower left", ncol=3)
>>> fig.savefig("tutorial_2h.png")
>>> fig, ax = stat.models["Displacement"]["Transient"].make_scalogram(t_left=t_start_str,
... t_right=t_end_str,
... cmaprange=20)
>>> ax[0].set_title("Reweighted L1 Regularization")
>>> fig.savefig("tutorial_2i.png")
As you can see, the significant components of the splines have now been emphasized when compared to the previous scalogram, and all the values that were small but not really zero in the previous case are now really close to zero. This did not come at a significant increase in residual variance.
We can quantify this by plotting a histogram of the magnitudes of the transient parameters for the three different regularization schemes. One last time, we save the parameter values:
>>> params_transient_L0 = stat.models["Displacement"]["Transient"].parameters.ravel().copy()
>>> ZERO = 1e-4
>>> print("Number of nonzero spline parameters:\n"
... f"L2: {(np.abs(params_transient_L2) > ZERO).sum()}\n"
... f"L1: {(np.abs(params_transient_L1) > ZERO).sum()}\n"
... f"L0: {(np.abs(params_transient_L0) > ZERO).sum()}")
Number of nonzero spline parameters:
L2: 264
L1: 76
L0: 67
As we can see, the L0-regularized solver gives a more sparse solution than both the L1- and L2-regularized solvers. A histogram can show this to use more visually:
>>> plt.figure()
>>> bins = np.linspace(-14.5, 1.5, 17)
>>> stacked_params = \
>>> np.stack([params_transient_L2, params_transient_L1, params_transient_L0], axis=1)
>>> plt.hist(np.log10(np.abs(stacked_params)), bins=bins, label=["L2", "L1", "L0"])
>>> plt.xlabel(r"$\log_{10}$ Parameter Magnitude [-]")
>>> plt.ylabel("Count [-]")
>>> plt.legend()
>>> plt.savefig(f"{outdir}/tutorial_2j.png")
Comparing specific parameters
Before we finish up, let’s just print some differences between the ground truth and our L0-fitted model:
>>> reldiff_sec = (mdl_coll_synth["Secular"].parameters
... / stat.models["Displacement"]["Secular"].parameters).ravel() - 1
>>> reldiff_ann_amp = (mdl_coll_synth["Annual"].amplitude
... / stat.models["Displacement"]["Annual"].amplitude)[0] - 1
>>> reldiff_sem_amp = (mdl_coll_synth["Semi-Annual"].amplitude
... / stat.models["Displacement"]["Semi-Annual"].amplitude)[0] - 1
>>> absdiff_ann_ph = np.rad2deg(mdl_coll_synth["Annual"].phase
... - stat.models["Displacement"]["Annual"].phase)[0]
>>> if absdiff_ann_ph > 180:
... absdiff_ann_ph -= 360
>>> absdiff_sem_ph = np.rad2deg(mdl_coll_synth["Semi-Annual"].phase
... - stat.models["Displacement"]["Semi-Annual"].phase)[0]
>>> if absdiff_sem_ph > 180:
... absdiff_sem_ph -= 360
>>> print(f"Percent Error Constant: {reldiff_sec[0]: .3%}\n"
... f"Percent Error Linear: {reldiff_sec[1]: .3%}\n"
... f"Percent Error Annual Amplitude: {reldiff_ann_amp: .3%}\n"
... f"Percent Error Semi-Annual Amplitude: {reldiff_sem_amp: .3%}\n"
... f"Absolute Error Annual Phase: {absdiff_ann_ph: .3f}°\n"
... f"Absolute Error Semi-Annual Phase: {absdiff_sem_ph: .3f}°")
Percent Error Constant: 7.264%
Percent Error Linear: -1.378%
Percent Error Annual Amplitude: 2.308%
Percent Error Semi-Annual Amplitude: -0.407%
Absolute Error Annual Phase: -1.836°
Absolute Error Semi-Annual Phase: -1.284°
As we can see, we got pretty close to our ground truth. Let’s finish up by calculating
an average velocity of the station using get_trend()
around
the time when it’s rapidly moving (around the middle of 2002). We don’t want a normal trend
through the data, since that is also influenced by the secular velocity, the noise, etc.,
so we choose to only fit our transient model:
>>> trend, _ = stat.get_trend("Displacement", fit_list=["Transient"],
... t_start="2002-06-01", t_end="2002-08-01")
>>> print(f"Transient Velocity: {trend[0]:f} {ts.data_unit}/D")
Transient Velocity: 0.105952 mm/D
We can use average velocities like these when we want to create velocity maps for certain episodes.