conditional_expectation¶

conditional_expectation

Conditional expectation¶

Purpose¶

In statistics, the analysis of the relationship between two random variables leads to the concept of conditional expectation. This estimation technique will be considered here in the Gaussian space, as when considering a multi-gaussian random function, any weighted average of its variables follows a Gaussian distribution. This concept can be used in geostatistics for estimating a non-linear spatial random function.

This notebook shows how conditional expectation can be used with Deeplime's geolime library.

Required libraries¶

In [1]:

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
import geolime as glm

Dataset¶

The Walker Lake dataset, openly available for educational purposes, will be used in this notebook. It is directly included in geolime.

In [2]:

walker_dataset_path = "../../tests/data/walker_dataset/walker.csv"
# load input dataset
data = pd.read_csv(walker_dataset_path, sep=' ')
data.loc[data['v']>1e10,'v']=0
sample_coords = data[['x', 'y']].to_numpy()
sample_u = data[['u']].to_numpy()
sample_v = data[['v']].to_numpy()

Anamorphosis and variography in the Gaussian space¶

Anamorphosis¶

The anamorphosis purpose is to transform the original distribution of the considered spatial variable of interest (not necessarily Gaussian) in the Gaussian space (specifically: a centered and reduced Gaussian distribution with zero mean and variance equal to one). The anamorphosis transformation is a bijective and continuous operation (at least within the input variable boundaries).

For instance, when considering the u variable of interest, a Shapiro-Wilk test can be applied in order to check whether it follows or not a Gaussian distribution.

In [3]:

stats.shapiro(sample_u)

Out[3]:

ShapiroResult(statistic=0.9629311561584473, pvalue=1.6578262096444973e-09)

The test result p-value at a 5% level of confidence is very significant, leading us to decide that the null-hypothesis stating that the v variable follows a Gaussian distribution can be rejected. Gaussian anamorphosis appears shrewd in that case.

If considering the previously defined u variable of interest, anamorphosis in geolime can be defined with the following call.

In [4]:

anam = glm.Anamorphosis(sample_u.flatten())

The resulting anamorphosis can be approximated with a Hermite polynomials expansion. These polynomials appear handy for computation efficiency as it allows to consider a sum of polynomials instead of having to compute costly PDF integrals.

The first coefficient is equal to the variable's expected value and the sum of the remaining coefficients is equal to the variable's variance. They can be retrieved with the following call.

In [5]:

coeffs_point = anam.hermite_coefficients(50)

Experimental variography¶

As variography as been covered in a more in-depth notebook, we will use the same computation lags and directions for experimental semivariogram computation. They are computed directly in the Gaussian space.

In [6]:

# define input regular lags
lags, tol = glm.generate_lags(lag=10, plag=50, nlags=10)

vario_gauss_exp_160 = glm.variogram(coords=sample_coords,
                                    values=anam.gauss_scores.reshape(470,1),
                                    lags=lags,
                                    tol=tol,
                                    geographic_azimuth=160,
                                    dip=0,
                                    pitch=0,
                                    atol=40,
                                    slice_width=np.nan,
                                    slice_height=np.nan,
                                    method="semivariogram")

vario_gauss_exp_60 = glm.variogram(coords=sample_coords,
                                   values=anam.gauss_scores.reshape(470,1),
                                   lags=lags,
                                   tol=tol,
                                   geographic_azimuth=60,
                                   dip=0,
                                   pitch=0,
                                   atol=40,
                                   slice_width=np.nan,
                                   slice_height=np.nan,
                                   method="semivariogram")

The previously computed experimental variograms can be displayed with the following command.

In [7]:

glm.plot_semivariogram(variograms=[vario_gauss_exp_160,vario_gauss_exp_60], model=None)
plt.xlabel('Lag (m)')
plt.ylabel('Semivariance (ppm^2))')
plt.title('Experimental semivariograms in the Gaussian space')

Out[7]:

Text(0.5, 1.0, 'Experimental semivariograms in the Gaussian space')

Variogram model fit¶

While the experimental semivariogram shows the operator insightful information about a variable's spatial correlation based on distance and angular criterions, it can't be used directly for applying conditional expectation. A continuous covariance model is mandatory prior applying an estimation technique. This model can fitted manually from the previously computed experimental variogram, but geolime also provides an autofit function that comes handy when willing to adjust several variograms, which is our case here. One only needs to provide a covariance type, that can for instance be defined using the following call.

In [8]:

cov_gauss = glm.Nugget() + 0.5 * glm.Spherical(2) + 0.5 * glm.Spherical(2)

The automatic covariance fitting is applied which unique function, as such.

In [9]:

cov_gauss_model = glm.model_fit(variograms=[vario_gauss_exp_160, vario_gauss_exp_60], cov=cov_gauss, constraints=None)

The model can be plotted along with the previous experimental variograms in order to check the automatic covariance model fit using the previous plotting function.

In [10]:

glm.plot_semivariogram(variograms=[vario_gauss_exp_160, vario_gauss_exp_60], model= cov_gauss_model)

Out[10]:

(<Figure size 432x288 with 1 Axes>,
 <AxesSubplot:xlabel='Average distance', ylabel='Semivariance'>)

geolime provides insightful information about the resulting fit.

In [11]:

print(cov_gauss_model)

Model with 3 components

Component 1 :
Variance : 0.24281509823351183
Covariance type : Nugget

Component 2 :
Variance : 0.24939042694795258
Covariance type : Spherical
Scales : [33.03373991 17.31740456]
Angles : 35.4243588578317
Rotation matrix
 [[ 0.57962767 -0.81488145]
 [ 0.81488145  0.57962767]]

Component 3 :
Variance : 0.577185656821309
Covariance type : Spherical
Scales : [89.99999453 40.29051323]
Angles : 20.00192782357825
Rotation matrix
 [[ 0.34205176 -0.93968111]
 [ 0.93968111  0.34205176]]

Total sill = 1.0693911820027733

Change of support and variogram reuliarization¶

Prior applying conditional expectation as an spatial estimation technique, we need to create an estimation grid and a pattern generator allowing block computation. In order to match the input data geographic constraints, we define them with the following parameters.

Grid containing 26 x 30 cells, of 10 x 10 m square size
Discretization paramater of 5 x 5, implying the computation of 25 estimated values per 100 meter square blocks

These parameters can be applied in geolime with the two following calls.

In [12]:

target_grid = glm.create_grid_2d(origin=[0, 0], cell_size=[10, 10], n_dim=[26, 30], angle=[0, 0, 0])
pattern_smu = glm.GridPatternGenerator(target_grid, [5, 5])

We then proceed to the change of support coefficient. This scalar value is mandatory to allow the operator to smoothly switch from sample values to larger estimation blocks, in order to reduce estimation variability. This topic has been covered more in-depth in a separate notebook.

In [13]:

r = np.sqrt(glm.cvv(cov_gauss_model, pattern_smu))
r

Out[13]:

0.8114109044428001

For a 10 x 10 m block size, the change of support coefficient is here equal to 0.81. A coefficient near one would imply a smother change of support effect.

This coefficient is then used for computing block anamorphosis coefficients. The formula is provided in the next cell.

In [14]:

coeffs_block = coeffs_point * np.power(r, range(0, len(coeffs_point)))

In order to take the change of support effect on block, we need to regularize the previously computed punctual covariance model. This step is also mandatory when the operator is willing to switch from a punctual model to a block model.

In [15]:

varioreg_160 = glm.model_regularize(vario_gauss_exp_160, cov_gauss_model, pattern_smu, nlag=11, normalize=True)
varioreg_60 = glm.model_regularize(vario_gauss_exp_60, cov_gauss_model, pattern_smu, nlag=11, normalize=True)

The regularized experimental semivariograms must be fitted with the previously seen automatic fit function.

In [16]:

cov_gauss_block = glm.model_fit([varioreg_160, varioreg_60], glm.Spherical(2), constraints=[{'sill_fixed': 1.}])

In [17]:

glm.plot_semivariogram(variograms=[varioreg_160,varioreg_60], model=cov_gauss_block)

Out[17]:

(<Figure size 432x288 with 1 Axes>,
 <AxesSubplot:xlabel='Average distance', ylabel='Semivariance'>)

Discrete Gaussian model¶

Applying effectively the conditional expectation estimation technique requires one last step, which is computing the associated Discrete Gaussian model, as we lose the exact localization of the input data for a block localization. Grouping the block data can be achieved with the following call.

In [18]:

block_coord_centers, n_points_in_blocks, mean_by_block = glm.group_data(target_grid, sample_coords, anam.gauss_scores.flatten())
neigh = glm.Neighborhood()

Conditional expectation¶

Kriging is the best linear estimator. Conditional expectation can be, in the same fashion, the best estimator (it gives the best approximation of X knowing Y). This general mathematical concept can also be used in geostatistics for spatial interpolation.

In [19]:

res = glm.discrete_gaussian_kriging_by_index(target_grid.get_center(),
                                             block_coord_centers,
                                             mean_by_block,
                                             n_points_in_blocks,
                                             cov_gauss_block,
                                             neigh,
                                             r)

Conditional expectation is then computed on the resulting Discrete Gaussian Kriging result as such.

In [20]:

estim_block = glm.ce(res[:, 0], res[:, 1], coeffs_block)

geolime can display grids, and comes handy when willing to show the estimation results.

In [21]:

glm.display_grid(target_grid, estim_block, sample_coords, z=None, title="Conditional Expectation");

Out[21]:

(<Figure size 432x288 with 2 Axes>,
 <AxesSubplot:title={'center':'Conditional Expectation'}>)

More specifically to geological constraints, we can also define a cutoff value of 600 ppm in order to estimate the proportion of each block where the cosidered variable is greater than this cutoff.

In [22]:

cut_off = 600.

In [23]:

estim_ore = glm.ore(res[:, 0], res[:, 1], coeffs_block, cut_off)

In [24]:

glm.display_grid(target_grid, estim_ore, sample_coords, z=None, title="Conditional Expectation Ore");

Out[24]:

(<Figure size 432x288 with 2 Axes>,
 <AxesSubplot:title={'center':'Conditional Expectation Ore'}>)

We then compute the quantity of metal per volume unit contained in each blocks, with the assumption that only the grade greater the previously defined cutoff value will contain the considered estimated variable.

In [25]:

metal_block = glm.metal(res[:, 0], res[:, 1], coeffs_block, cut_off)

In [26]:

glm.display_grid(target_grid, metal_block, sample_coords, z=None, title="Conditional Expectation (Metal)");

Out[26]:

(<Figure size 432x288 with 2 Axes>,
 <AxesSubplot:title={'center':'Conditional Expectation (Metal)'}>)

Lastly, estimation variance can be computed and displayed by applying the following formula.

In [27]:

std_dev_estim_ore = estim_ore * (1 - estim_ore)

In [28]:

glm.display_grid(target_grid, std_dev_estim_ore, sample_coords, z=None, title="Conditional Expectation Metal");

Out[28]:

(<Figure size 432x288 with 2 Axes>,
 <AxesSubplot:title={'center':'Conditional Expectation Metal'}>)