Emulator

Emulator#

Note that some parameters are SHARED in different parts of the code.

Data collection and cache. The cache is shared between all chains and processes. These settings are independent of the sampling method.

Title#

parameter

default

description

cache_size

1000

Maximum size of stored training data points. If more data points are to be added, the one with the smallest loglikelihood is removed.

min_data_points

80

Number of minimal states in the cache before the emulator can be trained. This is an important parameter. If it is selected too small, the emulator will require too many retrainings. If too large, the initial data collection phase of OLE is unnecessary long.

cache_file

cache.pkl

File in which the cache is going to be stored in. The path will be appended to the working_directory.

share_cache

True

If this flag is set to True the cache is shared between all chains and processes. This is useful if you want to run the sampler in parallel. If set to False each chain has its own cache. Shared cache allows for faster training of the emulator. However, biases of the emulator can be shared between chains. If each emulator uses its own cache, this can lead to a minimal R-1 (due to the emulation bias). Actually, thts a nice estimate of the emulation bias :)

load_cache

False

If set True, the cache of a previous run is loaded. Note that if the likelihood is changed, this can corrupt your cache leading to bugs! Thus, if you change the theory or likelihood code, always create a new cache or set this flag to False. In this case the old cache file will be overwritten.

delta_loglike

50.0

This parameter discriminates between relevant data points for the cache or outliers. Therefore, all states in the cache with a loglikelihood smaller that the maximum loglikelihood in the cache minus delta_loglike are to be removed since they are classified as outlier. If N_sigma and dimensionality is set, this parameter is omitted. In general it is better to give N_sigma and dimensionality!

dimensionality

None

As an alternative to the delta_loglike we can compute an educated guess for this parameter by computing the delta loglike of a gaussian distribution of dimension dimenstionality from its best fit point to N_sigma. Thus, if the posterior would be gaussian, points in the cache would lay inside a N_sigma contour but all points outside would be classifies as outlier. If no dimenstionality is given delta_loglike is used. Imporant for eficiency!

N_sigma

3.0

See dimensionality. Important parameter for efficiency.

These parameters are used to specify the PCA compression of the data.

Title#

parameter

default

description

min_variance_per_bin

5e-6

The level of compression of each observable is determined by the number of PCA components. Therefore, we increase the number of PCA components until the explained variance per bin times the bin size exceeds the parameters value. The value of 1e-4 can be interpreted in a way that for each observable the systematic uncertainty due to insufficient projection of the PCA will lead to a relative error (of the normalized observables) of 1e-2. Thus, it is a maximal achievable precision of the emulator. If it is selected too large an error message appears that indicates possible biases. Here we can directly trade between speed and accuracy. For highly correlated quantities it is adviseable to reduce this number by 1-2 magnitudes! This is an important parameter.

max_output_dimensions

40

The maximal number of PCA components. Unlikely to exceed that

data_covmat_directory

None

We can provide the emulator with a dictionary of data covmats (keys are the names of the observables). They can be either the full (2-dimensional) covariance matrix or the (1-dimensional) diagonal of the covariance matrix. These covariance matrices are used to normalize the data. This is particular helpful to indicate the emulator which parts of the observable have to be computed precisesly and which parts have only a low significance for the total likelihood. If no covariance matrices are provided, the normalization is performed bin wise and the code assumes the entire range of the output to be of same relevance for the total likelihood.

normalize_by_full_covmat

False

If the flag is set to true, we normalize the observables by the full covariance, thus, go into the data eigenspace. This is already partly that what the PCA is supposed to do. It can be computationally expensive for high dimensional observables.

Following parameters are used to specify the training of the GP and when it is supposed to happen. It also deals with the possible compression of data by sparse GPs.

Title#

parameter

default

description

kernel

RBF

GP Kernel. Currently implemented: [RBF] In fact it is a RBF + linear + WhiteNoise kernel.

learning_rate

0.1

Learning rate for ADAM optimizer when fitting the GP parameters. Note that sparse GP typically require a smaller learning rate than ordinary ones

num_iters

None

Proposed number of training epochs. If we see that the loss is still falling (more than early_stopping within two batches of early_stopping_window iterations). If not set, it will be determined by the number of datapoints (see ‘num_epochs_per_dp’).

max_num_iters

None

Maximal training epochs if early stopping is not triggered. Should not be reached. Produces a warning when exceeded! If not set, it will be determined by the number of datapoints (see ‘num_epochs_per_dp’).

num_epochs_per_dp

30

Sets num_iters by multiplying the number of data points with this factor if num_iters is not set.

max_num_epochs_per_dp

120

Sets max_num_iters by multiplying the number of data points with this factor if max_num_iters is not set.

early_stopping

0.05

Early stopping criterium. See num_iters.

early_stopping_window

10

Window for early stopping. See num_iters.

kernel_fitting_frequency

40

Frequency of how many new data points are added to the cache until a new compression is computed and the parameters of the GP are fitted again. Since this step is rather computational expensive we do not want to refit every step. Note however, that every new point in the cache will be utilized in the prediction even if the kernels are not refitted!

sparse_GP_points

0

If not set to 0 we try to use condensate the information of all training points into a reduced training set (sparse GPs). The initial guess of the number of estimated sparse data points is sparse_GP_points. However, in the iterative search for the best number of data points there is a certain error tolerance that we deem acceptable for the acceleration. It should be choosen rather small as the subleading PCA components can be fit with very few data points.

white_noise_ratio

1.

If not set to 0 a noise term is added to the Kernel that is determined by the explained_variance_cutoff for each PCA component. This prevents the GP from fitting random noise introduced in the PCA analysis. It is also a central component of the sparse GP method since it is used to determine the optimal number of sparse points. A value of one sets the white noise error such that is comparable to the dropped PCA components

error_boost

2.

This parameter allocates a noise budget to the sparse GP relative to the existing white noise term. A value of 2. means that the total allowed error is twice the white noise and thus the average error of the sparse GP may be as large as the white noise term. A value of 1. means that the sparse GP error is zero, so it can never be used. Reasonable values are between 1.5 and 5.

Uncertainty qualification related to the precision criterium of the emulator and when to test it.

Title#

parameter

default

description

testing_strategy

'test_stochastic'

Specify testing strategy. Possible stragies: 'test_all','test_early','test_none','test_stochastic'. When 'test_all' is selected each emulator call will be tested. When 'test_none' is selected none emulator call will be tested. If 'test_early' is selected we test all points until we tested test_early_points consecutive points positive. Afterwards we turn off the testing. test_stochastik starts with a 100% testing probability. However, the chance of testing will exponentially decrease with the number of consecutive successful emulator calls. The scale of the test_stochastic_scale times dimensionality is the scale of the exponential decrease. If test_stochastic_rate is set, even after the exponential decay we will test at least with a test_stochastic_rate the points. If it is not set, it will be determined by test_stochastic_testing_time_fraction. In this case, the time for testing and the actual emulator call is balanced, such that the testing time is a fraction of the total time.

test_early_points

1000

Number of consective positive test calls until testing is switched off. See testing_strategy

test_stochastic_scale

40

Scale of each dimension for the stochastik testing. See testing_strategy.

test_stochastic_rate

None

See testing_strategy.

test_stochastic_testing_time_fraction

0.15

See testing_strategy.

max_sigma

20

The emulator should only be used in the vicinity of the best-fit where it is trained. If the loglike is far away (like during burn-in) it should not be used.

N_quality_samples

5

Number of samples which are drawn from the emulator to estimate the performance of the emulator. The runtime is about linear in that parameter! From this number of samples we compute the mean loglikelihood $m$ and its standard deviation $sigma_m$. In general we want the emulator to be very precise at the best fit point with its loglikelihood $b$ and less accurate for points more away. We accept the prediction of the emulator if $$sigma_m < mathrm{quality.threshold.constant} + mathrm{quality.threshold.linear}*(b-m) + mathrm{quality.threshold.quadratic} * (b-m)^2 $$

quality_threshold_constant

0.1

See N_quality_samples

quality_threshold_linear

0.05

See N_quality_samples. Note that this factor can be reformulated in a precision criterium of your confidence bounds (for a gaussian distribution). If we set this factor to 0.01 the emulator can estimate the position of the N sigma contour to a precision of N*0.01.

quality_threshold_quadratic

0.0001

See N_quality_samples. In general we want the quadratic term to be state the absolute ignorance outside the relevant parameter space. To provide you with a better handle this parameter is overwritten if one provides values for dimensionality and N_sigma. In this case, the contribution of quality_threshold_quadratic starts to dominate over the constant and linear term exactly at N_sigma.

burn_in_trigger

100

During the burn-in of the MCMC the emulator should not yet deploy the high accuracy settings since it needs to wait for all cahins to leave burn-in. Thus, we deploy reduced precision settings. It will switched to high accuracy when there are burn_in_trigger consecutive points inside the max_sigma region.

quality_threshold_constant_early

1.0

See N_quality_samples

quality_threshold_linear_early

0.3

See N_quality_samples. Note that this factor can be reformulated in a precision criterium of your confidence bounds (for a gaussian distribution). If we set this factor to 0.01 the emulator can estimate the position of the N sigma contour to a precision of N*0.01.

quality_threshold_quadratic_early

0.001

See N_quality_samples. In general we want the quadratic term to be state the absolute ignorance outside the relevant parameter space. To provide you with a better handle this parameter is overwritten if one provides values for dimensionality and N_sigma. In this case, the contribution of quality_threshold_quadratic starts to dominate over the constant and linear term exactly at N_sigma.

quality_points_radius

0.0

One way to reduce the number of performance tests is to create a sphere around each tested emulator call and whenever the emulator predicts the performance within a radius of quality_points_radius (in normalized units), no testing is required and the emulator can be used. If set to 0.0 ever call will be tested.

Other:

Title#

parameter

default

description

working_directory

./

This will be the default directory in which all emulator related files are stored. The cache file, the emulator file, the training data and the log file.

emulator_state_file

emulator_state.pkl

This is the file the current state of the emulator is stored in. This involves normalization, PCA and GP-kernel parameters.

normalized_cache_file

normalized_cache.pkl

In this file the normalized training data are stored in by rank 0.

load_initial_state

False

If flag is set to True the state from which the emulator is initialized is loaded from an already existing cache file. Otherwise the emulator is initialized once the theory code was run for the first time. By setting this to True and setting test_emulator to False, one can use the emulator without calling the theory code at all.

skip_emulation_quantities

None

List of quantities that are provided by the theory code but which should not be emulated. As a consequence the output of the veto quantities will be constant with the value the emulator was initialized with.

jit

True

Flag if we want to use ‘jax.jit’ to accelerate the emulator by just-in-time compilation.

jit_threshold

60

Using ‘jit’ gives a small overhead due to compiling the code. In the early phase when there are a lot of new data points it can be ineffcient to do that every time. Thus, we can wait for a certain number of successful emulator calls until we jit the emulator.

check_cache_for_new_points

1000

Every check_cache_for_new_points emulator calls the cache is checked for new points. If new points are found the emulator is retrained. This is important if the emulator is used in a MCMC where the emulator is called multiple times for the same point. If the emulator is used in a MCMC it is recommended to set this to a large number.

Debugging. Very recommended when investigating a new problem:

Title#

parameter

default

description

plotting_directory

None

Path to a directory in which (if set) debugging plots are saved to.

testset_fraction

None

If set (for example 0.1) a certain fraction of the training samples will not be used for training but for testing the performance of the emulator. Additional plots will be created in the plotting_directory

logfile

None

If set to a text file, the emulator writes a log.

status_print_frequency

200

Every status_print_frequency runs the status of the emulator will be printed.

debug

False

If set to True the emulator will print out a lot of debugging information. This is very helpful when investigating a new problem.

training_verbose

True

If set to True the emulator will print a training bar. For clusters it is recommended to set this to False.