Hyper-parameter optimization

The core of OCTIS framework consists of an efficient and user-friendly way to select the best hyper-parameters for a Topic Model using Bayesian Optimization.

To inizialize an optimization, inizialize the Optimizer class:

from octis.optimization.optimizer import Optimizer
optimizer = Optimizer()

Choose the dataset you want to analyze.

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load("octis/preprocessed_datasets/M10")

Choose a Topic-Model.

from octis.models.LDA import LDA
model = LDA()
model.hyperparameters.update({"num_topics": 25})

Choose the metric function to optimize.

from octis.evaluation_metrics.coherence_metrics import Coherence
metric_parameters = {
    'texts': dataset.get_corpus(),
    'topk': 10,
    'measure': 'c_npmi'
}
npmi = Coherence(metric_parameters)

Create the search space for the optimization.

from skopt.space.space import Real
search_space = {
    "alpha": Real(low=0.001, high=5.0),
    "eta": Real(low=0.001, high=5.0)
}

Finally, launch the optimization.

optimization_result=optimizer.optimize(model,
                   dataset,
                   npmi,
                   search_space,
                   number_of_call=10,
                   n_random_starts=3,
                   model_runs=3,
                   save_name="result",
                   surrogate_model="RF",
                   acq_func="LCB"
                    )

where:

number_of_call: int, default: 5. Number of function evaluations.
n_random_starts: int, default: 1. Number of random points used to inizialize the BO
model_runs: int: default: 3. Number of model runs.
save_name: str, default “results”. Name of the json file where all the results are saved
surrogate_model: str, default: “RF”. Probabilistic surrogate model used to build to prior on the objective function. Can be either:
- “RF” for Random Forest regression
- “GP” for Gaussian Process regression
- “ET” for Extra-tree Regression
acq_function: str, default: “EI”. function to optimize the surrogate model. Can be either:
- “LCB” for lower confidence bound
- “EI” for expected improvment
- “PI” for probability of improvment

The results of the optimization are saved in the json file, by default. However, you can save the results of the optimization also in a user-friendly csv file

optimization_result.save_to_csv("results.csv")

Resume the optimization

Optimization runs, for some reason, can be interrupted. With the help of the resume_optimization you can restart the optimization run from the last saved iteration.

optimizer = Optimizer()
optimizer.resume_optimization(json_path)

where json_path is the path of json file of the previous results.

Continue the optimization

Suppose that, after an optimization process, you want to perform three extra-evaluations. You can do this using the method resume_optimization.

optimizer = Optimizer()
optimizer.resume_optimization(json_path, extra_evaluations=3)

where extra_evaluations (int, default 0) is the number of extra-evaluations to perform.

Inspect an extra-metric

Suppose that, during the optimization process, you want to inspect the value of another metric. For example, suppose that you want to check the value of

metric_parameters = {
    'texts': dataset.get_corpus(),
    'topk': 10,
    'measure': 'c_npmi'
}
npmi2 = Coherence(metric_parameters)

You can add this as a parameter.

optimization_result=optimizer.optimize(model,
                   dataset,
                   npmi,
                   search_space,
                   number_of_call=10,
                   n_random_starts=3,
                   extra_metrics=[npmi2]
                    )

where extra_metrics (list, default None) is the list of extra metrics to inspect.

Early stopping

Suppose that you want to terminate the optimization process if there is no improvement after a certain number of iterations. You can apply an early stopping criterium during the optimization.

optimization_result=optimizer.optimize(model,
                   dataset,
                   npmi,
                   search_space,
                   number_of_call=10,
                   n_random_starts=3,
                   early_stop=True,
                   early_step=5,
                    )

where early_step (int, default 5) is the number of function evaluations after that the optimization process is stopped.