Modules

Dataset

class octis.dataset.dataset.Dataset(corpus=None, vocabulary=None, labels=None, metadata=None, document_indexes=None)[source]

Dataset handles a dataset and offers methods to access, save and edit the dataset data

fetch_dataset(dataset_name, data_home=None, download_if_missing=True)[source]

Load the filenames and data from a dataset. Parameters ———- dataset_name: name of the dataset to download or retrieve data_home : optional, default: None

Specify a download and cache folder for the datasets. If None, all data is stored in ‘~/octis’ subfolders.

download_if_missingoptional, True by default

If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.

load_custom_dataset_from_folder(path, multilabel=False)[source]

Loads all the dataset from a folder Parameters ———- path : path of the folder to read

save(path, multilabel=False)[source]

Saves all the dataset info in a folder Parameters ———- path : path to the folder in which files are saved.

If the folder doesn’t exist it will be created

Data Preprocessing

Evaluation Measures

class octis.evaluation_metrics.metrics.AbstractMetric[source]

Class structure of a generic metric implementation

abstract score(model_output)[source]

Retrieves the score of the metric

Parameters:

model_output – output of a topic model in the form of a dictionary. See model for details on

the model output :type model_output: dict

class octis.evaluation_metrics.coherence_metrics.Coherence(texts=None, topk=10, processes=1, measure='c_npmi')[source]
score(model_output)[source]

Retrieve the score of the metric

Parameters

model_outputdictionary, output of the model

key ‘topics’ required.

Returns

score : coherence score

class octis.evaluation_metrics.coherence_metrics.WECoherenceCentroid(topk=10, word2vec_path=None, binary=True)[source]
score(model_output)[source]

Retrieve the score of the metric

Parameters:

model_output – dictionary, output of the model. key ‘topics’ required.

:return topic coherence computed on the word embeddings

class octis.evaluation_metrics.coherence_metrics.WECoherencePairwise(word2vec_path=None, binary=False, topk=10)[source]
score(model_output)[source]

Retrieve the score of the metric

Parameters

model_outputdictionary, output of the model

key ‘topics’ required.

Returns

scoretopic coherence computed on the word embeddings

similarities

class octis.evaluation_metrics.diversity_metrics.InvertedRBO(topk=10, weight=0.9)[source]
score(model_output)[source]

Retrieves the score of the metric

:param model_output : dictionary, output of the model. the ‘topics’ key is required.

class octis.evaluation_metrics.diversity_metrics.KLDivergence[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters:

model_output – output of a topic model in the form of a dictionary. See model for details on

the model output :type model_output: dict

class octis.evaluation_metrics.diversity_metrics.LogOddsRatio[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters:

model_output – output of a topic model in the form of a dictionary. See model for details on

the model output :type model_output: dict

class octis.evaluation_metrics.diversity_metrics.TopicDiversity(topk=10)[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_outputdictionary, output of the model

key ‘topics’ required.

Returns

td : score

class octis.evaluation_metrics.diversity_metrics.WordEmbeddingsInvertedRBO(topk=10, weight=0.9, normalize=True, word2vec_path=None, binary=True)[source]
score(model_output)[source]
Returns:

rank_biased_overlap over the topics

class octis.evaluation_metrics.diversity_metrics.WordEmbeddingsInvertedRBOCentroid(topk=10, weight=0.9, normalize=True, word2vec_path=None, binary=True)[source]
score(model_output)[source]
Returns:

rank_biased_overlap over the topics

class octis.evaluation_metrics.classification_metrics.AccuracyScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_outputdictionary, output of the model. ‘topic-document-matrix’

and ‘test-topic-document-matrix’ keys are required.

Returns

score : score

class octis.evaluation_metrics.classification_metrics.ClassificationScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters:

model_output – output of a topic model in the form of a dictionary. See model for details on

the model output :type model_output: dict

class octis.evaluation_metrics.classification_metrics.F1Score(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_outputdictionary, output of the model. keys

‘topic-document-matrix’ and ‘test-topic-document-matrix’ are required.

Returns

score : score

class octis.evaluation_metrics.classification_metrics.PrecisionScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_outputdictionary, output of the model. ‘topic-document-matrix’

and ‘test-topic-document-matrix’ keys are required.

Returns

score : score

class octis.evaluation_metrics.classification_metrics.RecallScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_outputdictionary, output of the model. ‘topic-document-matrix’

and ‘test-topic-document-matrix’ keys are required.

Returns

score : score

class octis.evaluation_metrics.topic_significance_metrics.KL_background[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_outputdictionary, output of the model

‘topic-document-matrix’ required

Returns

result : score

class octis.evaluation_metrics.topic_significance_metrics.KL_uniform[source]
score(model_output, per_topic=False)[source]

Retrieves the score of the metric

Parameters

model_outputdictionary, output of the model

‘topic-word-matrix’ required

per_topic: if True, it returns the score for each topic

Returns

result : score

class octis.evaluation_metrics.topic_significance_metrics.KL_vacuous[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_outputdictionary, output of the model

‘topic-word-matrix’ required ‘topic-document-matrix’ required

Returns

result : score

Optimization

class octis.optimization.optimizer.Optimizer[source]

Class Optimizer to perform Bayesian Optimization on Topic Model

optimize(model, dataset, metric, search_space, extra_metrics=None, number_of_call=5, n_random_starts=1, initial_point_generator='lhs', optimization_type='Maximize', model_runs=5, surrogate_model='RF', kernel=1**2 * Matern(length_scale=1, nu=1.5), acq_func='LCB', random_state=False, x0=None, y0=None, save_models=True, save_step=1, save_name='result', save_path='results/', early_stop=False, early_step=5, plot_best_seen=False, plot_model=False, plot_name='B0_plot', log_scale_plot=False, topk=10)[source]

Perform hyper-parameter optimization for a Topic Model

Parameters:
  • model (OCTIS Topic Model) – model with hyperparameters to optimize

  • dataset (OCTIS dataset) – dataset for the model dataset

  • metric (OCTIS metric) – metric used for the optimization

  • search_space (skopt space object) – a dictionary of hyperparameters to optimize (each parameter is defined as a skopt space)

  • extra_metrics (list of metrics, optional) – list of extra-metrics to compute during the optimization

  • number_of_call (int, optional) – number of evaluations of metric

  • n_random_starts (int, optional) – number of evaluations of metric with random points before approximating it with surrogate model

  • initial_point_generator (str, optional) – set an initial point generator. Can be either “random”, “sobol”, “halton” ,”hammersly”,”lhs”

  • optimization_type – Set “Maximize” if you want to maximize metric, “Minimize” if you want to minimize

  • model_runs

  • surrogate_model – set a surrogate model. Can be either “GP” (Gaussian Process), “RF” (Random Forest) or “ET” (Extra-Tree)

  • kernel – set a kernel function

  • acq_func – Function to minimize over the surrogate model. Can be either: “LCB” (Lower Confidence Bound), “EI” (Expected improvement) OR “PI” (Probability of Improvement)

  • random_state – Set random state to something other than None for reproducible results.

  • x0 – List of initial input points.

  • y0 – Evaluation of initial input points.

  • save_models – if ‘True’ save all the topic models generated during the optimization process

  • save_step – decide how much to save the results of the optimization

  • save_name – name of the file where the results of the optimization will be saved

  • save_path (str, optional) – Path where the results of the optimization (json file ) will be saved

  • early_stop (bool, optional) – if “True” stop the optimization if there is no improvement after early_step evaluations

  • early_step (int, optional) – number of iterations with no improvement after which optimization will be stopped (if early_stop is True)

  • plot_best_seen (bool, optional) – If “True” save a convergence plot of the result of a Bayesian_optimization (i.e. the best seen for each iteration)

  • plot_model (bool, optional) – If “True” save the boxplot of all the model runs

  • plot_name (str, optional) – Set the name of the plots (best_seen and model_runs).

  • log_scale_plot (bool, optional) – if “True” use the logarithmic scale for the plots.

  • topk (int, optional) –

Type:

int, optional

Type:

str, optional

Type:

str, optional

Type:

int, optional

Type:

list, optional

Type:

list, optional

Type:

bool, optional

Type:

int, optional

Type:

str, optional

Returns:

OptimizerEvaluation object

Return type:

class

resume_optimization(name_path, extra_evaluations=0)[source]

Restart the optimization from the json file.

Parameters:
  • name_path (str) – path of the json file

  • extra_evaluations (int) – extra iterations for the BO optimization

Returns:

object with the results of the optimization

Return type:

object

octis.optimization.optimizer_tool.check_instance(obj)[source]

Check if a specific object con be inserted in the json file.

Parameters:

obj ([str,float, int, bool, etc.]) – an object of the optimization to be saved

Returns:

‘True’ if the object is json format, ‘False’ otherwise

Return type:

bool

octis.optimization.optimizer_tool.choose_optimizer(optimizer)[source]

Choose a surrogate model for Bayesian Optimization

Parameters:

optimizer (Optimizer) – list of setting of the BO experiment

Returns:

surrogate model

Return type:

scikit object

octis.optimization.optimizer_tool.convergence_res(values, optimization_type='minimize')[source]
Compute the list of values to plot the convergence plot (i.e. the best

seen at each iteration)

Parameters:
  • values (list) – the result(s) for which to compute the convergence trace.

  • optimization_type (str) – “minimize” if the problem is a minimization problem, “maximize” otherwise

Returns:

a list with the best min seen for each iteration

Return type:

list

octis.optimization.optimizer_tool.convert_type(obj)[source]

Convert a numpy object to a python object

Parameters:

obj (numpy object) – object to be checked

Returns:

python object

Return type:

python object

octis.optimization.optimizer_tool.early_condition(values, n_stop, n_random)[source]

Compute the early-stop criterium to stop or not the optimization.

Parameters:
  • values (list) – values obtained by Bayesian Optimization

  • n_stop (int) – Range of points without improvement

  • n_random (int) – Random starting points

Returns:

‘True’ if early stop condition reached, ‘False’ otherwise

Return type:

bool

octis.optimization.optimizer_tool.importClass(class_name, module_name, module_path)[source]

Import a class runtime based on its module and name

Parameters:
  • class_name (str) – name of the class

  • module_name (str) – name of the module

  • module_path (str) – absolute path to the module

Returns:

class object

Return type:

class

octis.optimization.optimizer_tool.load_model(optimization_object)[source]

Load the topic model for the resume of the optimization

Parameters:

optimization_object (dict) – dictionary of optimization attributes saved in the json file

Returns:

topic model used during the BO.

Return type:

object model

octis.optimization.optimizer_tool.load_search_space(search_space)[source]

Load the search space from the json file

Parameters:

search_space – dictionary of the search space (insertable in a json file)

Returns:

dictionary for the search space (for scikit optimize)

Return type:

dict

octis.optimization.optimizer_tool.plot_bayesian_optimization(values, name_plot, log_scale=False, conv_max=True)[source]

Save a convergence plot of the result of a Bayesian_optimization.

Parameters:
  • values (list) – List of objective function values

  • name_plot (str) – Name of the plot

  • log_scale (bool, optional) – ‘True’ if log scale for y-axis, ‘False’ otherwise

  • conv_max (bool, optional) – ‘True’ for a minimization problem, ‘False’ for a maximization problem

octis.optimization.optimizer_tool.plot_model_runs(model_runs, current_call, name_plot)[source]

Save a boxplot of the data (Works only when optimization_runs is 1).

Parameters:
  • model_runs (dict) – dictionary of all the model runs.

  • current_call (int) – number of calls computed by BO

  • name_plot (str) – Name of the plot

octis.optimization.optimizer_tool.save_search_space(search_space)[source]

Save the search space in the json file

Parameters:

search_space (dict) – dictionary of the search space (scikopt object)

Returns:

dictionary for the seach space, which can be saved in a json file

Return type:

dict

octis.optimization.optimizer_tool.select_metric(metric_parameters, metric_name)[source]

Select the metric for the resume of the optimization

Parameters:
  • metric_parameters (list) – metric parameters

  • metric_name (str) – name of the metric

Returns:

metric

Return type:

metric object

Models

class octis.models.model.AbstractModel[source]

Class structure of a generic Topic Modeling implementation

set_hyperparameters(**kwargs)[source]

Set model hyperparameters

Parameters:

**kwargs

a dictionary of in the form {hyperparameter name: value}

abstract train_model(dataset, hyperparameters, top_words=10)[source]

Train the model. :param dataset: Dataset :param hyperparameters: dictionary in the form {hyperparameter name: value} :param top_words: number of top significant words for each topic (default: 10)

Return model_output:

a dictionary containing up to 4 keys: topics, topic-word-matrix,

topic-document-matrix, test-topic-document-matrix. topics is the list of the most significant words for each topic (list of lists of strings). topic-word-matrix is the matrix (num topics x ||vocabulary||) containing the probabilities of a word in a given topic. topic-document-matrix is the matrix (||topics|| x ||training documents||) containing the probabilities of the topics in a given training document. test-topic-document-matrix is the matrix (||topics|| x ||testing documents||) containing the probabilities of the topics in a given testing document.

octis.models.model.load_model_output(output_path, vocabulary_path=None, top_words=10)[source]

Loads a model output from the choosen directory

Parameters

param output_path:

path in which th model output is saved

param vocabulary_path:

path in which the vocabulary is saved (optional, used to retrieve the top k words of each topic)

param top_words:

top k words to retrieve for each topic (in case a vocabulary path is given)

octis.models.model.save_model_output(model_output, path='.', appr_order=7)[source]

Saves the model output in the chosen directory

Parameters:
  • model_output – output of the model

  • path – path in which the file will be saved and name of the file

  • appr_order – approximation order (used to round model_output values)

class octis.models.LDA.LDA(num_topics=100, distributed=False, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None)[source]
hyperparameters_info()[source]

Returns hyperparameters informations

info()[source]

Returns model informations

partitioning(use_partitions, update_with_test=False)[source]

Handle the partitioning system to use and reset the model to perform new evaluations

Parameters

use_partitions: True if train/set partitioning is needed, False

otherwise

update_with_test: True if the model should be updated with the test set,

False otherwise

set_hyperparameters(**kwargs)[source]

Set model hyperparameters

train_model(dataset, hyperparams=None, top_words=10)[source]

Train the model and return output

Parameters

dataset : dataset to use to build the model hyperparams : hyperparameters to build the model top_words : if greater than 0 returns the most significant words for

each topic in the output (Default True)

Returns

resultdictionary with up to 3 entries,

‘topics’, ‘topic-word-matrix’ and ‘topic-document-matrix’

class octis.models.NMF_scikit.NMF_scikit(num_topics=100, init=None, alpha=0, l1_ratio=0, regularization='both', use_partitions=True)[source]
hyperparameters_info()[source]

Returns hyperparameters informations

partitioning(use_partitions, update_with_test=False)[source]

Handle the partitioning system to use and reset the model to perform new evaluations

Parameters

use_partitions: True if train/set partitioning is needed, False

otherwise

update_with_test: True if the model should be updated with the test set,

False otherwise

train_model(dataset, hyperparameters=None, top_words=10)[source]

Train the model and return output

Parameters

dataset : dataset to use to build the model hyperparameters : hyperparameters to build the model top_words : if greather than 0 returns the most significant words

for each topic in the output Default True

Returns

resultdictionary with up to 3 entries,

‘topics’, ‘topic-word-matrix’ and ‘topic-document-matrix’

class octis.models.CTM.CTM(num_topics=10, model_type='prodLDA', activation='softplus', dropout=0.2, learn_priors=True, batch_size=64, lr=0.002, momentum=0.99, solver='adam', num_epochs=100, reduce_on_plateau=False, prior_mean=0.0, prior_variance=None, num_layers=2, num_neurons=100, seed=None, use_partitions=True, num_samples=10, inference_type='zeroshot', bert_path='', bert_model='bert-base-nli-mean-tokens')[source]
train_model(dataset, hyperparameters=None, top_words=10)[source]

trains CTM model

Parameters:
  • dataset – octis Dataset for training the model

  • hyperparameters – dict, with optionally) the following information:

  • top_words – number of top-n words of the topics (default 10)

class octis.models.ETM.ETM(num_topics=10, num_epochs=100, t_hidden_size=800, rho_size=300, embedding_size=300, activation='relu', dropout=0.5, lr=0.005, optimizer='adam', batch_size=128, clip=0.0, wdecay=1.2e-06, bow_norm=1, device='cpu', train_embeddings=True, embeddings_path=None, embeddings_type='pickle', binary_embeddings=True, headerless_embeddings=False, use_partitions=True)[source]
train_model(dataset, hyperparameters=None, top_words=10, op_path='checkpoint.pt')[source]

Train the model. :param dataset: Dataset :param hyperparameters: dictionary in the form {hyperparameter name: value} :param top_words: number of top significant words for each topic (default: 10)

Return model_output:

a dictionary containing up to 4 keys: topics, topic-word-matrix,

topic-document-matrix, test-topic-document-matrix. topics is the list of the most significant words for each topic (list of lists of strings). topic-word-matrix is the matrix (num topics x ||vocabulary||) containing the probabilities of a word in a given topic. topic-document-matrix is the matrix (||topics|| x ||training documents||) containing the probabilities of the topics in a given training document. test-topic-document-matrix is the matrix (||topics|| x ||testing documents||) containing the probabilities of the topics in a given testing document.