Modules

Dataset

class octis.dataset.dataset.Dataset(corpus=None, vocabulary=None, labels=None, metadata=None, document_indexes=None)[source]

Dataset handles a dataset and offers methods to access, save and edit the dataset data

fetch_dataset(dataset_name, data_home=None, download_if_missing=True)[source]

Load the filenames and data from a dataset. Parameters ———- dataset_name: name of the dataset to download or retrieve data_home : optional, default: None

Specify a download and cache folder for the datasets. If None, all data is stored in ‘~/octis’ subfolders.

download_if_missingoptional, True by default

If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.

load_custom_dataset_from_folder(path, multilabel=False)[source]

Loads all the dataset from a folder Parameters ———- path : path of the folder to read

save(path, multilabel=False)[source]

Saves all the dataset info in a folder Parameters ———- path : path to the folder in which files are saved.

If the folder doesn’t exist it will be created

Data Preprocessing

Evaluation Measures

class octis.evaluation_metrics.metrics.AbstractMetric[source]

Class structure of a generic metric implementation

abstract score(model_output)[source]

Retrieves the score of the metric

Parameters

model_output – output of a topic model in the form of a dictionary. See model for details on

the model output :type model_output: dict

class octis.evaluation_metrics.coherence_metrics.Coherence(texts=None, topk=10, measure='c_npmi')[source]
score(model_output)[source]

Retrieve the score of the metric

model_outputdictionary, output of the model

key ‘topics’ required.

score : coherence score

class octis.evaluation_metrics.coherence_metrics.WECoherenceCentroid(topk=10, word2vec_path=None, binary=True)[source]
score(model_output)[source]

Retrieve the score of the metric

Parameters

model_output – dictionary, output of the model. key ‘topics’ required.

:return topic coherence computed on the word embeddings

class octis.evaluation_metrics.coherence_metrics.WECoherencePairwise(word2vec_path=None, binary=False, topk=10)[source]
score(model_output)[source]

Retrieve the score of the metric

model_outputdictionary, output of the model

key ‘topics’ required.

scoretopic coherence computed on the word embeddings

similarities

class octis.evaluation_metrics.diversity_metrics.InvertedRBO(topk=10, weight=0.9)[source]
score(model_output)[source]

Retrieves the score of the metric

:param model_output : dictionary, output of the model. the ‘topics’ key is required.

class octis.evaluation_metrics.diversity_metrics.KLDivergence[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_output – output of a topic model in the form of a dictionary. See model for details on

the model output :type model_output: dict

class octis.evaluation_metrics.diversity_metrics.LogOddsRatio[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_output – output of a topic model in the form of a dictionary. See model for details on

the model output :type model_output: dict

class octis.evaluation_metrics.diversity_metrics.TopicDiversity(topk=10)[source]
score(model_output)[source]

Retrieves the score of the metric

model_outputdictionary, output of the model

key ‘topics’ required.

td : score

class octis.evaluation_metrics.diversity_metrics.WordEmbeddingsInvertedRBO(topk=10, weight=0.9, normalize=True, word2vec_path=None, binary=True)[source]
score(model_output)[source]
Returns

rank_biased_overlap over the topics

class octis.evaluation_metrics.diversity_metrics.WordEmbeddingsInvertedRBOCentroid(topk=10, weight=0.9, normalize=True, word2vec_path=None, binary=True)[source]
score(model_output)[source]
Returns

rank_biased_overlap over the topics

class octis.evaluation_metrics.classification_metrics.AccuracyScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

model_outputdictionary, output of the model. ‘topic-document-matrix’ and

‘test-topic-document-matrix’ keys are required.

score : score

class octis.evaluation_metrics.classification_metrics.ClassificationScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

Parameters

model_output – output of a topic model in the form of a dictionary. See model for details on

the model output :type model_output: dict

class octis.evaluation_metrics.classification_metrics.F1Score(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

model_outputdictionary, output of the model. keys ‘topic-document-matrix’ and

‘test-topic-document-matrix’ are required.

score : score

class octis.evaluation_metrics.classification_metrics.PrecisionScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

model_outputdictionary, output of the model. ‘topic-document-matrix’ and

‘test-topic-document-matrix’ keys are required.

score : score

class octis.evaluation_metrics.classification_metrics.RecallScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
score(model_output)[source]

Retrieves the score of the metric

model_outputdictionary, output of the model. ‘topic-document-matrix’ and

‘test-topic-document-matrix’ keys are required.

score : score

class octis.evaluation_metrics.topic_significance_metrics.KL_background[source]
score(model_output)[source]

Retrieves the score of the metric

model_outputdictionary, output of the model

‘topic-document-matrix’ required

result : score

class octis.evaluation_metrics.topic_significance_metrics.KL_uniform[source]
score(model_output, per_topic=False)[source]

Retrieves the score of the metric

model_outputdictionary, output of the model

‘topic-word-matrix’ required

per_topic: if True, it returns the score for each topic

result : score

class octis.evaluation_metrics.topic_significance_metrics.KL_vacuous[source]
score(model_output)[source]

Retrieves the score of the metric

model_outputdictionary, output of the model

‘topic-word-matrix’ required ‘topic-document-matrix’ required

result : score

Optimization

class octis.optimization.optimizer.Optimizer[source]

Class Optimizer to perform Bayesian Optimization on Topic Model

optimize(model, dataset, metric, search_space, extra_metrics=None, number_of_call=5, n_random_starts=1, initial_point_generator='lhs', optimization_type='Maximize', model_runs=5, surrogate_model='RF', kernel=1**2 * Matern(length_scale=1, nu=1.5), acq_func='LCB', random_state=False, x0=None, y0=None, save_models=True, save_step=1, save_name='result', save_path='results/', early_stop=False, early_step=5, plot_best_seen=False, plot_model=False, plot_name='B0_plot', log_scale_plot=False, topk=10)[source]

Perform hyper-parameter optimization for a Topic Model

Parameters
  • model (OCTIS Topic Model) – model with hyperparameters to optimize

  • dataset (OCTIS dataset) – dataset for the model dataset

  • metric (OCTIS metric) – metric used for the optimization

  • search_space (skopt space object) – a dictionary of hyperparameters to optimize (each parameter is defined as a skopt space)

  • extra_metrics (list of metrics, optional) – list of extra-metrics to compute during the optimization

  • number_of_call (int, optional) – number of evaluations of metric

  • n_random_starts (int, optional) – number of evaluations of metric with random points before approximating it with surrogate model

  • initial_point_generator (str, optional) – set an initial point generator. Can be either “random”, “sobol”, “halton” ,”hammersly”,”lhs”

  • optimization_type – Set “Maximize” if you want to maximize metric, “Minimize” if you want to minimize

  • model_runs

  • surrogate_model – set a surrogate model. Can be either “GP” (Gaussian Process), “RF” (Random Forest) or “ET” (Extra-Tree)

  • kernel – set a kernel function

  • acq_func – Function to minimize over the surrogate model. Can be either: “LCB” (Lower Confidence Bound), “EI” (Expected improvement) OR “PI” (Probability of Improvement)

  • random_state – Set random state to something other than None for reproducible results.

  • x0 – List of initial input points.

  • y0 – Evaluation of initial input points.

  • save_models – if ‘True’ save all the topic models generated during the optimization process

  • save_step – decide how much to save the results of the optimization

  • save_name – name of the file where the results of the optimization will be saved

  • save_path (str, optional) – Path where the results of the optimization (json file) will be saved

  • early_stop (bool, optional) – if “True” stop the optimization if there is no improvement after early_step evaluations

  • early_step (int, optional) – number of iterations with no improvement after which optimization will be stopped (if early_stop is True)

  • plot_best_seen (bool, optional) – If “True” save a convergence plot of the result of a Bayesian_optimization (i.e. the best seen for each iteration)

  • plot_model (bool, optional) – If “True” save the boxplot of all the model runs

  • plot_name (str, optional) – Set the name of the plots (best_seen and model_runs).

  • log_scale_plot (bool, optional) – if “True” use the logarithmic scale for the plots.

  • topk (int, optional) –

Type

int, optional

Type

str, optional

Type

str, optional

Type

int, optional

Type

list, optional

Type

list, optional

Type

bool, optional

Type

int, optional

Type

str, optional

Returns

OptimizerEvaluation object

Return type

class

resume_optimization(name_path, extra_evaluations=0)[source]

Restart the optimization from the json file.

Parameters
  • name_path (str) – path of the json file

  • extra_evaluations (int) – extra iterations for the BO optimization

Returns

object with the results of the optimization

Return type

object

octis.optimization.optimizer_tool.check_instance(obj)[source]

Check if a specific object con be inserted in the json file.

Parameters

obj ([str,float, int, bool, etc.]) – an object of the optimization to be saved

Returns

‘True’ if the object can be inserted in a json file, ‘False’ otherwise

Return type

bool

octis.optimization.optimizer_tool.choose_optimizer(optimizer)[source]

Choose a surrogate model for Bayesian Optimization

Parameters

optimizer (Optimizer) – list of setting of the BO experiment

Returns

surrogate model

Return type

scikit object

octis.optimization.optimizer_tool.convergence_res(values, optimization_type='minimize')[source]

Compute the list of values to plot the convergence plot (i.e. the best seen at each iteration)

Parameters
  • values (list) – the result(s) for which to compute the convergence trace.

  • optimization_type (str) – “minimize” if the problem is a minimization problem, “maximize” otherwise

Returns

a list with the best min seen for each iteration

Return type

list

octis.optimization.optimizer_tool.convert_type(obj)[source]

Convert a numpy object to a python object

Parameters

obj (numpy object) – object to be checked

Returns

python object

Return type

python object

octis.optimization.optimizer_tool.early_condition(values, n_stop, n_random)[source]

Compute the early-stop criterium to stop or not the optimization.

Parameters
  • values (list) – values obtained by Bayesian Optimization

  • n_stop (int) – Range of points without improvement

  • n_random (int) – Random starting points

Returns

‘True’ if early stop condition reached, ‘False’ otherwise

Return type

bool

octis.optimization.optimizer_tool.importClass(class_name, module_name, module_path)[source]

Import a class runtime based on its module and name

Parameters
  • class_name (str) – name of the class

  • module_name (str) – name of the module

  • module_path (str) – absolute path to the module

Returns

class object

Return type

class

octis.optimization.optimizer_tool.load_model(optimization_object)[source]

Load the topic model for the resume of the optimization

Parameters

optimization_object (dict) – dictionary of optimization attributes saved in the jaon file

Returns

topic model used during the BO.

Return type

object model

octis.optimization.optimizer_tool.load_search_space(search_space)[source]

Load the search space from the json file

Parameters

search_space – dictionary of the search space (insertable in a json file)

Returns

dictionary for the search space (for scikit optimize)

Return type

dict

octis.optimization.optimizer_tool.plot_bayesian_optimization(values, name_plot, log_scale=False, conv_max=True)[source]

Save a convergence plot of the result of a Bayesian_optimization.

Parameters
  • values (list) – List of objective function values

  • name_plot (str) – Name of the plot

  • log_scale (bool, optional) – ‘True’ if you want a log scale for y-axis, ‘False’ otherwise

  • conv_max (bool, optional) – ‘True’ for a minimization problem, ‘False’ for a maximization problem

octis.optimization.optimizer_tool.plot_model_runs(model_runs, current_call, name_plot)[source]

Save a boxplot of the data (Works only when optimization_runs is 1).

Parameters
  • model_runs (dict) – dictionary of all the model runs.

  • current_call (int) – number of calls computed by BO

  • name_plot (str) – Name of the plot

octis.optimization.optimizer_tool.save_search_space(search_space)[source]

Save the search space in the json file

Parameters

search_space (dict) – dictionary of the search space (scikit-optimize object)

Returns

dictionary for the seach space, which can be saved in a json file

Return type

dict

octis.optimization.optimizer_tool.select_metric(metric_parameters, metric_name)[source]

Select the metric for the resume of the optimization

Parameters
  • metric_parameters (list) – metric parameters

  • metric_name (str) – name of the metric

Returns

metric

Return type

metric object

Models

class octis.models.model.AbstractModel[source]

Class structure of a generic Topic Modeling implementation

set_hyperparameters(**kwargs)[source]

Set model hyperparameters

Parameters

**kwargs

a dictionary of in the form {hyperparameter name: value}

abstract train_model(dataset, hyperparameters, top_words=10)[source]

Train the model. :param dataset: Dataset :param hyperparameters: dictionary in the form {hyperparameter name: value} :param top_words: number of top significant words for each topic (default: 10)

Return model_output

a dictionary containing up to 4 keys: topics, topic-word-matrix,

topic-document-matrix, test-topic-document-matrix. topics is the list of the most significant words for each topic (list of lists of strings). topic-word-matrix is the matrix (num topics x ||vocabulary||) containing the probabilities of a word in a given topic. topic-document-matrix is the matrix (||topics|| x ||training documents||) containing the probabilities of the topics in a given training document. test-topic-document-matrix is the matrix (||topics|| x ||testing documents||) containing the probabilities of the topics in a given testing document.

octis.models.model.load_model_output(output_path, vocabulary_path=None, top_words=10)[source]

Loads a model output from the choosen directory

Parameters
  • output_path – path in which th model output is saved

  • vocabulary_path – path in which the vocabulary is saved (optional, used to retrieve the top k words of each topic)

  • top_words – top k words to retrieve for each topic (in case a vocabulary path is given)

octis.models.model.save_model_output(model_output, path='.', appr_order=7)[source]

Saves the model output in the chosen directory

Parameters
  • model_output – output of the model

  • path – path in which the file will be saved and name of the file

  • appr_order – approximation order (used to round model_output values)

class octis.models.LDA.LDA(num_topics=100, distributed=False, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None)[source]
hyperparameters_info()[source]

Returns hyperparameters informations

info()[source]

Returns model informations

partitioning(use_partitions, update_with_test=False)[source]

Handle the partitioning system to use and reset the model to perform new evaluations

use_partitions: True if train/set partitioning is needed, False

otherwise

update_with_test: True if the model should be updated with the test set,

False otherwise

set_hyperparameters(**kwargs)[source]

Set model hyperparameters

train_model(dataset, hyperparams=None, top_words=10)[source]

Train the model and return output

dataset : dataset to use to build the model hyperparams : hyperparameters to build the model top_words : if greater than 0 returns the most significant words for each topic in the output

(Default True)

resultdictionary with up to 3 entries,

‘topics’, ‘topic-word-matrix’ and ‘topic-document-matrix’

class octis.models.NMF_scikit.NMF_scikit(num_topics=100, init=None, alpha=0, l1_ratio=0, regularization='both', use_partitions=True)[source]
hyperparameters_info()[source]

Returns hyperparameters informations

partitioning(use_partitions, update_with_test=False)[source]

Handle the partitioning system to use and reset the model to perform new evaluations

use_partitions: True if train/set partitioning is needed, False

otherwise

update_with_test: True if the model should be updated with the test set,

False otherwise

train_model(dataset, hyperparameters=None, topics=10)[source]

Train the model and return output

dataset : dataset to use to build the model hyperparameters : hyperparameters to build the model topics : if greather than 0 returns the most significant words

for each topic in the output Default True

resultdictionary with up to 3 entries,

‘topics’, ‘topic-word-matrix’ and ‘topic-document-matrix’

class octis.models.CTM.CTM(num_topics=10, model_type='prodLDA', activation='softplus', dropout=0.2, learn_priors=True, batch_size=64, lr=0.002, momentum=0.99, solver='adam', num_epochs=100, reduce_on_plateau=False, prior_mean=0.0, prior_variance=None, num_layers=2, num_neurons=100, use_partitions=True, num_samples=10, inference_type='zeroshot', bert_path='', bert_model='bert-base-nli-mean-tokens')[source]
train_model(dataset, hyperparameters=None, top_words=10)[source]

trains CTM model

Parameters
  • dataset – octis Dataset for training the model

  • hyperparameters – dict, with optionally) the following information:

  • top_words – number of top-n words of the topics (default 10)

class octis.models.ETM.ETM(num_topics=10, num_epochs=100, t_hidden_size=800, rho_size=300, embedding_size=300, activation='relu', dropout=0.5, lr=0.005, optimizer='adam', batch_size=128, clip=0.0, wdecay=1.2e-06, bow_norm=1, device='cpu', top_word=10, train_embeddings=True, embeddings_path=None, embeddings_type='pickle', binary_embeddings=True, headerless_embeddings=False, use_partitions=True)[source]
train_model(dataset, hyperparameters=None, top_words=10)[source]

Train the model. :param dataset: Dataset :param hyperparameters: dictionary in the form {hyperparameter name: value} :param top_words: number of top significant words for each topic (default: 10)

Return model_output

a dictionary containing up to 4 keys: topics, topic-word-matrix,

topic-document-matrix, test-topic-document-matrix. topics is the list of the most significant words for each topic (list of lists of strings). topic-word-matrix is the matrix (num topics x ||vocabulary||) containing the probabilities of a word in a given topic. topic-document-matrix is the matrix (||topics|| x ||training documents||) containing the probabilities of the topics in a given training document. test-topic-document-matrix is the matrix (||topics|| x ||testing documents||) containing the probabilities of the topics in a given testing document.