Modules
Dataset
- class octis.dataset.dataset.Dataset(corpus=None, vocabulary=None, labels=None, metadata=None, document_indexes=None)[source]
Dataset handles a dataset and offers methods to access, save and edit the dataset data
- fetch_dataset(dataset_name, data_home=None, download_if_missing=True)[source]
Load the filenames and data from a dataset. Parameters ———- dataset_name: name of the dataset to download or retrieve data_home : optional, default: None
Specify a download and cache folder for the datasets. If None, all data is stored in ‘~/octis’ subfolders.
- download_if_missingoptional, True by default
If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.
Data Preprocessing
Evaluation Measures
- class octis.evaluation_metrics.metrics.AbstractMetric[source]
Class structure of a generic metric implementation
- class octis.evaluation_metrics.coherence_metrics.Coherence(texts=None, topk=10, processes=1, measure='c_npmi')[source]
- class octis.evaluation_metrics.coherence_metrics.WECoherenceCentroid(topk=10, word2vec_path=None, binary=True)[source]
- class octis.evaluation_metrics.coherence_metrics.WECoherencePairwise(word2vec_path=None, binary=False, topk=10)[source]
- class octis.evaluation_metrics.diversity_metrics.WordEmbeddingsInvertedRBO(topk=10, weight=0.9, normalize=True, word2vec_path=None, binary=True)[source]
- class octis.evaluation_metrics.diversity_metrics.WordEmbeddingsInvertedRBOCentroid(topk=10, weight=0.9, normalize=True, word2vec_path=None, binary=True)[source]
- class octis.evaluation_metrics.classification_metrics.AccuracyScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
- class octis.evaluation_metrics.classification_metrics.ClassificationScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
- class octis.evaluation_metrics.classification_metrics.F1Score(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
- class octis.evaluation_metrics.classification_metrics.PrecisionScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
- class octis.evaluation_metrics.classification_metrics.RecallScore(dataset, average='micro', use_log=False, scale=True, kernel='linear', same_svm=False)[source]
Optimization
- class octis.optimization.optimizer.Optimizer[source]
Class Optimizer to perform Bayesian Optimization on Topic Model
- optimize(model, dataset, metric, search_space, extra_metrics=None, number_of_call=5, n_random_starts=1, initial_point_generator='lhs', optimization_type='Maximize', model_runs=5, surrogate_model='RF', kernel=1**2 * Matern(length_scale=1, nu=1.5), acq_func='LCB', random_state=False, x0=None, y0=None, save_models=True, save_step=1, save_name='result', save_path='results/', early_stop=False, early_step=5, plot_best_seen=False, plot_model=False, plot_name='B0_plot', log_scale_plot=False, topk=10)[source]
Perform hyper-parameter optimization for a Topic Model
- Parameters:
model (OCTIS Topic Model) – model with hyperparameters to optimize
dataset (OCTIS dataset) – dataset for the model dataset
metric (OCTIS metric) – metric used for the optimization
search_space (skopt space object) – a dictionary of hyperparameters to optimize (each parameter is defined as a skopt space)
extra_metrics (list of metrics, optional) – list of extra-metrics to compute during the optimization
number_of_call (int, optional) – number of evaluations of metric
n_random_starts (int, optional) – number of evaluations of metric with random points before approximating it with surrogate model
initial_point_generator (str, optional) – set an initial point generator. Can be either “random”, “sobol”, “halton” ,”hammersly”,”lhs”
optimization_type – Set “Maximize” if you want to maximize metric, “Minimize” if you want to minimize
model_runs –
surrogate_model – set a surrogate model. Can be either “GP” (Gaussian Process), “RF” (Random Forest) or “ET” (Extra-Tree)
kernel – set a kernel function
acq_func – Function to minimize over the surrogate model. Can be either: “LCB” (Lower Confidence Bound), “EI” (Expected improvement) OR “PI” (Probability of Improvement)
random_state – Set random state to something other than None for reproducible results.
x0 – List of initial input points.
y0 – Evaluation of initial input points.
save_models – if ‘True’ save all the topic models generated during the optimization process
save_step – decide how much to save the results of the optimization
save_name – name of the file where the results of the optimization will be saved
save_path (str, optional) – Path where the results of the optimization (json file ) will be saved
early_stop (bool, optional) – if “True” stop the optimization if there is no improvement after early_step evaluations
early_step (int, optional) – number of iterations with no improvement after which optimization will be stopped (if early_stop is True)
plot_best_seen (bool, optional) – If “True” save a convergence plot of the result of a Bayesian_optimization (i.e. the best seen for each iteration)
plot_model (bool, optional) – If “True” save the boxplot of all the model runs
plot_name (str, optional) – Set the name of the plots (best_seen and model_runs).
log_scale_plot (bool, optional) – if “True” use the logarithmic scale for the plots.
topk (int, optional) –
- Type:
int, optional
- Type:
str, optional
- Type:
str, optional
- Type:
int, optional
- Type:
list, optional
- Type:
list, optional
- Type:
bool, optional
- Type:
int, optional
- Type:
str, optional
- Returns:
OptimizerEvaluation object
- Return type:
class
- octis.optimization.optimizer_tool.check_instance(obj)[source]
Check if a specific object con be inserted in the json file.
- Parameters:
obj ([str,float, int, bool, etc.]) – an object of the optimization to be saved
- Returns:
‘True’ if the object is json format, ‘False’ otherwise
- Return type:
bool
- octis.optimization.optimizer_tool.choose_optimizer(optimizer)[source]
Choose a surrogate model for Bayesian Optimization
- Parameters:
optimizer (Optimizer) – list of setting of the BO experiment
- Returns:
surrogate model
- Return type:
scikit object
- octis.optimization.optimizer_tool.convergence_res(values, optimization_type='minimize')[source]
- Compute the list of values to plot the convergence plot (i.e. the best
seen at each iteration)
- Parameters:
values (list) – the result(s) for which to compute the convergence trace.
optimization_type (str) – “minimize” if the problem is a minimization problem, “maximize” otherwise
- Returns:
a list with the best min seen for each iteration
- Return type:
list
- octis.optimization.optimizer_tool.convert_type(obj)[source]
Convert a numpy object to a python object
- Parameters:
obj (numpy object) – object to be checked
- Returns:
python object
- Return type:
python object
- octis.optimization.optimizer_tool.early_condition(values, n_stop, n_random)[source]
Compute the early-stop criterium to stop or not the optimization.
- Parameters:
values (list) – values obtained by Bayesian Optimization
n_stop (int) – Range of points without improvement
n_random (int) – Random starting points
- Returns:
‘True’ if early stop condition reached, ‘False’ otherwise
- Return type:
bool
- octis.optimization.optimizer_tool.importClass(class_name, module_name, module_path)[source]
Import a class runtime based on its module and name
- Parameters:
class_name (str) – name of the class
module_name (str) – name of the module
module_path (str) – absolute path to the module
- Returns:
class object
- Return type:
class
- octis.optimization.optimizer_tool.load_model(optimization_object)[source]
Load the topic model for the resume of the optimization
- Parameters:
optimization_object (dict) – dictionary of optimization attributes saved in the json file
- Returns:
topic model used during the BO.
- Return type:
object model
- octis.optimization.optimizer_tool.load_search_space(search_space)[source]
Load the search space from the json file
- Parameters:
search_space – dictionary of the search space (insertable in a json file)
- Returns:
dictionary for the search space (for scikit optimize)
- Return type:
dict
- octis.optimization.optimizer_tool.plot_bayesian_optimization(values, name_plot, log_scale=False, conv_max=True)[source]
Save a convergence plot of the result of a Bayesian_optimization.
- Parameters:
values (list) – List of objective function values
name_plot (str) – Name of the plot
log_scale (bool, optional) – ‘True’ if log scale for y-axis, ‘False’ otherwise
conv_max (bool, optional) – ‘True’ for a minimization problem, ‘False’ for a maximization problem
- octis.optimization.optimizer_tool.plot_model_runs(model_runs, current_call, name_plot)[source]
Save a boxplot of the data (Works only when optimization_runs is 1).
- Parameters:
model_runs (dict) – dictionary of all the model runs.
current_call (int) – number of calls computed by BO
name_plot (str) – Name of the plot
Models
- class octis.models.model.AbstractModel[source]
Class structure of a generic Topic Modeling implementation
- set_hyperparameters(**kwargs)[source]
Set model hyperparameters
- Parameters:
**kwargs –
a dictionary of in the form {hyperparameter name: value}
- abstract train_model(dataset, hyperparameters, top_words=10)[source]
Train the model. :param dataset: Dataset :param hyperparameters: dictionary in the form {hyperparameter name: value} :param top_words: number of top significant words for each topic (default: 10)
- Return model_output:
a dictionary containing up to 4 keys: topics, topic-word-matrix,
topic-document-matrix, test-topic-document-matrix. topics is the list of the most significant words for each topic (list of lists of strings). topic-word-matrix is the matrix (num topics x ||vocabulary||) containing the probabilities of a word in a given topic. topic-document-matrix is the matrix (||topics|| x ||training documents||) containing the probabilities of the topics in a given training document. test-topic-document-matrix is the matrix (||topics|| x ||testing documents||) containing the probabilities of the topics in a given testing document.
- octis.models.model.load_model_output(output_path, vocabulary_path=None, top_words=10)[source]
Loads a model output from the choosen directory
Parameters
- param output_path:
path in which th model output is saved
- param vocabulary_path:
path in which the vocabulary is saved (optional, used to retrieve the top k words of each topic)
- param top_words:
top k words to retrieve for each topic (in case a vocabulary path is given)
- octis.models.model.save_model_output(model_output, path='.', appr_order=7)[source]
Saves the model output in the chosen directory
- Parameters:
model_output – output of the model
path – path in which the file will be saved and name of the file
appr_order – approximation order (used to round model_output values)
- class octis.models.LDA.LDA(num_topics=100, distributed=False, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None)[source]
-
- partitioning(use_partitions, update_with_test=False)[source]
Handle the partitioning system to use and reset the model to perform new evaluations
Parameters
- use_partitions: True if train/set partitioning is needed, False
otherwise
- update_with_test: True if the model should be updated with the test set,
False otherwise
- train_model(dataset, hyperparams=None, top_words=10)[source]
Train the model and return output
Parameters
dataset : dataset to use to build the model hyperparams : hyperparameters to build the model top_words : if greater than 0 returns the most significant words for
each topic in the output (Default True)
Returns
- resultdictionary with up to 3 entries,
‘topics’, ‘topic-word-matrix’ and ‘topic-document-matrix’
- class octis.models.NMF_scikit.NMF_scikit(num_topics=100, init=None, alpha=0, l1_ratio=0, regularization='both', use_partitions=True)[source]
-
- partitioning(use_partitions, update_with_test=False)[source]
Handle the partitioning system to use and reset the model to perform new evaluations
Parameters
- use_partitions: True if train/set partitioning is needed, False
otherwise
- update_with_test: True if the model should be updated with the test set,
False otherwise
- train_model(dataset, hyperparameters=None, top_words=10)[source]
Train the model and return output
Parameters
dataset : dataset to use to build the model hyperparameters : hyperparameters to build the model top_words : if greather than 0 returns the most significant words
for each topic in the output Default True
Returns
- resultdictionary with up to 3 entries,
‘topics’, ‘topic-word-matrix’ and ‘topic-document-matrix’
- class octis.models.CTM.CTM(num_topics=10, model_type='prodLDA', activation='softplus', dropout=0.2, learn_priors=True, batch_size=64, lr=0.002, momentum=0.99, solver='adam', num_epochs=100, reduce_on_plateau=False, prior_mean=0.0, prior_variance=None, num_layers=2, num_neurons=100, seed=None, use_partitions=True, num_samples=10, inference_type='zeroshot', bert_path='', bert_model='bert-base-nli-mean-tokens')[source]
- class octis.models.ETM.ETM(num_topics=10, num_epochs=100, t_hidden_size=800, rho_size=300, embedding_size=300, activation='relu', dropout=0.5, lr=0.005, optimizer='adam', batch_size=128, clip=0.0, wdecay=1.2e-06, bow_norm=1, device='cpu', train_embeddings=True, embeddings_path=None, embeddings_type='pickle', binary_embeddings=True, headerless_embeddings=False, use_partitions=True)[source]
- train_model(dataset, hyperparameters=None, top_words=10, op_path='checkpoint.pt')[source]
Train the model. :param dataset: Dataset :param hyperparameters: dictionary in the form {hyperparameter name: value} :param top_words: number of top significant words for each topic (default: 10)
- Return model_output:
a dictionary containing up to 4 keys: topics, topic-word-matrix,
topic-document-matrix, test-topic-document-matrix. topics is the list of the most significant words for each topic (list of lists of strings). topic-word-matrix is the matrix (num topics x ||vocabulary||) containing the probabilities of a word in a given topic. topic-document-matrix is the matrix (||topics|| x ||training documents||) containing the probabilities of the topics in a given training document. test-topic-document-matrix is the matrix (||topics|| x ||testing documents||) containing the probabilities of the topics in a given testing document.