nlpboost package

Subpackages

Submodules

nlpboost.autotrainer module

class nlpboost.autotrainer.AutoTrainer(model_configs: List, dataset_configs: List, metrics_dir: str = 'tmp_experiments_metrics', hp_search_mode: str = 'optuna', clean: bool = True, metrics_cleaner: str = 'tmp_metrics_cleaner', use_auth_token: bool = False, skip_mixes: Optional[List] = None)[source]

Bases: object

Main class of nlpboost. Fine-tune and evaluate several models on several datasets.

Useful for performing benchmarking of different models on the same datasets. The behavior of AutoTrainer is mainly configured through model_configs and dataset_configs, which define the datasets and the models to be used.

Parameters:
  • model_configs (List[nlpboost.ModelConfig]) – Configurations for the models, instances of ModelConfig, each describing their names in the hub or local directory, the name to save the model, the dropout values to use, and a long etc.

  • dataset_configs (List[nlpboost.DatasetConfig]) – Configurations for the datasets, instances of DatasetConfig, each describing how each dataset should be processed.

  • metrics_dir (str) – Directory to save the metrics for the experiments, as returned by nlpboost.ResultsGetter.

  • hp_search_mode (str) – Mode for hyperparameter search; possibilities are optuna or fixed. If fixed, no hyperparameter tuning is carried out.

  • clean (bool) – Whether to clean checkpoints every 10 minutes to avoid using too much disk, by using nlpboost.CkptCleaner. Best model checkpoint is also saved when unuseful checkpoints are deleted.

  • metrics_cleaner (str) – Path to the folder where the metrics of the checkpoint cleaner should be stored. These metrics are used to decide which checkpoints should be removed. Note: if the experiment fails for some reason, and you re-launch it, please remove this folder before doing so. Otherwise there will probably be an error, as the checkpoint cleaner will use metrics from past experiments, not the running one, so there will be incorrect checkpoint removals.

  • use_auth_token (bool) – Whether to use auth token to load datasets and models.

  • skip_mixes (List[nlpboost.SkipMix]) – List of SkipMix instances with combinations of datasets and models that must be skipped.

Carry out hyperparameter search with Optuna.

Use model_configs and dataset_configs passed in init. Iterate over each dataset, and then over each model, with hyperparameter tuning. Metrics over the test dataset are gathered and then saved in the metrics_dir specified in init for each of those models, for later comparison.

Returns:

all_results – Dictionary with results from the experiments.

Return type:

Dict

train_one_model_fixed_params(model_config, dataset_config, compute_metrics_func, test_dataset)[source]

Train one model with fixed params in one dataset, without tuning parameters.

Parameters:
  • model_config (nlpboost.ModelConfig) – Configuration for the model.

  • dataset_config (nlpboost.DatasetConfig,) – Configuration for the dataset.

  • compute_metrics_func (Any) – Function to compute metrics.

  • test_dataset (datasets.Dataset) – Test dataset to get metrics on.

Returns:

test_results – Dictionary with results over the test set after training with fixed params.

Return type:

Dict

train_one_model_optuna(model_config, dataset_config, compute_objective, compute_metrics_func, output_dir, test_dataset)[source]

Train one model in one dataset, with hyperparameter tuning, using Optuna.

Load a checkpoint cleaner in the background to clean bad performing checkpoints every 10 minutes, also saving the best performing checkpoint. Then, carry out hyperparameter search and, if configured (see DatasetConfig), retrain at end with the best hyperparameters again. After that, results on the test set are obtained. For that, ResultsGetter is used for dataset processing, prediction and metrics gathering. If desired, the user may change the behavior of this part by creating a custom ResultsGetter overriding the desired methods, and passing it to DatasetConfig as a custom_results_getter. Metrics are saved in json or txt format, and, if configured, the model is pushed to the hub.

Parameters:
  • model_config (nlpboost.ModelConfig) – Configuration for the model.

  • dataset_config (nlpboost.DatasetConfig,) – Configuration for the dataset.

  • compute_objective (Any) – Function to return the computed metric objective.

  • compute_metrics_func (Any) – Function to compute metrics.

  • output_dir (str) – Directory where the model is saved.

  • test_dataset (datasets.Dataset) – Test dataset to get metrics on.

Returns:

test_results – Dictionary with the results in the test set.

Return type:

Dict

train_with_fixed_params()[source]

Train without hyperparameter search, with a fixed set of params.

The default parameters are defined in the fixed_train_args of DatasetConfig. However, we can use ModelConfig.overwrite_training_args to change this, by passing a dictionary with the new parameters that we want to use for a model.

nlpboost.ckpt_cleaner module

class nlpboost.ckpt_cleaner.CkptCleaner(current_folder_clean: str, current_dataset_folder: str, metrics_save_dir: str, modelname: str, mode: str = 'max', try_mode: bool = False)[source]

Bases: object

Clean all checkpoints that are no longer useful.

Use a metrics dictionary to check the results of all runs of a model for a dataset, then sort these metrics to decide which checkpoints are removable and which are among the four best. When called, only those are kept, and all the other checkpoints are removed. This enables the user to effectively use their computer resources, so there is no need to worry about the disk usage, which is a typical concern when running multiple transformer models.

fix_dir(dir: str)[source]

Fix directory path for windows file systems.

Parameters:

dir (str) – Directory to fix.

Returns:

dir – Fixed directory.

Return type:

str

get_best_name(metrics: Dict)[source]

Get the path of the best performing model.

Parameters:

metrics (Dict) – Metrics of all models in a dictionary.

Returns:

best – Path to the best performing model.

Return type:

str

remove_dirs(checkpoint_dirs: List)[source]

Delete checkpoint directories.

Parameters:

checkpoint_dirs (List) – List with the checkpoint directories to remove.

save_best(best_model: str)[source]

Save best model.

Parameters:

best_model (str) – Path of the best performing model.

Returns:

target – Complete path to the target directory where the best model has been copied.

Return type:

str

nlpboost.dataset_config module

class nlpboost.dataset_config.DatasetConfig(dataset_name: str, alias: str, task: str, fixed_training_args: ~typing.Dict, is_multilabel: bool = False, multilabel_label_names: ~typing.List = <factory>, hf_load_kwargs: ~typing.Optional[~typing.Dict] = None, type_load: str = 'json', files: ~typing.Optional[~typing.Dict] = None, data_field: str = 'data', partial_split: bool = False, split: bool = False, label_col: str = 'label_list', val_size: float = 0.15, test_size: float = 0.15, pre_func: ~typing.Optional[~typing.Any] = None, remove_fields_pre_func: bool = False, squad_v2: bool = False, text_field: str = 'text', is_2sents: bool = False, sentence1_field: ~typing.Optional[str] = None, sentence2_field: ~typing.Optional[str] = None, summary_field: str = 'summary', callbacks: ~typing.List = <factory>, metric_optimize: str = 'eval_loss', direction_optimize: str = 'minimize', custom_eval_func: ~typing.Optional[~typing.Any] = None, seed: int = 420, max_length_summary: int = 120, num_proc: int = 4, loaded_dataset: ~typing.Optional[~typing.Any] = None, additional_metrics: ~typing.Optional[~typing.List] = None, retrain_at_end: bool = True, config_num_labels: ~typing.Optional[int] = None, smoke_test: bool = False, augment_data: bool = False, data_augmentation_steps: ~typing.List = <factory>, id_field_qa: str = 'id', pretokenized_dataset: ~typing.Optional[~typing.Any] = None)[source]

Bases: object

Configure a dataset for use within the AutoTrainer class.

This determines how to load the dataset, whether local files are needed, whether additional splits are needed (for example when the original dataset only has train-test and we want also validation), and so on.

Parameters:
  • dataset_name (str) – The name of the dataset.

  • alias (str) – Alias for the dataset, for saving it.

  • task (str) – The task of the dataset. Currenlty, only classification, ner and qa (question answering) are available.

  • fixed_training_args (Dict) – The training arguments (to use in transformers.TrainingArguments) for every model on this dataset, in dictionary format.

  • is_multilabel (bool) – Whether it is multilabel classification

  • multilabel_label_names (List) – Names of the labels for multilabel training.

  • hf_load_kwargs (Dict) – Arguments for loading the dataset from the huggingface datasets’ hub. Example: {‘path’: ‘wikiann’, ‘name’: ‘es’}. If None, it is assumed that all necessary files exist locally and are passed in the files field.

  • type_load (str) – The type of load to perform in load_dataset; for example, if your data is in csv format (d = load_dataset(‘csv’, …)), this should be csv.

  • files (Dict) – Files to load the dataset from, in Huggingface’s datasets format. Possible keys are train, validation and test.

  • data_field (str) – Field to load data from in the case of jsons loading in datasets.

  • partial_split (bool) – Wheter a partial split is needed, that is, if you only have train and test sets, this should be True so that a new validation set is created.

  • split (bool) – This should be true when you only have one split, that is, a big train set; this creates new validation and test sets.

  • label_col (str) – Name of the label column.

  • val_size (float) – In case no validation split is provided, the proportion of the training data to leave for validation.

  • test_size (float) – In case no test split is provided, the proportion of the total data to leave for testing.

  • pre_func (Any) – Function to perform previous transformations. For example, if your dataset lacks a field (like xquad with title field for example), you can fix it in a function provided here.

  • squad_v2 (bool) – Only useful for question answering. Whether it is squad v2 format or not. Default is false.

  • text_field (str) – The name of the field containing the text. Useful only in case of unique-text-field datasets,like most datasets are. In case of 2-sentences datasets like xnli or paws-x this is not useful. Default is text.

  • is_2sents (bool) – Whether it is a 2 sentence dataset. Useful for processing datasets like xnli or paws-x.

  • sentence1_field (str) – In case this is a 2 sents dataset, the name of the first sentence field.

  • sentence2_field (str) – In case this is a 2 sents dataset, the name of the second sentence field.

  • summary_field (str = field() – The name of the field with summaries (we assume the long texts are in the text_field field). Only useful for summarization tasks. Default is summary.

  • callbacks (List) – Callbacks to use inside transformers.

  • metric_optimize (str) – Name of the metric you want to optimize in the hyperparameter search.

  • direction_optimize (str) – Direction of the optimization problem. Whether you want to maximize or minimize metric_optimize.

  • custom_eval_func (Any) – In case we want a special evaluation function, we can provide it here. It must receive EvalPredictions by trainer, like any compute_metrics function in transformers.

  • seed (int) – Seed for optuna sampler.

  • max_length_summary (int) – Max length of the summaries, for tokenization purposes. It will be changed depending on the ModelConfig.

  • num_proc (int) – Number of processes to preprocess data.

  • loaded_dataset (Any) – In case you want to do weird things like concatenating datasets or things like that, you can do that here, by passing a (non-tokenized) dataset in this field.

  • additional_metrics (List) – List of additional metrics loaded from datasets, to compute over the test part.

  • retrain_at_end (bool) – whether to retrain with the best performing model. In most cases this should be True, except when training 1 model with 1 set of hyperparams.

  • config_num_labels (int) – Number of labels to set for the config, if None it will be computed based on number of labels detected.

  • smoke_test (bool) – Whether to select only top 10 rows of the dataset for smoke testing purposes.

  • augment_data (bool) – Whether to augment_data or not.

  • data_augmentation_steps (List) – List of data augmentation techniques to use from NLPAugPipeline.

  • pretokenized_dataset (Any) – Pre-tokenized dataset, to avoid tokenizing inside AutoTrainer, which may cause memory issues with huge datasets.

Examples

One can easily create a DatasetConfig for dataset conll2002 just with the following:

>>> from nlpboost import DatasetConfig
>>> config={'fixed_training_args': {}, 'dataset_name': 'conll2002', 'alias': 'conll2002', 'task': 'ner', 'hf_load_kwargs': {'path': 'conll2002', 'name': 'es'}, 'label_col':'ner_tags'}
>>> config = DatasetConfig(**config)
additional_metrics: List = None
alias: str
augment_data: bool = False
callbacks: List
config_num_labels: int = None
custom_eval_func: Any = None
data_augmentation_steps: List
data_field: str = 'data'
dataset_name: str
direction_optimize: str = 'minimize'
files: Dict = None
fixed_training_args: Dict
hf_load_kwargs: Dict = None
id_field_qa: str = 'id'
is_2sents: bool = False
is_multilabel: bool = False
label_col: str = 'label_list'
loaded_dataset: Any = None
max_length_summary: int = 120
metric_optimize: str = 'eval_loss'
multilabel_label_names: List
num_proc: int = 4
partial_split: bool = False
pre_func: Any = None
pretokenized_dataset: Any = None
remove_fields_pre_func: bool = False
retrain_at_end: bool = True
seed: int = 420
sentence1_field: str = None
sentence2_field: str = None
smoke_test: bool = False
split: bool = False
squad_v2: bool = False
summary_field: str = 'summary'
task: str
test_size: float = 0.15
text_field: str = 'text'
type_load: str = 'json'
val_size: float = 0.15

nlpboost.default_param_spaces module

nlpboost.default_param_spaces.hp_space_base(trial)[source]

Hyperparameter space in Optuna format for base-sized models (e.g. bert-base).

nlpboost.default_param_spaces.hp_space_large(trial)[source]

Hyperparameter space in Optuna format for large-sized models (e.g. bert-large).

nlpboost.hfdatasets_manager module

class nlpboost.hfdatasets_manager.HFDatasetsManager(dataset_config, model_config)[source]

Bases: object

Utility for loading HF Datasets’ objects, using a DatasetConfig and a ModelConfig.

Parameters:
  • dataset_config (nlpboost.DatasetConfig) – Configuration for the dataset

  • model_config (nlpboost.ModelConfig) – Configuration for the model.

get_dataset_and_tag2id(tokenizer: PreTrainedTokenizer)[source]

Get dataset and tag2id depending on dataset and model config.

Using dataset config (task, etc), a preprocessing is applied to the dataset, tokenizing text data, returning a processed dataset ready for the configured task.

Parameters:

tokenizer (transformers.PretrainedTokenizer) – Tokenizer to process data.

Returns:

  • dataset (datasets.DatasetDict) – Tokenized dataset.

  • tag2id (Dict) – Dictionary with tags (labels) and their indexes.

nlpboost.hftransformers_manager module

class nlpboost.hftransformers_manager.HFTransformersManager(model_config: Optional[ModelConfig] = None, dataset_config: Optional[DatasetConfig] = None, use_auth_token: bool = True)[source]

Bases: object

Utility for loading HF Transformers’ objects, using a dataset config and a model config.

Parameters:
  • model_config (nlpboost.ModelConfig) – Configuration for the model.

  • dataset_config (nlpboost.DatasetConfig) – Configuration for the dataset

get_model_cls()[source]

Get the class to use for a model.

Returns:

Class for the model.

Return type:

model_cls

load_config(tag2id: Dict, dropout: float)[source]

Load configuration for the model depending on the type of task we are doing.

Parameters:
  • tag2id (Dict) – Dictionary mapping labels to indices of those labels in the network output layer.

  • dropout (float) – Dropout proportion for the pooler layer.

Returns:

config – Configuration for use in the transformers module.

Return type:

transformers.PretrainedConfig

load_data_collator(tokenizer)[source]

Load data collator depending on the type of task we are doing.

Parameters:

tokenizer (transformers.PretrainedTokenizer) – Tokenizer to process data.

Returns:

data_collator – DataCollator for use in the transformers library.

Return type:

transformers.DataCollator

load_model_init(model_cls, config, tokenizer)[source]

Load the model init function.

This function is useful for the Transformers integration with Optuna.

Parameters:
  • model_cls – Class for the model.

  • config (AutoConfig) – Configuration for the model.

  • tokenizer (transformers.PretrainedTokenizer) – Tokenizer to preprocess text data.

Returns:

Function for initializing the model. Furtherly passed to the Trainer.

Return type:

model_init

load_tokenizer()[source]

Load tokenizer for the given model config and model name.

Returns:

Loaded tokenizer.

Return type:

tokenizer

load_train_args(output_dir)[source]

Load training args depending on the task.

Parameters:

output_dir (str) – Local directory name to save the model.

Returns:

args – Arguments for training.

Return type:

transformers.TrainingArguments

load_trainer(dataset, tokenizer, args, model_init, data_collator, compute_metrics_func, config)[source]

Load an instantiated Trainer object depending on the configuration.

Parameters:
  • dataset (datasets.DatasetDict) – Dataset with train and validation splits.

  • tokenizer (transformers.PretrainedTokenizer) – Tokenizer from transformers.

  • args (transformers.TrainingArguments) – TrainingArguments for the Trainer.

  • model_init (Any) – Function that loads the model.

  • data_collator (Any) – Data Collator to use inside Trainer.

  • compute_metrics_func (Any) – Function to compute metrics.

  • config (transformers.PretrainedConfig) – Configuration for the model in Huggingface Transformers.

Returns:

Trainer – Trainer object loaded with the given configuration.

Return type:

transformers.Trainer

class nlpboost.hftransformers_manager.MultilabelTrainer(model: Optional[Union[PreTrainedModel, Module]] = None, args: Optional[TrainingArguments] = None, data_collator: Optional[DataCollator] = None, train_dataset: Optional[Dataset] = None, eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, model_init: Optional[Callable[[], PreTrainedModel]] = None, compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None, callbacks: Optional[List[TrainerCallback]] = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), preprocess_logits_for_metrics: Optional[Callable[[Tensor, Tensor], Tensor]] = None)[source]

Bases: Trainer

Version of the trainer used for multilabel setting.

compute_loss(model, inputs, return_outputs=False)[source]

Compute loss of the model.

Parameters:
  • model (transformers.PreTrainedModel) – Model to compute loss.

  • inputs (torch.Tensor) – Model inputs.

  • return_outputs (bool) – Wether or not to return model outputs.

nlpboost.metrics module

nlpboost.metrics.compute_metrics_classification(pred, tokenizer=None, id2tag=None, additional_metrics=None)[source]

Compute metrics for classification (multi-class or binary) tasks.

Parameters:
  • pred (transformers.EvalPrediction) – Prediction as output by transformers.Trainer

  • tokenizer (transformers.Tokenizer) – Tokenizer from huggingface.

  • id2tag (Dict) – Dictionary mapping label ids to label names.

  • additional_metrics (List) – List with additional metrics to compute.

Returns:

metrics – Dictionary with metrics. For information regarding the exact metrics received in it, see the documentation for sklearn.metrics.classification_report.

Return type:

Dict

nlpboost.metrics.compute_metrics_multilabel(pred, tokenizer=None, id2tag=None, additional_metrics=None)[source]

Compute the metrics for a multilabel task.

Parameters:
  • pred (transformers.EvalPrediction) – Prediction as output by transformers.Trainer

  • tokenizer (transformers.Tokenizer) – Tokenizer from huggingface.

  • id2tag (Dict) – Dictionary mapping label ids to label names.

  • additional_metrics (List) – List with additional metrics to compute.

Returns:

best_metrics – Dictionary with best metrics, after trying different thresholds.

Return type:

Dict

nlpboost.metrics.compute_metrics_ner(p, tokenizer=None, id2tag=None, additional_metrics=None)[source]

Compute metrics for ner.

Use seqeval metric from HF Evaluate. Get the predicted label for each instance, then skip padded tokens and finally use seqeval metric, which takes into account full entities, not individual tokens, when computing the metrics.

Parameters:
  • p (transformers.EvalPrediction) – Instance of EvalPrediction from transformers.

  • tokenizer (transformers.Tokenizer) – Tokenizer from huggingface.

  • id2tag (Dict) – Dictionary mapping label ids to label names.

  • additional_metrics (List) – List with additional metrics to compute.

Returns:

Complete dictionary with all computed metrics on eval data.

Return type:

Metrics

nlpboost.metrics.compute_metrics_summarization(eval_pred, tokenizer, id2tag=None, additional_metrics: Optional[List] = None)[source]

Compute metrics for summarization tasks, by using rouge metrics in datasets library.

Parameters:
  • eval_pred (transformers.EvalPrediction) – Prediction as output by transformers.Trainer

  • tokenizer – Tokenizer from huggingface.

  • id2tag (Dict) – Dictionary mapping label ids to label names.

  • additional_metrics (List) – List with additional metrics to compute.

Returns:

metrics – Dictionary with relevant metrics for summarization.

Return type:

Dict

nlpboost.metrics_plotter module

class nlpboost.metrics_plotter.ResultsPlotter(metrics_dir: str, model_names: List, dataset_to_task_map: Dict, remove_strs: List = [], metric_field: str = 'f1-score')[source]

Bases: object

Tool for plotting the results of the models trained.

Parameters:
  • metrics_dir (str) – Directory name with metrics.

  • model_names (List) – List with the names of the models.

  • dataset_to_task_map (Dict) – Dictionary that maps dataset names to tasks. Can be built with the list of DatasetConfigs.

  • remove_strs (List) – List of strings to remove from filename.

  • metric_field (str) – Name of the field with the objective metric.

plot_metrics()[source]

Plot the metrics as a barplot.

read_metrics()[source]

Read the metrics in the self.metrics_dir directory, creating a dataset with the data.

nlpboost.model_config module

class nlpboost.model_config.ModelConfig(name: str, save_name: str, hp_space: ~typing.Optional[~typing.Any] = None, dropout_vals: ~typing.List = <factory>, custom_config_class: ~typing.Optional[~transformers.configuration_utils.PretrainedConfig] = None, custom_model_class: ~typing.Optional[~transformers.modeling_utils.PreTrainedModel] = None, custom_tokenization_func: ~typing.Optional[~typing.Any] = None, partial_custom_tok_func_call: ~typing.Optional[~typing.Any] = None, encoder_name: ~typing.Optional[str] = None, decoder_name: ~typing.Optional[str] = None, tie_encoder_decoder: bool = True, max_length_summary: int = 128, min_length_summary: int = 10, no_repeat_ngram_size: int = 3, early_stopping_summarization: bool = True, length_penalty: float = 2.0, num_beams: int = 1, dropout_field_name: str = 'cls_dropout', n_trials: int = 20, random_init_trials: int = 10, trainer_cls_summarization: ~typing.Optional[~typing.Any] = None, model_cls_summarization: ~typing.Optional[~typing.Any] = None, only_test: bool = False, test_batch_size: int = 32, overwrite_training_args: ~typing.Optional[~typing.Dict] = None, save_dir: str = '.', push_to_hub: bool = False, additional_params_tokenizer: ~typing.Optional[~typing.Dict] = None, resume_from_checkpoint: bool = False, config_problem_type: ~typing.Optional[str] = None, custom_trainer_cls: ~typing.Optional[~typing.Any] = None, do_nothing: bool = False, custom_params_config_model: ~typing.Optional[~typing.Dict] = None, generation_params: ~typing.Optional[~typing.Dict] = None, hf_hub_username: ~typing.Optional[str] = None, custom_results_getter: ~typing.Optional[~typing.Any] = None)[source]

Bases: object

Configure a model to use inside the AutoTrainer class.

With this we determine every choice related to the model such as the original name, the name to save the model with, the hyperparameter space, and a long etc.

Parameters:
  • name (str) – Name of the model, either in the HF hub or a path to the local directory where it is stored.

  • save_name (str) – Alias for the model, used for saving it.

  • hp_space (Any) – The hyperparameter space for hyperparameter search with optuna. Must be a function receiving a trial and returning a dictionary with the corresponding suggest_categorical and float fields.

  • dropout_vals (List) – Dropout values to try.

  • custom_config_class (transformers.PretrainedConfig) – Custom configuration for a model. Useful for training ensembles of transformers.

  • custom_model_class (transformers.PreTrainedModel) – Custom model. None by default. Only used for ensemble models and other strange creatures of Nature.

  • partial_custom_tok_func_call (Any) – Partial call for a tokenization function, with all necessary parameters passed to it.

  • encoder_name (str) – Useful for summarization problems, when we want to create an encoder-decoder and want those models to be different.

  • decoder_name (str) – Useful for summarization problems, when we want to create an encoder-decoder and want those models to be different.

  • tie_encoder_decoder (bool) – Useful for summarization problems, when we want to have the weights of the encoder and decoder in an EncoderDecoderModel tied.

  • max_length_summary (int) – Max length of the summaries. Useful for summarization datasets.

  • min_length_summary (int) – Min length of the summaries. Useful for summarization datasets.

  • no_repeat_ngram_size (int) – Number of n-grams to don’t repeat when doing summarization.

  • early_stopping_summarization (bool) – Whether to have early stopping when doing summarization tasks.

  • length_penalty (float) – Length penalty for summarization tasks.

  • num_beams (int) – Number of beams in beam search for summarization tasks.

  • dropout_field_name (str) – Name for the dropout field in the pooler layer.

  • n_trials (int) – Number of trials (trainings) to carry out with this model.

  • random_init_trials (int) – Argument for optuna sampler, to control number of initial trials to run randomly.

  • trainer_cls_summarization (Any) – Class for the trainer. Useful when it is desired to override the default trainer cls for summarization.

  • model_cls_summarization (Any) – Class for the trainer. Useful when it is desired to override the default trainer cls for summarization.

  • custom_tokenization_func (Any) – Custom tokenization function for processing texts. When the user does not want to use the default tokenization function for the task at hand, one can create a custom tokenization function. This function must receive samples from a dataset, a tokenizer and a dataset config.

  • only_test (bool) – Whether to only test, not train (for already trained models).

  • test_batch_size (int) – Batch size for test; only used when doing only testing.

  • overwrite_training_args (Dict) – Arguments to overwrite the default arguments for the trainer, for example to change the optimizer for this concrete model.

  • save_dir (str) – The directory to save the trained model.

  • push_to_hub (bool) – Whether to push the best model to the hub.

  • additional_params_tokenizer (Dict) – Additional arguments to pass to the tokenizer.

  • resume_from_checkpoint (bool) – Whether to resume from checkpoint to continue training.

  • config_problem_type (str) – The type of the problem, for loss fct.

  • custom_trainer_cls (Any) – Custom trainer class to override the current one.

  • do_nothing (bool) – Whether to do nothing or not. If true, will not train nor predict.

  • custom_params_config_model (Dict) – Dictionary with custom parameters for loading AutoConfig.

  • generation_params (Dict) – Parameters for generative tasks, for the generate call.

  • hf_hub_username (str) – Username in HF Hub, to push models to hub.

  • custom_results_getter (Any) – Custom class to get test results after training.

Examples

With the following lines you can create a ModelConfig for bert-base-cased model.

>>> from nlpboost import ModelConfig
>>> from nlpboost.default_param_spaces import hp_space_base
>>> model_config = ModelConfig(name='bert-base-cased', save_name='bert', hp_space=hp_space_base)
additional_params_tokenizer: Dict = None
config_problem_type: str = None
custom_config_class: PretrainedConfig = None
custom_model_class: PreTrainedModel = None
custom_params_config_model: Dict = None
custom_results_getter: Any = None
custom_tokenization_func: Any = None
custom_trainer_cls: Any = None
decoder_name: str = None
do_nothing: bool = False
dropout_field_name: str = 'cls_dropout'
dropout_vals: List
early_stopping_summarization: bool = True
encoder_name: str = None
generation_params: Dict = None
hf_hub_username: str = None
hp_space: Any = None
length_penalty: float = 2.0
max_length_summary: int = 128
min_length_summary: int = 10
model_cls_summarization: Any = None
n_trials: int = 20
name: str
no_repeat_ngram_size: int = 3
num_beams: int = 1
only_test: bool = False
overwrite_training_args: Dict = None
partial_custom_tok_func_call: Any = None
push_to_hub: bool = False
random_init_trials: int = 10
resume_from_checkpoint: bool = False
save_dir: str = '.'
save_name: str
test_batch_size: int = 32
tie_encoder_decoder: bool = True
trainer_cls_summarization: Any = None

nlpboost.results_getter module

class nlpboost.results_getter.ResultsGetter(dataset_config: DatasetConfig, model_config: ModelConfig, compute_metrics_func: Any)[source]

Bases: object

Retrieve results on the test set for different tasks (seq2seq, different forms of classification, NER, QA…).

Parameters:
  • dataset_config (nlpboost.DatasetConfig) – Configuration for the dataset.

  • model_config (nlpboost.ModelConfig) – Configuration for the model.

  • compute_metrics_func (Any) – Function to compute metrics.

general_get_test_results(test_dataset, trainer, compute_metrics_func, additional_metrics=None)[source]

Compute metrics in general for every NLU task except for QA.

Parameters:
  • test_dataset (datasets.Dataset) – Dataset on any task except for QA.

  • trainer (transformers.Trainer) – Trainer trained on a dataset that is not a QA dataset.

Returns:

metrics – Metrics for the test dataset.

Return type:

Dict

get_test_results_qa(test_dataset, trainer, squad_v2=False, additional_metrics=None)[source]

Compute metrics on test for QA datasets.

Parameters:
  • test_dataset (datasets.Dataset) – QA dataset.

  • trainer (transformers.Trainer) – Trainer trained on QA dataset.

  • squad_v2 (bool) – Whether the dataset is in squad v2 format or not.

Returns:

metrics – Metrics for the test dataset.

Return type:

Dict

get_test_results_summarization(test_dataset, trainer, compute_metrics_func, additional_metrics=None)[source]

Compute and get the results in test for summarization tasks.

Parameters:
  • test_dataset (datasets.Dataset) – Test dataset.

  • trainer (transformers.Trainer) – HF’s transformers trainer.

  • compute_metrics_func (Any) – Function to compute metrics.

  • model_config (nlpboost.ModelConfig) – Configuration for the model.

  • additional_metrics (List) – List with additional metrics to compute.

Returns:

metrics – Dictionary with metrics for the summarization task.

Return type:

Dict

postprocess_qa_predictions(examples, features, raw_predictions, tokenizer, n_best_size=20, max_answer_length=30, squad_v2=False, min_score=None)[source]

Process raw predictions of a QA model.

Parameters:
  • examples (datasets.Dataset) – Samples from datasets.Dataset.

  • features – Validation features as processed by prepare_validation_features_squad.

  • raw_predictions – Predictions by trainer.

  • tokenizer (tokenizers.Tokenizer) – Instance of hf’s tokenizer.

  • n_best_size (int) – Number of best answers to get (maximum).

  • max_answer_length (int) – Maximum answer length in number of characters. Answer longer than this are not even considered.

  • squad_v2 (bool) – Whether the dataset is in squad v2 format or not.

Returns:

predictions – An ordered dict with the predictions formatted so that we can compute metrics easily.

Return type:

collections.OrderedDict

prepare_validation_features_squad(examples, tokenizer, pad_on_right=True)[source]

Process features for validating on squad-like datasets.

Parameters:
  • examples (datasets.Dataset) – Samples from datasets.Dataset.

  • tokenizer (tokenizers.Tokenizer) – Instance of hf’s tokenizer.

  • pad_on_right (bool) – Whether or not to pad the samples on the right side. True for most models.

Returns:

Tokenized samples.

Return type:

tokenized_examples

nlpboost.tokenization_functions module

nlpboost.tokenization_functions.tokenize_classification(examples, tokenizer, dataset_config)[source]

Tokenize classification datasets.

Given a dataset, a tokenizer and a dataset configuration, returns the tokenized dataset.

Parameters:
  • examples (datasets.Dataset) – Samples from datasets.Dataset.

  • tokenizer (tokenizers.Tokenizer) – Instance of hf’s tokenizer.

  • dataset_config (benchmarker.DatasetConfig) – Instance of a Dataset Config.

Returns:

Tokenized samples.

Return type:

tokenized

nlpboost.tokenization_functions.tokenize_ner(examples, tokenizer, dataset_config)[source]

Tokenize a dataset or dataset split.

This function is intended to be used inside the map method for the Dataset.

Parameters:
  • examples (datasets.Dataset) – Samples from datasets.Dataset.

  • tokenizer (tokenizers.Tokenizer) – Instance of hf’s tokenizer.

  • dataset_config (benchmarker.DatasetConfig) – Instance of a Dataset Config.

Returns:

Tokenized samples.

Return type:

tokenized

nlpboost.tokenization_functions.tokenize_squad(examples, tokenizer, dataset_config=None, pad_on_right=True)[source]

Tokenize samples of squad-like datasets, on batches.

It differentiates between BPE tokenizers and others as there are errors in these ones if they are processed in the conventional way.

Parameters:
  • examples (datasets.Dataset) – Samples from datasets.Dataset.

  • tokenizer (tokenizers.Tokenizer) – Instance of hf’s tokenizer.

  • pad_on_right (bool) – Whether or not to pad the samples on the right side. True for most models.

Returns:

Tokenized samples.

Return type:

tokenized_examples

nlpboost.tokenization_functions.tokenize_summarization(examples, tokenizer, dataset_config)[source]

Tokenization function for summarization tasks.

Parameters:
  • examples (datasets.Dataset) – Samples from datasets.Dataset.

  • tokenizer (tokenizers.Tokenizer) – Instance of hf’s tokenizer.

  • dataset_config (benchmarker.DatasetConfig) – Instance of a Dataset Config.

Returns:

examples – Tokenized samples with all necessary fields.

Return type:

datasets.Dataset

nlpboost.utils module

nlpboost.utils.chunks(lst, n)[source]

Split a list into n-sized chunks.

Parameters:
  • lst (List) – List containing any type of elements.

  • n (int) – Size of the chunks

Returns:

Generates n-sized chunks.

Return type:

Chunks

nlpboost.utils.dict_to_list(example, nulltoken='O', entities_field='entities', sentence_field='sentence')[source]

Transform a dictionary of entities in the default format.

With start and end characters for each entity, into lists of words and labels, having one label per word. This is useful for NER tasks when we usually have this format (ent_label, start_char, end_char) and we need to have 2 equally-sized lists of words and labels for passing them to the tokenizer.

Parameters:
  • example – Sample of huggingface Dataset, with an entities field containing the entities in the format mentioned above.

  • nulltoken (Union[str, int]) – Default token for the “no-entities”. Usually O is used for this, which is the default value.

  • entities_field (str) – Name of the field which contains entities in (ent_label, start_char, end_char) format. Usually “entities” is used for this, which is the default value.

  • sentence_field (str) – Name of the field which contains the sentence. Usually “sentence” is used for this, which is the default value.

Returns:

Sample of huggingface dataset with 2 new fields: token list and label list.

Return type:

example

nlpboost.utils.filter_empty(string_list)[source]

Remove empty characters and spaces from list.

Parameters:

string (str) – String to filter.

Returns:

result – Whether string is not in the empty characters list.

Return type:

bool

nlpboost.utils.get_tags(dataset, dataset_config)[source]

Get the list of unique tags for a dataset.

Parameters:
  • dataset (datasets.DatasetDict) – Dataset to tokenize.

  • dataset_config (benchmark.DatasetConfig) – Dataset configuration.

Returns:

tags – List of unique labels for the dataset.

Return type:

List

nlpboost.utils.get_windowed_match_context_answer(context, answer, maxrange=100)[source]

Find the best possible match for an answer in the context.

Useful for translated QA datasets, where we don’t have exact translations of the answers and they do not exist in the context anymore. This could also happen because of encodings, or other reasons, which cause that the answer does not start at the string index that appears in the dataset.

Parameters:
  • context (str) – Context where we want to find the answer.

  • answer (str) – Answer that we want to find in the context.

  • maxrange (int) – Maximum size of the windows for matching, in number of words.

Returns:

  • beg (int) – Beginning character index of the answer.

  • end (int) – Ending character index for tha answer.

  • new_answer (str) – Answer found in the context.

nlpboost.utils.joinpaths(*paths)[source]

Join all paths passed as args.

nlpboost.utils.match_questions_multiple_answers(formatted_predictions, references)[source]

Check if any of the given answers for a question coincides with our answer.

Parameters:
  • formatted_predictions (List) – List with the predictions.

  • references (List) – All references with real answers for the questions. Possibly more than one answer per question, which we need to unify previously with the same id.

Returns:

final_references – Final references for the questions, so that if we get right questions with more than one possible answers, it counts as a right guess.

Return type:

List

Module contents