linktransformer

Linkage and Classification Inference

linktransformer.infer.aggregate_rows(df, ref_df, model, left_on, right_on, openai_key=None)

Aggregate the dataframe based on a reference dataframe using a language model.

Parameters:
  • (DataFrame) (ref_df) – Dataframe to aggregate.

  • (DataFrame) – Reference dataframe to aggregate on.

  • (str) (model) – Language model to use.

  • List[str]]) (right_on (Union[str,) – Column(s) to aggregate on in df.

  • List[str]]) – Reference column(s) to aggregate on in ref_df.

Return type:

DataFrame

Returns:

DataFrame: The aggregated dataframe.

linktransformer.infer.all_pair_combos_evaluate(df, model, left_on, right_on, openai_key=None)

Get similarity scores for every pair of rows in a dataframe. We make this efficient by only embedding each string once and get all possible pairwise distances and add the expanded rows and their scores to the dataframe :param df (DataFrame): Dataframe to evaluate. :param model (str): Language model to use. :param left_on (Union[str, List[str]]): Column(s) to evaluate on in df. :param right_on (Union[str, List[str]]): Reference column(s) to evaluate on in df. :param openai_key (str): OpenAI API key :return: DataFrame: The evaluated dataframe.

linktransformer.infer.classify_rows(df, on=None, model=None, num_labels=2, label_map=None, use_gpu=False, batch_size=128, openai_key=None, openai_topic=None, openai_prompt=None, openai_params={})

Classify texts in all rows of one or more columns whether they are relevant to a certain topic. The function uses either a trained classifier to make predictions or an OpenAI API key to send requests and retrieve classification results from ChatCompletion endpoint. The function returns a copy of the input dataframe with a new column “clf_preds_{on}” that stores the classification results.

Parameters:
  • df (DataFrame) – (DataFrame) the dataframe.

  • on (Union[str, List[str], None]) – (Union[str, List[str]], optional) Column(s) to classify (if multiple columns are passed in, they will be joined).

  • model (Optional[str]) – (str) filepath to the model to use (to use OpenAI, see “https://platform.openai.com/docs/models”).

  • num_labels (int) – (int) number of labels to predict. Defaults to 2.

  • label_map (Optional[dict]) – (dict) a dictionary that maps text labels to numeric labels. Used for OpenAI predictions.

  • use_gpu (bool) – (bool) Whether to use GPU. Not supported yet. Defaults to False.

  • batch_size (int) – (int) Batch size for inferencing embeddings. Defaults to 128.

  • openai_key (Optional[str]) – (str, optional) OpenAI API key for InferKit API. Defaults to None.

  • openai_topic (Optional[str]) – (str, optional) The topic predict whether the text is relevant or not. Defaults to None.

  • openai_prompt (Optional[str]) – (str, optional) Custom system prompt for OpenAI ChatCompletion endpoint. Defaults to None.

  • openai_params (Optional[dict]) – (str, optional) Custom parameters for OpenAI ChatCompletion endpoint. Defaults to None.

Returns:

DataFrame: The dataframe with a new column “clf_preds_{on}” that stores the classification results.

linktransformer.infer.cluster_rows(df, model, on, cluster_type='SLINK', cluster_params={'metric': 'cosine', 'min cluster size': 2, 'threshold': 0.5}, openai_key=None)

Deduplicate a dataframe based on a similarity threshold. Various clustering options are supported. “agglomerative”: {

“threshold”: 0.5, “clustering linkage”: “ward”, # You can choose a default linkage method “metric”: “euclidean”, # You can choose a default metric

}, “HDBScan”: {

“min cluster size”: 5, “min samples”: 1,

}, “SLINK”: {

“min cluster size”: 2, “threshold”: 0.1,

},

}

Parameters:
  • (DataFrame) (df) – Dataframe to deduplicate.

  • (str) (openai_key) – Language model to use.

  • List[str]]) (on (Union[str,) – Column(s) to deduplicate on.

  • (str) – Clustering method to use. Defaults to “SLINK”.

  • Any]) (cluster_params (Dict[str,) – Parameters for clustering method. Defaults to {‘threshold’: 0.5, “min cluster size”: 2, “metric”: “cosine”}.

  • (str) – OpenAI API key

Return type:

DataFrame

Returns:

DataFrame: The deduplicated dataframe.

linktransformer.infer.dedup_rows(df, model, on, cluster_type='SLINK', cluster_params={'metric': 'cosine', 'min cluster size': 2, 'threshold': 0.5}, openai_key=None)

Deduplicate a dataframe based on a similarity threshold. This is just clustering and keeping the first row in each cluster. Refer to the docs for the cluster_rows function for more details.

Parameters:
  • (DataFrame) (df) – Dataframe to deduplicate.

  • (str) (openai_key) – Language model to use.

  • List[str]]) (on (Union[str,) – Column(s) to deduplicate on.

  • (str) – Clustering method to use. Defaults to “SLINK”.

  • Any]) (cluster_params (Dict[str,) – Parameters for clustering method. Defaults to {‘threshold’: 0.5, “min cluster size”: 2, “metric”: “cosine”}.

  • (str) – OpenAI API key

Return type:

DataFrame

Returns:

DataFrame: The deduplicated dataframe.

linktransformer.infer.evaluate_pairs(df, model, left_on, right_on, openai_key=None)

This function evaluates paired columns in a dataframe and gives a match score (cosine similarity). Typically, this can be though of as a way to evaluate already merged in dataframes.

Parameters:
  • (DataFrame) (df) – Dataframe to evaluate.

  • (str) (model) – Language model to use.

  • List[str]]) (right_on (Union[str,) – Column(s) to evaluate on in df.

  • List[str]]) – Reference column(s) to evaluate on in df.

Returns:

DataFrame: The evaluated dataframe.

linktransformer.infer.merge(df1, df2, merge_type='1:1', on=None, model='all-MiniLM-L6-v2', left_on=None, right_on=None, suffixes=('_x', '_y'), use_gpu=False, batch_size=128, openai_key=None)

Merge two dataframes using language model embeddings.

Parameters:
  • (DataFrame) (df2) – First dataframe (left).

  • (DataFrame) – Second dataframe (right).

  • (str) (model) – Type of merge to perform (1:m or m:1 or 1:1).

  • (str) – Language model to use.

  • optional) (openai_key (str,) – Column(s) to join on in df1. Defaults to None.

  • optional) – Column(s) to join on in df1. Defaults to None.

  • optional) – Column(s) to join on in df2. Defaults to None.

  • str]) (suffixes (Tuple[str,) – Suffixes to use for overlapping columns. Defaults to (‘_x’, ‘_y’).

  • (bool) (use_gpu) – Whether to use GPU. Not supported yet. Defaults to False.

  • (int) (batch_size) – Batch size for inferencing embeddings. Defaults to 128.

  • optional) – OpenAI API key for InferKit API. Defaults to None.

Return type:

DataFrame

Returns:

DataFrame: The merged dataframe.

linktransformer.infer.merge_blocking(df1, df2, merge_type='1:1', on=None, model='all-MiniLM-L6-v2', left_on=None, right_on=None, blocking_vars=None, suffixes=('_x', '_y'), use_gpu=False, batch_size=128, openai_key=None)

Merge two dataframes using language model embeddings with optional blocking.

Parameters:
  • (DataFrame) (df2) – First dataframe (left).

  • (DataFrame) – Second dataframe (right).

  • (str) (model) – Type of merge to perform (1:m or m:1 or 1:1).

  • (str) – Language model to use.

  • optional) (openai_key (str,) – Column(s) to join on in df1. Defaults to None.

  • optional) – Column(s) to join on in df1. Defaults to None.

  • optional) – Column(s) to join on in df2. Defaults to None.

  • optional) – Columns to use for blocking. Defaults to None.

  • str]) (suffixes (Tuple[str,) – Suffixes to use for overlapping columns. Defaults to (‘_x’, ‘_y’).

  • (bool) (use_gpu) – Whether to use GPU. Not supported yet. Defaults to False.

  • (int) (batch_size) – Batch size for inferencing embeddings. Defaults to 128.

  • optional) – OpenAI API key for InferKit API. Defaults to None.

Return type:

DataFrame

Returns:

DataFrame: The merged dataframe.

linktransformer.infer.merge_knn(df1, df2, merge_type='1:1', on=None, model='all-MiniLM-L6-v2', left_on=None, right_on=None, k=1, suffixes=('_x', '_y'), use_gpu=False, batch_size=128, openai_key=None, drop_sim_threshold=None)

Merge two dataframes using language model embeddings. This function would support k nearest neighbors matching for each row in df1. Merge is a special case of this function when k=1. :param df1 (DataFrame): First dataframe (left). :param df2 (DataFrame): Second dataframe (right). :param on (Union[str, List[str]], optional): Column(s) to join on in df1. Defaults to None. :param model (str): Language model to use. :param left_on (Union[str, List[str]], optional): Column(s) to join on in df1. Defaults to None. :param right_on (Union[str, List[str]], optional): Column(s) to join on in df2. Defaults to None. :param k (int): Number of nearest neighbors to match for each row in df1. Defaults to 1. :param suffixes (Tuple[str, str]): Suffixes to use for overlapping columns. Defaults to (‘_x’, ‘_y’). :param use_gpu (bool): Whether to use GPU. Not supported yet. Defaults to False. :param batch_size (int): Batch size for inferencing embeddings. Defaults to 128. :param openai_key (str, optional): OpenAI API key for InferKit API. Defaults to None. :rtype: DataFrame :return: DataFrame: The merged dataframe.

Linkage Model Training

linktransformer.train_model.create_new_train_config(base_config_path='/home/docs/checkouts/readthedocs.org/user_builds/linktransformer/envs/latest/lib/python3.8/site-packages/linktransformer/configs/linkage.json', config_save_path='myconfig.json', model_save_dir=None, model_save_name=None, train_batch_size=None, num_epochs=None, warm_up_perc=None, learning_rate=None, val_perc=None, wandb_names=None, add_pooling_layer=None, opt_model_description=None, opt_model_lang=None, test_at_end=None, save_val_test_pickles=None, val_query_prop=None)

Function to create a training config :param config_save_path (str): Path to save the config :param base_config_path (str): Path to the base config :param model_save_dir (str): Path to save the model :param model_save_name (str): Name of the model :param train_batch_size (int): Batch size for training :param num_epochs (int): Number of epochs :param warm_up_perc (float): Percentage of warmup steps :param learning_rate (float): Learning rate :param val_perc (float): Percentage of validation data :param wandb_names (dict): Dictionary of wandb names :param add_pooling_layer (bool): Whether to add pooling layer :param language (str): Language of the model :return: Path to the saved config

linktransformer.train_model.train_model(data=None, train_data=None, val_data=None, test_data=None, model_path='sentence-transformers/paraphrase-xlm-r-multilingual-v1', left_col_names=None, right_col_names=None, left_id_name=None, right_id_name=None, label_col_name=None, clus_id_col_name=None, clus_text_col_names=None, config_path='/home/docs/checkouts/readthedocs.org/user_builds/linktransformer/envs/latest/lib/python3.8/site-packages/linktransformer/configs/linkage.json', training_args={'num_epochs': 10}, log_wandb=False)

Train the LinkTransformer model.

Param:

model_path (str): The name of the model to use.

Param:

data (str): Path to the dataset in Excel or CSV format or a dataframe object.

Param:

left_col_names (List[str]): List of column names to use as left side data.

Param:

right_col_names (List[str]): List of column names to use as right side data.

Param:

left_id_name (List[str]): List of column names to use as identifiers for the left data.

Param:

right_id_name (List[str]): List of column names to use as identifiers for the right data,

Param:

label_col_name (str): Name of the column to use as labels. Specify this if you have data of the form (left, right, label). This type supports both positive and negative examples.

Param:

clusterid_col_name (str): Name of the column to use as cluster ids. Specify this if you have data of the form (text, cluster_id).

Param:

cluster_text_col_name (str): Name of the column to use as cluster text. Specify this if you have data of the form (text, cluster_id).

Param:

config_path (str): Path to the JSON configuration file.

Param:

training_args (dict): Dictionary of training arguments to override the config.

Param:

log_wandb (bool): Whether to log the training run on wandb.

Return type:

str

Returns:

The path to the saved best model.

Classification Model Training

linktransformer.train_clf_model.train_clf_model(data=None, model='distilroberta-base', on=[], label_col_name='label', train_data=None, val_data=None, test_data=None, data_dir='.', training_args={}, config='/home/docs/checkouts/readthedocs.org/user_builds/linktransformer/envs/latest/lib/python3.8/site-packages/linktransformer/configs/classification.json', eval_steps=None, save_steps=None, batch_size=None, lr=None, epochs=None, model_save_dir='.', weighted_loss=False, weight_list=None, wandb_log=False, wandb_name='topic', print_test_mistakes=False)

Trains a text classification model using Hugging Face’s Transformers library.

Parameters:
  • data – (str/DataFrame, optional) Path to the CSV file or a DataFrame object containing the training data.

  • model – (str, default=”distilroberta-base”) The name of the Hugging Face model to be used.

  • on – (list, default=[]) List of column names that are used as input features.

  • label_col_name – (str, default=”label”) The column name in the data that contains the labels.

  • train_data – (str/DataFrame, optional) Training dataset if data is not provided.

  • val_data – (str/DataFrame, optional) Validation dataset if data is not provided.

  • test_data – (str/DataFrame, optional) Test dataset if data is not provided.

  • data_dir – (str, default=”.”) Directory where training data splits are saved.

  • training_args – (dict, default={}) Training arguments for the Hugging Face Trainer.

  • config – (str, default=CLF_CONFIG_PATH) Path to the default config file.

  • eval_steps – (int, optional) Evaluation interval in terms of steps.

  • save_steps – (int, optional) Model saving interval in terms of steps.

  • batch_size – (int, optional) Batch size for training and evaluation.

  • lr – (float, optional) Learning rate.

  • epochs – (int, optional) Number of training epochs.

  • model_save_dir – (str, default=”.”) Directory where the trained model will be saved.

  • weighted_loss – (bool, default=False) If true, uses weighted loss based on class frequencies.

  • weight_list – (list, optional) Weights for each class in the loss function.

  • wandb_log – (bool, default=False) If true, logs metrics to Weights & Biases.

  • wandb_name – (str, default=”topic”) Name of the Weights & Biases project.

  • print_test_mistakes – (bool, default=False) If true, prints the misclassified samples in the test dataset.

Returns:

  • best_model_path (str): Path to the directory of the best saved model.

  • best_metric (float): The best metric value achieved during training.

  • label_map (dict): Mapping of labels to their respective integer values.

Note

Either the data parameter or all of train_data, val_data, and test_data should be provided. If only data is provided, it will be split into train, validation, and test sets.

Model Classes

class linktransformer.modelling.LinkTransformer.LinkTransformer(model_name_or_path=None, modules=None, device=None, cache_folder=None, use_auth_token=None, opt_model_description=None, opt_model_lang=None)

Modified SentenceTransformer class for LinkTransformers models as a wrapper around the SentenceTransformer class.

save(path, model_name=None, create_model_card=True, train_datasets=None, override_model_description=None, override_model_lang=None)

Saves all elements for this seq. sentence embedder into different sub-folders :type path: str :param path: Path on disc :type model_name: Optional[str] :param model_name: Optional model name :type create_model_card: bool :param create_model_card: If True, create a README.md with basic information about this model :type train_datasets: Optional[List[str]] :param train_datasets: Optional list with the names of the datasets used to to train the model

save_to_hub(repo_name, organization=None, private=None, commit_message='Add new LinkTransformer model.', local_model_path=None, exist_ok=False, replace_model_card=False, train_datasets=None, override_model_description=None, override_model_lang=None)

Uploads all elements of this LinkTransformer (inherited Sentence Transformer) to a new HuggingFace Hub repository.

Parameters:
  • repo_name (str) – Repository name for your model in the Hub.

  • organization (Optional[str]) – Organization in which you want to push your model or tokenizer (you must be a member of this organization).

  • private (Optional[bool]) – Set to true, for hosting a prive model

  • commit_message (str) – Message to commit while pushing.

  • local_model_path (Optional[str]) – Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded

  • exist_ok (bool) – If true, saving to an existing repository is OK. If false, saving only to a new repository is possible

  • replace_model_card (bool) – If true, replace an existing model card in the hub with the automatically created model card

  • train_datasets (Optional[List[str]]) – Datasets used to train the model. If set, the datasets will be added to the model card in the Hub.

Returns:

The url of the commit of your model in the given repository.

class linktransformer.modelling.LinkTransformerClassifier.LinkTransformerClassifier(model_name_or_path, opt_model_description=None, opt_model_lang=None, label_map=None, model_card_text=None)

Modified Sequence Classification model and tokenizer to implement model card generation and save to hub functions

save(save_directory, model_name=None, override_model_description=None, override_model_lang=None, train_datasets=None)

Saves the model and tokenizer to the specified directory.

save_to_hub(repo_name, organization=None, private=None, commit_message='Add new LinkTransformer model.', local_model_path=None, exist_ok=False, override_model_description=None, override_model_lang=None, train_datasets=None)

Uploads all elements of this LinkTransformer (for classification) to a new HuggingFace Hub repository.

Parameters:
  • repo_name (str) – Repository name for your model in the Hub.

  • organization (Optional[str]) – Organization in which you want to push your model or tokenizer (you must be a member of this organization).

  • private (Optional[bool]) – Set to true, for hosting a prive model

  • commit_message (str) – Message to commit while pushing.

  • local_model_path (Optional[str]) – Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded

  • exist_ok (bool) – If true, saving to an existing repository is OK. If false, saving only to a new repository is possible

  • replace_model_card – If true, replace an existing model card in the hub with the automatically created model card

  • train_datasets (Optional[List[str]]) – Datasets used to train the model. If set, the datasets will be added to the model card in the Hub.

Returns:

The url of the commit of your model in the given repository.