linktransformer¶
Linkage and Classification Inference¶
- linktransformer.infer.aggregate_rows(df, ref_df, model, left_on, right_on, openai_key=None)¶
Aggregate the dataframe based on a reference dataframe using a language model.
- Parameters:
(DataFrame) (ref_df) – Dataframe to aggregate.
(DataFrame) – Reference dataframe to aggregate on.
(str) (model) – Language model to use.
List[str]]) (right_on (Union[str,) – Column(s) to aggregate on in df.
List[str]]) – Reference column(s) to aggregate on in ref_df.
- Return type:
DataFrame
- Returns:
DataFrame: The aggregated dataframe.
- linktransformer.infer.all_pair_combos_evaluate(df, model, left_on, right_on, openai_key=None)¶
Get similarity scores for every pair of rows in a dataframe. We make this efficient by only embedding each string once and get all possible pairwise distances and add the expanded rows and their scores to the dataframe :param df (DataFrame): Dataframe to evaluate. :param model (str): Language model to use. :param left_on (Union[str, List[str]]): Column(s) to evaluate on in df. :param right_on (Union[str, List[str]]): Reference column(s) to evaluate on in df. :param openai_key (str): OpenAI API key :return: DataFrame: The evaluated dataframe.
- linktransformer.infer.classify_rows(df, on=None, model=None, num_labels=2, label_map=None, use_gpu=False, batch_size=128, openai_key=None, openai_topic=None, openai_prompt=None, openai_params={})¶
Classify texts in all rows of one or more columns whether they are relevant to a certain topic. The function uses either a trained classifier to make predictions or an OpenAI API key to send requests and retrieve classification results from ChatCompletion endpoint. The function returns a copy of the input dataframe with a new column “clf_preds_{on}” that stores the classification results.
- Parameters:
df (
DataFrame
) – (DataFrame) the dataframe.on (
Union
[str
,List
[str
],None
]) – (Union[str, List[str]], optional) Column(s) to classify (if multiple columns are passed in, they will be joined).model (
Optional
[str
]) – (str) filepath to the model to use (to use OpenAI, see “https://platform.openai.com/docs/models”).num_labels (
int
) – (int) number of labels to predict. Defaults to 2.label_map (
Optional
[dict
]) – (dict) a dictionary that maps text labels to numeric labels. Used for OpenAI predictions.use_gpu (
bool
) – (bool) Whether to use GPU. Not supported yet. Defaults to False.batch_size (
int
) – (int) Batch size for inferencing embeddings. Defaults to 128.openai_key (
Optional
[str
]) – (str, optional) OpenAI API key for InferKit API. Defaults to None.openai_topic (
Optional
[str
]) – (str, optional) The topic predict whether the text is relevant or not. Defaults to None.openai_prompt (
Optional
[str
]) – (str, optional) Custom system prompt for OpenAI ChatCompletion endpoint. Defaults to None.openai_params (
Optional
[dict
]) – (str, optional) Custom parameters for OpenAI ChatCompletion endpoint. Defaults to None.
- Returns:
DataFrame: The dataframe with a new column “clf_preds_{on}” that stores the classification results.
- linktransformer.infer.cluster_rows(df, model, on, cluster_type='SLINK', cluster_params={'metric': 'cosine', 'min cluster size': 2, 'threshold': 0.5}, openai_key=None)¶
Deduplicate a dataframe based on a similarity threshold. Various clustering options are supported. “agglomerative”: {
“threshold”: 0.5, “clustering linkage”: “ward”, # You can choose a default linkage method “metric”: “euclidean”, # You can choose a default metric
}, “HDBScan”: {
“min cluster size”: 5, “min samples”: 1,
}, “SLINK”: {
“min cluster size”: 2, “threshold”: 0.1,
},
}
- Parameters:
(DataFrame) (df) – Dataframe to deduplicate.
(str) (openai_key) – Language model to use.
List[str]]) (on (Union[str,) – Column(s) to deduplicate on.
(str) – Clustering method to use. Defaults to “SLINK”.
Any]) (cluster_params (Dict[str,) – Parameters for clustering method. Defaults to {‘threshold’: 0.5, “min cluster size”: 2, “metric”: “cosine”}.
(str) – OpenAI API key
- Return type:
DataFrame
- Returns:
DataFrame: The deduplicated dataframe.
- linktransformer.infer.dedup_rows(df, model, on, cluster_type='SLINK', cluster_params={'metric': 'cosine', 'min cluster size': 2, 'threshold': 0.5}, openai_key=None)¶
Deduplicate a dataframe based on a similarity threshold. This is just clustering and keeping the first row in each cluster. Refer to the docs for the cluster_rows function for more details.
- Parameters:
(DataFrame) (df) – Dataframe to deduplicate.
(str) (openai_key) – Language model to use.
List[str]]) (on (Union[str,) – Column(s) to deduplicate on.
(str) – Clustering method to use. Defaults to “SLINK”.
Any]) (cluster_params (Dict[str,) – Parameters for clustering method. Defaults to {‘threshold’: 0.5, “min cluster size”: 2, “metric”: “cosine”}.
(str) – OpenAI API key
- Return type:
DataFrame
- Returns:
DataFrame: The deduplicated dataframe.
- linktransformer.infer.evaluate_pairs(df, model, left_on, right_on, openai_key=None)¶
This function evaluates paired columns in a dataframe and gives a match score (cosine similarity). Typically, this can be though of as a way to evaluate already merged in dataframes.
- Parameters:
(DataFrame) (df) – Dataframe to evaluate.
(str) (model) – Language model to use.
List[str]]) (right_on (Union[str,) – Column(s) to evaluate on in df.
List[str]]) – Reference column(s) to evaluate on in df.
- Returns:
DataFrame: The evaluated dataframe.
- linktransformer.infer.merge(df1, df2, merge_type='1:1', on=None, model='all-MiniLM-L6-v2', left_on=None, right_on=None, suffixes=('_x', '_y'), use_gpu=False, batch_size=128, openai_key=None)¶
Merge two dataframes using language model embeddings.
- Parameters:
(DataFrame) (df2) – First dataframe (left).
(DataFrame) – Second dataframe (right).
(str) (model) – Type of merge to perform (1:m or m:1 or 1:1).
(str) – Language model to use.
optional) (openai_key (str,) – Column(s) to join on in df1. Defaults to None.
optional) – Column(s) to join on in df1. Defaults to None.
optional) – Column(s) to join on in df2. Defaults to None.
str]) (suffixes (Tuple[str,) – Suffixes to use for overlapping columns. Defaults to (‘_x’, ‘_y’).
(bool) (use_gpu) – Whether to use GPU. Not supported yet. Defaults to False.
(int) (batch_size) – Batch size for inferencing embeddings. Defaults to 128.
optional) – OpenAI API key for InferKit API. Defaults to None.
- Return type:
DataFrame
- Returns:
DataFrame: The merged dataframe.
- linktransformer.infer.merge_blocking(df1, df2, merge_type='1:1', on=None, model='all-MiniLM-L6-v2', left_on=None, right_on=None, blocking_vars=None, suffixes=('_x', '_y'), use_gpu=False, batch_size=128, openai_key=None)¶
Merge two dataframes using language model embeddings with optional blocking.
- Parameters:
(DataFrame) (df2) – First dataframe (left).
(DataFrame) – Second dataframe (right).
(str) (model) – Type of merge to perform (1:m or m:1 or 1:1).
(str) – Language model to use.
optional) (openai_key (str,) – Column(s) to join on in df1. Defaults to None.
optional) – Column(s) to join on in df1. Defaults to None.
optional) – Column(s) to join on in df2. Defaults to None.
optional) – Columns to use for blocking. Defaults to None.
str]) (suffixes (Tuple[str,) – Suffixes to use for overlapping columns. Defaults to (‘_x’, ‘_y’).
(bool) (use_gpu) – Whether to use GPU. Not supported yet. Defaults to False.
(int) (batch_size) – Batch size for inferencing embeddings. Defaults to 128.
optional) – OpenAI API key for InferKit API. Defaults to None.
- Return type:
DataFrame
- Returns:
DataFrame: The merged dataframe.
- linktransformer.infer.merge_knn(df1, df2, merge_type='1:1', on=None, model='all-MiniLM-L6-v2', left_on=None, right_on=None, k=1, suffixes=('_x', '_y'), use_gpu=False, batch_size=128, openai_key=None, drop_sim_threshold=None)¶
Merge two dataframes using language model embeddings. This function would support k nearest neighbors matching for each row in df1. Merge is a special case of this function when k=1. :param df1 (DataFrame): First dataframe (left). :param df2 (DataFrame): Second dataframe (right). :param on (Union[str, List[str]], optional): Column(s) to join on in df1. Defaults to None. :param model (str): Language model to use. :param left_on (Union[str, List[str]], optional): Column(s) to join on in df1. Defaults to None. :param right_on (Union[str, List[str]], optional): Column(s) to join on in df2. Defaults to None. :param k (int): Number of nearest neighbors to match for each row in df1. Defaults to 1. :param suffixes (Tuple[str, str]): Suffixes to use for overlapping columns. Defaults to (‘_x’, ‘_y’). :param use_gpu (bool): Whether to use GPU. Not supported yet. Defaults to False. :param batch_size (int): Batch size for inferencing embeddings. Defaults to 128. :param openai_key (str, optional): OpenAI API key for InferKit API. Defaults to None. :rtype:
DataFrame
:return: DataFrame: The merged dataframe.
Linkage Model Training¶
- linktransformer.train_model.create_new_train_config(base_config_path='/home/docs/checkouts/readthedocs.org/user_builds/linktransformer/envs/latest/lib/python3.8/site-packages/linktransformer/configs/linkage.json', config_save_path='myconfig.json', model_save_dir=None, model_save_name=None, train_batch_size=None, num_epochs=None, warm_up_perc=None, learning_rate=None, val_perc=None, wandb_names=None, add_pooling_layer=None, opt_model_description=None, opt_model_lang=None, test_at_end=None, save_val_test_pickles=None, val_query_prop=None)¶
Function to create a training config :param config_save_path (str): Path to save the config :param base_config_path (str): Path to the base config :param model_save_dir (str): Path to save the model :param model_save_name (str): Name of the model :param train_batch_size (int): Batch size for training :param num_epochs (int): Number of epochs :param warm_up_perc (float): Percentage of warmup steps :param learning_rate (float): Learning rate :param val_perc (float): Percentage of validation data :param wandb_names (dict): Dictionary of wandb names :param add_pooling_layer (bool): Whether to add pooling layer :param language (str): Language of the model :return: Path to the saved config
- linktransformer.train_model.train_model(data=None, train_data=None, val_data=None, test_data=None, model_path='sentence-transformers/paraphrase-xlm-r-multilingual-v1', left_col_names=None, right_col_names=None, left_id_name=None, right_id_name=None, label_col_name=None, clus_id_col_name=None, clus_text_col_names=None, config_path='/home/docs/checkouts/readthedocs.org/user_builds/linktransformer/envs/latest/lib/python3.8/site-packages/linktransformer/configs/linkage.json', training_args={'num_epochs': 10}, log_wandb=False)¶
Train the LinkTransformer model.
- Param:
model_path (str): The name of the model to use.
- Param:
data (str): Path to the dataset in Excel or CSV format or a dataframe object.
- Param:
left_col_names (List[str]): List of column names to use as left side data.
- Param:
right_col_names (List[str]): List of column names to use as right side data.
- Param:
left_id_name (List[str]): List of column names to use as identifiers for the left data.
- Param:
right_id_name (List[str]): List of column names to use as identifiers for the right data,
- Param:
label_col_name (str): Name of the column to use as labels. Specify this if you have data of the form (left, right, label). This type supports both positive and negative examples.
- Param:
clusterid_col_name (str): Name of the column to use as cluster ids. Specify this if you have data of the form (text, cluster_id).
- Param:
cluster_text_col_name (str): Name of the column to use as cluster text. Specify this if you have data of the form (text, cluster_id).
- Param:
config_path (str): Path to the JSON configuration file.
- Param:
training_args (dict): Dictionary of training arguments to override the config.
- Param:
log_wandb (bool): Whether to log the training run on wandb.
- Return type:
str
- Returns:
The path to the saved best model.
Classification Model Training¶
- linktransformer.train_clf_model.train_clf_model(data=None, model='distilroberta-base', on=[], label_col_name='label', train_data=None, val_data=None, test_data=None, data_dir='.', training_args={}, config='/home/docs/checkouts/readthedocs.org/user_builds/linktransformer/envs/latest/lib/python3.8/site-packages/linktransformer/configs/classification.json', eval_steps=None, save_steps=None, batch_size=None, lr=None, epochs=None, model_save_dir='.', weighted_loss=False, weight_list=None, wandb_log=False, wandb_name='topic', print_test_mistakes=False)¶
Trains a text classification model using Hugging Face’s Transformers library.
- Parameters:
data – (str/DataFrame, optional) Path to the CSV file or a DataFrame object containing the training data.
model – (str, default=”distilroberta-base”) The name of the Hugging Face model to be used.
on – (list, default=[]) List of column names that are used as input features.
label_col_name – (str, default=”label”) The column name in the data that contains the labels.
train_data – (str/DataFrame, optional) Training dataset if data is not provided.
val_data – (str/DataFrame, optional) Validation dataset if data is not provided.
test_data – (str/DataFrame, optional) Test dataset if data is not provided.
data_dir – (str, default=”.”) Directory where training data splits are saved.
training_args – (dict, default={}) Training arguments for the Hugging Face Trainer.
config – (str, default=CLF_CONFIG_PATH) Path to the default config file.
eval_steps – (int, optional) Evaluation interval in terms of steps.
save_steps – (int, optional) Model saving interval in terms of steps.
batch_size – (int, optional) Batch size for training and evaluation.
lr – (float, optional) Learning rate.
epochs – (int, optional) Number of training epochs.
model_save_dir – (str, default=”.”) Directory where the trained model will be saved.
weighted_loss – (bool, default=False) If true, uses weighted loss based on class frequencies.
weight_list – (list, optional) Weights for each class in the loss function.
wandb_log – (bool, default=False) If true, logs metrics to Weights & Biases.
wandb_name – (str, default=”topic”) Name of the Weights & Biases project.
print_test_mistakes – (bool, default=False) If true, prints the misclassified samples in the test dataset.
- Returns:
best_model_path (str): Path to the directory of the best saved model.
best_metric (float): The best metric value achieved during training.
label_map (dict): Mapping of labels to their respective integer values.
Note
Either the data parameter or all of train_data, val_data, and test_data should be provided. If only data is provided, it will be split into train, validation, and test sets.
Model Classes¶
- class linktransformer.modelling.LinkTransformer.LinkTransformer(model_name_or_path=None, modules=None, device=None, cache_folder=None, use_auth_token=None, opt_model_description=None, opt_model_lang=None)¶
Modified SentenceTransformer class for LinkTransformers models as a wrapper around the SentenceTransformer class.
- save(path, model_name=None, create_model_card=True, train_datasets=None, override_model_description=None, override_model_lang=None)¶
Saves all elements for this seq. sentence embedder into different sub-folders :type path:
str
:param path: Path on disc :type model_name:Optional
[str
] :param model_name: Optional model name :type create_model_card:bool
:param create_model_card: If True, create a README.md with basic information about this model :type train_datasets:Optional
[List
[str
]] :param train_datasets: Optional list with the names of the datasets used to to train the model
- save_to_hub(repo_name, organization=None, private=None, commit_message='Add new LinkTransformer model.', local_model_path=None, exist_ok=False, replace_model_card=False, train_datasets=None, override_model_description=None, override_model_lang=None)¶
Uploads all elements of this LinkTransformer (inherited Sentence Transformer) to a new HuggingFace Hub repository.
- Parameters:
repo_name (
str
) – Repository name for your model in the Hub.organization (
Optional
[str
]) – Organization in which you want to push your model or tokenizer (you must be a member of this organization).private (
Optional
[bool
]) – Set to true, for hosting a prive modelcommit_message (
str
) – Message to commit while pushing.local_model_path (
Optional
[str
]) – Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploadedexist_ok (
bool
) – If true, saving to an existing repository is OK. If false, saving only to a new repository is possiblereplace_model_card (
bool
) – If true, replace an existing model card in the hub with the automatically created model cardtrain_datasets (
Optional
[List
[str
]]) – Datasets used to train the model. If set, the datasets will be added to the model card in the Hub.
- Returns:
The url of the commit of your model in the given repository.
- class linktransformer.modelling.LinkTransformerClassifier.LinkTransformerClassifier(model_name_or_path, opt_model_description=None, opt_model_lang=None, label_map=None, model_card_text=None)¶
Modified Sequence Classification model and tokenizer to implement model card generation and save to hub functions
- save(save_directory, model_name=None, override_model_description=None, override_model_lang=None, train_datasets=None)¶
Saves the model and tokenizer to the specified directory.
- save_to_hub(repo_name, organization=None, private=None, commit_message='Add new LinkTransformer model.', local_model_path=None, exist_ok=False, override_model_description=None, override_model_lang=None, train_datasets=None)¶
Uploads all elements of this LinkTransformer (for classification) to a new HuggingFace Hub repository.
- Parameters:
repo_name (
str
) – Repository name for your model in the Hub.organization (
Optional
[str
]) – Organization in which you want to push your model or tokenizer (you must be a member of this organization).private (
Optional
[bool
]) – Set to true, for hosting a prive modelcommit_message (
str
) – Message to commit while pushing.local_model_path (
Optional
[str
]) – Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploadedexist_ok (
bool
) – If true, saving to an existing repository is OK. If false, saving only to a new repository is possiblereplace_model_card – If true, replace an existing model card in the hub with the automatically created model card
train_datasets (
Optional
[List
[str
]]) – Datasets used to train the model. If set, the datasets will be added to the model card in the Hub.
- Returns:
The url of the commit of your model in the given repository.