minicons.scorer module¶
- class minicons.scorer.LMScorer(model_name: str, device: Optional[str] = 'cpu')¶
Bases:
object
Base LM scorer class intended to store models and tokenizers along with methods to facilitate the analysis of language model output scores.
- add_special_tokens(text: Union[str, List[str]]) Union[str, List[str]] ¶
- distribution(batch: Iterable) torch.Tensor ¶
- topk(distribution: torch.Tensor, k: int = 1) Tuple ¶
- query(distribution: torch.Tensor, queries: List[str]) Tuple ¶
- logprobs(batch: Iterable, rank: bool = False) Union[float, List[float]] ¶
- compute_stats(batch: Iterable, rank: bool = False) Union[float, int, List[Union[float, int]]] ¶
- prepare_text(text: Union[str, List[str]]) Union[str, List[str]] ¶
- prime_text(preamble: Union[str, List[str]], stimuli: Union[str, List[str]]) Tuple ¶
- token_score(batch: Union[str, List[str]], surprisal: bool = False, prob: bool = False, base_two: bool = False, rank: bool = False) Union[List[Tuple[str, float]], List[Tuple[str, float, int]]] ¶
- For every input sentence, returns a list of tuples in the following format:
(token, score),
where score represents the log-probability (by default) of the token given context. Can also return ranks along with scores.
- Parameters
batch (Union[str, List[str]]) – a single sentence or a batch of sentences.
surprisal (bool) – If True, returns per-word surprisals instead of log-probabilities.
prob (bool) – If True, returns per-word probabilities instead of log-probabilities.
base_two (bool) – If True, uses log base 2 instead of natural-log (returns bits of values in case of surprisals)
rank (bool) – If True, also returns the rank of each word in context (based on the log-probability value)
- Returns
A List containing a Tuple consisting of the word, its associated score, and optionally, its rank.
- Return type
Union[List[Tuple[str, float]], List[Tuple[str, float, int]]]
- score(batch: Union[str, List[str]], pool: Callable = <built-in method mean of type object>, *args) Union[float, List[float]] ¶
DEPRECATED as of v 0.1.18. Check out
sequence_score
ortoken_score
instead!Pooled estimates of sentence log probabilities, computed by the language model. Pooling is usually done using a function that is passed to the method.
- Parameters
batch (Union[str, List[str]]) – a list of sentences that will be passed to the language model to score.
pool (Callable) – Pooling function, is selected to be torch.mean() by default.
- Returns
Float or list of floats specifying the log probabilities of the input sentence(s).
- Return type
Union[float, List[float]]
- adapt_score(preamble: Union[str, List[str]], stimuli: Union[str, List[str]], pool: Callable = <built-in method mean of type object>, *args) None ¶
DEPRECATED as of v 0.1.18. Check out
partial_score
instead!
- partial_score(preamble: Union[str, List[str]], stimuli: Union[str, List[str]], reduction: Callable = <function LMScorer.<lambda>>, **kwargs) List[float] ¶
Pooled estimates of sequence log probabilities (or some modification of it), given a preamble. Pooling is usually done using a function that is passed to the method.
- Parameters
preamble (
Union[str, List[str]]
) – a batch of preambles or primes passed to the language model. This is what the sequence is conditioned on, and the model ignores the word probabilities of this part of the input in estimating the overall score.stimuli (
Union[str, List[str]]
) – a batch of sequences (same length as preamble) that form the main input consisting of the sequence whose score you want to calculate.reduction (Callable) – Reduction function, is selected to be
lambda x: x.mean(0).item()
by default, which stands for the avg. log-probability per token for each sequence in the batch.kwargs –
parameters for the
compute_stats
call –prob (bool): Whether the returned value should be a probability (note that the default reduction method will have to be changed to lambda x: x.prod(0).item() to get a meaningful return value)
base_two (bool): whether the returned value should be in base 2 (only works when prob = False)
surprisal (bool): whether the returned value should be a surprisal (does not work when prob = True)
- Returns
List of floats specifying the desired score for the stimuli part of the input, e.g., P(stimuli | preamble).
- Return type
List[float]
- encode(text: Union[str, List[str]], manual_special: bool = True, return_tensors: Optional[str] = 'pt') Dict ¶
Encode a batch of sentences using the model’s tokenizer. Equivalent of calling model.tokenizer(input)
- Parameters
text (Union[str, List[str]]) – Input batch/sentence to be encoded.
manual_special (str) – Specification of whether special tokens will be manually encoded.
return_tensors – returned tensor format. Default ‘pt’
- Returns
Encoded batch
- Return type
Dict
- decode(idx: List[int])¶
Decode input ids using the model’s tokenizer.
- Parameters
idx (List[int]) – List of ids.
- Returns
Decoded strings
- Return type
List[str]
- class minicons.scorer.MaskedLMScorer(model_name: str, device: Optional[str] = 'cpu')¶
Bases:
minicons.scorer.LMScorer
Class for Masked Langauge Models such as BERT, RoBERTa, etc.
- Parameters
model_name (str) – name of the model, should either be a path to a model (.pt or .bin file) stored locally, or a pretrained model stored on the Huggingface Model Hub.
device (str, optional) – device type that the model should be loaded on, options: cpu or cuda:{0, 1, …}
- add_special_tokens(text: Union[str, List[str]]) List[str] ¶
Reformats input text to add special model-dependent tokens.
- Parameters
text (
Union[str, List[str]]
) – single string or batch of strings to be modified.- Returns
Modified input, containing special tokens as per tokenizer specification
- Return type
List[str]
- mask(sentence_words: Union[Tuple[str, str], List[Tuple[str, str]]]) Tuple[str, str, int] ¶
Processes a list of (sentence, word) into input that has the word masked out of the sentence.
Note: only works for masked LMs.
- Parameters
sentence_words (Union[Tuple[str], List[Tuple[str]]]) – Input consisting of [(sentence, word)], where sentence is an input sentence, and word is a word present in the sentence that will be masked out.
- Returns
Tuple (sentence, word, length)
- cloze(sentence_words: Union[Tuple[str, str], List[Tuple[str, str]]]) torch.Tensor ¶
Runs inference on masked input. Note: only works for masked LMs.
- Parameters
sentence_words (Union[Tuple[str], List[Tuple[str]]]) – Input consisting of [(sentence, word)], where sentence is an input sentence, and word is a word present in the sentence that will be masked out and inferred.
- Returns
A tensor with log probabilities for the desired word in context
- prepare_text(text: Union[str, List[str]]) Iterable[Any] ¶
Prepares a batch of input text into a format fit to run MLM scoring on.
Borrows preprocessing algorithm from Salazar et al. (2020), and modifies code from the following github repository by simonpri: https://github.com/simonepri/lm-scorer
- Parameters
text – batch of sentences to be prepared for scoring.
- Returns
Batch of formatted input that can be passed to logprob
- prime_text(preamble: Union[str, List[str]], stimuli: Union[str, List[str]]) Iterable[Any] ¶
Prepares a batch of input text into a format fit to run LM scoring on.
Borrows preprocessing algorithm from Salazar et al. (2020), and modifies code from the following github repository by simonpri: https://github.com/simonepri/lm-scorer
- Parameters
preamble (Union[str, List[str]]) – Batch of prefixes/prime/preambles on which the LM is conditioned.
stimuli (Union[str, List[str]]) – Batch of continuations that are scored based on the conditioned text (provided in the
preamble
). The positions of the elements match their counterparts in thepreamble
.
- Returns
Batch of formatted input that can be passed to
compute_stats
- distribution(batch: Iterable) torch.Tensor ¶
Returns a distribution over the vocabulary of the model.
- Parameters
batch (Iterable) – A batch of inputs fit to pass to a transformer LM.
- Returns
Tensor consisting of log probabilies over vocab items.
- cloze_distribution(queries: Iterable) torch.Tensor ¶
Accepts as input batch of [(s_i, bw_i)] where s_i is a prompt with an abstract token (bw_i) representing a blank word and returns a distribution over the vocabulary of the model.
- Parameters
queries (Iterable) – A batch of [(s_i, bw_i)] where s_i is a prompt with an abstract token (bw_i) representing a blank word
- Returns
Tensor contisting of log probabilities over vocab items.
- logprobs(batch: Iterable, rank=False) Union[List[Tuple[torch.Tensor, str]], List[Tuple[torch.Tensor, str, int]]] ¶
Returns log probabilities
- Parameters
batch (Iterable) – A batch of inputs fit to pass to a transformer LM.
rank (bool) – Specifies whether to also return ranks of words.
- Returns
List of MLM score metrics and tokens.
- Return type
Union[List[Tuple[torch.Tensor, str]], List[Tuple[torch.Tensor, str, int]]]
- compute_stats(batch: Iterable, rank: bool = False, prob=False, base_two: bool = False, return_tensors: bool = False) Union[Tuple[List[float], List[float]], List[float]] ¶
Primary computational method that processes a batch of prepared sentences and returns per-token scores for each sentence. By default, returns log-probabilities.
- Parameters
batch (Iterable) – batched input as processed by
prepare_text
orprime_text
.rank (bool) – whether the model should also return ranks per word (based on the conditional log-probability of the word in context).
prob (bool) – whether the model should return probabilities instead of log-probabilities. Can only be True when base_two is False.
base_two (bool) – whether the base of the log should be 2 (usually preferred when reporting results in bits). Can only be True when prob is False.
return_tensors (bool) – whether the model should return scores as a list of tensors instead of a list of lists. This is important in some other convenient methods used in the package.
- Returns
Either a tuple of lists, each containing probabilities and ranks per token in each sentence passed in the input.
- Return type
Union[Tuple[List[float], List[float]], List[float]]
- sequence_score(batch, reduction=<function MaskedLMScorer.<lambda>>, base_two=False)¶
TODO: reduction should be a string, if it’s a function, specify what kind of function. –> how to ensure it is always that type?
- token_score(batch: Union[str, List[str]], surprisal: bool = False, prob: bool = False, base_two: bool = False, rank: bool = False) Union[List[Tuple[str, float]], List[Tuple[str, float, int]]] ¶
- For every input sentence, returns a list of tuples in the following format:
(token, score),
where score represents the log-probability (by default) of the token given context. Can also return ranks along with scores.
- Parameters
batch (Union[str, List[str]]) – a single sentence or a batch of sentences.
surprisal (bool) – If True, returns per-word surprisals instead of log-probabilities.
prob (bool) – If True, returns per-word probabilities instead of log-probabilities.
base_two (bool) – If True, uses log base 2 instead of natural-log (returns bits of values in case of surprisals)
rank (bool) – If True, also returns the rank of each word in context (based on the log-probability value)
- Returns
A List containing a Tuple consisting of the word, its associated score, and optionally, its rank.
- Return type
Union[List[Tuple[str, float]], List[Tuple[str, float, int]]]
- class minicons.scorer.IncrementalLMScorer(model_name: str, device: Optional[str] = 'cpu')¶
Bases:
minicons.scorer.LMScorer
Class for Autoregressive or Incremental (or left-to-right) language models such as GPT2, etc.
- Parameters
model_name (str) – name of the model, should either be a path to a model (.pt or .bin file) stored locally, or a pretrained model stored on the Huggingface Model Hub.
device (str, optional) – device type that the model should be loaded on, options: cpu or cuda:{0, 1, …}
- add_special_tokens(text: Union[str, List[str]]) Union[str, List[str]] ¶
Reformats input text to add special model-dependent tokens.
- Parameters
text (Union[str, List[str]]) – single string or batch of strings to be modified.
- Returns
Modified input, containing special tokens as per tokenizer specification
- Return type
Union[float, List[float]]:
- encode(text: Union[str, List[str]]) dict ¶
Encode a batch of sentences using the model’s tokenizer. Equivalent of calling model.tokenizer(input)
- Parameters
text (Union[str, List[str]]) – Input batch/sentence to be encoded.
manual_special (str) – Specification of whether special tokens will be manually encoded.
return_tensors – returned tensor format. Default ‘pt’
- Returns
Encoded batch
- Return type
Dict
- prepare_text(text: Union[str, List[str]]) Tuple ¶
Prepares a batch of input text into a format fit to run LM scoring on.
- Parameters
text – batch of sentences to be prepared for scoring.
- Returns
Batch of formatted input that can be passed to
compute_stats
- prime_text(preamble: Union[str, List[str]], stimuli: Union[str, List[str]]) Tuple ¶
Prepares a batch of input text into a format fit to run LM scoring on.
- Parameters
preamble (Union[str, List[str]]) – Batch of prefixes/prime/preambles on which the LM is conditioned.
stimuli (Union[str, List[str]]) – Batch of continuations that are scored based on the conditioned text (provided in the
preamble
). The positions of the elements match their counterparts in thepreamble
.
- Returns
Batch of formatted input that can be passed to
compute_stats
- distribution(batch: Iterable) torch.Tensor ¶
Returns a distribution over the vocabulary of the model.
- Parameters
batch (Iterable) – A batch of inputs fit to pass to a transformer LM.
- Returns
Tensor consisting of log probabilies over vocab items.
- next_word_distribution(queries: List, surprisal: bool = False)¶
Returns the log probability distribution of the next word.
- compute_stats(batch: Iterable, rank: bool = False, prob: bool = False, base_two: bool = False, return_tensors: bool = False) Union[Tuple[List[float], List[float]], List[float]] ¶
Primary computational method that processes a batch of prepared sentences and returns per-token scores for each sentence. By default, returns log-probabilities.
- Parameters
batch (Iterable) – batched input as processed by
prepare_text
orprime_text
.rank (bool) – whether the model should also return ranks per word (based on the conditional log-probability of the word in context).
prob (bool) – whether the model should return probabilities instead of log-probabilities. Can only be True when base_two is False.
base_two (bool) – whether the base of the log should be 2 (usually preferred when reporting results in bits). Can only be True when prob is False.
return_tensors (bool) – whether the model should return scores as a list of tensors instead of a list of lists. This is important in some other convenient methods used in the package.
- Returns
Either a tuple of lists, each containing probabilities and ranks per token in each sentence passed in the input.
- Return type
Union[Tuple[List[float], List[int]], List[float]]
- sequence_score(batch, reduction=<function IncrementalLMScorer.<lambda>>, base_two=False)¶
TODO: reduction should be a string, if it’s a function, specify what kind of function. –> how to ensure it is always that type?
- token_score(batch: Union[str, List[str]], surprisal: bool = False, prob: bool = False, base_two: bool = False, rank: bool = False) Union[List[Tuple[str, float]], List[Tuple[str, float, int]]] ¶
- For every input sentence, returns a list of tuples in the following format:
(token, score),
where score represents the log-probability (by default) of the token given context. Can also return ranks along with scores.
- Parameters
batch (Union[str, List[str]]) – a single sentence or a batch of sentences.
surprisal (bool) – If True, returns per-word surprisals instead of log-probabilities.
prob (bool) – If True, returns per-word probabilities instead of log-probabilities.
base_two (bool) – If True, uses log base 2 instead of natural-log (returns bits of values in case of surprisals)
rank (bool) – If True, also returns the rank of each word in context (based on the log-probability value)
- Returns
A List containing a Tuple consisting of the word, its associated score, and optionally, its rank.
- Return type
Union[List[Tuple[str, float]], List[Tuple[str, float, int]]]
- logprobs(batch: Iterable, rank=False) Union[float, List[float]] ¶
Returns log probabilities
- Parameters
batch (Iterable) – A batch of inputs fit to pass to a transformer LM.
rank (bool) – Specifies whether to also return ranks of words.
- Returns
List of LM score metrics (probability and rank) and tokens.
- Return type
Union[List[Tuple[torch.Tensor, str]], List[Tuple[torch.Tensor, str, int]]]