minicons.scorer module

class minicons.scorer.LMScorer(model_name: str, device: Optional[str] = 'cpu')

Bases: object

Base LM scorer class intended to store models and tokenizers along with methods to facilitate the analysis of language model output scores.

add_special_tokens(text: Union[str, List[str]]) Union[str, List[str]]
distribution(batch: Iterable) torch.Tensor
topk(distribution: torch.Tensor, k: int = 1) Tuple
query(distribution: torch.Tensor, queries: List[str]) Tuple
logprobs(batch: Iterable, rank: bool = False) Union[float, List[float]]
compute_stats(batch: Iterable, rank: bool = False) Union[float, int, List[Union[float, int]]]
prepare_text(text: Union[str, List[str]]) Union[str, List[str]]
prime_text(preamble: Union[str, List[str]], stimuli: Union[str, List[str]]) Tuple
token_score(batch: Union[str, List[str]], surprisal: bool = False, prob: bool = False, base_two: bool = False, rank: bool = False) Union[List[Tuple[str, float]], List[Tuple[str, float, int]]]
For every input sentence, returns a list of tuples in the following format:

(token, score),

where score represents the log-probability (by default) of the token given context. Can also return ranks along with scores.

Parameters
  • batch (Union[str, List[str]]) – a single sentence or a batch of sentences.

  • surprisal (bool) – If True, returns per-word surprisals instead of log-probabilities.

  • prob (bool) – If True, returns per-word probabilities instead of log-probabilities.

  • base_two (bool) – If True, uses log base 2 instead of natural-log (returns bits of values in case of surprisals)

  • rank (bool) – If True, also returns the rank of each word in context (based on the log-probability value)

Returns

A List containing a Tuple consisting of the word, its associated score, and optionally, its rank.

Return type

Union[List[Tuple[str, float]], List[Tuple[str, float, int]]]

score(batch: Union[str, List[str]], pool: Callable = <built-in method mean of type object>, *args) Union[float, List[float]]

DEPRECATED as of v 0.1.18. Check out sequence_score or token_score instead!

Pooled estimates of sentence log probabilities, computed by the language model. Pooling is usually done using a function that is passed to the method.

Parameters
  • batch (Union[str, List[str]]) – a list of sentences that will be passed to the language model to score.

  • pool (Callable) – Pooling function, is selected to be torch.mean() by default.

Returns

Float or list of floats specifying the log probabilities of the input sentence(s).

Return type

Union[float, List[float]]

adapt_score(preamble: Union[str, List[str]], stimuli: Union[str, List[str]], pool: Callable = <built-in method mean of type object>, *args) None

DEPRECATED as of v 0.1.18. Check out partial_score instead!

partial_score(preamble: Union[str, List[str]], stimuli: Union[str, List[str]], reduction: Callable = <function LMScorer.<lambda>>, **kwargs) List[float]

Pooled estimates of sequence log probabilities (or some modification of it), given a preamble. Pooling is usually done using a function that is passed to the method.

Parameters
  • preamble (Union[str, List[str]]) – a batch of preambles or primes passed to the language model. This is what the sequence is conditioned on, and the model ignores the word probabilities of this part of the input in estimating the overall score.

  • stimuli (Union[str, List[str]]) – a batch of sequences (same length as preamble) that form the main input consisting of the sequence whose score you want to calculate.

  • reduction (Callable) – Reduction function, is selected to be lambda x: x.mean(0).item() by default, which stands for the avg. log-probability per token for each sequence in the batch.

  • kwargs

    parameters for the compute_stats call –

    • prob (bool): Whether the returned value should be a probability (note that the default reduction method will have to be changed to lambda x: x.prod(0).item() to get a meaningful return value)

    • base_two (bool): whether the returned value should be in base 2 (only works when prob = False)

    • surprisal (bool): whether the returned value should be a surprisal (does not work when prob = True)

Returns

List of floats specifying the desired score for the stimuli part of the input, e.g., P(stimuli | preamble).

Return type

List[float]

encode(text: Union[str, List[str]], manual_special: bool = True, return_tensors: Optional[str] = 'pt') Dict

Encode a batch of sentences using the model’s tokenizer. Equivalent of calling model.tokenizer(input)

Parameters
  • text (Union[str, List[str]]) – Input batch/sentence to be encoded.

  • manual_special (str) – Specification of whether special tokens will be manually encoded.

  • return_tensors – returned tensor format. Default ‘pt’

Returns

Encoded batch

Return type

Dict

decode(idx: List[int])

Decode input ids using the model’s tokenizer.

Parameters

idx (List[int]) – List of ids.

Returns

Decoded strings

Return type

List[str]

class minicons.scorer.MaskedLMScorer(model_name: str, device: Optional[str] = 'cpu')

Bases: minicons.scorer.LMScorer

Class for Masked Langauge Models such as BERT, RoBERTa, etc.

Parameters
  • model_name (str) – name of the model, should either be a path to a model (.pt or .bin file) stored locally, or a pretrained model stored on the Huggingface Model Hub.

  • device (str, optional) – device type that the model should be loaded on, options: cpu or cuda:{0, 1, …}

add_special_tokens(text: Union[str, List[str]]) List[str]

Reformats input text to add special model-dependent tokens.

Parameters

text (Union[str, List[str]]) – single string or batch of strings to be modified.

Returns

Modified input, containing special tokens as per tokenizer specification

Return type

List[str]

mask(sentence_words: Union[Tuple[str, str], List[Tuple[str, str]]]) Tuple[str, str, int]

Processes a list of (sentence, word) into input that has the word masked out of the sentence.

Note: only works for masked LMs.

Parameters

sentence_words (Union[Tuple[str], List[Tuple[str]]]) – Input consisting of [(sentence, word)], where sentence is an input sentence, and word is a word present in the sentence that will be masked out.

Returns

Tuple (sentence, word, length)

cloze(sentence_words: Union[Tuple[str, str], List[Tuple[str, str]]]) torch.Tensor

Runs inference on masked input. Note: only works for masked LMs.

Parameters

sentence_words (Union[Tuple[str], List[Tuple[str]]]) – Input consisting of [(sentence, word)], where sentence is an input sentence, and word is a word present in the sentence that will be masked out and inferred.

Returns

A tensor with log probabilities for the desired word in context

prepare_text(text: Union[str, List[str]]) Iterable[Any]

Prepares a batch of input text into a format fit to run MLM scoring on.

Borrows preprocessing algorithm from Salazar et al. (2020), and modifies code from the following github repository by simonpri: https://github.com/simonepri/lm-scorer

Parameters

text – batch of sentences to be prepared for scoring.

Returns

Batch of formatted input that can be passed to logprob

prime_text(preamble: Union[str, List[str]], stimuli: Union[str, List[str]]) Iterable[Any]

Prepares a batch of input text into a format fit to run LM scoring on.

Borrows preprocessing algorithm from Salazar et al. (2020), and modifies code from the following github repository by simonpri: https://github.com/simonepri/lm-scorer

Parameters
  • preamble (Union[str, List[str]]) – Batch of prefixes/prime/preambles on which the LM is conditioned.

  • stimuli (Union[str, List[str]]) – Batch of continuations that are scored based on the conditioned text (provided in the preamble). The positions of the elements match their counterparts in the preamble.

Returns

Batch of formatted input that can be passed to compute_stats

distribution(batch: Iterable) torch.Tensor

Returns a distribution over the vocabulary of the model.

Parameters

batch (Iterable) – A batch of inputs fit to pass to a transformer LM.

Returns

Tensor consisting of log probabilies over vocab items.

cloze_distribution(queries: Iterable) torch.Tensor

Accepts as input batch of [(s_i, bw_i)] where s_i is a prompt with an abstract token (bw_i) representing a blank word and returns a distribution over the vocabulary of the model.

Parameters

queries (Iterable) – A batch of [(s_i, bw_i)] where s_i is a prompt with an abstract token (bw_i) representing a blank word

Returns

Tensor contisting of log probabilities over vocab items.

logprobs(batch: Iterable, rank=False) Union[List[Tuple[torch.Tensor, str]], List[Tuple[torch.Tensor, str, int]]]

Returns log probabilities

Parameters
  • batch (Iterable) – A batch of inputs fit to pass to a transformer LM.

  • rank (bool) – Specifies whether to also return ranks of words.

Returns

List of MLM score metrics and tokens.

Return type

Union[List[Tuple[torch.Tensor, str]], List[Tuple[torch.Tensor, str, int]]]

compute_stats(batch: Iterable, rank: bool = False, prob=False, base_two: bool = False, return_tensors: bool = False) Union[Tuple[List[float], List[float]], List[float]]

Primary computational method that processes a batch of prepared sentences and returns per-token scores for each sentence. By default, returns log-probabilities.

Parameters
  • batch (Iterable) – batched input as processed by prepare_text or prime_text.

  • rank (bool) – whether the model should also return ranks per word (based on the conditional log-probability of the word in context).

  • prob (bool) – whether the model should return probabilities instead of log-probabilities. Can only be True when base_two is False.

  • base_two (bool) – whether the base of the log should be 2 (usually preferred when reporting results in bits). Can only be True when prob is False.

  • return_tensors (bool) – whether the model should return scores as a list of tensors instead of a list of lists. This is important in some other convenient methods used in the package.

Returns

Either a tuple of lists, each containing probabilities and ranks per token in each sentence passed in the input.

Return type

Union[Tuple[List[float], List[float]], List[float]]

sequence_score(batch, reduction=<function MaskedLMScorer.<lambda>>, base_two=False)

TODO: reduction should be a string, if it’s a function, specify what kind of function. –> how to ensure it is always that type?

token_score(batch: Union[str, List[str]], surprisal: bool = False, prob: bool = False, base_two: bool = False, rank: bool = False) Union[List[Tuple[str, float]], List[Tuple[str, float, int]]]
For every input sentence, returns a list of tuples in the following format:

(token, score),

where score represents the log-probability (by default) of the token given context. Can also return ranks along with scores.

Parameters
  • batch (Union[str, List[str]]) – a single sentence or a batch of sentences.

  • surprisal (bool) – If True, returns per-word surprisals instead of log-probabilities.

  • prob (bool) – If True, returns per-word probabilities instead of log-probabilities.

  • base_two (bool) – If True, uses log base 2 instead of natural-log (returns bits of values in case of surprisals)

  • rank (bool) – If True, also returns the rank of each word in context (based on the log-probability value)

Returns

A List containing a Tuple consisting of the word, its associated score, and optionally, its rank.

Return type

Union[List[Tuple[str, float]], List[Tuple[str, float, int]]]

class minicons.scorer.IncrementalLMScorer(model_name: str, device: Optional[str] = 'cpu')

Bases: minicons.scorer.LMScorer

Class for Autoregressive or Incremental (or left-to-right) language models such as GPT2, etc.

Parameters
  • model_name (str) – name of the model, should either be a path to a model (.pt or .bin file) stored locally, or a pretrained model stored on the Huggingface Model Hub.

  • device (str, optional) – device type that the model should be loaded on, options: cpu or cuda:{0, 1, …}

add_special_tokens(text: Union[str, List[str]]) Union[str, List[str]]

Reformats input text to add special model-dependent tokens.

Parameters

text (Union[str, List[str]]) – single string or batch of strings to be modified.

Returns

Modified input, containing special tokens as per tokenizer specification

Return type

Union[float, List[float]]:

encode(text: Union[str, List[str]]) dict

Encode a batch of sentences using the model’s tokenizer. Equivalent of calling model.tokenizer(input)

Parameters
  • text (Union[str, List[str]]) – Input batch/sentence to be encoded.

  • manual_special (str) – Specification of whether special tokens will be manually encoded.

  • return_tensors – returned tensor format. Default ‘pt’

Returns

Encoded batch

Return type

Dict

prepare_text(text: Union[str, List[str]]) Tuple

Prepares a batch of input text into a format fit to run LM scoring on.

Parameters

text – batch of sentences to be prepared for scoring.

Returns

Batch of formatted input that can be passed to compute_stats

prime_text(preamble: Union[str, List[str]], stimuli: Union[str, List[str]]) Tuple

Prepares a batch of input text into a format fit to run LM scoring on.

Parameters
  • preamble (Union[str, List[str]]) – Batch of prefixes/prime/preambles on which the LM is conditioned.

  • stimuli (Union[str, List[str]]) – Batch of continuations that are scored based on the conditioned text (provided in the preamble). The positions of the elements match their counterparts in the preamble.

Returns

Batch of formatted input that can be passed to compute_stats

distribution(batch: Iterable) torch.Tensor

Returns a distribution over the vocabulary of the model.

Parameters

batch (Iterable) – A batch of inputs fit to pass to a transformer LM.

Returns

Tensor consisting of log probabilies over vocab items.

next_word_distribution(queries: List, surprisal: bool = False)

Returns the log probability distribution of the next word.

compute_stats(batch: Iterable, rank: bool = False, prob: bool = False, base_two: bool = False, return_tensors: bool = False) Union[Tuple[List[float], List[float]], List[float]]

Primary computational method that processes a batch of prepared sentences and returns per-token scores for each sentence. By default, returns log-probabilities.

Parameters
  • batch (Iterable) – batched input as processed by prepare_text or prime_text.

  • rank (bool) – whether the model should also return ranks per word (based on the conditional log-probability of the word in context).

  • prob (bool) – whether the model should return probabilities instead of log-probabilities. Can only be True when base_two is False.

  • base_two (bool) – whether the base of the log should be 2 (usually preferred when reporting results in bits). Can only be True when prob is False.

  • return_tensors (bool) – whether the model should return scores as a list of tensors instead of a list of lists. This is important in some other convenient methods used in the package.

Returns

Either a tuple of lists, each containing probabilities and ranks per token in each sentence passed in the input.

Return type

Union[Tuple[List[float], List[int]], List[float]]

sequence_score(batch, reduction=<function IncrementalLMScorer.<lambda>>, base_two=False)

TODO: reduction should be a string, if it’s a function, specify what kind of function. –> how to ensure it is always that type?

token_score(batch: Union[str, List[str]], surprisal: bool = False, prob: bool = False, base_two: bool = False, rank: bool = False) Union[List[Tuple[str, float]], List[Tuple[str, float, int]]]
For every input sentence, returns a list of tuples in the following format:

(token, score),

where score represents the log-probability (by default) of the token given context. Can also return ranks along with scores.

Parameters
  • batch (Union[str, List[str]]) – a single sentence or a batch of sentences.

  • surprisal (bool) – If True, returns per-word surprisals instead of log-probabilities.

  • prob (bool) – If True, returns per-word probabilities instead of log-probabilities.

  • base_two (bool) – If True, uses log base 2 instead of natural-log (returns bits of values in case of surprisals)

  • rank (bool) – If True, also returns the rank of each word in context (based on the log-probability value)

Returns

A List containing a Tuple consisting of the word, its associated score, and optionally, its rank.

Return type

Union[List[Tuple[str, float]], List[Tuple[str, float, int]]]

logprobs(batch: Iterable, rank=False) Union[float, List[float]]

Returns log probabilities

Parameters
  • batch (Iterable) – A batch of inputs fit to pass to a transformer LM.

  • rank (bool) – Specifies whether to also return ranks of words.

Returns

List of LM score metrics (probability and rank) and tokens.

Return type

Union[List[Tuple[torch.Tensor, str]], List[Tuple[torch.Tensor, str, int]]]