Tokenizers

Contents

Tokenizers#

class nemo.collections.common.tokenizers.AutoTokenizer(pretrained_model_name: str, vocab_file: str | None = None, merges_file: str | None = None, mask_token: str | None = None, bos_token: str | None = None, eos_token: str | None = None, pad_token: str | None = None, sep_token: str | None = None, cls_token: str | None = None, unk_token: str | None = None, additional_special_tokens: List | None = [], use_fast: bool | None = True, trust_remote_code: bool | None = False, include_special_tokens: bool = False, chat_template: str | None = None)[source]#

Bases: TokenizerSpec

Wrapper of HuggingFace AutoTokenizer https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer.

__init__(pretrained_model_name: str, vocab_file: str | None = None, merges_file: str | None = None, mask_token: str | None = None, bos_token: str | None = None, eos_token: str | None = None, pad_token: str | None = None, sep_token: str | None = None, cls_token: str | None = None, unk_token: str | None = None, additional_special_tokens: List | None = [], use_fast: bool | None = True, trust_remote_code: bool | None = False, include_special_tokens: bool = False, chat_template: str | None = None)[source]#
Parameters:
  • pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to the documentation of the from_pretrained method here: https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer. The list of all supported models can be found here: https://huggingface.co/models

  • vocab_file – path to file with vocabulary which consists of characters separated by newlines.

  • mask_token – mask token

  • bos_token – the beginning of sequence token

  • eos_token – the end of sequence token. Usually equal to sep_token

  • pad_token – token to use for padding

  • sep_token – token used for separating sequences

  • cls_token – class token. Usually equal to bos_token

  • unk_token – token to use for unknown tokens

  • additional_special_tokens – list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

  • use_fast – whether to use fast HuggingFace tokenizer

  • include_special_tokens – when True, converting text to ids will include special tokens / prompt tokens (if any), yielding self.tokenizer(text).input_ids

  • chat_template – The chat template string to format “messages” with against the underlying HF tokneizer with apply_chat_template function

add_special_tokens(special_tokens_dict: dict) int[source]#

Adds a dictionary of special tokens (eos, pad, cls…). If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary).

Parameters:

special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens]. Tokens are only added if they are not already in the vocabulary.

Returns:

Number of tokens added to the vocabulary.

property additional_special_tokens_ids#

Returns a list of the additional special tokens’ IDs (excluding bos, eos, pad, unk).

Returns:

List of token IDs for additional special tokens, such as sentinel tokens for T5.

Return type:

List[int]

apply_chat_template(*args, **kwargs)[source]#

Appies chat template and tokenizes results

property bos_id#

Gets the ID of the beginning-of-sequence token.

Returns:

The ID of the BOS token if it exists, None otherwise.

Return type:

int or None

property cls_id#

Gets the ID of the classifier token.

Returns:

The ID of the classifier token if it exists, None otherwise.

Return type:

int or None

property eod#

Gets the ID of the end-of-document token (same as EOS token). Required for megatron-core compatibility.

Returns:

The ID of the EOD/EOS token.

Return type:

int

property eos_id#

Gets the ID of the end-of-sequence token.

Returns:

The ID of the EOS token if it exists, None otherwise.

Return type:

int or None

ids_to_text(ids, remove_special_tokens=True)[source]#

Converts token IDs back to text.

Parameters:
  • ids (List[int]) – List of token IDs to convert to text.

  • remove_special_tokens (bool) – Whether to remove special tokens (like [PAD], [CLS], etc.) from the output

  • text.

Returns:

The reconstructed text.

Return type:

str

ids_to_tokens(ids)[source]#

Converts a list of token IDs back to tokens.

Parameters:

ids (List[int]) – List of token IDs to convert.

Returns:

List of tokens.

Return type:

List[str]

property inv_vocab#

Returns the inverse vocabulary mapping (token to ID).

Returns:

Dictionary mapping tokens to their IDs.

Return type:

Dict[str, int]

property mask_id#

Gets the ID of the mask token.

Returns:

The ID of the mask token if it exists, None otherwise.

Return type:

int or None

property name#

Returns the name of the underlying HuggingFace tokenizer class.

Returns:

Name of the tokenizer class.

Return type:

str

property pad_id#

Gets the ID of the padding token.

Returns:

The ID of the padding token if it exists, None otherwise.

Return type:

int or None

save_pretrained(save_directory: str)[source]#

Saves tokenizer’s vocabulary and other artifacts to the specified directory

save_vocabulary(save_directory: str, filename_prefix: str | None = None)[source]#

Saves tokenizer’s vocabulary and other artifacts to the specified directory

property sep_id#

Gets the ID of the separator token.

Returns:

The ID of the separator token if it exists, None otherwise.

Return type:

int or None

text_to_ids(text)[source]#

Converts text directly to token IDs.

Parameters:

text (str) – Input text to be converted to IDs.

Returns:

List of token IDs. If include_special_tokens is True, will include special tokens from the tokenizer’s configuration.

Return type:

List[int]

text_to_tokens(text)[source]#

Converts text into a list of tokens.

Parameters:

text (str) – Input text to be tokenized.

Returns:

List of tokens.

Return type:

List[str]

token_to_id(token)[source]#

Converts a single token to its corresponding ID.

Parameters:

token (str) – The token to convert.

Returns:

The ID corresponding to the token.

Return type:

int

tokens_to_ids(tokens)[source]#

Converts a list of tokens to their corresponding IDs.

Parameters:

tokens (List[str]) – List of tokens to convert.

Returns:

List of token IDs.

Return type:

List[int]

tokens_to_text(tokens)[source]#

Converts a list of tokens back into text.

Parameters:

tokens (List[str]) – List of tokens to be converted.

Returns:

The reconstructed text.

Return type:

str

property unk_id#

Gets the ID of the unknown token.

Returns:

The ID of the unknown token if it exists, None otherwise.

Return type:

int or None

property vocab#

Returns the vocabulary as a list where the index corresponds to the token ID.

Returns:

List of tokens in the vocabulary.

Return type:

List[str]

property vocab_size#

Returns the size of the tokenizer’s vocabulary.

Returns:

The number of tokens in the vocabulary.

Return type:

int

class nemo.collections.common.tokenizers.SentencePieceTokenizer(model_path: str, special_tokens: Dict[str, str] | List[str] | None = None, legacy: bool = False, ignore_extra_whitespaces: bool = True, chat_template: Dict | None = None, trim_spm_separator_after_special_token=True, spm_separator='▁')[source]#

Bases: TokenizerSpec, ChatTemplateMixin

Sentencepiecetokenizer google/sentencepiece.

Parameters:
  • model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()

  • special_tokens – either list of special tokens or dictionary of token name to token value

  • legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.

  • ignore_extra_whitespaces – whether to ignore extra whitespaces in the input text while encoding. Note: This is done for the current models tokenizers that don’t handle extra whitespaces as by default tokenizer learned to ignore it. To check if the tokenizer by default ignores extra whitespaces refer to self.removed_extra_spaces attribute of the tokenizer. We added a parameter to process_asr_tokenizer.py for upcoming models to handle it inbuilt.

__init__(model_path: str, special_tokens: Dict[str, str] | List[str] | None = None, legacy: bool = False, ignore_extra_whitespaces: bool = True, chat_template: Dict | None = None, trim_spm_separator_after_special_token=True, spm_separator='▁')[source]#
add_special_tokens(special_tokens)[source]#

Adds new special tokens to the tokenizer’s vocabulary (only if legacy=True).

Parameters:

special_tokens – List or dict of special tokens to add.

Raises:
  • AttributeError – If not in legacy mode.

  • ValueError – If the input is not a list or dictionary.

property additional_special_tokens_ids#

Returns a list of the additional special tokens (excluding bos, eos, pad, unk).

Used to return sentinel tokens for e.g. T5.

property bos_id#

Returns the ID for the beginning-of-sequence token.

property cls_id#

Returns the ID for the classification token (only in legacy mode).

property eos_id#

Returns the ID for the end-of-sequence token.

ids_to_text(ids)[source]#

Decodes a list of token IDs into a string, handling special tokens if in legacy mode.

Parameters:

ids – A list or tensor/array of token IDs.

Returns:

The decoded string.

ids_to_tokens(ids)[source]#

Converts a list of token IDs into corresponding token strings.

Parameters:

ids – A list or array/tensor of token IDs.

Returns:

List of string tokens.

property mask_id#

Returns the ID for the mask token (only in legacy mode).

property pad_id#

Returns the ID for the padding token.

property sep_id#

Returns the ID for the separator token (only in legacy mode).

text_to_ids(text, sample_alpha=None)[source]#

Converts input text to a list of token IDs.

Handles chat formatting or raw string tokenization depending on input type.

Parameters:
  • text – A string or list representing chat template inputs.

  • sample_alpha – Optional float to enable subword sampling for data augmentation.

Returns:

A list of token IDs.

text_to_tokens(text)[source]#

Converts input text to a list of tokens.

If legacy mode is enabled, handles special tokens separately.

Parameters:

text – The input string to tokenize.

Returns:

A list of string tokens.

token_to_id(token)[source]#

Gets the ID corresponding to a token.

Parameters:

token – Token string.

Returns:

Token ID as an integer.

tokens_to_ids(tokens: str | List[str], tokens_to_skip: List[str] = []) int | List[int][source]#

Converts one or more tokens into their respective IDs, skipping any specified tokens.

Parameters:
  • tokens – A string or list of token strings.

  • tokens_to_skip – List of tokens to ignore during conversion.

Returns:

A single ID or list of IDs.

tokens_to_text(tokens)[source]#

Converts a list of tokens back to the corresponding string.

Parameters:

tokens – A list of string tokens or a tensor/array of token IDs.

Returns:

The decoded string.

property unk_id#

Returns the ID for the unknown token.

property vocab#

Returns the combined vocabulary list, including base and special tokens.

class nemo.collections.common.tokenizers.TokenizerSpec[source]#

Bases: ABC

Inherit this class to implement a new tokenizer.

add_special_tokens(special_tokens: List[str])[source]#

Adds special tokens (eos, pad, cls…) to vocab.

apply_chat_template(*args, **kwargs)[source]#

Appies chat template and tokenizes results

property bos#

Property alias to match MegatronTokenizer; returns bos_id if available.

property cls#

Property alias to match MegatronTokenizer; returns cls_id if available.

property eod#

Property alias to match MegatronTokenizer; returns eod_id if available.

property eos#

Property alias to match MegatronTokenizer; returns eos_id if available.

abstract ids_to_text(ids)[source]#

Converts token IDs back to text.

abstract ids_to_tokens(ids)[source]#

Converts a list of token IDs back to tokens.

property mask#

Property alias to match MegatronTokenizer; returns mask_id if available.

property name#

name of the class

property pad#

Property alias to match MegatronTokenizer; returns pad_id if available.

property sep#

Property alias to match MegatronTokenizer; returns sep_id if available.

abstract text_to_ids(text)[source]#

Converts text directly to token IDs.

abstract text_to_tokens(text)[source]#

Converts text into a list of tokens.

abstract tokens_to_ids(tokens)[source]#

Converts a list of tokens to their corresponding IDs.

abstract tokens_to_text(tokens)[source]#

Converts a list of tokens back into text.

property unique_identifiers#

Property required for use with megatron-core datasets.