Tokenizers#
- class nemo.collections.common.tokenizers.AutoTokenizer(pretrained_model_name: str, vocab_file: str | None = None, merges_file: str | None = None, mask_token: str | None = None, bos_token: str | None = None, eos_token: str | None = None, pad_token: str | None = None, sep_token: str | None = None, cls_token: str | None = None, unk_token: str | None = None, additional_special_tokens: List | None = [], use_fast: bool | None = True, trust_remote_code: bool | None = False, include_special_tokens: bool = False, chat_template: str | None = None)[source]#
Bases:
TokenizerSpec
Wrapper of HuggingFace AutoTokenizer https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer.
- __init__(pretrained_model_name: str, vocab_file: str | None = None, merges_file: str | None = None, mask_token: str | None = None, bos_token: str | None = None, eos_token: str | None = None, pad_token: str | None = None, sep_token: str | None = None, cls_token: str | None = None, unk_token: str | None = None, additional_special_tokens: List | None = [], use_fast: bool | None = True, trust_remote_code: bool | None = False, include_special_tokens: bool = False, chat_template: str | None = None)[source]#
- Parameters:
pretrained_model_name – corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument. For more details please refer to the documentation of the from_pretrained method here: https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer. The list of all supported models can be found here: https://huggingface.co/models
vocab_file – path to file with vocabulary which consists of characters separated by newlines.
mask_token – mask token
bos_token – the beginning of sequence token
eos_token – the end of sequence token. Usually equal to sep_token
pad_token – token to use for padding
sep_token – token used for separating sequences
cls_token – class token. Usually equal to bos_token
unk_token – token to use for unknown tokens
additional_special_tokens – list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)
use_fast – whether to use fast HuggingFace tokenizer
include_special_tokens – when True, converting text to ids will include special tokens / prompt tokens (if any), yielding self.tokenizer(text).input_ids
chat_template – The chat template string to format “messages” with against the underlying HF tokneizer with apply_chat_template function
- add_special_tokens(special_tokens_dict: dict) int [source]#
Adds a dictionary of special tokens (eos, pad, cls…). If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary).
- Parameters:
special_tokens_dict – dict of string. Keys should be in the list of predefined special attributes: [
bos_token
,eos_token
,unk_token
,sep_token
,pad_token
,cls_token
,mask_token
,additional_special_tokens
]. Tokens are only added if they are not already in the vocabulary.- Returns:
Number of tokens added to the vocabulary.
- property additional_special_tokens_ids#
Returns a list of the additional special tokens’ IDs (excluding bos, eos, pad, unk).
- Returns:
List of token IDs for additional special tokens, such as sentinel tokens for T5.
- Return type:
List[int]
- property bos_id#
Gets the ID of the beginning-of-sequence token.
- Returns:
The ID of the BOS token if it exists, None otherwise.
- Return type:
int or None
- property cls_id#
Gets the ID of the classifier token.
- Returns:
The ID of the classifier token if it exists, None otherwise.
- Return type:
int or None
- property eod#
Gets the ID of the end-of-document token (same as EOS token). Required for megatron-core compatibility.
- Returns:
The ID of the EOD/EOS token.
- Return type:
int
- property eos_id#
Gets the ID of the end-of-sequence token.
- Returns:
The ID of the EOS token if it exists, None otherwise.
- Return type:
int or None
- ids_to_text(ids, remove_special_tokens=True)[source]#
Converts token IDs back to text.
- Parameters:
ids (List[int]) – List of token IDs to convert to text.
remove_special_tokens (bool) – Whether to remove special tokens (like [PAD], [CLS], etc.) from the output
text.
- Returns:
The reconstructed text.
- Return type:
str
- ids_to_tokens(ids)[source]#
Converts a list of token IDs back to tokens.
- Parameters:
ids (List[int]) – List of token IDs to convert.
- Returns:
List of tokens.
- Return type:
List[str]
- property inv_vocab#
Returns the inverse vocabulary mapping (token to ID).
- Returns:
Dictionary mapping tokens to their IDs.
- Return type:
Dict[str, int]
- property mask_id#
Gets the ID of the mask token.
- Returns:
The ID of the mask token if it exists, None otherwise.
- Return type:
int or None
- property name#
Returns the name of the underlying HuggingFace tokenizer class.
- Returns:
Name of the tokenizer class.
- Return type:
str
- property pad_id#
Gets the ID of the padding token.
- Returns:
The ID of the padding token if it exists, None otherwise.
- Return type:
int or None
- save_pretrained(save_directory: str)[source]#
Saves tokenizer’s vocabulary and other artifacts to the specified directory
- save_vocabulary(save_directory: str, filename_prefix: str | None = None)[source]#
Saves tokenizer’s vocabulary and other artifacts to the specified directory
- property sep_id#
Gets the ID of the separator token.
- Returns:
The ID of the separator token if it exists, None otherwise.
- Return type:
int or None
- text_to_ids(text)[source]#
Converts text directly to token IDs.
- Parameters:
text (str) – Input text to be converted to IDs.
- Returns:
List of token IDs. If include_special_tokens is True, will include special tokens from the tokenizer’s configuration.
- Return type:
List[int]
- text_to_tokens(text)[source]#
Converts text into a list of tokens.
- Parameters:
text (str) – Input text to be tokenized.
- Returns:
List of tokens.
- Return type:
List[str]
- token_to_id(token)[source]#
Converts a single token to its corresponding ID.
- Parameters:
token (str) – The token to convert.
- Returns:
The ID corresponding to the token.
- Return type:
int
- tokens_to_ids(tokens)[source]#
Converts a list of tokens to their corresponding IDs.
- Parameters:
tokens (List[str]) – List of tokens to convert.
- Returns:
List of token IDs.
- Return type:
List[int]
- tokens_to_text(tokens)[source]#
Converts a list of tokens back into text.
- Parameters:
tokens (List[str]) – List of tokens to be converted.
- Returns:
The reconstructed text.
- Return type:
str
- property unk_id#
Gets the ID of the unknown token.
- Returns:
The ID of the unknown token if it exists, None otherwise.
- Return type:
int or None
- property vocab#
Returns the vocabulary as a list where the index corresponds to the token ID.
- Returns:
List of tokens in the vocabulary.
- Return type:
List[str]
- property vocab_size#
Returns the size of the tokenizer’s vocabulary.
- Returns:
The number of tokens in the vocabulary.
- Return type:
int
- class nemo.collections.common.tokenizers.SentencePieceTokenizer(model_path: str, special_tokens: Dict[str, str] | List[str] | None = None, legacy: bool = False, ignore_extra_whitespaces: bool = True, chat_template: Dict | None = None, trim_spm_separator_after_special_token=True, spm_separator='▁')[source]#
Bases:
TokenizerSpec
,ChatTemplateMixin
Sentencepiecetokenizer google/sentencepiece.
- Parameters:
model_path – path to sentence piece tokenizer model. To create the model use create_spt_model()
special_tokens – either list of special tokens or dictionary of token name to token value
legacy – when set to True, the previous behavior of the SentecePiece wrapper will be restored, including the possibility to add special tokens inside wrapper.
ignore_extra_whitespaces – whether to ignore extra whitespaces in the input text while encoding. Note: This is done for the current models tokenizers that don’t handle extra whitespaces as by default tokenizer learned to ignore it. To check if the tokenizer by default ignores extra whitespaces refer to self.removed_extra_spaces attribute of the tokenizer. We added a parameter to process_asr_tokenizer.py for upcoming models to handle it inbuilt.
- __init__(model_path: str, special_tokens: Dict[str, str] | List[str] | None = None, legacy: bool = False, ignore_extra_whitespaces: bool = True, chat_template: Dict | None = None, trim_spm_separator_after_special_token=True, spm_separator='▁')[source]#
- add_special_tokens(special_tokens)[source]#
Adds new special tokens to the tokenizer’s vocabulary (only if legacy=True).
- Parameters:
special_tokens – List or dict of special tokens to add.
- Raises:
AttributeError – If not in legacy mode.
ValueError – If the input is not a list or dictionary.
- property additional_special_tokens_ids#
Returns a list of the additional special tokens (excluding bos, eos, pad, unk).
Used to return sentinel tokens for e.g. T5.
- property bos_id#
Returns the ID for the beginning-of-sequence token.
- property cls_id#
Returns the ID for the classification token (only in legacy mode).
- property eos_id#
Returns the ID for the end-of-sequence token.
- ids_to_text(ids)[source]#
Decodes a list of token IDs into a string, handling special tokens if in legacy mode.
- Parameters:
ids – A list or tensor/array of token IDs.
- Returns:
The decoded string.
- ids_to_tokens(ids)[source]#
Converts a list of token IDs into corresponding token strings.
- Parameters:
ids – A list or array/tensor of token IDs.
- Returns:
List of string tokens.
- property mask_id#
Returns the ID for the mask token (only in legacy mode).
- property pad_id#
Returns the ID for the padding token.
- property sep_id#
Returns the ID for the separator token (only in legacy mode).
- text_to_ids(text, sample_alpha=None)[source]#
Converts input text to a list of token IDs.
Handles chat formatting or raw string tokenization depending on input type.
- Parameters:
text – A string or list representing chat template inputs.
sample_alpha – Optional float to enable subword sampling for data augmentation.
- Returns:
A list of token IDs.
- text_to_tokens(text)[source]#
Converts input text to a list of tokens.
If legacy mode is enabled, handles special tokens separately.
- Parameters:
text – The input string to tokenize.
- Returns:
A list of string tokens.
- token_to_id(token)[source]#
Gets the ID corresponding to a token.
- Parameters:
token – Token string.
- Returns:
Token ID as an integer.
- tokens_to_ids(tokens: str | List[str], tokens_to_skip: List[str] = []) int | List[int] [source]#
Converts one or more tokens into their respective IDs, skipping any specified tokens.
- Parameters:
tokens – A string or list of token strings.
tokens_to_skip – List of tokens to ignore during conversion.
- Returns:
A single ID or list of IDs.
- tokens_to_text(tokens)[source]#
Converts a list of tokens back to the corresponding string.
- Parameters:
tokens – A list of string tokens or a tensor/array of token IDs.
- Returns:
The decoded string.
- property unk_id#
Returns the ID for the unknown token.
- property vocab#
Returns the combined vocabulary list, including base and special tokens.
- class nemo.collections.common.tokenizers.TokenizerSpec[source]#
Bases:
ABC
Inherit this class to implement a new tokenizer.
- add_special_tokens(special_tokens: List[str])[source]#
Adds special tokens (eos, pad, cls…) to vocab.
- property bos#
Property alias to match MegatronTokenizer; returns bos_id if available.
- property cls#
Property alias to match MegatronTokenizer; returns cls_id if available.
- property eod#
Property alias to match MegatronTokenizer; returns eod_id if available.
- property eos#
Property alias to match MegatronTokenizer; returns eos_id if available.
- property mask#
Property alias to match MegatronTokenizer; returns mask_id if available.
- property name#
name of the class
- property pad#
Property alias to match MegatronTokenizer; returns pad_id if available.
- property sep#
Property alias to match MegatronTokenizer; returns sep_id if available.
- property unique_identifiers#
Property required for use with megatron-core datasets.