Module tokenizers::tokenizer
source · Expand description
Represents a tokenization pipeline.
A Tokenizer is composed of some of the following parts.
Normalizer: Takes care of the text normalization (like unicode normalization).PreTokenizer: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.Model: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, …).PostProcessor: Takes care of the processing after tokenization (like truncating, padding, …).
Re-exports§
pub use crate::decoders::DecoderWrapper;pub use crate::models::ModelWrapper;pub use crate::normalizers::NormalizerWrapper;pub use crate::pre_tokenizers::PreTokenizerWrapper;pub use crate::processors::PostProcessorWrapper;pub use crate::utils::iter::LinesWithEnding;pub use crate::utils::padding::pad_encodings;pub use crate::utils::padding::PaddingDirection;pub use crate::utils::padding::PaddingParams;pub use crate::utils::padding::PaddingStrategy;pub use crate::utils::truncation::truncate_encodings;pub use crate::utils::truncation::TruncationDirection;pub use crate::utils::truncation::TruncationParams;pub use crate::utils::truncation::TruncationStrategy;pub use normalizer::NormalizedString;pub use normalizer::OffsetReferential;pub use normalizer::SplitDelimiterBehavior;pub use pre_tokenizer::*;
Modules§
Structs§
- Represent a token added by the user on top of the existing Model vocabulary. AddedToken can be configured to specify the behavior they should have in various situations like:
- Represents the output of a
Tokenizer. - Builder for Tokenizer structs.
- A
Tokenizeris capable of encoding/decoding any text.
Enums§
Traits§
- A
Decoderchanges the raw tokens into its more readable form. - Represents a model used during Tokenization (like BPE or Word or Unigram).
- Takes care of pre-processing strings.
- A
PostProcessorhas the responsibility to post process an encoded output of theTokenizer. It adds any special tokens that a language model would require. - The
PreTokenizeris in charge of doing the pre-segmentation step. It splits the given string in multiple substrings, keeping track of the offsets of said substrings from theNormalizedString. In some occasions, thePreTokenizermight need to modify the givenNormalizedStringto ensure we can entirely keep track of the offsets and the mapping with the original string. - A
Trainerhas the responsibility to train a model. We feed it with lines/sentences and then it can train the givenModel.