Module tokenizers::tokenizer
source · Expand description
Represents a tokenization pipeline.
A Tokenizer
is composed of some of the following parts.
Normalizer
: Takes care of the text normalization (like unicode normalization).PreTokenizer
: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.Model
: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, …).PostProcessor
: Takes care of the processing after tokenization (like truncating, padding, …).
Re-exports§
pub use crate::decoders::DecoderWrapper;
pub use crate::models::ModelWrapper;
pub use crate::normalizers::NormalizerWrapper;
pub use crate::pre_tokenizers::PreTokenizerWrapper;
pub use crate::processors::PostProcessorWrapper;
pub use crate::utils::iter::LinesWithEnding;
pub use crate::utils::padding::pad_encodings;
pub use crate::utils::padding::PaddingDirection;
pub use crate::utils::padding::PaddingParams;
pub use crate::utils::padding::PaddingStrategy;
pub use crate::utils::truncation::truncate_encodings;
pub use crate::utils::truncation::TruncationDirection;
pub use crate::utils::truncation::TruncationParams;
pub use crate::utils::truncation::TruncationStrategy;
pub use normalizer::NormalizedString;
pub use normalizer::OffsetReferential;
pub use normalizer::SplitDelimiterBehavior;
pub use pre_tokenizer::*;
Modules§
Structs§
- Represent a token added by the user on top of the existing Model vocabulary. AddedToken can be configured to specify the behavior they should have in various situations like:
- Represents the output of a
Tokenizer
. - Builder for Tokenizer structs.
- A
Tokenizer
is capable of encoding/decoding any text.
Enums§
Traits§
- A
Decoder
changes the raw tokens into its more readable form. - Represents a model used during Tokenization (like BPE or Word or Unigram).
- Takes care of pre-processing strings.
- A
PostProcessor
has the responsibility to post process an encoded output of theTokenizer
. It adds any special tokens that a language model would require. - The
PreTokenizer
is in charge of doing the pre-segmentation step. It splits the given string in multiple substrings, keeping track of the offsets of said substrings from theNormalizedString
. In some occasions, thePreTokenizer
might need to modify the givenNormalizedString
to ensure we can entirely keep track of the offsets and the mapping with the original string. - A
Trainer
has the responsibility to train a model. We feed it with lines/sentences and then it can train the givenModel
.