Struct tokenizers::models::bpe::trainer::BpeTrainer
source · #[non_exhaustive]pub struct BpeTrainer {
pub min_frequency: u64,
pub vocab_size: usize,
pub show_progress: bool,
pub special_tokens: Vec<AddedToken>,
pub limit_alphabet: Option<usize>,
pub initial_alphabet: HashSet<char>,
pub continuing_subword_prefix: Option<String>,
pub end_of_word_suffix: Option<String>,
pub max_token_length: Option<usize>,
/* private fields */
}
Expand description
In charge of training a BPE
model
§Examples
use tokenizers::tokenizer::Trainer;
use tokenizers::models::bpe::{BPE, BpeTrainer};
let sequences = vec![ "Hello", "World" ];
let mut trainer = BpeTrainer::default();
trainer.feed(sequences.iter(), |s| Ok(vec![s.to_owned()]));
let mut model = BPE::default();
let special_tokens = trainer.train(&mut model).unwrap();
Fields (Non-exhaustive)§
This struct is marked as non-exhaustive
Non-exhaustive structs could have additional fields added in future. Therefore, non-exhaustive structs cannot be constructed in external crates using the traditional
Struct { .. }
syntax; cannot be matched against without a wildcard ..
; and struct update syntax will not work.min_frequency: u64
The minimum frequency a pair must have to produce a merge operation
vocab_size: usize
The target vocabulary size
show_progress: bool
Whether to show progress while training
special_tokens: Vec<AddedToken>
A list of special tokens that the model should know of
limit_alphabet: Option<usize>
Whether to limit the number of initial tokens that can be kept before computing merges
initial_alphabet: HashSet<char>
The initial alphabet we want absolutely to include. This allows to cover some characters that are not necessarily in the training set
continuing_subword_prefix: Option<String>
An optional prefix to use on any subword that exist only behind another one
end_of_word_suffix: Option<String>
An optional suffix to caracterize and end-of-word subword
max_token_length: Option<usize>
An optional parameter to limit the max length of any single token
Implementations§
Trait Implementations§
source§impl Clone for BpeTrainer
impl Clone for BpeTrainer
source§fn clone(&self) -> BpeTrainer
fn clone(&self) -> BpeTrainer
Returns a copy of the value. Read more
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from
source
. Read moresource§impl Debug for BpeTrainer
impl Debug for BpeTrainer
source§impl Default for BpeTrainer
impl Default for BpeTrainer
source§impl<'de> Deserialize<'de> for BpeTrainer
impl<'de> Deserialize<'de> for BpeTrainer
source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Deserialize this value from the given Serde deserializer. Read more
source§impl From<BpeTrainer> for TrainerWrapper
impl From<BpeTrainer> for TrainerWrapper
source§fn from(from: BpeTrainer) -> Self
fn from(from: BpeTrainer) -> Self
Converts to this type from the input type.
source§impl PartialEq for BpeTrainer
impl PartialEq for BpeTrainer
source§fn eq(&self, other: &BpeTrainer) -> bool
fn eq(&self, other: &BpeTrainer) -> bool
This method tests for
self
and other
values to be equal, and is used
by ==
.source§impl Serialize for BpeTrainer
impl Serialize for BpeTrainer
source§impl Trainer for BpeTrainer
impl Trainer for BpeTrainer
impl Eq for BpeTrainer
impl StructuralPartialEq for BpeTrainer
Auto Trait Implementations§
impl Freeze for BpeTrainer
impl RefUnwindSafe for BpeTrainer
impl Send for BpeTrainer
impl Sync for BpeTrainer
impl Unpin for BpeTrainer
impl UnwindSafe for BpeTrainer
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more