Struct tokenizers::models::bpe::BPE
source · pub struct BPE {
pub dropout: Option<f32>,
pub unk_token: Option<String>,
pub continuing_subword_prefix: Option<String>,
pub end_of_word_suffix: Option<String>,
pub fuse_unk: bool,
pub byte_fallback: bool,
/* private fields */
}
Expand description
A Byte Pair Encoding model.
Fields§
§dropout: Option<f32>
Dropout probability for merges. 0 = no dropout is the default. At 1.0, tokenization will perform no merges, so the result will just be characters.
unk_token: Option<String>
The unknown token to be used when we encounter an unknown char
continuing_subword_prefix: Option<String>
An optional prefix to use on any subword that exist only behind another one
end_of_word_suffix: Option<String>
An optional suffix to caracterize and end-of-word subword
fuse_unk: bool
Do multiple unk tokens get fused
byte_fallback: bool
Byte fallback from sentence pieces, instead of UNK, uses "<0x00>"
for each byte in the unk token
Implementations§
source§impl BPE
impl BPE
sourcepub fn builder() -> BpeBuilder
pub fn builder() -> BpeBuilder
Initialize a BpeBuilder
.
sourcepub fn new(vocab: Vocab, merges: Merges) -> Self
pub fn new(vocab: Vocab, merges: Merges) -> Self
Create a new BPE model with the given vocab and merges.
sourcepub fn from_file(vocab: &str, merges: &str) -> BpeBuilder
pub fn from_file(vocab: &str, merges: &str) -> BpeBuilder
Initialize a BpeBuilder model from vocab and merges files
sourcepub fn read_file(vocab: &str, merges: &str) -> Result<(Vocab, Merges)>
pub fn read_file(vocab: &str, merges: &str) -> Result<(Vocab, Merges)>
Read the given files to extract the vocab and merges
sourcepub fn clear_cache(&self)
pub fn clear_cache(&self)
Reset the cache.
pub fn get_vocab(&self) -> Vocab
pub fn get_unk_token(&self) -> &Option<String>
pub fn get_continuing_subword_prefix(&self) -> &Option<String>
Trait Implementations§
source§impl<'de> Deserialize<'de> for BPE
impl<'de> Deserialize<'de> for BPE
source§fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>where
D: Deserializer<'de>,
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>where
D: Deserializer<'de>,
Deserialize this value from the given Serde deserializer. Read more
source§impl From<BPE> for ModelWrapper
impl From<BPE> for ModelWrapper
source§impl Model for BPE
impl Model for BPE
type Trainer = BpeTrainer
source§fn get_vocab(&self) -> HashMap<String, u32>
fn get_vocab(&self) -> HashMap<String, u32>
Retrieve the entire vocabulary mapping (token -> ID)
source§fn get_vocab_size(&self) -> usize
fn get_vocab_size(&self) -> usize
Retrieve the size of the vocabulary
source§fn tokenize(&self, sequence: &str) -> Result<Vec<Token>>
fn tokenize(&self, sequence: &str) -> Result<Vec<Token>>
Tokenize the given sequence into multiple underlying
Token
. The offsets
on the Token
are expected to be relative to the given sequence.source§fn save(&self, folder: &Path, name: Option<&str>) -> Result<Vec<PathBuf>>
fn save(&self, folder: &Path, name: Option<&str>) -> Result<Vec<PathBuf>>
Save the current
Model
in the given folder, using the given prefix
for the various
files that need to be saved.source§fn get_trainer(&self) -> BpeTrainer
fn get_trainer(&self) -> BpeTrainer
Get an instance of a Trainer capable of training this Model
impl StructuralPartialEq for BPE
Auto Trait Implementations§
impl !Freeze for BPE
impl RefUnwindSafe for BPE
impl Send for BPE
impl Sync for BPE
impl Unpin for BPE
impl UnwindSafe for BPE
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more