Struct tokenizers::models::unigram::Unigram
source · pub struct Unigram {
pub min_score: f64,
/* private fields */
}
Expand description
A Unigram
model to encode sentences.
Fields§
§min_score: f64
Implementations§
source§impl Unigram
impl Unigram
sourcepub fn from(
vocab: Vec<(String, f64)>,
unk_id: Option<usize>,
byte_fallback: bool
) -> Result<Self>
pub fn from( vocab: Vec<(String, f64)>, unk_id: Option<usize>, byte_fallback: bool ) -> Result<Self>
Create a Unigram
model from a given vocabulary.
Vocabulary are the various tokens and their associated score which is a sort of a logprob of
their frequency, which will enable tokenization and sampling.
unk_id, is the index within the vocabulary.
For now Unigram
requires at least unk
because we might find a never seen char.
Further versions might allow that part to be hidden.
pub fn byte_fallback(&self) -> bool
sourcepub fn encode(&self, sentence: &str) -> Result<Vec<String>>
pub fn encode(&self, sentence: &str) -> Result<Vec<String>>
This functions take a String, and will encode it in a Vec of Strings, of the best tokenization available to the current model.
use tokenizers::models::unigram::Unigram;
let pieces = vec![
("<unk>".to_string(), 0.0),
("a".to_string(), 0.0),
("b".to_string(), 0.0),
("c".to_string(), 0.0),
("d".to_string(), 0.0),
("cd".to_string(), 1.0),
("ab".to_string(), 2.0),
("abc".to_string(), 5.0),
("abcd".to_string(), 10.0),
];
let model = Unigram::from(pieces, Some(0), false).unwrap();
let result = model.encode("abcdacdxx").unwrap();
assert_eq!(result, vec!["abcd", "a", "cd", "xx"]);
sourcepub fn iter(&self) -> UnigramIterator<'_> ⓘ
pub fn iter(&self) -> UnigramIterator<'_> ⓘ
Iterate of vocabulary of the model as a pair of (token, score)
.
sourcepub fn load<P: AsRef<Path>>(path: P) -> Result<Unigram>
pub fn load<P: AsRef<Path>>(path: P) -> Result<Unigram>
Loads a SentencePiece output model after being trained by tokenizers. After that you can use the model with tokenizers library.
use tokenizers::models::unigram::Unigram;
use std::path::Path;
let model = Unigram::load("mymodel-unigram.json").unwrap();
Trait Implementations§
source§impl<'de> Deserialize<'de> for Unigram
impl<'de> Deserialize<'de> for Unigram
source§fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>where
D: Deserializer<'de>,
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>where
D: Deserializer<'de>,
Deserialize this value from the given Serde deserializer. Read more
source§impl From<Unigram> for ModelWrapper
impl From<Unigram> for ModelWrapper
source§impl Model for Unigram
impl Model for Unigram
type Trainer = UnigramTrainer
source§fn get_vocab(&self) -> HashMap<String, u32>
fn get_vocab(&self) -> HashMap<String, u32>
Retrieve the entire vocabulary mapping (token -> ID)
source§fn get_vocab_size(&self) -> usize
fn get_vocab_size(&self) -> usize
Retrieve the size of the vocabulary
source§fn tokenize(&self, sentence: &str) -> Result<Vec<Token>>
fn tokenize(&self, sentence: &str) -> Result<Vec<Token>>
Tokenize the given sequence into multiple underlying
Token
. The offsets
on the Token
are expected to be relative to the given sequence.source§fn save(&self, folder: &Path, name: Option<&str>) -> Result<Vec<PathBuf>>
fn save(&self, folder: &Path, name: Option<&str>) -> Result<Vec<PathBuf>>
Save the current
Model
in the given folder, using the given prefix
for the various
files that need to be saved.source§fn get_trainer(&self) -> Self::Trainer
fn get_trainer(&self) -> Self::Trainer
Get an instance of a Trainer capable of training this Model
Auto Trait Implementations§
impl !Freeze for Unigram
impl RefUnwindSafe for Unigram
impl Send for Unigram
impl Sync for Unigram
impl Unpin for Unigram
impl UnwindSafe for Unigram
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more