NLP.Vocabulary

This document is auto-generated for Owl’s APIs. #34 entries have been extracted.

Github: {Signature} {Implementation}

Type definition

type t

Type of vocabulary (or dictionary).

Query vocabulary

val get_w2i : t -> (string, int) Hashtbl.t

get_w2i v returns word -> index mapping of v.

source code

val get_i2w : t -> (int, string) Hashtbl.t

get_i2w v returns index -> word mapping of v.

source code

val exits_w : t -> string -> bool

exits_w v w returns true if word w exists in the vocabulary v.

source code

val exits_i : t -> int -> bool

exits_i i w returns true if index i exists in the vocabulary v.

source code

val word2index : t -> string -> int

word2index v w converts word w to its index using vocabulary v.

source code

val index2word : t -> int -> string

index2word v i converts index i to its corresponding word using vocabulary v.

source code

Obtain properties

val length : t -> int

length v returns the size of vocabulary v.

source code

val freq_i : t -> int -> int

freq_i v i returns the frequency of word of index i.

source code

val freq_w : t -> string -> int

freq_w v w returns the frequency of word w in the vocabulary v.

source code

val sort_freq : ?inc:bool -> t -> (int * int) array

sort_freq v returns the vocabulary as a (index, freq) array in increasing or decreasing frequency specified by parameter inc.

source code

val top : t -> int -> (string * int) array

top v k returns the top k words in vocabulary v.

source code

val bottom : t -> int -> (string * int) array

bottom v k returns the bottom k words in vocabulary v.

source code

val re_index : t -> t

re_index v re-indexes the indices of words in vocabulary v.

source code

Core functions

val build : ?lo:float -> ?hi:float -> ?alphabet:bool -> ?stopwords:(string, 'a) Hashtbl.t -> string -> t

build ~lo ~hi ~stopwords fname builds a vocabulary from a text corpus file of name fname. If alphabet=false then tokens are the words separated by white spaces; if alphabet=true then tokens are the characters and a vocabulary of alphabets is returned.

Parameters:
  • lo: percentage of lower bound of word frequency.
  • hi: percentage of higher bound of word frequency.
  • alphabet : build vocabulary for alphabets or words.
  • fname: file name of the text corpus, each line contains a doc.

source code

val build_from_string : ?lo:float -> ?hi:float -> ?alphabet:bool -> ?stopwords:(string, 'a) Hashtbl.t -> string -> t

build_from_string is similar to build but builds the vocabulary from an input string rather than a file.

source code

val trim_percent : lo:float -> hi:float -> t -> t

trim_percent ~lo ~hi v remove extremely low and high frequency words based on percentage of frequency.

Parameters:
  • lo: the percentage of lower bound.
  • hi: the percentage of higher bound.

source code

val trim_count : lo:int -> hi:int -> t -> t

trim_count ~lo ~hi v remove extremely low and high frequency words based on absolute count of words.

Parameters:
  • lo: the lower bound of number of occurrence.
  • hi: the higher bound of number of occurrence.

source code

val remove_stopwords : ('a, 'b) Hashtbl.t -> ('a, 'c) Hashtbl.t -> unit

remove_stopwords stopwords v removes the stopwords defined in a hashtbl from vocabulary v.

source code

val copy : t -> t

copy v makes a copy of vocabulary v.

source code

val tokenise : t -> string -> int array

tokenise v s tokenises the string s according to the vocabulary v.

source code

val w2i_to_tuples : t -> (string * int) list

w2w2i_to_tuples v converts vocabulary v to a list of (word, index) tuples.

source code

val to_array : t -> (int * string) array

to_array v converts a vocabulary to a (word, index) array.

source code

val of_array : (int * string) array -> t

of_array v converts a (word, index) array to a vocabulary.

source code

I/O functions

val save : t -> string -> unit

save v fname serialises the vocabulary and saves it to a file of name s.

source code

val load : string -> t

load fname loads the serialised vocabulary from a file of name fname.

source code

val save_txt : t -> string -> unit

save_txt v fname saves the vocabulary in the text format to a file of name s.

source code

val to_string : t -> string

to_string v returns the summary information of a vocabulary.

source code

val pp_vocab : Format.formatter -> t -> unit

Pretty printer for vocabulary type.

source code