Owl_nlp_tfidfNLP: TFIDF module
val length : t -> intSize of Tfidf model, i.e. number of documents contained.
val term_freq : tf_typ -> float -> float -> floatterm_freq term_count num_words calculates the term frequency weight.
val doc_freq : df_typ -> float -> float -> floatdoc_freq doc_count num_docs calculates the document frequency weight.
val get_uri : t -> stringReturn the path of the TFIDF model.
val get_corpus : t -> Owl_nlp_corpus.tReturn the corpus contained in TFIDF model
val vocab_len : t -> intReturn the size of the vocabulary contained in the TFIDF model.
val get_handle : t -> Stdlib.in_channelGet the file handle associated with TFIDF model.
val doc_count_of : t -> string -> floatdoc_count_of tfidf w calculate document frequency for a given word w.
val doc_count : Owl_nlp_vocabulary.t -> string -> float array * intdoc_count vocab fname count occurrency in all documents contained in the raw text corpus of file fname, for all words
term_count count doc counts the term occurrency in a document, and saves the result in count hashtbl.
val density : t -> floatReturn the percentage of non-zero elements in doc-term matrix.
val doc_to_vec :
(float, 'a) Stdlib.Bigarray.kind ->
t ->
(int * float) array ->
(float, 'a) Owl_dense.Ndarray.Generic.tdoc_to_vec kind tfidf vec converts a TFIDF vector from its sparse represents to dense ndarray vector whose length equals the vocabulary size.
val get : t -> int -> (int * float) arrayReturn the ith TFIDF vector in the model. The format of return is (vocabulary index, weight) tuple array of a document.
val next : t -> (int * float) arrayReturn the next document vector in the model. The format of return is (vocabulary index, weight) tuple array of a document.
val next_batch : ?size:int -> t -> (int * float) array arrayReturn the next batch of document vectors in the model, the default size is 100.
val iteri : (int -> (int * float) array -> unit) -> t -> unitIterate all the document vectors in a TFIDF model. The format of document vector is (vocabulary index, weight) tuple array of a document.
val mapi : (int -> (int * float) array -> 'a) -> t -> 'a arrayMap all the document vectors in a TFIDF model. The format of document vector is (vocabulary index, weight) tuple array of a document.
val reset_iterators : t -> unitReset the iterator to the beginning of the TFIDF model.
val build :
?norm:bool ->
?sort:bool ->
?tf:tf_typ ->
?df:df_typ ->
Owl_nlp_corpus.t ->
tThis function builds up a TFIDF model according to the passed in parameters.
Parameters: * norm: whether to normalise the vectors in the TFIDF model, default is false. * sort: whether to sort the terms in a TFIDF vector in increasing order w.r.t their vocabulary indices. The default is false. * tf: type of term frequency used in building TFIDF. The default is Count. * df: type of document frequency used in building TFIDF. The default is Idf. * corpus: the corpus built by Owl_nlp_corpus model atop of which TFIDF will be built.
val save : t -> string -> unitsave tfidf fname saves the TFIDF to a file of given file name fname.
val load : string -> tload fname loads a TFIDF from a file of name fname.
val to_string : t -> stringConvert a TFIDF to its string representation, contains summary information.
val print : t -> unitPretty print out the summary information of a TFIDF model.
val tf_typ_string : tf_typ -> stringConvert term frequency type into string.
val df_typ_string : df_typ -> stringConvert document frequency type into string.
val apply : t -> string -> (int * float) arrayConvert a single document according to a given model
normalise x makes x a unit vector by dividing its l2norm.
val create : tf_typ -> df_typ -> Owl_nlp_corpus.t -> tWrap up a TFIDF model type. Low-level function and you are not supposed to use it.
val all_pairwise_distance :
Owl_nlp_similarity.t ->
t ->
('a * float) array ->
(int * float) arrayCalculate pairwise distance for the whole model, return format is (id,dist) array.
val nearest :
?typ:Owl_nlp_similarity.t ->
t ->
('a * float) array ->
int ->
(int * float) arrayReturn K-nearest neighbours, it is very slow due to linear search.