Owl_nlp_corpusNLP: Corpus module
val length : t -> intReturn the size of the corpus, i.e. number of documents.
val get : t -> int -> stringReturn the ith document in the corpus.
val get_tok : t -> int -> int arrayReturn the ith tokenised document in the corpus.
val get_uri : t -> stringReturn the path of the corpus.
val get_bin_uri : t -> stringReturn the path of the binary format of corpus.
val get_bin_fh : t -> Stdlib.in_channelReturn the file handle of the binary formation of corpus.
val get_tok_uri : t -> stringReturn the path of tokenised corpus.
val get_tok_fh : t -> Stdlib.in_channelReturn the file handle of the tokenised corpus.
val get_vocab_uri : t -> stringReturn the path of vocabulary file associated with the corpus.
val get_vocab : t -> Owl_nlp_vocabulary.tReturn the vocabulary associated with the corpus.
val get_docid : t -> int arrayReturn a list of document ids which are mapped back to the original file where the corpus is built.
val next : t -> stringReturn the next document in the corpus.
val next_tok : t -> int arrayReturn the next tokenised document in the corpus.
val iteri : (int -> string -> unit) -> t -> unitIterate all the documents in the corpus, the index (line number) is passed in.
val iteri_tok : (int -> int array -> unit) -> t -> unitIterate the tokenised documents in the corpus, the index (line number) is passed in.
val mapi : (int -> string -> 'a) -> t -> 'a arrayMap all the documents in a corpus into another array. The index (line number) is passed in.
val mapi_tok : (int -> 'a -> 'b) -> t -> 'b arrayMap all the tokenised ocuments in a corpus into another array. The index (line number) is passed in.
val next_batch : ?size:int -> t -> string arrayReturn the next batch of documents in a corpus as a string array. The default size is 100.
val next_batch_tok : ?size:int -> t -> int array arrayReturn the next batch of tokenised documents in a corpus as a string array. The default size is 100.
val reset_iterators : t -> unitReset the iterator to the beginning of the corpus.
val build :
?docid:int array ->
?stopwords:(string, 'a) Stdlib.Hashtbl.t ->
?lo:float ->
?hi:float ->
?vocab:Owl_nlp_vocabulary.t ->
?minlen:int ->
string ->
tThis function builds up a corpus of type t from a given raw text corpus. We assume that each line in the raw text corpus represents a document.
Parameters: * ?docid: passed in docid can be used for tracking back to the original corpus, but this is not compulsory. * ?stopwords: stopwords used in building vocabulary. * ?lo: any word below this lower bound of the frequency is removed from vocabulary. * ?hi: any word above this upper bound of the frequency is removed from vocabulary. * ?vocab: an optional vocabulary, if it is not passed, the vocabulary is built from current corpus. * ?(minlen=10): threshold of the document length, any document shorter than this is removed from the corpus. * fname: the file name of the raw text corpus.
val tokenise : t -> string -> int arraytokenise corpus doc tokenises the document doc using the corpus and its associated vocabulary.
Remove the duplicates in a text corpus, the ids of the removed files are returned.
preprocess f input_file output_file pre-processes a given file input_file with the passed in function f then saves the output to output_file.
E.g., you can plug in simple_process function to clean up the text. Note this function will not change the number of lines in a corpus.
val save : t -> string -> unitSerialise the corpus and save it to a file of given name.
val load : string -> tLoad a serialised corpus from a file.
val save_txt : t -> string -> unitConvert the tokenised corpus back to a text file
val to_string : t -> stringThe string representation of a corpus, contains the summary of a corpus.
val print : t -> unitPretty print the summary of a text corpus.
val create :
string ->
int array ->
int array ->
Stdlib.in_channel option ->
Stdlib.in_channel option ->
Owl_nlp_vocabulary.t option ->
int ->
int array ->
t```create uri bin_ofs tok_ofs bin_fh tok_fh vocab minlen docid` wraps up the corpus into a record of type t.
val cleanup : t -> unitClose the opened file handles associated with the corpus.