Natural Language Processing
Text is a dominant media type on the Internet along with images, videos, and audios. Many of our day-to-day tasks involve text analysis. Natural language processing is a powerful tool to extract insights from text copora.
This chapter focusses on the vector space models and topic modelling.
TODO: this chapter now mainly lacks general text introduction of NLP.
Survey the literature, give a high-level picture of NLP. Talk about classic NLP … structured and unstructured text …
In this chapter, we will use a news dataset crawled from the Internet. It contains 130000 pieces of news from various sources, each line in the file represents one entry.
We call a collection of documents a text corpus.
Show how to build a corpus from a collection of documents, in a step step way.
Preprocess the text, convert all the text into lowercase.
let simple_process s = Str.split Owl_nlp_utils.regexp_split s |> List.filter (fun x -> String.length x > 1) |> String.concat " " |> String.lowercase_ascii |> Bytes.of_string let preprocess input_file = let output_file = input_file ^ ".output" in Nlp.Corpus.preprocess simple_process input_file output_file
Then let’s build vocabulary out of the text corpus.
let build_vocabulary input_file = let vocab = Nlp.Vocabulary.build input_file in (* print out the words of highest frequency *) Nlp.Vocabulary.top vocab 10 |> Owl.Utils.Array.to_string ~sep:", " fst; (* print out the words of lowest frequency*) Nlp.Vocabulary.bottom vocab 10 |> Owl.Utils.Array.to_string ~sep:", " fst; (* save the vocabulary *) let output_file = input_file ^ ".vocab" in Nlp.Vocabulary.save vocab output_file
The progress of building the vocabulary is printed out. After the vocabular is built, the token of the highest frequency is printed out.
2020-01-28 20:29:41.592 INFO : processed 13485, avg. 2696 docs/s 2020-01-28 20:29:46.593 INFO : processed 24975, avg. 2497 docs/s 2020-01-28 20:29:51.593 INFO : processed 39728, avg. 2648 docs/s 2020-01-28 20:29:56.595 INFO : processed 53277, avg. 2663 docs/s 2020-01-28 20:30:01.595 INFO : processed 62908, avg. 2516 docs/s 2020-01-28 20:30:06.595 INFO : processed 69924, avg. 2330 docs/s ... - : string = "the, to, of, a, and, in, \", s, that, on" "eichorst, gcs, freeross, depoliticisation, humping, shopable, appurify, intersperse, vyaecheslav, raphaelle"
Now let’s trim off some most and least frequency words. You can trim either by absolute number or by percent. We use percent here, namely triming off top and bottom 1% of the words.
let trim_vocabulary vocab = Nlp.Vocabulary.trim_percent ~lo:0.01 ~hi:0.01 vocab
With a vocabulary at hands, now we are ready to tokenise a piece of text.
let tokenise vocab text = String.split_on_char ' ' text |> List.map (Nlp.Vocabulary.word2index vocab)
For example, if we tokenise “this is owl book”, you will get the following output.
tokenise vocab "this is an owl book";; - : int list = [55756; 18322; 109456; 90661; 22362]
Furthermore, we can now tokenise the whole news collection.
let tokenise_all vocab input_file = let doc_s = Owl_utils.Stack.make () in Owl_io.iteri_lines_of_file (fun i s -> let t = Str.split Owl_nlp_utils.regexp_split s |> List.filter (Owl_nlp_vocabulary.exits_w vocab) |> List.map (Owl_nlp_vocabulary.word2index vocab) |> Array.of_list in Owl_utils.Stack.push doc_s i) input_file; doc_s
Even though this is a simplified case, it well illustrates the typical starting point of text analysis before delving into any topic modelling.
Use Convenient Function
Show how to build corpus using a convenient function in
Nlp.Corpus. The convenient function builds the dictionary and tokienise the text corpus at the same time. You can further specify how to trim off the high-frequency and low-frenquency words when calling the function.
let main () = (* remove duplicates *) let ids = Nlp.Corpus.unique "news.txt" "clean.txt" in Printf.printf "removed %i duplicates." (Array.length ids); (* build vocabulary and tokenise *) let corpus = Nlp.Corpus.build ~lo:0.01 ~hi:0.01 "clean.txt" in Nlp.Corpus.save corpus "news.corpus"; Nlp.Corpus.print corpus
The output is like this …
2020-01-28 19:07:05.461 INFO : build up vocabulary ... 2020-01-28 19:07:10.461 INFO : processed 13587, avg. 2717 docs/s 2020-01-28 19:07:15.463 INFO : processed 26447, avg. 2644 docs/s 2020-01-28 19:07:20.463 INFO : processed 43713, avg. 2913 docs/s 2020-01-28 19:07:25.463 INFO : processed 57699, avg. 2884 docs/s 2020-01-28 19:07:30.463 INFO : processed 65537, avg. 2621 docs/s 2020-01-28 19:07:35.463 INFO : processed 76199, avg. 2539 docs/s ... 2020-01-28 19:08:09.125 INFO : convert to binary and tokenise ... 2020-01-28 19:08:14.126 INFO : processed 16756, avg. 3350 docs/s 2020-01-28 19:08:19.126 INFO : processed 32888, avg. 3288 docs/s 2020-01-28 19:08:24.127 INFO : processed 43110, avg. 2873 docs/s 2020-01-28 19:08:29.127 INFO : processed 48362, avg. 2417 docs/s 2020-01-28 19:08:34.130 INFO : processed 52628, avg. 2104 docs/s 2020-01-28 19:08:39.132 INFO : processed 55727, avg. 1857 docs/s ... corpus info file path : news.txt # of docs : 129968 doc minlen : 10 - : unit = ()
save function will create several files, explain these files …
Teach how to use
Nlp.Corpus.next and etc. to iterate and map documents.
Vector Space Models
Survey, explain what is VSM, documents become vectors, i.e. a point in high-dimensional space. With VSM, we can cluster the documents based on their proximity, i.e. similarity.
Bag of Words (BOW)
Explain what is BOW, simply counting the frequency. What are the pros and cons of this method?
(* count the term occurrency in a document *) let term_count htbl doc = Array.iter (fun w -> match Hashtbl.mem htbl w with | true -> let a = Hashtbl.find htbl w in Hashtbl.replace htbl w (a +. 1.) | false -> Hashtbl.add htbl w 1.) doc (* build bag-of-words for the corpus *) let build_bow corpus = Nlp.Corpus.mapi_tok (fun i doc -> let htbl = Hashtbl.create 128 in term_count htbl; htbl) corpus
Term Frequency–Inverse Document Frequency (TFIDF)
Explain what is TFIDF, mention Cambridge Wolfson fellow. The corpus we have built in the previous section is used as input to the following function.
Explain why TFIDF is better than BOW, what is the motivation. Give an examble to illustrate.
let build_tfidf corpus = (* configure and build the model *) let tf = Nlp.Tfidf.Count in let df = Nlp.Tfidf.Idf in let model = Nlp.Tfidf.build ~tf ~df corpus in (* print and save model *) Nlp.Tfidf.print model; Nlp.Tfidf.save model "news.tfidf"; model
After the model is build, illustrate how to find k similar documents. The following exmaple uses consine similarity, then convert a document into vector using previously trained TFIDF model. Note do NOT teach how to index and how the similarity is calculted here, teach in Indexing and Searching section.
let query model doc k = (* TODO: change api *) let typ = Owl_nlp_similarity.Cosine in let vec = Nlp.Tfidf.apply model doc in let knn = Nlp.Tfidf.nearest ~typ model vec k in knn
Latent Dirichlet Allocation (LDA)
Explain what is LDA.
topics is the number of topics. Owl supports the following types of LDA algorithms.
type lda_typ = | SimpleLDA | FTreeLDA | LightLDA | SparseLDA
How to train an LDA model.
(* change to ocaml when image progation finished *) let build_lda corpus topics = let model = Nlp.Lda.init ~iter:1000 topics in let lda_typ = Nlp.Lda.SparseLDA in Nlp.Lda.train lda_typ model; Owl.Log.info "LDA training finished."
Latent Semantic Analysis (LSA)
Explain what is LSA, and how it differs from LDA wrt to derived topics.
Read * https://en.wikipedia.org/wiki/Latent_semantic_analysis * https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/
Indexing and Searching
Topic models are effective tools for clustering documents based on their similarity or relevance. We can further use this tool to query relevant document given an input one. In this section, we will go through some techniques on how to index and query model built using the previous topic modeling method.
Euclidean and Consine Similarity
Define what is euclidean and consine similarity. Emphasise both are correlated on a high-dimensional ball model.
TODO: use an image to illustrate.
First implement linear search, in this case, we do not need index at all, but it is very slow. In the following, the corpus is an array of arrays (each of which is an document).
(* calculate pairwise distance for the whole model, format (id,dist) *) let all_pairwise_distance typ corpus x = let dist_fun = Owl_nlp_similarity.distance typ in let l = Array.mapi (fun i y -> i, dist_fun x y) corpus in Array.sort (fun a b -> Stdlib.compare (snd a) (snd b)) l; l (* return the k most relevant documents *) let query corpus doc k = let typ = Owl_nlp_similarity.Cosine in let l = all_pairwise_distance typ corpus doc in Array.sub l 0 k
Use Matrix Multiplication
Show that pairwise distance can be done in a matrix multiplication, which is often highly-optimised GEMM operation. We assume the corpus has been converted into a dense matrix, wherein each row vector represents a document.
let query corpus doc k = let vec = Mat.transpose doc in let l = Mat.(corpus *@ vec) in Mat.bottom l k
NOTE: give an image illustration on what is random project, but no need to implement. We will reserve in-depth discussion in Recommender System Chapter.
Only give a teaser here which goes like: “Linear Search does not rely on data structure, but is slow; matrix multiplication is paralleled version LS, faster, but still compute all the pairwise. If you want to simplify computation, you need to use index, and Random Projection is a widely used index method in industry. It is a big topic, and we will cover that in detail in the case chapter.”