Internal Utility Modules
During development of Owl, we find some utility modules are immensely handy. In this chapter, we share some of them. These are not the main feature of Owl, and perhaps you can implement your own version very quickly. But we hope to present how these features are used in Owl.
Dataset Module
The dataset modules provide easy access to various datasets to be used in Owl, mainly the MNSIT and CIFAR10 datasets. You can get all these data in Owl by executing: Dataset.download_all ()
. The data are downloaded in the home directory, for example, ~/.owl/dataset
on Linux.
MNIST
The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. Each example is of size 28 x 28. It is a good starting point for deep neural network related tasks. You can get MNIST data via these Owl functions:
Dataset.load_mnist_train_data ()
: returns a tripletx, y, y'
.x
is a [60000, 784] ndarray (Owl_dense_ndarray.S.mat
) where each row represents a [28, 28] image.y
is a [60000, 1] label ndarray. Each row is an integer ranging from 0 to 9, indicating the digit on each image.y'
is a [60000, 10] label ndarray. Each one-hot row vector corresponds to one label.
Dataset.load_mnist_test_data ()
: returns a triplet. Similar toload_mnist_train_data
, only that it returns the test set, so the example size is 10,000 instead of 60,000.Dataset.load_mnist_train_data_arr ()
: similar toload_mnist_train_data
, but returnsx
as [60000,28,28,1] ndarrayDataset.load_mnist_test_data_arr ()
: similar toload_mnist_train_data_arr
, but it returns the test set, so the example size is 10, 000 instead of 60, 000.Dataset.draw_samples x y n
drawsn
random examples from images ndarrayx
and label ndarrayy
.
Here is what the dataset looks like when loaded into Owl:
# let x, _, y = Dataset.load_mnist_train_data_arr ();;
val x : Owl_dense_ndarray.S.arr =
C0
R[0,0,0] 0
R[0,0,1] 0
R[0,0,2] 0
R[0,0,3] 0
R[0,0,4] 0
...
R[59999,27,23] 0
R[59999,27,24] 0
R[59999,27,25] 0
R[59999,27,26] 0
R[59999,27,27] 0
val y : Owl_dense_matrix.S.mat =
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
R0 0 0 0 0 0 1 0 0 0 0
R1 1 0 0 0 0 0 0 0 0 0
R2 0 0 0 0 1 0 0 0 0 0
R3 0 1 0 0 0 0 0 0 0 0
R4 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ...
R59995 0 0 0 0 0 0 0 0 1 0
R59996 0 0 0 1 0 0 0 0 0 0
R59997 0 0 0 0 0 1 0 0 0 0
R59998 0 0 0 0 0 0 1 0 0 0
R59999 0 0 0 0 0 0 0 0 1 0
You can find the MNIST dataset used in training and testing a DNN in Owl:
let train network =
let x, _, y = Dataset.load_mnist_train_data_arr () in
Graph.train network x y |> ignore;
network
let test network =
let imgs, _, labels = Dataset.load_mnist_test_data () in
let m = Dense.Matrix.S.row_num imgs in
let imgs = Dense.Ndarray.S.reshape imgs [|m;28;28;1|] in
let mat2num x = Dense.Matrix.S.of_array (
x |> Dense.Matrix.Generic.max_rows
|> Array.map (fun (_,_,num) -> float_of_int num)
) 1 m
in
let pred = mat2num (Graph.model network imgs) in
let fact = mat2num labels in
let accu = Dense.Matrix.S.(elt_equal pred fact |> sum') in
Owl_log.info "Accuracy on test set: %f" (accu /. (float_of_int m))
CIFAR-10
The CIFAR-10 dataset include small scale color images for more realistic complex image classification tasks. It includes 10 classes of images: aeroplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. It consists of 60,000 32 x 32 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.
Due to the limit of file size on GitHub, the training set is cut into 5 smaller batches. You can get CIFAR-10 data using Owl
:
Dataset.load_cifar_train_data batch
: returns a tripletx, y, y'
.- The input paramter
batch
can range from 1 to 5, indicating which training set batch to choose. x
is an [10000, 32, 32, 3] ndarray (Owl_dense_ndarray.S.arr
). The last dimension indicates color channels (first Red, then Green, finally Blue).y
is an [10000, 1] label ndarray, each number representing an image class.y'
is the corresponding [10000, 10] one-hot label ndarray.
- The input paramter
Dataset.load_cifar_test_data ()
: similar toload_cifar_train_data
, only that it loads test data.Dataset.draw_samples_cifar x y n
drawsn
random examples from images ndarrayx
and label ndarrayy
.
Note that all elements in the loaded matrices and ndarrays are of float32
format. The CIFAR10 dataset can be loaded in a similar way as MNIST:
let train network =
let x, _, y = Dataset.load_cifar_train_data 1 in
Graph.train network x y
Graph Module
The Graph module in Owl provides a general data structure to manipulate graphs. It is defined as:
type 'a node =
{ mutable id : int
; (* unique identifier *)
mutable name : string
; (* name of the node *)
mutable prev : 'a node array
; (* parents of the node *)
mutable next : 'a node array
; (* children of the node *)
mutable attr : 'a (* indicate the validity *)
}
The attribution here is generic so that you can define your own graph where each node contains an integer, a string, or any data type you define. This makes the graph module extremely flexible.
Graph module provides a rich set of APIs. First, you can build a Graph using these methods:
node ~id ~name ~prev ~next attr
creates a node with given id and name string. The created node is also connected to parents inprev
and children innext
. Theattr
will be saved inattr
field.connect parents children
connects a set of parents to a set of children. The created links are the Cartesian product of parents and children. In other words, they are bidirectional links between parents and children. Note that this function does not eliminate any duplicates in the array.connect_descendants parents children
connects parents to their children. This function creates unidirectional links from parents to children. In other words, this function saveschildren
toparent.next
field.connect_ancestors parents children
connects children to their parents. This function creates unidirectional links from children to parents. In other words, this function savesparents
tochild.prev
field.remove_node x
removes nodex
from the graph by disconnecting itself from all its parent nodes and child nodes.remove_edge src dst
removes a linksrc -> dst
from the graph. Namely, the corresponding entry ofdst
insrc.next
andsrc
indst.prev
will be removed. Note that it does not remove [dst -> src] if there exists one.replace_child x y
replacesx
withy
inx
parents. Namely,x
parents now make link toy
rather thanx
innext
field. Note that the function does not make link fromy
tox
children. Namely, thenext
field ofy
remains intact.
Then, to obtain and update properties of a graph using these functions:
val id : 'a node -> int
(** ``id x`` returns the id of node ``x``. *)
val name : 'a node -> string
(** ``name x`` returns the name string of node ``x``. *)
val set_name : 'a node -> string -> unit
(** ``set_name x s`` sets the name string of node ``x`` to ``s``. *)
val parents : 'a node -> 'a node array
(** ``parents x`` returns the parents of node ``x``. *)
val set_parents : 'a node -> 'a node array -> unit
(** ``set_parents x parents`` set ``x`` parents to ``parents``. *)
val children : 'a node -> 'a node array
(** ``children x`` returns the children of node ``x``. *)
val set_children : 'a node -> 'a node array -> unit
(** ``set_children x children`` sets ``x`` children to ``children``. *)
val attr : 'a node -> 'a
(** ``attr x`` returns the ``attr`` field of node ``x``. *)
val set_attr : 'a node -> 'a -> unit
(** ``set_attr x`` sets the ``attr`` field of node ``x``. *)
Similarly, you can get other properties of a graph use the other functions:
indegree x
returns the in-degree of nodeoutdegree x
returns the out-degree of nodedegree x
returns the total number of links ofx
num_ancestor x
returns the number of ancestors ofx
num_descendant x
returns the number of descendants ofx
length x
returns the total number of ancestors and descendants ofx
Finally, we provide functions for traversing the graph in either Breadth-First order or Depth-First order. You can also choose to iterate the descendants or ancestors of a given node.
val iter_ancestors
: ?order:order
-> ?traversal:traversal
-> ('a node -> unit)
-> 'a node array
-> unit
(** Iterate the ancestors of a given node. *)
val iter_descendants
: ?order:order
-> ?traversal:traversal
-> ('a node -> unit)
-> 'a node array
-> unit
(** Iterate the descendants of a given node. *)
val iter_in_edges : ?order:order -> ('a node -> 'a node -> unit) -> 'a node array -> unit
(** Iterate all the in-edges of a given node. *)
val iter_out_edges : ?order:order -> ('a node -> 'a node -> unit) -> 'a node array -> unit
(** Iterate all the out-edges of a given node. *)
val topo_sort : 'a node array -> 'a node array
(** Topological sort of a given graph using a DFS order. Assumes that the graph is acyclic.*)
You can also use functions: filter_ancestors
, filter_descendants
, fold_ancestors
, fold_descendants
, fold_in_edges
, and fold_out_edges
to perform fold or filter operations when iterating the graph. Within Owl, the Graph module is heavily use to facilitate other modules such as the Computation Graph module to provide the basic graph structuring functionalities. We will discuss the Computation Graph in detail in later chapter in the book.
Stack and Heap Modules
Both Stack and Heap are two common abstract data types for collection of elements. They are also used in Owl code. Similar to graph, they use generic types so that any data type can be plugged in. Here is the definition of a stack:
type 'a t =
{ mutable used : int
; mutable size : int
; mutable data : 'a array
}
The stack and heap modules support four standards operations:
push
: push element into a stack/heappop
: pops the top element in stack/heap. It returns None if the container is emptypeek
: returns the value of top element in stack/heap but it does not remove the element from the stack.None
is returned if the container is empty.is_empty
: returns true if the stack/heap is empty
The stack data structure is used in many places in Owl: the filteri
operation in ndarray module, the topological sort in graph, the reading file IO operation for keeping the content from all the lines, the data frame… The heap structure is used in key functions such as search_top_elements
in the Ndarray
module, which searches the indices of top values in input ndarray according to the comparison function.
Count-Min Sketch
Count-Min Sketch is a probabilistic data structure for computing approximate counts. It is particularly ideal for use when the space is limited and exact results are not required. Imagine that we want to count how frequent certain elements are in a realtime stream, what would you do? An intuitive answer is that you can create a hash table, with the element as key and its count as value. The problem with this solution is that the stream could have millions and billions of elements. Even if you somehow manage to cut the long tail (such as the unique elements), the storage requirement is still terribly large.
Now that you think about it, you don’t really care about the precise count of an element from the stream. What you really need is an estimation that is not very far away from the true. That leaves space for optimising the solution. First, apply a hashing function and use h(e)
as the key, instead of the element e
itself. Besides, the total number of key-value pairs can be limited. Towards the end, this approach can be summarised as three steps:
- initialise an array of \(n\) elements, each set to 0;
- when processing one element \(e\), increase the count of the hashed index:
count[h(e)] += 1
; - when querying the count for certain element, just return
count[h(e)]
.
Obviously, this approach tends to give an overestimated answer because of the inevitable collision in hash table. Here the Count-Min Sketch method comes to help. Its basic idea is simple: follow the process stated above, but the only difference is that now instead of maintaining a vector of length \(n\), we now need to maintain a matrix of shape \(dxn\), i.e. \(d\) rows and \(n\) columns. Each row is assigned with a different hash function, and when processing one element \(e\), apply \(h_0, h_1, \ldots, h_d\) to it, and make \(count[i][h_i(e)] += 1\), for each \(i = 0, 1, 2, \ldots, d\). At any time if you want to know the count of an element \(e\), you again apply the same set of hash functions, retrieve the \(d\) counts of this element from all the rows, and choose the smallest count to return. This process is shown in [@fig:utilities:count-min] (Src).

In this way, the effect of collision is reduced in the counting. The reason is simple: if these different hashing functions are independent, then the probability that the same element leads to collision in multiple lines can be exponentially reduced with more hash function used.
Even though this method looks like just a heuristic, it actually provides a theoretical guarantee of its counting error. Specifically, we have two error bounds parameter: failure probability \(\sigma\), and the approximation ratio \(\epsilon\), and let \(s\) be the sum of all counts stored in the data structure. It can be proved that with a probability of \(1-\sigma\), the error between the estimated count and the true count is \(\epsilon~s\) at most. The detailed proof can be seen in the original paper [@cormode2005improved]. Note that this guarantee implies that elements that appear more frequently will have more accurate counts, since the maximum possible error in a count is linear in the total number of counts in the data structure.
Owl has provided this probabilistic data structure. It is implemented by Pratap Singh. Owl provides these interfaces for use:
module type Sig = sig
type 'a sketch
(** The type of Count-Min sketches *)
(** {6 Core functions} *)
val init : epsilon:float -> delta:float -> 'a sketch
(**
``init epsilon delta`` initializes a sketch with approximation ratio
``(1 + epsilon)`` and failure probability ``delta``.
*)
val incr : 'a sketch -> 'a -> unit
(** ``incr s x`` increments the frequency count of ``x`` in sketch ``s`` in-place. *)
val count : 'a sketch -> 'a -> int
(** ``count s x`` returns the estimated frequency of element ``x`` in ``s``. *)
val init_from : 'a sketch -> 'a sketch
(**
``init_from s`` initializes a new empty sketch with the same parameters as ``s``, which
can later be merged with ``s``.
*)
val merge : 'a sketch -> 'a sketch -> 'a sketch
(**
``merge s1 s2`` returns a new sketch whose counts are the sum of those in ``s1`` and ``s2``.
Raises ``INVALID_ARGUMENT`` if the parameters of ``s1`` and ``s2`` do not match.
*)
end
Owl provides two different implementations of the underlying table of counts, one based on the OCaml native array and one based on the Owl ndarray
. These are exported as Owl_base.Countmin_sketch.Native
and Owl_base.Countmin_sketch.Owl
respectively. In our testing, we have found the OCaml native array to be about 30% faster.
As an example, we can use the count-min sketch to calculate the frequencies of some words in a large corpus. The code below builds a count-min sketch and fills it with text data from here, a corpus of online news articles of about 61 million words. It then tests for the frequencies of some common words and one that doesn’t appear. WARNING: this code will download the file news.txt.gz (96.5MB) onto your machine and expand it into news.txt (340.3MB).
module CM = Owl_base.Countmin_sketch.Native
let get_corpus () =
let fn = "news.txt" in
if not (Sys.file_exists (Owl_dataset.local_data_path () ^ fn)) then
Owl_dataset.download_data (fn ^ ".gz");
open_in (Owl_dataset.local_data_path () ^ fn)
let get_line_words inch =
let regexp = Str.regexp "[^A-Za-z]+" in
try
Some ((input_line inch) |> Str.split regexp)
with
End_of_file -> None
let fill_sketch inch epsilon delta =
let c = CM.init ~epsilon ~delta in
let rec aux () =
match get_line_words inch with
| Some lst -> List.iter (CM.incr c) lst; aux ()
| None -> c in
aux ()
let _ =
let inch = get_corpus () in
let c = fill_sketch inch 0.001 0.001 in
let words = ["the"; "and"; "of"; "said"; "floccinaucinihilipilification"] in
List.iter (fun word -> Printf.printf "%s: %d\n" word (CM.count c word)) words
The example output is shown below. It shows that the common words appear with accurate counts, but the word which does not appear in the text gets a positive count.
the: 3378663
and: 1289949
of: 1404742
said: 463257
floccinaucinihilipilification: 15540
The count-min sketch is a useful data structure when we are interested in approximate counts of important objects in a data set. One such application is to find heavy hitters–for example, finding out the most popular web pages given a very long website access log. Formally, the \(k\)-heavy-hitters of a dataset are those elements that occur with relative frequency at least \(1/k\). So the 100-heavy-hitters are the elements which each appear at least 1% of the time in the dataset.
We can use the count-min sketch, combined with a min-heap, to find the \(k\)-heavy-hitters in a particular dataset. The general idea is to maintain in the heap all the current heavy hitters, with the lowest-count heavy hitter at the top. Whenever we get a new element, we add it to the count-min sketch, and then get its count from the sketch. If the relative frequency of that element is greater than \(1/k\), we add it to the heap. Then, we check if the current minimum element in the heap has gone below the relative frequency threshold of \(1/k\), and if so remove it from the heap. We repeat this process to remove all heavy hitters whose relative frequency is below \(1/k\). So the heap always contains only the elements which have relative frequency at least \(1/k\). To get the heavy hitters and their counts, we just get all the elements currently in the heap.
Owl implements this data structure on top of the count-min sketch. The interface is as follows:
module type Sig = sig
type 'a t
(** Core functions *)
val init : k:float -> epsilon:float -> delta:float -> 'a t
(**
`init k epsilon delta` initializes a sketch with threshold k, approximation
factor epsilon, and failure probability delta.
*)
val add : 'a t -> 'a -> unit
(** `add h x` adds value `x` to sketch `h` in-place. *)
val get : 'a t -> ('a * int) list
(**
`get h` returns a list of all heavy-hitters in sketch `h`, as a
(value, frequency) pair, sorted in decreasing order of frequency.
*)
end
Owl provides two implementations of the heavy-hitters data structure, as Owl_base.HeavyHitters_sketch.Native
and Owl_base.HeavyHitters_sketch.Owl
, using the two types of count-min sketch table. As described above, we have found the Native
implementation to be faster. An example use of this data structure to find the heavy hitters in the news.txt
corpus can be found in the Owl examples repository.
Summary
In this chapter, we introduce several internal modules in Owl. First we have the dataset module that provides access to the commonly used datasets such as MNIST and CIFAR10. Then we also have modules that can be used to build common data structures such as graph, stack, and heap. One less-known but interesting data structure is the Count-Min Sketch, which is a probabilistic data structure for computing approximate counts. It is also discussed in detail in this chapter. These modules are not complex and may even not quite necessary in all numerical libraries, but they may come surprisingly handy when you need them.