OCaml Scientific Computing

1st Edition (in progress)
Back
Table of Contents

Statistical Functions

Statistics is an indispensable tool for data analysis, it helps us to gain the insights from data. The statistical functions in Owl can be categorised into three groups: descriptive statistics, distributions, and hypothesis tests.

Random Variables

Stats module supports many distributions. For each distribution, there is a set of related functions using the distribution name as their common prefix. Here we use Gaussian distribution as an exmaple to illustrate the naming convention.

  • gaussian_rvs : random number generator.
  • gaussian_pdf : probability density function.
  • gaussian_cdf : cumulative distribution function.
  • gaussian_ppf : percent point function (inverse of CDF).
  • gaussian_sf : survival function (1 - CDF).
  • gaussian_isf : inverse survival function (inverse of SF).
  • gaussian_logpdf : logarithmic probability density function.
  • gaussian_logcdf : logarithmic cumulative distribution function.
  • gaussian_logsf : logarithmic survival function.

We generate two data sets in this example, and both contain 999 points drawn from different Gaussian distribution \(\mathcal{N} (\mu, \sigma^{2})\). For the first one, the configuration is \((\mu = 1, \sigma = 1)\); whilst for the second one, the configuration is \((\mu = 12, \sigma = 3)\).

let noise sigma = Stats.gaussian_rvs ~mu:0. ~sigma;;
let x = Array.init 999 (fun _ -> Stats.gaussian_rvs ~mu:1. ~sigma:1.);;
let y = Array.init 999 (fun _ -> Stats.gaussian_rvs ~mu:12. ~sigma:3.);;

We can visualise the data sets using histogram plot as below. When calling histogram, we also specify 30 bins explicitly. You can also fine tune the figure using spec named parameter to specify the colour, x range, y range, and etc. We will discuss in details on how to use Owl to plot in a separate chapter.

(* convert arrays to matrices *)

let x' = Mat.of_array x 1 999;;
let y' = Mat.of_array y 1 999;;

(* plot the figures *)

let h = Plot.create ~m:1 ~n:2 "plot_02.png" in

  Plot.subplot h 0 0;
  Plot.set_ylabel h "frequency";
  Plot.histogram ~bin:30 ~h x';
  Plot.histogram ~bin:30 ~h y';

  Plot.subplot h 0 1;
  Plot.set_ylabel h "PDF p(x)";
  Plot.plot_fun ~h (fun x -> Stats.gaussian_pdf ~mu:1. ~sigma:1. x) (-2.) 6.;
  Plot.plot_fun ~h (fun x -> Stats.gaussian_pdf ~mu:12. ~sigma:3. x) 0. 25.;

  Plot.output h;;

In subplot 1, we can see the second data set has much wider spread. In subplot 2, we also plot corresponding the probability density functions of the two data sets.

plot 02

As an exercise, you can also try out other distributions like gamma, beta, chi2, and student-t distribution.

Descriptive Statistics

Descriptive statistics are used to summarise the characteristics of data. The commonly used ones are mean, variance, standard deviation, skewness, kurtosis, and etc.

We first draw one hundred random numbers which are uniformly distributed between 0 and 10. Here we use Stats.uniform_rvs function to generate numbers following uniform distribution.

let data = Array.init 100 (fun _ -> Stats.uniform_rvs 0. 10.);;

Then We use mean function calculate sample average. As we can see, it is around 5. We can also calculate higher moments such as variance and skewness easily with corresponding functions.

Stats.mean data;;
>- : float = 5.31364843477812787
Stats.std data;;
>- : float = 2.84206433399567393
Stats.var data;;
>- : float = 8.0773296785702744
Stats.skew data;;
>- : float = -0.20440315951733054
Stats.kurtosis data;;
>- : float = 1.86457865927682298

The following code calculates different central moments of data. A central moment is a moment of a probability distribution of a random variable about the random variable’s mean. The zeroth central moment is always 1, and the first is close to zero, and the second is close to the variance.

Stats.central_moment 0 data;;
>- : float = 1.
Stats.central_moment 1 data;;
>- : float = -8.52651282912120191e-16
Stats.central_moment 2 data;;
>- : float = 7.99655638178457107
Stats.central_moment 3 data;;
>- : float = -4.69233832808677143

Correlations

Correlation studies how strongly two variables are related. There are different ways of calculating correlation. For the first example, let’s look at Pearson correlation.

x is our explanatory variable and we draw 50 random values uniformly from an interval between 0 and 10. Both y and z are response variables with a linear relation to x. The only difference is that we add different level of noise to the response variables. The noise values are generated from Gaussian distribution.

let noise sigma = Stats.gaussian_rvs ~mu:0. ~sigma;;
let x = Array.init 50 (fun _ -> Stats.uniform_rvs 0. 10.);;
let y = Array.map (fun a -> 2.5 *. a +. noise 1.) x;;
let z = Array.map (fun a -> 2.5 *. a +. noise 8.) x;;

It is easier to see the relation between two variables from a figure. Herein we use Owl’s Plplot module to make two scatter plots.

(* convert arrays to matrices *)

let x' = Mat.of_array x 1 50;;
let y' = Mat.of_array y 1 50;;
let z' = Mat.of_array z 1 50;;

(* plot the figures *)

let h = Plot.create ~m:1 ~n:2 "plot_01.png" in

  Plot.subplot h 0 0;
  Plot.set_xlabel h "x";
  Plot.set_ylabel h "y (sigma = 1)";
  Plot.scatter ~h x' y';

  Plot.subplot h 0 1;
  Plot.set_xlabel h "x";
  Plot.set_ylabel h "z (sigma = 8)";
  Plot.scatter ~h x' z';

  Plot.output h;;

The subfigure 1 shows the functional relation between x and y whilst the subfiture 2 shows the relation between x and z. Because we have added higher-level noise to z, the points in the second figure are more diffused.

plot 01

Intuitively, we can easily see there is stronger relation between x and y from the figures. But how about numerically? In many cases, numbers are preferred because they are easier to compare with by a computer. The following snippet calculates the Pearson correlation between x and y, as well as the correlation between x and z. As we see, the smaller correlation value indicates weaker linear relation between x and z comparing to that between x and y.

Stats.corrcoef x y;;
>- : float = 0.989229727850764129
Stats.corrcoef x z;;
>- : float = 0.722342652115260719

Hypothesis Tests

While decriptive statitics solely concern properties of the observed data, statistical inference focusses on studying whether the data set is sampled from a larger population. In other words, statistical inference make propositions about a population. Hypothesis test is an important method in inferential statistical analysis. There are two hypotheses proposed with regard to the statistical relationship between data sets.

  • Null hypothesis \(H_0\): there is no relationship between two data sets.
  • Alternative hypothesis \(H_1\): there is statistically significant relationship between two data sets.

The Stats module in Owl supports many different kinds of hypothesis tests.

  • Z-Test
  • Student’s T-Test
  • Paired Sample T-Test
  • Unpaired Sample T-Test
  • Kolmogorov-Smirnov Test
  • Chi-Square Variance Test
  • Jarque-Bera Test
  • Fisher’s Exact Test
  • Wald–Wolfowitz Runs Test
  • Mann-Whitney Rank Test
  • Wilcoxon Signed-rank Test

Now let’s see how to perform a z-test in Owl. We first generate two data sets, both are drawn from Gaussian distribution but with different parameterisation. The first one data_0 is drawn from \(\mathcal{N}(0, 1)\), while the second one data_1 is drawn from \(\mathcal{N}(3, 1)\).

let data_0 = Array.init 10 (fun _ -> Stats.gaussian_rvs ~mu:0. ~sigma:1.);;
let data_1 = Array.init 10 (fun _ -> Stats.gaussian_rvs ~mu:3. ~sigma:1.);;

Our hypothesis is that the data set is drawn from Gaussian distribution \(\mathcal{N}(0, 1)\). From the way we genereated the synthetic data, it is obvious that data_0 will pass the test, but let’s see what Owl will test us using its Stats.z_test function.

Stats.z_test ~mu:0. ~sigma:1. data_0;;
>- : Owl_stats.hypothesis =
>{Owl.Stats.reject = false; p_value = 0.0984716313263263171;
> score = -1.65230872621158897}

The returned result is a record with the following type definition. The fields are self-explained: reject field tells whether the null hypothesis is rejected, along with the p value and score calculated with the given data set.

type hypothesis = {
  reject : bool;
  p_value : float;
  score : float;
}

From the previous result, we can see reject = false, indicating null hypothesis is rejected, therefore the data set data_0 is drawn from \(\mathcal{N}(0, 1)\). How about the second data set then?

Stats.z_test ~mu:0. ~sigma:1. data_1;;
>- : Owl_stats.hypothesis =
>{Owl.Stats.reject = true; p_value = 1.24308805325057672e-22;
> score = 9.78998949198827439}

As we expected, the null hypothesis is accepted with a very small p value. This indicates that data_1 is drawn from a different distribution rather than assumed \(\mathcal{N}(0, 1)\).

Order Statistics

Order statistics and rank statistics are among the most fundamental tools in non-parametric statistics and inference. The \(k^{th}\) order statistic of a statistical sample is equal to its kth-smallest value. The example functions of ordered statistics are as follows.

Stats.min;; (* the mininum of the samples *)
Stats.max;; (* the maximum of the samples *)
Stats.median;; (* the median value of the samples *)
Stats.quartile;; (* quartile of the samples *)
Stats.first_quartile;; (* the first quartile of the samples *)
Stats.third_quartile;; (* the third quartile of the samples *)
Stats.interquartile;; (* the interquartile of the samples *)
Stats.percentile;; (* percentile of the samples *)

In addition to the aforementioned ones, there are many other ordered statitical functions in Owl for you to explore.

Next: Chapter 06N-Dimensional Arrays