 # Kolmogorov-Smirnov Test (KS-Test)

The KS-test is another example of a popular, rank-based test. Once again, the KS-test compares two lists of numbers under the null hypothesis that they are actually drawn from the same pool, but this time it uses a test statistic based on the “cumulative distributions” of the observations. The cumulative distribution is the sum of the probability distribution up to a certain point. The KS-test uses as the test statistic the maximum difference between the cumulative distributions. Figure 2.5 illustrates the cumulative distribution and the KS-test statistic, usually referred to as D. Surprisingly, the distribution of this test statistic can be computed (approximately) under the null hypothesis, regardless of the distribution of the data. I reproduce the formula here only for aesthetic reasons—to show that there is no reason to expect that it should be simple: where

expjx] is the exponential function

D is the observed value of the test statistic

n and m are the sizes of the two samples, as defined in the WMW section

Figure 2.5 illustrates why the KS-test would be expected to work reasonably well, even on a strongly bimodal dataset like the CD4 data. Like the WMW test, you will usually use a statistics package to perform this test, and be sure that if there are any tied-ranks in your data, the statistics software is handling them correctly. In the KS-test, it’s particularly tricky to

1 FIGURE 2.5 The KS-test and the central limit theorem apply even when the data are not distributed according to a standard distribution. T cells show a different distribution of expression levels for CD4 than other cells (top right) even though neither distribution looks Gaussian. The KS-test measures the difference between the cumulative distributions (D) between two samples (bottom left). Convergence to the central limit theorem is illustrated in the top left panel. Averages (gray bars, 20 samples of 20 datapoints, randomly chosen from the T-cell data, unfilled bars) show approximately Gaussian behavior (dashed line), even though the real data is strongly non-Gaussian.

handle the ties correctly. Once again the KS-test tells us that both these datasets are highly significant, with the Cdc6 data having D = 0.6832, P = 1.038e-05, while the CD4 data having D = 0.3946, P = 5.925e-06. Notice that the Cdc6 data have a larger D, but are not as significant (larger P-value). This is because there are more T cells than stem cells in the dataset. Here, you can see the nontrivial dependence on the size of the two sets in the formula for the KS-test P-value. 