Debiasing for intrinsic dimension estimation

Kevin M. CarterRaviv RaichAlfred O. Hero III

Abstract

Many algorithms have been proposed for estimating the intrinsic dimension of high dimensional data. A phenomenon common to all of them is a negative bias, perceived to be the result of undersampling. We propose improved methods for estimating intrinsic dimension, taking manifold boundaries into consideration. By estimating dimension locally, we are able to analyze and reduce the effect that sample data depth has on the negative bias. Additionally, we offer improvements to an existing algorithm for dimension estimation, based on k-nearest neighbor graphs, and offer an algorithm for adapting any dimension estimation algorithm to operate locally. Finally, we illustrate the uses of local dimension estimation with data sets consisting of multiple manifolds, including applications such as diagnosing anomalies in router networks and image segmentation.

Paper

Kevin M. Carter, Raviv Raich, and Alfred O. Hero III, “Debiasing for intrinsic dimension estimation,” in Proc. IEEE Statistical Signal Processing Workshop (SSP), August 2007, pp. 601-605. (.pdf)

Matlab Code

(.zip) This code requires the Kernel Density Estimation Toolbox if you wish to plot the PDF of dimension vs data depth (Fig. 2 below).

Figures

  • Figure 1. The probability of randomly selecting a point on the boundary of an m-dimensional hypercube
  • Figure 2. Analysis of the effect of data depth on local dimension estimation. Points with less depth estimate at a lower dimension, contributing to the overall negative bias.
  • Figure 3a. Biased global dimension estimates of hypercube
  • Figure 3b. Developing a de-biased global dimension estimate by averaging over the 50% of points with the greatest depth on the manifold
  • Figure 4. The k-NN Algorithm applied to network traffic data
  • Figure 5. Satellite image of New York City; three regions of differing complexity are noted
  • Figure 6a. NYC Region 1 dimension estimation (image complexity)
  • Figure 6b. NYC Region 2 dimension estimation (image complexity)
  • Figure 6c. NYC Region 3 dimension estimation (image complexity)

Comments and Remarks