Introduction

We propose a novel geometric anomaly measure called PIDScore and present an algorithm that detects anomalies based on this definition.

Abstract

We consider the problem of detecting anomalies in a large dataset. We propose a definition that captures the intuition that anomalies are easy to distinguish from the overwhelming majority of points by relatively few attribute values: we call this partial identification. Formalizing this intuition, we propose a geometric anomaly measure for a point that we call PIDScore , which measures the minimum density of data points over all subcubes containing the point. We present PIDForest : a random forest based algorithm that finds anomalies based on this definition. We show that it performs favorably in comparison to several popular anomaly detection methods, across a broad range of benchmarks. PIDForest also provides a succinct explanation for why a point is labelled anomalous, by providing a set of features and ranges for them which are relatively uncommon in the dataset.

Files

Date

August, 2019

Authors

Related projects

Research Areas

  • Algorithms
  • Machine Learning
  • Statistics

Type

Inproceedings

Journal

NeurIPS 2019