VMware Research | PIDForest: Anomaly Detection and Certification via Partial Identification

Introduction

We propose a novel geometric anomaly measure called PIDScore and present an algorithm that detects anomalies based on this definition.

Abstract

We consider the problem of detecting anomalies in a large dataset. We propose a definition that captures the intuition that anomalies are easy to distinguish from the overwhelming majority of points by relatively few attribute values: we call this partial identification. Formalizing this intuition, we propose a geometric anomaly measure for a point that we call PIDScore , which measures the minimum density of data points over all subcubes containing the point. We present PIDForest : a random forest based algorithm that finds anomalies based on this definition. We show that it performs favorably in comparison to several popular anomaly detection methods, across a broad range of benchmarks. PIDForest also provides a succinct explanation for why a point is labelled anomalous, by providing a set of features and ranges for them which are relatively uncommon in the dataset.

Files

pidforest.pdf

Date

August, 2019

Authors

Related projects

Anomaly Detection

Research Areas

Algorithms
Machine Learning
Statistics

Type

Inproceedings

Journal

NeurIPS 2019