Monitoring large volumes of data and finding anomalous behavior in them is a ubiquitous challenge. The data are of typically high-dimensional, heterogeneous (categorical and numerical), and contain irrelevant attributes and noise. Labels are often scarce and/or expensive, hence unsupervised learning methods are called for.
Our goal is to come up with algorithms that
- Make minimal generative assumptions, and hence apply broadly.
- Give rigorous guarantees, and whose outcomes are easily interpretable.
- Can handle heterogeneous datasets.
- Are highly performant and scale to large volumes and high dimensions.