Hi, I'm the author of the MIDAS algorithm. We choose the number of hash functions and bucket according to the maximum error we can tolerate and the false positive probability theoretical guarantee we want. Please refer to the AAAI paper here: https://www.comp.nus.edu.sg/~sbhatia/assets/pdf/midas.pdf Let me know if you need more details.
Anomaly detection in graphs is a critical problem for finding suspicious behavior in innumerable systems, such as intrusion detection, fake ratings, and financial fraud. But most of the systems in place focus either on static graphs or on entire graph snapshots if they consider dynamic (time-evolving) graphs.
However, to minimize the effect of malicious activities and start recovery as soon as possible, we need to detect anomalies in real-time or near real-time i.e. to identify whether an incoming edge is anomalous or not, as soon as we receive it. In addition, since the number of vertices can increase as we process the stream of edges, we need an algorithm which uses constant memory in graph size. Moreover, fraudulent or anomalous events in many applications occur in microclusters or suddenly arriving groups of suspiciously similar edges e.g. denial of service attacks in network traffic data and lockstep behavior.
In this work, we propose MIDAS, which detects microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, in edge streams, using constant time and memory. In addition, by using a principled hypothesis testing framework, MIDAS provides theoretical bounds on the false positive probability, which earlier methods do not provide. Also, we are up to 644 times faster and 48% more accurate than previous state of the art approaches.
For more details, read the paper at: https://www.comp.nus.edu.sg/~sbhatia/assets/pdf/midas.pdf
If you are in New York, you are welcome to attend the talk at AAAI in New York Hilton Midtown on 11th February at 2pm.