The inspiration for my recent Journal article, “Actionable Security Intelligence from Big, Midsize and Small Data,” came from The New York Times article “How Not to Drown in Numbers" by Alex Peysakhovich, a behavioral economist and data scientist at Facebook, and Seth Stevens-Davidowitz, a former data scientist at Google. The premise of their article is that big data alone are not enough. This is an interesting assertion given the backgrounds of the writers. They say that big data have to be supplemented with what they call “small data,” which comprise information from surveys and human judgment. This claim rang true to me as I have conducted a number of security-risk-related surveys throughout my career, the most recent being an Operationally Critical Threat, Asset, and Vulnerability Evaluation (OCTAVE) review, which I mention in my recent Journal article.
It occurred to me that there is a huge swath of data, which I refer to as “midsize data,” that were excluded from the previous discussion. Midsize data are what information security professionals collect and work with all the time. They are the logs from firewalls, intrusion detection systems and other monitoring tools, which typically feed into security information and event management (SIEM) systems, are aggregated, correlated and analyzed, and the reported results, which may suggest possible anomalous behavior, become the basis of decisions about how to respond.
So many researchers, practitioners and reporters appear to be utterly astounded when they contemplate big data. They see unlimited opportunities emanating from analyzing huge volumes of data. Occasionally, but not often enough, reported results are accompanied by disclaimers that cause-and-effect should not be inferred from reported correlations. But that is like a judge telling a jury to disregard the last statement. Practically everybody assumes that the causal relationships are true when indeed they may not be.
In a 2008 article in the ISACA Journal titled “Accounting for Value and Uncertainty in Security Metrics,” I wrote that information security professionals tend to rely on metrics that are easy to gather and analyze rather than those that take more thought and effort to obtain. Big data analytics frequently fall into the former category. Little forethought is required to gather as much data as possible, run the analytics and see what relationships appear. But that is not how it should be done. Traditional design-of-experiments practices require that, first, a cause-and-effect model should be created and then correlations and regressions can be run to see if the models are supported by the results. The analyze-now-think-later approach is just fishing, and it is just as likely to find an old boot on a fishing pole and think that the boot is a fish.
Researchers typically formulate cause-and-effect models from small and midsize data. Without such models, trawling through big data can often be an exercise in futility, or worse, it can result in bad decisions.
Read C. Warren Axelrod’s recent Journal article:
“Actionable Security Intelligence from Big, Midsize and Small Data,” ISACA Journal, volume 1, 2016.