In my book there is nothing as good as real data produced by a red team, except captured data produced by a red team from NSA (even if it’s not really annotated or labeled well and their MO isn’t quite what it would be in practice). When I was involved with a past IDS research effort in the early 2000s there was a great deal of emphasis on the DARPA/Lincoln Labs datasets, which were old even back then. They were a lot better than nothing, but one thing that concerns me and most everyone else is the lack of good common data, let alone reproducible testbeds like the National Cyber Range is supposed to provide. So it is nice to see that the DARPA/LL datasets are dead–long live the 2009 CDX datasets.
(Wireshark opened the small border data capture fine on my laptop, so don’t let the lack of .pcap extensions bother you.)
But one often-implied corollary of having common or reproducible input data troubles me. Some folks have got the idea that it is possible to scientifically evaluate computer security systems. Even with good input data, I don’t believe such a thing is really possible except in an extremely narrow sense. Let me explain by way of analogy.
Suppose someone came to you with a box of padlocks of the same model and asked you to scientifically evaluate the security of that padlock model. There are a few things you could do that would be obvious. You could test mechanical properties scientifically, asking questions like: How much force does it take applied in such-and-such a way to produce a mechanical failure of the lock? What is the dominant failure mode that results? but it is very implausible to imagine that you could evaluate all the possible failure modes–and hence the actual security of the lock–scientifically.
Sticking with the mechanical failure modes: what if someone decides to use acid to dissolve the lock? or liquid nitrogen to make it brittle? or heats it with an acetylene torch? And maybe a cold, brittle lock is easier to pick; or a hot, ductile lock is…you get the picture. And this doesn’t even begin to address lockpicking in all its forms, which is equal parts art and science.
(BTW/FWIW: one of my favorite episodes from college involves breaking into a room [that I was allowed to be in] that was secured with a fancy keypad lock system using nothing more than a piece of string from an interoffice envelope. It was after a power outage, and the keypad was inoperative, but the folks that installed the lock didn’t think about a very simple mechanical failure mode.)
In the real world you can usually expect a combinatorial explosion of possible failure modes that would have to be tested to assure security. Even in quantum cryptography people rightly worry about things that aren’t in the formal protocols, like efficiencies and TEMPEST-type issues with photon detectors. One of the reasons people are so excited about quantum crypto in the first place is that it is, among other things, a truly credible attempt to use physical theory to reduce the number of failure modes in a security protocol. And one of the reasons I don’t bother to pay attention to formal security proofs outside of cryptography is that their assumptions are never credible to a degree comparable to the Bell inequalities.
This is not to say that security systems shouldn’t be tested–of course they should (especially if there is a “proof” of security)–but it doesn’t make sense to read too much into the results if they’re good. (If your results from evaluating a security system are bad, then that security system is not for you, regardless of why.) In science a hypothesis can never be proved, only disproved. And in security evaluation a system can never be proven secure, only broken. The difference is that in science the hypotheses can be deductively identified and tailored to test good theories that seek to reflect a underlying objective truth of big-n Nature; in security evaluation the system can only be used to test attacks that seek to reflect the ingenuity of one particular set of red team tactics, for which there is often no underlying objective validity, just a common-sense notion of what ought to be done. The domain of applicability of any security evaluation is fundamentally limited because there is no way to come up with a scientific theory of security. Science typically deals with establishing and understanding regularities in phenomena, while security evaluation typically deals with the opposite.
I was hoping to be able to (but can’t) make it to a meeting in Seattle at the end of the month that is trying to produce
progress in the area of Quantifiable Scientific Evaluation of CyberSecurity research. Currently, there is no well understood scientific standard used to guage [sic] the quality of research results in this area. Instead, decisions are made by program committees and journal editors. Also, experimental results are often not repeatable, sometimes due to the proprietary nature of the code or the privacy of the data. This meeting seeks to establish the beginnings of an agreed-upon set of scientific standards whereby progress can be measured, and identify barriers to such standards.
Since I can’t be there, I will just say this: Concentrate on getting good, normalized inputs and outputs for comparative security evaluations. That is plenty hard enough, even though it is not science except in a trivial sense. If a goal is to use nontrivial science in security research, try applying ideas from science (like immunology or my favorite, statistical physics) and mathematics in the development of engineering principles for security systems–where it can be of some benefit–rather than in the evaluation of systems, where anything nontrivial that can be done might be valid and statistically significant and of practical engineering value, but is still probably not scientific.
Anyway, my hat is off to the CDX guys for putting those pcap files and logs up.