Solution of second-order matrix difference equations

9 November 2009

While browsing through my notes recently I came across this cute and not-quite trivial, but entirely elementary result. Since it doesn’t seem to be available anywhere else, I present it as blog fodder. (As usual, I will offer $.50 and a Sprite to the first person who provides a usable reference in the event that this result is already known.)

Consider the second-order matrix difference equation

x_{k+1} = Ax_k + \tilde Ax_{k-1}

with x_0, x_1 given. Some algebra shows that (e.g.)

x_2 = (A)x_1 + \tilde A x_0,

x_3 = (A^2 + \tilde A)x_1 + (A) \tilde A x_0,

x_4 = (A^3 + bsp_{1,1}[A, \tilde A])x_1 + (A^2 + \tilde A) \tilde A x_0, etc.

where the bosonic symmetrized product bsp_{j,k}[A, \tilde A] for two “species” of order (j,k) is defined as the sum of all distinguishable products with j occurences of the first species and k of the second (if j or k = 0 then the BSP is a pure power of the nontrivial species).

For example, bsp_{3,2}[A,\tilde A] equals

A A A\tilde A \tilde A + A A\tilde A A \tilde A + A A \tilde A \tilde A A + A\tilde A A A \tilde A + A\tilde A A \tilde A A

+ A \tilde A \tilde A A A + \tilde A A A A \tilde A + \tilde A A A \tilde A A + \tilde A A \tilde A A A + \tilde A \tilde A A A A.

It is not hard to verify that

bsp_{j,k}[A, \tilde A] = A \cdot bsp_{j-1,k} + \tilde A \cdot bsp_{j,k-1}[A, \tilde A]

and more generally for \ell = 0,\dots \min(j,k) that

bsp_{j,k}[A, \tilde A] = \sum_{m=0}^\ell bsp_{\ell-m,m}[A, \tilde A] \cdot bsp_{j-(\ell-m),k-m}[A, \tilde A].

The same algebra that facilitated writing down the first few terms for the difference equation quickly leads to the general solution, viz.

x_k = \left( A^{k-1} + \sum_{j=1}^{(k-1)/2 - 1} bsp_{k-2j-1,j}[A,\tilde A] + \tilde A^{(k-1)/2} \right) x_1

+ \left( A^{k-2} + \sum_{j=1}^{(k-1)/2-1} bsp_{k-2j-2,j}[A, \tilde A] \right) \tilde A x_0

for k odd, and

x_k = \left( A^{k-1} + \sum_{j=1}^{k/2 - 1} bsp_{k-2j-1,j}[A,\tilde A] \right) x_1

+ \left( A^{k-2} + \sum_{j=1}^{k/2-2} bsp_{k-2j-2,j}[A, \tilde A] + \tilde A^{k/2-1} \right) \tilde A x_0

for k even.

I used this to understand what happens when one of the internal states of a classical Bose gas becomes “more fermionic” in the sense that fewer particles can occupy that state, but the microscopic transition probabilities are otherwise unchanged. (The underlying motivation was to apply this to get a handle on the effects of some complex configurations for an older mathematically-oriented network monitoring framework in a tractable traffic regime.) It turns out that this is completely uninteresting unless the remaining “bosonic” states are close to equally likely to occur: in this event you get a slight but also quite counter-intuitive effect on the state distributions. Pseudocolor figures of the energy functions corresponding to these distributions will illustrate:

3 totally bosonic interal states

3 totally bosonic interal states. The distribution is multigeometric, and the energy is affine.

Capping the occupancy of one internal state

Capping the occupancy of one internal state

A more restrictive cap

A more restrictive cap

A "fermionic" internal state

A "fermionic" internal state

Note that the color scales differ for each figure. I think the pictures indicate how surprising the behavior of this (completely classical) “poor man’s exclusion principle” is: the effect is as unexpected as it is slight.

Hopefully posting the main result will save some other folks a head-scratch or two. I’d be interested to know if it’s applied elsewhere.


A graded lexicographic index, part 3

26 October 2009

In an earlier post we covered some more sophisticated ways of implementing a graded lexicographic index, a/k/a the state index. One of these dealt with producing a superset of the state indices of the neighbors of a given state (here X_{B,b} is abusively interpreted as a graph embedded in \mathbb{R}^{B+1} where edges join any two vertices that are a distance \sqrt{2} apart). A careful exploitation of this technique leads to a very efficient way of updating the state index for random walks on X_{B,b}. Basically, the idea is to just determine which element of the neighborhood is the correct one, and restrict the neighborhood algorithm to a single vertex. In practice this means a lot of careful indexing and addition of appropriate binomial coefficients, but the basic idea is really just to omit the extra work in the neighborhood algorithm.

Here’s some MATLAB code that does the trick:

function y = stateindexupdate(a);

% a is an array of states (assumed to be) in a bucketspace
% this illustrates efficient stateindex updating

T = size(a,1);
B = size(a,2);
b = sum(a(1,:));

% Pascal matrix
mb = max(b+1,B+1);
P1 = pascal(mb);
P = P1(1:B-1,:);
PA = [zeros(1,mb);P];

ia(1) = stateindex(a(1,:));

for j = 2:T
 % compute tau:
 tau = 1 + cumsum(fliplr(a(j,:)));
 % compute the ball move that got us here:
 move = a(j,:) - a(j-1,:);
 n = 0; % init the "slot" index
 % compute the corresponding slot index...
 % first, ID the first nonzero element of the move a la Java/C:
 for k = 1:B
  if move(k)
   % now find the second nonzero element m:
   for i = k+1:B
    if move(i)
     m = i;
     break;
    end
   end
   % the slot n is as follows:
   n = move(k) * ( (B-k)*((B-k)+1)/2 - B + m);   % move(i) = sign(move(i))
   break;
  end
 end

 % compute the first (signed) triangular number further from zero:
 for k = 0:B
  n2 = k * (k+1)/2;
  if n2 >= abs(n)
   n2 = sign(n) * n2;
   break;
  end
 end

 d = abs(n2-n);
 l = ( -1 + sqrt(1+8*abs(n2)) )/2;  % l*(l+1)/2 = abs(n2);

 % the sum will go from d+2 to l+1...
 ind = (d+2):(l+1);
 % finally, perform the sum
 lind = length(ind);
 temp = zeros(1,lind);
 if n2 < 0
  for i = 1:lind
   temp(i) = PA(ind(i),tau(ind(i)-1)-1);
  end
  delta = -sum(temp);
 elseif n2 > 0
  for i = 1:lind
   temp(i) = PA(ind(i),tau(ind(i)-1));
  end
  delta = sum(temp);
 else
  delta = 0;
 end

 % finally, update the stateindex:
 ia(j) = ia(j-1) + delta;
end

y=ia;

This runs really fast. A Java version of this code that we produced performs many millions of updates per second on a decent processor core.

One of the nicest things about “grlex” ordering and indexing is that it can be used as a primitive to tackle lots of otherwise difficult problems. For instance, I’ve used it to help (among other things) emulate an older network security monitoring platform, analyze generalized multivariate de Moivre martingales, perform multivariate Lagrange interpolation, study energy functions on equivalence classes of directed graphs, and to construct propagation operators implementing periodic hexagonal, truncated-octahedral, and more generally permutohedral boundary conditions for lattice models (such as spin or lattice Boltzmann models) with surprising ease.

Hopefully some of these more advanced techniques for grlex stuff will be of use elsewhere. If you’d like to use some of this code, just ask: we’ll provide it under an open-source license.


Securing the Information Highway

20 October 2009

The November/December issue of Foreign Affairs (unfortunately not yet available online as of this writing) has an eponymous piece by Wesley Clark and Peter Levin: in it, they write that

[The limited success of the July 4 cyberattacks principally affecting the US and Korea] may embolden future hackers to attack critical infrastructure, such as power generators or air-traffic-control systems, with devastating consequences for the U.S. economy and national security…There is no form of military combat more irregular than an electronic attack: it is extremely cheap, is very fast, and can disrupt or deny critical services at the moment of maximum peril…Disturbingly, [other nations] seem to understand the vulnerabilities of the United States’ network infrastructure better than many Americans do…The longer the U.S government waits [to confront the real threats in cybersecurity], the more devastating the eventual assault is likely to be…the consequences of a major breach would be catastrophic.

Clark and Levin recount William Safire‘s claim that a 3-kiloton explosion of a Siberian natural gas pipeline in 1982—”the most monumental non-nuclear explosion and fire ever seen from space”—was the direct consequence of a Trojan inserted into Canadian SCADA software that the CIA allowed the KGB to steal. They recirculate the rumor that the Israeli destruction of a purported Syrian nuclear facility in 2007 was facilitated by a cyberattack targeting Syrian air defense systems. (This blog has linked to other reports of capabilities along similar lines, such as this one.)

But their real focus is (not surprisingly, given Levin’s history as founder of a hardware outfit specializing in the area) is on the problem of validating hardware. DoD has been very concerned with the idea of hardware Trojans the last few years. Nobody in the military/intelligence-industrial complex wants to take it on faith that chips that are manufactured in China or Taiwan don’t have backdoors. There are apparent precedents for the hardware Trojan such as old reports involving Crypto AG. So NSA started up a trusted foundry and DARPA started the TRUST program (whose PM funded some of my research some years back, so I applaud his taste on both counts). But that leaves the vast majority of chips in network components still unaccounted for, including a large number of counterfeit chips.

Clark and Levin propose an emphasis on reconfigurable hardware (such as FPGAs) and the sort of immunological paradigm started by the Forrest group at UNM as an example of a sound defensive strategy. While the practical utility (as compared to the undeniable conceptual elegance) of the paradigm for network defense is not clear to me (but then again I’m obviously a partisan when it comes to the best scientific principles for designing network defense infrastructure), the ideas of using reconfigurable hardware and avoiding a computational and network monoculture that goes hand-in-hand with immunological principles are sound ones that I’ve agreed with for some time. I gained an appreciation of the benefits of FPGAs from performing research on algorithms for reconfigurable computing architectures some years back, and at a conference last year I got into a brief argument on the security dangers of monocultures with a government sysadmin who lauded the monolithic computing infrastructure he maintained. So it’s not a stretch to say that I am extremely sympathetic to their point of view.

Clark and Levin close by highlighting the need for open infrastructure–both reconfigurable hardware and open source software, and (insofar as it can be implemented) this is the entirely correct approach to technological security of any form. As Reagan said: “trust, but verify”.

update 10/23: The article is available here (subscription required).


A graded lexicographic index, part 2

19 October 2009

The previous post on this topic covered basic code for producing a graded lexicographic index (a/k/a the state index) and its inverse. This post will cover some more advanced techniques. The first one of these is simple: as de Boor pointed out (and it took me quite a while to catch on) the state index I(\alpha) of \alpha \in X_{B,b} can be given by a simple formula. Let PA denote the matrix given by the following MATLAB code:

mb=max(b+1,B+1);  P1=pascal(mb);  P=P1(1:B-1,:);  PA=[zeros(1,mb);P];

For example, with B = 4, b = 8 this is

PA = \left( \begin{smallmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\ 1 & 3 & 6 & 10 & 15 & 21 & 28 & 36 & 45 \end{smallmatrix} \right).

(Identifying the elements of the Pascal matrix with binomial coefficients would introduce some minor technical complications w/r/t the top row, and so we avoid this.) Now let

\alpha \equiv \sum_{k=1}^K \alpha_{j_k} \delta_{j_k}, \ \ \alpha_{j_k} > 0 \ \ \forall k.

Then

I(\alpha) = 1 + \sum_{k=1}^K \sum_{m=1}^{\alpha_{j_k}} PA \left( B + 1 - j_k, y_{km}(\alpha)\right)

where

y_{km}(\alpha) = b + 2 - m - \sum_{\ell = 0}^{k-1} \alpha_{j_\ell}

and an empty sum is evaluated as zero. This is embodied in the following MATLAB code:

J = find(alpha);
K = length(J);

for k = 1:K
   temp1 = sum(alpha(J(1:(k-1))));
   for m = 1:alpha(J(k))
      stateindex = stateindex + PA(B+1-J(k),b+2-temp1-m);
   end
end

stateindex = stateindex + 1;

This is hard to beat in general. However, for some applications it is sufficient to compute the state index at a neighboring site. And this can actually be done much more effectively in practice.

A deceptively simple algorithm always produces a superset of the neighborhood of a state in bucketspace. Write

\tau_k (\alpha) := 1+ \sum_{j=1}^k \alpha_{B-j+1}

and for a finite subset Y of \mathbb{R}, let \mu_k(Y) and \mu^k(Y) respectively denote the k least and greatest elements of Y. Now set I_1 := \{I(\alpha)\} and given I_k for k > 0 form

I_{k+1} := I_{k+1}^- \cup \{I_k\} \cup I_{k+1}^+

where

I_{k+1}^- := \{ \left( \mu_{k-1}(I_k) \cup I_1 \right) - PA(k, \tau_k) \}

and

I_{k+1}^+ := \{ \left( \mu^{k-1}(I_k) \cup I_1 \right) - PA(k, \tau_k - 1) \}.

Then the state indices of the nearest neighbors of \alpha are contained in the set I_B. It is not hard to see that this algorithm requires \sim 3B^2 simple operations (e.g., lookup, addition/subtraction, min/max) if we store the (fairly small) Pascal matrix. On “interior” states (i.e., states with no zero entries), this algorithm produces the exact neighborhood. However, on “boundary” states (i.e., states with at least one zero entry), some spurious indices are returned. It is not obvious (especially given its structure) how the algorithm might be refined to deal with this problem directly.

However, the spurious indices can themselves be checked efficiently. Given B and b, the following MATLAB code illustrates how to invert the state index:

PA2 = flipud(fliplr(PA));
alpha = zeros(1,B);
ind0 = 0; % init
for j = 1:B
   temp1 = cumsum(PA2(j,(ind0+1):(mb-1)));
   ind1 = max(find(temp1<s));
   if ind1
      temp2 = temp1(ind1);
      s = s - temp2;
      alpha(j) = ind1;
      ind0 = ind0 + ind1;
   end
end

The inversion essentially takes O(b) operations. With this tool in hand, we can trivially establish which state indices are spurious and which are not.

In the next post on this topic, we’ll produce a technique which is actually much faster than any of these for updating state indices in both MATLAB and Java, even after accounting for considerable care in the optimization (such as accounting for the order in which bitshifts are executed).


VizSec09

12 October 2009

VizSec 2009 was yesterday; aside from Bill Cheswick’s keynote and participating in the poster session (the poster is available on our downloads page), I was pleasantly surprised by Joel Glanfield et al.’s OverFlow work. Like us, they have recognized that aggregating IPs (among other things) is a good thing for visualizing network traffic, particularly over time. One thing OverFlow does that we don’t is to explicitly show a representation of the aggregated connections as a graph drawing. When aggregation is done statically (even including layer 4) this seems like the sort of thing that can be very friendly to analysts, but there are occlusion issues that suggest focusing on one aggregated node at a time, especially for time series data. Anyway I look forward to seeing more of this sort of thing and fewer “yarn ball” visualizations and their ilk that too often convey little or no useful information because of a refusal to recognize one of the great lessons of physics: that successfully analyzing complex systems is largely dependent on identifying relevant spatial and time scales and then ignoring irrelevant details. When I heard people saying that analysts complain that visualizations frequently “get in the way of the data” I think I know what they meant.

One thing I was pleasantly not surprised by is that the afternoon panel seemed to repudiate the notional equation “Security + Visualization = Science”. As I’ve commented here (and there), there can be no truly scientific theory of security. Visualization doesn’t change this. The place where security and visualization can overlap with each other and with science is in the development of frameworks guided by scientific principles, both in architectural and cognitive aspects. For example, an immunology-based security visualization tool might seek to leverage some kind of corresponding visualization, like some sort of graph summarizing “antibodies” that draws from biologists’ experience.

But trying to compare different visualizations scientifically is almost surely doomed to failure outside of a “perturbative regime” where small elements of visualizations are altered and the cognitive effects are measured. For instance, comparing Wireshark and TNV might be done carefully and provide some insight, but it does not qualify as science. And it doesn’t need to. Engineering is a good thing, and so are usability studies. But while we certainly base our own framework on principles from physics, we haven’t bothered with trying to do formal usability studies, because people will make it known if or when they want minor improvements to an interface, and that’s precisely the sort of thing that falls into the “perturbative regime” anyway. I think the bottom line is that if you care about how users interact with your tool and what it can do for them, just let them have a say in the development process.


Cyberterror: menace or myth?

8 October 2009

Today I went to a talk by Irving Lachow at the think tank where I used to work on whether cyberterrorism is a myth or a menace. The conclusion is probably obvious: it’s basically not a real menace. There’s lots for everyone to worry about from organized crime and certainly lots more for the government-industrial complex to worry about from nation-state threats, but cyberterrorism is a mirage, and basically everyone who doesn’t have skin in the game (and even some people who do or might–such as myself) agree. I’ve talked about similar themes here and here and here, among other entries on this blog.

Lachow mentioned that he did his research on this topic through unclassified sources, then went onto JWICS with crossed fingers–and ended up standing by his initial conclusions. Terrorists do organizational and support stuff and “influence operations” (I guess this is a subtler version of PSYOPS) using networks, but they don’t really engage in cyberterrorism. And the stuff that gets called cyberterrorism basically doesn’t deserve the name. Lachow believes, as I do, that crime, espionage, and state-level network attacks are what we really ought to be concerned with: cyberterror should be considered a “lesser included threat”–although the risks for all of these threats will only increase with time.

One theme that he brought up that I and many others have mulled over is cyberdeterrence. Basically, nobody knows how to do it. The nuclear analogies are false. But (as I’ve mentioned here) for a really big attack, one that’s worthy of a strategic offensive move by a nation-state or terrorist group, there will be a kinetic component. Attribution won’t be a problem. Old-fashioned deterrence with guns still works just fine.


A graded lexicographic index, part 1

6 October 2009

This post is going to introduce a way to index elements of the bucketspace

X_{B,b} := \{\alpha \in \mathbb{N}^B : |\alpha| \equiv \sum_{j=1}^B \alpha_j = b \}.

This is basically equivalent to indexing the \binom{B+b-1}{b} ways you can put b indistinguishable balls into B distinguishable buckets, hence the name.

Why is this useful? There are a number of pretty good reasons. For one thing, such an index is the key to providing a total ordering of \mathbb{N}^B that is a graded lexicographic ordering (i.e., something akin to a dictionary ordering in which words of length 1 appear first in order, then words of length 2, and so on) which can probably be used in lots of places (it seems like the sort of thing that could be a 35-40 rated exercise in Knuth depending on hints). A more specific application derives from the fact that multivariate Lagrange interpolation is about as nice as it can possibly be on X_{B,b}, which is probably why it turned out that Carl de Boor of spline fame anticipated a lot of our early work from 2001 discussed in this post two years beforehand.

Example of Lagrange interpolation.

Example of Lagrange interpolation of a function on X_{3,4}.

But the reason we’re talking about it here is that random walks on X_{B,b} have been used as a way of describing network traffic in a past approach by DoD that we’ve drawn a lot of lessons from (the basic idea for translating network packets to random walks was similar to the approach described at the end of a previous post, except that the finite space X_{B,b} is used instead of the infinite root lattice A_{B-1}), and we seriously considered running with this framework in tandem with the one that we followed through with.

There are also some cute martingale results that derive from looking at X_{B,b} as the configuration space of states for a classical Bose gas and looking at the energy fluctuations of the gas. In fact this line of thinking led in a very circuitous way to the development of a lattice gauge theory for random walks that we’re working on in the hope that it could lead to descriptors generalizing electric and magnetic charges, fields, etc. for dealing with correlations within network traffic–but talking about that more would take us way too far afield, especially since that stuff is still in the research stage.

Getting back on topic, pictures of the graded lex ordering obtained by projecting X_{B,b} onto \mathbb{R}^{B-1} give a little more insight. For example, with B = 3 and b = 12, there are \binom{3+12-1}{12} = 91 elements or states, and they are ordered as

(0,0,12), (0,1,11), …, (0,12,0), [13 states]

(1,0,11), (1,1,10), …, (1,11,0), [12 states]

(2,0,10), (2,1,9), …, (2,10,0), [11 states]

(11,0,1), (11,1,0), [2 states]

(12,0,0). [1 state]

If we just project these points into two dimensions and literally connect the dots in order (except for joining the last point to the first one) we get a pretty clear picture of what’s going on:

State index for <img src='http://s0.wp.com/latex.php?latex=B%3D3%2C+b%3D12&bg=ffffff&fg=333333&s=0' alt='B=3, b=12' title='B=3, b=12' class='latex' />. A circle indicates the first index and "x" indicates the last index.” width=”300″ height=”225″ /><p class=State index for B=3, b=12. A circle indicates the first index and "x" indicates the last index. The coordinates shown are artifacts of the projection used.

The pattern becomes more obvious if we add a dimension:

State index for <img src='http://s0.wp.com/latex.php?latex=B%3D4%2C+b%3D12&bg=ffffff&fg=333333&s=0' alt='B=4, b=12' title='B=4, b=12' class='latex' />. A circle indicates the first index and "x" indicates the last index.” width=”300″ height=”225″ /><p class=State index for B=4, b=12. A circle indicates the first index and "x" indicates the last index.

It looks recursive, and it is. Since you can read about an essentially equivalent construction in de Boor’s 1999 paper, I won’t belabor the details, though our MATLAB code is different from his and appears to be more fully developed in many respects.

First, here’s how to produce X_{B,b}:

function y=lookup(B,b);

% Generates a lookup table with B buckets and b balls.

P1 = pascal(max(b+1,B+1));
P = P1(1:B,:);

numstates = P(B,:);
% ith element gives number of states for B buckets and (i-1) balls

% addermat is a matrix of "row adder" vectors:
addermat = fliplr(eye(B));

%init
look = zeros(1,B);

% Put C = max(c)...the number of balls is b = C-1, so b-1 = C-2
for c = 1:b
 for k = 1:B
 blocksize(c,k) = nchoosek(c+k-2,c-1);
 temp = blocksize(c,k);
 cellblock{k} = look(1:temp,:)+repmat(addermat(k,:),[temp 1]);
 end
 look = cat(1,cellblock{:});
end

y = look;

Next, here’s a recursive index:

function y = stateindex(state)

% B: number of buckets
% b: number of balls

B = length(state);
b = sum(state);

if min(state) < 0 | max(state) > b
 stateindex = 0;  % basic error handling
else
 P1 = pascal(max(b+1,B+1));
 P = P1(1:B,:);
 PA = cat(1,zeros(1,max(b+1,B+1)),P);

 % addermat is a matrix of "row adder" vectors:
 addermat = fliplr(eye(B));

 blocknum = zeros(1,b);
 numblocknum = zeros(1,b);

 substate = state;
 for i = 1:b
 blocknum(i) = max(find(fliplr(substate)));
 substate = substate-addermat(blocknum(i),:);
 numblocknum(i) = PA(blocknum(i),b+2-i);
 end

 stateindex = sum(numblocknum)+1;
end

y = stateindex;

Unfortunately, recursive code usually looks conceptually elegant but performs terribly. In the next post in this series I’ll discuss some more sophisticated methods for indexing states in X_{B,b} and related algorithms that avoid this pitfall.


Jaynes and the Gibbs paradox

21 September 2009

Ed Jaynes (see also here) quoted Eugene Wigner as saying that “entropy is an anthropomorphic concept” and provided his own elaboration of this idea:

It is necessary to decide at the outset of a problem which macroscopic variable or degrees of freedom we shall measure and/or control; and within the context of the thermodynamic system this defined, entropy will be some function S(X_1,\dots, X_n) of whatever variables we have chosen. We can expect this to obey the second law T dS \ge dQ only as long as all experimental manipulations are confined to that chosen set. If someone, unknown to us, were to vary a macrovariable X_{n+1} outside that set, he could produce what would appear to us as a violation of the second law, since our entropy function S(X_1,\dots, X_n) might decrease spontaneously, while his S(X_1,\dots, X_n, X_{n+1}) increases.

As part of Jaynes’ MAXENT program, he discoursed at some length on the Gibbs paradox. In one version of this, a box contains a monatomic ideal gas of 2N atoms in equilibrium. Now divide the box in half by an impermeable membrane. A naïve computation of the entropy might suggest that S_{box} - 2S_{half} \ne 0. The resolution of this is found in the recognition that introducing the membrane effectively distinguishes atoms on one side of the membrane from those on the other side. But the molecules in the undivided volume are indistinguishable, and if we divide the partition function by the right factorial, the paradox disappears.

Jaynes analyzed the Gibbs paradox in detail and concluded that

The rules of thermodynamics are valid and correctly describe the measurements that it is possible to make by manipulating the macrovariables within the set that we have chosen to use…The entropy of mixing does indeed represent human information; just the information needed to predict the work available from the mixing.

Lawrence Sklar cites an example of Gibbs’ paradox with hydrogen gas in singlet and triplet states that is attractive but that I think is misleading: molecular collisions in a gas will induce transitions in the spin states that would greatly complicate attempts to deal with entropies of mixing. However Sklar’s point that entropies depend on the level of physical description is correct, as an amended example with (e.g.) molecular enantiomers (i.e., chiral “mirror images”) quite clearly shows.

The dependence of entropies on levels of description is fully manifested in information theory, and it is important to keep issues like this in mind when using entropy methods for anomaly detection, or more generally techniques like some of ours, in which network traffic is mapped onto the model thermodynamic system of a (typically single-particle) Bose gas. The practical utility of this sort of technique was again anticipated by Jaynes:

A physical system always has more macroscopic degrees of freedom beyond what we control or observe, and by manipulating them a trickster can always make us see an apparent violation of the second law.

Therefore the correct statement of the second law is not that an entropy decrease is impossible in principle, or even improbable; rather that it cannot be achieved reproducibly by manipulating the macrovariables \{X_1,\dots, X_n\} that we have chosen to define our macrostate. Any attempt to write a stronger law than this will put one at the mercy of a trickster, who can produce a violation of it.

But recognizing this should increase rather than decrease our confidence in the future of the second law, because it means that if an experimenter ever sees an apparent violation, then instead of issuing a sensational announcement, it will be more prudent to search for that unobserved degree of freedom. That is, the connection of entropy with information works both ways; seeing an apparent decrease of entropy signifies ignorance of what were the relevant macrovariables.

One of the things that we’ve done is to demonstrate (look at the data from the paper “Effective temperature for finite systems” on our downloads page) that using a small set of source/destination attributes of packets to define macrovariables reflecting the bulk traffic characteristics and looking for “tricksters” can be done quite effectively using the mathematical apparatus of statistical physics.

I have mixed feelings about Jaynes’ legacy. The two things he is most known for are his advocacy of Bayesian inference (I personally feel like conditional probability is not a big deal and have never been able to understand why the choice of a justified prior is a big deal, but I have seen lots of smart people mess this sort of thing up) and maximum entropy (I think it’s usually a good approximation technique that has to be used with some care–sometimes more than its practitioners employ). But I’ve got to hand it to the man for his ability to flesh out hidden assumptions and to transform “obvious” facts into powerful tools. A referee said about his original maximum entropy work that

It has no apparent practical application whatsoever. The problem and the point of view are not familiar to physicists. As a physicist I would raise the question whether the point of view is entirely new, whether it has been discussed explicitly by Information Theory people, or whether it is implicit in the work of Information Theory experts. I would guess that it is at least implicit in their thinking.

Jaynes framed this review and hung it on his wall. I think it’s safe to say he got the last laugh there.


The fundamental law of statistical physics

9 September 2009

Feynman called the canonical ensemble

the summit of statistical mechanics, and the entire subject is either the slide-down from this summit, as the principle is applied to various cases, or the climb-up to where the fundamental law is derived and the concepts of thermal equilibrium and temperature T clarified.

In short, the canonical ensemble is the probabilistic description of a system in thermal equilibrium with its environment. It is expressed through the Boltzmann-Gibbs distribution (which in accordance with the more general mathematical usage I will simply call the Gibbs distribution):

\mathbb{P}(E_k) = Z^{-1} e^{-\beta E_k},

where E_k is the energy of the kth possible state of the system (for convenience, assume a finite number of such possible states), the partition function that serves as a normalizing factor on the right hand side above is given by Z := \sum_j e^{-\beta E_j}, and \beta := 1/k_B T, where k_B is the Boltzmann constant.

There is a quick derivation of the Gibbs distribution for a finite system from the basic postulate that the probability of a state depends only on its energy. I have seen elements of it in physics textbooks, but invariably I have found that the derivations introduce a function that counts the number of system states at a given energy. That approach has the clear advantage of motivating the introduction of entropy, but it also has the cost of clouding the essence of the derivation itself.

What follows is a quick derivation of the Gibbs distribution along different lines. It is not completely rigorous as presented here, and I cannot recall having seen it anywhere else, but it is almost certainly out there somewhere. (As usual in cases like this, I will give $.50 and a Sprite to the first person who can show me an essentially identical argument in the literature: claim your prize in a comment.)

The key to the derivation is that only differences in energy can be measured, or as Feynman put it in his derivation, “energy is only defined up to an additive constant”. This and the basic postulate imply that

\mathbb{P}(E_k) = \frac{f(E_k)}{\sum_j f(E_j)} = \frac{f(E_k + \epsilon)}{\sum_j f(E_j + \epsilon)}

for some function f and \epsilon arbitrary. Define

g_E(\epsilon) := \frac{\sum_j f(E_j + \epsilon)}{\sum_j f(E_j)}

and note that g_E(0) = 1 by definition. It follows that

\mathbb{P}(E_k) = \frac{f(E_k)}{\sum_j f(E_j + \epsilon)} g_E(\epsilon) = \frac{f(E_k + \epsilon)}{\sum_j f(E_j + \epsilon)}.

Therefore

f(E_k) g_E(\epsilon) = f(E_k + \epsilon),

implying that

f(E_k + \epsilon) - f(E_k) = (g_E(\epsilon) - 1) f(E_k),

and in turn, since g_E(0) = 1,

\frac{d}{dE_k}f(E_k) = g'_E(0) f(E_k).

As a result,

f(E_k) = A \exp (g'_E(0)E_k).

Setting, without loss of generality,

\beta := -g'_E(0) and A \equiv 1

produces the Gibbs distribution so long as the temperature is defined via T := 1/k_B\beta. Note also that

g_E(\epsilon) = e^{-\beta \epsilon}

so that g_E \equiv g, as required for the self-consistency of the argument. And although the derivation here is only appropriate for a fixed \beta, this amounts to the canonical ensemble.

In effect, this way of arriving at the Gibbs distribution takes the notion of temperature as a primitive, compared to the usual way which takes entropy as a primitive. There is a surprisingly good reason to do this: entropy can be introduced, defined, and applied in much broader contexts than physics, precisely because it is abstract. Temperature, on the other hand, has no obvious information-theoretical interpretation, and more often than not its physical interpretation has to be given in terms of the average kinetic energy of a gas particle in thermal equilibrium with whatever system is actually under consideration. That is, the meaning of temperature is only really clear in relation to a gas, though as the well-known kinetic theorist Harold Grad put it:

Any dynamical system can be used as a thermometer to measure the temperature of another large enough system.

Hasok Chang’s recent book Inventing Temperature: Measurement and Scientific Progress and Lawrence Sklar’s wonderful Physics and Chance: Philosophical Issues in the Foundation of Statistical Mechanics both discuss the nature of temperature, and with complementary emphases.

One of the things that we’ve done at EQ is to define a sensible notion of temperature for network traffic and demonstrate its utility for anomaly detection. The essential ideas and some actual data are detailed in a paper available from the downloads section of our website (note that this is a preprint addressed to physicists, and it’s still a work in progress). The question “what is the temperature of a network?” doesn’t have a “right answer”, but neither does the question “what is the entropy of X?” Even so, there are ways to define “appropriate” temperatures and/or entropies, some better than others. A 2000 paper by Mark Burgess of cfengine fame was among the first places that ideas like this were publicly applied to network security, though Burgess had floated these ideas a few years before then, and EQ’s technology drew a lot of lessons from the later stages of an effort initiated by Dave Ford at NSA that was declassified around the same time.

Because temperature is usually only defined in thermal equilibrium, where the Gibbs distribution applies, it makes sense to go in the opposite direction and use the Gibbs distribution to define an effective temperature for systems–whether or not they are actually in equilibrium. This is basically what we do: and it turns out that there is a unique way to do it provided that a characteristic timescale for the system is known. In situations like these I take to heart a comment of Henry McKean, a mathematician who is well known for contributions to probability theory and statistical physics (among other things) that

One of the faults of mathematicians is [that] when physicists give them an equation, they take it absolutely seriously.

In a nutshell, statistical physics is great at extracting the few most relevant parameters describing gigantic volumes of data. The detailed microstate of a cup of water encodes many orders of magnitude more information than can go over a 100 Gbps link in a year, and temperature, pressure, and volume do a pretty good job of describing the important features. It’s not much of a stretch to apply the same ideas to network traffic.

http://en.wikipedia.org/wiki/Canonical_ensemble

VizSec 2009

8 September 2009

EQ will have a poster at VizSec 2009. I’m looking forward to seeing lots of cool new ideas in security visualization there as well as explaining our own approach to providing a scalable visualization and analysis infrastructure for network traffic.


Follow

Get every new post delivered to your Inbox.