eve11 | Thesis related post, feel free to ignore

So I had a bit of a freak-out last week (early) regarding being able to get my thesis done. But I met with my advisor and he offered me words of encouragement and a plan to assuage my anxieties about writing.

Writing. I don't do it. At least, not very well. I have snippets of LaTeX files all over the place, a few starts at something like a motivating introduction, a proposal document, and a conference paper. I also have a macro file for notation and a BibTeX file for citations, which make me feel better about organization anyway. Notation was one of the reasons (I think) that I wigged out on my first thesis attempt. I didn't have any way of updating it well. I made stuff up as I went along and found myself quickly running out of letters and going into subscript/superscript hell. The macros (thanks

astralagos for showing me how to use them) are so much easier to deal with.

But as for writing sections and stuff, I am still not very good at it. The problem is that I am still developing my bigger model, which means that the terminology is all in flux. And there are a lot of steps and things to consider. And I have an idea of how it should go but the "fiddly bits" of organization and additional notation are killing me.

My advisor suggested (very rationally) that I do have a section or two where the notation is quite well pinned down, and I should write that up as a chapter. This will help in two fronts; it will document the first part of what is done, and it will help me get over my "OMG I need to have something written and every time I open a LaTeX file I just stare at it stupidly!" feelings.

So that's the plan. Get a good set of "first step" results, write up the whole first step bit. Then move on to the harder stuff.

----------

I am basically building a rather complicated modeling and estimation procedure, a little bit at a time. So this gives me a very nice structure for the dissertation chapters: start with the simplest case, then for each new chapter examine what happens when you add "the complication of X". Thing is, I keep thinking about more and more complications that are like sub-complications of the big ones... Anyway, I figured I'd write them down.

BASIC PROBLEM:

You want to know how big a botnet is. How many infected machines? You can monitor network traffic and so, you see machines from the botnet communicating with each other, or scanning you or DDoS-ing you, etc. But you don't really see individual machines; you see IP addresses in the traffic you monitor. And IP addresses are to machines as real addresses are to people: Some are single-person homes, some are high-rise apartment buildings, and some are time-shares. You have a pretty good idea (by which I mean a probability model) of how a single machine's activity looks, not distorted through its connections in IP space. How do you use this information to get an estimate of the number of machines? PS this problem will probably get worse with IPv6 adoption?

I can describe it with an agricultural metaphor? You are a farmer and want to know how many grackles are living in your very large and vast apple orchard. Grackles are pests and you want to get rid of them, but you want to know where the worst infestations are in order to best set your traps. You can very easily see and count your trees; you know where they all are. But it is much harder to see how many grackles are hiding in your trees, or to catch sight of them as they flit from tree to tree. Some of your trees are housing tons of grackles, some maybe only have a few.

(the metaphor breaks down a little here) The thing is, grackles are loud. You have microphones set up around the perimeter of your apple orchard and you can more easily HEAR these damn grackles squawking away all day long. Because your microphones are awesome you can also pinpoint which tree each of the squawks are coming from. You also have some naturalist friends and one or two captive grackles, that give you information on the amount of noise that a single grackle makes when it squawks, as well as its tendencies to squawk louder or softer, more or less, eg during the day or at night. To some extent, you also know how much it likes to flit from tree to tree.

Now this model of bird behavior depends on some parameters, and while you have some good guesses for the values of these parameters based on your captive birds, you would also like to use the data you get from your orchard to refine the model, and get better parameter estimates. Maybe your birds were sick or weird, and hey you only had one or two of them, that's not great for getting standard error estimates. You've got a freaking orchard full of them at your disposal and the naturalists really want to know if their initial guesses at behavior could be better.

SO, use that information from your captive birds, along with the information from about two months worth of your microphone recordings, to estimate how many birds have been infesting your giant apple orchard, and where the worst infestations are located, so you know where your traps will get the best bang for their buck.

In my world: Machines are birds, Networks (eg IP addresses) are trees. Network admins doing cleanups around the world are like feral cats roaming your orchard, picking off grackles here and there and upsetting their living conditions.

Here is where the metaphor really breaks down: You use a big computer program to start with an initial guess of how many birds are in your trees and where they were located across the span of time they were in your orchard. Then you use data and increasingly complex probability models to iteratively refine that guess, and build an entire history of grackle behavior in your orchard over the past two months.

----------

CODE:

So the code is set up as a big graph with two types of nodes. On one side are the Networks (the trees) where I have observable data of how many times my network got scanned by infected hosts hiding within those networks (eg, how loud the birds are squawking): these are hourly counts over a period of 2 months (1224 hours). These objects are static; networks can't be added or subtracted from the graph once they are initialized. Networks are linked to Machines, which I don't observe. I have to guess at how many machines there are and at how they are connected through to networks. So a Machine object also has a set of hourly counts for 1224 hours, and it has a breakdown of what network it sent its counts through for each of the 1224 hours.

But because I'm guessing at the configuration, Machines are dynamic. I can add a machine into the graph, linked up with a number of networks, or I can take a machine out of the graph and re-distribute its proposed counts among the other existing machines. I can also keep the number of machines constant, but re-arrange the counts associated with each machine, as long as the sum total of where the counts are sent through the networks adds up to what was actually seen at that point in time. I can also take a machine's static counts and decide that they went through different network points. And I can propose new parameter values for the probability model of individual behavior associated with any single machine.

Ie, I can theorize a bunch of different configurations of numbers of birds squawking in a tree over time: lots squawking quietly, or a few squawking loudly, or maybe several that are squawking in sequential conversation with each other--but any of my theories must jibe with the timeline of the total number of squawks I heard from that tree ;) You can imagine right now, the way the information of a single bird's squawking and mobility patterns will help you decide between more or less likely configurations. If a tree is a hundred times louder than a single bird, it is very highly unlikely that the noise is being caused by a giant velociraptor-esque grackle hunching its massive form in behind the leaves. No, you've probably got about a hundred grackles in there. If grackles generally squawk for 20 minutes, then it is not likely that a 20-minute period of grackle squawking is really 10 birds talking to each other in 2-minute intervals.

So to recap, to get a good estimate of your Machine population you do the following:

1) Start with an initial guess of:
--(a) The total number of machines
--(b) The configuration of where and when these machines squawk through the network points
--(c) The parameter values associated with single-machine behavior. There are 13 parameters in my single machine model, as well as a set of "states" for each machine: Off (not squawking), Spiking (machines tend to squawk louder when they just turn on), and Decaying down to base rate (machine squawks tend to spike at around 15 squawks per hour and then trail off down to about 3 squawks per hour over a period of about 4 hours).

2) Start "messing" with this configuration in a very principled way:
--(a) Propose a new configuration, "close to" your current configuration
--(b) Compare the likelihood of this new configuration with the current configuration
--(c) Accept or reject the new configuration with a very specific probability:

(Likelihood of proposed config) * (probability of moving from proposed to current)
----------------------------------------------------------------------------------
(Likelihood of current config) * (probability of moving from current to proposed)

That's it. Not too hard, eh? Statistical theory tells us: Do step 2 iteratively a bajillion times and what you will eventually end up with is a sample (with replacement) of configurations that has the following property: For each different configuration, its likelihood based on your observed data is proportional to the number of times it shows up in the sample. That is, you get a Monte Carlo estimate of the posterior distribution of configurations given the observed squawks through the network and the assumptions you laid out in the single-machine model. Which, by the by, also includes a distribution of the total number of machines in the population, which is what we care about.

Great! Now of course, we just need to describe how to start off with an initial configuration for step 1, and how to formally "mess with it" ie implement step 2. BECAUSE, the deal is that I don't know if you noticed, but whole configurations of this graph are MASSIVE. Tons of different things you could change, and lots and lots of objects. Ha, for my entire data set I have 33 million IP addresses spread across 1.1 million net blocks (sets of 256 contiguous IP addresses) and probably about 10 million machines? Yeah, not gonna try to implement this for all of the data at once.

But even for a relatively small number of networks and machines, you don't want to just start proposing wholesale giant changes in configurations. The giant dimensionality of the changes mean that the acceptance ratio in step 2c will be .... poor. BUT statistical theory also tells us, that you can keep the vast majority of the graph fixed, and "mess with it" in very much smaller ways. That ratio in 2c is built up of a whole bunch of little ratios multiplied together; if you change one little bit at a time, you keep most of those little ratios constant, and you are more likely to accept new configurations.

(This is called "mixing". You are basically stepping through a high-dimensional space that is overlaid with a probability distribution at each point. You propose a new direction and hop there according to how likely it is relative to where you are. Big jumps will get you across the space better but they are harder to accept. Little jumps are easier to accept but you won't go as far. Good mixing tries to achieve a balance: it means that your hops are going to be correlated, but they will be big enough that you will not take too long to explore the whole space. Of course sometimes correlations and restrictions can also make small-dimensional jumps "too small": think about eg, you have parameters A and B, and A+B is fixed; you can't try to change A by itself; you will never accept any proposal other than exactly A; you must propose a new value for both A and B together.)

I've touched on the kinds of steps we can do in this graph but let's list them:

1) Single model parameters: There are 13 parameters for each machine, and a set of states (off, spike, decay) associated with each machine at each hour. A sub-step would be to randomly pick a machine out of the graph, and then to propose new values for any or all of these parameters or states. You can do this one parameter at a time. By the by, this is called "updating the single host" model. I outlined all of the math for all of this already in my thesis proposal document. I implemented a version of this that just looked at a single trace of squawks and did this estimation, and I have nearly nearly finished with the bit that took that old nasty kluge code and worked it into the overall graph structure where now it can pick out a machine at random and update its parameters.

2) Keep the number of machines fixed, and re-arrange the squawks for a single IP address for an hour amongst the machines squawking behind it. Two ways to do this: conditional on underlying machine state (smaller dimensional jump), or proposing both new states and new count rearrangements of machines at the same time (larger dimensional jump that might work better due to the "A+B" mixing problem).

3) Add or Remove a machine. This is going to be done in a very specific way. To add a machine, you randomly sample a single machine, then propose splitting that machine into two machines (generating new histories and counts for each machine, holding its network points constant). To remove a machine you either (1) clean up any machines that are basically "off" for the duration of their history, or (2) randomly sample two machines and propose to merge them together into one. This is a large dimensional jump you'll notice because it involves wholesale changes of the 13 parameters and states of the machines across their histories. I am a bit worried about it. Also the math of this involves implementing something called "reversible jumps" that tweak the ratio from 2c with some determinants or something, IDEK.

--------------

SIMPLEST CASE: AT MOST ONE BIRD TO A TREE (AND THEY NEVER MOVE FROM TREE TO TREE)

Well, if this is the case, you have a very, very simple solution. Just update the parameters to your model of single behavior for each machine. You are basically assuming that machines are essentially visible. So, initialize your graph: read in your Network data, and then spawn a single Machine for each Network. Never change the number of machines. Never rearrange machine counts among IP addresses; the only possible configuration is that the machine sent what you saw. Note that this is basically, "implement the single machine model updater and run it in the population".

COMPLICATIONS DUE TO NAT (MULTIPLE MACHINES BEHIND ONE ADDRESS)

In this case you assume that you have at most N birds to a tree, but that birds don't move from tree to tree, ie, they all send their squawks through a fixed network point. This is the case for example with Network Address Translation (NAT); a company or a dorm or an ISP or whatever has one outward facing proxy through which all of its sub-allocated machines send their traffic. Basically the "high rise" situation. In this case, if we assume that the distribution of the number of hosts behind a NAT is uninformative, the graph estimation includes the steps for updating single machines as before, but also, ONLY WITHIN NETWORKS, you do the merge/split and reallocation steps. You can either initialize the same way (one machine per block), or you can use prior information to eg, initialize 10 machines and divvy up their traffic behind a block that looks approximately 10 times bigger than normal according to your a priori single host model with its nice expertly informed (if not perfect) parameters.

COMPLICATIONS DUE TO DHCP (ONE MACHINE ZINGING AROUND LOTS OF ADDRESSES)

Dynamic Host Configuration Protocol (DHCP) is one way that Internet service providers (ISPs) "lease out" a pool of addresses to lots of underlying hosts. If you assume that it's going on in your graph, now you're in the situation where you can do all of the above from the first two cases, but you can also choose machines to merge or split from different network touchpoints. Before in the NAT case, because the network touchpoint never changed and your model of the total number of machines behind a network point is non-informative, the math of merging and splitting depended only on the model of a single host's squawking behavior, and not on any of the models of its tendencies to flit from tree to tree. So in order to implement the DHCP, the math of the merge/split steps needs to be updated to include the information about movement. Specifically for example, do you really just want to randomly choose two machines to merge? Their current IP configurations could be in entirely different countries. You want to be able to pick a machine, then pick another "nearby" machine and see if it is likely for them to be merged together.

COMPLICATIONS DUE TO NETWORK-INITIATED MOVES AND SHUTDOWNS

Sometimes, ISPs just move shit around, wholesale. IE, they will take a NAT of a thousand computers communicating through 1.2.3.4, and move it to 1.2.3.10 or whatever. This shows up as big jumps of lots of machines at once in network patterns. It should probably be incorporated as a step in the model that says "no, all of these computers did not just all turn off at once and disappear, see they reappeared over here, and so their behavior is not a highly unlikely coincidence of independent machine behavior, but correlated to the network through which they were beaconing, that just got turned off." Haha, yeah, this is way future work at the moment.

Some Current Issues and my Concerns:

--------------

WTF sort()?: I have to get one more little bit working in my current code, in order to see some results, and then I can analyze the results and write them up. The segfault that I fixed the other day uncovered a very strange error involving a call to std::sort() for C++ vector objects. I have a vector of pointers to an object that I created (called a Network, so eg "vector < Network* > network_touchpoints;" that gets filled with pointers using push_back() ) and I declared a sort function because I want to sort the vector of Networks by the first three octets of their IP space. And then remove duplicates if they occur. So I have a boolean sort predicate function that should work and I call eg:

sort(network_touchpoints.begin(), network_touchpoints.end(), NetworkPtrSortPredicate);

WELL, let me tell you, that command executes and COMPLETELY borks up all of the memory of the objects that the pointers point to! WHY would it do that? The vector is just rearranging the order of the POINTERS in its list. However, once that is done, the things that the pointers POINT to, no longer have the same values. They have been written over with garbage. It's very strange, and it makes me nervously wonder if the problem isn't somewhere else entirely; that is, another segfault that is overwriting memory it shouldn't overwrite, early on in the process, but that doesn't get elucidated until I call sort(). Boo.

I have two options:
1) Try to fix this in the C++ code
2) Write a python or R script that will pre-process the networks and make sure they are sorted and de-duplicated before I read them into the C++ objects.

I'm thinking that (2) is the way to go. Though, it will not solve the problem if it is indeed a symptom of an earlier segfault; it will just move any strangeness to somewhere else. Or, for now:

3) comment out those bits and come back to it later

I know for a fact that my current set of networks contains no duplicates, and actually, the part of the analysis where the sort order of networks even matters is not even programmed in yet. Perhaps then, *that* is what I should do.

The other weird thing is that this problem did not occur when I was running the code on a data set that is smaller than the one I'm using now. I'll have to go back and check and make sure--maybe the smaller data set DID have this problem and I didn't notice, or maybe the smaller data set was already sorted.

Right. Picky code errors. That's what I get for choosing a thesis topic that involves a mountain of code that really can't be done in anything nice like R. Anyway, it's okay. I like being able to have concrete goals of "chase down this error in the code".

-------------

The maths behind jumps Assume that your single host model does not give you any information at all about the relative likelihoods of machine-to-IP address allocations. IE, the model of how or why birds flit from tree to tree is completely uninformative: any configuration of a bird moving from tree to tree or the number of birds in trees is unequivocally equally likely. The first pass of the machine parameter steps, count rearrangements, and machine merging and splitting all make this assumption, implicitly. When you assume that network configurations give you information, you need to incorporate the likelihood of these configurations into the merge/split steps when you spawn new machines and divvy up their counts amongst the networks, or when you merge machines and agglomerate their counts and their network points.

---------------

Births and deaths and timezones?: Right now, I currently have restricted the single model updater to just look at the patterns of behavior between the first time the network is seen turned on and the last time it is seen turned on. There are lots of rather ephemeral networks that eg, only show up for a day out of the 50. Currently a machine spawned from a network in the initial setup stage takes on that begin and end point as one of its characteristics, and the single model updater ignores zero-counts outside that interval. It also ignores new infections and death due to cleanup. But if one is going to be rearranging machines across networks, then for any machine its arrangement of counts might change that beginning and ending point, and so really I should set begin=0 and end=1223 for all, and possibly add in a "could be uninfected up to point X, could be cleaned after point Y" step.

Oh also, the tendency of machines to "spike" also depends on the network. I did some pre-processing of the networks and calculated a time zone for each and a spot that I'm pretty sure is the "turned on with a rate spike" time of day for each network (this informs some of those 13 single model parameters). Again, when merging machines randomly across networks, how to update this value when it is different across machines? Well anyway, both these are "Stage 3 concerns": I have to get the simple case and the NAT (within network uninformative) case working first.

-----------------

Reversible Jumps and high-dimensional steps: One could ostensibly start changing the dimension of the space, not only by merging/splitting new machines, but by taking OUT parameters in some machines. Spikes only happen in some machines. Some don't spike at all. So you could take out the "spikiness" parameters for a machine if it looks like it's not spiking. But then what do you do if you want to merge a spiking and a non-spiking machine? Perhaps it is best to let sleeping dogs lie on that one.

On the other hand, the reversible jump for merging and splitting is going to be, as I mentioned, very high dimensional, because it's dealing with entire 1224-hour histories at once. I am rather worried about getting this working right. Once the single host stuff is done (hopefully running correctly this weekend), I will start implementing a very naive version of this merge/split. And then I will hope it works and mixes well despite its high dimensionality.

-----------------

Identifiability: Bah, yeah, I need to write about that. Particularly the possibility of smearing inference across parameter configurations as the possible counts among machines are rearranged. Good thing those are mostly nuisance parameters.

------------------

Urgh, that was long. I guess it's time to stop typing, comment out the offending code from issue number 1, and ... figure out where the next segfault is. Yoiks and away!