LJ won't take pdfs?
Aug. 19th, 2011 11:47 amHow annoying. So anyway, it is not as cool as a Steampunk femme Seven, but (admittedly more for me than you all) I wanted to upload some pictures I made of the results from running my model on 154 different traces of counts. I spent a good while coming up with a big diagnostic picture that let me study the results machine by machine. Some are good (model seems to fit well) and some are not so good (model is unidentifiable in stupid ways).
Recall if you will, that I am interested in modeling the rate at which machines infected with a particular virus (Conficker-C) scan the internet (and thus my network). Here is my diagnostic plot for model checking and visualization. I wrote all the plotting code in R.

The top left plot is a time series plot of the number of scan attempts I saw on my network originating from a single infected IP address. Tick marks are 24-hour intervals; the data set is 51 days. The second and third plots are a breakdown of the model's performance via simulation. The blue is the mean count the model simulations predict at that hour. The grey shading is the 0.01st and 0.99th percentiles of the simulations. The dots are the counts again; black representing "falls within the 99% bands" and red representing "falls outside the 99% bands". Kind of a measure of goodness of fit.
The third plot shows a breakdown of the "states" the model thought this machine was in. Gray means "off" ie, the machine was physically turned off and so was not scanning. Red means "spiking", which means that at that hour it thinks the machine had spiked up higher than its baseline rate. Blue means "decaying down to baseline". The height of the bar shows the probability of the each state (between 0 and 1) at that hour. As a note, the predictive plot in part 2 was done by first generating a state from that distribution of states, then calculating the off-spike-decay rates for that set of states, then generating a Poisson-distributed count at that rate for each hour. It is conservative in that way, as I did not eg, start with a single state and generate a string of states via their transition probabilities.
If you look at the traces on the right side of the graph, they show the estimates of various parameters. "OffLambda" is the average number of hours the machine stays turned off at a clip. "q" is the baseline scan rate. "alpha" is the rate per hour at which a spike decays down to the baseline. "omega" is the spike multiplier. Ie, when the model spikes, it spikes up from a rate of q per hour to a rate of q*omega per hour, and then decays back down by a multiplier of alpha for each hour after the spike.
The wavy plots and the other parameters (rho, nu, gamma) all are parameters determining the likelihood of transitioning between states, and since these machines tend to have daily patterns, the likelihood of switching to different kinds of states changes with the hour of the day. So the bottom three plots show eg, if it is transitioning out of an Off cycle, what is the probability it transitions to a spike vs. decay state, depending on the hour of the day. Same thing for spike and decay (although those are allowed to transition to the same state).
The sparkline plots along the side show parameter estimates and the iterations that were used to get to them. This model is rather complicated, and if you recall my last thesis-related post, the way you estimate the values for these parameters is to start with an "initial guess", then keep proposing new guesses and accepting or rejecting them according to their likelihood. Do this over and over ad infinitum and your model should converge to proposing guesses near the "correct" values. So the sparklines show the trace of guesses, over 30,000 iterations. Actually they just show every 5-th one, to save space. The "acc=X" is the proportion of times it accepted a guess, which indicates how well the chain is "mixing". The histograms show the density of the guesses, as well as the min and max. The blue line is the mean value of the guesses, and the black line is a smoothed trace of the guesses over time. For a well-behaved chain and model, the trace of guesses should look flat, and uniform over time. These ones are all pretty well-behaved :D
I described this machine (machine 24) as the "poster child" for the model. It's doing very well. But there is a lot of variation in behavior across the 154 IP addresses I modeled. I put up a gallery of a select few here to demonstrate typical patterns I saw.
Now if I can just (a) write all that above stuff all formal-like along with my explanations and (b) fix the identifiability problems that are shown in the gallery, chapter 4 of the thesis will be in the bag.
Recall if you will, that I am interested in modeling the rate at which machines infected with a particular virus (Conficker-C) scan the internet (and thus my network). Here is my diagnostic plot for model checking and visualization. I wrote all the plotting code in R.
The top left plot is a time series plot of the number of scan attempts I saw on my network originating from a single infected IP address. Tick marks are 24-hour intervals; the data set is 51 days. The second and third plots are a breakdown of the model's performance via simulation. The blue is the mean count the model simulations predict at that hour. The grey shading is the 0.01st and 0.99th percentiles of the simulations. The dots are the counts again; black representing "falls within the 99% bands" and red representing "falls outside the 99% bands". Kind of a measure of goodness of fit.
The third plot shows a breakdown of the "states" the model thought this machine was in. Gray means "off" ie, the machine was physically turned off and so was not scanning. Red means "spiking", which means that at that hour it thinks the machine had spiked up higher than its baseline rate. Blue means "decaying down to baseline". The height of the bar shows the probability of the each state (between 0 and 1) at that hour. As a note, the predictive plot in part 2 was done by first generating a state from that distribution of states, then calculating the off-spike-decay rates for that set of states, then generating a Poisson-distributed count at that rate for each hour. It is conservative in that way, as I did not eg, start with a single state and generate a string of states via their transition probabilities.
If you look at the traces on the right side of the graph, they show the estimates of various parameters. "OffLambda" is the average number of hours the machine stays turned off at a clip. "q" is the baseline scan rate. "alpha" is the rate per hour at which a spike decays down to the baseline. "omega" is the spike multiplier. Ie, when the model spikes, it spikes up from a rate of q per hour to a rate of q*omega per hour, and then decays back down by a multiplier of alpha for each hour after the spike.
The wavy plots and the other parameters (rho, nu, gamma) all are parameters determining the likelihood of transitioning between states, and since these machines tend to have daily patterns, the likelihood of switching to different kinds of states changes with the hour of the day. So the bottom three plots show eg, if it is transitioning out of an Off cycle, what is the probability it transitions to a spike vs. decay state, depending on the hour of the day. Same thing for spike and decay (although those are allowed to transition to the same state).
The sparkline plots along the side show parameter estimates and the iterations that were used to get to them. This model is rather complicated, and if you recall my last thesis-related post, the way you estimate the values for these parameters is to start with an "initial guess", then keep proposing new guesses and accepting or rejecting them according to their likelihood. Do this over and over ad infinitum and your model should converge to proposing guesses near the "correct" values. So the sparklines show the trace of guesses, over 30,000 iterations. Actually they just show every 5-th one, to save space. The "acc=X" is the proportion of times it accepted a guess, which indicates how well the chain is "mixing". The histograms show the density of the guesses, as well as the min and max. The blue line is the mean value of the guesses, and the black line is a smoothed trace of the guesses over time. For a well-behaved chain and model, the trace of guesses should look flat, and uniform over time. These ones are all pretty well-behaved :D
I described this machine (machine 24) as the "poster child" for the model. It's doing very well. But there is a lot of variation in behavior across the 154 IP addresses I modeled. I put up a gallery of a select few here to demonstrate typical patterns I saw.
Now if I can just (a) write all that above stuff all formal-like along with my explanations and (b) fix the identifiability problems that are shown in the gallery, chapter 4 of the thesis will be in the bag.
no subject
Date: 2011-08-19 09:06 pm (UTC)no subject
Date: 2011-08-21 05:26 pm (UTC)