eve11: (Default)
[personal profile] eve11
I have spent the last 12 years becoming well-acquainted with statistics, and yet I am still flummoxed by what others could possibly mean when they say "statistical analysis." It's either a magic box that solves everything, or a divining rod imbued by witchcraft.

Here is an article where it appears to be the second. A long interview with linguist Noam Chomsky on AI and "big data".



What I think he means by the term "statistical analysis" in this context is black box prediction engines, data mining, machine learning etc. He throws around the terms Bayes and priors, but I think doesn't understand that it is particularly the Machine Learning "flavor" of Bayesian statistics that he is really thinking about (simple, modular models, lots of data, non-informative priors and the information theoretic minimum-risk estimators). I would love to see someone relate theories to data without the field of statistics! As the interviewer noted, it is the "glue" that links noisy data to the perception. Granularity, feature space, representation... how is that not also a facet of statistical modeling if we would like it to be informed, adapted, and improved by observations in the world? Too much noise at the wrong level of granularity and, as the interviewer puts it, "you're screwed". There does need to be some amount of driving by theory and hypothesis--classical hypothesis tests and simple experiments or by more flexible model-building and model checking--in order to be able to understand causes. This is precisely why I think in general, cyber-security and infosec has failed in its fancy math to answer hard questions; because as Chomsky realizes, pattern matching prediction engines are not set up for causal, interpretable modeling; they are built for pure prediction. As I was discussing with a colleague recently, there is a strong correlation between methods used in academic cyber-security, and whatever was fashionable in pattern-matching machine learning circles six months ago. (Owing to the fact, likely, that most cyber-secuirty researchers come out of the field of computer science, and they quite like black boxes).

It is great to think about models--which I'd argue are indistinguishable and equivalent to scientific "theories" in Chomsky's terms--to understand that they are mechanistic models and that they produce predictions from those mechanistic assumptions and relationships. The fake frictionless inclined plane. But the way you relate those back to reality is always to go and either experiment or observe to validate those theories against what truly exists. The notion of inputs/outputs, algorithms, and implementation sounds rather like a more fleshed-out or comprehensive idea of what I would call "granularity" of a model. In cognitive science for example, they have these architectural languages that are used to build models to perform thought tasks... solving proportional reasoning problems or what have you... and they want the model to fit the data in two directions: they want that algorithm to match the input and output of the task, but also to match to what is observed in the brain through eg, fmri studies. Generalizing that to the more broad arena of scientific pursuit, you want to understand what features of your model are causally relevant. Check out this paper by Giere on the role of models in science. This paper by Pitt, Myung and Zhang may also be helpful, as may be this paper by Seth Roberts and Harold Pashler on theory testing in cognitive science.

Also there is a more fundamental link to Brieman's 2001 Statistical Science article that, oh, in the early days of the data mining mystique, said, "screw statistical modeling; we don't need it, we just need prediction." This generated a large debate even 12 to 15 years ago. For practical answers to what's going on now and what may improve margins in the short term, I agree. But to link it to "How" or "Why" always requires science and not just pattern matching or number crunching. It requires assumptions that must be able to be tested. To predict into the future, one needs at the bare minimum to suppose that the factors influencing the decisions made in that massive database collected yesterday, are sufficiently similar to the factors influencing the decisions that will be made later today.

And making sense of these massive data collections at the most minute of levels also requires some manner of representation, feature selection, etc., guided by hypothesis. For example, Nate Silver is not exactly doing formal science when applying his model to the polls and predicting the outcome of the election. Mechanistically, does he care about the interpretation of specific variables that best predict the outcomes in the states? Hm, if his goal is pure prediction based on previous correlation, then not really; if election outcome was strongly correlated with eg, amount of nose hair, in the past, then he'd have a coefficient for amount of nose hair relevant to the outcome (of course this is likely confounded with sex and age, I'd guess!).

But actually he is a bit smarter than that, and it's the structural and sort of "artsy" side of model building that is really giving him a hypothesis of what is important beforehand and why it might be important. I think this may be considered as using abductive reasoning from prior experience to build up a relevant structure and to relegate non-useful predictors to "the Archive" of irrelevant information (that by Hugh Gauch's terms, somewhat magically results in external experimental validity, hooray!). Granularity is important too. Silver's certainly not modeling the individual vote of each individual of the 109 million or so voters; that would be ridiculous. Instead he is generalizing them to a feature space and looking at the aggregation of their activity. So it's "statistical analysis" but it is not blind number crunching. And someone must ask him of course, "on what basis do you believe your model, which worked so well on Tuesday, will work again in 2016?" Then, we get into the difficult questions of "why" and "how" and what the assumptions behind that "blind big data" really are.

Another thought here is that, before one can use data to guide the formation of hypotheses, one must in some way understand what is in that data, at a level that can be approached by the human brain and not the artificial one. It's amazing how little insight one gets by throwing fancy math at unstructured data. The process is instead iterative at best. A Recent CMU grad from HCI puts forth data mining as integrated with human analysis; the process of focusing attention and presenting information at the right granularity for the analyst to ask good questions. So really, the Big Data pattern finding stuff is what is done before the science is even asked. It's an attempt to summarize and structure, explain "What" before you get to the harder questions of "Why" and "How".

If you have not seen it already, here's a true physicist's take on the fundamentals of science in "why" and "how", and how difficult they are to answer: As the interviewer asks Richard Feynman, "why or how?" do magnets work? And the answer is brilliant: one must at some point take something as true or evident, or the why will never stop. Why and How are always built on presuppositions. I should think that Feynman would have had interesting interactions with eg, three-year-olds.

Thus I think that the term "Statistical analysis" is used rather unfairly in the perjorative sense by Chomsky and his MIT compatriots who should know better. It should be replaced by something a little more finely scoped to what they mean, which is black box prediction or associations. (Okay, i will be a bit petulant and say "Data mining"). Because if one ignores the fact that statistical analysis links theories to data, then all "statistical analysis" is, is some number crunching. It is disingenuous, especially for a linguist, to separate statistics as fiddly black box numbers from statistics as the language of scientific evidence.

Date: 2012-11-09 03:04 am (UTC)
From: [identity profile] a-phoenixdragon.livejournal.com
That would be awesome! Find those intellectual passions, share them, flesh out your ideas and share!

*HUGS*

Profile

eve11: (Default)
eve11

December 2022

S M T W T F S
    123
45678910
11121314151617
18192021222324
25262728293031

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 25th, 2026 07:17 am
Powered by Dreamwidth Studios