data, math and cooking
Mar. 2nd, 2013 03:58 pmStill churning through the Teaspoon data. I made a graph of number of reviews vs. word count for a set of categories of word count.

These are boxplots, with the mean of each distribution added as a little "+", showing the skewness of the distributions. I also attempted to normalize for repeat reviews by counting the number of unique reviewers only per story. (Honestly I saw one outlier earlier that was a 1000-word story with over 80 reviews; when I looked at it, it was actually a conversation between one lone reviewer and the author, done in 8 pages of multiple comments). Note the y-axis is on the log scale, so it's de-emphasizing outliers. Medians and means steadily increase. The median number of unique reviewers for shorter stories (under 1000 words) is 2, and it increases steadily with word count to the effect that the median number of unique reviewers for longer stories (over 25K words) is all the way up at 8, and the mean for long stories is 13.
A few outliers of note, marked from the lowest category (0-100 words):
Three of these are actual stories: ids 562, 9402, and 28298. They are all also brilliant in different ways, definitely check them out. The others are actually pictures/comics, which is why the word counts are so small. I didn't look up the Torchwood Babiez ones but I've heard of them and know they were pretty popular. The Master one (16749) is also pretty funny. Written by Adalia Zandra who also used images in her hilarious and slashy (non-explicit but not quite vanilla) Passing Notes (also I note she is responsible for one of the 23 stories titled "Aftermath", heh).
ETA: Ha, "Passing Notes" is actually the highest outlier in the 1K-2K category too! Do you all think that humor stories garner more reviews than other kinds of stories? I see that as a kind of trend in my stories. I should check that out. (makes graph) Nothing really jumps out except "Het" which is likely highly correlated with Ten/Rose. I should mention that the numbers in these plots count stories multiple times if the story is marked by multiple genres.

And for good measure and
aralias: Unique reviweres by category (again with the caveat that stories with multiple categories are counted multiply)

Median review counts for Classic series and Tenth Doctor seem to be similar. These are not yet normalized by "time spent in archive (see below)". Nine is higher than I'd think, and Eleven is lower, but again, temporal effects may adjust that.
I want to divide those review counts by some measure of time in order to account for stories that have had a longer time to accrue reviews. I could use the date of the newest review minus the publish date; I could use the average of that same value, or likely a higher percentile like the 75th or 85th.
**** WARNING: MATH INTERLUDE AHEAD ****
How might one formally model a story's popularity in terms of reviews? I would start with a Poisson process, which is generally what you use when you want to count arrivals of things over time. The way it is modeled; suppose that arrivals follow a Poisson process with mean lambda: for the span of time between t_1 and t_2, the distribution of the number of arrivals within that span (t_1, t_2) is Poisson with mean equal to lambda*(t_2 - t_1). So the more time passes, the bigger the average number of arrivals you expect, which makes sense. If you think of the mean lambda as a flat constant function defined over time, then the mean for any segment is the area of the rectangle under the segment in time.
As Khan Academy explains in the youtube link, a basic Poisson process assumes that the likelihood of arrivals is constant over time. I think we can all agree that isn't the case for reviews; the average reviews per day will spike at publication and update times (also likely at times recced, and hey I did just grab all of
calufrax's history to integrate into the db and will work on that). So as a function of time, the mean lambda is not constant. You can still get the probability of the number of arrivals in an interval of time, but now that nice "area of a rectangle" from the homogenous case turns into "area under the curve" (eg, the integral of lambda(t) between t_1 and t_2).
Google image search fails me for clarifying pictures of this. I could draw one in R, and perhaps will, but I'm guessing most people skipped this lj-cut anyway.
Anyway, it would be interesting to model this mean function parametrically, as a function of different properties of the story (for example, the era, the content, the rating, when it was published). Most stories have so few reviews, we'd have to categorize them somehow to group them together so they could borrow information from each other. I imagine that authors who appear in others' "favorite author" lists (when those others are prone to reviewing) will have a stronger spike, due to their followers getting email notifications when they update/upload stories. Exponential decay of the average expected reviews after publication/update seems reasonable. Also that is doubly nice because the integral of e^x has closed form (it is e^x). so you model eg, the rate of that exponential decay as a kind of popularity metric.
The reason why I suggest this is because the shape (or not) of this curve is kind of how I would really want to normalize stories for the accumulation of reviews over time. In the short term exploratory analysis, I should divide the number of unique reviewers by some measure that incorporates "t_2 - t_1" where "t_2" is the time of the last review, and "t_1" is the time of publication. Those mathematically inclined will see that, if the process could be considered homogenous, it is relatively simple to say that you should divide all story review counts by the total active time in the database, to get comparable "lambda" values for each story that will measure time-adjusted popularity. But the bursty-ness and decay of the process makes the question of what cutoff or measure of time to divide by, a bit more difficult. It becomes a question of quantiles.
Right, that's enough math. More for me than you all, I think.
**** This concludes the math interlude! ****
I am still working through how best to categorize multi-era fics in terms of Classic vs. New eras. This question/difficulty of classification has now come up in both Teaspoon and in the FF.net data. In FF.net I went through the character lists and gave each an associated era (with indicators for "more info needed" for characters like Rose and Sarah Jane who span several eras). I have to go through and normalize the characters for the Teaspoon data. which also includes having to correct for HTML errors/truncations when the character lists in the story blurbs got too long.
ETA: Ah, forgot about cooking! My co-worker (a fellow mathematician and baker) has suggested that we celebrate Pi day (3/14) with pies. I am thinking I would like to try a gluten-free version of something like this Mexican chocolate silk pie. It seems mostly gluten free already and the crust is an easy substitute of gf ginger snaps instead of graham crackers. But why go through all that trouble to make a from-scratch chocolate pudding and then top it with Cool Whip? Might as well make the whipped cream/buttercream topping from scratch at that point too!

These are boxplots, with the mean of each distribution added as a little "+", showing the skewness of the distributions. I also attempted to normalize for repeat reviews by counting the number of unique reviewers only per story. (Honestly I saw one outlier earlier that was a 1000-word story with over 80 reviews; when I looked at it, it was actually a conversation between one lone reviewer and the author, done in 8 pages of multiple comments). Note the y-axis is on the log scale, so it's de-emphasizing outliers. Medians and means steadily increase. The median number of unique reviewers for shorter stories (under 1000 words) is 2, and it increases steadily with word count to the effect that the median number of unique reviewers for longer stories (over 25K words) is all the way up at 8, and the mean for long stories is 13.
A few outliers of note, marked from the lowest category (0-100 words):
tables$story[which((tables$story$uniqreview >= 20) & (tables$story$lengthcat == "0-100")),]
id title author rating
266 562 "Yeah, I know what an allegory is," said Ace. 81 1
4930 9402 Cliff, Shag, Marry? 2066 1
9799 15715 Torchwood Babiez: Missing Shoes 2255 1
10496 16552 A Very Baby Halloweenie 2255 1
10662 16749 WHY THE MASTER PWNS EVERYTHING. 3435 2
12525 19168 Torchwood Babiez: Missing Shoes Part 2 2255 1
17859 27078 Torchwood Babiez: Missing Shoes Part 3 2255 1
18718 28298 Pandora's Box 647 1
22483 33603 Torchwood Babiez: Missing Shoes Part 4 2255 1
chapters words reviews updated published uniquereview
266 1 100 21 10/09/03 10/09/03 21
4930 1 63 31 12/29/06 12/29/06 29
9799 9 25 69 09/25/07 09/25/07 59
10496 1 1 29 10/30/07 10/30/07 26
10662 1 1 20 11/09/07 11/09/07 20
12525 8 8 46 02/08/08 02/07/08 41
17859 12 12 29 11/11/08 11/11/08 23
18718 1 100 23 01/07/09 01/06/09 22
22483 10 10 35 10/20/09 10/20/09 24Three of these are actual stories: ids 562, 9402, and 28298. They are all also brilliant in different ways, definitely check them out. The others are actually pictures/comics, which is why the word counts are so small. I didn't look up the Torchwood Babiez ones but I've heard of them and know they were pretty popular. The Master one (16749) is also pretty funny. Written by Adalia Zandra who also used images in her hilarious and slashy (non-explicit but not quite vanilla) Passing Notes (also I note she is responsible for one of the 23 stories titled "Aftermath", heh).
ETA: Ha, "Passing Notes" is actually the highest outlier in the 1K-2K category too! Do you all think that humor stories garner more reviews than other kinds of stories? I see that as a kind of trend in my stories. I should check that out. (makes graph) Nothing really jumps out except "Het" which is likely highly correlated with Ten/Rose. I should mention that the numbers in these plots count stories multiple times if the story is marked by multiple genres.

And for good measure and

Median review counts for Classic series and Tenth Doctor seem to be similar. These are not yet normalized by "time spent in archive (see below)". Nine is higher than I'd think, and Eleven is lower, but again, temporal effects may adjust that.
I want to divide those review counts by some measure of time in order to account for stories that have had a longer time to accrue reviews. I could use the date of the newest review minus the publish date; I could use the average of that same value, or likely a higher percentile like the 75th or 85th.
**** WARNING: MATH INTERLUDE AHEAD ****
How might one formally model a story's popularity in terms of reviews? I would start with a Poisson process, which is generally what you use when you want to count arrivals of things over time. The way it is modeled; suppose that arrivals follow a Poisson process with mean lambda: for the span of time between t_1 and t_2, the distribution of the number of arrivals within that span (t_1, t_2) is Poisson with mean equal to lambda*(t_2 - t_1). So the more time passes, the bigger the average number of arrivals you expect, which makes sense. If you think of the mean lambda as a flat constant function defined over time, then the mean for any segment is the area of the rectangle under the segment in time.
As Khan Academy explains in the youtube link, a basic Poisson process assumes that the likelihood of arrivals is constant over time. I think we can all agree that isn't the case for reviews; the average reviews per day will spike at publication and update times (also likely at times recced, and hey I did just grab all of
Google image search fails me for clarifying pictures of this. I could draw one in R, and perhaps will, but I'm guessing most people skipped this lj-cut anyway.
Anyway, it would be interesting to model this mean function parametrically, as a function of different properties of the story (for example, the era, the content, the rating, when it was published). Most stories have so few reviews, we'd have to categorize them somehow to group them together so they could borrow information from each other. I imagine that authors who appear in others' "favorite author" lists (when those others are prone to reviewing) will have a stronger spike, due to their followers getting email notifications when they update/upload stories. Exponential decay of the average expected reviews after publication/update seems reasonable. Also that is doubly nice because the integral of e^x has closed form (it is e^x). so you model eg, the rate of that exponential decay as a kind of popularity metric.
The reason why I suggest this is because the shape (or not) of this curve is kind of how I would really want to normalize stories for the accumulation of reviews over time. In the short term exploratory analysis, I should divide the number of unique reviewers by some measure that incorporates "t_2 - t_1" where "t_2" is the time of the last review, and "t_1" is the time of publication. Those mathematically inclined will see that, if the process could be considered homogenous, it is relatively simple to say that you should divide all story review counts by the total active time in the database, to get comparable "lambda" values for each story that will measure time-adjusted popularity. But the bursty-ness and decay of the process makes the question of what cutoff or measure of time to divide by, a bit more difficult. It becomes a question of quantiles.
Right, that's enough math. More for me than you all, I think.
**** This concludes the math interlude! ****
I am still working through how best to categorize multi-era fics in terms of Classic vs. New eras. This question/difficulty of classification has now come up in both Teaspoon and in the FF.net data. In FF.net I went through the character lists and gave each an associated era (with indicators for "more info needed" for characters like Rose and Sarah Jane who span several eras). I have to go through and normalize the characters for the Teaspoon data. which also includes having to correct for HTML errors/truncations when the character lists in the story blurbs got too long.
ETA: Ah, forgot about cooking! My co-worker (a fellow mathematician and baker) has suggested that we celebrate Pi day (3/14) with pies. I am thinking I would like to try a gluten-free version of something like this Mexican chocolate silk pie. It seems mostly gluten free already and the crust is an easy substitute of gf ginger snaps instead of graham crackers. But why go through all that trouble to make a from-scratch chocolate pudding and then top it with Cool Whip? Might as well make the whipped cream/buttercream topping from scratch at that point too!
no subject
Date: 2013-03-02 10:56 pm (UTC)no subject
Date: 2013-03-02 11:32 pm (UTC)PS: I added more graphs!
no subject
Date: 2013-03-03 02:47 am (UTC)Course sex gets more reviews than anything, lol! (I exaggerate, yus). Slash and Crossovers don't get much love - poor things. And my main genres, too (besides horror), so...maybe I need to step away from my genres? *Laughs* Seriously...love the graphs - help me make sense of the math. I need...visual aids. *HEADDESK*
no subject
Date: 2013-03-03 02:53 am (UTC)no subject
Date: 2013-03-03 02:54 am (UTC)*SQUISHES*
no subject
Date: 2013-03-03 03:46 am (UTC)I am all for doing gluten-free baking -- I'm learning more and more about baking substitutes for gluten and eggs and some of them are DARNED delicious! My Mom's friend sent us the name of a website, Chocolate Covered Katie, that is for healthy desserts, many of which are gluten free. There are also yummy non-dessert things, too, and I cannot wait to try the healthy tater tots (made with quinoa and white beans and other nummies).
Let us know which pie you pick!
Ooh, and I want to start baking with rhubarb! Have you tried it? In college, one of my friend's moms used to send rhubarb preserves and they were DELICIOUS.
no subject
Date: 2013-03-03 04:03 am (UTC)I am interested in gluten free less for health reasons and more so that my one co-worker who is celiac can have a choice of at least one pie on Pi day. I found a good recipe for a flour mix that can substitute for all-purpose flour and not taste too much differently. On the other hand, I have also had some really good desserts made with the heavier flours too :) I have had some success with gf pie crusts but they are a bit picky. So a crushed graham cracker crust, substituting gf cookies for the grahams, works pretty well I think :)
I have not yet tried rhubarb! It seems like a summery kind of thing (strawberry rhubarb pie?) and it's still snowing out here. Do you think it would go well with blackberries?
no subject
Date: 2013-03-03 04:16 am (UTC)So, my Mom's been using rice flour and had generally good success with the bread she made for my sister, as well as some cakes, and I know she's been trying some other flour options, too. Mom made a delicious gf crust for a gf pumpkin pie - if you want, I can send you the recipe. I don't believe it was complicated...
Rhubarb is typically (normally) a mid-spring/summer thing, but because of greenhouses, you can get it all year long. I haven't seen rhubarb since November in the store, though, but I've read that frozen rhubarb chunks (no added sugar, etc) work just as well as fresh stuff. I think it absolutely would go well with blackberries - a quick google search shows "blackberry rhubarb crumble" and this and that for many entries....
no subject
Date: 2013-03-03 01:39 pm (UTC)Anyway, this is turning out to be v fascinating, so do keep on sharing please. :-)
no subject
Date: 2013-03-03 03:31 pm (UTC)I am surprised that the median number of reviews for all of the classics from One to Eight is the same; it's saying that 50% of the stories have 3 or more reviews. The mean (the '+') for Six is higher than the other Classics, and also the 75th percentile. So yes, it is saying that more Six fans tend toward leaving feedback on stories. Probably because we are grateful that people write Six :D
no subject
Date: 2013-03-03 04:45 pm (UTC)Anyway, forgive my dimness, but I wanted to check I was understanding things properly or not.
Six fans are the best, basically - now it's official! :lol: