eve11 | data, math and cooking

Still churning through the Teaspoon data. I made a graph of number of reviews vs. word count for a set of categories of word count.

ReviewsByWordCount

These are boxplots, with the mean of each distribution added as a little "+", showing the skewness of the distributions. I also attempted to normalize for repeat reviews by counting the number of unique reviewers only per story. (Honestly I saw one outlier earlier that was a 1000-word story with over 80 reviews; when I looked at it, it was actually a conversation between one lone reviewer and the author, done in 8 pages of multiple comments). Note the y-axis is on the log scale, so it's de-emphasizing outliers. Medians and means steadily increase. The median number of unique reviewers for shorter stories (under 1000 words) is 2, and it increases steadily with word count to the effect that the median number of unique reviewers for longer stories (over 25K words) is all the way up at 8, and the mean for long stories is 13.

A few outliers of note, marked from the lowest category (0-100 words):

 tables$story[which((tables$story$uniqreview >= 20) & (tables$story$lengthcat == "0-100")),]
         id                                         title author rating
266     562 "Yeah, I know what an allegory is," said Ace.     81      1
4930   9402                           Cliff, Shag, Marry?   2066      1
9799  15715               Torchwood Babiez: Missing Shoes   2255      1
10496 16552                       A Very Baby Halloweenie   2255      1
10662 16749               WHY THE MASTER PWNS EVERYTHING.   3435      2
12525 19168        Torchwood Babiez: Missing Shoes Part 2   2255      1
17859 27078        Torchwood Babiez: Missing Shoes Part 3   2255      1
18718 28298                                 Pandora's Box    647      1
22483 33603        Torchwood Babiez: Missing Shoes Part 4   2255      1
      chapters words reviews  updated published uniquereview
266          1   100      21 10/09/03  10/09/03           21
4930         1    63      31 12/29/06  12/29/06           29
9799         9    25      69 09/25/07  09/25/07           59 
10496        1     1      29 10/30/07  10/30/07           26
10662        1     1      20 11/09/07  11/09/07           20
12525        8     8      46 02/08/08  02/07/08           41
17859       12    12      29 11/11/08  11/11/08           23
18718        1   100      23 01/07/09  01/06/09           22
22483       10    10      35 10/20/09  10/20/09           24

Three of these are actual stories: ids 562, 9402, and 28298. They are all also brilliant in different ways, definitely check them out. The others are actually pictures/comics, which is why the word counts are so small. I didn't look up the Torchwood Babiez ones but I've heard of them and know they were pretty popular. The Master one (16749) is also pretty funny. Written by Adalia Zandra who also used images in her hilarious and slashy (non-explicit but not quite vanilla) Passing Notes (also I note she is responsible for one of the 23 stories titled "Aftermath", heh).

ETA: Ha, "Passing Notes" is actually the highest outlier in the 1K-2K category too! Do you all think that humor stories garner more reviews than other kinds of stories? I see that as a kind of trend in my stories. I should check that out. (makes graph) Nothing really jumps out except "Het" which is likely highly correlated with Ten/Rose. I should mention that the numbers in these plots count stories multiple times if the story is marked by multiple genres.
reviewsbygenre

And for good measure and

aralias: Unique reviweres by category (again with the caveat that stories with multiple categories are counted multiply)
reviewsbycategory

Median review counts for Classic series and Tenth Doctor seem to be similar. These are not yet normalized by "time spent in archive (see below)". Nine is higher than I'd think, and Eleven is lower, but again, temporal effects may adjust that.

I want to divide those review counts by some measure of time in order to account for stories that have had a longer time to accrue reviews. I could use the date of the newest review minus the publish date; I could use the average of that same value, or likely a higher percentile like the 75th or 85th.

**** WARNING: MATH INTERLUDE AHEAD ****
How might one formally model a story's popularity in terms of reviews? I would start with a Poisson process, which is generally what you use when you want to count arrivals of things over time. The way it is modeled; suppose that arrivals follow a Poisson process with mean lambda: for the span of time between t_1 and t_2, the distribution of the number of arrivals within that span (t_1, t_2) is Poisson with mean equal to lambda*(t_2 - t_1). So the more time passes, the bigger the average number of arrivals you expect, which makes sense. If you think of the mean lambda as a flat constant function defined over time, then the mean for any segment is the area of the rectangle under the segment in time.

As Khan Academy explains in the youtube link, a basic Poisson process assumes that the likelihood of arrivals is constant over time. I think we can all agree that isn't the case for reviews; the average reviews per day will spike at publication and update times (also likely at times recced, and hey I did just grab all of

calufrax's history to integrate into the db and will work on that). So as a function of time, the mean lambda is not constant. You can still get the probability of the number of arrivals in an interval of time, but now that nice "area of a rectangle" from the homogenous case turns into "area under the curve" (eg, the integral of lambda(t) between t_1 and t_2).

Google image search fails me for clarifying pictures of this. I could draw one in R, and perhaps will, but I'm guessing most people skipped this lj-cut anyway.

Anyway, it would be interesting to model this mean function parametrically, as a function of different properties of the story (for example, the era, the content, the rating, when it was published). Most stories have so few reviews, we'd have to categorize them somehow to group them together so they could borrow information from each other. I imagine that authors who appear in others' "favorite author" lists (when those others are prone to reviewing) will have a stronger spike, due to their followers getting email notifications when they update/upload stories. Exponential decay of the average expected reviews after publication/update seems reasonable. Also that is doubly nice because the integral of e^x has closed form (it is e^x). so you model eg, the rate of that exponential decay as a kind of popularity metric.

The reason why I suggest this is because the shape (or not) of this curve is kind of how I would really want to normalize stories for the accumulation of reviews over time. In the short term exploratory analysis, I should divide the number of unique reviewers by some measure that incorporates "t_2 - t_1" where "t_2" is the time of the last review, and "t_1" is the time of publication. Those mathematically inclined will see that, if the process could be considered homogenous, it is relatively simple to say that you should divide all story review counts by the total active time in the database, to get comparable "lambda" values for each story that will measure time-adjusted popularity. But the bursty-ness and decay of the process makes the question of what cutoff or measure of time to divide by, a bit more difficult. It becomes a question of quantiles.

Right, that's enough math. More for me than you all, I think.
**** This concludes the math interlude! ****

I am still working through how best to categorize multi-era fics in terms of Classic vs. New eras. This question/difficulty of classification has now come up in both Teaspoon and in the FF.net data. In FF.net I went through the character lists and gave each an associated era (with indicators for "more info needed" for characters like Rose and Sarah Jane who span several eras). I have to go through and normalize the characters for the Teaspoon data. which also includes having to correct for HTML errors/truncations when the character lists in the story blurbs got too long.

ETA: Ah, forgot about cooking! My co-worker (a fellow mathematician and baker) has suggested that we celebrate Pi day (3/14) with pies. I am thinking I would like to try a gluten-free version of something like this Mexican chocolate silk pie. It seems mostly gluten free already and the crust is an easy substitute of gf ginger snaps instead of graham crackers. But why go through all that trouble to make a from-scratch chocolate pudding and then top it with Cool Whip? Might as well make the whipped cream/buttercream topping from scratch at that point too!

Threaded | Top-Level Comments Only

From:

a-phoenixdragon.livejournal.com

I tried to absorb a lot of the info - but pretty graph distracted me, lol!! Still...what I did grab boggled my mind and was yet interesting. *GLEES*

eve11

I was writing quickly. I could do a better job explaining it with pictures, but the pictures would take a while to make.

PS: I added more graphs!

Edited Date: 2013-03-03 12:39 am (UTC)

I SAW!! OHHHH, LOVELOVELOVE!!

Course sex gets more reviews than anything, lol! (I exaggerate, yus). Slash and Crossovers don't get much love - poor things. And my main genres, too (besides horror), so...maybe I need to step away from my genres? *Laughs* Seriously...love the graphs - help me make sense of the math. I need...visual aids. *HEADDESK*

Poor Femmeslash gets no love. No love!

S'a damned shame! Everyone should get love. S'why I've pretty much made that my fanfic mission, lol!! Love to all, dammit. Love. To. ALL.

*SQUISHES*

lostrack621.livejournal.com

DROOLS over the plots. ZOMG - I am learning R post-haste after finishing my dissertation and defending. I swear that your plots are so pretty! I am going to bother you BIG TIME. ;)

I am all for doing gluten-free baking -- I'm learning more and more about baking substitutes for gluten and eggs and some of them are DARNED delicious! My Mom's friend sent us the name of a website, Chocolate Covered Katie, that is for healthy desserts, many of which are gluten free. There are also yummy non-dessert things, too, and I cannot wait to try the healthy tater tots (made with quinoa and white beans and other nummies).

Let us know which pie you pick!

Ooh, and I want to start baking with rhubarb! Have you tried it? In college, one of my friend's moms used to send rhubarb preserves and they were DELICIOUS.

Heh, I have been building a "viz portfolio" of stuff I've done in R. I built my Machine diagnostic pictures from the ground up with R, and all of the other images in my dissertation. If you're interested in my recent training talk, which introduces how to do simple plots and stuff in R, grab it here. There is a gzipped TAR archive of supplemental materials here. It has a "copy/paste guided tour" that you can try out on your own if you can manage to get R installed. But yeah, R has extremely fine control over graphics and lots of different libraries to do lots of visualizations. I'd show you the code I wrote to do those boxplots, but it was on the command line so I shoved like what would have been 20 lines of code into 3 or 4 and then just threw it into a function so I wouldn't have to re-type things. ;)

I am interested in gluten free less for health reasons and more so that my one co-worker who is celiac can have a choice of at least one pie on Pi day. I found a good recipe for a flour mix that can substitute for all-purpose flour and not taste too much differently. On the other hand, I have also had some really good desserts made with the heavier flours too :) I have had some success with gf pie crusts but they are a bit picky. So a crushed graham cracker crust, substituting gf cookies for the grahams, works pretty well I think :)

I have not yet tried rhubarb! It seems like a summery kind of thing (strawberry rhubarb pie?) and it's still snowing out here. Do you think it would go well with blackberries?

Edited Date: 2013-03-03 04:05 am (UTC)

Ooh, goodie - I will download that stuff tomorrow to put in the "to do after diss" folder. Thank you! :)

So, my Mom's been using rice flour and had generally good success with the bread she made for my sister, as well as some cakes, and I know she's been trying some other flour options, too. Mom made a delicious gf crust for a gf pumpkin pie - if you want, I can send you the recipe. I don't believe it was complicated...

Rhubarb is typically (normally) a mid-spring/summer thing, but because of greenhouses, you can get it all year long. I haven't seen rhubarb since November in the store, though, but I've read that frozen rhubarb chunks (no added sugar, etc) work just as well as fresh stuff. I think it absolutely would go well with blackberries - a quick google search shows "blackberry rhubarb crumble" and this and that for many entries....

Edited Date: 2013-03-03 04:16 am (UTC)

thisbluespirit

Ha, I am not sure I am reading your graph correctly (I'm so ignorant, I only know very simple graphs, you know where there's these bars and these numbers, or a pie chart...) but it looks as though what I've always felt - that Six fans are more likely to leave a comment and encourage you - than other Classic Doctor fans. (Seven seems level with him, but then I've written very little Seven, strangely). But I don't know - is that what it means, the way they are? If so, I'm surprised Five has less.

Anyway, this is turning out to be v fascinating, so do keep on sharing please. :-)

Yeah, you have the idea :) The boxplot is a way of showing the five-number summary of the distribution: the minimum value, the 25th percentile, 50th percentile (median), 75th percentile and the maximum value. The 25th through 75th percentiles make up the box. I think boxplots are more descriptive than just writing out and comparing the averages. Although they do hide some things. (here is another informative post about boxplots that also shows them compared to stem-and-leaf plots). I think boxplots and percentiles are especially useful for data like the review data, that tends to be skewed with lots of low values for every category, countered by high outliers.

I am surprised that the median number of reviews for all of the classics from One to Eight is the same; it's saying that 50% of the stories have 3 or more reviews. The mean (the '+') for Six is higher than the other Classics, and also the 75th percentile. So yes, it is saying that more Six fans tend toward leaving feedback on stories. Probably because we are grateful that people write Six :D

Thanks. :-) I'm glad I'm following at lest to some degree... and ahahaha, so I am statistically right in my hitherto unsupported assertion that you always get comments on Six fic - Six fans are just more appreciative.

Anyway, forgive my dimness, but I wanted to check I was understanding things properly or not.

Six fans are the best, basically - now it's official! :lol:

Disquisitiones Arithmeticae

complete, compact and asymptotically normal

data, math and cooking

data, math and cooking

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

December 2022

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags