Reviews, Favorites and Follows
Feb. 25th, 2013 12:18 amI wanted to write up a post on Eras and Characters in the ff.net database but I ran out of steam. I have to decide how pedantic I want to be about the stats, and how much information I want to impute before writing about it. So instead, here is a preliminary blurb on the relationship between the # of chapters, reviews, favorites and follows in the stories.
ETA: Updated with some info on Reviews and Completions
All of these distributions are highly skewed: there are lots of stories that have 1 chapter, lots of stories that have low numbers of favorites, follows and reviews, and then there are long tails of fics that have lots of all of these.
In R, my story data frame is called, aptly, "story", and here are some five-number summaries plus means for each of these variables:
You can see how the numbers jump from the 3rd quartile up to the max; it's ridiculous. The mean is quite a bit larger than the median for all of these, indicating that they are right-skewed, eg, have long right tails. Those outlying stories will be over-influential in determining correlations, and likely the relationship on the original scale is not linear. So a correlation matrix on this scale is likely not really capturing the true or typical correlation, but instead is influenced by the high outliers and the massive number of low-lying points.
To address this, I look at correlations on the log base 2 scale (adding 1 before taking the log to avoid -infinities for the 0's).
Here is a plot of the pairwise data points for these 4 variables, on the log scale:

Now the bottom right corner of these graphs corresponds to a lot of the data; the points are overlapping. Those are the stories that have smaller numbers of favorites/reviews/follows etc (1 through 8 or so). The axes are on the log scale so for axis value x if you take (2^x)-1 you will get the original number [eg, a label of 6 corresponds to (2^6)-1=63]. Things look generally more linear. I am curious about what looks like a very sharp bound for favorites vs. follows.
Everybody remember basic stats? Imagine, now, fitting a best fit straight line through each of these point clouds, and measuring how concentrated the relationship is around that line, with 0 being weakest (patternless noise) and 1 being strongest (points all line up exactly on a straight line). That number is what the correlation matrix below is giving us for each square in the above plot.
The diagonals are 1 (strongest) because each variable is of course perfectly correlated with itself. Oddly enough, I thought the relationship between chapters and reviews would be stronger (eg, long stories or broken-up stories tend to get more reviews) but it is not as strong as others. The correlation between favorites and chapters is low: length isn't associated with favoriting. The strongest correlation on the chart is between reviews and favorites: 0.74. That says that the number of favorites can be pretty strongly predicted by the number of reviews, but from the picture I'd be willing to bet that it works better for stories with larger numbers of both.
A few stats on Reviews. In particular, I normalize for longer stories by looking at Reviews per Chapter. The first is a log-log plot of the distribution of Reviews per Chapter for stories in the database. Read it like a histogram, but for the kind of data that has many instances of low values, and is skewed by many fewer instances of very high data.

So eg, a 7185 (a little less than 10000 on the log scale) stories have fewer than one review per chapter, and you can see at the bottom right, that 5 stories had over 100 reviews per chapter. (these are, incidentally, all 1-chapter stories from between 2007 and 2011. I should control for publish date as well as number of chapters).
A summary with boxplots on the relationship between completion and reviews/chapter. (Note this does not take into account that incomplete stories may be over-represented as younger stories)

Boxplots are visual displays of the five-number summary (Min, Q1, Median, Q3, Max). This plot is labeled with the medians, and the number of stories in each boxplot is printed under the plot. After looking at several different cuts, the chapter breakdowns I went with were: 1, 2, 3, 4-9, 10-15, 16-20, 21-30, and 31+. They Y-axis is log-base-2 scale again, so all of these distributions are rather skewed and more easily visible on the log scale.
There is a much more marked difference of reviews per chapter for complete vs. incomplete on low-chapter stories. The completed ones tend to have more reviews (but again, not accounting for publish date). The effect tails off for middle-length chapter fics, and then we start to see it appear again for the tomes. Could this indicate the "Leave reviews and I will keep writing it!" effect? (Keeping in mind though, that as the categories include more chapters, there is more "wiggle room" in "reviews/chapter", so the distributions are likely including some more sources of variation for the bigger categories.)
ETA: Updated with some info on Reviews and Completions
All of these distributions are highly skewed: there are lots of stories that have 1 chapter, lots of stories that have low numbers of favorites, follows and reviews, and then there are long tails of fics that have lots of all of these.
In R, my story data frame is called, aptly, "story", and here are some five-number summaries plus means for each of these variables:
summary(story[,c(6,8,13,14)])
chapters reviews favs follows
Min. : 1.000 Min. : 0.00 Min. : 0.00 Min. : 0.000
1st Qu.: 1.000 1st Qu.: 2.00 1st Qu.: 2.00 1st Qu.: 0.000
Median : 1.000 Median : 4.00 Median : 5.00 Median : 1.000
Mean : 3.167 Mean : 12.25 Mean : 12.54 Mean : 7.801
3rd Qu.: 3.000 3rd Qu.: 10.00 3rd Qu.: 13.00 3rd Qu.: 5.000
Max. :370.000 Max. :1710.00 Max. :1603.00 Max. :1054.000
You can see how the numbers jump from the 3rd quartile up to the max; it's ridiculous. The mean is quite a bit larger than the median for all of these, indicating that they are right-skewed, eg, have long right tails. Those outlying stories will be over-influential in determining correlations, and likely the relationship on the original scale is not linear. So a correlation matrix on this scale is likely not really capturing the true or typical correlation, but instead is influenced by the high outliers and the massive number of low-lying points.
To address this, I look at correlations on the log base 2 scale (adding 1 before taking the log to avoid -infinities for the 0's).
Here is a plot of the pairwise data points for these 4 variables, on the log scale:
pairs(log(story[,c(6,8,13,14)]+1,2), pch=".")

Now the bottom right corner of these graphs corresponds to a lot of the data; the points are overlapping. Those are the stories that have smaller numbers of favorites/reviews/follows etc (1 through 8 or so). The axes are on the log scale so for axis value x if you take (2^x)-1 you will get the original number [eg, a label of 6 corresponds to (2^6)-1=63]. Things look generally more linear. I am curious about what looks like a very sharp bound for favorites vs. follows.
Everybody remember basic stats? Imagine, now, fitting a best fit straight line through each of these point clouds, and measuring how concentrated the relationship is around that line, with 0 being weakest (patternless noise) and 1 being strongest (points all line up exactly on a straight line). That number is what the correlation matrix below is giving us for each square in the above plot.
cor(log(story[,c(6,8,13,14)]+1,2))
chapters reviews favs follows
chapters 1.0000000 0.5591888 0.2840169 0.6217631
reviews 0.5591888 1.0000000 0.7429512 0.7022728
favs 0.2840169 0.7429512 1.0000000 0.6144298
follows 0.6217631 0.7022728 0.6144298 1.0000000
The diagonals are 1 (strongest) because each variable is of course perfectly correlated with itself. Oddly enough, I thought the relationship between chapters and reviews would be stronger (eg, long stories or broken-up stories tend to get more reviews) but it is not as strong as others. The correlation between favorites and chapters is low: length isn't associated with favoriting. The strongest correlation on the chart is between reviews and favorites: 0.74. That says that the number of favorites can be pretty strongly predicted by the number of reviews, but from the picture I'd be willing to bet that it works better for stories with larger numbers of both.
A few stats on Reviews. In particular, I normalize for longer stories by looking at Reviews per Chapter. The first is a log-log plot of the distribution of Reviews per Chapter for stories in the database. Read it like a histogram, but for the kind of data that has many instances of low values, and is skewed by many fewer instances of very high data.

So eg, a 7185 (a little less than 10000 on the log scale) stories have fewer than one review per chapter, and you can see at the bottom right, that 5 stories had over 100 reviews per chapter. (these are, incidentally, all 1-chapter stories from between 2007 and 2011. I should control for publish date as well as number of chapters).
A summary with boxplots on the relationship between completion and reviews/chapter. (Note this does not take into account that incomplete stories may be over-represented as younger stories)

Boxplots are visual displays of the five-number summary (Min, Q1, Median, Q3, Max). This plot is labeled with the medians, and the number of stories in each boxplot is printed under the plot. After looking at several different cuts, the chapter breakdowns I went with were: 1, 2, 3, 4-9, 10-15, 16-20, 21-30, and 31+. They Y-axis is log-base-2 scale again, so all of these distributions are rather skewed and more easily visible on the log scale.
There is a much more marked difference of reviews per chapter for complete vs. incomplete on low-chapter stories. The completed ones tend to have more reviews (but again, not accounting for publish date). The effect tails off for middle-length chapter fics, and then we start to see it appear again for the tomes. Could this indicate the "Leave reviews and I will keep writing it!" effect? (Keeping in mind though, that as the categories include more chapters, there is more "wiggle room" in "reviews/chapter", so the distributions are likely including some more sources of variation for the bigger categories.)
no subject
Date: 2013-02-25 05:52 am (UTC)Yeah, I would have expected that to be stronger, too. I guess it makes sense, though, that the strongest correlation is between reviews and favorites--the things people like the most would be both the ones that they felt like commenting on and favoriting.
Are you planning to do anything with this data? Like trying to write up something for the Organization for Transformative Works?
no subject
Date: 2013-02-25 01:38 pm (UTC)I really want to explore that data.
no subject
Date: 2013-02-25 05:56 pm (UTC)no subject
Date: 2013-02-26 05:24 am (UTC)I don't know what your current Erdős number is, but if you want a 4, then we can talk.
no subject
Date: 2013-02-26 05:27 am (UTC)My undergrad thesis was done with Ken Ono but the paper that was published just has my name on it so I don't think it counts. If it does, then mine is his plus one. If not, eh, I've not really been in the paper publishing business in a while. So it's probably infinity.
no subject
Date: 2013-02-25 02:53 pm (UTC)