eve11: (chance)
[personal profile] eve11
I wanted to write up a post on Eras and Characters in the ff.net database but I ran out of steam. I have to decide how pedantic I want to be about the stats, and how much information I want to impute before writing about it. So instead, here is a preliminary blurb on the relationship between the # of chapters, reviews, favorites and follows in the stories.

ETA: Updated with some info on Reviews and Completions



All of these distributions are highly skewed: there are lots of stories that have 1 chapter, lots of stories that have low numbers of favorites, follows and reviews, and then there are long tails of fics that have lots of all of these.

In R, my story data frame is called, aptly, "story", and here are some five-number summaries plus means for each of these variables:
summary(story[,c(6,8,13,14)])
    chapters          reviews             favs            follows        
 Min.   :  1.000   Min.   :   0.00   Min.   :   0.00   Min.   :   0.000  
 1st Qu.:  1.000   1st Qu.:   2.00   1st Qu.:   2.00   1st Qu.:   0.000  
 Median :  1.000   Median :   4.00   Median :   5.00   Median :   1.000  
 Mean   :  3.167   Mean   :  12.25   Mean   :  12.54   Mean   :   7.801  
 3rd Qu.:  3.000   3rd Qu.:  10.00   3rd Qu.:  13.00   3rd Qu.:   5.000  
 Max.   :370.000   Max.   :1710.00   Max.   :1603.00   Max.   :1054.000  

You can see how the numbers jump from the 3rd quartile up to the max; it's ridiculous. The mean is quite a bit larger than the median for all of these, indicating that they are right-skewed, eg, have long right tails. Those outlying stories will be over-influential in determining correlations, and likely the relationship on the original scale is not linear. So a correlation matrix on this scale is likely not really capturing the true or typical correlation, but instead is influenced by the high outliers and the massive number of low-lying points.

To address this, I look at correlations on the log base 2 scale (adding 1 before taking the log to avoid -infinities for the 0's).

Here is a plot of the pairwise data points for these 4 variables, on the log scale:
pairs(log(story[,c(6,8,13,14)]+1,2), pch=".")
revfavfollow

Now the bottom right corner of these graphs corresponds to a lot of the data; the points are overlapping. Those are the stories that have smaller numbers of favorites/reviews/follows etc (1 through 8 or so). The axes are on the log scale so for axis value x if you take (2^x)-1 you will get the original number [eg, a label of 6 corresponds to (2^6)-1=63]. Things look generally more linear. I am curious about what looks like a very sharp bound for favorites vs. follows.

Everybody remember basic stats? Imagine, now, fitting a best fit straight line through each of these point clouds, and measuring how concentrated the relationship is around that line, with 0 being weakest (patternless noise) and 1 being strongest (points all line up exactly on a straight line). That number is what the correlation matrix below is giving us for each square in the above plot.

cor(log(story[,c(6,8,13,14)]+1,2))
          chapters   reviews      favs   follows
chapters 1.0000000 0.5591888 0.2840169 0.6217631
reviews  0.5591888 1.0000000 0.7429512 0.7022728
favs     0.2840169 0.7429512 1.0000000 0.6144298
follows  0.6217631 0.7022728 0.6144298 1.0000000


The diagonals are 1 (strongest) because each variable is of course perfectly correlated with itself. Oddly enough, I thought the relationship between chapters and reviews would be stronger (eg, long stories or broken-up stories tend to get more reviews) but it is not as strong as others. The correlation between favorites and chapters is low: length isn't associated with favoriting. The strongest correlation on the chart is between reviews and favorites: 0.74. That says that the number of favorites can be pretty strongly predicted by the number of reviews, but from the picture I'd be willing to bet that it works better for stories with larger numbers of both.





A few stats on Reviews. In particular, I normalize for longer stories by looking at Reviews per Chapter. The first is a log-log plot of the distribution of Reviews per Chapter for stories in the database. Read it like a histogram, but for the kind of data that has many instances of low values, and is skewed by many fewer instances of very high data.

reviewsperchapter
So eg, a 7185 (a little less than 10000 on the log scale) stories have fewer than one review per chapter, and you can see at the bottom right, that 5 stories had over 100 reviews per chapter. (these are, incidentally, all 1-chapter stories from between 2007 and 2011. I should control for publish date as well as number of chapters).

A summary with boxplots on the relationship between completion and reviews/chapter. (Note this does not take into account that incomplete stories may be over-represented as younger stories)

chapcompletereview

Boxplots are visual displays of the five-number summary (Min, Q1, Median, Q3, Max). This plot is labeled with the medians, and the number of stories in each boxplot is printed under the plot. After looking at several different cuts, the chapter breakdowns I went with were: 1, 2, 3, 4-9, 10-15, 16-20, 21-30, and 31+. They Y-axis is log-base-2 scale again, so all of these distributions are rather skewed and more easily visible on the log scale.

There is a much more marked difference of reviews per chapter for complete vs. incomplete on low-chapter stories. The completed ones tend to have more reviews (but again, not accounting for publish date). The effect tails off for middle-length chapter fics, and then we start to see it appear again for the tomes. Could this indicate the "Leave reviews and I will keep writing it!" effect? (Keeping in mind though, that as the categories include more chapters, there is more "wiggle room" in "reviews/chapter", so the distributions are likely including some more sources of variation for the bigger categories.)

Date: 2013-02-25 05:52 am (UTC)
From: [identity profile] ladymercury-10.livejournal.com
Oddly enough, I thought the relationship between chapters and reviews would be stronger (eg, long stories or broken-up stories tend to get more reviews) but it is not as strong as others.

Yeah, I would have expected that to be stronger, too. I guess it makes sense, though, that the strongest correlation is between reviews and favorites--the things people like the most would be both the ones that they felt like commenting on and favoriting.

Are you planning to do anything with this data? Like trying to write up something for the Organization for Transformative Works?

Date: 2013-02-25 05:56 pm (UTC)
From: [identity profile] ladymercury-10.livejournal.com
It definitely sounds like a fun investigation whatever you end up doing with it!

Date: 2013-02-26 05:24 am (UTC)
From: [identity profile] skurtchasor.livejournal.com
Google "analysis of 4chan" and click the first link. As far as I can tell, this is a real academic paper.

I don't know what your current Erdős number is, but if you want a 4, then we can talk.

Date: 2013-02-25 02:53 pm (UTC)
From: [identity profile] a-phoenixdragon.livejournal.com
I get more favs and follows than reviews - constantly. Seems to work that way over at AO3 as well...

Profile

eve11: (Default)
eve11

December 2022

S M T W T F S
    123
45678910
11121314151617
18192021222324
25262728293031

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 24th, 2026 04:52 pm
Powered by Dreamwidth Studios