eve11: (chance)
[personal profile] eve11
In my previous post on categories and genres, I mentioned that it's a bit tricky to interpret the boxplots by category and genre because of the multiplicity of multi-categorized or multi-labeled stories. I theorized that at least part of the uniformity in reviews by era seen for the Classic era may be due to the fact that stories tend to overlap in eras more so for Classic than for New Who. Let us examine this in a bit more detail.


There are 34145 Teaspoon stories (as of the collection of my db on Feb 25), and 16 category labels, which appear prominently on the front page. These categories are not one-to-one with stories; you can see this reflected in the fact that if you take the sum of the counts listed next to each of the categories, it will be larger than the total number of stories listed at the top (on this Sunday the 3rd of March, the total reads 34165 and the sum of the category listings is 40333).

So, some stories are multiply labeled. So let's look at that. In R, I made a big matrix called "categorymat", with one row per story (34145 stories as of 25 Feb. 2013), and one column per category (16 categories). In the ij-th cell I place a 1 if story i has category j listed as one of its categories, and 0 otherwise. I call this an "indicator matrix".

If I sum the indicator matrix down each column, I should get the number of stories per category, eg, reproduce the numbers that appear on Teaspoon's main page:
> apply(categorymat,2, sum) # apply the function 'sum' to the matrix categorymat, (2) column-wise
         First Doctor         Second Doctor          Third Doctor 
                  530                   504                   665 
        Fourth Doctor          Fifth Doctor          Sixth Doctor 
                 1039                  1055                   434 
       Seventh Doctor         Eighth Doctor         Other Doctors 
                  634                   908                   662 
            Other Era             Multi-Era          Ninth Doctor 
                  698                  2975                  4637 
         Tenth Doctor             Torchwood Sarah Jane Adventures 
                17110                  5319                   490 
      Eleventh Doctor 
                 2555 
Pretty close to what you see today. Also, you can give column names to a matrix in R, and it will helpfully print out the labels, which is why it is nicely labeled here by category.

Summing the indicator matrix across rows gives me the number of categories listed for each story. This will result in 34145 numbers which is too many to print, but I can look at the distribution by counts and percentages, using the handy "table" function, which tabulates the counts (bottom) for each of the values (top):
> sumcat = apply(categorymat, 1, sum) # apply the function 'sum' to the matrix categorymat, (1) row-wise
> table(sumcat)
    1     2     3     4     5     6     7     8     9    10    11    12    13 
29653  3528   715   123    61    12    16     6     7     6     9     7     2 

> round(table(sumcat)/sum(table(sumcat)),4)
     1      2      3      4      5      6      7      8      9     10     11 
0.8684 0.1033 0.0209 0.0036 0.0018 0.0004 0.0005 0.0002 0.0002 0.0002 0.0003 
    12     13 
0.0002 0.0001 

The majority of stories, 29653 or about 87%, have only one category listed. Note we are not looking at which category just yet, only that they have only one category, whatever that is. And we also see that 97.1% of the stories have either one or two categories listed. The most categories listed is 13. There are two stories whose intrepid archivists took the care to list 13 separate categories for them:
> tables$story[which(sumcat==13),]
         id                                           title author  authorname
17875 27099 Always Time For Tea: Drabbles In Time And Space   1355  TimeLord89
27706 40926                         The Sarah Jane Memories  12206 enchantment
      rating chapters words lengthcat reviews uniqreview  updated published
17875      1       22  2246     2K-5K      22          4 12/20/12  11/13/08
27706      1        1  1729     1K-2K       9          7 03/14/11  03/14/11
      complete
17875        0
27706        1
                                                                                                                                                                     description
17875 An ever-growing collection of drabbles taken from all eras of Doctor Who, and beyond! 
[Individual drabbles posted as separate chapters]
27706 It's Sarah Jane's 93rd birthday and all eleven Doctors arrive at Bannerman Road to 
enjoy her special day.  A few companions join in the fun and a good time is had by all.
(Good on you, Sarah Jane.)

Anyhow, these multi-multi-categorized stories are only a small fraction of the database. What this is telling us is, if we look at pairwise categories, we will capture most of the trends. I also hypothesize that Classic Era "plays well with others" more so than New Who might do. That is, that although only about 13% of the database is labeled with multiple categories, the multiplicity may be overrepresented in Classic as opposed to New Who stories.

To examine this, let's get a bit fancy, and take column-wise sums for the single-category stories and multiple-category stories separately:
> singlecat = apply(categorymat[sumcat==1,], 2, sum) # tabulate categories for the single-category stories
> multicat = apply(categorymat[sumcat>1,], 2, sum)   # tabulate categories for the multiple-category stories
>  data.frame(Single=singlecat, Multi=multicat, Total=singlecat+multicat, PctMulti=multicat/(singlecat+multicat) )

                      Single Multi Total  PctMulti
First Doctor             308   222   530 0.4188679
Second Doctor            321   183   504 0.3630952
Third Doctor             385   280   665 0.4210526
Fourth Doctor            662   377  1039 0.3628489
Fifth Doctor             750   305  1055 0.2890995
Sixth Doctor             256   178   434 0.4101382
Seventh Doctor           355   279   634 0.4400631
Eighth Doctor            583   325   908 0.3579295
Other Doctors            412   250   662 0.3776435
Other Era                340   358   698 0.5128940
Multi-Era               1572  1403  2975 0.4715966
Ninth Doctor            3677   960  4637 0.2070304
Tenth Doctor           14304  2806 17110 0.1639977
Torchwood               3848  1471  5319 0.2765557
Sarah Jane Adventures    142   348   490 0.7102041
Eleventh Doctor         1738   817  2555 0.3197652
So, although we are looking only at about 13% of the database multi-labeled, the multiple labels include significant proportions of all stories for what appear to be lesser-written Doctors. One, Six, and Seven, who are often the most difficult Doctors for new fans to "get", are the fore-runners of playing nicely, with 41% to 44% of their stories cross-listed. Sarah Jane Adventures is an obvious high outlier; 71% of her stories are cross-listed with a different category. Ten, even in Teaspoon, appears to be the King of Lonely Emo, as his category only appears cross-listed in 16% of his stories. Note however, despite having the lowest cross-listed rate, Ten still has the highest sheer number of cross-listed stories. There is a lot of Ten in the database. Nine is only slightly more cross-listed than Ten. Eleven and Five seem to be about on par; perhaps it has something to do with the lack of eyebrows.

"Multi-Era" is... hard to interpret. About half-and-half, some folks just label a multi-era story "Multi-Era", while others may list the other categories it appears in. We are a bit blind to what categories actually appear in Multi-era stories with no other listing (1572 stories or about 4.4% of the database).

Pairwise, what can we do? Well, my answer is, make some image plots. You can think of co-occurrence or cross-listing as a kind of correlation, and correlations are nice to display in image matrices.

The interpretation of correlation is that the categories are variables, and the units are stories. We want to know how category X and category Y co-occur by story, eg, their correlation is to what extent they are cross-listed. The measurement of each category for a story is binary: the story either is listed as the category or not, represented in the indicator matrix as either a 1 or a 0.

So each category has associated with it a string of 35145 1's or 0's. A standard metric is to take the Jaccard index between the two, which is a fancy way of saying:
J(X,Y) = (# of stories listing both X and Y)/( #listing both X and Y + #listing X without Y + #listing Y without X). Eg is is the intersection of X and Y divided by the union. This metric is symmetric in that J(X,Y) = J(Y,X). It goes from 0 (mutually exclusive) to 1 (completely cross-listed).

Here is a picture* of the Jaccard similarity for pairwise categories in Teaspoon:
dcat
The overall correlations are pretty small, maxing out at about 0.06 on a scale of 0.00 to 1.00. The biggest Jaccard similarity is for Six and Seven (and I am guessing that may be because they tend to show up most of the time in larger multi-category work, where eg, you get one chapter per Doctor and stuff like that). The diagonals are marked as white: every category is perfectly correlated with itself. You can see the tendency of insularity in Classic and New Who as the fact that the top left and bottom right squares are brighter than the off-quadrants.

I think though, that the Jaccard similarity is not exactly set up to deal well with large imbalances in the totals. For example, because there are so many more thousands of stories with Ten in them, it is hard to tell to what extent the Classic eras overlap with him on their own terms. They will always have small Jaccard distances with him, because eg, if every First Doctor story in the database also included Ten, the Jaccard index would still be 530/17110 = 0.03.

So I made a second matrix. The i-j'th cell of this matrix can be read as "the percent of stories in the category labeled (row) that are also labeled as the category (column)". EG:
R(X,Y) = (#stories listing X and Y)/(#stories listing X)

It is NOT symmetric: the percent of First Doctor stories that are cross-listed with Ten is not going to be the same as the percent of Tenth Doctor stories that are cross-listed with One. I call it "row-conditional" because it is looking at conditional percentages by row. Here it is below (note that the colors are similar but the scale has changed):
dcat2
We now see the Black Hole that is Ten (he has a LOT of stories, and the rest of the eras cannot keep up with him), but we also see that bright streak of Multi-Era and Ten that shows up in the top right. Significant proportions of the Classic era stories are also cross-listed with Ten. Seven and Multi-Era get along well, as well as Five and Six; it looks like ~15% of sixth Era stories are cross-listed with Five as well. Sarah Jane is most highly associated with 10, and there is a bright spot there with Four as well. The top left quadrant is Classic Who; from both the Jaccard and the row-conditional matrices, we can see that the Classic era definitely tends to show up cross-listed more often.

I did these matrices for genres as well. There are a lot more genres than categories, they are more cross-listed, and only 56% of stories have 2 or fewer genres listed (to get 97% of the database, we have to look at 7-way interactions among genres). But there are some obvious trends:

Jaccard:
dgen

row-conditional:
dgen2
Het/Romance, dominating the pairwise associations. Followed by Introspection/Character Study, and Fluff/Het, and Fluff/Humor. d'aw. Wait, am I 'shipping genres? That just seems too meta.

*These plots were done using the image() function in R, which I used a lot in my thesis work. Today I found an online resource with some code for adding the image color scale at the bottome to image plots. Brilliant!

Date: 2013-03-04 12:50 am (UTC)
From: [identity profile] doctorpancakes.livejournal.com
Call me crazy, but I ship Fluff/Character Study. OTP! These graphs totally fascinate me.

(They also confirm that I really need to get on with writing more Six fics)

Date: 2013-03-04 05:38 am (UTC)
From: [identity profile] doctorpancakes.livejournal.com
Drabble/Action!Adventure is a good ship. I'd sail it.

(I'm writing a series of 50 tweet-length fics for the anniversary or something, and I think I actually just wrote something that nearly fits the description. Go figure)

Date: 2013-03-04 07:08 am (UTC)
From: [identity profile] a-phoenixdragon.livejournal.com
These are gorgeous! And highly informative! Does seem Ten just eats the hell outta Teaspoon!!

*HUGS*

Date: 2013-03-04 02:00 pm (UTC)
thisbluespirit: (buffy - Giles librarian)
From: [personal profile] thisbluespirit
Oooh, okay, it was pretty intereesting already - now it's getting fascinating.

And as someone who decided (I can't remember why) to read through Teaspoon category by category (well, classic Who!) it's really interesting to see that some of the impressions I got of some Doctors were correct. Other things are less expected, but this is v v interesting. (You can see the slight lightening of Nine crossovers with Eight - because there's a high number of Time War fics in Eight. And One gets more Other/Multi Era fics than Two, showing up the Academy era fics that screw up the First Doctor category (although it's less than I'd actually thought. Because it's not something I tend to want to read, it seems to represent more than it is when looking through One.) And I've been thoroughly frustrated by the amount of Seven fics that are just multi-era fics with pretty much all the Doctors and little Seven - and that shows up. (Well, that does happen with all the Classic Doctors. There still needs to be way more Classic Who fic, really.)

Also Torchwood does not go well with Classic Who, then? ;lol:

Anyway, that was interesting. Thank you. I am a bit obsessive about TEaspoon, generally. Probably because of my various attempts to sort through all of it (not read all of it, that would be silly).

Date: 2013-03-05 12:27 pm (UTC)
thisbluespirit: (DW - Eight)
From: [personal profile] thisbluespirit
And his total crossover is right around 20 percent, so he doesn't seem to play well with the Classic Era.

Destroy Gallifrey and your former selves will never talk to you again. :-)

Profile

eve11: (Default)
eve11

December 2022

S M T W T F S
    123
45678910
11121314151617
18192021222324
25262728293031

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated May. 8th, 2026 11:30 pm
Powered by Dreamwidth Studios