Co-occurrences
Mar. 3rd, 2013 06:27 pmIn my previous post on categories and genres, I mentioned that it's a bit tricky to interpret the boxplots by category and genre because of the multiplicity of multi-categorized or multi-labeled stories. I theorized that at least part of the uniformity in reviews by era seen for the Classic era may be due to the fact that stories tend to overlap in eras more so for Classic than for New Who. Let us examine this in a bit more detail.
There are 34145 Teaspoon stories (as of the collection of my db on Feb 25), and 16 category labels, which appear prominently on the front page. These categories are not one-to-one with stories; you can see this reflected in the fact that if you take the sum of the counts listed next to each of the categories, it will be larger than the total number of stories listed at the top (on this Sunday the 3rd of March, the total reads 34165 and the sum of the category listings is 40333).
So, some stories are multiply labeled. So let's look at that. In R, I made a big matrix called "categorymat", with one row per story (34145 stories as of 25 Feb. 2013), and one column per category (16 categories). In the ij-th cell I place a 1 if story i has category j listed as one of its categories, and 0 otherwise. I call this an "indicator matrix".
If I sum the indicator matrix down each column, I should get the number of stories per category, eg, reproduce the numbers that appear on Teaspoon's main page:
Summing the indicator matrix across rows gives me the number of categories listed for each story. This will result in 34145 numbers which is too many to print, but I can look at the distribution by counts and percentages, using the handy "table" function, which tabulates the counts (bottom) for each of the values (top):
The majority of stories, 29653 or about 87%, have only one category listed. Note we are not looking at which category just yet, only that they have only one category, whatever that is. And we also see that 97.1% of the stories have either one or two categories listed. The most categories listed is 13. There are two stories whose intrepid archivists took the care to list 13 separate categories for them:
Anyhow, these multi-multi-categorized stories are only a small fraction of the database. What this is telling us is, if we look at pairwise categories, we will capture most of the trends. I also hypothesize that Classic Era "plays well with others" more so than New Who might do. That is, that although only about 13% of the database is labeled with multiple categories, the multiplicity may be overrepresented in Classic as opposed to New Who stories.
To examine this, let's get a bit fancy, and take column-wise sums for the single-category stories and multiple-category stories separately:
"Multi-Era" is... hard to interpret. About half-and-half, some folks just label a multi-era story "Multi-Era", while others may list the other categories it appears in. We are a bit blind to what categories actually appear in Multi-era stories with no other listing (1572 stories or about 4.4% of the database).
Pairwise, what can we do? Well, my answer is, make some image plots. You can think of co-occurrence or cross-listing as a kind of correlation, and correlations are nice to display in image matrices.
The interpretation of correlation is that the categories are variables, and the units are stories. We want to know how category X and category Y co-occur by story, eg, their correlation is to what extent they are cross-listed. The measurement of each category for a story is binary: the story either is listed as the category or not, represented in the indicator matrix as either a 1 or a 0.
So each category has associated with it a string of 35145 1's or 0's. A standard metric is to take the Jaccard index between the two, which is a fancy way of saying:
J(X,Y) = (# of stories listing both X and Y)/( #listing both X and Y + #listing X without Y + #listing Y without X). Eg is is the intersection of X and Y divided by the union. This metric is symmetric in that J(X,Y) = J(Y,X). It goes from 0 (mutually exclusive) to 1 (completely cross-listed).
Here is a picture* of the Jaccard similarity for pairwise categories in Teaspoon:

The overall correlations are pretty small, maxing out at about 0.06 on a scale of 0.00 to 1.00. The biggest Jaccard similarity is for Six and Seven (and I am guessing that may be because they tend to show up most of the time in larger multi-category work, where eg, you get one chapter per Doctor and stuff like that). The diagonals are marked as white: every category is perfectly correlated with itself. You can see the tendency of insularity in Classic and New Who as the fact that the top left and bottom right squares are brighter than the off-quadrants.
I think though, that the Jaccard similarity is not exactly set up to deal well with large imbalances in the totals. For example, because there are so many more thousands of stories with Ten in them, it is hard to tell to what extent the Classic eras overlap with him on their own terms. They will always have small Jaccard distances with him, because eg, if every First Doctor story in the database also included Ten, the Jaccard index would still be 530/17110 = 0.03.
So I made a second matrix. The i-j'th cell of this matrix can be read as "the percent of stories in the category labeled (row) that are also labeled as the category (column)". EG:
R(X,Y) = (#stories listing X and Y)/(#stories listing X)
It is NOT symmetric: the percent of First Doctor stories that are cross-listed with Ten is not going to be the same as the percent of Tenth Doctor stories that are cross-listed with One. I call it "row-conditional" because it is looking at conditional percentages by row. Here it is below (note that the colors are similar but the scale has changed):

We now see the Black Hole that is Ten (he has a LOT of stories, and the rest of the eras cannot keep up with him), but we also see that bright streak of Multi-Era and Ten that shows up in the top right. Significant proportions of the Classic era stories are also cross-listed with Ten. Seven and Multi-Era get along well, as well as Five and Six; it looks like ~15% of sixth Era stories are cross-listed with Five as well. Sarah Jane is most highly associated with 10, and there is a bright spot there with Four as well. The top left quadrant is Classic Who; from both the Jaccard and the row-conditional matrices, we can see that the Classic era definitely tends to show up cross-listed more often.
I did these matrices for genres as well. There are a lot more genres than categories, they are more cross-listed, and only 56% of stories have 2 or fewer genres listed (to get 97% of the database, we have to look at 7-way interactions among genres). But there are some obvious trends:
Jaccard:

row-conditional:

Het/Romance, dominating the pairwise associations. Followed by Introspection/Character Study, and Fluff/Het, and Fluff/Humor. d'aw. Wait, am I 'shipping genres? That just seems too meta.
*These plots were done using the image() function in R, which I used a lot in my thesis work. Today I found an online resource with some code for adding the image color scale at the bottome to image plots. Brilliant!
There are 34145 Teaspoon stories (as of the collection of my db on Feb 25), and 16 category labels, which appear prominently on the front page. These categories are not one-to-one with stories; you can see this reflected in the fact that if you take the sum of the counts listed next to each of the categories, it will be larger than the total number of stories listed at the top (on this Sunday the 3rd of March, the total reads 34165 and the sum of the category listings is 40333).
So, some stories are multiply labeled. So let's look at that. In R, I made a big matrix called "categorymat", with one row per story (34145 stories as of 25 Feb. 2013), and one column per category (16 categories). In the ij-th cell I place a 1 if story i has category j listed as one of its categories, and 0 otherwise. I call this an "indicator matrix".
If I sum the indicator matrix down each column, I should get the number of stories per category, eg, reproduce the numbers that appear on Teaspoon's main page:
> apply(categorymat,2, sum) # apply the function 'sum' to the matrix categorymat, (2) column-wise
First Doctor Second Doctor Third Doctor
530 504 665
Fourth Doctor Fifth Doctor Sixth Doctor
1039 1055 434
Seventh Doctor Eighth Doctor Other Doctors
634 908 662
Other Era Multi-Era Ninth Doctor
698 2975 4637
Tenth Doctor Torchwood Sarah Jane Adventures
17110 5319 490
Eleventh Doctor
2555
Pretty close to what you see today. Also, you can give column names to a matrix in R, and it will helpfully print out the labels, which is why it is nicely labeled here by category.Summing the indicator matrix across rows gives me the number of categories listed for each story. This will result in 34145 numbers which is too many to print, but I can look at the distribution by counts and percentages, using the handy "table" function, which tabulates the counts (bottom) for each of the values (top):
> sumcat = apply(categorymat, 1, sum) # apply the function 'sum' to the matrix categorymat, (1) row-wise
> table(sumcat)
1 2 3 4 5 6 7 8 9 10 11 12 13
29653 3528 715 123 61 12 16 6 7 6 9 7 2
> round(table(sumcat)/sum(table(sumcat)),4)
1 2 3 4 5 6 7 8 9 10 11
0.8684 0.1033 0.0209 0.0036 0.0018 0.0004 0.0005 0.0002 0.0002 0.0002 0.0003
12 13
0.0002 0.0001 The majority of stories, 29653 or about 87%, have only one category listed. Note we are not looking at which category just yet, only that they have only one category, whatever that is. And we also see that 97.1% of the stories have either one or two categories listed. The most categories listed is 13. There are two stories whose intrepid archivists took the care to list 13 separate categories for them:
> tables$story[which(sumcat==13),]
id title author authorname
17875 27099 Always Time For Tea: Drabbles In Time And Space 1355 TimeLord89
27706 40926 The Sarah Jane Memories 12206 enchantment
rating chapters words lengthcat reviews uniqreview updated published
17875 1 22 2246 2K-5K 22 4 12/20/12 11/13/08
27706 1 1 1729 1K-2K 9 7 03/14/11 03/14/11
complete
17875 0
27706 1
description
17875 An ever-growing collection of drabbles taken from all eras of Doctor Who, and beyond!
[Individual drabbles posted as separate chapters]
27706 It's Sarah Jane's 93rd birthday and all eleven Doctors arrive at Bannerman Road to
enjoy her special day. A few companions join in the fun and a good time is had by all. (Good on you, Sarah Jane.) Anyhow, these multi-multi-categorized stories are only a small fraction of the database. What this is telling us is, if we look at pairwise categories, we will capture most of the trends. I also hypothesize that Classic Era "plays well with others" more so than New Who might do. That is, that although only about 13% of the database is labeled with multiple categories, the multiplicity may be overrepresented in Classic as opposed to New Who stories.
To examine this, let's get a bit fancy, and take column-wise sums for the single-category stories and multiple-category stories separately:
> singlecat = apply(categorymat[sumcat==1,], 2, sum) # tabulate categories for the single-category stories
> multicat = apply(categorymat[sumcat>1,], 2, sum) # tabulate categories for the multiple-category stories
> data.frame(Single=singlecat, Multi=multicat, Total=singlecat+multicat, PctMulti=multicat/(singlecat+multicat) )
Single Multi Total PctMulti
First Doctor 308 222 530 0.4188679
Second Doctor 321 183 504 0.3630952
Third Doctor 385 280 665 0.4210526
Fourth Doctor 662 377 1039 0.3628489
Fifth Doctor 750 305 1055 0.2890995
Sixth Doctor 256 178 434 0.4101382
Seventh Doctor 355 279 634 0.4400631
Eighth Doctor 583 325 908 0.3579295
Other Doctors 412 250 662 0.3776435
Other Era 340 358 698 0.5128940
Multi-Era 1572 1403 2975 0.4715966
Ninth Doctor 3677 960 4637 0.2070304
Tenth Doctor 14304 2806 17110 0.1639977
Torchwood 3848 1471 5319 0.2765557
Sarah Jane Adventures 142 348 490 0.7102041
Eleventh Doctor 1738 817 2555 0.3197652 So, although we are looking only at about 13% of the database multi-labeled, the multiple labels include significant proportions of all stories for what appear to be lesser-written Doctors. One, Six, and Seven, who are often the most difficult Doctors for new fans to "get", are the fore-runners of playing nicely, with 41% to 44% of their stories cross-listed. Sarah Jane Adventures is an obvious high outlier; 71% of her stories are cross-listed with a different category. Ten, even in Teaspoon, appears to be the King of Lonely Emo, as his category only appears cross-listed in 16% of his stories. Note however, despite having the lowest cross-listed rate, Ten still has the highest sheer number of cross-listed stories. There is a lot of Ten in the database. Nine is only slightly more cross-listed than Ten. Eleven and Five seem to be about on par; perhaps it has something to do with the lack of eyebrows."Multi-Era" is... hard to interpret. About half-and-half, some folks just label a multi-era story "Multi-Era", while others may list the other categories it appears in. We are a bit blind to what categories actually appear in Multi-era stories with no other listing (1572 stories or about 4.4% of the database).
Pairwise, what can we do? Well, my answer is, make some image plots. You can think of co-occurrence or cross-listing as a kind of correlation, and correlations are nice to display in image matrices.
The interpretation of correlation is that the categories are variables, and the units are stories. We want to know how category X and category Y co-occur by story, eg, their correlation is to what extent they are cross-listed. The measurement of each category for a story is binary: the story either is listed as the category or not, represented in the indicator matrix as either a 1 or a 0.
So each category has associated with it a string of 35145 1's or 0's. A standard metric is to take the Jaccard index between the two, which is a fancy way of saying:
J(X,Y) = (# of stories listing both X and Y)/( #listing both X and Y + #listing X without Y + #listing Y without X). Eg is is the intersection of X and Y divided by the union. This metric is symmetric in that J(X,Y) = J(Y,X). It goes from 0 (mutually exclusive) to 1 (completely cross-listed).
Here is a picture* of the Jaccard similarity for pairwise categories in Teaspoon:

The overall correlations are pretty small, maxing out at about 0.06 on a scale of 0.00 to 1.00. The biggest Jaccard similarity is for Six and Seven (and I am guessing that may be because they tend to show up most of the time in larger multi-category work, where eg, you get one chapter per Doctor and stuff like that). The diagonals are marked as white: every category is perfectly correlated with itself. You can see the tendency of insularity in Classic and New Who as the fact that the top left and bottom right squares are brighter than the off-quadrants.
I think though, that the Jaccard similarity is not exactly set up to deal well with large imbalances in the totals. For example, because there are so many more thousands of stories with Ten in them, it is hard to tell to what extent the Classic eras overlap with him on their own terms. They will always have small Jaccard distances with him, because eg, if every First Doctor story in the database also included Ten, the Jaccard index would still be 530/17110 = 0.03.
So I made a second matrix. The i-j'th cell of this matrix can be read as "the percent of stories in the category labeled (row) that are also labeled as the category (column)". EG:
R(X,Y) = (#stories listing X and Y)/(#stories listing X)
It is NOT symmetric: the percent of First Doctor stories that are cross-listed with Ten is not going to be the same as the percent of Tenth Doctor stories that are cross-listed with One. I call it "row-conditional" because it is looking at conditional percentages by row. Here it is below (note that the colors are similar but the scale has changed):

We now see the Black Hole that is Ten (he has a LOT of stories, and the rest of the eras cannot keep up with him), but we also see that bright streak of Multi-Era and Ten that shows up in the top right. Significant proportions of the Classic era stories are also cross-listed with Ten. Seven and Multi-Era get along well, as well as Five and Six; it looks like ~15% of sixth Era stories are cross-listed with Five as well. Sarah Jane is most highly associated with 10, and there is a bright spot there with Four as well. The top left quadrant is Classic Who; from both the Jaccard and the row-conditional matrices, we can see that the Classic era definitely tends to show up cross-listed more often.
I did these matrices for genres as well. There are a lot more genres than categories, they are more cross-listed, and only 56% of stories have 2 or fewer genres listed (to get 97% of the database, we have to look at 7-way interactions among genres). But there are some obvious trends:
Jaccard:

row-conditional:

Het/Romance, dominating the pairwise associations. Followed by Introspection/Character Study, and Fluff/Het, and Fluff/Humor. d'aw. Wait, am I 'shipping genres? That just seems too meta.
*These plots were done using the image() function in R, which I used a lot in my thesis work. Today I found an online resource with some code for adding the image color scale at the bottome to image plots. Brilliant!
no subject
Date: 2013-03-04 12:50 am (UTC)(They also confirm that I really need to get on with writing more Six fics)
no subject
Date: 2013-03-04 12:56 am (UTC)no subject
Date: 2013-03-04 05:38 am (UTC)(I'm writing a series of 50 tweet-length fics for the anniversary or something, and I think I actually just wrote something that nearly fits the description. Go figure)
no subject
Date: 2013-03-04 07:08 am (UTC)*HUGS*
no subject
Date: 2013-03-04 01:19 pm (UTC)no subject
Date: 2013-03-04 02:00 pm (UTC)And as someone who decided (I can't remember why) to read through Teaspoon category by category (well, classic Who!) it's really interesting to see that some of the impressions I got of some Doctors were correct. Other things are less expected, but this is v v interesting. (You can see the slight lightening of Nine crossovers with Eight - because there's a high number of Time War fics in Eight. And One gets more Other/Multi Era fics than Two, showing up the Academy era fics that screw up the First Doctor category (although it's less than I'd actually thought. Because it's not something I tend to want to read, it seems to represent more than it is when looking through One.) And I've been thoroughly frustrated by the amount of Seven fics that are just multi-era fics with pretty much all the Doctors and little Seven - and that shows up. (Well, that does happen with all the Classic Doctors. There still needs to be way more Classic Who fic, really.)
Also Torchwood does not go well with Classic Who, then? ;lol:
Anyway, that was interesting. Thank you. I am a bit obsessive about TEaspoon, generally. Probably because of my various attempts to sort through all of it (not read all of it, that would be silly).
no subject
Date: 2013-03-04 11:22 pm (UTC)Which reminds me that
no subject
Date: 2013-03-05 12:27 pm (UTC)Destroy Gallifrey and your former selves will never talk to you again. :-)