eve11: (writing_muse)
A discussion elsewhere (and a bit from [livejournal.com profile] wrisomifu earlier) had me thinking about dialogue words and "said" synonyms. In general, I am with the "go with 'said', it's a non-word" camp on the issue. I don't go around looking for synonyms. I also tend to structure sentences as ["dialogue." Description. "more dialogue."] and variants, without necessarily using any kind of "said" word. This helps me to avoid the dreaded synonym-for-said-that-doesn't-actually-involve-talking difficulty that seems to plague some writers.

I checked a few different stories and am seeing what looks like an interesting zipf-like distribution in my dialogue word choices. The bulk of my choices are made up of one of 4 choices: either no explicit 'said' word, "said", "asked" or "answered". No 'said' word (denoted by '-') and "said" together make up well over 50% of my dialogue choices in the stories I looked at. Have a table of 5 stories:

Story             #Dialogue   "said"    "-"       Total
                   Lines
WIP-o-Doom         431        0.2041763 0.3689095 0.5730858
HumpDay Party Mix  104        0.3653846 0.3365385 0.7019231
Sage Advice         46        0.3913043 0.3695652 0.7608696
Avocado WIP         31        0.3548387 0.2903226 0.6451613
Still Life          27        0.3333333 0.3703704 0.7037037


Also I can't believe the WIP-o-Doom is nearly 40K words and has only 431 dialogue passages. It feels like way more.
Because I'm ridiculous I also did a graph of dialogue words for this story:
under the cut )

The tails of this distribution also include "said [adverb]" phrases. Also, this is a nice way to figure out what kinds of phrases I might be over-using (not counting 'said'; it's a non-word).

I suppose I should actually try to write something tonight instead of mucking with numbers.

ETA: Nope! Another graph, I used The Hump Day Party Mix because it is the longest thing I've finished in a good long while. Graph below:
Here! )
eve11: (chance)
Wanted to write down a notebook here just to remind myself of some things I was interested in doing.
rambling )
eve11: (chance)
Guess who is the most well-rounded writer on Teaspoon?

By which I mean the entropy of the distribution of eras in your stories is as closest to uniform across all eras as anyone gets in the database. And you have a lot of stories too!

I shall have a true post up at some point but am off to a picnic now so pictures will have to wait.
eve11: (dw_something_new_TARDIS)
Comparing the ratings I got for Series 5 from the Barb website with the ratings I got from The wikipedia site I am coming up with differences due to the effects of "BBC HD"

Data Below Cut )

All of the values match except for series 5. I looked at the source from the wiki, and it cites back to the BARB site. But the discrepancy is that the bigger number from the website is obtained from the sum of "BBC1" and "BBC HD" ratings. Now I'm wondering: (1) when did BBC HD launch? (2) when did DW start getting broadcast in HD and (3) did they eventually consolidate the numbers back again? Or am I going to find BBC HD simulcast numbers for series 6 and 7 as well?

ETA: I just checked for 2011 and it only has a Tuesday repeat and the BBC1 airing. So I'm guessing then it was just split for series 5.

EATA: For anyone interested in munging around with this data, here are the two files I've made:

https://dl.dropboxusercontent.com/u/71925398/format-DWratings.txt

https://dl.dropboxusercontent.com/u/71925398/format-ratings.txt

First is data on the 88 episodes from series 1 through series 7 "Journey to the Center of the TARDIS", with some covariates.

Second is the top 30 shows for each of the big channels for that week, based on the BARB top 30 numbers. Both files are pipe ("|") delimited. Date formats should match up.

You can apparently get all of the channels by top 10 list too. I didn't do that; only the top ones that were listed based on the BARB "Top 30" lists. These were: BBC1, BBC2, ITV, Channel 5, Channel 4, every week. But starting in 2008 they keep track of the top 30 shows from "Other" channels on the site, so you can bound the competitors ratings that aren't on this list with the lowest rating on that list.
eve11: (dw_eleven_books_specs)
I realize I haven't been linking to my original images with my plots. Will fix that from now on. You should be able to click and hopefully eventually get to the original image, which is usually bigger and better resolution than what's on the blog.

I realized one way to look at the prevalence of eras is to look at the character list and see how many stories list different characters. These are cross-listed of course, but you can at least see how many stories each character appears in, marginally. Now this is just the list that the authors put in their main tags, so it's a set of characters the authors felt were worth tagging in the stories.
Top characters )
eve11: (chance)
In my previous post on categories and genres, I mentioned that it's a bit tricky to interpret the boxplots by category and genre because of the multiplicity of multi-categorized or multi-labeled stories. I theorized that at least part of the uniformity in reviews by era seen for the Classic era may be due to the fact that stories tend to overlap in eras more so for Classic than for New Who. Let us examine this in a bit more detail.

tables and pictures under the cut )
eve11: (rationalreal)
Still churning through the Teaspoon data. I made a graph of number of reviews vs. word count for a set of categories of word count.
Below the cut! )
I want to divide those review counts by some measure of time in order to account for stories that have had a longer time to accrue reviews. I could use the date of the newest review minus the publish date; I could use the average of that same value, or likely a higher percentile like the 75th or 85th.

Math/stat modeling discussion below! )

I am still working through how best to categorize multi-era fics in terms of Classic vs. New eras. This question/difficulty of classification has now come up in both Teaspoon and in the FF.net data. In FF.net I went through the character lists and gave each an associated era (with indicators for "more info needed" for characters like Rose and Sarah Jane who span several eras). I have to go through and normalize the characters for the Teaspoon data. which also includes having to correct for HTML errors/truncations when the character lists in the story blurbs got too long.

ETA: Ah, forgot about cooking! My co-worker (a fellow mathematician and baker) has suggested that we celebrate Pi day (3/14) with pies. I am thinking I would like to try a gluten-free version of something like this Mexican chocolate silk pie. It seems mostly gluten free already and the crust is an easy substitute of gf ginger snaps instead of graham crackers. But why go through all that trouble to make a from-scratch chocolate pudding and then top it with Cool Whip? Might as well make the whipped cream/buttercream topping from scratch at that point too!
eve11: (chance)
I has it.




I am still navigating it. At the outset, I was curious about reviews. 87% of the stories on Teaspoon have at least one review (29732 out of 34145 stories). In comparison, 90% of the DW stories on FF.net have at least one review (40590 out of 44917).

There are 235262 reviews on Teaspoon (likely more now, as the scraping is a few hours old), and 550124 reviews on DW FF.net.

The 5-number summary of the number of reviews per story:
DW FF.net:
summary(story$reviews)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00    4.00   12.25   10.00 1710.00 

Teaspoon:
summary(story$reviews)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   0.000    1.000    3.000    6.888    6.000 4360.000* 
On the whole it looks like Teaspooners may be stingier, but I think also that FF.net may have a higher probability of readers leaving negative reviews than Teaspoon. I'd have to take a sample and read them to find out.

One thing that I have for Teaspoon and not for FF.net is the dates of the reviews. So I can look at trends in when reviews are given vs. publish dates. The 5-number summary of the number of days between the initial publish date and the date of a given review is:
summary(reviews$afterpub)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     1.0    14.0   136.4   116.0  3451.0
So this is saying that 50% of the reviews that are in the database were given within two weeks of the initial publish date of the story. There is a big tail, but this also accounts for multi-chapter stories that are being updated in progress. I don't have the update times of the individual chapters.

But I also looked at the number of reviews vs last update time. Now, last update time can be misleading because authors tweak their stories without necessarily adding content (ETA: actually, I'm not sure now what an update means for a 1-chapter story. I've edited my stories quite a bit but the update dates seem to be static even if the edits happened months or years later. Mods? How does update vs. edit work?**). For example, 5085 stories have update dates not equal to publish dates, and only one chapter. On the other hand, at least for 1-chapter stories, 96% of the tweaks happen within 3 days of publication. So anyway, I'm looking at the 5-number summary of reviews after the last update time, for all reviews that are larger than the last update time:
summary(reviews$afterupdate[reviews$afterupdate >= 0])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     1.0   160.2   134.0  3451.0 
This is a weird distribution. 54% of these post-update reviews are given either 0, 1, or 2 days post update.

The true distribution I'm searching for--the distribution of reviews following addition of a new chapter, is likely somewhere between those extremes.

No pictures yet; it's past my bedtime and I am meticulous about graphs.

*Yes, you saw that maximum correctly. There is a story (unslinky's 3-million word tome "Terminal Decay") that has more reviews than the number of words in most of the stories that most people write (!). Literally, 80% of the stories on Teaspoon are 4500 words or less.

**"Updated" is the last time updated, eg, sent through the modding queue. The update date does not change with story edits. The outliers for the 1-chapter stories seem to have at one point had other chapters or notes as chapters to "bump" them up in the archive, that got either consolidated into one chapter or deleted. Thanks [livejournal.com profile] ghost2 and [livejournal.com profile] abates for the clarification.
eve11: (dw_something_new_TARDIS)
From my friend who is currently scraping info from Teaspoon:




whofic top 10 overused titles (case sensitive):
sqlite> select count(title) as c, title from story group by title order by c desc limit 10;
27|Home
25|Alone
23|Aftermath
22|Moving On
21|Memories
19|Lost
18|Choices
18|Consequences
18|Forever
18|Hope


I checked; 13 of the 18 labeled "Forever" are Ten and Rose, at least 10 of those seem to be post-Doomsday fics. Ah, fandom. Never change.

Out of curiosity, I checked my stories. There are:

- 6 stories called "Closure", but I think only mine is referencing the mathematical concept (eg, the limit points of all infinite sequences in the set are also members of the set)

- 3 stories labeled "Don't Wander Off": mine (Eleven and the TARDIS), a Ten/Rose PWP BDSM fic (!) and a Nine, Jack and Rose story.

- 3 stories called "Fair Trade" (mine, another Donna story, and a Nine, Jack and Rose story)

- 2 stories called "Resonance" (mine and a story where Ten and Martha meet up with Liz Shaw in 1969)

- 1 other story that also includes "What is Essential" (which is the worst title I've ever thought up, bleh), a Jack/Ianto 450K tome that uses the whole phrase ("What is essential is invisible to the eye") as the title.

So in the aftermath of that, as I sit here with my memories alone in my office, wondering how much productivity I've now lost forever, it appears that the consequences of my title choices are at least mostly unique, I hope. Right, then. Moving on...


ETA: for comparison, here is the top 10 DW titles for ff.net:
51|Forever
42|Alone
39|Memories
37|Nightmares
36|Home
35|Goodbye
33|Run
31|Running
29|Gone
26|Bad Wolf
So, perhaps unsurprisingly, there are (proportionately) more nightmares and running on FF.net, and less hope.

Another Edit: Would you like a bar plot? Here is a barplot of the top 50 DW titles on FF.net )
eve11: (chance)
I wanted to write up a post on Eras and Characters in the ff.net database but I ran out of steam. I have to decide how pedantic I want to be about the stats, and how much information I want to impute before writing about it. So instead, here is a preliminary blurb on the relationship between the # of chapters, reviews, favorites and follows in the stories.

ETA: Updated with some info on Reviews and Completions

Summary stats and a pairs plot )

Is there a relationship between Reviews and Completions? )
eve11: (dw_note_to_self)
Note to self: When randomly selecting a series of stories to look up on ff.net, arrange them randomly as well, so if 2/3 of the way through you decide that 200 is plenty big enough of a sample size than 300, you can just stop, as opposed to having to finish looking up the rest.

I am trying to figure out if stories that don't bother listing characters, have basically the same distribution of characters as those that do list characters. There are over 11,000 stories that don't list any characters (about 25% of all of the stories in the db). So I randomly selected 300 of these and am using either the description or a skimming of the story to determine the characters. Then I will compare distributions. Of course my data provider has all of the characters listed numerically with the names in a lookup table. I've been going chronologically--came across a few stories I recognized: Astrogirl's story about the end of the world and the Doctor drinking tea, Calapine's story about Rose perhaps not wanting to find out about more companions, "Watson's Ghost" by Camilla Sandman... and oh my goodness a bunch of 'spork my eyes out' stories too. So, you might be able to guess, I've now so far got several numeric character sets memorized:

1267
1267,1279
1267,1279,590
490,1267
490,590
490,665
answers )

Also, it is kind of frightening how many stories I honestly can't tell which Doctor they're writing about.

ETA: 222 in to my chronologically ordered sample, I've got my first 1267,820 femmeslash story! (That would be Rose/Donna). Woot, thanks NetgirlY2K ;D
eve11: (find_x)
Exploring around the Doctor Who fanfic.net database that my friend sent me. Interesting stuff. The most popular story on there in terms of raw favorites is FrostFyre's "Man with No Name" a Tenth Doctor/Firefly crossover, 106K words in 32 chapters, with 1603 favorites and 879 reviews, published over 6 months in the 2nd half of 2007. 2nd most popular in terms of favorites is "That Which Holds the Image" by TheAngelsHaveThePhoneBox, a recent HP/DW crossover wherein Harry summons a weeping angel boggart. 9 chapters, 40K words, recently completed after 1.5 years. 1309 favorites, 794 reviews.

I should probably look at straight DW stories as opposed to crossovers. Which, luckily, my intrepid data gathering partner has given me that variable.

One graph so far: as per the interests of my data gathering partner

percent completed vs. publish date. )
eve11: (find_x)
And encourage my fandom nerdiness. A friend and coworker (a software engineer) stopped by my office today asking me a statistics question. "It's not work-related" he told me. "I have a database with three variables for each record and I want to know how they are related."


Me: Numeric?
Him: Yeah
Me: Are they continuous variables? Or discrete?
Him: Okay, I will just tell you. They're follows, hits, and review counts for fanfiction stories.
Me: OMG can you actually scrape me that data???
Him: Yeah, I have a script that feeds a SQL Lite data base and can grab everything on fanfic.net within a category. It gets all of the info from the title blurbs. It fills in 0s for when favorites and reviews don't exist.
Me: That is awesome. Can you scrape all of Doctor Who for me?

We know each other well enough to know we both intersect with the fanfic world. He is an avid reader of long anime-related stories and apparently his profile on ff.net has "A list of common grammatical errors." :D I have often wanted to collect the data but my script-fu is lacking. I always stop at the "build a script to scrape the websites" step. He has the data and is stumbling at the "make pretty pictures and describe relationships" step. He's curious about relationships between follows, favorites and time with respect to completion and popularity, etc.

It could be the start of a beautiful friendship!

He also says he can probably scrape all of Teaspoon for me too, as I showed him the format and he said, "Oh yeah, I'm familiar with that format". Also cool. What would be a little more time consuming would be to scrape everyone's favorites and match them up so we can look at it like a big graph.

A colleague alerted me to this web-based javascript viz library called DS3: http://anna.ps/talks/fel/ Might be interesting to see if we can figure out who, empirically, is the "BNF" and to see who the cross-era people are or how insular different niches are, etc.

Profile

eve11: (Default)
eve11

July 2017

S M T W T F S
       1
2 3 45678
9 101112131415
16171819202122
23242526272829
3031     

Syndicate

RSS Atom

Most Popular Tags

Active Entries

Style Credit

Expand Cut Tags

No cut tags
Page generated Aug. 19th, 2017 01:37 am
Powered by Dreamwidth Studios