"whatcha talking 'bout?" version 4!

Ok, this is pretty much the final definitive version of analysing the latest hot topics from library/librarian blogs…

hotstuff
http://161.112.232.18/hotstuff.php

The page is updated approximately every 15 minutes and uses the following methodology…

1) posts older than 48 hours are analysed and the frequency by which every unique word appears is calculated

2) posts from the last 48 hours are analysed in the same way and the word frequency is compared to the older posts

3) when a word has become noticeably more frequently used in the last 48 hours, it’ll appear in the word cloud — the bigger the increase in frequency, the larger the size in the cloud

4) if a word appears in multiple new blog posts then the shading is darker

5) if a word only appears lots of times, but only in a small number of new posts, then the shading will be lighter

So, in the last 24 hours, the usage of “2007″ has increased substantially.  In new posts the word has a frequency of 28%, but only 7% in older posts.

Several bloggers have picked up on the sale of ProQuest — if other bloggers talk about it today, then it will grow in size and be shaded darker.

The usage of the word “disallow” has also increased, but it only appears in a single blog post (by the Baby Boomer Librarian) and is therefore shaded lightly.

Unlike the previous versions, this one doesn’t require a stop word list — words like “library” and “the” tend to have a high frequency of usage in both old and new posts, so the relative difference in usage is usually too small for the words to appear in the cloud.

The other cool thing is that this version uses the “network effect” — the more posts it has to work with, the better the cloud becomes as delivering the latest hot topics.  For example, Stephen Abram‘s RSS feed is currently delivering posts from the last 3 days and he usually ends them with “Stephen”, so he’s currently making a strong appearance in the cloud.  However, over time, the number of older archived posts with the word will increase which means he’ll no longer (relatively speaking) be a hot topic in the cloud …although not in real life, of course!


 

14 thoughts on “"whatcha talking 'bout?" version 4!”

  1. I’ve spent a couple of hours sifting though dozens of blogrolls and the code is now picking up feeds from around 450 library related blogs.

    Given the increase in blogs, I’ve changed the page so that it updates every 30 minutes.

  2. I spotted this morning that the aggregator was choking on some of the atom feeds, so I’ve fixed that. It does mean that there’s been a huge influx of new posts and it’ll take a few days for some of the more common words (like “it’s” and “new”) to drop in frequency.

  3. Whilst we wait for the archive of older posts to mature, I’ve tweaked the cloud to show words that rarely appear in the archived posts in green — these are potentially new buzzwords and new topics of conversation.

    Also, given the large number of feeds I’m pulling in, I’ve dropped to only updating the cloud every hour. This is partly because the physical server the code is running on is already overdue for retirement, but I’m in the middle of prep’ing a replacement.

  4. A little late to notice it, but I just noticed: Walt at Random isn’t in your master list. I may not be a librarian, but I’ll assert that W.a.R. is part of “libraryland.” Your call, of course.

Leave a Reply