Mining Blogs: Using R

As i said href="http://www.andrerestivo.com/weblog/archives/001836.html">here,
I decided to give a go to the word burst idea I saw on New
Scientist
. I’m going to use href="http://www.R-project.org">R to build a prototype (just to
test the concept).

R is a language for statistical computing and is both powerfull and
confusing. At the moment Iam able to anaylze the freqs of a single
weblog using the following code:

hp <- htmlTreeParse('http://www.scripting.com/')
html.elem <- unlist(hp$children$html$children)
text <- html.elem[which(regexpr("text.value",names(html.elem)) > 0)]
names(text) <- NULL
text2 <- paste(text,collapse=" ")
wrds <- strsplit(text2," ")
wrds <- sapply(wrds,tolower)
wrds <- gsub("[,.!?;:]","",wrds)
f.wrds <- factor(wrds)
freqs <- table(f.wrds)
sort(freqs)

Applied to Scripting News
this gives the following table of most used words:

  • google 22
  • have 23
  • my 24
  • was 25
  • with 29
  • be 32
  • on 43
  • it 44
  • for 46
  • is 50
  • in 56
  • that 65
  • and 71
  • of 72
  • i 82
  • a 105
  • to 130
  • the 177

Now it’s just a question of joining href="http://www.technorati.com/cosmos/top100.html">several
important blogs, do the same type of frequency analisys and
compare the results from one day to another. This way we can see
which words have big bursts and which stopped being in the
spotlight.

This entry was posted on Monday, February 24th, 2003 at 12:46 am and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

1 Comment

  1. #!/sablog : Shanti Braford's daily links says:

    Trackback: Word Bursts…: Daily Bytes:
    Mining Blogs: Using R. Andre Restivo is using this software to work
    on the word bursts problem mentioned in the New Scientist article a
    little while back. Good luck Andre!

    ... on July January 1st, 1970

Post a Comment