Pages

Wednesday, 19 May 2010

StackOverflow in RavenDB - question and answer basics


A while back I was reading through the series of posts on Ayende's blog titled "That No SQL Thing". If you haven't read it, it's worth a look as it's a great introduction to his new project RavenDB. The main site has loads more information, but in a short sentence, RavenDB is a document database for .NET.

This posts isn't going to be an general intro to RavenDB, see Rob Ashtons blog and the Code Project article for that

This series also assumes that you already have the StackOverflow data imported into RavenDB. For information on doing this see this post and the sample ETL application at on GitHub.

As a quick recap, here's what a typical question looks like.
         Score: 6         
         CreationDate: "/Date(1221793769413+0100)/"
         LastActivityDate: "/Date(1240901507540+0100)/"
         PostTypeId: 1
         AnswerCount: 8
         ViewCount: 1068
         Title: "Can you Distribute a Ruby on Rails Application without Source?"
         Body: "I'm wondering if it's possible t....."
         AcceptedAnswerId: 788676
         Tags: "ruby, ruby-on-rails, encoding, obfuscation, distribution"
         FavoriteCount: 3
         Id: 99553
         OwnerUserId: 14530

and a typical answer,
         Id= 7 
         PostTypeId= 2
         ParentId= 4
         CreationDate="2008-07-31T22:17:57.883"
         Score= 44
         ViewCount= 0
         Body="..."
         OwnerUserId= 9 
         LastEditorUserId= 1
         LastEditorDisplayName=""
         LastEditDate="2010-02-02T19:38:14.493"
         LastActivityDate="2010-02-02T19:38:14.493"
         CommentCount = 0
In StackOverflow questions and answers have similar fields and so in RavenDB they are stored in the same document, known as a "post". For instance the document above would be visible at http://localhost:8080/docs/posts/7. It's the field "PostTypeId" that allows you to distinguish between a question (1) and an answer (2).

In RavenDB, you can't perform joins, any "relational" information you need to get has to be obtained from an index. Again there is much more information about indexes on the main Raven site.

The first index we need to create has to contain the field "PostTypeId" so that we can actually get questions and answers seperately, it looks like this:
  Map = docs => from doc in docs               
    select new { PostTypeId = doc.PostTypeId },   
Pretty simple and familiar to anyone who's used LINQ before.

We can then write a query like the one shown below to get the 1st 10 questions.




A brief explanation of what's going on here
  • When we create our query RavenDB starts updating it in the background
  • It only stores the field "PostTypeId" (other than some internal fields used by Raven itself) and that's what we can query against
  • If we want to query against a number we have to use Hex notation, so 1 becomes "0x00000001", see this thread for more info.
  •  
Getting the answers for a question
Probably the most common page on StackOverflow shows a question and it's answers, ordered by the number of votes or the order they were written. To be able to get this info me need to create the following index:

 Map = docs => from doc in docs                      
   where doc.PostTypeId == 2
   select new { ParentId = doc.ParentId }

Now we can the do a query like so: 



and get a list of all the answers for a particular question.
  


That's it for part 1, next time we'll look at more complex queries such as "Active" and "Hot" questions.