A while back I was reading through the series of posts on Ayende's blog titled "That No SQL Thing". If you haven't read it, it's worth a look as it's a great introduction to his new project RavenDB. The main site has loads more information, but in a short sentence, RavenDB is a document database for .NET.
This posts isn't going to be an general intro to RavenDB, see Rob Ashtons blog and the Code Project article for that
This series also assumes that you already have the StackOverflow data imported into RavenDB. For information on doing this see this post and the sample ETL application at on GitHub.
As a quick recap, here's what a typical question looks like.
Score: 6 CreationDate: "/Date(1221793769413+0100)/" LastActivityDate: "/Date(1240901507540+0100)/" PostTypeId: 1 AnswerCount: 8 ViewCount: 1068 Title: "Can you Distribute a Ruby on Rails Application without Source?" Body: "I'm wondering if it's possible t....." AcceptedAnswerId: 788676 Tags: "ruby, ruby-on-rails, encoding, obfuscation, distribution" FavoriteCount: 3 Id: 99553 OwnerUserId: 14530
and a typical answer,
Id= 7
PostTypeId= 2 ParentId= 4 CreationDate="2008-07-31T22:17:57.883" Score= 44 ViewCount= 0 Body="..." OwnerUserId= 9 LastEditorUserId= 1 LastEditorDisplayName="" LastEditDate="2010-02-02T19:38:14.493" LastActivityDate="2010-02-02T19:38:14.493" CommentCount = 0
In StackOverflow questions and answers have similar fields and so in RavenDB they are stored in the same document, known as a "post". For instance the document above would be visible at http://localhost:8080/docs/posts/7. It's the field "PostTypeId" that allows you to distinguish between a question (1) and an answer (2).
In RavenDB, you can't perform joins, any "relational" information you need to get has to be obtained from an index. Again there is much more information about indexes on the main Raven site.
The first index we need to create has to contain the field "PostTypeId" so that we can actually get questions and answers seperately, it looks like this:
Map = docs => from doc in docs
select new { PostTypeId = doc.PostTypeId },
Pretty simple and familiar to anyone who's used LINQ before.We can then write a query like the one shown below to get the 1st 10 questions.
A brief explanation of what's going on here
- When we create our query RavenDB starts updating it in the background
- It only stores the field "PostTypeId" (other than some internal fields used by Raven itself) and that's what we can query against
- If we want to query against a number we have to use Hex notation, so 1 becomes "0x00000001", see this thread for more info.
Probably the most common page on StackOverflow shows a question and it's answers, ordered by the number of votes or the order they were written. To be able to get this info me need to create the following index:
Map = docs => from doc in docs
where doc.PostTypeId == 2
select new { ParentId = doc.ParentId }and get a list of all the answers for a particular question.
That's it for part 1, next time we'll look at more complex queries such as "Active" and "Hot" questions.
Great introduction!! I assume that this is your first blog and Good Luck with that.. :)
ReplyDeleteGreat Post and a great idea for a series! I love the "upcoming posts" section.
ReplyDelete@Luke, @Magesh thanks for the kind comments, I had fun doing this.
ReplyDelete