Schema Design for MongoDB

Schema design in MongoDB is very different than schema design in a relational DBMS. However it is still very important and the first step towards building an application.

In relational data models, conceptual there is a "correct" design for a given entity relationship model independent of the use case. This is typically a third normal form normalization. One typically only diverges from this for performance reasons. In MongoDB, the schema design is not only a function of the data to be modeled but also of the use case. The schema design is optimized for our most common use case. This has pros and cons – that use case is then typically highly performant; however there is a bias in the schema which may make certain ad hoc queries a little less elegant than in the relational schema.

As we design the schema, the questions we must answer are:

1. When do we embed data versus linking (see below)? Our decisions here imply the answer to question #2:

2. How many collections do we have, and what are they?

3. When do we need atomic operations? These operations can be performed within the scope of a BSON document, but not across documents.

4. What indexes will we create to make query and updates fast?

5. How will we shard? What is the shard key?

Embedding and Linking

A key question when designing a MongoDB schema is when to embed and when to link. Embedding is the nesting of objects and arrays inside a BSON document. Links are references between documents.

There are no joins in MongoDB – distributed joins would be difficult on a 1,000 server cluster. Embedding is a bit like "prejoined" data. Operations within a document are easy for the server to handle; these operations can be fairly rich. Links in contrast must be processed client-side by the application; the application does this by issuing a follow-up query.

Generally, for "contains" relationships between entities, embedding should be be chosen. Use linking when not using linking would result in duplication of data.

Collections

Collections in MongoDB are analogous to tables in a relational database. Each collection contains documents. As mentioned above these documents can be fairly rich.

There is no explicit declaration of the fields within the documents of a collection. However there is a conceptual design from the schema designer of what those fields will be and how the documents in the collection will be structured. MongoDB does not require documents in a collection to have the same structure. However, in practice, most collections are highly homogenous. We can move away from this when we need to though; for example when adding a new field. In a situation such as that, an "alter table" style operation is not necessary.

Atomic Operations

Some problems require the ability to perform atomic operations. For example, simply incrementing a counter is often a case where one wants atomicity. MongoDB can also perform more complex operations such as that shown in the pseudocode below:

atomically { 
    if( doc.credits > 5 ) { 
        doc.credits -= 5; 
        doc.debits += 5; 
    } 
}

Another example would be a user registration scenario. We would never want to users to register the same username simultaneously:

atomically { 
    if( exists a document with username='jane' ) { 
        print "username already in use please choose another"; 
    } else { 
        insert a document with username='jane' in the users collection; 
        print("thanks you have registered as user jane."); 
    } 
}

The key aspect here in terms of schema design is that our scope of atomicity / ACID properties is the document. Thus we need to assure that all fields relevant to the atomic operation are in the same document.

Indexes

MongoDB supports the declaration of indexes. Indexes in MongoDB are highly analogous to indexes in a relational database : they are needed for efficient query processing, and must be explicitly declared. Thus we need to think about what indexes we will define as part of the schema design process. Just like in a relational database, indexes can be added later – if we decide later to have a new one, we can do that.

Sharding

Another consideration for schema design is sharding. A BSON document (which may have significant amounts of embedding) resides on one and only one shard.

A collection may be sharded. When sharded, the collection has a shard key, which determines how the collection is partitioned among shards. Typically (but not always) queries on a sharded collection involve the shard key as part of the query expression.

The key here is that changing shard keys is difficult. You will want to choose the right key from the start.

Example

Let's consider an example, which is a content management system. The examples below use mongo shell syntax but could also be done in any programming language – just use the appropriate driver for that language.

Our content management system will have posts. Post have authors. We'd like to support commenting and voting on posts. We'd also like posts to be taggable for searching.

One good schema design for this scenario would be to have two MongoDB collections, one called posts and one called users. This is what we will use for the example.

Our users have a few properties - a user id they registered with, their real name, and their karma. For example we could invoke:

> db.users.insert( { _id : "alex", name: { first:"Alex", last:"Benisson" }, karma : 1.0 } )

The _id field is always present in MongoDB, and is automically indexed with a unique key constraint. That is perfect for our usernames so we store them in the _id field. We don't have to though; we could instead make a username field and let the system automically generate a unique id.

Let's now consider posts. We'll assume some posts are already populated. Let's query one:

> db.posts.findOne() { 
    _id : ObjectId("4e77bb3b8a3e000000004f7a"), 
    when : Date("2011-09-19T02:10:11.3Z", 
           author : "alex", 
           title : "No Free Lunch", 
           text : "This is the text of the post. It could be very long.", 
           tags : [ "business", "ramblings" ], 
           votes : 5, 
           voters : [ "jane", "joe", "spencer", "phyllis", "li" ], 
           comments : [ { 
               who : "jane", 
               when : Date("2011-09-19T04:00:10.112Z"), 
               comment : "I agree." 
               },{ 
               who : "meghan", 
               when : Date("2011-09-20T14:36:06.958Z"),     
               comment : "You must be joking. etc etc ..." 
               } 
           ] 
    }

It's interesting to contrast this with how we might design the same schema in a relation database. We would likely have a users collection and a posts collection. But in addition one would typically have a tags collection, a voters collection, and a comments collection. Grabbing all the information on a single post would then be a little involved. Here to grab the whole post we might execute:

> db.posts.findOne( { _id : ObjectId("4e77bb3b8a3e000000004f7a") } );

To get all posts written by alex:

> db.posts.find( { author : "alex" } )

If the above is a common query we would create an index on the author field:

> db.posts.ensureIndex( { author : 1 } )

The post documents can be fairly large. To get just the titles of the posts by alex :

> db.posts.find( { author : "alex" }, { title : 1 } )

We may want to search for posts by tag:

> // make and index of all tags so that the query is fast: 
> db.posts.ensureIndex( { tags : 1 } ) 
> db.posts.find( { tags : "business" } )

What if we want to find all posts commented on by meghan?

> db.posts.find( { comments.who : "meghan" } )

Let's index that to make it fast:

> db.posts.ensureIndex( { "comments.who" : 1 } )

We track voters above so that no one can vote more than once. Suppose calvin wants to vote for the example post above. The following update operation will record calvin's vote. Because of the $nin sub-expression, if calvin has already voted, the update will have no effect.

> db.posts.update( 
{ 
    _id : ObjectId("4e77bb3b8a3e000000004f7a"), 
    voters : 
    { 
        $nin : "calvin" 
    } 
}, 
{ 
    votes : 
    { 
        $inc : 1 
    }, 
    voters : 
    { 
        $push : "calvin" 
    } 
);

Note the above operation is atomic : if multiple users vote simultaneously, no votes would be lost.

Suppose we want to display the title of the latest post in the system as well as the full user name for the author. This is a case where we must use client-side linking:

> var post = db.posts.find().sort( { when : -1 } ).limit(1); 
> var user = db.users.find( { _id : post.author } ); 
> print( post.title + " " + user.name.first + " " + user.name.last );

A final question we might ask about our example is how we would shard. If the users collection is small, we would not need to shard it at all. If posts is huge, we would shard it. We would need to choose a shard key. The key should be chosen based on the queries that will be common. We want those queries to involve the shard key.

Sharding by _id is one option here.
If finding the most recent posts is a very frequent query, we would then shard on the when field. (There is also an optimization trickwhich might work here.)

Summary of Best Practices

"First class" objects, that are at top level, typically have their own collection.
Line item detail objects typically are embedded.
Objects which follow an object modelling "contains" relationship should generally be embedded.
Many to many relationships are generally done by linking.
Collections with only a few objects may safely exist as separate collections, as the whole collection is quickly cached in application server memory.
Embedded objects are a bit harder to link to than "top level" objects in collections.
It is more difficult to get a system-level view for embedded objects. When needed an operation of this sort is performed by using MongoDB's map/reduce facility.
If the amount of data to embed is huge (many megabytes), you may reach the limit on size of a single object. See also GridFS.
If performance is an issue, embed.

More Details

Choosing Indexes

A second aspect of schema design is index selection. As a general rule, where you want an index in a relational database, you want an index in Mongo.

The _id field is automatically indexed.
Fields upon which keys are looked up should be indexed.
Sort fields generally should be indexed.

The MongoDB profiling facility provides useful information for where an index should be added that is missing.

Note that adding an index slows writes to a collection, but not reads. Use lots of indexes for collections with a high read : write ratio (assuming one does not mind the storage overage). For collections with more writes than reads, indexes are expensive as keys must be added to each index for each insert.

How Many Collections?

As Mongo collections are polymorphic, one could have a collection objects and put everything in it! This approach is taken by some object databases. This is not recommended in MongoDB for several reasons, mainly performance. Data within a single collection is roughly contiguous on disk. Thus, "table scans" of a collection are possible, and efficient. Just like in relational dbs, independent collections are very important for high throughput batch processing.

Chan Chen Coding...

导航

统计

文章分类

文章档案

最新评论