Cloud Computing is moving from being “IT buzzword” to reasonable yet reliable way of deploying applications in the Internet. IT managers within companies are considering deploying some applications within cloud. A cloud-related trend that developers have been paying attention is the idea of “NoSQL“, a set of operational-data technologies based on non-relational concepts. “NoSQL” is “a sea change” idea to consider data storage options beyond the traditional SQL-based relational database.
Accordingly, a new set of open source distributed database is actively propping up to leverage the facilities and services provided through the cloud architecture. Thus, web applications and databases in cloud are undergoing major architectural changes to take advantage of the scalability provided by the cloud. This article is intended to provide insight on the NOSQL in the context of Cloud computing.
Face off ~ SQL, NOSQL & Cloud Computing
A key disadvantage of SQL Databases is the fact that SQL Databases are at a high abstraction level. This is a disadvantage because to do a single Statement, SQL often requires the data to be processed multiple times. This, of course, takes time and performance. For instance, multiple queries on SQL Data occur when there is a ‘Join’ operation. Cloud computing environments need high-performing and highly scalable databases.
NoSQL Databases are built without relations. But is it really that “good” to go for NoSQL Databases? A world without relations, no joins and pure scalability! NoSQL databases typically emphasize horizontal scalability via partitioning, putting them in a good position to leverage the elastic provisioning capabilities of the cloud.
The general definition of a NOSQL data store is that it manages data that is not strictly tabular and relational, so it does not make sense to use SQL for the creation and retrieval of the data. NOSQL data stores are usually non-relational, distributed, open-source, and horizontally scalable.
If we look at the big Platforms in the Web like Facebook or Twitter, there are some Datasets that do not need any relations. The challenge for NoSQL Databases is to keep the data consistent. Imagine the fact that a user deletes his or her account. If this is hosted on a NoSQL Database, all the tables have to check for any data the user has produced in the past. With NoSQL, this has to be done by code.
A major advantage of NoSQL Databases is the fact that Data replication can be done more easily then it would be with SQL Databases.
As there are no relations, Tables don’t necessary have to be on the same servers. Again, this allows better “scaling” than SQL Databases. Don’t forget: scaling is one of the key aspects in Cloud computing environments.
Another disadvantage of SQL databases is the fact that there is always a schema involved. Over time, requirements will definitely change and the database somehow has to support this new requirements. This can lead to serious problems. “Just imagine” the fact that applications need two extra fields to store data. Solving this issue with SQL Databases might get very hard. NoSQL databases support a changing environment for data and are a better solution in this case as well.
SQL Databases have the advantage over NoSQL Databases to have better support for “Business Intelligence”.
Cloud Computing Platforms are made for a great number of people and potential customers. This means that there will be millions of queries over various tables, millions or even billions of read and write operations within seconds. SQL Databases are built to serve another market: the “business intelligence” one, where fewer queries are executed.
This implies that the way forward for many developers is a hybrid approach, with large sets of data stored in, ideally, cloud-scale NoSQL storage, and smaller specialized data remaining in relational databases. While this would seem to amplify management overhead, reducing the size and complexity of the relational side can drastically simplify things.
However, it is up to the Use-Case to identify if you want a NoSQL approach or if you better stay with SQL.
“NOSQL” Databases for Cloud
The NoSQL (or “not only SQL”) movement is defined by a simple premise: Use the solution that best suits the problem and objectives.
If the data structure is more appropriately accessed through key-value pairs, then the best solution is likely a dedicated key value pair database.
If the objective is to quickly find connections within data containing objects and relationships, then the best solution is a graph database that can get results without any need for translation (O/R mapping).
Today’s availability of numerous technologies that finally support this simple premise are helping to simplify the application environment and enable solutions that actually exceed the requirements, while also supporting performance and scalability objectives far into the future. Many cloud web applications have expanded beyond the sweet spot for these relational database technologies. Many applications demand availability, speed, and fault tolerance over consistency.
Although the original emergence of NOSQL data stores was motivated by web-scale data, the movement has grown to encompass a wide variety of data stores that just happen to not use SQL as their processing language. There is no general agreement on the taxonomy of NOSQL data stores, but the categories below capture much of the landscape.
Tabular / Columnar Data Stores
Storing sparse tabular data, these stores look most like traditional tabular databases. Their primary data retrieval paradigm utilizes column filters, generally leveraging hand-coded map-reduce algorithms.
BigTable is a compressed, high performance, and proprietary database system built on Google File System (GFS), Chubby Lock Service, and a few other Google programs;
HBase is an open source; non-relational, distributed database modeled after Google’s BigTable and is written in Java. It runs on top of HDFS, providing a fault-tolerant way of storing large quantities of sparse data.
Hypertable is an open source database inspired by publications on the design of Google’s BigTable. Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely in C++ for performance.
VoltDB is an in-memory database. It is an ACID-compliant RDBMS which uses a shared nothing architecture. VoltDB is based on the academic HStore project. VoltDB is a relational database that supports SQL access from within pre-compiled Java stored procedures.
Google Fusion Tables is a free service for sharing and visualizing data online. It allows you to upload and share data, merge data from multiple tables into interesting derived tables, and see the most up-to-date data from all sources.
These NOSQL data sources store unstructured (i.e., text) or semi-structured (i.e., XML) documents. Their data retrieval paradigm varies highly, but documents can always be retrieved by unique handle. XML data sources leverage XQuery. Text documents are indexed, facilitating keyword search-like retrieval.
Apache CouchDB, commonly referred to as CouchDB, is an open source document-oriented database written in the Erlang programming language. It is designed for local replication and to scale vertically across a wide range of devices.
MongoDB is an open source, scalable, high-performance, schema-free, document-oriented database written in the C++ programming language.
Terrastore is a distributed, scalable and consistent document store supporting single-cluster and multi-cluster deployments. It provides advanced scalability support and elasticity feature without loosening the consistency at data level.
These NOSQL sources store graph-oriented data with nodes, edges, and properties and are commonly used to store associations in social networks.
Neo4j is an open-source graph database, implemented in Java. It is “embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs.
AllegroGraph is a Graph database. It considers each stored item to have any number of relationships. These relationships can be viewed as links, which together form a network, or graph.
FlockDB is an open source distributed, fault-tolerant graph database for managing data at webscale. It was initially used by Twitter to build its database of users and manage their relationships to one another. It scales horizontally and is designed for on-line, low-latency, high throughput environments such as websites.
VertexDB is a high performance graph database server that supports automatic garbage collection. It uses the HTTP protocol for requests and JSON for its response data format and the API are inspired by the FUSE file system API plus a few extra methods for queries and queues.
These sources store simple key/value pairs like a traditional hash table. Their data retrieval paradigm is simple; given a key, return the value.
Dynamo is a highly available, proprietary key-value structured storage system. It has properties of both databases and distributed hash tables (DHTs). It is not directly exposed as a web service, but is used to power parts of other Amazon Web Services
Memcached is a general-purpose distributed memory caching system. It is often used to speed up dynamic database-driven websites by caching data and objects in RAM to reduce the number of times an external data source must be read.
Cassandra is an open source distributed database management system. It is designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. It is a NoSQL solution that was initially developed by Facebook and powers their Inbox Search feature.
Amazon SimpleDB is a distributed database written in Erlang by Amazon.com. It is used as a web service in concert with EC2 and S3 and is part of Amazon Web Services.
Voldemort is a distributed key-value storage system. It is used at LinkedIn for certain high-scalability storage problems where simple functional partitioning is not sufficient.
Kyoto Cabinet is a library of routines for managing a database. The database is a simple data file containing records; each is a pair of a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table or B+ tree.
Scalaris is a scalable, transactional, distributed key-value store. It can be used for building scalable Web 2.0 services.
Riak is a Dynamo-inspired database that is being used in production by companies like Mozilla.
Object and Multi-value Databases
These types of stores preceded the NOSQL movement, but they have found new life as part of the movement. Object databases store objects (as in object-oriented programming). Multi-value databases store tabular data, but individual cells can store multiple values. Examples include Objectivity, GemStone and Unidata. Proprietary query languages are used.
Miscellaneous NOSQL Sources
Several other data stores can be classified as NOSQL stores, but they don’t fit into any of the categories above. Examples include: GT.M, IBM Lotus/Domino, and the ISIS family.
Sources for further Reading
http://www.rackspacecloud.com/blog/2010/02/25/should-you-switch-to-nosql-too/ http://pro.gigaom.com/2010/03/what-cloud-computing-can-learn-from-nosql/ http://www.drdobbs.com/database/224900500