Skip to content

Commit

Permalink
SQL, NoSQL, NewSQL, ACID, Normalization, CAP theorem
Browse files Browse the repository at this point in the history
  • Loading branch information
mistermicheels committed Jun 23, 2019
1 parent 35505b7 commit d51dfb7
Show file tree
Hide file tree
Showing 4 changed files with 396 additions and 0 deletions.
45 changes: 45 additions & 0 deletions data/CAP-theorem.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# CAP theorem

See:

- [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem)
- Designing Data-Intensive Applications (book by Martin Kleppmann)
- [SQL Server Availability Modes](https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/availability-modes-always-on-availability-groups?view=sql-server-2017)
- [Offload read-only workload to secondary replica of an Always On availability group](https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/active-secondaries-readable-secondary-replicas-always-on-availability-groups?view=sql-server-2017)

This theorem states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

- *Consistency*: Every read returns either the relevant value as it was written by the latest successful write or an error
- *Availability*: Every request receives a non-error response
- *Partition tolerance*: The system keeps working, even if any number of messages is dropped or delayed by the network that connects the different instances

Implications:

> No distributed system is safe from network failures, thus network partitioning generally has to be tolerated. In the presence of a partition, one is then left with two options: consistency or availability. When choosing consistency over availability, the system will return an error or a time-out if particular information cannot be guaranteed to be up to date due to network partitioning. When choosing availability over consistency, the system will always process the query and try to return the most recent available version of the information, even if it cannot guarantee it is up to date due to network partitioning.
>
> In the absence of network failure – that is, when the distributed system is running normally – both availability and consistency can be satisfied
## Criticism

- Can be misleading if presented as "pick 2 out of 3"
- Every distributed system has to assume the possibility of network failures, so actually the trade-off is between consistency and availability
- When the network is working correctly, you can still have both consistency and availability at the same time (you don't have to abandon 1 out of the 3 at all times)
- Notion of consistency is limited and can be confusing
- Consistency as defined in CAP theorem is actually linearizability (basically, making it appear as if there is only a single copy of the data)
- There are also other consistency models in the distributed systems research, plus other uses of the term *consistency* in the data store world
- Only takes into account network partitions (nodes that are alive but disconnected from each other)
- Ignores dead nodes, etc.

Still, the CAP theorem has been of large historical importance as it encouraged people to also explore distributed systems that limit consistency in favor of availability, which can make sense for certain large-scale web services. This has been an important inspiration for the NoSQL movement.

## CAP consistency vs. ACID consistency

See also [ACID properties](./sql/ACID.md)

Both mean something completely different:

- CAP consistency: Every read returns either the relevant value as it was written by the latest successful write or an error

- ACID consistency: The execution of the transaction must bring the database to a valid state, respecting the database’s schema

In fact, when relational databases are deployed in a distributed fashion, there are typically different modes available that can have an impact on CAP consistency. For example, when settings up a high-availability cluster for Microsoft SQL Server, you have the choice between the availability modes *synchronous commit* and *asynchronous commit*. Synchronous commit waits to return for a transaction until it has effectively been synchronized to the other instances (secondary replicas). Asynchronous commit, on the other hand, does not wait for the secondary replicas to catch up. If asynchronous commit is used and the cluster is configured to allow reads to go directly to the secondary replicas, it is possible that reads return stale data.
187 changes: 187 additions & 0 deletions data/SQL-NoSQL-NewSQL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# SQL, NoSQL, NewSQL

See:

- [Relational database](https://en.wikipedia.org/wiki/Relational_database)
- [NoSQL](https://en.wikipedia.org/wiki/NoSQL)
- [Living Without Transactions](https://stackoverflow.com/a/39210371)
- [Patterns for Schema Changes in Document Databases](https://stackoverflow.com/questions/5029580/patterns-for-schema-changes-in-document-databases)
- [NewSQL](https://en.wikipedia.org/wiki/NewSQL)

## Relational databases (SQL)

- Also called RDBMS (relational database management system)
- Have been around for many decades
- Popular RDBMSes are mature systems and there are lots of developers and database administrators that have several years of experience in dealing with them
- This means that there is a wealth of knowledge available on best practices, how to tackle certain issues, etc.
- SQL standard describing the query language and behavior of relational databases
- Still, different databases typically provide a different dialect of that query language and they may differ significantly in their behavior in some cases (although possibly all within the bounds of the standard)

### Tables, rows, relationships and schemas

- Tables (= *relations*)

- Each table is a set of columns, each of a certain type, that can hold data for the rows in the table

Each table has a subset of columns, the table’s *primary key (PK)*, that uniquely identifies each row in the table

- There may also be other subsets of columns uniquely identifying each row in the table, known as alternate keys (AK).

- Relationships

- Possible to link rows in a table to rows in another table by including the column(s) of that other table’s primary key
- This is called a *foreign key*
- If defined correctly, database enforces that foreign key links to actual row in the other table
- This way, we can define one-to-one relationships and one-to-many relationships
- Also possible to represent many-to-many relationships using an intermediate table to store foreign keys to both tables in the relationship

All of the tables, columns, keys, relationships, etc. are defined in the database schema. The database actively enforces the schema and forbids data that doesn’t match it (incorrect data type for a column, foreign key linking to a row that doesn’t exist, …).

### SQL

SQL (Structured Query Language):

- Declarative language that allows performing CRUD operations on the data and the database schema
- Declarative = you specify *what* you want your query to do instead of *how* to do it. The database system itself figures out a how exactly the query will be performed. This can simplify things, but it can also make it challenging to optimize queries that get executed in a sub-optimal way.

### ACID transactions

See [ACID properties](./sql/ACID.md)

### Normalization

See [Normalization](./sql/Normalization.md)

## NoSQL

- Became popular in the early twenty-first century as an alternative to relational databases
- The term NoSQL encompasses lots of different data stores with different concepts, approaches, query languages, etc. that offer a solution to some problem for which relational databases are maybe not an ideal solution
- In order to achieve this, they generally need to make compromises in terms of features and the guarantees offered by the data store
- Depending on the application, it's possible that developers need to foresee some things on the application side that would just be handled by the database if they were using a relational database
- Developers used to working with a non-distributed relational database should be especially careful when working with distributed NoSQL databases as those may introduce the possibility for inconsistencies in areas where the developers take consistency for granted

- Typical selling point: specialized solution for particular use cases
- Example: graph databases: see below
- Typical selling point: horizontal scalability
- Vertical scaling: make your machines more powerful by adding CPU power, memory, faster disks, etc.
- Becomes expensive or practically impossible once you reach a certain point
- Horizontal scaling: add more machines and distribute the load between them
- Becomes cheaper than vertical scaling one certain scale is reached
- Allows you to keep on scaling up further by adding additional machines
- NoSQL databases tend to be built for horizontal scaling, while relational databases are typically not very good at it
- See also [CAP theorem](./CAP-theorem.md)

### Transactions

NoSQL databases typically don't offer transactions with ACID guarantees that relational databases provide

- Some only provide transactional integrity at the level of a single entry (which may still contain structured data or an array of values)
- Some don't even provide any form of transactions at all

Strategies for dealing with lack of transactional support:

- Redesign your data model so you don’t need more transactional support than what the system offers
- Perform the required concurrency control at the level of your application
- Tolerate the possible concurrency issues caused by not having transactions and adjust your application and possibly your users’ expectations to this

### Types of NoSQL data stores

Note: this is not intended to be complete list of all possible types

#### Document store

- Data is stored as documents containing structured data (think something JSON-like)
- When performing queries, you can typically retrieve or filter on data inside the documents
- Typically the main candidate for storing your application’s domain data if you don’t want to store that data in a relational database
- Example: MongoDB

Fit:

- Can be a good fit for data that has a hierarchical structure (looks like a tree), as you can just put the entire structure in a document
- Works well for one-to-one and one-to-many relationships
- Many-to-many relationships can be hard to model
- Example: you want to store information on actors, movies and which actors played in which movies
- One option: include the data regarding actors inside the documents for the movies or vice versa
- This is *denormalization* (see also [Normalization](./sql/Normalization.md)) and will lead to duplicate data and the possibility for inconsistencies
- Other option: have documents for actors, documents for movies, and storing references to movies inside actors
- Similar to the concept of foreign keys in relational databases
- Problem: document stores often do not offer real foreign key constraints, so there is nothing on the database level preventing you from deleting an actor that a movie still refers to

Schemaless?:

- Often, document stores are *schemaless*
- This means that the database does not enforce a certain structure of the documents you store in it
- Typically, this does not mean that there is no schema for the data, but it means that that schema is either implicitly or explicitly defined by your application rather than at the database level
- Offers more flexibility in the face of changes to the structure of your data
- Specifically, it allows data with the old structure to sit next to data with the new structure, without forcing you to migrate the old data to the new structure (yet)
- Drawbacks of existence of documents of the same type with different structures:
- Your application needs to be able to handle the different structures (can lead to loads and loads of if-statements)
- Can make maintenance difficult
- Take care to document the changes to the data’s structure and migrate old data when it makes sense

Note: some relational databases actually offer document store capabilities!

- Example: recent versions of PostgreSQL allows storing JSON data and performing queries based on the contents of that JSON data
- This can be a good option if some of your data is hierarchical in nature but you still want ACID capabilities
- If you don’t need to query based on the actual contents of the structured data, you can even just use any relational database and store the data as text in a column

#### Key-value store

- Made for storing data as a dictionary (key-value map)
- All the data is stored in the database as a value with a unique key identifying that value.
- Values for different keys can have different data types. Data types offered by a key-value store may include strings, lists of strings, sets of strings and even key-value maps.
- It is typically up to the application to determine what the keys look like. For example, if you want to store data for users, you may use the key `user:1` for the user with id 1.
- Example: Redis

Fit:

- Useful if your data looks like a key-value map
- Popular use case is using a clusters in-memory key-value stores as a very fast distributed cache for often-retrieved data

#### Graph database

- Represents data as a graph of nodes and relationships between those nodes
- Typically offer some specialized graph-based algorithms for analyzing the data (shortest path, clustering, ...)
- Example: Neo4j

Fit:

- Good fit when your data can naturally be represented as a network of nodes connected by edges that represent relationships between nodes
- Example: people on a social network site and their friends. If you model this as each person being a node and each friendship being an edge connecting nodes, storing the data in a graph database helps you recommend friends of friends, identify clusters of people that are all friends of each other, etc.

Note: there exist extensions to RDBMSes (for example PostgreSQL) that offer graph database capabilities as well

#### Time-series database

- Aimed at storing values that change throughout time
- Have storage engines and query languages that are optimized for storing time-series data, making it easy and efficient to perform time-based queries that can aggregate huge amounts of data
- Examples: InfluxDB, SiriDB

Fit:

- Typical use case: storing data obtained from sensors that are constantly measuring values like temperature, humidity, etc.
- Time-series database can make it easy to store a year’s worth of temperature measurements (one measurement each minute) and then retrieve the maximum and minimum measured temperature per week

Note that there exist extensions to RDBMSes that offer time-series database capabilities.

- Example: Timescale (builds upon PostgreSQL)

## NewSQL

NewSQL systems are a class of relational database management systems that aim at providing the ACID guarantees of relational databases with the horizontal scalability of NoSQL databases. There are several categories of NewSQL databases:

- Completely new systems, often built from scratch with distributed deployment being a major focus. They often use techniques that are similar to the techniques used by NoSQL databases. Examples include Google Spanner and CockroachDB. These systems typically have some limitations with regards to the features they support or the extent to which they provide true ACID guarantees.
- SQL storage engines optimized for horizontal scalability, replacing the default storage engines of relational databases. These storage engines may have some limitations that are not present in the database’s default storage engine.
- Middleware that sits on top of a cluster of relational database instances. An example is Vitess. Note that these systems may not offer ACID guarantees.

## Which one to use?

- Choosing which data store to use is a tradeoff and there is likely no “wrong” or “right” choice.
- Choice will likely depend on the kind of data you need to store, the scalability you need, the consistency you need, the knowledge of your team, etc.
- There is no rule stating that you should use either SQL, NoSQL or NewSQL.
- Example: it is very common to use a relational database for your application’s domain data but use a key-value store for caching purposes.
- It could also be a good idea to store parts of your domain data in a relational database and other parts in a document database, depending on which one is a better fit for which part of the data. Of course, using multiple systems also means having to keep multiple systems running smoothly.

## Hosted data stores

When you are evaluating data stores for your project, it is a good idea to also consider the hosted data stores that are offered by cloud providers like AWS or Microsoft Azure. These hosted data stores include SQL, NoSQL and NewSQL data stores and using them could save you the headaches involved in managing your own data store or data store cluster. However, you should be careful regarding the amount of vendor lock-in this generates.
22 changes: 22 additions & 0 deletions data/sql/ACID.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# ACID

See:

- [ACID (computer science)](https://en.wikipedia.org/wiki/ACID_(computer_science))

ACID = acronym describing four properties that transactions in relational database must have

Also provided to some extent by some NoSQL or NewSQL databases

The four ACID properties:

- *Atomicity*:
- A transaction is treated as a single unit that either succeeds completely or fails completely. If some operation fails as part of a transaction, the entire transaction is rolled back, including the changes that other operations may have performed in the same transaction. The system must guarantee this in every situation, even if the system crashes right after a transaction is successfully committed.
- *Consistency*:
- The execution of the transaction must bring the database to a valid state, respecting the database’s schema
- Example: no matter how many concurrent transactions are executing, you will never be able to set a foreign key from a row to another row that does not exist (but maybe did exist when you retrieved your data)
- *Isolation*:
- Although multiple transactions may be running concurrently, their effects on each other’s execution are limited.
- Relational database systems typically provide multiple isolation levels, where higher levels protect against more concurrency-induced phenomena than lower levels. For more details, see [transaction isolation levels](./Transaction-isolation-levels).
- *Durability*:
- Once a transaction has been successfully committed, it will remain so, even if the system crashes, power is shut off, etc. right after the transaction has completed.
Loading

0 comments on commit d51dfb7

Please sign in to comment.