GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.
Spring 2016: Introduction to Database Systems
[TOC]
- data : facts, basis for reasoning, useful, irrelevant, redundant
- information : processed and organised data, meaningful, relevant, actionnable
- study database
- corporate : suppler chain management, data analytics, data science
- scientific : digital humanities, human brain project, sensor network
- database management system DBMS : software system designed to store, manage, facilitate access to databases
- efficient : thousands of queries/updates per second
- reliable : guarantees for 24x7 availability
- convient : physical data independence, declarative highlevel query langues
- safe : protects data from failures (softwar, hardware, power, malicious)
- multi-user : concurrency control
- massive : extremly large terabytes everyday
- persistent : data outlives the programs that operate on it
- ad hoc queries : formed or usef for specific or immediate problems or needs
- database DB : a large, integrated, structured collection of data
- entities vs relationship : usually intented to model some real-world enterprise
- WWW != DB : correct answer not defined, cannot manipulate data, few guarantees
- search vs query : what's stored vs answers
- file system != DB : how changes survive to collective edition or power outages
- database design
- requirements analysis : users needs
- conceptual design : high level description, ER model
- logical design : translate ER into DBMS data model
- schema refinement : consistency, normalization
- physical design : indexes, disk layout
- security design : who accesses what
- data independence : ability to change the schema at one level of DB system without change the schema at the next higher level
- logical data indepence : capacity to change the conceptual schema without changing the user views
- physical data independence : capacity to change the internal schema without having to change the conceptual schema or user views
- data model : collection fo concepts for describing high-level data
- relational : set of records
- relation : table with rows and cols
- schema (types) : describes the structure of a relation
- hierarchical : nested data models
- graph
- object-oriented
- relational : set of records
- relational data model : rows and cols with keys and foreign keys to link relations
- entity : real-world object distinguishable from others, described as a set of attributes
- entity set : collection of similar entities, each entity has a key, each attribute has a domain
- relationship : associate among entities, can have attributes
- relationship set : collection of similar relationships
- integrity constaints
- key constraints
- participation constraints
- weak entities : can be identified uniquely only by considering the primary key of another entity, owner has a one-to-many relationship with it, using partial key, contains parent primary key as its primary key
- ternary relationships : 3 linked entity in * or T
- n-ary relationships : n linked entity
- complex hierarchies
- ISA (triangle) : attributes are inherited
- good : adding descriptive attributes, identify particular relationship
- overlap constraints (allow/disallow) : multiple parent at same time
- covering constraints (yes/no) : parentship requirement
- aggreation : treat a relationship set as an entity set
- ISA (triangle) : attributes are inherited
- entity vs attribute
- entity : if several times used
- entity : if structure is important
- database : set of named relations (tables)
- relations : set of named attributes (columns)
- tuple (row) : a value for each attributes
- cardinality : # rows
- arity/degree : # attributes
- attributes : has a domain (type)
- relation : set of tuples
- null value : unknown, undefined
- SQL
- DDL : data definition language
- DML : data manipulation language
- keys : set of attributes with unique value in each tuple
- candidate keys : unique, carefully uses only
- foreign keys : set of attributes used to refer to another tuple in another relation
- rejects addition if reffered foreign does not exist
- referential integrity : all foreign key enforced
- on deletion cascade, set null or nothing
- integrity constraints (ICs) : condition that must be true for any instance
- specified on schema definitions
- checked on relations modifications
- legal instance : satisfies all ICs
- weak entity : on delete cascade
- not null
- translating ISA
- join parent info on needed
- defined all attributes
- nosql : schema flexibility, less constraints, expensive queries
- object-centric representation
- key value data model
- hierachies, arrays
- query languages : manipulation and retrieval of data, strong formal foundation, easy and efficient access to large data sets, not turing complete
- relational algebra : operational, useful for representing execution plans
- relational calculus : describe what you want not how, declarative
-
relation instance : schemas of input and output are fixed
- positional field notation : easier for formal definition
- named field notation : more readable
-
operations : with composition (closed)
- selection (
$\sigma$ ) : selects subset of rows (horizontal), distinct - projection (
$\pi$ ) : retains only wanted columns (vertical), eliminate duplicates - set-difference (
$-$ ) : same number of fields and sames types - union (
$\cup$ ) : same number of fields and sames types - intersection (
$\cap$ ) :$R\cap S= R-(R-S)$ - renaming (
$\rho$ ) - cross-product (
$\times$ ) : combine two relations, need for rename in case of field conflicts - division (
$/$ ) : compute all values not disqualified by some values in the divisor, usefull for expressing forall
- selection (
- joins
- normalized relation : weak entity factoring out values for saving space
- buffer pool management : policy to move pages between RAM and disk
- unit
- I/O : transfer page
- records : high level of DBMS
- system catalog : stored themselves as relations
- for each relation : name, file name, file structure (heap), attribute names and types, index names, integrity constraints
- for each index : structure (tree) and search key fields
- for each view : view name, definition
- stats, authorization, buffer pool size, etc.
- records storing
- quick cost model : # of disk I/O, ignore CPU costs and prefetching, average case
- file organization : collection of pages
- file (traditional slotted pages) : collection of pages containing collections of records, support insert, delete, modify, scan all, select
- heap files : suitable when scanning for retrieving all records, simple, no order, disk pages allocated following size
- sorted files : best for retrieval in some order or range of records
- indexes : speed up selections on search key fields for index
- search conditions : < search key, comparison operator>
- index data entries : depend on indexing technics and data, can have many indexes per file
- actual data record with key value : structured as a file, saves pointers lookups but expensive for addition and deletion, at most on index can use that mode, implies clustered
- < key value, rid of matching data record> : easier to maintain
- < key value, list of rids of matching data records> : more flexible but variable length, could span multiple page
- index classification
- clustered (vs unclustered) : order of data records is close to order of index data entries
- clustered implied by index data entries directly with values
- at most one search key
- clustered cost : # pages in file with matching records
- unclustered cost : # of matching index data entries
- pros : efficient for ranges, may do some compression
- cons : expensive to maintain
- primary (vs secondary) index : index key includes file primary key (vs any other index), primary never contains duplicates
- dense (vs sparse) : at least one data entry per key value (vs an entry per data page in file, clustered)
- every sparse index is clustered
- dense index implied by index data entries directly with values
- composite search key : search combination of field
- equality query : every field equal to constant value
- range query : some field value is not a constant
- lexicographical order : usually
- indexing technique
- tree based : good for range, hierachical, B+ trees (all root to leaf path have equal length), R trees
- hash based : good for equality, file is collection of buckets
- clustered (vs unclustered) : order of data records is close to order of index data entries
-
storage hierarchy
- main-memory database : smaller size, performance optimized, volatility, expensive
- flash DBMS : main storage, accelerator (specialized cache, logging device)
-
DBMS with disks : costly READ and WRITE memory operations
- disk vs tape : random access (slower than RAM 1000x) vs sequential (faster than random access 10x)
- disk blocks (pages) : unit of data, multiple of sector size, retrieval time vary on disk location
- anatomy
- accessing a disk page delays
- seek time : moving arms to position disk head on track, dominated by settle time (move to one of many nearby tracks,
$D$ larger with disk track density), 1-20ms - max rotational delay : 1/RPM in ms
- transfer rate : sectors / max rotational delay
- rotational delay : waiting for block to rotate under head, 0-10ms
- transfer time : moving data to/from disk surface, < 1ms for 4KB page
- shared disks : most of time spent waiting in queue for access
- seek time : moving arms to position disk head on track, dominated by settle time (move to one of many nearby tracks,
- arranging pages on disk
- pre-fetching
- rules : memory access much faster than disk I/O (1000x), sequential I/O faster than random I/O (10x)
- space management : lower layer manage space on disk, higher level (de)allocate or read/write a page
- disk arrays : RAID, higher throughput (data striping), longer MTTF (mean time to failure, redundancy)
-
flash disk : secondary storage, caching layer
- random reads as fast as sequential, slow random writes
- organized in flash blocks (pages)
- retrieval time not related to location on disk
- access time : device organization (internal parellelism), software efficiency (driver), bandwith of flash packages
- flash translation layer : FTL, complex device driver, tunes performance and device lifetime
- flash disk vs HDD
-
buffer management : hides the fact that not all data is in RAM
- why not OS : abstraction is good but gets in the way of DBMS
- specialized prefetching
- control over buffer replacement policy : LRU not always the best
- control over thread/process scheduling : convoy problem (OS scheduling conflict with DBMS locking)
- control over flushing data to disk : WAL protocol requires flushing log entries to disk
- page request : buffer pool information table contains < frame# (disk location), pageid, pin_count, dirty>
- if page not in pool : choose a frame for replacement, only unpinned are candidate), write if replaced one is dirty, read page into chosen frame
- pin the page and returns its address
- pre-fetched several pages : sequential scans or similar queries
- CC and recovery : may required additional I/O (write ahead log protocol)
- requestor : must unpin the page, indicate whether the page is dirty
- replacement policy : # of I/O's depending on access pattern
- LRU : least recently used, replace oldest unpinned time, bad with sequential flooding (#buffer frame < #page in file, each request cause I/O, MRU better)
- MRU : most recently used
- clock : approximation LRU, arrange frames into a cycle by storing one reference bit per frame, unpinned make reference bit set on, find reference bit which is off (if on, set it off)
- why not OS : abstraction is good but gets in the way of DBMS
- backup
-
tree-structured indexing : support range searches, equality searches
- costly : maintaining sorted file, performing binary search, better use index
-
B+ tree : most widely-used index, usally preferable to ISAM (modulo locking considerations)
- insert/delete :
$\log_fN$ , keep height balanced ($F$ = fanout,$N$ =# leaf pages) - order of tree :
$d$ , with$d\le m\le 2d$ where$m$ entries, typically$d=100$ - typical fill factor : 67%
- average fanout :
$134$ - capacities :
$F^{h}$ with$h$ height, can often hold top levels in buffer pool (level 1 = 1 pages = 8KB, level 2 = 134 pages = 1MB, level 3 = 17956 pages = 140MB) - minimum 50% occupancy (except root) : each node contains
$d\le m\le 2d$ - dynamic structures : from root to leaves for each search
- insert
- find correct leaf
$L$ - put data entry onto
$L$ , if not enough space- split into
$L$ and a new node$L2$ - redistribute entries evenly, copy up middle key
- insert index entry pointing to
$L2$ into parent of$L$
- split into
- split index node : recursively, redistribute entries evenly, push up middle key (no copy)
- split root : increase height, wider or one level taller at top
- data vs index page split : data copied up, index push up
- find correct leaf
- delete
- insert/delete :
- prefix key compression : important to increase fan-out, key values in index entries only direct traffic, truncate key for having each index entry greater than every key value (in any subtree)
- bulk loading : much faster than repeated inserts, can control fill factor, sequentially storage
-
order : replace by at least half-full in practice
- index pages can typically hold many more entries than leaf pages
- variable sized records and search keys : different nodes will contain different numbers of entries
- real system : even sloppier, reclam space when page is completely empty only
-
streaming data through RAM : read a page from input and reload one, compute
$f(x)$ to output and write if buffer full- if
$f$ compress : read many per write - if
$f$ decompress : write many per read
- if
-
two way sorting : 3 pages buffer (2 input for comparison, 1 output)
- external merge sort : pivot is middle,
$N$ page,$\lceil\log_2 N\rceil + 1$ passes, total I/O cost$2N(\lceil\log_2 N\rceil + 1)$
- external merge sort : pivot is middle,
-
general external merge sort :
$\lceil\log_{B-1} \lceil N/B\rceil\rceil +1$ passes, total I/O cost$2N(\lceil\log_ {B-1}\lceil N/B\rceil\rceil + 1)$ - in-memory quicksort : select pivot, partition, recurively sort, combine, fastest but more passes
-
in-memory heapsort : average length
$2(B-2)$ - B+ tree for sorting : B+ tree index on sorting columns, retrieve records in order by traversing leaf pages
- sorting networks : all operations planned in advance (data independent), works only on fixed size input, efficient parellel execution or GPU
-
smart query processing
- clever implementation techniques for operators
- exploiting equivalencies of relational operatio
- using statics and cost models to choose among these
- query execution
- optimization : misnamed, pick a good low expected cost plan
-
optimize relational operations
- selection :
$\sigma$ - depends on indexes, access paths, expected size of the results
- selectivity : size of result estimated as size of table x reduction factor (based on stats)
- with no index, unsorted : scan whole relation, cost number of page
$M$ - with no index, sorted : cost of binary search + number of page containing results (selectivity * #pages)
- with an index on selection attribute : use index + retrieve corresponding data (hash index only usefull for equality)
- unclustered refinement : find qualifying data entries, sort rids of data records to be retrieved, fetch rids in order, ensure at most 1 lookup per page
- conjunctive normal form CNF : or clauses grouped by and
- 1st approach : find cheapest access path (fewest estimated I/Os), retrieve tuples using it, apply the terms that don't match the index (other terms are used to discard only)
- 2nd approach : for data entries alternatives 2 or 3, get sets of rids of data records using each matching index, intersect the sets, retrieve the cords and apply remaining terms
- projection :
$\pi$ - distinct : remove duplicate, scan table, extract only the needed attributes, sort, remove adjacent duplicates
- modifiy external sort algorithm : pass 0 eliminate unwanted fields, merge eliminate duplicates
- standard : better handle skew and result sorted
- hashing
- same cost :
$M+2T$ with$M$ page and$T$ unneeded attributes - if all attributes indexed : index-only scan, even better if indexed as prefix of search key
- join :
$\Join$ - equality with one join column :
$M$ page table A,$N$ page B- nested-loops join : for each tuple, for each tupe, good if inner table can fit in memory, p x M x N + M
- page-oriented nested loops join : for each page, for each page, for each tuple, for each tupe, M x N + M
- smaller outer
- index-nested loops join : exploit the index in the inner loop, M + (M x p x cost of finding matching S tuples), cost of probing index is about 1.2 for hash, 2-4 for B+ tree, depends on clustering
- block nested loops join : one page as input buffer for scanning inner table, one page as output buffer, all reamining pages to hold block of outer table, # outer block = # page / blocksize
- sort-merge join : sort on the join column, scan, merge, used if one of the inputs already sorted, output require sort
- hash join : build in-memory hash to speed the matching of tuples in second phase, recursive if not in memory
- hybrid hash join
- equality over several attributes : index nested-loop, combine index or use combination of results of sort-merge/hash-join
- inequality conditions : e.g. comparision, hash-join, merge not applicable, use block nested-loop
- equality with one join column :
- union :
$\cup$ , special case of join, sort both relations and merge or hash partition and build in-memory hash table for scanning and discarding duplicates - aggregation : sum, min
- without grouping : scanning the relation, or index-only scan
- with grouped : sort on group by attributes, scan relation and compute aggreate for each group, also with hash, index-only scan when index include all fields
- selection :
- buffer pool : memory as a cache, work but do not take into account advantages of memory characteristics
- main-memory database management system MMDBMS
- memory basics
- improving cache behavior : cache capacity, spatial and temporal locality (non random access, squeeze most useful info into a line), trade CPU for memory access
- data access
- memory map (mmap) : access scheduling and cache done by OS, e.g. MongoDB
- database organize everything : recognize query, keep historical access patterns, cache and schedule accordingly
- data organisation
- memory access quantum NSM : cache line, word-aligned (64 bytes)
- DSM : fixed length offsets (value = position x column offset), embedded tuple IDs
- data compression : reduce data size, produce fixed-length values, allow DBMS to query over compressed data
- lossless compression
- lossy compression : synopses (approximate data structures that use sampe of original data)
- granularity
- block-level : compress block of tuples for same table
- tuple-level : compress content of entire tuple
- attribute-level : compress single attribute value within one tuple
- column-level : compress multiple values for one or more attributes for multiple tuples
- techniques
- run-length encoding : triplet (value, start, number of appearances)
- frame of reference (FOR)
- patching technique
- bit-vector encoding
- dictionary encoding : replace frequent patterns with smaller codes
- building
- : compute dictionary for all tuples at given point in time, new tuple use separate dictionary or recompute everything
- : merge new tuples with extisting dictionary
- scope
- block-level : only include subset of tuple within a single table, lower compression but easier to add new one
- table-level : entire table, better ratio but expensive to update
- multi-table : sometimes helps with joins and set operations
- implementation
- hash table : fast, compact, unable to support range and prefix queries
- B+ tree : slower than hash table, takes more memory, support range and prefix queries
- building
- index : specialized one, T-tree or B+ tree, can be rebuilt on restart to avoid storing them
- equality encoding : all distinct values as separate bit
- range encoding : store a bit per range interval
- bit-sliced encoding : bitmap per each bit in every value
- cache sensitive search tree : good cache locality, similar to B+ tree, fit each node into L2 cache line
- adaptive indexing : cracking, incrementally build and refine indexes as part of query processing
- query processing and operators
- no cache misses bottleneck : optimize data flow, data structures, algorithm, branch mispredictions instead
- existing equi-join methods
- sort-merge : bad since one of the relation will not fit in cache
- hash join : bad if inner relation cannot fit in cache
- clustered hash join : one passe to generate cache sized partitions, bad if number of partitions exceeds number of cache lines (cache trashing occurs)
- radix join
- optimization : cache-line misses, pointer chasing, predcate evaluation, data copying
- techniques : late materialization, compression, block iteration, efficient algo
- performance
-
query optimization : given a query
- query first broken into blocks : nested block as subroutines
- each block converted to relational algebra
- for each block several alternative query plans are considered
- lowest estimated cost is selected
- select-project-join optimization (SPJ) : core of every query
-
relational algebra equivalence
- selection : cascade, commute, push some operations first
- projection : cascade, push some operations first
- join : associative, commutative
- converting selection + cross-production to join
- selection on just attributes of one table commutes with join
- push down projection : project on id before making join
-
query rewriting
- decorrelate subqueries : one execution instead of many (WHERE IN)
- flattening : convert query with nesting into one without (JOIN)
- cost estimation : what plans, cost of the plan, ideally find best, reality avoid worst
-
system R optimizer : widely used, work well under 10 join, inexact
- cost estimation : statistics maintained in system catalogs used to estimate combination of CPU and I/O costs
- plan space : too large, must be pruned, only left-deep plans considered, avoid cross product
- for each plan and for each operation
- estimate cost : depends on cardinality
- estimate size of result : use information about input relation, assume independance of predicate
- cost boild down to single number consisting of # I/O + factor x # CPU instructions
- catalogs : contains info about relations and indexes, updated periodically
- number of tuples : NTuples
- number of pages per relation : NPages
- number of distinct key values for each index : NKeys
- low/high key values for each index : Low/High
- index height for each tree index : IHeight
- number of index pages for each index : INPages
- sometimes histograms
- reduction factor (RF) : selectivity, reflects impact of term reducting result size
- uniform distrubtion assumption : crude, better use histogram (equiwidth, equidepth)
- plan enumeration
-
improved strategy : used in system R
- enumerate plans using
$N$ passes ($N$ relations joined)- pass 1. : find best 1 relation plans for each relation
- pass 2. : find best ways to join result of each 2-relation plan as outer to another relation
- pass N. : etc.
- for each subset of relations, retain only : cheapest subplan overall (unordered) and cheapest subplan for each interesting order (order by, group by, join attributes)
- for each subplan retained : remember cost and result size estime
- always push selections and projections as far down in the plans as possible
- enumerate plans using
- concurrent database access : isolate sensible query statements
- resilience to system failures : guarantee all or nothing
- transaction : sequence of one or more SQL operations treated as a unit
- ACID properties
- atomicity : all actions in the transaction happen or none happen
- logging : log all actions so it can undo if aborted, need for audit, efficient, good for crashes
- consist of records written sequentially : often archived on stable storage, chained together by transaction id
- steal buffer management : undo required if uncommitted data can overwrite stable version of committed data
- no force buffer management : redo required if transaction can commit before all its updates are on disk
- recorded : writes, commits, aborts
- write-ahead logging protocol : log must go to disk before the changed page, implemented via log manager and buffer manager handshake, all logs must be written for the transaction to be considered committed
- shadow pages
- cascading aborts : releasing lock too early enforce more action cancellation
- logging : log all actions so it can undo if aborted, need for audit, efficient, good for crashes
- consistency : if each transaction is consistent and the db starts consistent, it ends up consistent
- integrity constraints : if check fail, the transaction rolls back
- isolation : execution of one transaction is isolated from that of other transactions
- pessimistic isolation : do not let problems arise in the first place
- optimistic : assume conflicts are rare, deal with them after they happen
- interleaved execution anomalies
- reading uncommitted data : WR conflict, dirty reads
- unrepeatable read : RW conflict
- overwriting uncommitted data : WW conflict
- lock-based control : strict two-phase locking (Strict 2PL) protocol
- each transaction must obtain an S (shared) lock before reading and X (exclusive) lock before writing
- if transaction holds X, no other can acquire S or X
- if transaction holds S, no other can acquire X
- durability : if a transaction commits, its effects persist
- recovering from a crash
- analysis : scan log from most recent checkpoint, identify all transactions that were active during the crash
- redo : updares as needed to ensure all logged updates are carried out and written
- undo : writes of all transaction that were active during the crash working backwards in the log
- recovering from a crash
- atomicity : all actions in the transaction happen or none happen
-
schedule
- serial : does not interleave actions of different transactions
- equivalent : effect of executing first schedule is identical to the effect fo executing the second schedule
- serializable : schedule that is equivalent to some serial execution of transactions
- conflict equivalence : most commonly used
- conflicting operations : two operations conflict if different transaction, same object and at least one of them is a write
- conflict equivalent two schedules : iff involve same actions of same transactions and every pair of conflicting actions is ordered the same way
- conflict serializable schedule S : if S is conflict equivalent to some serial schedule, ability to transform S into a serial schedule by swapping consecutive non-conflicting operations of different transactions
-
dependency graph : one node per transaction, edge from
$T_i$ to$T_j$ if operation of$T_i$ conflict with one of$T_j$ and happen earlier- schedule is conflict serializable iff dependency graph is acyclic
- two phase locking : sufficient to guarantee conflict serializability, but subject to cascading aborts
-
lock management and deadlocks
- lock manager : contains entry for each currently held lock
- lock table entry : pointer to list of transaction currently holding locks, type of lock held, pointer to queue of lock requests
- lock upgrade : transaction holds a shared lock can be upgraded to exclusive lock
- deadlock : cycle of transactions waiting on each other release
- prevention
- detection : create waits-for graph, nodes for transactions, edge from
$T_i$ to$T_j$ if waiting on the latter releasing - timeout : bad
-
locking granularity : allow transaction lock at each loevel
- intention lock special protocol : before locking transaction must have proper intersion locks on all its ancestors in granularity
- each transaction starts at root : database, tables, pages, tuples
- release in bottom up order
- IS : intent to get S lock(s) at finer granularity, must hold IS or IX on parent
- IX : intent to get X lock(s) at finer granularity, must hold IX or SIX on parent
- SIX mode : both at same time, more conccurent transaction
- lock escalation : dynamically asks for coarser-grained locks when too many low level locks acquired
-
phantoms tuple : appearing tuple entry between two check, no serial execution
- predicate locking : grant lock on all records that satisfy some logical predicate, big overhead
- index locking : special case of predicate locking
- dense index : lock index pages where the data entry would be
- no suitable index : lock on every page and lock for the file itself
- replication : copies data and use it on crashes, good for load balancing
- eventually consistent : if no further changes, eventually all copes will have same value
- write success : only if predetermined number of servers agree on same value
- N : # server storing data copies
- R : # server needs to agree on a value for read operations
- W : # server needs to agree on a value for write operations
- sacrify consistency for faster response time
- strong consistency : R + W > N