Skip to content

Latest commit

 

History

History
631 lines (598 loc) · 36 KB

cs322.md

File metadata and controls

631 lines (598 loc) · 36 KB

CS322

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Spring 2016: Introduction to Database Systems

[TOC]

Introduction

  • data : facts, basis for reasoning, useful, irrelevant, redundant
  • information : processed and organised data, meaningful, relevant, actionnable
  • study database
    • corporate : suppler chain management, data analytics, data science
    • scientific : digital humanities, human brain project, sensor network
  • database management system DBMS : software system designed to store, manage, facilitate access to databases
    • efficient : thousands of queries/updates per second
    • reliable : guarantees for 24x7 availability
    • convient : physical data independence, declarative highlevel query langues
    • safe : protects data from failures (softwar, hardware, power, malicious)
    • multi-user : concurrency control
    • massive : extremly large terabytes everyday
    • persistent : data outlives the programs that operate on it
    • ad hoc queries : formed or usef for specific or immediate problems or needs
  • database DB : a large, integrated, structured collection of data
    • entities vs relationship : usually intented to model some real-world enterprise
    • WWW != DB : correct answer not defined, cannot manipulate data, few guarantees
    • search vs query : what's stored vs answers
    • file system != DB : how changes survive to collective edition or power outages
  • database design
    • requirements analysis : users needs
    • conceptual design : high level description, ER model
    • logical design : translate ER into DBMS data model
    • schema refinement : consistency, normalization
    • physical design : indexes, disk layout
    • security design : who accesses what
  • data independence : ability to change the schema at one level of DB system without change the schema at the next higher level
  • logical data indepence : capacity to change the conceptual schema without changing the user views
  • physical data independence : capacity to change the internal schema without having to change the conceptual schema or user views

Entity-relationship model

  • data model : collection fo concepts for describing high-level data
    • relational : set of records
      • relation : table with rows and cols
      • schema (types) : describes the structure of a relation
    • hierarchical : nested data models
    • graph
    • object-oriented
  • relational data model : rows and cols with keys and foreign keys to link relations
    • entity : real-world object distinguishable from others, described as a set of attributes
    • entity set : collection of similar entities, each entity has a key, each attribute has a domain
    • relationship : associate among entities, can have attributes
    • relationship set : collection of similar relationships
    • integrity constaints
  • key constraints
    • primary key : underline
    • partial key : dashed underline
    • arrow : at most one
  • participation constraints
    • total participation : at least
  • weak entities : can be identified uniquely only by considering the primary key of another entity, owner has a one-to-many relationship with it, using partial key, contains parent primary key as its primary key
  • ternary relationships : 3 linked entity in * or T
  • n-ary relationships : n linked entity
  • complex hierarchies
    • ISA (triangle) : attributes are inherited
      • good : adding descriptive attributes, identify particular relationship
      • overlap constraints (allow/disallow) : multiple parent at same time
      • covering constraints (yes/no) : parentship requirement
    • aggreation : treat a relationship set as an entity set
  • entity vs attribute
    • entity : if several times used
    • entity : if structure is important

Relational model

  • database : set of named relations (tables)
  • relations : set of named attributes (columns)
  • tuple (row) : a value for each attributes
  • cardinality : # rows
  • arity/degree : # attributes
  • attributes : has a domain (type)
  • relation : set of tuples
  • null value : unknown, undefined
  • SQL
    • DDL : data definition language
    • DML : data manipulation language
  • keys : set of attributes with unique value in each tuple
    • candidate keys : unique, carefully uses only
    • foreign keys : set of attributes used to refer to another tuple in another relation
      • rejects addition if reffered foreign does not exist
      • referential integrity : all foreign key enforced
      • on deletion cascade, set null or nothing
  • integrity constraints (ICs) : condition that must be true for any instance
    • specified on schema definitions
    • checked on relations modifications
    • legal instance : satisfies all ICs
    • weak entity : on delete cascade
    • not null
  • translating ISA
    • join parent info on needed
    • defined all attributes
  • nosql : schema flexibility, less constraints, expensive queries
    • object-centric representation
    • key value data model
    • hierachies, arrays

Relational algebra

  • query languages : manipulation and retrieval of data, strong formal foundation, easy and efficient access to large data sets, not turing complete
  • relational algebra : operational, useful for representing execution plans
  • relational calculus : describe what you want not how, declarative
  • relation instance : schemas of input and output are fixed
    • positional field notation : easier for formal definition
    • named field notation : more readable
  • operations : with composition (closed)
    • selection ($\sigma$) : selects subset of rows (horizontal), distinct
    • projection ($\pi$) : retains only wanted columns (vertical), eliminate duplicates
    • set-difference ($-$) : same number of fields and sames types
    • union ($\cup$) : same number of fields and sames types
    • intersection ($\cap$) : $R\cap S= R-(R-S)$
    • renaming ($\rho$)
    • cross-product ($\times$) : combine two relations, need for rename in case of field conflicts
    • division ($/$) : compute all values not disqualified by some values in the divisor, usefull for expressing forall
  • joins
    • natural ($\Join$) : compute cross product, select rows that satisfies equality, project unique attributes and singleton of common ones
    • condition/theta ($\Join_c$)
    • equi-join
  • normalized relation : weak entity factoring out values for saving space

File organization & indexing

  • buffer pool management : policy to move pages between RAM and disk
  • unit
    • I/O : transfer page
    • records : high level of DBMS
  • system catalog : stored themselves as relations
    • for each relation : name, file name, file structure (heap), attribute names and types, index names, integrity constraints
    • for each index : structure (tree) and search key fields
    • for each view : view name, definition
    • stats, authorization, buffer pool size, etc.
  • records storing
    • format : depending on field type
      • fixed-length records : same field type for all record in file, any records can be split arithmetically
      • variable-length records : offset table (best with field size modification) vs special symbol
    • page : bunch of records
    • records identification (rid) : page id. + slot number (in page)
    • row vs column store
    • NSM : row-store
    • decomposition storage model (DSM) : column subfile
    • partition attributes across (PAX)
  • quick cost model : # of disk I/O, ignore CPU costs and prefetching, average case
  • file organization : collection of pages
    • file (traditional slotted pages) : collection of pages containing collections of records, support insert, delete, modify, scan all, select
    • heap files : suitable when scanning for retrieving all records, simple, no order, disk pages allocated following size
    • sorted files : best for retrieval in some order or range of records
  • indexes : speed up selections on search key fields for index
    • search conditions : < search key, comparison operator>
    • index data entries : depend on indexing technics and data, can have many indexes per file
      • actual data record with key value : structured as a file, saves pointers lookups but expensive for addition and deletion, at most on index can use that mode, implies clustered
      • < key value, rid of matching data record> : easier to maintain
      • < key value, list of rids of matching data records> : more flexible but variable length, could span multiple page
  • index classification
    • clustered (vs unclustered) : order of data records is close to order of index data entries
      • clustered implied by index data entries directly with values
      • at most one search key
      • clustered cost : # pages in file with matching records
      • unclustered cost : # of matching index data entries
      • pros : efficient for ranges, may do some compression
      • cons : expensive to maintain
    • primary (vs secondary) index : index key includes file primary key (vs any other index), primary never contains duplicates
    • dense (vs sparse) : at least one data entry per key value (vs an entry per data page in file, clustered)
      • every sparse index is clustered
      • dense index implied by index data entries directly with values
    • composite search key : search combination of field
      • equality query : every field equal to constant value
      • range query : some field value is not a constant
      • lexicographical order : usually
    • indexing technique
      • tree based : good for range, hierachical, B+ trees (all root to leaf path have equal length), R trees
      • hash based : good for equality, file is collection of buckets

The storage layer

  • storage hierarchy
    • main-memory database : smaller size, performance optimized, volatility, expensive
    • flash DBMS : main storage, accelerator (specialized cache, logging device)
  • DBMS with disks : costly READ and WRITE memory operations
    • disk vs tape : random access (slower than RAM 1000x) vs sequential (faster than random access 10x)
    • disk blocks (pages) : unit of data, multiple of sector size, retrieval time vary on disk location
    • anatomy
    • accessing a disk page delays
      • seek time : moving arms to position disk head on track, dominated by settle time (move to one of many nearby tracks, $D$ larger with disk track density), 1-20ms
      • max rotational delay : 1/RPM in ms
      • transfer rate : sectors / max rotational delay
      • rotational delay : waiting for block to rotate under head, 0-10ms
      • transfer time : moving data to/from disk surface, < 1ms for 4KB page
      • shared disks : most of time spent waiting in queue for access
    • arranging pages on disk
      • next block : blocks on same track (followed by), on same cylinder (followed by), on adjacent cylinder, proximity almost constant time, recent HD have 100 neighbors
      • blocks in a file : arranged by adjacent (next) block to minize seek and rotational delay
    • pre-fetching
    • rules : memory access much faster than disk I/O (1000x), sequential I/O faster than random I/O (10x)
    • space management : lower layer manage space on disk, higher level (de)allocate or read/write a page
  • disk arrays : RAID, higher throughput (data striping), longer MTTF (mean time to failure, redundancy)
  • flash disk : secondary storage, caching layer
    • random reads as fast as sequential, slow random writes
    • organized in flash blocks (pages)
    • retrieval time not related to location on disk
    • access time : device organization (internal parellelism), software efficiency (driver), bandwith of flash packages
    • flash translation layer : FTL, complex device driver, tunes performance and device lifetime
  • flash disk vs HDD
  • buffer management : hides the fact that not all data is in RAM
    • why not OS : abstraction is good but gets in the way of DBMS
      • specialized prefetching
      • control over buffer replacement policy : LRU not always the best
      • control over thread/process scheduling : convoy problem (OS scheduling conflict with DBMS locking)
      • control over flushing data to disk : WAL protocol requires flushing log entries to disk
    • page request : buffer pool information table contains < frame# (disk location), pageid, pin_count, dirty>
      • if page not in pool : choose a frame for replacement, only unpinned are candidate), write if replaced one is dirty, read page into chosen frame
      • pin the page and returns its address
      • pre-fetched several pages : sequential scans or similar queries
      • CC and recovery : may required additional I/O (write ahead log protocol)
    • requestor : must unpin the page, indicate whether the page is dirty
    • replacement policy : # of I/O's depending on access pattern
      • LRU : least recently used, replace oldest unpinned time, bad with sequential flooding (#buffer frame < #page in file, each request cause I/O, MRU better)
      • MRU : most recently used
      • clock : approximation LRU, arrange frames into a cycle by storing one reference bit per frame, unpinned make reference bit set on, find reference bit which is off (if on, set it off)
  • backup

Tree-structured indexing

  • tree-structured indexing : support range searches, equality searches
    • costly : maintaining sorted file, performing binary search, better use index
  • B+ tree : most widely-used index, usally preferable to ISAM (modulo locking considerations)
    • insert/delete : $\log_fN$, keep height balanced ($F$ = fanout, $N$=# leaf pages)
    • order of tree : $d$, with $d\le m\le 2d$ where $m$ entries, typically $d=100$
    • typical fill factor : 67%
    • average fanout : $134$
    • capacities : $F^{h}$ with $h$ height, can often hold top levels in buffer pool (level 1 = 1 pages = 8KB, level 2 = 134 pages = 1MB, level 3 = 17956 pages = 140MB)
    • minimum 50% occupancy (except root) : each node contains $d\le m\le 2d$
    • dynamic structures : from root to leaves for each search
    • insert
      • find correct leaf $L$
      • put data entry onto $L$, if not enough space
        • split into $L$ and a new node $L2$
        • redistribute entries evenly, copy up middle key
        • insert index entry pointing to $L2$ into parent of $L$
      • split index node : recursively, redistribute entries evenly, push up middle key (no copy)
      • split root : increase height, wider or one level taller at top
      • data vs index page split : data copied up, index push up
    • delete
      • start at root, find leaf $L$ where entry belongs
      • remove entry, if leaf has only $d-1$ entires (not at least half full)
        • try to redistribute, borrowing from sibling
        • if redistribution fails, merge $L$ and sibling, delete entry pointer from parent, could propagate to root and decrease height
  • prefix key compression : important to increase fan-out, key values in index entries only direct traffic, truncate key for having each index entry greater than every key value (in any subtree)
  • bulk loading : much faster than repeated inserts, can control fill factor, sequentially storage
    • sort all data entires
    • insert pointer to first leaf in a new page
    • index entries for leaf pages always entered into right most index page (above leaf level)
  • order : replace by at least half-full in practice
    • index pages can typically hold many more entries than leaf pages
    • variable sized records and search keys : different nodes will contain different numbers of entries
    • real system : even sloppier, reclam space when page is completely empty only

External sorting

  • streaming data through RAM : read a page from input and reload one, compute $f(x)$ to output and write if buffer full
    • if $f$ compress : read many per write
    • if $f$ decompress : write many per read
  • two way sorting : 3 pages buffer (2 input for comparison, 1 output)
    • external merge sort : pivot is middle, $N$ page, $\lceil\log_2 N\rceil + 1$ passes, total I/O cost $2N(\lceil\log_2 N\rceil + 1)$
  • general external merge sort : $\lceil\log_{B-1} \lceil N/B\rceil\rceil +1$ passes, total I/O cost $2N(\lceil\log_ {B-1}\lceil N/B\rceil\rceil + 1)$
    • read block (chunk) of pages sequentially : in each buffer, potentialy reduce fan-in during merge passe
    • double buffering : reduce wait time for I/O request (output), potentialy more passes, can prefetch shadow block
  • in-memory quicksort : select pivot, partition, recurively sort, combine, fastest but more passes
  • in-memory heapsort : average length $2(B-2)$
  • B+ tree for sorting : B+ tree index on sorting columns, retrieve records in order by traversing leaf pages
    • clustered : good idea
    • unclustered : bad, one I/O per data record, always better than external sorting
  • sorting networks : all operations planned in advance (data independent), works only on fixed size input, efficient parellel execution or GPU
    • comparator : given two input device returns min and max
    • zero-one principle

Query processing with relation operations

  • smart query processing
    • clever implementation techniques for operators
    • exploiting equivalencies of relational operatio
    • using statics and cost models to choose among these
  • query execution
  • optimization : misnamed, pick a good low expected cost plan
  • optimize relational operations
    • selection : $\sigma$
      • depends on indexes, access paths, expected size of the results
      • selectivity : size of result estimated as size of table x reduction factor (based on stats)
      • with no index, unsorted : scan whole relation, cost number of page $M$
      • with no index, sorted : cost of binary search + number of page containing results (selectivity * #pages)
      • with an index on selection attribute : use index + retrieve corresponding data (hash index only usefull for equality)
        • unclustered refinement : find qualifying data entries, sort rids of data records to be retrieved, fetch rids in order, ensure at most 1 lookup per page
      • conjunctive normal form CNF : or clauses grouped by and
      • 1st approach : find cheapest access path (fewest estimated I/Os), retrieve tuples using it, apply the terms that don't match the index (other terms are used to discard only)
      • 2nd approach : for data entries alternatives 2 or 3, get sets of rids of data records using each matching index, intersect the sets, retrieve the cords and apply remaining terms
    • projection : $\pi$
      • distinct : remove duplicate, scan table, extract only the needed attributes, sort, remove adjacent duplicates
      • modifiy external sort algorithm : pass 0 eliminate unwanted fields, merge eliminate duplicates
        • standard : better handle skew and result sorted
      • hashing
        • partitionning : read table using one input buffer, for each tuple discard unwanted fields and apply hash function to choose one of $B-1$ output buffer
        • duplicates : for partition build in-memory hash table while discarding redondant tuples, if not fit in memory recursively do the same thing
      • same cost : $M+2T$ with $M$ page and $T$ unneeded attributes
      • if all attributes indexed : index-only scan, even better if indexed as prefix of search key
    • join : $\Join$
      • equality with one join column : $M$ page table A, $N$ page B
        • nested-loops join : for each tuple, for each tupe, good if inner table can fit in memory, p x M x N + M
        • page-oriented nested loops join : for each page, for each page, for each tuple, for each tupe, M x N + M
        • smaller outer
        • index-nested loops join : exploit the index in the inner loop, M + (M x p x cost of finding matching S tuples), cost of probing index is about 1.2 for hash, 2-4 for B+ tree, depends on clustering
        • block nested loops join : one page as input buffer for scanning inner table, one page as output buffer, all reamining pages to hold block of outer table, # outer block = # page / blocksize
        • sort-merge join : sort on the join column, scan, merge, used if one of the inputs already sorted, output require sort
          • cost : sort M + sort S + M + N
          • better : less sensitive to data skew, result sorted, faster if already sorted
        • hash join : build in-memory hash to speed the matching of tuples in second phase, recursive if not in memory
          • partitioning : 2(M+N)
          • matching : M+N
          • better : if relation size differ greatly, highly parallelizable
        • hybrid hash join
      • equality over several attributes : index nested-loop, combine index or use combination of results of sort-merge/hash-join
      • inequality conditions : e.g. comparision, hash-join, merge not applicable, use block nested-loop
    • union : $\cup$, special case of join, sort both relations and merge or hash partition and build in-memory hash table for scanning and discarding duplicates
    • aggregation : sum, min
      • without grouping : scanning the relation, or index-only scan
      • with grouped : sort on group by attributes, scan relation and compute aggreate for each group, also with hash, index-only scan when index include all fields

Main-memory databases

  • buffer pool : memory as a cache, work but do not take into account advantages of memory characteristics
  • main-memory database management system MMDBMS
  • memory basics
  • improving cache behavior : cache capacity, spatial and temporal locality (non random access, squeeze most useful info into a line), trade CPU for memory access
  • data access
    • memory map (mmap) : access scheduling and cache done by OS, e.g. MongoDB
    • database organize everything : recognize query, keep historical access patterns, cache and schedule accordingly
  • data organisation
    • direct memory pointers : rather than rids
    • fixed and variable length data pools : no dead space
    • block checkums : detect software errors
  • memory access quantum NSM : cache line, word-aligned (64 bytes)
  • DSM : fixed length offsets (value = position x column offset), embedded tuple IDs
  • data compression : reduce data size, produce fixed-length values, allow DBMS to query over compressed data
    • lossless compression
    • lossy compression : synopses (approximate data structures that use sampe of original data)
    • granularity
      • block-level : compress block of tuples for same table
      • tuple-level : compress content of entire tuple
      • attribute-level : compress single attribute value within one tuple
      • column-level : compress multiple values for one or more attributes for multiple tuples
    • techniques
      • run-length encoding : triplet (value, start, number of appearances)
      • frame of reference (FOR)
      • patching technique
      • bit-vector encoding
      • dictionary encoding : replace frequent patterns with smaller codes
        • building
          1. : compute dictionary for all tuples at given point in time, new tuple use separate dictionary or recompute everything
          2. : merge new tuples with extisting dictionary
        • scope
          1. block-level : only include subset of tuple within a single table, lower compression but easier to add new one
          2. table-level : entire table, better ratio but expensive to update
          3. multi-table : sometimes helps with joins and set operations
        • implementation
          1. hash table : fast, compact, unable to support range and prefix queries
          2. B+ tree : slower than hash table, takes more memory, support range and prefix queries
  • index : specialized one, T-tree or B+ tree, can be rebuilt on restart to avoid storing them
    1. columnar index
    2. bitmap index
    • equality encoding : all distinct values as separate bit
    • range encoding : store a bit per range interval
    • bit-sliced encoding : bitmap per each bit in every value
    • cache sensitive search tree : good cache locality, similar to B+ tree, fit each node into L2 cache line
      • increase fan-out : converting variable length keys to fixed length, eliminating child pointers
    • adaptive indexing : cracking, incrementally build and refine indexes as part of query processing
  • query processing and operators
    • no cache misses bottleneck : optimize data flow, data structures, algorithm, branch mispredictions instead
    • existing equi-join methods
      • sort-merge : bad since one of the relation will not fit in cache
      • hash join : bad if inner relation cannot fit in cache
      • clustered hash join : one passe to generate cache sized partitions, bad if number of partitions exceeds number of cache lines (cache trashing occurs)
      • radix join
        • join matching partitions : nested-loop for small partitions (< 8 tuples)
        • beat conventional join methods : saving on cache missses > extra partition cost
        • multi-pass radix cluster
  • optimization : cache-line misses, pointer chasing, predcate evaluation, data copying
  • techniques : late materialization, compression, block iteration, efficient algo
  • performance

Relational query optimization

  • query optimization : given a query
    1. query first broken into blocks : nested block as subroutines
    2. each block converted to relational algebra
    3. for each block several alternative query plans are considered
    4. lowest estimated cost is selected
  • select-project-join optimization (SPJ) : core of every query
  • relational algebra equivalence
    • selection : cascade, commute, push some operations first
    • projection : cascade, push some operations first
    • join : associative, commutative
      • converting selection + cross-production to join
      • selection on just attributes of one table commutes with join
      • push down projection : project on id before making join
  • query rewriting
    • decorrelate subqueries : one execution instead of many (WHERE IN)
    • flattening : convert query with nesting into one without (JOIN)
  • cost estimation : what plans, cost of the plan, ideally find best, reality avoid worst
  • system R optimizer : widely used, work well under 10 join, inexact
    • cost estimation : statistics maintained in system catalogs used to estimate combination of CPU and I/O costs
    • plan space : too large, must be pruned, only left-deep plans considered, avoid cross product
    • for each plan and for each operation
      • estimate cost : depends on cardinality
      • estimate size of result : use information about input relation, assume independance of predicate
      • cost boild down to single number consisting of # I/O + factor x # CPU instructions
    • catalogs : contains info about relations and indexes, updated periodically
      • number of tuples : NTuples
      • number of pages per relation : NPages
      • number of distinct key values for each index : NKeys
      • low/high key values for each index : Low/High
      • index height for each tree index : IHeight
      • number of index pages for each index : INPages
      • sometimes histograms
    • reduction factor (RF) : selectivity, reflects impact of term reducting result size
      • for selection
      • for joins
    • uniform distrubtion assumption : crude, better use histogram (equiwidth, equidepth)
  • plan enumeration
    • single relation plans : each available access path (file scan/index) considered
    • multiple relation plans : restrict search space, only left-deep trees (fully pipelined join without intermediate value)
    • hard way : NP-complete
      1. select order of relation : max $N!$
      2. for each join, select join algorithm : number of joins to power $N-1$
      3. for each input relation, select access method : number of index per relation $(I+1)^N$
  • improved strategy : used in system R
    • enumerate plans using $N$ passes ($N$ relations joined)
      • pass 1. : find best 1 relation plans for each relation
      • pass 2. : find best ways to join result of each 2-relation plan as outer to another relation
      • pass N. : etc.
    • for each subset of relations, retain only : cheapest subplan overall (unordered) and cheapest subplan for each interesting order (order by, group by, join attributes)
    • for each subplan retained : remember cost and result size estime
    • always push selections and projections as far down in the plans as possible

Transaction management overview

  • concurrent database access : isolate sensible query statements
  • resilience to system failures : guarantee all or nothing
  • transaction : sequence of one or more SQL operations treated as a unit
  • ACID properties
    • atomicity : all actions in the transaction happen or none happen
      • logging : log all actions so it can undo if aborted, need for audit, efficient, good for crashes
        • consist of records written sequentially : often archived on stable storage, chained together by transaction id
        • steal buffer management : undo required if uncommitted data can overwrite stable version of committed data
        • no force buffer management : redo required if transaction can commit before all its updates are on disk
        • recorded : writes, commits, aborts
        • write-ahead logging protocol : log must go to disk before the changed page, implemented via log manager and buffer manager handshake, all logs must be written for the transaction to be considered committed
      • shadow pages
      • cascading aborts : releasing lock too early enforce more action cancellation
    • consistency : if each transaction is consistent and the db starts consistent, it ends up consistent
      • integrity constraints : if check fail, the transaction rolls back
    • isolation : execution of one transaction is isolated from that of other transactions
      • pessimistic isolation : do not let problems arise in the first place
      • optimistic : assume conflicts are rare, deal with them after they happen
      • interleaved execution anomalies
        • reading uncommitted data : WR conflict, dirty reads
        • unrepeatable read : RW conflict
        • overwriting uncommitted data : WW conflict
      • lock-based control : strict two-phase locking (Strict 2PL) protocol
        • each transaction must obtain an S (shared) lock before reading and X (exclusive) lock before writing
        • if transaction holds X, no other can acquire S or X
        • if transaction holds S, no other can acquire X
    • durability : if a transaction commits, its effects persist
      • recovering from a crash
        • analysis : scan log from most recent checkpoint, identify all transactions that were active during the crash
        • redo : updares as needed to ensure all logged updates are carried out and written
        • undo : writes of all transaction that were active during the crash working backwards in the log

Concurrency control

  • schedule
    • serial : does not interleave actions of different transactions
    • equivalent : effect of executing first schedule is identical to the effect fo executing the second schedule
    • serializable : schedule that is equivalent to some serial execution of transactions
    • conflict equivalence : most commonly used
  • conflicting operations : two operations conflict if different transaction, same object and at least one of them is a write
  • conflict equivalent two schedules : iff involve same actions of same transactions and every pair of conflicting actions is ordered the same way
  • conflict serializable schedule S : if S is conflict equivalent to some serial schedule, ability to transform S into a serial schedule by swapping consecutive non-conflicting operations of different transactions
  • dependency graph : one node per transaction, edge from $T_i$ to $T_j$ if operation of $T_i$ conflict with one of $T_j$ and happen earlier
    • schedule is conflict serializable iff dependency graph is acyclic
  • two phase locking : sufficient to guarantee conflict serializability, but subject to cascading aborts
    • growing phase : locks are acquired and no lock released
    • shrinking phase : locks are released and no lock is acquired
    • strict : avoid cascading as all locks held by transaction released only when transaction is completed
  • lock management and deadlocks
    • lock manager : contains entry for each currently held lock
    • lock table entry : pointer to list of transaction currently holding locks, type of lock held, pointer to queue of lock requests
    • lock upgrade : transaction holds a shared lock can be upgraded to exclusive lock
    • deadlock : cycle of transactions waiting on each other release
      • prevention
      • detection : create waits-for graph, nodes for transactions, edge from $T_i$ to $T_j$ if waiting on the latter releasing
      • timeout : bad
  • locking granularity : allow transaction lock at each loevel
    • intention lock special protocol : before locking transaction must have proper intersion locks on all its ancestors in granularity
    • each transaction starts at root : database, tables, pages, tuples
    • release in bottom up order
    • IS : intent to get S lock(s) at finer granularity, must hold IS or IX on parent
    • IX : intent to get X lock(s) at finer granularity, must hold IX or SIX on parent
    • SIX mode : both at same time, more conccurent transaction
    • lock escalation : dynamically asks for coarser-grained locks when too many low level locks acquired
  • phantoms tuple : appearing tuple entry between two check, no serial execution
    • predicate locking : grant lock on all records that satisfy some logical predicate, big overhead
    • index locking : special case of predicate locking
      • dense index : lock index pages where the data entry would be
      • no suitable index : lock on every page and lock for the file itself

Eventual consistency

  • replication : copies data and use it on crashes, good for load balancing
  • eventually consistent : if no further changes, eventually all copes will have same value
    • write success : only if predetermined number of servers agree on same value
    • N : # server storing data copies
    • R : # server needs to agree on a value for read operations
    • W : # server needs to agree on a value for write operations
    • sacrify consistency for faster response time
    • strong consistency : R + W > N