Hashing and ADTs

herrera-ignacio · Nov 14, 2024 · b6eec65 · b6eec65
1 parent 47bfa0b
commit b6eec65
Show file tree

Hide file tree

Showing 11 changed files with 170 additions and 22 deletions.
diff --git a/.obsidian/workspace.json b/.obsidian/workspace.json
@@ -8,20 +8,49 @@
         "type": "tabs",
         "children": [
           {
-            "id": "5052b16fd8bdb461",
+            "id": "5be967f0881160fb",
             "type": "leaf",
             "state": {
               "type": "markdown",
               "state": {
-                "file": "Data structures/Arrays/Sliding window.md",
+                "file": "Data structures/Hashing/Universal hashing.md",
+                "mode": "source",
+                "source": false
+              },
+              "icon": "lucide-file",
+              "title": "Universal hashing"
+            }
+          },
+          {
+            "id": "73686db3a8fa2556",
+            "type": "leaf",
+            "state": {
+              "type": "markdown",
+              "state": {
+                "file": "Data structures/Hashing/Hash map.md",
                 "mode": "source",
                 "source": false
               },
               "icon": "lucide-file",
-              "title": "Sliding window"
+              "title": "Hash map"
+            }
+          },
+          {
+            "id": "26cfaac60005359e",
+            "type": "leaf",
+            "state": {
+              "type": "markdown",
+              "state": {
+                "file": "Data structures/Hashing/Sets.md",
+                "mode": "source",
+                "source": false
+              },
+              "icon": "lucide-file",
+              "title": "Sets"
             }
           }
-        ]
+        ],
+        "currentTab": 2
       }
     ],
     "direction": "vertical"
@@ -174,20 +203,31 @@
       "table-editor-obsidian:Advanced Tables Toolbar": false
     }
   },
-  "active": "5052b16fd8bdb461",
+  "active": "26cfaac60005359e",
   "lastOpenFiles": [
+    "Data structures/README.md",
+    "Data structures/Hashing/Sets.md",
+    "Data structures/Hashing/Collisions.md",
+    "Data structures/Hashing/Hash map.md",
+    "Data structures/Hashing/Universal hashing.md",
+    "Data structures/Hashing/Hash function.md",
+    "Data structures/Hashing/Hashing.md",
+    "Data structures/Hashing/_attachments/Pasted image 20241114124706.png",
+    "Data structures/Hashing/_attachments",
+    "Data structures/Data structure.md",
+    "READMEv2.md",
+    "README.md",
+    "Data structures/Arrays/Sliding window.md",
+    "Data structures/Abstract Data Type.md",
+    "Data structures/Arrays/Array.md",
     "Data structures/Arrays/Prefix sum.md",
+    "Data structures/Hashing",
     "Data structures/Arrays/_attachments/Pasted image 20241112114017.png",
-    "Data structures/Arrays/Sliding window.md",
     "Data structures/Arrays/Two pointers.md",
-    "Data structures/README.md",
     "Data structures/Arrays/_attachments/Pasted image 20241111194533.png",
-    "Data structures/Arrays/Array.md",
     "Data structures/Arrays/_attachments/Pasted image 20241111172708.png",
-    "README.md",
     "Data structures/Arrays/_attachments",
     "Pasted image 20241111172836.png",
-    "READMEv2.md",
     "Data Structures.md",
     "Data structures/Arrays",
     "Data structures",
@@ -205,21 +245,11 @@
     "legacy/glossary/side-effect",
     "legacy/glossary/pattern",
     "legacy/glossary/modularization",
-    "legacy/glossary/first-class-citizen",
-    "legacy/glossary/crud",
     "legacy/frontend/legacy-lifecycle/README.md",
     "legacy/frontend/legacy-lifecycle/2022-12-28-22-40-07.png",
     "legacy/frontend/legacy-lifecycle/2022-12-28-22-39-07.png",
-    "legacy/frontend/accessibility-testing/README.md",
     "legacy/frontend/design-system/2022-11-09-09-19-31.png",
-    "legacy/frontend/metrics/README.md",
-    "legacy/frontend/csr-vs-ssr/README.md",
-    "legacy/frontend/design-system/README.md",
     "legacy/frontend/design-system/2022-11-09-09-25-46.png",
-    "legacy/frontend/design-system/2022-11-09-09-23-27.png",
-    "legacy/frontend/design-system/2022-11-09-09-22-44.png",
-    "legacy/frontend/core-team/README.md",
-    "legacy/frontend/areas/README.md",
-    "legacy/frontend/backbone-OKRs/README.md"
+    "legacy/frontend/design-system/2022-11-09-09-23-27.png"
   ]
 }
diff --git a/Data structures/Abstract Data Type.md b/Data structures/Abstract Data Type.md
@@ -0,0 +1,7 @@
+# Abstract data type
+## Definition
+ADT is a **mathematical model** for data types, analogous to an algebraic structure. It consists of a domain, a collection of operations, and a set of constraints the operations must satisfy.
+ADTs are a **theoretical concept**, used in formal semantics and program verification and, less strictly, design and analysis of algorithms.
+## ADT vs data structures
+ADT, as a mathematical model, contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not an user.
+For example, a stack has push/pop operations that follow a LIFO rule, and can be concretely implemented using either a list or an array.
diff --git a/Data structures/Data structure.md b/Data structures/Data structure.md
@@ -0,0 +1,11 @@
+# Data structures
+
+## What is a data structure?
+A data structure is a data organization and storage format for efficient access to data. More precisely, it's a collection of data values, the relationships among them, and the operations that can be applied to the data (i.e., it's an *algebraic structure* about data).
+We can split data structures into two things: the interface and the implementation.
+## Interface
+The interface is like a contract that specifies how we can interact with the data structure -- what operations we can perform on it, what inputs it expects, and what outputs we can expect.
+For example, consider a dynamic array. The interface would include operations like appending, insertion, removal, updating, and more.
+## Implementation
+The implementation is the code that actually makes the data structure work. It includes the details of how the data is stored and how the operations are performed.
+For example, the implementation of a dynamic array might involve allocating memory for the list, tracking the size, and rearranging the elements when an operation like remove is called.
diff --git a/Data structures/Hashing/Collisions.md b/Data structures/Hashing/Collisions.md
@@ -0,0 +1,8 @@
+# Collisions
+## Definition
+When different keys convert to the same integer, it is called a collision. Without handling collisions, older keys will get overridden and data will be lost.
+## Collision resolution
+There are [multiple ways](https://en.wikipedia.org/wiki/Hash_table#Collision_resolution) to handle collisions.
+### Chaining 
+We store linked lists inside the hash map's array instead of the elements themselves. The linked list nodes store both the key and the value.
+If there are collisions, the collided key-value pairs are linked together in a linked list. Then, when trying to access one of these key-value pairs, we traverse through the linked list until the key matches.
diff --git a/Data structures/Hashing/Hash function.md b/Data structures/Hashing/Hash function.md
@@ -0,0 +1,50 @@
+# Hash function
+## Definition
+A hash function is any function that can be used to map data of arbitrary size to fixed-size values (though there are some hash functions that support variable-length output).
+A hash function may be considered to perform **three functions**:
+- Convert variable-length keys into fixed-length values, by folding them by words or other units using a parity-preserving operator like ADD or XOR.
+- Scramble the bits of the key so that the resulting values are uniformly distributed over the keyspace.
+- Map the key values into ones less than or equal to the size of the table.
+### Hash values and digests
+The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes.
+The values are usually used to index a fixed-size table called a *hash table*.
+## Usage
+- **Data storage and retrieval**: Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval.
+- **Integrity checking**: Identical hash values for different files imply equality, providing a reliable means to detect file modifications.
+- **Key derivation**: Minor input changes result in a random-looking output alteration.
+- **Message authentication codes** (MACs): Through the integration of a confidential key with the input data, hash functions can generate MACs ensuring the genuineness of the data (e.g., HMACs).
+- **Signatures**: Message hashes are signed rather than the whole message.
+## What makes a good hash function?
+A good hash function satisfies two basic properties:
+- Should be very fast to compute.
+- Should minimize duplication of output values (collisions).
+Hash functions rely on generating favorable probability distributions for their effectiveness, reducing access time to nearly constant.
+High table loading factors, pathological key sets, and poorly designed hash functions can result in access times approaching linear in the number of items in the table.
+## Collition-resolution
+A necessary adjunct to the hash function is a collision-resolution method that employs an auxiliary data structure like linked lists, or systematic probing of the table to find an empty slot.
+## Properties
+### Uniformity
+It should map the expected inputs as evenly as possible over its output range. That is, every hash value should be generated with roughly the same probability.
+
+> [!tip] Uniformly distributed is not random
+> This criterion only requires the value to be *uniformly distributed*, not *random* in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true.
+#### Absolute uniformity
+In special cases when the keys are known in advance and the key set is static, a hash function can be found that achieves absolute (or collision-less) uniformity. Such a hash function is said to be *perfect*.
+There is no algorithmic way of constructing such a function -- searching for one is a factorial function of the number of keys to be mapped versus the number of table slots that they are mapped into.
+Finding a perfect hash function over more than a very small set of keys is usually computationally infeasible; the resulting function is likely to be more computationally complex than a standard hash function and provides only a marginal advantage over a function with good statistical properties that yields a minimum number of collisions.
+### Testing and measurement
+When testing a hash function, the uniformity of the distribution can be evaluated by the [chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test) which is a goodness-of-fit measure: it's the actual distribution of items in buckets versus the expected distribution of items.
+### Efficiency
+A hash function takes a finite amount of time to map a potentially large keyspace to a feasible amount of storage space searchable in a bounded amount of time regardless of the number of keys.
+In most applications, the hash function should be computable with minimum latency and secondarily in a minimum number of instructions.
+
+> [!tip] Space-time trade-off
+> In data storage and retrieval applications, the use of a hash function is a trade-off between search time and data storage space.
+> If [memory](https://en.wikipedia.org/wiki/Computer_memory "Computer memory") is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a [binary search](https://en.wikipedia.org/wiki/Binary_search "Binary search") or [linear search](https://en.wikipedia.org/wiki/Linear_search "Linear search") can be used to retrieve the element.
+
+### Applicability
+A hash function that allows only certain table sizes or strings only up to a certain length, or cannot accept a seed, is less useful than one that does.
+### Deterministic
+A hash procedure **must be deterministic** -- for a given input value, it must always generate the same hash value.
+### Defined range
+It is often desirable that the output of a hash function have fixed size.
diff --git a/Data structures/Hashing/Hash map.md b/Data structures/Hashing/Hash map.md
@@ -0,0 +1,13 @@
+# Hash map
+## Definition
+A hash map (also known as hash table or dictionary) is an **unordered data structure that stores key-value pairs**.
+It implements an associative array, which is an abstract data type that maps keys to values. It uses a [[Hash function]] to compute an *index*, into an array of *buckets* or *slots*, from which the desired value can be found.
+![[Pasted image 20241114124706.png]]
+Typically, the only constraint on a hash map's key is that it has to be **immutable**.
+## Advantages
+- It allows to reduce the time complexity of a search algorithm by a factor of $O(n)$ for a huge amount of problems.
+- It can add, update, check if exists and remove elements in $O(1)$.
+## Disadvantages
+- For smaller input sizes, they can be slower due to overhead.
+- Can take up more space than arrays.
+- When implemented using a fixed size array, resizing is much more expensive than a normal array because every existing key needs to be re-hashed, and a hash table may use an array that is significantly larger than the number of elements stored, resulting in a huge waste of space.
diff --git a/Data structures/Hashing/Hashing.md b/Data structures/Hashing/Hashing.md
@@ -0,0 +1,4 @@
+# Hashing
+## Definition
+Use of hash function to index a [[Hash map]] is called *hashing* or *scatter-storage addressing*.
+Hashing is a computationally- and storage-space-efficient form of data access that **avoids the non-constant access time** of ordered and unordered lists and structured trees, and the often-exponential storage requirements of direct access of state spaces of large or variable-length keys.
diff --git a/Data structures/Hashing/Sets.md b/Data structures/Hashing/Sets.md
@@ -0,0 +1,11 @@
+# Sets
+## Definition
+A set is an **unordered** abstract data type that can **store unique values**. It is a computer implementation of the mathematical concept of a *finite set*.
+
+> [!tip] Test membership
+> Unlike most other collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in a set.
+## Advantages
+You can add, remove, and check if an element exists in a set in $O(1)$.
+## Sets vs hash table
+Sets use the same mechanism for hashing keys into integers but the difference is that sets do not map their keys to anything.
+Sets are convenient to use when you only care about checking if element exists.
diff --git a/Data structures/Hashing/Universal hashing.md b/Data structures/Hashing/Universal hashing.md
@@ -0,0 +1,4 @@
+# Universal hashing
+## Definition
+Universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation.
+Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient.
diff --git a/Data structures/Hashing/_attachments/Pasted image 20241114124706.png b/Data structures/Hashing/_attachments/Pasted image 20241114124706.png
diff --git a/Data structures/README.md b/Data structures/README.md
@@ -1,6 +1,16 @@
 # Data structures
+## General
+- [[Data structure|What's a data structure?]]
+- [[Abstract Data Type]]
 ## Arrays
 - [[Array|What's an array?]]
 - [[Two pointers]]
 - [[Sliding window]]
-- [[Prefix sum]]
+- [[Prefix sum]]
+## Hashing
+- [[Hash function]]
+- [[Collisions]]
+- [[Hashing]]
+- [[Universal hashing]]
+- [[Hash map]]
+- [[Sets]]