Skip to content

Commit

Permalink
Hashing and ADTs
Browse files Browse the repository at this point in the history
  • Loading branch information
herrera-ignacio committed Nov 14, 2024
1 parent 47bfa0b commit b6eec65
Show file tree
Hide file tree
Showing 11 changed files with 170 additions and 22 deletions.
72 changes: 51 additions & 21 deletions .obsidian/workspace.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,49 @@
"type": "tabs",
"children": [
{
"id": "5052b16fd8bdb461",
"id": "5be967f0881160fb",
"type": "leaf",
"state": {
"type": "markdown",
"state": {
"file": "Data structures/Arrays/Sliding window.md",
"file": "Data structures/Hashing/Universal hashing.md",
"mode": "source",
"source": false
},
"icon": "lucide-file",
"title": "Universal hashing"
}
},
{
"id": "73686db3a8fa2556",
"type": "leaf",
"state": {
"type": "markdown",
"state": {
"file": "Data structures/Hashing/Hash map.md",
"mode": "source",
"source": false
},
"icon": "lucide-file",
"title": "Sliding window"
"title": "Hash map"
}
},
{
"id": "26cfaac60005359e",
"type": "leaf",
"state": {
"type": "markdown",
"state": {
"file": "Data structures/Hashing/Sets.md",
"mode": "source",
"source": false
},
"icon": "lucide-file",
"title": "Sets"
}
}
]
],
"currentTab": 2
}
],
"direction": "vertical"
Expand Down Expand Up @@ -174,20 +203,31 @@
"table-editor-obsidian:Advanced Tables Toolbar": false
}
},
"active": "5052b16fd8bdb461",
"active": "26cfaac60005359e",
"lastOpenFiles": [
"Data structures/README.md",
"Data structures/Hashing/Sets.md",
"Data structures/Hashing/Collisions.md",
"Data structures/Hashing/Hash map.md",
"Data structures/Hashing/Universal hashing.md",
"Data structures/Hashing/Hash function.md",
"Data structures/Hashing/Hashing.md",
"Data structures/Hashing/_attachments/Pasted image 20241114124706.png",
"Data structures/Hashing/_attachments",
"Data structures/Data structure.md",
"READMEv2.md",
"README.md",
"Data structures/Arrays/Sliding window.md",
"Data structures/Abstract Data Type.md",
"Data structures/Arrays/Array.md",
"Data structures/Arrays/Prefix sum.md",
"Data structures/Hashing",
"Data structures/Arrays/_attachments/Pasted image 20241112114017.png",
"Data structures/Arrays/Sliding window.md",
"Data structures/Arrays/Two pointers.md",
"Data structures/README.md",
"Data structures/Arrays/_attachments/Pasted image 20241111194533.png",
"Data structures/Arrays/Array.md",
"Data structures/Arrays/_attachments/Pasted image 20241111172708.png",
"README.md",
"Data structures/Arrays/_attachments",
"Pasted image 20241111172836.png",
"READMEv2.md",
"Data Structures.md",
"Data structures/Arrays",
"Data structures",
Expand All @@ -205,21 +245,11 @@
"legacy/glossary/side-effect",
"legacy/glossary/pattern",
"legacy/glossary/modularization",
"legacy/glossary/first-class-citizen",
"legacy/glossary/crud",
"legacy/frontend/legacy-lifecycle/README.md",
"legacy/frontend/legacy-lifecycle/2022-12-28-22-40-07.png",
"legacy/frontend/legacy-lifecycle/2022-12-28-22-39-07.png",
"legacy/frontend/accessibility-testing/README.md",
"legacy/frontend/design-system/2022-11-09-09-19-31.png",
"legacy/frontend/metrics/README.md",
"legacy/frontend/csr-vs-ssr/README.md",
"legacy/frontend/design-system/README.md",
"legacy/frontend/design-system/2022-11-09-09-25-46.png",
"legacy/frontend/design-system/2022-11-09-09-23-27.png",
"legacy/frontend/design-system/2022-11-09-09-22-44.png",
"legacy/frontend/core-team/README.md",
"legacy/frontend/areas/README.md",
"legacy/frontend/backbone-OKRs/README.md"
"legacy/frontend/design-system/2022-11-09-09-23-27.png"
]
}
7 changes: 7 additions & 0 deletions Data structures/Abstract Data Type.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Abstract data type
## Definition
ADT is a **mathematical model** for data types, analogous to an algebraic structure. It consists of a domain, a collection of operations, and a set of constraints the operations must satisfy.
ADTs are a **theoretical concept**, used in formal semantics and program verification and, less strictly, design and analysis of algorithms.
## ADT vs data structures
ADT, as a mathematical model, contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not an user.
For example, a stack has push/pop operations that follow a LIFO rule, and can be concretely implemented using either a list or an array.
11 changes: 11 additions & 0 deletions Data structures/Data structure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Data structures

## What is a data structure?
A data structure is a data organization and storage format for efficient access to data. More precisely, it's a collection of data values, the relationships among them, and the operations that can be applied to the data (i.e., it's an *algebraic structure* about data).
We can split data structures into two things: the interface and the implementation.
## Interface
The interface is like a contract that specifies how we can interact with the data structure -- what operations we can perform on it, what inputs it expects, and what outputs we can expect.
For example, consider a dynamic array. The interface would include operations like appending, insertion, removal, updating, and more.
## Implementation
The implementation is the code that actually makes the data structure work. It includes the details of how the data is stored and how the operations are performed.
For example, the implementation of a dynamic array might involve allocating memory for the list, tracking the size, and rearranging the elements when an operation like remove is called.
8 changes: 8 additions & 0 deletions Data structures/Hashing/Collisions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Collisions
## Definition
When different keys convert to the same integer, it is called a collision. Without handling collisions, older keys will get overridden and data will be lost.
## Collision resolution
There are [multiple ways](https://en.wikipedia.org/wiki/Hash_table#Collision_resolution) to handle collisions.
### Chaining
We store linked lists inside the hash map's array instead of the elements themselves. The linked list nodes store both the key and the value.
If there are collisions, the collided key-value pairs are linked together in a linked list. Then, when trying to access one of these key-value pairs, we traverse through the linked list until the key matches.
50 changes: 50 additions & 0 deletions Data structures/Hashing/Hash function.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Hash function
## Definition
A hash function is any function that can be used to map data of arbitrary size to fixed-size values (though there are some hash functions that support variable-length output).
A hash function may be considered to perform **three functions**:
- Convert variable-length keys into fixed-length values, by folding them by words or other units using a parity-preserving operator like ADD or XOR.
- Scramble the bits of the key so that the resulting values are uniformly distributed over the keyspace.
- Map the key values into ones less than or equal to the size of the table.
### Hash values and digests
The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes.
The values are usually used to index a fixed-size table called a *hash table*.
## Usage
- **Data storage and retrieval**: Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval.
- **Integrity checking**: Identical hash values for different files imply equality, providing a reliable means to detect file modifications.
- **Key derivation**: Minor input changes result in a random-looking output alteration.
- **Message authentication codes** (MACs): Through the integration of a confidential key with the input data, hash functions can generate MACs ensuring the genuineness of the data (e.g., HMACs).
- **Signatures**: Message hashes are signed rather than the whole message.
## What makes a good hash function?
A good hash function satisfies two basic properties:
- Should be very fast to compute.
- Should minimize duplication of output values (collisions).
Hash functions rely on generating favorable probability distributions for their effectiveness, reducing access time to nearly constant.
High table loading factors, pathological key sets, and poorly designed hash functions can result in access times approaching linear in the number of items in the table.
## Collition-resolution
A necessary adjunct to the hash function is a collision-resolution method that employs an auxiliary data structure like linked lists, or systematic probing of the table to find an empty slot.
## Properties
### Uniformity
It should map the expected inputs as evenly as possible over its output range. That is, every hash value should be generated with roughly the same probability.

> [!tip] Uniformly distributed is not random
> This criterion only requires the value to be *uniformly distributed*, not *random* in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true.
#### Absolute uniformity
In special cases when the keys are known in advance and the key set is static, a hash function can be found that achieves absolute (or collision-less) uniformity. Such a hash function is said to be *perfect*.
There is no algorithmic way of constructing such a function -- searching for one is a factorial function of the number of keys to be mapped versus the number of table slots that they are mapped into.
Finding a perfect hash function over more than a very small set of keys is usually computationally infeasible; the resulting function is likely to be more computationally complex than a standard hash function and provides only a marginal advantage over a function with good statistical properties that yields a minimum number of collisions.
### Testing and measurement
When testing a hash function, the uniformity of the distribution can be evaluated by the [chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test) which is a goodness-of-fit measure: it's the actual distribution of items in buckets versus the expected distribution of items.
### Efficiency
A hash function takes a finite amount of time to map a potentially large keyspace to a feasible amount of storage space searchable in a bounded amount of time regardless of the number of keys.
In most applications, the hash function should be computable with minimum latency and secondarily in a minimum number of instructions.

> [!tip] Space-time trade-off
> In data storage and retrieval applications, the use of a hash function is a trade-off between search time and data storage space.
> If [memory](https://en.wikipedia.org/wiki/Computer_memory "Computer memory") is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a [binary search](https://en.wikipedia.org/wiki/Binary_search "Binary search") or [linear search](https://en.wikipedia.org/wiki/Linear_search "Linear search") can be used to retrieve the element.
### Applicability
A hash function that allows only certain table sizes or strings only up to a certain length, or cannot accept a seed, is less useful than one that does.
### Deterministic
A hash procedure **must be deterministic** -- for a given input value, it must always generate the same hash value.
### Defined range
It is often desirable that the output of a hash function have fixed size.
13 changes: 13 additions & 0 deletions Data structures/Hashing/Hash map.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Hash map
## Definition
A hash map (also known as hash table or dictionary) is an **unordered data structure that stores key-value pairs**.
It implements an associative array, which is an abstract data type that maps keys to values. It uses a [[Hash function]] to compute an *index*, into an array of *buckets* or *slots*, from which the desired value can be found.
![[Pasted image 20241114124706.png]]
Typically, the only constraint on a hash map's key is that it has to be **immutable**.
## Advantages
- It allows to reduce the time complexity of a search algorithm by a factor of $O(n)$ for a huge amount of problems.
- It can add, update, check if exists and remove elements in $O(1)$.
## Disadvantages
- For smaller input sizes, they can be slower due to overhead.
- Can take up more space than arrays.
- When implemented using a fixed size array, resizing is much more expensive than a normal array because every existing key needs to be re-hashed, and a hash table may use an array that is significantly larger than the number of elements stored, resulting in a huge waste of space.
4 changes: 4 additions & 0 deletions Data structures/Hashing/Hashing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Hashing
## Definition
Use of hash function to index a [[Hash map]] is called *hashing* or *scatter-storage addressing*.
Hashing is a computationally- and storage-space-efficient form of data access that **avoids the non-constant access time** of ordered and unordered lists and structured trees, and the often-exponential storage requirements of direct access of state spaces of large or variable-length keys.
11 changes: 11 additions & 0 deletions Data structures/Hashing/Sets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Sets
## Definition
A set is an **unordered** abstract data type that can **store unique values**. It is a computer implementation of the mathematical concept of a *finite set*.

> [!tip] Test membership
> Unlike most other collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in a set.
## Advantages
You can add, remove, and check if an element exists in a set in $O(1)$.
## Sets vs hash table
Sets use the same mechanism for hashing keys into integers but the difference is that sets do not map their keys to anything.
Sets are convenient to use when you only care about checking if element exists.
4 changes: 4 additions & 0 deletions Data structures/Hashing/Universal hashing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Universal hashing
## Definition
Universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation.
Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 11 additions & 1 deletion Data structures/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
# Data structures
## General
- [[Data structure|What's a data structure?]]
- [[Abstract Data Type]]
## Arrays
- [[Array|What's an array?]]
- [[Two pointers]]
- [[Sliding window]]
- [[Prefix sum]]
- [[Prefix sum]]
## Hashing
- [[Hash function]]
- [[Collisions]]
- [[Hashing]]
- [[Universal hashing]]
- [[Hash map]]
- [[Sets]]

0 comments on commit b6eec65

Please sign in to comment.