404: Page not found
Sorry, we've misplaced that URL or it's pointing to something that doesn't exist.
diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/.nojekyll @@ -0,0 +1 @@ + diff --git a/404.html b/404.html new file mode 100644 index 0000000..ea3439f --- /dev/null +++ b/404.html @@ -0,0 +1 @@ +
Sorry, we've misplaced that URL or it's pointing to something that doesn't exist.
This video is a deep dive on the vector database Weaviate as part of the weekly database seminars by CMU Database Group. The presenter is Etienne Dilocker from Weaviate. Why a Vector Database Ins...
This video is about streamlining FedRAMP compliance with CNCF technologies. The presenters are Ali Monfre and Vlad Ungureanu from Palantir Technologies. FedRAMP Overview FedRAMP is the accreditat...
This video is about Snowflake Iceberg Tables, Streaming Ingest and Unistore. The presenters are N.Single, T.Jones and A.Motivala as part of the Database Seminar Series by CMU Database Group. Probl...
I enjoy watching 45 minutes to 1 hour long technical talks at conferences. Unfortunately, I am not retaining the knowledge as long as I would like to. From now on, I am going to try summarizing my ...
This video is about how networking works in Kubernetes by Bowei Du and Tim Hockin from Google. Networking APIs Exposed by Kubernetes Service, Endpoint: Service registration and discovery. In...
Over the last 1.5 years, I studied Master of Computational Data Science (MCDS) at Carnegie Mellon University. Inspired by blogs such as fanpu.io and wanshenl.me, I am going to outline my experience...
The underlying data structure of a Git repository is just a directed acylic graph (DAG). Not only the core idea is simple, the implementation can be easily inspected in the .git directory. Let’s br...
Depending on the resources available and the performance metric of an application, different Garbage Collectors (GC) should be considered for the underlying Java Virtual Machine. This post explains...
HTTP/2 has made our applications faster and more robust by providing protocol enhancements over HTTP/1. This post only focuses on the major pain points of HTTP/1 and how the new protocol has been e...
Have you occasionally chosen a character encoding such as UTF-8 during reading and writing files while wondering its purpose? I have! This post explains various UTF (Unicode Transformation Format) ...
What is CSRF? A cross site request forgery (CSRF) attack occurs when a web browser is tricked into executing an unwanted action in an application that a user is logged in. For example, User A may...
Git Submodules allows one Git repository to be a subdirectory of another. I keep forgetting the commands so I have created a 2-minute refresher for my future reference. Adding a submodule To add ...
Scale-invariant Feature Transform, also known as SIFT, is a method to consistently represent features in an image even under different scales, rotations and lighting conditions. Since the video ser...
Even as I frequently use transformers for NLP projects, I have struggled with the intuition behind the multi-head attention mechanism outlined in the paper - Attention Is All You Need. This post wi...
IIFE - Initial Concept of JS Modules Immediately-invoked Function Expression are anonymous functions that wrap around code blocks to be imported. In the example below, the inner function sayHi() c...
There are multiple ways to speed up computations in Python. The cython language compiles Python-like syntax into CPython extensions. Libraries such as numpy provides methods to manipulate large arr...
If you have worked with asynchronous programming in Python, you may have used the async and await keywords before. It turns out that Python Generators are actually the building blocks of these abst...
If you have worked with asynchronous programming in Python, you may have used the async
and await
keywords before. It turns out that Python Generators are actually the building blocks of these abstractions. This article explain their relationship in a greater detail.
For single-threaded asynchronous programming to work in Python, we need a mechanism to “pause” function calls. For example if a particular function involves fetching something from a database, we would like to “pause” the function’s execution and schedule something else until the response is received. However in traditional Python functions, the return
keyword frees up the internal state at the end of invocation…
It turns out that generators in Python can achieve similar purpose! With generators, the yield
keyword gives up the control of the thread while the internal state is saved until the next invocation. So we can do some multitasking with a scheduler as shown below.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+
def gen_one():
+ print("Gen one doing some work")
+ yield
+ print("Gen one doing more work")
+ yield
+
+def gen_two():
+ print("Gen two doing some work")
+ yield
+ print("Gen two doing more work")
+ yield
+
+def scheduler():
+ g1 = gen_one()
+ g2 = gen_two()
+ next(g1)
+ next(g2)
+ next(g1)
+ next(g2)
+
1
+2
+3
+4
+5
+
>>> scheduler()
+Gen one doing some work
+Gen two doing some work
+Gen one doing more work
+Gen two doing more work
+
Coroutine is the term for suspendable functions. As generators cannot take in values like normal functions, new methods are introduced in PEP 342 including .send()
that allows passing of parameters (and also .throw()
and .close()
).
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+
def coroutine_one():
+ print("Coroutine one doing some work")
+ data = (yield)
+ print(f"Received data: {data}")
+ print("Coroutine one doing more work")
+ yield
+
+cor1 = coroutine_one()
+cor1.send(None)
+cor1.send("lorem ipsum")
+
1
+2
+3
+
Coroutine one doing some work
+Received data: lorem ipsum
+Coroutine one doing more work
+
Let’s refer to generators as coroutines from now.
Another problem we have is that nested coroutines would not work with current syntax. As shown below, how will coroutine_three()
call coroutine_one()
and coroutine_two()
? It is just a function that has two coroutine objects but has no ability to schedule them!
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+
def coroutine_one():
+ print("Coroutine one doing some work")
+ yield
+ print("Coroutine one doing more work")
+ yield
+
+def coroutine_two():
+ print("Coroutine two doing some work")
+ yield
+ print("Coroutine two doing more work")
+ yield
+
+# Will not work as intended
+def coroutine_three():
+ coroutine_one()
+ coroutine_two()
+
To solve this, PEP 380 introduces the yield from
operator. This allows a section of code containing yield
to be factored out and placed in another generator. So in essence the yield
calls are “flattened” so that the same scheduler that handles coroutine_three()
can handle the nested coroutines. Furthermore, if the inner coroutines use return
, the values can made available to coroutine_three()
, just like traditional nested functions!
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+
def coroutine_three():
+ yield from coroutine_one()
+ yield from coroutine_two()
+
+# Equivalent code
+# The 'yield' calls in subgenerators are flattened
+def coroutine_three():
+ print("Coroutine one doing some work")
+ yield
+ print("Coroutine one doing more work")
+ yield
+ print("Coroutine two doing some work")
+ yield
+ print("Coroutine two doing more work")
+ yield
+
The previous scheduler in the example interleaves the two coroutines manually. A more automatic implementation will be using a queue as shown below.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+
from collections import deque
+
+def scheduler(coroutines):
+ q = deque(coroutines)
+ while q:
+ try:
+ coroutine = q.popleft()
+ results = coroutine.send(None)
+ q.append(coroutine)
+ except StopIteration:
+ pass
+
1
+2
+3
+4
+5
+
>>> scheduler([coroutine_one(), coroutine_two()])
+Coroutine one doing some work
+Coroutine two doing some work
+Coroutine one doing more work
+Coroutine two doing more work
+
During I/O operations, a synchronous function will block the main thread until the I/O is ready. To carry out asychronous work on a single thread, a good way is for the scheduler to check all the coroutines in the queue and only allow those which are “ready” to run.
In the example below, coroutine_four()
has to fetch data through I/O operation. While it is suspended as the kernel populates the read buffer, the scheduler allows other coroutines to occupy the thread. The scheduler only allows coroutine_four()
to execute again when the I/O is ready.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+
def fetch_data():
+ print("Fetching data awaiting IO..")
+ # Suspends coroutine while awaiting IO to be ready
+ yield
+
+ # Let's assume that the scheduler only reschedules
+ # the coroutine again when IO is ready
+ print("Fetching data IO ready..")
+ # Mocked data
+ return 10
+
+def coroutine_four():
+ print("Coroutine four doing some work")
+ data = yield from fetch_data() # I/O related coroutine
+ print("Coroutine four doing more work with data: " + str(data))
+ yield
+
1
+2
+3
+4
+5
+6
+7
+
>>> scheduler([coroutine_one(), coroutine_four()])
+Coroutine one doing some work
+Coroutine four doing some work
+Fetching data awaiting IO..
+Coroutine one doing more work
+Fetching data IO ready..
+Coroutine four doing more work with data: 10
+
In the previous example, the I/O completion is mocked inside the fetch_data()
coroutine. In reality, how does the scheduler know when the I/O is complete?
This is where AsyncIO library comes in. It introduces concepts called Future
and the Event Loop. Future
objects are just coroutines that track whether the results (such as I/O) are ready or not. The Event Loop is just a for-loop that continuously schedules and runs coroutines, similar to our scheduler()
function in the examples. At each iteration, the scheduler also polls for I/O operations to see which file descriptors are ready for I/O.
In essence, this is the pseudocode of the Event Loop in asyncio
.
1
+2
+3
+4
+
while the event loop is not stopped:
+ poll for I/O and schedule reads/writes that are ready
+ schedule the coroutines set for a 'later' time
+ run the scheduled coroutines
+
Even though coroutines work well with the yield
keyword of generators, it was not the original intention of the feature. From Python 3.5 onwards, coroutines were made first-class features with the introduction of async
and await
keywords.
There are some implementation differences but the main features remain the same. For example, assuming that fetch_data()
returns an awaitable object, the coroutine_four()
can be rewritten as shown below.
1
+2
+3
+4
+
async def coroutine_four():
+ print("Coroutine four doing some work")
+ data = await fetch_data()
+ print("Coroutine four doing more work with data: " + str(data))
+
The same coroutine methods such as .send()
will still work but the purpose now a lot clearer!
Have you occasionally chosen a character encoding such as UTF-8 during reading and writing files while wondering its purpose? I have! This post explains various UTF (Unicode Transformation Format) algorithms such as UTF-8, UTF-16, UTF32 and how to choose between them.
Unicode character set defines a unique number for almost all characters used in modern texts today. The standard ensures that given a number, also known as a code point, different softwares will decode it as the same character.
Character | Decimal Representation | Code (Read Hexadecimal) |
---|---|---|
A | 65 | U+41 |
B | 66 | U+42 |
我 | 25105 | U+6211 |
😋 | 128523 | U+1F60B |
The Unicode character set ranges from 0x0 to 0x10FFFF (21-bits range).
UTF stands for Unicode Transformation Format. It encodes integer code points into byte representations on a machine. For example, if 4 bytes are allocated to a character at each time, four-byte representations are shown below.
Character | Byte Representation (Read Hexadecimal) |
---|---|
A | 0x00000041 |
B | 0x00000042 |
我 | 0x00006211 |
😋 | 0x0001F60B |
This is exactly what UTF-32 does. It pads every code point with zeros into 32 bits. This is more than sufficient for the 21-bits range of Unicode character set.
However, the approach is space-inefficient. For example, if there are only English letters in a document (U+41 to U+7A), only one byte is necessary to represent each character. However, UTF-32 will still pad with three bytes to form four-byte representations, resulting as 300% increase in storage.
UTF-16 mitigates the problem by representing U+0 to U+FFFF with two bytes and U+10000 to U+10FFFF with four bytes.
Characters from almost all modern languages are found in the first 216 code points (See Basic Multilingual Plane). If a document only contains these code points, UTF-16 will mainly use two-byte representations instead, meaning storage is cut by 50% from using UTF-32.
To represent larger code points, UTF-16 employs a concept called surrogate pairs. High surrogates are code points from U+D800 to U+DBFF and low surrogates are code points from U+DC00 to U+DFFF. There are no character mappings at these ranges and they only have meaningful representations when paired. The example below may present a clearer picture.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+
High surrogate --> U+D800 to U+DBFF --> 110110 concat with any 10 bits
+Low surrogate --> U+DC00 to U+DFFF --> 110111 concat with any 10 bits
+
+Character: 😋
+Unicode : U+1F60B
+Binary : 0b11111011000001011
+
+Binary padded 20-bits: 0b00011111011000001011
+ <--- A --><--- B -->
+ (10 bits) (10 bits)
+
+High surrogate: 110110 concat A = 1101100001111101 (16 bits)
+Low surrogate : 110111 concat B = 1101111000001011 (16 bits)
+
If a decoder sees a two-byte representation starting with 110110
or 110111
bits, it can infer that this is part of a surrogate pair and immediately identify the other surrogate. The binary representation of the original character can be reconstructed afterwards.
ASCII characters compose the first 27 code points. Most of the time when coding or writing English articles, you may mostly end up using these characters. As these code points can be represented with one byte, two-byte representations of UTF-16 still results in wasted storage.
Depending on the range of Unicode character set, UTF-8 uses one, two, three or four-byte representations. The encoding pseudocode is shown below.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+
if code point < 2^7 # Covers ASCII
+ pad with zeros till 8 bits
+ 1st byte = 8 bits
+
+else if code point < 2^11 # Covers other Latin alphabets
+ pad with zeros till 11 bits # (5 + 6)
+ 1st byte = "110" concat 5 bits
+ 2nd byte = "10" concat 6 bits
+
+else if code point < 2^16 # Covers Basic Multilingual Plane
+ pad with zeros till 16 bits # (4 + 6 + 6)
+ 1st byte = "1110" concat 4 bits
+ 2nd byte = "10" concat 6 bits
+ 3rd byte = "10" concat 6 bits
+
+else if code point < 2^21 # Covers 21-bit Unicode range
+ pad with zeros till 21 bits # (3 + 6 + 6 + 6)
+ 1st byte = "11110" concat 3 bits
+ 2nd byte = "10" concat 6 bits
+ 3rd byte = "10" concat 6 bits
+ 4rd byte = "10" concat 6 bits
+
As texts encoded in ASCII never appear as multi-byte sequences, UTF-8 can be used to decode it directly. The backward compatibility is one of the reasons why it has been adopted at a large scale.
If backward compatibility to ASCII is preferred and most characters are English text, UTF-8 is a good choice.
If most characters are from non-English languages, UTF-16 is preferred because it uses two-byte representations for Basic Multilingual Plane as compared to UTF-8 which uses three-byte representations.
UTF-32 is rarely used but in theory, the fixed-width encoding without transformations allows faster encoding and decoding of characters.
Over the last 1.5 years, I studied Master of Computational Data Science (MCDS) at Carnegie Mellon University. Inspired by blogs such as fanpu.io and wanshenl.me, I am going to outline my experiences for each course to hopefully help future students.
The underlying data structure of a Git repository is just a directed acylic graph (DAG). Not only the core idea is simple, the implementation can be easily inspected in the .git
directory. Let’s break it down.
There are three types of “nodes”, also known as Git Objects - Blobs, Trees and Commits. The article will run through an example usage of Git so that we can observe how each of them is created.
After initializing an empty Git repository, the .git
directory is shown below. For the rest of the article, our focus will be on the .git/objects
directory.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+
.
+|____branches
+|____config
+|____description
+|____HEAD
+|____hooks
+| |____applypatch-msg.sample
+| |____commit-msg.sample
+| |...
+|____info
+| |____exclude
+|____objects <--- Our focus
+| |____info
+| |____pack
+|____refs
+| |____heads
+| |____tags
+
We will create a new file and add it to staging.
1
+2
+
echo "Hello World" > hello.txt
+git add hello.txt
+
Now we can find a new directory and file in .git/objects
.
1
+2
+3
+4
+5
+6
+7
+
.
+|____objects
+| |____55
+| | |____7db03de997c86a4a028e1ebd3a1ceb225be238
+| |____info
+| |____pack
+...
+
Note that the concatenation of the directory name and file name is a 20-byte hash digest.
If we inspect the file, we will see that it is not a human-readable format.
1
+
cat .git/objects/55/7db03de997c86a4a028e1ebd3a1ceb225be238
+
1
+
xKOR04bH/IAI
+
This is because Git stores content in a compressed binary format. If we uncompress with the appropriate algorithm, the contents can be seen as below. This Git Object is indicated to be a 12-byte Blob with data being Hello World\n.
1
+
blob 12Hello World
+
Remember the 20-byte hash digest? To produce it, Git has actually run SHA-1 on the uncompressed data shown above. In other words, a Blob is simply contents of one file and is identified by SHA-1 hash of its contents.
Note that our Blob does not contain any information about the file name hello.txt.
Convenient to us, Git provides APIs to inspect the compressed data in Git Objects.
1
+2
+3
+4
+5
+
# Inspect the type of Git Object
+git cat-file -t 557db
+
+# Inspect the content of Git Object
+git cat-file -p 557db
+
1
+2
+3
+4
+5
+
# Type of Git Object
+blob
+
+# Content of Git Object
+Hello World
+
Now that we understand what a Blob is, let’s create a new commit.
1
+
git commit -m "First commit"
+
Looking at .git/objects
again, there are two new Git Objects created.
1
+2
+3
+4
+5
+6
+7
+8
+9
+
.
+|____objects
+| |____55
+| | |____7db03de997c86a4a028e1ebd3a1ceb225be238
+| |____97
+| | |____b49d4c943e3715fe30f141cc6f27a8548cee0e <-- New file 1
+| |____c5
+| | |____5df28adf8320cc4d15637b82e8a0b13422d955 <-- New file 2
+...
+
If we inspect 97b49
with cat-file
, the Git Object type and its contents are shown below.
1
+2
+3
+4
+5
+
# Type of Git Object
+tree
+
+# Content of Git Object
+100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238 hello.txt
+
It can be seen that this particular Git Object is a Tree. More specifically, it has a pointer to a Blob with hash digest 557db while naming it hello.txt. It also states that the file has a 644 permission.
In the example, the Tree has one Blob pointer but in reality it can have multiple Blob pointers and even other Tree pointers. In other words, a Tree simply contains pointers to other Git Objects and is identified by the SHA-1 hash of its contents.
Excluding the file names from Blobs is an intentional optimization by Git. If there are two files with duplicate content but different names, Git’s representation will be multiple pointers pointing to the same Blob.
There is still one more Git Object. If we inspect c55df
with cat-file
, the results are shown below.
1
+2
+3
+4
+5
+6
+7
+
# Type of Git Object
+commit
+
+# Content of Git Object
+tree 97b49d4c943e3715fe30f141cc6f27a8548cee0e
+author yarkhinephyo <yarkhinephyo@gmail.com> 1652402598 +0800
+committer yarkhinephyo <yarkhinephyo@gmail.com> 1652402598 +0800
+
It can be seen that a Commit contains a pointer to a single Tree encompassing the contents of the commit and other bookkeeping details (such as author and timestamp). Similar to other Git Objects, a Commit is also identified by the SHA-1 hash of its contents.
1
+2
+3
+
echo "Hello World" > hello.txt
+git add hello.txt
+git commit "First commit"
+
Considering all the pointers, the Git Objects resulting from these commands can be represented as a graph.
DAG after first commit - Diagram by author
This is essentially the data structure powering Git repositories, stored right in the .git/objects
directory as compressed binary files.
Let’s see what happens with a new commit. We will modify hello.txt
and add new_file.txt
in the second commit.
1
+2
+3
+4
+
echo "Bye" >> hello.txt
+echo "I love git" > new_file.txt
+git add hello.txt new_file.txt
+git commit -m "Second commit"
+
If we look at the .git/objects
directory and inspect the new Git Objects with cat-file
tool, it is possible to manually update the graph.
DAG after second commit - Diagram by author
There are two interesting observations.
First, the new Commit has a pointer to the parent Commit in its contents. This means that whatever is in the ancestor Commits affects the SHA-1 calculation of the new Commit. Therefore, as long as we have the SHA-1 calculation of the latest commit, the integrity of Git history can be verified.
Second, a new Blob is created after hello.txt
has been modified and a new Tree stores a pointer to it. This is because Git Objects are immutable. Whatever changes made in a new commit would not mutate the previous Git Objects and modify the SHA-1 calculations.
This DAG where each node has an identifier resulting from hashing its contents is called Merkel DAG. This data structure also plays an important role in Web3 applications.
Git Submodules allows one Git repository to be a subdirectory of another. I keep forgetting the commands so I have created a 2-minute refresher for my future reference.
To add a submodule to a project, run the command as shown below. Git will clone the submodule to the path provided and create a new .gitmodules
file to store the information.
1
+
git submodule add <remote-url> <path-to-module>
+
Note that the <path-to-module>
is now tracked by the parent repository as a commit ID instead of a subdirectory of contents. Treat it as a file for all practical purposes.
1
+2
+
git add <path-to-module> .gitmodules
+git commit -m "Added submodule"
+
Only the submodule’s commit ID is inspected by the parent repository. When the submodule’s commit is modified, the parent repository will react similarly to how a file has been modified. Add the modified “file” to staging and commit as usual.
1
+2
+
git add <path-to-module>
+git commit -m "Updated submodule"
+
After pulling changes from the parent repository, only the submodule’s tracked commit ID will be updated, not its contents. Manually update the contents of the submodule to synchronize with the updated commit ID.
1
+2
+3
+4
+5
+
# This updates the commit IDs of submodules
+git pull origin main
+
+# Update the contents of the submodules
+git submodule update --init --recursive
+
Add a --recursive
flag.
1
+
git clone --recursive <module>
+
A cross site request forgery (CSRF) attack occurs when a web browser is tricked into executing an unwanted action in an application that a user is logged in.
For example, User A may be logged onto Bank XYZ in the browser which uses cookies for authentication. Let’s say the a transfer request like this -
1
+
GET http://bank-zyz.com/transfer?from=UserA&to=UserB&amt=100 HTTP/1.1
+
Then a malicious actor can embed a request with similar signatures inside an innocent looking hyperlink.
1
+
<a href="http://bank-xyz.com/transfer?from=UserA&to=MaliciousActor&amt=10000 HTTP/1.1></a>
+
If User A clicks on the hyperlink, the web browser sends the request together with the session cookie. The funds are then unintentionally transferred out of User A’s account.
The same-origin policy prevents a page from accessing results of cross domain requests. It prevents a malicious website from accessing another website’s resources such as static files and cookies.
Even though the policy prevents cross-origin access of resources, it does not prevent the requests from being sent.
In the Bank XYZ example, a GET
request with relevant cookies triggers the server to transfer the funds before returning the 200 OK response. As shown in the diagram below, the same-origin policy only prevents the access of resouces, which in this case is reading the HTTP response. Since the request (with cookies) can still be sent, the hyperlink can still trigger the transfer of funds.
Image by Author - The policy would only prevent a cross-origin access of HTTP response (Step 3)
Note: For more complex HTTP requests, a preflight OPTIONS
request is sent beforehand to check for relevant CORS headers. In that scenario, an unexpected cross-origin request will not reach Bank XYZ’s website at all.
To prevent CSRF, Bank XYZ can generate an unpredictable token for each client which is validated in the subsequent requests. For example, a hidden HTML field can allow the token to be included in subsequent form submissions.
1
+
<input type="hidden" name="csrf-token" value="CIwNZNlR4XbisJF39I8yWnWX9wX4WFoz" />
+
Other websites running in User A’s browser do not have access to the form field due to the same-origin-policy. So malicious scripts from other origins can no longer make the same request to transfer funds.
Even as I frequently use transformers for NLP projects, I have struggled with the intuition behind the multi-head attention mechanism outlined in the paper - Attention Is All You Need. This post will act as a memo for my future self.
Consider the sequence of words - pool beats badminton. For the purpose of machine learning tasks, we can use word embeddings to represent each of them. The representation can be a matrix of three word embeddings.
If we take a closer look, the word pool has multiple meanings. It can mean a swimming pool, some cue sports or a collection of things such as money. Humans can easily perceive the correct interpretation because of the word badminton. However, the word embedding of pool includes all the possible interpretations learnt from the training corpus.
Can we add more context to the embedding representing pool? Optimally, we want it to be “aware” of the word badminton more than the word beats.
Consider that matrix A represents the sequence - pool beats badminton. There are three words (rows) and the word embedding has four dimensions (columns). The first dimension represents the concept of sports. Naturally, we expect the words pool and badminton to have more similarity in this dimension.
1
+2
+3
+4
+5
+
A = np.array([
+ [0.5, 0.1, 0.1, 0.2],
+ [0.1, 0.5, 0.2, 0.1],
+ [0.5, 0.1, 0.2, 0.1],
+])
+
If we do a matrix multiplication between A and AT, the resulting matrix will be the dot-product similarities between all possible pairs of words. For example, the word pool is more similar to badminton than the word beats. In other words, this matrix hints that the word badminton should be more important than the word beats when adding more context to the word embedding of pool.
1
+2
+3
+4
+5
+
A_At = np.matmul(A, A.T)
+>>> A_At
+array([[0.31, 0.14, 0.3 ],
+ [0.14, 0.31, 0.15],
+ [0.3 , 0.15, 0.31]])
+
By applying the softmax function across each word, we can ensure that these “similarity scores” add up to 1.0.
The last step is to do another matrix multiplication with matrix A. In a way, this step consolidates the contexts of the entire sequence to each embedding in an “intelligent” manner. In the example below, both embeddings of beats and badminton are added to pool but with different weights depending on their similarities with pool.
1
+2
+3
+4
+5
+6
+7
+
output = np.round(
+ np.matmul(softmax(A_At, axis=1), A)
+, 2)
+>>> output
+array([[0.38, 0.22, 0.16, 0.14],
+ [0.35, 0.25, 0.17, 0.13],
+ [0.38, 0.22, 0.17, 0.13]])
+
Notice that the output matrix has the same dimensions (3 x 4) as the original input A. The intuition is that each word vector is now enriched with more information. This is the gist of the self-attention mechanism.
The picture below shows the Scaled Dot-Product Attention from the paper. The core operations are the same as the example we explored. Notice that scaling is added before softmax to ensure stable gradients, and there is an optional masking operation. Inputs are also termed as Q, K and V.
Image taken from Attention Is All You Need paper
The Scaled Dot-Product Attention can be represented as attention(Q, K, V)
function.
Diagram by Frank Odom on Medium
The initial example that we use can be represented as attention(A, A, A)
, where matrix A contains the word embeddings of pool, beats and badminton. So far there are no weights involved. We can make a simple adjustment to add trainable parameters.
Imagine we have (m x m) matrices MQ, MK and MV where m matches the dimension of word embeddings in A. Instead of passing matrix A directly to the function, we can calculate Q = A MQ, K = A MK and V = A MV which will be the same sizes as A. Then we apply attention(Q, K, V)
afterwards. In neural network, this is akin to adding a linear layer before each input into the Scaled Dot-Product Attention.
To complete the Single-Head Attention mechanism, we just need to add another linear layer after the output from the Scaled Dot-Product Attention. The idea of expanding to the Multi-Head Attention in the paper is relatively simpler to grasp.
Diagram by Frank Odom on Medium
Immediately-invoked Function Expression are anonymous functions that wrap around code blocks to be imported. In the example below, the inner function sayHi()
cannot be accessed outside the anonymous function. The anonymous function itself also does not have a name so it does not pollute the global scope.
1
+2
+3
+4
+5
+6
+7
+8
+
// script1.js
+(function () {
+ var userName = "Steve";
+ function sayHi(name) {
+ console.log("Hi " + name);
+ }
+ sayHi(userName);
+})();
+
If this script is included as shown below, no variable name collision can occur with other scripts such as script2.js
.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+
<!DOCTYPE html>
+<html>
+ <head>
+ <title>JavaScript Demo</title>
+ <script src="script1.js"></script>
+ <script src="script2.js"></script>
+ </head>
+ <body>
+ <h1>IIFE Demo</h1>
+ </body>
+</html>
+
What if script2.js
wants to use the sayHi()
function defined in script1.js
? We can pass a common global variable through the two IIFE modules as shown below.
1
+2
+3
+4
+5
+6
+7
+
// script1.js
+(function (window) {
+ function sayHi(name) {
+ console.log("Hi " + name);
+ }
+ window.script1 = { sayHi };
+})(window);
+
1
+2
+3
+4
+5
+6
+7
+8
+9
+
// script2.js
+(function (window) {
+ function sayHiBye(name) {
+ window.script1.sayHi(name);
+ console.log("Bye " + name);
+ }
+ var userName = "Jenny";
+ sayHiBye(userName);
+})(window);
+
This solves the immediate problem, but generates other issues.
If we reorder script1.js
and script2.js
, the code will break as the window
object will not have the script1
object by the time script2.js
starts to load.
There is also the problem of what common variable to pass between the two IIFE. One company may use the window
object but another may create a new app
object in the global scope. No strict standards means incompatiblity issues.
CommonJS is a series of specifications for development of JavaScript applications in non-browser environments. One of the specifications is the API for importing and exporting of modules. This is where require()
and module.exports
are introduced.
There is no more need for passing around a global variable or wrapping an anonymous function around every code blocks for export.
1
+2
+3
+4
+5
+
// script1.js
+function sayHi(name) {
+ console.log("Hi " + name);
+}
+module.exports.sayHi = sayHi;
+
1
+2
+3
+4
+5
+6
+7
+8
+
// script2.js
+script1 = require("./script1.js");
+function sayHiBye(name) {
+ script1.sayHi(name);
+ console.log("Bye " + name);
+}
+var userName = "Jenny";
+sayHiBye(userName);
+
However, CommonJS was not meant for the browser environment. The specifications also do not support asychronous loading of the modules which is important in the browser environment for the user experience.
Module bundlers such as Webpack solves the incompatibility problem by bundling CommonJS modules for usage in the browser. The modules are loaded into a single bundle.js
file such that individual dependencies are satisfied, which can be loaded onto the page with the a single <script>
tag.
For the example above, webpack can produce a single bundle.js
with script2.js
as an entry. The bundle will include script1.js
first as it understands the dependency graph. By including the bundle.js
into HTML as shown below, the abovementioned problems with CommonJS are fixed.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+
<!DOCTYPE html>
+<html>
+ <head>
+ <title>JavaScript Demo</title>
+ <script src="bundle.js"></script>
+ </head>
+ <body>
+ <h1>Webpack Demo</h1>
+ </body>
+</html>
+
ES6 is a JavaScript standard introduced in 2015 that finally introduced a module system for JavaScript in the browsers. ES6 modules utilize import
and export
keywords. Unlike CommonJS, webpack is not necessary for browser compatibility. We only need to add a type="module"
attribute inside the HTML <script>
tag and everything will work out of the box.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+
<!DOCTYPE html>
+<html>
+ <head>
+ <title>JavaScript Demo</title>
+ <script type="module" src="script2.js"></script>
+ </head>
+ <body>
+ <h1>ES6 Demo</h1>
+ </body>
+</html>
+
1
+2
+3
+4
+5
+
// script1.js
+function sayHi(name) {
+ console.log("Hi " + name);
+}
+export default { sayHi };
+
1
+2
+3
+4
+5
+6
+7
+8
+
// script2.js
+import script1 from './script1.js';
+function sayHiBye(name) {
+ script1.sayHi(name);
+ console.log("Bye " + name);
+}
+var userName = "Jenny";
+sayHiBye(userName);
+
Backward Compatibility: ES6 modules are not recognized in the older versions of the browsers. Bundlers allow developers to work with the more modern ES6 syntax while the code remains compatible with older browsers.
Size Reduction: Minifying code with bundlers reduces file sizes which will lead to faster page loads.
Code Splitting: Bundlers can split code into chunks which can then be loaded on demand or in parallel.
Caching Support: Webpack can be configured to name the bundles with the hash of their contents. Browsers will only fetch scripts from the server if the hashes no longer match.
Depending on the resources available and the performance metric of an application, different Garbage Collectors (GC) should be considered for the underlying Java Virtual Machine. This post explains the main idea behind the garbage collection process in JVM and summarizes the pros and cons of Serial GC, Parallel GC and Concurrent-Mark-Sweep GC.
Garbage-First GC (G1) is out-of-scope for this post as it works very differently from the other algorithms (and I still have not wrap my head around it). This post also assumes familiarity with heap memory.
These symbols will be used to illustrate the memory allocation in heap.
1
+2
+3
+
o - unvisited
+x - visited
+<empty> - free
+
Mark-Sweep: The objects in the heap that can be reached from root nodes (such as stack references) are marked as visited. While sweeping the memory, the regions occupied by the unvisited objects are updated to be free. As there are likely to be less contiguous regions after a collection, external fragmentation is likely to occur.
1
+2
+
Marked | x |o| x | o |x|
+Sweeped | x | | x | |x|
+
Mark-Sweep-Compact: After marking, the visited objects are identified and compacted to the beginning of the memory region. This solves the external fragmentation issue, but more time is required as objects have to be moved and references have to be updated accordingly.
1
+2
+3
+
Marked | x |o| x | o |x|
+Sweeped | x | | x | |x|
+Compacted | x | x |x| |
+
Mark-Copy: After marking, the visited objects are relocated to another region. This accomplishes compaction of allocated memory at the same step. However, the disadvantage is that there is a need to maintain one more memory region.
1
+2
+
Marked | x |o| x | o |x| |
+Copied | | x | x |x| |
+
During parts of a garbage collection, all application threads may be suspended. This is called stop-the-world pause. Long pauses are especially undesirable in interactive applications.
The Weak Generational Hypothesis states that most objects die young.
In JVM, heap memory is divided into two regions - Young Generation and Old Generation. Newly created objects are stored in the Young Generation and the older ones are promoted to the Old Generation. With this separation, GC can work more often in a smaller region where dead objects are more likely to be found.
1
+2
+3
+4
+5
+6
+
<- Young Generation ->
++--------------------+--------------------+
+| Eden Space | |
++----------+---------+ Old Generation |
+| S0 | S1 | |
++----------+---------+--------------------+
+
Young Generation: The region is divided into Eden Space where new objects are created, and S0/S1 Space where the visited objects from each garbage collection can be copied to. Naturally, Mark-Copy algorithm is used.
Old Generation: As there is no delimited region for the visited objects to be copied to. Only Mark-Sweep and Mark-Sweep-Compact algorithms can be used.
JVM option: -XX:+UseSerialGC
This option uses Mark-Copy for the Young Generation and Mark-Sweep-Compact for the Old Generation. Both of the collectors are single-threaded. Without leveraging multiple cores present in modern processors, the stop-the-world pauses are longer. The advantage is that there is less resource overhead compared to other options.
JVM option: -XX:+UseParallelGC -XX:+UseParallelOldGC
Similary to Serial GC, this option uses Mark-Copy for the Young Generation and Mark-Sweep-Compact for the Old Generation. Unlike Serial GC, multiple threads are run for the respective algorithms. As less time is spent on garbage collection, there is higher throughput for the application.
JVM option: -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
For the Young Generation, this option is the same as Parallel GC. For the Old Generation, this option runs most of the job in Mark-Sweep concurrently with the application. This means that the application threads continue running during some parts of the garbage collection. Hence this option is less affected by stop-the-world pauses compared to the other two, making it preferred for interactive applications.
As at least one thread is used for garbage collection all the time, the application has lower throughput. Without compaction, external fragmentation may also occur. When this happens, there is a fallback with Serial GC but it is very time-consuming.
This video is about how networking works in Kubernetes by Bowei Du and Tim Hockin from Google.
All pods can reach all other pods across nodes. The network drivers on each node and networking between pods are implemented by Kubelet CNI implementation.
One implementation is using hosts as a router device while a routing entry is added for each pod. For example, Flannel host-gw mode and Calico BGP mode. Another implementation is using overlay networks where layer 2 frames are encapsulated into layer 4 UDP packets alongside a VxLAN header. For example, Flannel and Calico VxLAN mode.
Pod IP addresses are ephemeral. Service API exposes a group of pods via one IP (ClusterIP). This is how the API works:
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+
kind: Service
+apiVersion: v1
+metadata:
+ name: my-service
+ namespace: default
+spec:
+ selector:
+ app: my-app
+ ports:
+ - port: 80 # for clients
+ - targetPort: 9376 # for backend pods
+
KubeProxy runs on every node in the cluster. It uses iptables, IPVS or userspace options to proxy traffic from pods.
KubeProxy control plane accumulate changes to Endpoints and Services, then updates rules in the node. In the data plane, the sending KubeProxy recognizes ClusterIP/port and rewrites packets to the new destination (DNAT). The recipient KubeProxy un-DNAT the packets.
To disambiguate, CNI ensures the Pod IPs work. KubeProxy redirects ClusterIP to Pod IP before sending over the network.
Endpoint objects are a list of IPs behind a Service. An Endpoint Controller manages them automatically.
When a Service object is created, an Endpoint object is created that has a mapping of service name to pod addresses and ports. This object is fed into the rest of the system such as KubeDNS and KubeProxy.
HTTP proxy and L7 routing rules that targets a service for each rule. Kubernetes define the API but implementations are all third party.
Unlike the Ingress API, Service-type load balancers only work at L4 level.
1
+2
+3
+4
+5
+6
+7
+8
+
Ingress {
+ hostname: foo.com
+ paths:
+ - path: /foo
+ service: foo-svc
+ - path: /bar
+ service: bar-svc
+}
+
DNS resource cost is high. There are more microservice addressed by names and more application libraries tending to use DNS names. The solution is to run a DNS cache on every node.
The NodeLocal DNS implementation is deployed on each node as a Daemonset.
Dummy network interface is created that binds to ClusterIP address of KubeDNS. Linux NOTRACK target is added with KubeDNS ClusterIP before any KubeProxy rules. This ensures that NodeLocal DNS can process the packets without them reaching KubeProxy.
A watcher process removes the NOTRACK entries in the event that NodeLocalDNS fails. This defaults back to the original KubeDNS infrastructure.
Endpoint objects are stored in the Etcd database. When one pod IP changes, the entire object has to be redistributed to all the KubeProxy. If Endpoint objects are large, it may also hit the maximum storage limit in Etcd.
The solution is to represent one original Endpoint object with a set of EndpointSlice objects. A single update to pod IP will only require redistribution of one EndpointSlice object.
The EndpointSlice controller slices from a Service object to create EndpointSlice objects.
Interesting optimization problem:
Scale-invariant Feature Transform, also known as SIFT, is a method to consistently represent features in an image even under different scales, rotations and lighting conditions. Since the video series by First Principles of Computer Vision covers the details very well, the post covers mainly my intuition. The topic requires prior knowledge on using Laplacian of Gaussian for edge detection in images.
Image by First Principles of Computer Vision
Consider the two images. How can the computer recognize that the object in the left is included inside the image on the right? One way is to use template-based matching where the left image is overlapped onto the right. Then some form of similarity measure can be calculated as it is shifted across the right image.
Problem: To ensure different scales are accounted for, we would need templates of different sizes. To check for different orientations, we would need a template for every unit of angle. To overcome occlusion, we may even need to split the left image into multiple pieces and check if each of them matches.
For the example above, our brains recognize the eye and the faces to locate the book. Our eyes do not scan every pixel, and we are not affected by the differences in scale and rotation. Similarly it will be great if we can 1) extract only interesting features from an image and 2) transform them into representations that are consistent across different scenes.
Points of Interest: Blob-like features with rich details are preferred over simple corners or edges.
Insensitive to Scale: The feature representation should be normalized to its size.
Insensitive to Rotation: The feature representation should be able to undo the effects of rotation.
Insensitive to Lighting: The feature representation should be consistent under different lighting conditions.
Image from Princeton CS429 - 1D edge detection
In traditional edge detection, a Laplacian operator can be applied to an image through convolution. Edges can be identified from the ripples in the response.
Image from Princeton CS429 - 1D blob detection
If multiple edges are at the right distance, there will be a single strong ripple caused by constructive interference. If this response is sufficiently strong, the location is identified as a blob representing a feature. Intuitively, complex features will be chosen compared to simple edges as constructive interferences cannot be produced by single edges.
From the same diagram, we can also see that not all collection of edges result in singular ripples with a particular Laplacian operator. By increasing the sigma (σ) of the Laplacian (making the kernel “fatter”), the constructive interference will occur when edges are further apart. If we apply the Laplacian operators many times with varying σ’s, blobs of different scales can be identified each time.
Image from Princeton CS429 - Increasing σ to identify larger blobs
Wait but if the σ is larger, the Laplacian response will be weaker (shown above). Intuitively, if the responses by larger blobs are penalized for their sizes. Does that means the selected features will be mostly tiny?
Image from Princeton CS429 - Normalized Laplacian of the Gaussian (NLoG)
We solve this by multiplying the Laplacian response with σ2 for normalization. (This works out because the Laplacian is the 2nd Gaussian derivative) Intuitively, this means that the response now only indicates the complexity of the features without any effect from their sizes.
3 x 3 x 3 kernels to find local extremas
Imagine the Laplacian response represented as a matrix with x-y plane for image dimensions and z axis for various σ. We can slide an n x n x n kernel to find the local extremas. The resulting x-y coordinates would represent the centers of the blobs and σ would correspond to their sizes.
With this technique, blobs can be extracted to represent complex features with the sizes normalized.
To assign an orientation to each feature, it can be divided into smaller windows as shown above. Then the pixel gradient for each window can be computed to produce a histogram of gradient directions. The most prominent direction can be assigned as the principle orientation of the feature.
In the example above, blobs are identified in both images representing the same feature. The black arrows are the principle orientations. After rescaling the blob sizes with the corresponding σ’s, the effect of rotation is eliminated by aligning with respect to the principle orientations.
Image from Princeton CS429 - Pixels to SIFT descriptors
Instead of comparing each blob directly (pixel-by-pixel), we can produce a unique representation that is invariant to lighting conditions. As shown above, the image can be broken into smaller windows (4 x 4) where each histogram of the gradients is computed. If each histogram only consider 8 directions, there will be 8 dimensions per window. Even with only 16 windows per blob, each feature representation will be of 128 dimensions (16 x 8) which can be robust.
These feature representations are known more formally as SIFT descriptors.
Image from OpenCV documentation
For matching images, SIFT descriptors in two images can be directly compared against one another through similarity measurements. If a large number of them matches, it is likely that the same objects are observed in both images. In practice, nearest neighbor algorithms such as FLANN are used to match the features between images.
I enjoy watching 45 minutes to 1 hour long technical talks at conferences. Unfortunately, I am not retaining the knowledge as long as I would like to. From now on, I am going to try summarizing my takeaways for each videos to improve my own retention.
This video is about scaling monitoring from Prometheus to M3 at Databricks presented by YY Wan and Nick Lanham.
Prometheus based monitoring system is used in Databricks since 2016. Most internal services run on Kubernetes and Spark workloads run in VMs in customer environments. PromQL is widely used by engineers.
In each region, there are two Prometheus servers, Prom-Normal and Prom-Proxied. Prom-Normal scrapes metrics from internal services in k8s pods. Metrics from external services are pushed by Kafka to the Metrics Proxy Service (on k8s). Prom-Proxied scrapes metrics from the Metrics Proxy Service. Having two servers also means metrics can be sharded logically (internal/external) as all the metrics would not fit on one. Disks are attached to each Prometheus server to store metrics.
Globally, there is a Prometheus server that contain a subset of metrics federated from all the regions.
Users interact with the monitoring system in two ways: alerting and querying. Regional Prometheus servers issue alerts to the Alert Manager Service which notifies engineers via PagerDuty. Users also query regional or global servers for insights.
50 regions across multiple cloud providers with 4 million VMs of Databricks services and Spark workers.
Must:
Nice to have:
Why M3 solves the problem for Databricks:
1
+2
+3
+4
+5
+
Application --- M3 collector
+ |
+ M3 aggregator
+ |
+ M3DB --- M3 query --- Grafana
+
Prom-Normal and Prom-Proxied remote-write data in M3DB instead of local disks.
However, remote-writes by only two Prometheus servers could not achieved at a scale that Databricks required.
More servers would achieve higher write throughput into M3DB.
To replace Prom-Normal, multiple Grafana Scrape Agents scrape metrics from internal services and write to M3DB.
To replace Prom-Proxied, Metrics Proxy Service directly writes to M3DB. Note that this service is already made up of multiple servers. This reduces end-to-end latency of external metrics too.
Originally, the alerting rule configurations are used in Prometheus servers to issue alerts to Alert Manager Service.
Databricks built its own rule engine that takes the same configurations and interacts with M3DB and Alert Manager Service.
M3 Coordinators were having noisy neighbor issues. If users submit heavy queries, the coordinators would not be able to serve the write paths from Metrics Proxy Service and Grafana Scrape Agents.
To solve this, M3 Coordinators were separately deployed for read and writes. CPU-heavy machines for write-coordinators and Memory-heavy machines for read-coordinators.
Vanilla Prometheus servers that scrape M3 related components. Metrics retention period is short but it is sufficient for the use case.
Global Prometheus server to federate metrics for all the Prometheus server.
This video is about Snowflake Iceberg Tables, Streaming Ingest and Unistore. The presenters are N.Single, T.Jones and A.Motivala as part of the Database Seminar Series by CMU Database Group.
Traditional data lakes use file systems as the metadata layer. For example, data for each table is organized in a directory. Partitioning is implemented through nested directories. Using directories as database tables cause problems.
Table metadata and data are stored as Parquet files on customers’ bucket.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+
+-------------------------------------------------+
+ Cloud services | Authentication and Authorization |
+ +-------------------------------------------------+
+ | Infra Manager | Optimizer | Transaction Manager |
+ +-------------------------------------------------+
+ | Metadata Storage (Customer's Bucket) |
+ +-------------------------------------------------+
+
+ +-------------------+ +-------------------+
+ Compute | Warehouse | | Warehouse |
+ +-------------------+ +-------------------+
+
+ +-------------------------------------------------+
+ Storage | Data (Customer's Bucket) |
+ +-------------------------------------------------+
+
Customers will have to provide Snowflake External Volumes on any the cloud providers with access credentials. Data and metadata files are written to the External Volume.
Snowflake has its own files to store snapshot metadata originally. To support Iceberg format, each table commit requires generation of both Iceberg metadata and internal Snowflake metadata.
The generation of additional metadata files (Iceberg) increases query latency significantly. Thus Iceberg metadata files are generated on the background at the same time.
When Snowflake metadata files are generated, the transaction is considered commited. If the server crashes before Iceberg metadata is generated, the request would come to the new Snowflake server and the Iceberg metadata will be generated on the fly.
The Iceberg SDK accesses a catalog which returns the location of metadata files in customers’ buckets. Then the SDK interprets the metadata files and returns the locations of data files in an API to Spark.
1
+2
+3
+4
+5
+
Spark ---> Iceberg SDK ---> 1. Catalog (Hive, Glue)
+ | |
+ | -----------> 2. Storage (Snapshot Metadata)
+ |
+ ------------> 3. Data Files
+
Before this feature, the original Snowpipe did continuous copying from a bucket to a table behind the scenes, in batches. However, there was no low latency, high throughput, in-order processing feature. Snowpipe Streaming provides:
New concepts include:
The implementation details:
Snowflake’s product for combining transactional and analytical workload on one platform.
A new table type that works with existing snowflake tables, supports transactional features such as unique keys, referential integrity constraints and cross domain transactions.
1
+2
+3
+4
+5
+
CREATE HYBRID TABLE CustomerTable {
+ customer_id int primary key,
+ full_name varchar(256),
+ ...
+}
+
HTTP/2 has made our applications faster and more robust by providing protocol enhancements over HTTP/1. This post only focuses on the major pain points of HTTP/1 and how the new protocol has been engineered to overcome them. It is assumed that the reader is familiar with how HTTP/1 works.
Head-of-line blocking: Browsers typically only allow 6 parallel TCP connections per domain. If the initial requests are not complete, the subsequent requests will be blocked.
Unnecessary resource utilization: With HTTP/1, a single connection is created for every request, even if multiple requests are directed to the same server. As the server has to maintain states for each connection, there is an inefficient utilization of resources.
Overhead of headers: Headers in HTTP/1 are in the human-readable text format instead of being encoded to be more space-efficient. As there are numerous headers in complex HTTP requests and responses, it can become a significant overhead.
In HTTP/2, there is a binary framing layer that exists between the original HTTP API exposed to the applications and the transport layer. The rest of TCP/IP stack is unaffected. As long as both client and server implements HTTP/2, the applications will also continue to function as usual.
In HTTP/1, each HTTP request creates a separate TCP connection as shown below.
1
+2
+3
+4
+
| - Application - | - Transport - |
+
+ request_1 --> connection_1
+ request_2 --> connection_2
+
In HTTP/2, the binary framing layer breaks down each request into units called frames. These frames are interleaved and sent to the transport layer as application data. The transport layer is oblivious to the process and carries on with its own responsibilities. At the server’s end, the binary framing layer reconstruct the requests from the frames.
1
+2
+3
+4
+5
+
| - Application ---------------- | - Transport - |
+ | Binary Framing |
+
+ request_1 ---> frames ---> connection_1
+ request_2 -/
+
To be specific, each HTTP request is broken down into HEADERS
frame and DATA
frame/s. The names are self-explanatory. HEADERS
frame include HTTP headers and DATA
frame/s include the body.
The diagram below shows the structure of a frame.
1
+2
+3
+4
+5
+6
+7
+8
+9
+
+-----------------------------------------------+
+| Length (24) |
++---------------+---------------+---------------+
+| Type (8) | Flags (8) |
++-+-------------+---------------+-------------------------------+
+|R| Stream Identifier (31) |
++=+=============================================================+
+| Frame Payload (0...) ...
++---------------------------------------------------------------+
+
Notice that each HTTP/2 frame has an associated stream identifier which identifies each bidirectional flow of bytes. For example, all the frames in a single request-response exchange will have the same stream identifier.
This means that when frames from different requests are interleaved with one another, the receiving binary framing layer can reconstruct them back into independent streams.
Interleaving frames with different stream identifiers
In other words, the hierarchical relationship between connection, stream and frame can be represented as shown below.
Logical relationship between frames and streams
Aside from multiplexing HTTP requests over a single TCP connection, HTTP/2 also provides a mechanism for header compression. Instead of transporting textual data, both server and client maintain identical lookup tables to remember the headers that have been used. In the subsequent communication, only the pointers into the lookup table are sent over the network. Tests have show that on average, the header size is reduced around 85%-88%.
HTTP/2 solves the head-of-line blocking problem from parallel TCP connections. However, this creates another problem at the TCP level. Due to the nature of TCP implementation, one lost packet can make all the streams wait until the packet is re-transmitted and received.
HTTP/3 addresses this issue by communicating over QUIC (TCP-like protocol over UDP) instead of TCP.
There are multiple ways to speed up computations in Python. The cython
language compiles Python-like syntax into CPython
extensions. Libraries such as numpy
provides methods to manipulate large arrays and matrices efficiently with underlying C data structures.
In this post, I will be discussing the ctypes
module. It provides C-compatible data types to so that Python functions can use C-compiled shared libraries. Therefore, we can offload computationally intensive modules of a Python application into C where the developers will have more fine-grained control. To my surprise, this comes as part of the Python standard library, so no external dependencies are required!
I have created a sample program that we can speed up afterwards using the ctypes
module. The num_primes()
calculates the total number of primes in a list by looping through each item.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+
# prime.py
+def is_prime(num: int):
+ for i in range(2, int(num**(0.5))):
+ if num % i == 0:
+ return 1
+ return 0
+
+def num_primes(num_list: List[int]):
+ count = 0
+ for num in num_list:
+ count += is_prime(num)
+ return count
+
Let’s see the number of primes in a list of 1 million integers. Note that we use consecutive numbers for the example but it does not have to be.
1
+2
+3
+4
+5
+6
+7
+8
+9
+
from prime import num_primes
+
+MAX_NUM = 1000000
+num_list = list(range(MAX_NUM))
+
+def timeit_function():
+ return num_primes(num_list)
+
+print(timeit_function())
+
It takes around 3.4 seconds to run. How can we speed this up?
1
+2
+3
+
>>> python -m timeit -n 5 -s 'import test_python as t' 't.timeit_function()'
+Primes: 921295
+5 loops, best of 5: 3.4 sec per loop
+
As Python has a threading
module, one idea is to parallalize calculations across the entire list by using multiple threads. However, this is not possible due to Python’s Global Interpreter Lock (GIL), which prevents multiple threads in a process from executing Python bytecode at the same time. Hence for non-I/O operations, there will not be any speed up.
The prime checker is reimplemented in C as shown below, then compiled it into a shared library prime.so
. Note that the program logic is exactly the same.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+
// prime.c
+#include <stdio.h>
+#include <math.h>
+
+int is_prime(int num) {
+ for (int i=2; i<(int)sqrt(num); i++) {
+ if (num % i == 0)
+ return 1;
+ }
+ return 0;
+}
+
+int num_primes(int arrSize, int *numArr) {
+ int count = 0;
+ for (int i=0; i<arrSize; i++)
+ count += is_prime(numArr[i]);
+ return count;
+}
+
The ctypes
library provides C-compatible data types in Python. All we need to do is load the shared library with CDLL()
API and then declare the parameters/ return types accordingly with argtypes
and restype
attributes.
1
+2
+3
+4
+5
+6
+7
+8
+
from ctypes import *
+
+# Load the shared library
+lib = CDLL("./libprime.so")
+# Declare the return data as 32-bit int
+lib.num_primes.restype = c_int32
+# Declare the arguments as a 32-bit int and a pointer for 32-bit int (for list)
+lib.num_primes.argtypes = [c_int32, POINTER(c_int32)]
+
Afterwards, the num_primes()
in the shared library can be called! Note that the num_list
has to be converted from Python list into a contiguous array of C with a method provided by ctypes
.
1
+2
+3
+4
+5
+6
+7
+8
+
MAX_NUM = 1000000
+num_list = list(range(MAX_NUM))
+
+def timeit_function():
+ # num_list is converted into an integer array of size MAX_NUM
+ return lib.num_primes(MAX_NUM, (c_int32 * MAX_NUM)(*num_list))
+
+print(f"Primes: {timeit_function()}")
+
For the same input of 1 million integers, the speed up is significant just by offloading the same program logic to C code. It makes sense because contiguous arrays in C can leverage caching mechanisms better than lists in Python.
1
+2
+3
+
>>> python -m timeit -n 5 -s 'import test_ctypes as t' 't.timeit_function()'
+Primes: 921295
+5 loops, best of 5: 482 msec per loop
+
There is one more benefit of offloading the work to C. Since the shared library is not under Python’s GIL, we can now use multithreading in C to parallelize the computations!
In the code below, the integer array is split evenly into 4 subarrays and 4 threads are spawned with POSIX pthreads
to do parallel work. Each thread runs thread_function()
to check the numbers in the array without any overlap between threads. The counts of prime numbers are added into countByThreads
array which are summed up after the child threads have terminated.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+
#define NUM_THREADS 4 // 4 threads used
+
+// Global variables for spawn threads to access
+int *gArrSize = 0; // Ptr for array size
+int *gNumArr = 0; // Ptr for input array
+int countByThreads[NUM_THREADS] = { 0 }; // Prime counts of each thread
+pthread_t tids[NUM_THREADS] = { 0 }; // IDs of each thread
+
+// Function run by each thread
+void *thread_function(void *vargp) {
+ // Each thread has a different offset
+ int offset = (*(int*) vargp);
+ int count = 0;
+ // Split the array items evenly across threads
+ for (int i=offset; i < *gArrSize; i+=NUM_THREADS)
+ count += is_prime(gNumArr[i]);
+ countByThreads[offset] += count;
+ free(vargp);
+}
+
+int num_primes(int arrSize, int *numArr) {
+ gArrSize = &arrSize;
+ gNumArr = numArr;
+ for(int i=0; i < NUM_THREADS; i++) {
+ int *offset = (int*) malloc(sizeof(int));
+ *offset = i;
+ if(pthread_create(&tids[i], NULL, thread_function, (void *) offset) == -1)
+ exit(1);
+ }
+ int count = 0;
+ for(int i=0; i < NUM_THREADS; i++) {
+ if(pthread_join(tids[i], NULL) == -1)
+ exit(1);
+ // Combine counts from each thread after termination
+ count += countByThreads[i];
+ countByThreads[i] = 0;
+ }
+ return count;
+}
+
We have further sped up the code execution although there is an additional overhead of managing threads.
1
+2
+3
+
>>> python -m timeit -n 5 -s 'import test_ctypes_pthread as t' 't.timeit_function()'
+Primes: 921295
+5 loops, best of 5: 322 msec per loop
+
Remember the threading
module in Python just now? Another neat thing about ctypes
is that the program releases the GIL as long as the execution is inside the C-compiled shared library. So instead of POSIX pthreads
in C, we can generate the threads with threading
instead!
1
+2
+3
+4
+5
+6
+7
+8
+
from ctypes import *
+
+# Load the shared library
+lib = CDLL("./libprime.so")
+# Declare the return data as 32-bit integer
+lib.num_primes.restype = c_int32
+# Declare the arguments as a 32-bit integer & a pointer for 32-bit integer (for list)
+lib.num_primes.argtypes = [c_int32, POINTER(c_int32)]
+
Afterwards, the num_primes()
in the shared library can be called! Note that the num_list
has to be converted from Python list into a contiguous array with a method provided by ctypes
.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+
MAX_NUM = 1000000
+NUM_THREADS = 4
+
+# Prime counts per thread
+count_list = [0 for _ in range(NUM_THREADS)]
+# One list of numbers for each thread
+num_list_list = []
+
+# Split the list for multiple threads
+for i in range(NUM_THREADS):
+ num_list = list(range(i, MAX_NUM, NUM_THREADS))
+ num_list_list.append(num_list)
+
+# Function run by each thread
+def thread_function(i, num_list, count_list):
+ len_num_list = len(num_list)
+ count_list[i] = lib.num_primes(len_num_list, (c_int32 * len_num_list)(*num_list))
+
+def timeit_function():
+ threads = []
+ for i in range(NUM_THREADS):
+ t = threading.Thread(target=thread_function, args=(i, num_list_list[i], count_list))
+ t.start()
+ threads.append(t)
+ for thread in threads:
+ thread.join()
+ return sum(count_list) # Combine counts from each thread
+
+print(f"Primes: {timeit_function()}")
+
For this example, the speed up is comparable to using pthreads
.
1
+2
+3
+
>>> python -m timeit -n 5 -s 'import test_ctypes_threading as t' 't.timeit_function()'
+Primes: 921295
+5 loops, best of 5: 313 msec per loop
+
The code demonstrations can be found here.
This video is about streamlining FedRAMP compliance with CNCF technologies. The presenters are Ali Monfre and Vlad Ungureanu from Palantir Technologies.
FedRAMP is the accreditation required for companies to sell SaaS solutions to the government instead of on-prem solutions. General steps include:
For operating systems, major vendors have STIGs published. Palantir started running immutable machine images which were scanned during the CI process. This provided a faster feedback loop for the developers. Every host was also terminated every 72 hours. One side effect was that the vulnerabilities would be patched within three days.
For container images, an internal “golden image” was used by all products. The downstream images that used this were built automatically. Trivy (a scanning tool) was also embedded into CI.
Regarding FIPS, there is a long processing time for NIST (government agency) to validate new kernels and cryptographic libraries. Thus, Palantir cannot used features offered by new versions of the kernel.
Regarding service-to-service communication, Cilium CNI is used for k8s clusters. IPSec encryption in Cilium ensures FIPS validated encryption between pods. Cilium also has powerful network policy primitives which made it easier to adhere to FedRAMP standards.
Regarding Ingress/ Egress traffic, NGINX+ provides FIPS validation but there were performance problems encountered by Palantir. The decision was made to switched to Envoy, which is an open sourced service proxy designed for cloud native applications. BoringSSL with FIPS configured was used as the TLS provider.
Regarding Host Intrusion Detection System, originally OSQUERY tool was used. However, it did not integrate well with k8s so all the pods showed up as similar processes. The decision was made to switch to Isovalent Tetragon which was an eBPF tool that integrated well with k8s.
There are more challenges not solved by CNCF technologies out of the box. To solve this, Palantir created Apollo and FedStart which helps companies to deploy software for the federal environment.
This video is a deep dive on the vector database Weaviate as part of the weekly database seminars by CMU Database Group. The presenter is Etienne Dilocker from Weaviate.
Instead of indexing literal keywords from paragraphs, meanings (embeddings) can be indexed for search purposes.
LLMs tend to hallucinate less when some context around the question is given.
Retrieval Augmented Generation (RAG) retrieves the top few relevant documents from a vector database before providing as a context to LLMs. There is no need to retrain LLMs to keep updated with the latest information.
Collections are logical groups by the user. Shards distribute data across multiple nodes.
In each shard, the HNSW index is used most of the time. The object store can keep any binary files that are related to the embeddings. There is no need to have secondary storage for non-key-value data. The inverted index allows searching by properties and BM25 (bag-of-words retrieval) queries.
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+
+-----------------------+
+ | Weaviate Setup |
+ +-----------------------+
+ +-- | Collection "Articles" |
+ | | Collection "Authors" |
+ | | ... |
+ | +-----------------------+
+ |
+ | +-----------------------+
+ +-> | Collection "Articles" |
+ +-----------------------+
+ +-- | Shard A |
+ | | Shard B |
+ | | ... |
+ | +-----------------------+
+ |
+ | +-----------------------+
+ +-> | Shard A |
+ +-----------------------+
+ | HNSW Index |
+ | Object Store (LSM) |
+ | Inverted Index (LSM) |
+ +-----------------------+
+
Consistent hashing on a specific key is used for sharding. On each node (physical shard), there can be multiple logical shards.
If the number of shards are changed on the fly, there are measures to ensure that minimal amount of data is moved around the nodes.
The index approximates nearest neighbor proximity graph with multiple layers. Compared to other indexes, it is slower to build but faster to query with.
Algorithm for querying a Navigable Small World (NSW) graph:
Considerations when building a NSW graph:
Layers of NSW is HNSW. There are fewer connections per point on higher layers of HNSW. Few connections also means the connections “travel” longer distances. The search starts from the higher layers then move to lower layers.
Adding new data points do not degrade the graph. When one point has too many connections, pruning is done by reducing first-grade connections (direct) to second-grade connections (indirect).
Deleting points degrades the query time. When a point is marked as tombstone, it can still be used for traversing the graph but not be included in the result set. When the proportion of tombstones is large, the graph is rebuilt. On the fly, there are also reassignments of the tombstone’s connections to other points to make sure clusters remain connected. This operation is expensive but works well if there are not too many deletes.