Building something 👨🍳🚀 • Former Machine Learning (AI) Research Scientist, Full-Stack Software Engineer, & Data Engineer at Expedock Software Inc. • 2x IOI & 2x ICPC World Finalist • Mathematics at the Ateneo de Manila University
At Expedock, I was in charge of researching, building, training, and managing the deploy of hundreds of multi-modal machine learning models fine-tuned for information extraction on semi-structured documents in the logistics industry. More generally, I was also responsible for improving our entire ML system--from making our data collection jobs more robust, to managing our data warehouse and feature stores, to building the charts and dashboards we present to our customers.
Recently, I've also been exploring model inference optimizations on more lower-level abstractions. I know how to implement most machine learning building blocks in C++ (see my implementation of Meta's LLamaV2 in C++ and Flash Attention 1 & 2 in CUDA). At Expedock, I also worked on reducing the memory consumption of PyTorch (and its CUDA kernels) so we could run more inference jobs in parallel per GPU instance. Tl;dr: I'm very comfortable working on every level of abstraction in machine learning.
Before Expedock, I studied Mathematics at the Ateneo de Manila University. I also dabbled a lot in competitive programming. In fact, I managed to be a 2-time IOI and a 2-time ICPC World Finalist representing the Philippines.
Layer | Tools |
---|---|
Cloud | |
Infra | |
DB | |
Backend | |
API | |
Frontend | |
ML Platform | |
ML Inference Server | |
ML APIs | |
ML Frameworks | |
Data Viz |
- Information Retrieval from Semi-Structured Documents. Research on information retrieval (colloquially, "Search") mostly focus on purely text-based documents and structured documents--both of which are now largely solved problems. For context, structured documents are PDFs, scanned documents, screenshots of excel sheets, etc. where (1) the borders of the tables (if present) and (2) the ordering of the word-blocks are very clear. But most real-world documents, especially in the logistics industry, are semi-structured. That is, documents where either (1) the tables don't have very clear borders (or may even be implicit tables) and/or (2) the word-blocks are scattered all over the place. This is surprisingly a very difficult problem and even the big cloud platforms (GCP, AWS, & Azure) are having difficulty handling such documents. But it can be very profitable if you can get it right--hence why Expedock is now a multi-million $$ startup.
-
ML on Non-Euclidean Geometry. More specifically, I'm interested in embedding high-dimensional data into lower-dimensional non-euclidean spaces. Although embedding into euclidean spaces,
$\mathbb{R}^n$ , is good enough for most cases, there are cases where non-euclidean spaces might be more appropriate. For example:- Embedding hierarchical data such as the phylogenetic tree-representation of single-cell specialization data. Real-world hierarchical data are usually tree-like with near-constant branching factors. Thus, they grow exponentially with respect to the depth (e.g. the
$k^{th}$ -level of a binary tree has$2^{k}$ nodes). However, euclidean spaces,$\mathbb{R}^n$ , only grow polynomially with respect to$n$ . On the other hand, negatively-curved spaces such as thepoincare disc
grow exponentially. Thus, it's better to embed hierarchical data into them - we just need to be careful with floating-point errors. - Embedding complex cyclical data. In my stint at ExoraPH, I used to uncover the lower-dimensional, torus-like structure of the Philippine's energy supply-and-demand curves.
- Embedding hierarchical data such as the phylogenetic tree-representation of single-cell specialization data. Real-world hierarchical data are usually tree-like with near-constant branching factors. Thus, they grow exponentially with respect to the depth (e.g. the
-
Geometric Deep Learning. I'm interested in unifying various concepts in machine learning through the lens of the Erlangen Program. I'm especially fascinated with the following:
- How we can derive linear regression, convolution, the attention mechanism, and message-passing from the geometric transformations we want our models to preserve. For example:
- If we want translation-invariance, then we have to use convolutions as they're the only family of transformations that are translation-invariant.
- If we want color- and shade-invariance, then we can use batch-normalization.
- If we let the weights of the convolutions to be learnable (and depend on the neighbors' weights), then we'd end up with the attention mechanism. And
- If we generalize the attention mechanism to all graph structures (not just regular graphs), then we'd end up with message-passing.
- In almost all unsupervised learning models, we just fix two of (a) the manifold
$X$ , (b) the metric on the manifold$d_X$ , and (c) the probability measure$\mu_X$ over the metric space$(X, d_X)$ and then try to estimate the remaining one of the three. For example:- In
dimensional reduction
, we usually fix$d_{X, p}(x, y) = \sqrt[p]{\sum_i (x_i - y_i)^p}$ and$\mu_X =$ the uniform distribution (such as in UMAP) then try to find a low-dimenional manifold$X$ that preserves the local distances in the original graph as much as possible. - In
metric learning
, we usually fix$X = \mathbb{R}^n$ and$\mu_X =$ the uniform distribution then try to find$d_X$ such that similar datapoints are close together and dissimilar datapoints are far way from each other. And, finnaly, - In
density estimation
, we usually fix$X = \mathbb{R}^n$ and$d_{X, p}(x, y) = \sqrt[p]{\sum_i (x_i - y_i)^p}$ then try to find the probability distribution$\mu_X$ of our dataset.
- In
- How we can derive linear regression, convolution, the attention mechanism, and message-passing from the geometric transformations we want our models to preserve. For example:
If you're interested in collaborating on a research project with me, just email me at [email protected]
Please visit my personal website at leloykun.github.io for a more detailed portfolio.
Project | Description |
---|---|
Expedock's AutoML Library. Train a model on data from Snowflake with just one line of code and run predictions with another line of code. |