diff --git a/data/embeddings.rst b/data/embeddings.rst index 1e5a598..d6deb44 100644 --- a/data/embeddings.rst +++ b/data/embeddings.rst @@ -5,7 +5,7 @@ Embeddings are underrated ========================= Machine learning (ML) has the potential to greatly advance the state of the -art in technical writing. No, I'm not talking about Claude Opus, Gemini Pro, +art in technical writing. No, I'm not talking about text generation models like Claude Opus, Gemini Pro, LLaMa, etc. The ML technology that might end up having the biggest impact on technical writing is **embeddings**. @@ -198,16 +198,18 @@ texts. .. _Word2vec paper: https://arxiv.org/pdf/1301.3781 The concept of positioning items in a multi-dimensional -space like this goes by the wonderful name of `latent space`_. +space like this, where related items are clustered near each other, goes by the wonderful name of `latent space`_. The most famous example of the weird utility of this technology comes from -the `Word2vec paper`_, the foundational research that got more people -interested in embeddings 11 years ago. In the paper they shared an anecdote -where they started with the embedding for ``king``, then subtracted the embedding -for ``man``, and then added the embedding for ``woman``. When they looked around -that area of the latent space, they found that the word for ``queen`` was close-by. +the `Word2vec paper`_, the foundational research that kickstarted interest in embeddings 11 years ago. In the paper they shared this anecdote: -The ``king - man + woman = queen`` anecdote must always be followed by this +.. code-block:: text + + embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen") + +Starting with the embedding for ``king``, subtract the embedding for ``man``, then add the embedding for ``woman``. When you look around this vicinity of the latent space, you find the embedding for ``queen`` nearby. + +There appears to be an unspoken rule in ML culture that this anecdote must always be followed by this quote from John Rupert Firth: You shall know a word by the company it keeps! @@ -236,8 +238,12 @@ Applications ------------ I could tell you exactly how I think we can advance the state of the art -in technical writing with embeddings, but where's the fun in that? -Let's just cover a basic example to put the ideas into practice and then +in technical writing with embeddings, but where's the fun in that? Here are two gigantic hints: + +* A lot of documentation tasks revolve around detecting *discrepancies*. +* You can generate embeddings for *any* type of text, not just *documentation*. + +Let's cover a basic example to put the intuition-building exercise into practice and then wrap up this post. Related pages