Skip to content

Naming Things

Nilesh edited this page Dec 2, 2024 · 17 revisions

This is part of my notes as I attempt to build the domain model of humanity's universal learning map using first principles:


The Naming Problem

Names (in our case, names of topics like "quantum physics" or "C++" and of people/orgs like "Bill Gates" or "Hacker News") are very useful as identifiers because humans can recognize them quickly as well as type them manually (a search field with autocomplete might not be available in all contexts). We also need to identifiers to establish links. For example, I would like to have ["comp.lang", "comp.lang.python"] instead of [{id: 16534, name: "lang"}, {id: 27654, name: "python", parent: 16534}]

But names which also work as identifiers are not quite as simple as we programmers would like.

  • To have names act as identifiers, we want them to be URL-safe, hashtaggable for mentions, preferably unique, and case-insensitive (when using English/Latin characters).
  • But real-world human-readable names require case-preservation (eg: "AT&T") and special characters (eg: "C++ 20"). Sometimes they use non-Latin characters or emojis. After all, "😂" should itself be a topic in a learning map that claims to encompass all human knowledge.

This is how others have attempted to solve this:

  • Wikipedia uses URL escape codes that make names/identifiers hard to manually write. Eg: Zorn%27s_lemma or C%2B%2B (for "C++")
  • Usenet groups used a custom naming scheme of short names (like comp.lang.python) with some special characters allowed.

Currently, our Topic and People schema have two attributes: name (unique, lowercase, URL-safe identifier) and hname (case-preserving, human-preferable, duplicate-allowing names with special characters).

The Disambiguation Problem

This arises because the real-world relationships between things and names are many-to-many.

  • Some things have more than one names: Soccer vs Football, Graph vs Network. Emojis, which are pictures as characters, make this even more complex. Did you know that https://en.wikipedia.org/wiki/🤔 redirects to https://en.wikipedia.org/wiki/Thought?
  • Some things have no well-defined name and may require an entire phrase to indicate. For example, libraries which use the Colon Classification System, identify the subject of "Research in the cure of tuberculosis of lungs by x-ray conducted in India in 1950" with the identifier "Medicine,Lungs;Tuberculosis:Treatment;X-ray:Research.India'1950".
  • Some names refer to more than one things. See how big a variety of things are named "Lua" on Wikipedia.

Other challenges with human-recognizable names:

  • Should we support multiple languages and scripts/characters/math symbols?
    • For learndb, I am taking the easy way out and limiting to building this knowledge map in English only.
  • Who assigns and maintains these names and what are their incentives?
    • If names are modeled as property rights (eg: DNS or ENS or social media handles), it opens a pandora's box of problems and would be completely inappropriate for naming topics and subjects since they are a common good.

The Taxonomy Problem

Then there are the issues of taxonomy which, in our case, applies to naming of topics or concepts, but not to naming of people. The simplest taxonomy is a hierarchy (eg: comp.lang.python). We currently keep a "parent" attribute in Topic scheme. But this too makes many assumptions:

  • Should topic names always fully-specified like math.algebra.quadratics or just quadratics? This affects brevity and ease of use but makes disambiguation harder.
  • Taxonomy maintenance, even for a hierarchy, is not easy:
    • When does a concept/subtopic deserve its own topic?
    • What happens when topics get merged or retired?
    • Is the parent-child relationship an "is-a" relationship (like nations/india) or an "includes" relationship (like math/algebra)?
    • What separator should we use - period or slah? Why or why not?
    • What if a topic belongs under two separate parent topics, eg: statistics.machine_learning as well as computer_science.machine_learning? Will we need symlinks in our topic hierarchy?
    • Too many existing standards

Newsgroups's approach to naming topics is quite nice, but the taxonomy is not big or granular enough for us to build a universal knowledge map. For eg: there is no name yet for quadratic equations. Also, everything other than the Big-8 (comp, humanities, misc, news, rec, sci, soc, and talk) gets shoved into the alt hierarchy (btw, the historical reason for that is European networks did not want to pay for groups about religion or racism).

Semantic Search problem

Once you have named things, traditional search can do fuzzy matching or keyword-based search. But in this age of large language models, we should also be able to search for things in the semantic space. If I don't know the name icosahedron, the search query octahedron should be able to find this among one of the top similar results.

This is technically feasible. Mistral's 1024-dimension embeddings when binary quantized, would take 172 chars (in Base58). But it seems too early to build this in names. Embedding vectors are model-specific, so we may not want to standardize a specific model too early. Leaving this out of scope for now.

The Sorting Problem

In a hierarchy, often we would like to preserve a sort-order (for eg: chapter names in a book). This can be achieved with names like 100-physics, 200-chemistry, 300-biology etc. However, this quickly becomes unwieldy (eg: 100-math.400-algebra.200-polynomials). This is why I decided to keep a separate rank attribute for topics which is not part of the topic's name itself.


All of this, led me to defining the Topic schema (for now) as: (name, hname, parent, rank) and People (Creators) schema as (name, hname, links[]).

Next, we have got to deal with the problem of links as I don't find plain URLs good enough.