Skip to content

Add Typesense #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Dec 13, 2024
Merged

Add Typesense #44

merged 33 commits into from
Dec 13, 2024

Conversation

ruslandoga
Copy link
Contributor

@ruslandoga ruslandoga commented Sep 27, 2024

This PR integrates Typesense into Hexdocs.

TODOs:

  • should Broadway task fail on failed indexing, or should it just log an error?
    Logging excessively for now
  • should indexing happen in the same step as upload to object storage or can it be parallel?
    Indexing in the same step for now
  • index proglang
  • index Erlang, what's its search_data equivalent?
    The format is the same: Add Typesense #44 (comment)
  • index Gleam: ce7af3b
    Commit reverted, approach needs discussion with Gleam team
  • more tests

CI results: ruslandoga#1

@josevalim
Copy link
Member

Thank you @ruslandoga! I am currently on holidays but I will try to carve some time sooner than later to give you feedback. /cc @wojtekmach

@josevalim
Copy link
Member

index erlang, what's its search_data equivalent?

Erlang uses ExDoc now, so they have the exact same structure. However, we will need to either poll them or ask them to ping us once they publish a new version or ask them to push their docs to Hexdocs! I'd say we can postpone this to a follow up pull request.

On the other hand, I believe Gleam does not have the data in the format we need (we created our search_data.js somewhat recently thinking exactly about this). They would need to update their tools first, so we can postpone this conversation too.

@wojtekmach
Copy link
Member

@ruslandoga I'll try to review this sooner than later but I have a lot on my plate until October 15th, after that it's gonna be my number one priority to see this through. Sorry about that.

Copy link
Member

@wojtekmach wojtekmach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very exiting! I left some comments below.

@ruslandoga ruslandoga mentioned this pull request Oct 21, 2024
@ruslandoga ruslandoga marked this pull request as ready for review November 10, 2024 13:57
@ruslandoga
Copy link
Contributor Author

ruslandoga commented Nov 10, 2024

👋

I think it's ready for review now :)

Updates since the last review:

  • removed Gleam indexing (for now)
  • added proglang field to the schema
  • moved some code around in attempt to make it cleaner
  • added more tests

search_data_js =
Enum.find_value(files, fn {path, content} ->
case Path.basename(path) do
"search_data-" <> _digest -> content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot trust this data since it's user provided, can they do anything dangerous by providing something we don't expect? Maybe we should do some rudimentary validation?

Copy link
Contributor Author

@ruslandoga ruslandoga Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can provide long strings like https://github.com/cloudpods-dev/docker-engine-api-elixir/blob/813cc557da483f623a8f484db04efc7e58db0376/lib/docker_engine_api/api/container.ex#L67, but Typesense seems to handle it fine. We can check for content size, maybe. I think if Typesense doesn't like the payload, it would simply reject it.

Copy link
Contributor Author

@ruslandoga ruslandoga Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test that checks that invalid fields in search items (like type being a map instead of a string, or doc being a list) are rejected: 8d58e4f

@@ -5,6 +5,9 @@ if config_env() == :prod do
port: System.fetch_env!("HEXDOCS_PORT"),
hexpm_url: System.fetch_env!("HEXDOCS_HEXPM_URL"),
hexpm_secret: System.fetch_env!("HEXDOCS_HEXPM_SECRET"),
typesense_url: System.fetch_env!("TYPESENSE_URL"),
typesense_api_key: System.fetch_env!("TYPESENSE_API_KEY"),
typesense_collection: System.fetch_env!("TYPESENSE_COLLECTION"),
Copy link
Contributor Author

@ruslandoga ruslandoga Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created hexdocs-test collection on Typesense Cloud, https://cloud.typesense.org/clusters/ent97o5sv4dzx2f0p/collections

I think we can use it during alpha/beta testing.

@@ -5,6 +5,9 @@ if config_env() == :prod do
port: System.fetch_env!("HEXDOCS_PORT"),
hexpm_url: System.fetch_env!("HEXDOCS_HEXPM_URL"),
hexpm_secret: System.fetch_env!("HEXDOCS_HEXPM_SECRET"),
typesense_url: System.fetch_env!("TYPESENSE_URL"),
typesense_api_key: System.fetch_env!("TYPESENSE_API_KEY"),
Copy link
Contributor Author

@ruslandoga ruslandoga Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the "Admin API Key" for https://cloud.typesense.org/clusters/ent97o5sv4dzx2f0p cluster, it can be downloaded from the dashboard.

@@ -5,6 +5,9 @@ if config_env() == :prod do
port: System.fetch_env!("HEXDOCS_PORT"),
hexpm_url: System.fetch_env!("HEXDOCS_HEXPM_URL"),
hexpm_secret: System.fetch_env!("HEXDOCS_HEXPM_SECRET"),
typesense_url: System.fetch_env!("TYPESENSE_URL"),
Copy link
Contributor Author

@ruslandoga ruslandoga Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this would be https://ent97o5sv4dzx2f0p.a1.typesense.net

@ruslandoga
Copy link
Contributor Author

ruslandoga commented Nov 18, 2024

👋

Just wanted to check if there’s anything else we'd need to address before it's merged?

Copy link
Member

@wojtekmach wojtekmach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. We plan to test this out in our staging server before merging. Thank you for all the work!

@wojtekmach
Copy link
Member

@ruslandoga this is now deployed to https://staging.hexdocs.pm and I have published a test package and it was correctly indexed into hexdocs-test collection.

Is this ready to publish to prod? What are the next steps?

Thank you for all of your work on this and apologies for delays.

@ruslandoga
Copy link
Contributor Author

ruslandoga commented Dec 12, 2024

👋 @wojtekmach

Yes, I think it's ready for prod! I think the next step would be integrating global search into ex_doc. I can open a PR!

@ruslandoga
Copy link
Contributor Author

ruslandoga commented Dec 12, 2024

We can create a new collection if needed or we can continue using hexdocs-test

I think it's possible to clone / fork collections in Typesense cloud, and if not, it's pretty easy to move the data around, since I guess by the nature of Hexdocs, the indexed data is immutable. So it should be OK either way.

@wojtekmach
Copy link
Member

Got it, excellent!

Could you create hexdocs-staging and hexdocs-prod collections for us? We tend to follow that particular convention for external services.

Should we backfill some data? Latest versions of all packages? All versions of all packages? We can also do nothing for now and make a decision when all pieces are ready. cc @josevalim

@ruslandoga
Copy link
Contributor Author

ruslandoga commented Dec 12, 2024

I've created hexdocs-staging and hexdocs-prod collections with the same schema.

Regarding backfilling, it definitely helps during development but I tend to use a local Typesense instance. It's a bit outdated by now (e.g. missing proglang), but here're some docs I used before: https://hexdocs-artifacts.s3.eu-central-003.backblazeb2.com/docs_from_tarballs_all_versions.jsonl.zst (it's 190MB compressed, 4.4G uncompressed)

$ curl https://hexdocs-artifacts.s3.eu-central-003.backblazeb2.com/docs_from_tarballs_all_versions.jsonl.zst -O
$ zstd docs_from_tarballs_all_versions.jsonl.zst -d
# add proglang=elixir to all entries
$ jq '. + {proglang: "elixir"}' docs_from_tarballs_all_versions.jsonl > docs.jsonl
$ docker compose up typesense -d

# https://typesense.org/docs/27.1/api/collections.html#with-pre-defined-schema
$ curl "http://localhost:8108/collections" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: hexdocs" \
       -d '{"fields": [
    {"facet": true, "name": "proglang", "type": "string"},
    {"facet": true, "name": "type", "type": "string"},
    {"name": "title", "type": "string"},
    {"name": "doc", "type": "string"},
    {"facet": true, "name": "package", "type": "string"}
  ],
  "name": "hexdocs-local",
  "token_separators": [".", "_", "-", " ", ":", "@", "/"]
}'

# https://typesense.org/docs/27.1/api/documents.html#import-a-jsonl-file
$ curl "http://localhost:8108/collections/hexdocs-local/documents/import?action=create" \
       -X POST \
       -T docs.jsonl \
       -H "X-TYPESENSE-API-KEY: hexdocs"

# sanity check
$ curl -H "X-TYPESENSE-API-KEY: hexdocs" "http://localhost:8108/collections/hexdocs-local/documents/777777"

@josevalim
Copy link
Member

@ericmj @wojtekmach @ruslandoga let's definitely backfill. We only support recent ExDoc versions anyway, which will act as a filter.

A couple things to figure out:

  1. Will Elixir releases be indexed automatically since we push to hexdocs.pm or does it require special treatment?
  2. We should do something for Erlang in particular
  3. Are we going to make typesense address public or are we going to expose it through hex.pm and use fast.ly to rewrite to typesense?

Overall, my next step suggestion is to build a home page for searching within a given set of packages. This thread on ElixirForum has a good example: https://elixirforum.com/t/hexdocs-search-engine-for-us-devs/46814/1

We could use it at least in two places:

  1. A new hexdocs.pm home? We can store the selected packages in your cookies

  2. mix hex.search which opens up a page with all of the hex packages of your lock file filled in (similar to mix hex.outdated). We can rename the existing mix hex.search to mix hex.find

For ExDoc, we would need to work on the related packages feature: elixir-lang/ex_doc#1811

My idea is that we would be able to store this information as a .json file as well. So phoenix_html can say related_deps: :phoenix and we just query https://hexdocs.pm/phoenix/related_deps.json

@josevalim
Copy link
Member

I opened new issues: #46 #47 #48 #49.

@josevalim
Copy link
Member

This one looks good to merge to me. We can continue in the other issues.

@wojtekmach
Copy link
Member

Sounds good, I'll finish infrastructure setup and deploy this to prod soon.

@wojtekmach wojtekmach merged commit b1f48e9 into hexpm:main Dec 13, 2024
3 checks passed
@wojtekmach
Copy link
Member

This is now running on prod, so far so good!

@josevalim
Copy link
Member

Amazing work @ruslandoga !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants