Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multi-lingual candidate names #138

Open
saumier opened this issue Sep 14, 2023 · 25 comments
Open

Support for multi-lingual candidate names #138

saumier opened this issue Sep 14, 2023 · 25 comments

Comments

@saumier
Copy link

saumier commented Sep 14, 2023

As a service provider, I would like clients to be able to query in any language and to return candidate names in one or more languages specified by the client request.

Use Case

A client is reconciling a place in Canada using the Artsdata.ca Reconciliation service with the name "Studio Azrieli".

Current solution (not ideal)

The service returns multiple entities including K11-15 "National Arts Centre - Azrieli Studio" and K11-15 "Centre National des Arts - Studio Azrieli" which appear as separate entities but have the same URI. This may appear incorrect to the user because there are 2 candidates. If the user doesn't notice that they have the same URI then they may be mistaken as duplicates.

Screenshot 2023-09-14 at 9 52 11 AM

Ideal solution

The service returns multiple entities but only a single K11-15 displaying both names "National Arts Centre - Azrieli Studio" and "Centre National des Arts - Studio Azrieli" together. Parameters can specify the languages the client would like to display.

@fsteeg
Copy link
Member

fsteeg commented Nov 9, 2023

So on the protocol level, would this mean to allow arrays of objects for candidate name and description?

"candidates": [
  {
    "id": "K11-15",
    "name": [
      {
        "str": "National Arts Centre - Azrieli Studio",
        "lang": "en"
      },
      {
        "str": "Centre National des Arts - Studio Azrieli",
        "lang": "fr"
      }
    ],
    ...
  }
]

@wetneb
Copy link
Member

wetneb commented Nov 9, 2023

If we go down that route, I wonder if we should also add support for that for multiple names for properties (when returned in a property suggest response, or in a data extension response) or for types (when returned in a type suggest response, or in a reconciliation response as part of the reconciliation candidates). I guess it would make things look more uniform but I am not really sure about the use case. What do you think @saumier?

@saumier
Copy link
Author

saumier commented Mar 14, 2024

If we go down that route, I wonder if we should also add support for that for multiple names for properties (when returned in a property suggest response, or in a data extension response) or for types (when returned in a type suggest response, or in a reconciliation response as part of the reconciliation candidates). I guess it would make things look more uniform but I am not really sure about the use case. What do you think @saumier?

Yes. Since the group is not recommending JSON-LD, then I think this is the next best approach.

I am implementing a bilingual website (en, fr) that implements a client for the reconciliation API here kg.artsdata.ca. The UI of this site can switch between English and French. When querying using the reconciliation API, a query string can be in any language. For example I could query a Place using "Studio Azrieli" and "Azrieli Studio". The response would return candidates including K11-15. With this new approach, the website could display the name and description in the UI language.

Also good for add support for property and type suggestions.

@wetneb
Copy link
Member

wetneb commented Apr 5, 2024

Summary of our discussion on the monthly call of last month: we could either

  • always wrap the name and description field of entities, properties and types in additional array/objects so that multiple values can be specified depending on the language (see Support for multi-lingual candidate names #138 (comment)). This would be done even if only a single language is used, with the benefit of using a consistent JSON structure regardless of the data.
  • we could do this wrapping only in cases where multiple languages need to be returned, and fall back on the current syntax (bare strings) when a single language is provided. This has the benefit of offering a simpler JSON structure for most use cases.

Maybe there are other options?

We thought that it is worth bringing more attention to this issue from the broader community, to gather more feedback.

@tfmorris
Copy link
Member

tfmorris commented Apr 5, 2024

Unless the variable structure is backward compatible when the simple variant is used, I think it's better to be consistent and always use the array form, even for a single entry. I suspect that things have diverged enough that there's not a compatibility benefit.

@thadguidry
Copy link
Contributor

I second @tfmorris opinion. I like the consistency of when our API standards have a context that could be "one or many" then we resort to Array form. (mostly because the idea of simpler JSON structure, is precluded that perhaps JSON Array objects are complicated or noisy?, when they really are not for developers and our 2024+ tooling nowadays)

@acka47
Copy link
Member

acka47 commented Apr 8, 2024

Generally, this seems to be related to #52 as a solution to this issue will also resolve the #52, won't it?

Maybe there are other options?

I am late to the party (sorry) but am adding this for reference. Generally, I like the "language map" approach from JSON-LD (examples) for providing labels in multiple languages as it is simple, terse and easy to read. The example from #138 (comment) would look like this with language maps:

{
   "candidates":[
      {
         "id":"K11-15",
         "name":{
            "en":"National Arts Centre - Azrieli Studio",
            "fr":"Centre National des Arts - Studio Azrieli"
         }
      }
   ]
}

@thadguidry
Copy link
Contributor

@acka47 If we went that route, we'd have to adopt a convention and document it. That being the key should be an ISO 639-3 three letter code? Hmm, what else?

@wetneb
Copy link
Member

wetneb commented Apr 8, 2024

@acka47 I like the conciseness but how would a service represent a name or description for which it does not know the language? (Use case: a tool like CSV-reconcile, which spins a reconciliation service on arbitrary datasets, generally will not have access to this sort of information and shouldn't make up a language for the sake of fitting in)

@acka47
Copy link
Member

acka47 commented Apr 8, 2024

If we went that route, we'd have to adopt a convention and document it. That being the key should be an ISO 639-3 three letter code?

Yes, we could define it similar to JSON-LD like this: "keys must be strings representing [BCP47] language codes and the values must be a string."

how would a service represent a name or description for which it does not know the language?

Good question. I guess for the other approach from #138 (comment) you would you just omit the optional lang key. With the language map approach you would have to use und as key (for "undetermined"), I guess.

@awagner-mainz
Copy link

Would the array approach allow for multiple alias names in the same language whereas the map approach would not? That could be an argument for choosing the array approach. On the other hand, I am not sure we actually want to allow this?

@fsteeg
Copy link
Member

fsteeg commented Apr 11, 2024

Another aspect to consider for the lang field vs. language maps is that the field provides a general approach for all objects. To quote from the current draft:

All objects used in this protocol (entities, types, properties, queries, candidates, features, etc.) MAY declare an explicit text-processing language in a lang field.

@fsteeg
Copy link
Member

fsteeg commented Apr 11, 2024

[...] I think it's better to be consistent and always use the array form [...]

To be clear, this is not only about array vs. non-array, but also object vs. string.

The common, simple case currently:

"name": "National Arts Centre - Azrieli Studio"

The common case in the unified syntax:

"name": [
  {
    "str": "National Arts Centre - Azrieli Studio"
  }
]

If this was the first and only place where we introduce optional structure (string or array of objects), I'd agree we might want to avoid that. But since we do the same thing in other places (e.g. property values), I feel like the much simpler common case is worth having the option.

@saumier
Copy link
Author

saumier commented Apr 11, 2024

how would a service represent a name or description for which it does not know the language?

From JSON-LD https://www.w3.org/TR/json-ld/#example-102-indexing-languaged-tagged-strings-using-none-for-no-language

... the special index @none is used for indexing strings which do not have a language; this is useful to maintain a normalized representation for string values not having a datatype.

Example if there was no language for a name.

{
   "candidates":[
      {
         "id":"K11-15",
         "name":{
            "@none":"National Arts Centre - Azrieli Studio"
         }
      }
   ]
}

@wetneb
Copy link
Member

wetneb commented Apr 11, 2024

I'm not really enthusiastic about any of the solutions, but the one that I find the least bad is @fsteeg's suggestion to use the existing language (+ text direction) mechanisms we have, and simply switch to this default syntax:

"name": [
  {
    "str": "National Arts Centre - Azrieli Studio"
  }
]

with the option to add a lang and dir attributes at the same level as the str if needed, and to add more objects in the array.
This also has the benefit of allowing for returning multiple names in a same language (for alternate names, such as acronyms for instance).

@wetneb
Copy link
Member

wetneb commented Apr 11, 2024

And I agree with @tfmorris on the preference to stick to the array form.

@saumier
Copy link
Author

saumier commented Apr 11, 2024

I also agree with @wetneb and @tfmorris to use an array of objects with the str attribute and optional lang and dir.

For the sake of comparison with other patterns, this somewhat resembles the keys @value, @language and @direction used in JSON-LD.

@acka47
Copy link
Member

acka47 commented Apr 12, 2024

I have no preference here but just felt that the language map approach should at least be discussed in this context. Thus, I am fine with an array of objects containing at least the str with optional lang and dir.

@saumier
Copy link
Author

saumier commented Jul 8, 2024

@wetneb My team has implemented an endpoint for the current draft spec and updated our branch of the test bench to support both v0.2 and v0.3 (draft).

Here are 2 screen grabs from our branch of test bench. One showing our production reconciliation endpoint v0.2 and a second screen grab showing our test reconciliation endpoint v0.3 with multi-lingual support meeting the needs of this use case. This is a work in progress.

v0.2 - current spec - showing Azieli Studio returned 2 times with the same ID K11-15

Screenshot 2024-07-08 at 10 11 54 AM

v0.3 - draft spec - showing Azrieli Studio entity combined in a single response with en and fr.

Screenshot 2024-07-08 at 10 18 29 AM

fsteeg added a commit that referenced this issue Jul 10, 2024
With required `str`, optional `lang` and `dir` fields
fsteeg added a commit that referenced this issue Jul 10, 2024
And use for candidate `description`
fsteeg added a commit that referenced this issue Oct 9, 2024
- Extract existing string object definition to its own schema file
- Reference string-object.json in the suggest response schemas
- Update spec & examples to use string objects in suggest responses
- Redefine types used in suggest response as described in the spec
(instead of referencing the actual type.json schema)
- Clarify in the spec that we don't return actual full entity,
property, or type objects in the suggest response's `result` field
fsteeg added a commit that referenced this issue Oct 9, 2024
Also add `description` to spec and example (was in schema already)
@fsteeg
Copy link
Member

fsteeg commented Oct 10, 2024

Quoting myself in in the related PR #176 (comment):

Did we ever consider implementing it by (only) specifying the language(s) in the client request (in the Accept-Language header), and returning the old structure? Do we actually need to return multiple languages at the same time, for a single request?

I feel like we, in particular myself in #138 (comment), might have jumped to the solution of changing the data structure too quickly. One approach we discussed in today's meeting is using multiple requests, one for each language, each returning the current, simple structure.

So instead of a single response in the new format:

"candidates": [
  {
    "id": "K11-15",
    "name": [
      {
        "str": "National Arts Centre - Azrieli Studio",
        "lang": "en"
      },
      {
        "str": "Centre National des Arts - Studio Azrieli",
        "lang": "fr"
      }
    ],
    ...
  }
]

We'd have two responses (for two requests with different Accept-Language headers) in the old format:

"candidates": [
  {
    "id": "K11-15",
    "name": "National Arts Centre - Azrieli Studio",
    "lang": "en"
    ...
  }
]
"candidates": [
  {
    "id": "K11-15",
    "name": "Centre National des Arts - Studio Azrieli",
    "lang": "fr"
    ...
  }
]

This seems way more lightweight and in line with the other internationalization support, which is completely optional (request and response headers, optional lang and dir fields on existing objects), instead of determining the structure of the protocol.

It's actually kind of close to the original workaround of returning multiple candidates with the same ID but different labels by @saumier in #138 (comment), but I guess in all cases the client will have to handle something (grouping candidates with the same ID or displaying the new structure).

So not sure how that would be implemented exactly, but wanted to ask for feedback on the basic idea.

@tfmorris
Copy link
Member

Using multiple queries seems inefficient to me. I think doing it the way the Google KG Search does with an ordered list of requested languages would be simpler:

https://kgsearch.googleapis.com/v1/entities:search?languages=fr&languages=en&query=etage&key=<key>

which then returns the results in the same order as specified by the request:

        "@id": "kg:/m/02vk6kk",
        "name": [
          {
            "@language": "fr",
            "@value": "Étage"
          },
          {
            "@value": "Storey",
            "@language": "en"
          }
        ],

@saumier
Copy link
Author

saumier commented Nov 4, 2024

I feel like we, in particular myself in #138 (comment), might have jumped to the solution of changing the data structure too quickly.

@fsteeg I am also coming around to the idea that we maybe changed the data structure too quickly.

In my specific use case, the implementation of the reconciliation service is such that it always processes "matchType": "name" requests in both languages (ignoring the Text-processing language if specified). This is because in Canada it is not uncommon to be speaking one language but use the name of an entity in another language. So a person speaking english may talk about a place using the french name while continuing to speak in english.

The root of my problem is how to return the response that the user is expecting to see in the UI. Especially when there is an exact match in one language but not the other. To illustrate my use case with a concrete example (as in the original use case), imagine a reconciliation query for "Studio Azrieli" and the Language of the intended audience set to "en" in the Accept-Language request header. The service processes the request by searching in both languages: "en" and "fr". The response is formatted in "en" because of the Accept-Language. I could just display the english name "National Arts Centre - Azrieli Studio" and stop there. But ideally I would like to display the french name so the user will recognize their search and see the exact match. The current live production server for Artsdata.ca returns 2 candidates with the same URI but different names for english and french, and the exact match candidate hi-lighted. This has the down side (as mentioned in my original use case) that returning 2 candidates may appear incorrect to the user and be mistaken as duplicates.

This seems way more lightweight and in line with the other internationalization support, which is completely optional (request and response headers, optional lang and dir fields on existing objects), instead of determining the structure of the protocol.

It's actually kind of close to the original workaround of returning multiple candidates with the same ID but different labels by @saumier in #138 (comment), but I guess in all cases the client will have to handle something (grouping candidates with the same ID or displaying the new structure).

So not sure how that would be implemented exactly, but wanted to ask for feedback on the basic idea.

@fsteeg I understand your idea, but instead of doing two requests with different Accept-Language request headers, I am thinking of sacrificing one display language instead. The user may not recognize their "exact" match in the response if the response candidate is in a different language than their initial search string, but this may not really be a show stopper. I plan to do some usability testing on my end with the idea of only displaying only the Language of the intended audience in the UI.

@fsteeg
Copy link
Member

fsteeg commented Nov 12, 2024

This has the down side (as mentioned in my original use case) that returning 2 candidates may appear incorrect to the user and be mistaken as duplicates. [...] I am thinking of sacrificing one display language instead.

Isn't that mainly a client / UI issue? You could return both candidates, no matter which language(s) were requested:

"candidates": [
  {
    "id": "K11-15",
    "name": "National Arts Centre - Azrieli Studio",
    "lang": "en"
    ...
  },
  {
    "id": "K11-15",
    "name": "Centre National des Arts - Studio Azrieli",
    "lang": "fr"
    ...
  }
]

As mentioned above, it's up to the client to display multi-language candidates properly (no matter which approach we take), e.g. by grouping these candidates by ID, or by language, or by simply adding the language as a field in the UI.

@saumier
Copy link
Author

saumier commented Dec 11, 2024

@fsteeg I like this direction however I think we are still missing something in how "lang" is defined.

The property "lang" is clearly defined in the schema for queries as "The text-processing language for the query".

I don't see the "lang" property defined yet in the schema A.3 Reconciliation Result Batch Schema.

So if we kept the same "lang" property definition and added it to the schema for reconciliation results like this:

"candidates": {
      "type": "array",
      "items": {
      "type": "object",
             "properties": {
                    ...
                   "lang": {
                       "type": "string",
                       "description": "The text-processing language for the query"
                     }
...

then we could have responses like this:

"candidates": [
  {
    "id": "K11-15",
    "name": "National Arts Centre - Azrieli Studio",
    "description": "black box theatre within the National Arts Centre, in Ottawa, Canada",
    "lang": "en"
    ...
  }
]

However is this the correct use of "lang"? The "text-processing language" is to help the service understand how to treat "v" in the query.

According to the spec, the name and description strings in the response should be in the language of the intended audience which is set in the Accept-Language header.

Is there another definition of "lang" for the response schema? Like "The language of all plain literals in the object".

Another consideration is if "lang" is added to the schema, then should there not also be a "dir" property defined in the schema?

@fsteeg
Copy link
Member

fsteeg commented Dec 12, 2024

Is there another definition of "lang" for the response schema? Like "The language of all plain literals in the object".

Right, the schemas don't fully specify the allowed fields, many allow additionalProperties, but the spec itself says:

All objects used in this protocol (entities, types, properties, queries, candidates, features, etc.) MAY declare an explicit text-processing language in a lang field. [...] This text-processing language applies to the natural language fields of the object: name, description, query (for reconciliation queries), v and str (for property values). [...]

Same for text direction.

fsteeg added a commit that referenced this issue Dec 12, 2024
…dates

Add paragraph on multi-lingual reconciliation candidates (#138)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants