Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Name String Array primitive array type during deserialization #257

Merged
merged 3 commits into from
Jul 6, 2022

Conversation

simonmeoni
Copy link
Contributor

I had a problem when I parse XMI and typesystem containing String Array primitive type. The parser gave me the following error :

self = <cassis.xmi.CasXmiDeserializer object at 0x7ff109435430>
type_ = Type(name=uima.cas.StringArray), value = ''

    def _parse_primitive_array(self, type_: Type, value: Union[str, List[str]]) -> List:
        """Primitive collections are serialized as white space separated primitive values"""
    
        if value is None:
            return None
    
        # TODO: Use type name global variable here instead of hardcoded string literal
        elements = value.split() if isinstance(value, str) else value
    
        type_name = type_.name
        if type_name in [TYPE_NAME_FLOAT_ARRAY, TYPE_NAME_DOUBLE_ARRAY]:
            return [float(e) for e in elements] if value else []
        elif type_name in [TYPE_NAME_INTEGER_ARRAY, TYPE_NAME_SHORT_ARRAY, TYPE_NAME_LONG_ARRAY]:
            return [int(e) for e in elements] if value else []
        elif type_name == TYPE_NAME_BOOLEAN_ARRAY:
            return [self._parse_bool(e) for e in elements] if value else []
        elif type_name == TYPE_NAME_BYTE_ARRAY:
            return list(bytearray.fromhex(value)) if value else []
        else:
>           raise ValueError(f"Not a primitive collection type: {type_name}")
E           ValueError: Not a primitive collection type: uima.cas.StringArray

../../../../miniconda/envs/fhir-lighter/lib/python3.9/site-packages/cassis/xmi.py:445: ValueError

I modified the code to accept and parse String Array type as the all Primitive Array type inside the method.
Is it the right place to get able to process String Array type, or is it against the UIMA CAS format philosophy (I'm not an expert)?
Thanks in advance 🤗

@reckart
Copy link
Member

reckart commented Jun 30, 2022

Could you please provide a minimal sample XMI file that cassis fails to parse?
Could you maybe add a unit test for the case that fails to your PR?

@reckart
Copy link
Member

reckart commented Jun 30, 2022

Also, which version of cassis did you encounter the problem in?

@simonmeoni
Copy link
Contributor Author

Also, which version of cassis did you encounter the problem in?

I tested on these both versions : dkpro-cassis-0.7.1 and dkpro-cassis-0.8.0.dev0. Is there another version that could be probably working for my case?

@simonmeoni
Copy link
Contributor Author

simonmeoni commented Jun 30, 2022

Could you please provide a minimal sample XMI file that cassis fails to parse? Could you maybe add a unit test for the case that fails to your PR?

I provide you an example who didn't work for me without my change. It concerns the typesystem.xml above all (I suppose)

@reckart
Copy link
Member

reckart commented Jul 1, 2022

I have not looked at it in detail yet, but if I remember correctly, the place where you added the string array handling is about inlined arrays in the form <XXX arrayFeature="true false true true"/> where the array values are space separated. In that code, we only see a single XML element and its attributes. Now, for a string array this kind of inlining is not possible because a string can contain spaces. So for string arrays, the form

<XXX>
  <arrayFeature>value 1</arrayFeature>
  <arrayFeature>value 2</arrayFeature>
  <arrayFeature>value 3</arrayFeature>
</XXX>

must be used - which is done in a slightly different part of the code where we walk the XML element tree.

@simonmeoni
Copy link
Contributor Author

The typesystem I provided you was generated by Inception. When I keep the string array element inside the typesystem, the parser crashes even if my xmi file doesn't have any string array annotation. If I remove it, the parser works.
Is it a problem from Inception who generates an unreadable typesystem for cassis?

@reckart
Copy link
Member

reckart commented Jul 6, 2022

So the problem here seems to be that Assertion is a StringArray feature but in the XMI file it appears as an attribute with an empty value:

<custom:Span xmi:id="298" sofa="1" begin="13" end="19" Assertion="" label="LABEL1"/>

... another of the XMI oddities ... I need to check how UIMA treats this in the Java implementation to decide how to treat it in cassis.

@simonmeoni
Copy link
Contributor Author

I tried to use a previous version of cassis. And version 0.5.3 seems to work with string array. The only difference is type checking :

  • dkpro-cassis/cassis/xmi.py

    Lines 336 to 379 in 12de459

    def _parse_feature_structure(self, typesystem: TypeSystem, elem, children: Dict[str, List[str]]):
    # Strip the http prefix, replace / with ., remove the ecore part
    # TODO: Error checking
    type_name: str = elem.tag[9:].replace("/", ".").replace("ecore}", "").strip()
    if type_name.startswith("uima.noNamespace."):
    type_name = type_name[17:]
    AnnotationType = typesystem.get_type(type_name)
    attributes = dict(elem.attrib)
    attributes.update(children)
    # Map the xmi:id attribute to xmiID
    attributes["xmiID"] = int(attributes.pop("{http://www.omg.org/XMI}id"))
    if "begin" in attributes:
    attributes["begin"] = int(attributes["begin"])
    if "end" in attributes:
    attributes["end"] = int(attributes["end"])
    if "sofa" in attributes:
    attributes["sofa"] = int(attributes["sofa"])
    # Remap features that use a reserved Python name
    if "self" in attributes:
    attributes["self_"] = attributes.pop("self")
    if "type" in attributes:
    attributes["type_"] = attributes.pop("type")
    # Arrays which were represented as nested elements in the XMI have so far have only been parsed into a Python
    # arrays. Now we convert them to proper UIMA arrays/lists
    if not typesystem.is_primitive_array(type_name):
    for feature_name, feature_value in children.items():
    feature = AnnotationType.get_feature(feature_name)
    if typesystem.is_primitive_array(feature.rangeType):
    ArrayType = feature.rangeType
    attributes[feature_name] = ArrayType(elements=attributes[feature_name])
    if typesystem.is_primitive_list(feature.rangeType):
    attributes[feature_name] = self._parse_primitive_list(feature.rangeType, attributes[feature_name])
    self._max_xmi_id = max(attributes["xmiID"], self._max_xmi_id)
    return AnnotationType(**attributes)
  • dkpro-cassis/cassis/xmi.py

    Lines 254 to 283 in 1dd129d

    def _parse_feature_structure(self, typesystem: TypeSystem, elem, children: Dict[str, List[str]]):
    # Strip the http prefix, replace / with ., remove the ecore part
    # TODO: Error checking
    typename = elem.tag[9:].replace("/", ".").replace("ecore}", "").strip()
    AnnotationType = typesystem.get_type(typename)
    attributes = dict(elem.attrib)
    attributes.update(children)
    # Map the xmi:id attribute to xmiID
    attributes["xmiID"] = int(attributes.pop("{http://www.omg.org/XMI}id"))
    if "begin" in attributes:
    attributes["begin"] = int(attributes["begin"])
    if "end" in attributes:
    attributes["end"] = int(attributes["end"])
    if "sofa" in attributes:
    attributes["sofa"] = int(attributes["sofa"])
    # Remap features that use a reserved Python name
    if "self" in attributes:
    attributes["self_"] = attributes.pop("self")
    if "type" in attributes:
    attributes["type_"] = attributes.pop("type")
    self._max_xmi_id = max(attributes["xmiID"], self._max_xmi_id)
    return AnnotationType(**attributes)

Maybe it lacks some additional checking for primitive arrays?

@reckart
Copy link
Member

reckart commented Jul 6, 2022

We don't have a case of an emtpy string array in our test suite so far. So they need to be updated to include such a case.

As for reading: your change works, but IMHO it goes to far. It works because it splits an empty attribute value and then creates an empty array from that. The only thing we should support there is if the attribute is empty, then create an empty string. If the attribute is not empty, then we have a bug because that case should be handled elsewhere.

And then we need to check if writing empty string arrays using cassis works properly...

@reckart reckart self-assigned this Jul 6, 2022
@reckart reckart added this to the 0.7.2 milestone Jul 6, 2022
@reckart
Copy link
Member

reckart commented Jul 6, 2022

We have an issue for this now: Empty string arrays are not supported #258

- Extend test data to cover empty string array
- Fix parsing empty string array from XMI
- Fix serializing empty string array to XMI
@reckart reckart merged commit 0c6315f into dkpro:main Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants