Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to convert single character to Enum using parser #998

Open
hantsy opened this issue Dec 11, 2024 · 16 comments
Open

Failed to convert single character to Enum using parser #998

hantsy opened this issue Dec 11, 2024 · 16 comments

Comments

@hantsy
Copy link

hantsy commented Dec 11, 2024

After upgrading to v0.15, a new issue occurred in our application.

There is a column that accepts a single character in the CSV, and we use the following parser to convert it to Enum from a raw string directly.

df.
...
.convertTo<xxx>{
    parser { ACHType.fromSymbol(it) }
}

which worked well with the former version, but failed now with the following information.

Type converter from kotlin.Char to xxx.ACHType? is not found for column 'ACH Type'
@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Dec 11, 2024

Ah it's because we added Char recognition to our parser system.

parser {} here is just a shorthand of writing convert<String>.with {} I believe.

It should work if you write
convert<Char>.with { ACHType.fromSymbol(it.toString()) }

I'll have a look whether we can pass Chars (as String) through the parser function as well, that way we can keep the previous notation working. Also, I'll add some tests to catch it in the future. Thanks!

@hantsy
Copy link
Author

hantsy commented Dec 11, 2024

When using

.convertTo<DataSchemaType>{
    parser { ACHType.fromSymbol(it) }
    parser { other type parsing from it}
}

Here, it is always a string.

@hantsy
Copy link
Author

hantsy commented Dec 11, 2024

In the original post, I think the problem is the parser is not called.

It did not work as other parsers.

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Dec 11, 2024

@hantsy yes that's true. Could you share df.schema() in your example?

I suspect convertTo { parser { ... } } (being the same as convertTo { convert<String>().with { ... } }) is not called because you have no String columns in df.

Reading CSV can now give Char columns because of changes in our df.parse() implementation. (Note convertTo { parser {} } is a completely different function than df.parse {})

We'd have to change convertTo { parser { ... } } to handle both Char columns and String columns to keep the previous behavior the same.

@hantsy
Copy link
Author

hantsy commented Dec 12, 2024

I create an example project to produce the issue, https://github.com/hantsy/dataframe-sandbox/blob/master/src/main/kotlin/Main.kt

When the mainKt, it will display the following info.

12:58:11.221 [pool-1-thread-4] DEBUG org.jetbrains.kotlinx.dataframe.impl.io.FastDoubleParser -- Could not parse 'N' as Double with NumberFormat with locale 'en_US'.
Amount: Double
Last 4: Int
ACH Type: Char
Exception in thread "main" org.jetbrains.kotlinx.dataframe.exceptions.TypeConverterNotFoundException: Type converter from kotlin.Char to org.example.ACHType? is not found for column 'ACH Type'
	at org.jetbrains.kotlinx.dataframe.impl.api.ConvertKt.convertToTypeImpl$convertPerCell(convert.kt:189)
	at org.jetbrains.kotlinx.dataframe.impl.api.ConvertKt.convertToTypeImpl(convert.kt:209)
	at org.jetbrains.kotlinx.dataframe.api.ConvertKt.convertTo(convert.kt:137)
	at org.jetbrains.kotlinx.dataframe.impl.api.ConvertToKt.convertToImpl$convertToSchema(convertTo.kt:175)
	at org.jetbrains.kotlinx.dataframe.impl.api.ConvertToKt.convertToImpl(convertTo.kt:277)
	at org.example.MainKt.main(Main.kt:51)
	at org.example.MainKt.main(Main.kt)

Process finished with exit code 1

@hantsy
Copy link
Author

hantsy commented Dec 12, 2024

We'd have to change convertTo { parser { ... } } to handle both Char columns and String columns to keep the previous behavior the same.

it seems this is a free-style conversion for developers.

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Dec 12, 2024

it seems this is a free-style conversion for developers.

What do you mean with "free-style" conversion?
At the moment it's just a shortcut for df.convertTo<Schema> { convert<String>.with { ... } }

I create an example project to produce the issue, https://github.com/hantsy/dataframe-sandbox/blob/master/src/main/kotlin/Main.kt

When the mainKt, it will display the following info.

12:58:11.221 [pool-1-thread-4] DEBUG org.jetbrains.kotlinx.dataframe.impl.io.FastDoubleParser -- Could not parse 'N' as Double with NumberFormat with locale 'en_US'.
Amount: Double
Last 4: Int
ACH Type: Char
Exception in thread "main" org.jetbrains.kotlinx.dataframe.exceptions.TypeConverterNotFoundException: Type converter from kotlin.Char to org.example.ACHType? is not found for column 'ACH Type'
	at org.jetbrains.kotlinx.dataframe.impl.api.ConvertKt.convertToTypeImpl$convertPerCell(convert.kt:189)
	at org.jetbrains.kotlinx.dataframe.impl.api.ConvertKt.convertToTypeImpl(convert.kt:209)
	at org.jetbrains.kotlinx.dataframe.api.ConvertKt.convertTo(convert.kt:137)
	at org.jetbrains.kotlinx.dataframe.impl.api.ConvertToKt.convertToImpl$convertToSchema(convertTo.kt:175)
	at org.jetbrains.kotlinx.dataframe.impl.api.ConvertToKt.convertToImpl(convertTo.kt:277)
	at org.example.MainKt.main(Main.kt:51)
	at org.example.MainKt.main(Main.kt)

Process finished with exit code 1

Ah yes, as I expected. Just to check we both understand the situation:

Your CSV is parsed as Double, Int, Char.

Then you want to convert it to CsvExampleDataModel with convertTo, which DataFrame can do, but it needs some extra information, because it doesn't know how to convert Char -> your custom enum class.

As you can see in the docs, you can provide your strategies to the DSL of how it can convert type A to type B, like df.convertTo<Schema> { convert<A>().with { it.toB() } }. parser {} just defines how to convert from String to type B.

In the DSL you write:

.convertTo<CsvExampleDataModel> {
    // freely convert
    parser { ACHType.fromSymbol(it) }
    parser { BigDecimal(it) }
}

which tells dataframe how to convert String -> ACHType and String -> BigDecimal, but it still has no idea how to convert Char -> ACHType, which is necessary for converting your df to CsvExampleDataModel, so that's where it throws the exception.

Were you to add the strategy convert<Char>().with { ACHType.fromSymbol(it.toString()) }, the conversion would succeed.

So far the current situation. To make this situation easier in the future, I suggest either of the following changes:

  • we introduce the charParser { it: Char -> ... } notation besides parser { it: String -> ... } as a shortcut for convert<Char>().with { ... }
  • we make it so that any Char can be converted to type B if we know of a strategy to convert String -> type B, because we can do Char -> String -> B under the hood (This is implemented in Handling Chars as Strings implicitly #999). This would make it that you can write parser { it: String -> it.toTypeB() } once and provide the strategies Char -> B and String -> B at the same time.

Which would you think is the best solution?

(On a slightly related note, if you want your CSV to not parse String -> Char, but instead keep them as Strings, as in older versions, you can now supply parserOptions = ParserOptions(skipTypes = setOf(typeOf<Char>())) to the readCsv() function)
(alternatively, there's readCsv(colTypes = mapOf("ACH Type" to ColType.String))

@hantsy
Copy link
Author

hantsy commented Dec 12, 2024

What do you mean with "free-style" conversion?

It means I can not convert the original string value to my type as expected and ignore the Dataframe built-in conversion.

@hantsy
Copy link
Author

hantsy commented Dec 12, 2024

Were you to add the strategy convert().with { ACHType.fromSymbol(it.toString()) }, the conversion would succeed.

It breaks the existing rules, currently I found when using parser for all columns values in the convertTo<MyType>{} block is string values, not converted types.

@hantsy
Copy link
Author

hantsy commented Dec 12, 2024

Which would you think is the best solution?

I am not sure which one is better. But I think the original convertTo<DataSchemaType>{} is a good place for developers to convert the raw strings to the expected types.

(of course, I do not know if it is the design purpose. If it is not, as a developer, I think it is better to leave such room for developers to handle conversion manually instead of applying built-in converters)

@hantsy
Copy link
Author

hantsy commented Dec 12, 2024

And from my sample codes, in the schema.print result, the decimal-like literals are treated as Double type.

But in the convertTo<CsvExampleDateModel>{ BigDecimal(it)}, here it is still recognized as String type. It is not converted from a Double.

See the picture.

2024-12-12 20 17 48

@Jolanrensen
Copy link
Collaborator

DataFrame.readCsv and convertTo are two different operations with different use cases and options. (Remember each dataframe operation produces a new immutable dataframe, we have no reference back to the source after an operation)

readCsv

DataFrame.readCsv() reads a CSV file to a DataFrame. By default it performs type inference in the form of parsing, however this parsing can be disabled per-column by telling it which column types you expect, like:

val df = DataFrame.readCsv(file,
    colTypes = mapOf(
        "Amount" to ColType.Double, // per column name
        ColType.DEFAULT to ColType.String, // or for all columns at once
    ),
)

After calling this example, we'll get a dataFrame with Amount: Double (if a column with that name existed in the CSV) and all other columns as String.
There is no mechanism built into the parsing system which would allow for custom ColTypes at the moment.

convert

You can, however, perform a convert operation on the resulting dataframe, to convert any column to any type you like. So if you'd like to convert the column `ACH Type` (which, if we set the colTypes correctly before is now of type String) to your custom enum, you could write:

val newDf = df.convert { "ACH Type"<String>() }.with { ACHType.fromSymbol(it) }

This results in a new dataframe with the columns Amount: Double, Last 4: String, and ACH Type: ACHType.

convertTo

Another operation is DataFrame.convertTo<Schema> {} which tries its best to make the dataframe adhere to the schema you supply by automatically finding converters for each source column to the target column of the same name.

Calling

val result1 = newDf.convertTo<CsvExampleDataModel>()

tells DataFrame to do the following conversions:
Amount: Double -> Amount: BigDecimal
Last 4: String -> Last: 4: String
ACH Type: ACHType -> ACH Type: ACHType

which it can do just fine, because we support automatic type conversion between any number types.

However, calling this on the previous dataframe:

df.convertTo<CsvExampleDataModel>()

will fail, because it tells DataFrame to do:
Amount: Double -> Amount: BigDecimal (automatic number conversion)
Last 4: String -> Last: 4: String
ACH Type: String -> ACH Type: ACHType (no automatic type conversion is known to do this)

You can fix this by telling DataFrame how to do this type of conversion like:

df.convertTo<CsvExampleDataModel> {
    convert<String>().with { ACHType.fromSymbol(it) }
}

this tells DataFrame to do:
Amount: Double -> Amount: BigDecimal (automatic number conversion)
Last 4: String -> Last: 4: String
ACH Type: String -> ACH Type: ACHType (by calling your given lambda: { ACHType.fromSymbol(it) })

This succeeds :)

An alternate notation which does exactly the same is:

df.convertTo<CsvExampleDataModel> {
    parser { ACHType.fromSymbol(it) }
}

Amount: Double -> Amount: BigDecimal (automatic number conversion)
Last 4: String -> Last: 4: String
ACH Type: String -> ACH Type: ACHType (by calling your given lambda: { ACHType.fromSymbol(it) })

Hopefully this explanation helps to understand how these operations work and how they can be used.

All examples I just wrote above can be executed in this notebook if you want to get a better feeling for them.

@hantsy
Copy link
Author

hantsy commented Dec 13, 2024

Thanks for your explanation. In my example, when reading the CSV, I did not add colTypes to declare the expected column types, then I used the default type inferences.

The schema.print() got the result:

Amount: Double
Last 4: Int
ACH Type: Char

I want to convert to the expected types defined in CsvExampleDataModel.

Amount: Double   -> BigDecimal
Last 4: Int      -> String
ACH Type: Char   -> ACHType (enum class)

I call convertTo<CsvExampleDataModel> directly as the following.

.convertTo<CsvExampleDataModel> {
    // freely convert
    parser { ACHType.fromSymbol(it) } // `it` always refers to a String type 
    parser { BigDecimal(it) }
}

The it refers to a String type in all parser blocks.

The Amount can be converted by Double -> String (in BigDecimal(it) it is a String, see above picture) -> BigDecimal. I did not apply a convert before the parser to convert Amount from Double to String.

Why ACHType can not be converted using the same flow as previous versions: Char -> String (in ACHType.fromSymbol(it) it is a String) -> ACHType?

@hantsy
Copy link
Author

hantsy commented Dec 13, 2024

Anyway, using parserOptions = ParserOptions(skipTypes = setOf(typeOf<Char>())) option in the readCsv resolved my problem. Thanks.

@Jolanrensen
Copy link
Collaborator

I want to convert to the expected types defined in CsvExampleDataModel.

Amount: Double   -> BigDecimal
Last 4: Int      -> String
ACH Type: Char   -> ACHType (enum class)

I call convertTo<CsvExampleDataModel> directly as the following.

.convertTo<CsvExampleDataModel> {
    // freely convert
    parser { ACHType.fromSymbol(it) } // `it` always refers to a String type 
    parser { BigDecimal(it) }
}

The it refers to a String type in all parser blocks.

The Amount can be converted by Double -> String (in BigDecimal(it) it is a String, see above picture) -> BigDecimal. I did not apply a convert before the parser to convert Amount from Double to String.

Why ACHType can not be converted using the same flow as previous versions: Char -> String (in ACHType.fromSymbol(it) it is a String) -> ACHType?

In your example, you can remove the line parser { BigDecimal(it) }. It is never called and does nothing, because DataFrame your source df has no String column that needs to be converted to BigDecimal. Remember that parser {} just provides the convertTo<Schema> operation the strategy for how to convert from String to your target type. This doesn't mean it has to use it. It is only called when your source dataframe has a column of String that needs to be converted to a column of your target type.

In this example, convertTo<CsvExampleDataModel> does its own magic conversion from Double -> BigDecimal, there's no String in between.

So, again, to fix the example, you should write:

DataFrame.readCsv(...)
    .convertTo<CsvExampleDataModel> {
        // supplying convertTo the function (Char) -> ACHType it can use when needed
        convert<Char>().with { ACHType.fromSymbol(it.toString) }
    }

@hantsy
Copy link
Author

hantsy commented Dec 14, 2024

In this example, convertTo does its own magic conversion from Double -> BigDecimal, there's no String in between.

Got it, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants