readDelimiter variant for Regex as delimiter #746

dave08 · 2024-06-20T11:38:14Z

Maybe since this is a function to especially read delimeters, it might be useful to have an override that takes in a Regex as a delimiter... this might be used for command line output tables that are usually space separated but sometimes inside a column value there might be a single space, so I need to use "\s\s+" to correctly read it in.

koperagen · 2024-06-20T12:39:22Z

Hi. Library we're using now only has String and Char options for delimiter. Is your file a CSV/TSV or just a plain txt with some special format you want to parse?

dave08 · 2024-06-20T12:49:26Z

Say I have (output from kubectl get namespaces):

NAME                     STATUS   AGE      LABELS
argo-events              Active   2y77d    app.kubernetes.io/instance=argo-events,kubernetes.io/metadata.name=argo-events
argo-workflows           Active   2y77d    app.kubernetes.io/instance=argo-workflows,kubernetes.io/metadata.name=argo-workflows
argocd                   Active   5y18d    kubernetes.io/metadata.name=argocd
beta                     Active   4y235d   kubernetes.io/metadata.name=beta

Then I have multiple spacess as delimiters...

In some command line outputs, I have two words in one column:

NAME                                                                     CLUSTER        CDS        LDS        EDS        RDS          ECDS         ISTIOD                             VERSION
foo-5fcd67944f-2t97k.dev                                           Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-18-7-dbcdbb5f4-nth9n      1.18.7
foo-6f8bf4c9b9-qrwf9.prod                                          Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-16-7-6d46d45875-gxtzw     1.16.7

Like that NOT SENT... that's where a regex can help here. It's not just tabs, it's a bunch of spaces.

Also, how would you parse Markdown tables (or similar)...? Unless the library trims all those extra spaces... but I guess with markdown there might be more complications that just a delimiter.

koperagen · 2024-06-20T13:57:56Z

Good questions indeed. I think such tables should be parsed by readDelimStr in the future. For now i can only suggest something like this for Markdown.

fun String.markdownCells() = trim('|').split("|").map { it.trim() }

val s = """
| Month    | Savings |
| -------- | ------- |
| January  | $250    |
| February | $80     |
| March    | $420    |""".trimIndent()

val lines = s.lineSequence()
lines.drop(2).toList().toDataFrame().split { value }.by { it.markdownCells() }.into(lines.first().markdownCells())

dave08 · 2024-06-20T15:25:06Z

I think that's a bit of an advanced technique for most people with this kind of use case... and it involves parsing in two steps...

I wonder if some kind of readDSL would be better here... it could possibly work by line and give helpers for extracting the titles and values?

koperagen · 2024-06-20T15:54:32Z

Please share desired API or example of usages that you have in mind. Maybe something like this could be added

Jolanrensen · 2024-10-22T11:45:10Z

I'm closing this fow now. We're working on a new CSV implementation based on Deephaven CSV #827 since it's faster and lighter, however this also doesn't allow Regexes for delimiter characters unfortunately, just a Char.
It does have multiple other options like ignoreSurroundingSpaces, which can trim leading and trailing blanks around values and it can recognize quote characters. That might help :).

We plan to have an experimental version of it in 0.15. If that still does not work, I'd recommend modifying the string manually, potentially adding quote characters and then parsing it as delimStr.

Edit: well, apparently it seems to have some issues with delimiter = ' ', ignoreSurroundingSpaces = true. I'll make an issue over at deephaven XD deephaven/deephaven-csv#212

kosak · 2024-10-26T04:41:00Z

Hi, since you mentioned you were developing your own CSV library, I thought I would comment here.

Whether you decide to use Deephaven's CSV library or develop your own, there are a variety of things we learned along the way that may benefit you. We used some clever ideas for high performance and also some cute tricks for automatic "type inference". I'd be happy to discuss in more detail in some appropriate forum if you would find that helpful. Best, Corey Kosak @ Deephaven

Jolanrensen · 2024-10-28T09:53:13Z

@kosak We're not developing our own CSV library. We're simply replacing our Apache commons CSV integration in DataFrame with Deephaven's :) exactly for the reasons you mentioned; performance, type inference, etc. Plus, while we currently don't store our data primitively, using Deephaven, that remains a viable option in the future.

Jolanrensen · 2024-11-11T12:26:36Z

deephaven/deephaven-csv#212 is merged :)

We'll add it in #903. Simply set hasFixedWidthColumns = true and the column widths are determined by the width of the headers + spaces.

You can also manually specify fixedColumnWidths if this goes wrong.

dave08 mentioned this issue Jun 20, 2024

Add delimiter parameter to readDelimStr #743

Merged

koperagen added the enhancement New feature or request label Jun 20, 2024

zaleslaw modified the milestones: 0.14.0, Backlog Jul 19, 2024

zaleslaw added the files reading/writing from/to files label Jul 19, 2024

Jolanrensen added the csv CSV / delim related issues label Aug 20, 2024

Jolanrensen self-assigned this Aug 20, 2024

Jolanrensen mentioned this issue Aug 20, 2024

☂ CSV rework #827

Open

27 tasks

Jolanrensen closed this as not planned Won't fix, can't repro, duplicate, stale Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readDelimiter variant for Regex as delimiter #746

readDelimiter variant for Regex as delimiter #746

dave08 commented Jun 20, 2024

koperagen commented Jun 20, 2024

dave08 commented Jun 20, 2024

koperagen commented Jun 20, 2024

dave08 commented Jun 20, 2024

koperagen commented Jun 20, 2024

Jolanrensen commented Oct 22, 2024 •

edited

Loading

kosak commented Oct 26, 2024

Jolanrensen commented Oct 28, 2024

Jolanrensen commented Nov 11, 2024

readDelimiter variant for Regex as delimiter #746

readDelimiter variant for Regex as delimiter #746

Comments

dave08 commented Jun 20, 2024

koperagen commented Jun 20, 2024

dave08 commented Jun 20, 2024

koperagen commented Jun 20, 2024

dave08 commented Jun 20, 2024

koperagen commented Jun 20, 2024

Jolanrensen commented Oct 22, 2024 • edited Loading

kosak commented Oct 26, 2024

Jolanrensen commented Oct 28, 2024

Jolanrensen commented Nov 11, 2024

Jolanrensen commented Oct 22, 2024 •

edited

Loading