Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readDelimiter variant for Regex as delimiter #746

Closed
dave08 opened this issue Jun 20, 2024 · 9 comments
Closed

readDelimiter variant for Regex as delimiter #746

dave08 opened this issue Jun 20, 2024 · 9 comments
Assignees
Labels
csv CSV / delim related issues enhancement New feature or request files reading/writing from/to files
Milestone

Comments

@dave08
Copy link

dave08 commented Jun 20, 2024

Maybe since this is a function to especially read delimeters, it might be useful to have an override that takes in a Regex as a delimiter... this might be used for command line output tables that are usually space separated but sometimes inside a column value there might be a single space, so I need to use "\s\s+" to correctly read it in.

@koperagen
Copy link
Collaborator

Hi. Library we're using now only has String and Char options for delimiter. Is your file a CSV/TSV or just a plain txt with some special format you want to parse?
image

@dave08
Copy link
Author

dave08 commented Jun 20, 2024

Say I have (output from kubectl get namespaces):

NAME                     STATUS   AGE      LABELS
argo-events              Active   2y77d    app.kubernetes.io/instance=argo-events,kubernetes.io/metadata.name=argo-events
argo-workflows           Active   2y77d    app.kubernetes.io/instance=argo-workflows,kubernetes.io/metadata.name=argo-workflows
argocd                   Active   5y18d    kubernetes.io/metadata.name=argocd
beta                     Active   4y235d   kubernetes.io/metadata.name=beta

Then I have multiple spacess as delimiters...

In some command line outputs, I have two words in one column:

NAME                                                                     CLUSTER        CDS        LDS        EDS        RDS          ECDS         ISTIOD                             VERSION
foo-5fcd67944f-2t97k.dev                                           Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-18-7-dbcdbb5f4-nth9n      1.18.7
foo-6f8bf4c9b9-qrwf9.prod                                          Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-16-7-6d46d45875-gxtzw     1.16.7

Like that NOT SENT... that's where a regex can help here. It's not just tabs, it's a bunch of spaces.

Also, how would you parse Markdown tables (or similar)...? Unless the library trims all those extra spaces... but I guess with markdown there might be more complications that just a delimiter.

@koperagen
Copy link
Collaborator

Good questions indeed. I think such tables should be parsed by readDelimStr in the future. For now i can only suggest something like this for Markdown.

fun String.markdownCells() = trim('|').split("|").map { it.trim() }

val s = """
| Month    | Savings |
| -------- | ------- |
| January  | $250    |
| February | $80     |
| March    | $420    |""".trimIndent()

val lines = s.lineSequence()
lines.drop(2).toList().toDataFrame().split { value }.by { it.markdownCells() }.into(lines.first().markdownCells())

@dave08
Copy link
Author

dave08 commented Jun 20, 2024

I think that's a bit of an advanced technique for most people with this kind of use case... and it involves parsing in two steps...

I wonder if some kind of readDSL would be better here... it could possibly work by line and give helpers for extracting the titles and values?

@koperagen koperagen added the enhancement New feature or request label Jun 20, 2024
@koperagen
Copy link
Collaborator

Please share desired API or example of usages that you have in mind. Maybe something like this could be added

@zaleslaw zaleslaw modified the milestones: 0.14.0, Backlog Jul 19, 2024
@zaleslaw zaleslaw added the files reading/writing from/to files label Jul 19, 2024
@Jolanrensen Jolanrensen added the csv CSV / delim related issues label Aug 20, 2024
@Jolanrensen Jolanrensen self-assigned this Aug 20, 2024
@Jolanrensen Jolanrensen mentioned this issue Aug 20, 2024
27 tasks
@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Oct 22, 2024

I'm closing this fow now. We're working on a new CSV implementation based on Deephaven CSV #827 since it's faster and lighter, however this also doesn't allow Regexes for delimiter characters unfortunately, just a Char.
It does have multiple other options like ignoreSurroundingSpaces, which can trim leading and trailing blanks around values and it can recognize quote characters. That might help :).

We plan to have an experimental version of it in 0.15. If that still does not work, I'd recommend modifying the string manually, potentially adding quote characters and then parsing it as delimStr.

Edit: well, apparently it seems to have some issues with delimiter = ' ', ignoreSurroundingSpaces = true. I'll make an issue over at deephaven XD deephaven/deephaven-csv#212

@Jolanrensen Jolanrensen closed this as not planned Won't fix, can't repro, duplicate, stale Oct 22, 2024
@kosak
Copy link

kosak commented Oct 26, 2024

Hi, since you mentioned you were developing your own CSV library, I thought I would comment here.

Whether you decide to use Deephaven's CSV library or develop your own, there are a variety of things we learned along the way that may benefit you. We used some clever ideas for high performance and also some cute tricks for automatic "type inference". I'd be happy to discuss in more detail in some appropriate forum if you would find that helpful. Best, Corey Kosak @ Deephaven

@Jolanrensen
Copy link
Collaborator

@kosak We're not developing our own CSV library. We're simply replacing our Apache commons CSV integration in DataFrame with Deephaven's :) exactly for the reasons you mentioned; performance, type inference, etc. Plus, while we currently don't store our data primitively, using Deephaven, that remains a viable option in the future.

@Jolanrensen
Copy link
Collaborator

deephaven/deephaven-csv#212 is merged :)

We'll add it in #903. Simply set hasFixedWidthColumns = true and the column widths are determined by the width of the headers + spaces.

You can also manually specify fixedColumnWidths if this goes wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
csv CSV / delim related issues enhancement New feature or request files reading/writing from/to files
Projects
None yet
Development

No branches or pull requests

5 participants