-
Notifications
You must be signed in to change notification settings - Fork 39
Request for .frompandas() function #215
Comments
If you don't have any df["foo"].values
df["bar"].values If you want to wrap these up in an awkward Let me know if that's scanning you need. I didn't think a |
@jpivarski Could you give me a working example with the DataFrame above? Because these are not working for me: 1: import pandas as pd
df = pd.DataFrame({"foo": [2, 8], "bar": [0.3, -0.9]})
aw_table = awkward.Table(df)
awkward.topandas(aw_table, flatten=True) # or False
# True: ValueError: this array has unflattenable substructure: [0, 2) -> float64
# False: ValueError: If using all scalar values, you must pass an index 2: aw_table = awkward.Table(awkward.fromiter(df))
awkward.topandas(aw_table, flatten=True) # or False
# 0
# 0 foo
# 1 bar 3: aw_table = awkward.Table([df["foo"].values, df["bar"].values])
awkward.topandas(aw_table, flatten=True) # or False
# True: ValueError: this array has unflattenable substructure: [0, 2) -> float64
# False: ValueError: If using all scalar values, you must pass an index 4: aw_table = awkward.Table(awkward.fromiter([df["foo"].values, df["bar"].values]))
awkward.topandas(aw_table, flatten=True) # or False
# True:
# 0
# 0 0 2.0
# 1 8.0
# 1 0 0.3
# 1 -0.9
# False:
# 0
# 0 [2. 8.]
# 1 [ 0.3 -0.9] In all 4 attempts the original DataFrame was not correctly reconstructed. I think converting from and to DataFrames would be a quite standard operation for people in the field of Machine Learning. So if it is indeed me not properly understanding, I think it would be good to include an example like this in the Documentation. An argument for EDIT: Something like About closing issues: Non-contributers cannot reopen it. From other issues here, you seem very keen at closing issues quickly, probably to only keep meaningful issues in the tracker? This page gives a good overview of best practices for open source projects: http://zguide.zeromq.org/page:chapter6
More reading on building online communities (if you're interested): https://hintjens.gitbooks.io/social-architecture/content/ If you still want to keep the issue tracker clean, I would recommend using a Stale bot (https://github.com/apps/stale). After some inactivity, it will mark it as stale, and eventually automatically close it. |
Sorry—I didn't realize you couldn't open it. My only reason for closing it was so that I would have a better idea of which ones I need to worry about (i.e. the open ones). The volume of questions (not just from GitHub Issues) is getting to be a difficult management problem. Whenever I've been closing them, I've included some text to explain that it's not final—I've been considering them "done for now"—but that was based on the assumption that you could reopen them. I won't do that anymore. (Maybe I'll have to find a label or something, but I can't set labels on my phone.) What happens when you call Does awkward.Table(foo=df["foo"].values, bar=df["bar"].values) do what you want? |
I just got a chance to try this out on a computer and it works. More generally, awkward.Table({name: df[name].values for name in df.columns}) for all your DataFrame's columns. Back to the question of closing issues: it is very rare for the original poster to close the issue—the issues usually lay open for weeks after I think I've answered the question but the users don't follow up. I end up closing old issues in sweeps. I just did a sweep recently (more on uproot than awkward, I think) and resolved to start closing early. Each close had a message to try to avoid sounding dismissive. I don't have any centralized tracking, but I should figure out how to do that. I'm getting bug reports, usage questions, and feature requests from GitHub Issues (where the bug reports and usage questions belong), StackOverflow (where I'm trying to redirect the usage questions), Slack, Skype, and email (where I'd rather not get any, since they're not public and they get mixed in with a lot of other conversations). I'm reading the GitBook you sent. Thanks! |
Thank you! It seems to mostly work. Shouldn't import pandas as pd
df = pd.DataFrame({"foo": [2, 8], "bar": [0.3, -0.9]})
print(df.foo) # works
aw_table = awkward.Table({name: df[name].values for name in df.columns})
# no MultiIndex in this DataFrame, so flatten=True should work
aw_pd = awkward.topandas(aw_table, flatten=True) # False works correctly though
print(aw_pd.foo) Expected with I would still argue for a
df.x
# Name: x, dtype: awkward there should be a faster way of loading it than a for loop through columns right? I mean, the DataFrame will already be in a Awkward compatibly format, so something faster than If this still is not convincing, I give up and stop ranting about this ^-^ I just think Off-topic: About Awkward (You can make the above line in Markdown with ----- + newline) I seriously think you're doing a great job! This library is very powerful and useful (as far I've tested it), so I can totally understand that you're swamped by questions and such, and it's hard to deal with all that (mostly alone?). I'm actually surprised that this library is not more popular among the Machine Learning community. I, for example, have long been looking for a way to deal with audio of varying length. I need to apply functions to each row, slice among them and store it in a format like HDF5. With Numpy I was limited to for-loops. If e.g. the PyTorch community, and in specific the Audio branch of it ( I also see potential in better integration with Pandas. Now Pandas is limited to single values per column-row cell. With this library I think it's possible for a single cell to have a whole Array (image, audio, etc). Meaning, that you can have 1 reference, that is searchable, for data and labels. |
I'll make sure that Awkward 1.0 has a Also, the explicit structure classes like You're right that the only Awkward 0.x documentation is on the README. When it became clear that Awkward 0.x was headed for a brick wall of maintainability and I had to fix the technical debt with a redesign, I was left with the problem that Awkward 0.x had no documentation. There's a trade-off between adding more to Awkward 0.x and getting the Awkward 1.0 sprint done, so I compromised by writing that really long README. Awkward 1.0 will have good reference documentation, but I still need to learn how to write "how to" documentation. The biggest problem for me is to figure out what problems people need "how tos" for. That's why I'm trying to encourage the use of StackOverflow—it will give me more feedback on where the desire paths are. I'm currently working alone in the maintenance of uproot and awkward while trying to get Awkward 1.0 into a usable state. Having summer students generally means developing more features, rather than easing the load, because the students projects have to be somewhat separate to be well-defined. I really like that book you pointed me to, since it addresses exactly my problem—how to scale up a project beyond the individual developer level. (Before uproot, I never had enough of a userbase that maintenance was hard to keep up with.) Incidentally, the author of that book said that ZeroMQ had a no-feature-request policy: they grew a community of developers by only accepting contributed features, not requests. :) But I think a project needs to reach a certain maturity level before that's possible, so that user-developers can see how a contribution fits in. |
Thanks for the explanation :) The README is pretty good for something that has just been added as temporary solution while bridging the transition period to 1.0 ^-^ For your documentation structure, maybe it would be more clear if separated it in 4(+) topics: Concepts, beginner examples, interoperability with other libraries out there, and how Awkward works under the hood, and in that order.
Maybe this better first in your Awkward 1.0 docs? Feel free to copy it. That's true about students for a project like this indeed. Glad you like the book ^-^ It's true about asking |
Uproot uses readthedocs, but it also has docstrings, which can be automatically rendered there. In my experience, users overwhelmingly have read the README, because a lot of their questions would have been answered if they went to the readthedocs. It might have something to do with it being to sites—scrolling is easier/more discoverable than clicking? That's a good breakdown for the documentation, though the problem I need to solve are, "Which 'how' articles to write?" My guesses about what people need to know have been a little off, which is why I same to crowdsource it. And, of course, so of this takes time! You're free to write a PR for You also found a bug in |
Sorry it took so long. I created a function based on your insights. Please review it :) |
This issue would probably automatically close when I merge the pull request, but since you've done all the work and it will be merged (because I approve), I'll close this now, just in case. |
In your documentation you often mention
awkward.topandas()
, but how about the other way, aawkward.frompandas()
?I looked in the Python file where
.topandas()
was defined:https://github.com/scikit-hep/awkward-array/blob/d942fb8d4fae5e1dec35c70938e24c05207b3f31/awkward/util.py#L213
, but nothing about loading DataFrames there.
I also tried with some code, but this failed:
Applying
.fromiter()
only gets the column names.TL;DR How to convert a Pandas DataFrame to an awkward-array and vice-versa?
The text was updated successfully, but these errors were encountered: