-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: r concurrent handler #1295
base: main
Are you sure you want to change the base?
Conversation
Hi @sorhawell and welcome back! As you may have seen mentioned in some issues, we're currently rewriting Therefore it would be great if you could contribute there directly. From what I can see your changes don't involve |
Many thanks. I see you and eitsupi have surpassed me in contributions. You have made a lot of hard work. I'd like to support any attempt towards a better polars experience, maybe the neo-polars rewrite will be the best thing since sliced bread :) When do you plan to switch to neo-polars? What is you policy on patching "r-polars" main while? As far as neo-polars is now it is still early.
It is a bit exotic to do. So maybe just say no :) |
No problem about that, it would be great to have someone experienced on those. I can focus on the things that are not very technically challenging, so it's good if you focus on the thing where you have the most added value.
We don't have a clear timeline in mind. @eitsupi built the infrastructure from May I think and I started contributing in October. I think merging Regarding patching
Does that work in py-polars? That's often the main argument for deciding whether or not we should support something. IMO that's not a priority and I'd rather let it aside if it adds a lot of complexity. Having a clean implementation of |
I would like to rewrite this PR to fix the current bugs, in the smallest changes possible. That would give value and trust today. py-polars does not explicitly have "collect_in_background" as r-polars where a query is ofloaded to another OS thread via rust. I made that up. py-polars does not support running map_batches with a process pool backend. However, python supports threads natively and use of nested map_batches will not dead-lock because the threads can advance independantly. Also PyO3 the python version of extendr/savvy has GIL abstractions as "Python::with_gil and Python::acquire_gil that helps alot making a clean simple map_batches implementation. In R there is only one thread per session/process. In python it is also fairly straight forward to ofload a polars query to a python-thread without blocking other work or other queries. See examples below. I will think about of there is a way to emulate threads enough to avoid dead-locks without resorting to an R process pool. Otherwise docs should state that nested queries has limited support. I don't konw of any way in R to run a polars query in background without enabling that feature on rust side explicitly. Alternatively the R user must resort to clusters for non blocking queries which is not ideal. Mirai is getting close. import polars as pl
from concurrent.futures import ThreadPoolExecutor
import time
# example 1 py-polars supports embedded queries
def f(s: pl.Series) -> pl.Series:
s2 = s.to_frame().lazy().select(
pl.col("a").map_batches(lambda s: s*3)
).collect().to_series(0)
return s2 + 1
pl.DataFrame({"a" : [1,2,3]}).select(
pl.col("a").map_batches(f)
)
# py-polars supports running a query "in background"
# python supports threads, and even with GIL this is quite useful.
# while thread is performing rust-polars work or sleeping other python threads can advance mainwhile
# Sample DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"age": [25, 32, 29, 40, 23],
"city": ["New York", "London", "Berlin", "Paris", "Madrid"]
})
# Function to simulate batch processing with sleep
def process_batch(s: pl.Series) -> pl.Series:
print(f"Processing batch:\n{s}")
time.sleep(2) # Simulate delay
return s *2
# Function to execute Polars job
def job_with_map_batches(df: pl.DataFrame) -> pl.DataFrame:
print("Starting job...")
result = df.select(pl.col("age").map_batches(process_batch))
print("Job finished!")
return result
def job_without_map_batches(df):
print("Starting job...")
result = df
print("Job finished!")
return result
# Run multiple jobs concurrently
with ThreadPoolExecutor() as executor:
future_1 = executor.submit(job_with_map_batches, df)
future_2 = executor.submit(job_without_map_batches, df)
# Get results
result_2 = future_2.result() # can read result 2 before 1 has finished
print(result_2)
result_1 = future_1.result() # and 2 sec later result two finishes
print(result_1) |
Thanks both! I don't have a clear timeline either, but the current main branch is not compatible with R 4.5 (now r-devel) and shouldn't be unless someone works on it, so it's probably worth replacing it with the next branch around February. In the meantime, I am happy to do minimal fixes or review someone else's PR. As for |
THIS PR is a DRAFT DO NOT MERGE.
Not fully tested, but promising.
r concurrent handler was reported to crash throughout R session if an R error is raised in an R function embedded in a polars query, reported in #1253
Annother edge case if starting a new r-polars query within an r user function within an r-polars query did not work well.
Solution:
EDIT: It looked up InitCell from the wrong crate :/ the actual docs says it is fine to mutate later
https://docs.rs/state/0.6.0/state/struct.InitCell.html