-
-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SmarterCSV thread-safe #277
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #277 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 11 11
Lines 380 379 -1
=========================================
- Hits 380 379 -1 ☔ View full report in Codecov by Sentry. |
} | ||
} | ||
|
||
it 'at least returns the right number of results from each thread' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run this same spec on master, you'll see something like this (it's possible it could produce non-deterministic results)
Failures:
1) thread safety checks at least returns the right number of results from each thread
Failure/Error: expect(d[1].size).to eq(correct_sizes[d[0]])
expected: 5
got: 16
(compared using ==)
# ./spec/features/threading_spec.rb:22:in `block (3 levels) in <top (required)>'
# ./spec/features/threading_spec.rb:21:in `each'
# ./spec/features/threading_spec.rb:21:in `block (2 levels) in <top (required)>'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpcamara thank you for finding this issue, and your PR!
* SmarterCSV used instance variables on a module, so they were shared across all threads * When different threads ran SmarterCSV, they could overwrite the instance values that were set, and effectively corrupt eachothers day. The simplest way to show this was to run a full `process` call from multiple threads - each threads data gets aggregated together into one large result * By using an instance, and passing it around, we simulate a insance approach while maintaining the current API * We also expose a thread-local set of instance variables which should provide backwards compatibility if someone were to directly access the global instance variables on the SmarterCSV module (which a spec did, and that's how I noticed this behavior)
7003a32
to
90ec4c5
Compare
Hi @jpcamara, can you help me understand the use case where you run into issues with threading? In the systems where I've used SmarterCSV over the years, I typically have several Sidekiq workers:
In systems where users upload CSV files, I upload/capture the file in S3, and then kick-off the Sidekiq worker to download and run SmarterCSV.process on the file, kicking-off the other workers in the process. |
Hey @tilo! There are a few issues i'm encountering. To start - I've committed more threading scenarios (in In testing my upgrade to the newest version (1.10.3) I saw these issues I could easily reproduce and I needed to hold off using it as a result. Most of my usages of I use
I am processing from a variety of sources in all different CSV formats with different headers/delimiters/file sizes. I don't have one dedicated worker doing the As a result, it's very easy for me to encounter the scenarios I have laid out in the
In CRuby, I'm confident my usage of |
Hi @tilo! Any thoughts on this? |
I realized it's actually pretty easy to get With the following CSVs, and the following code, you'll be able to easily recreate broken headers. Sometimes headers are swapped between threads, and sometimes they are nil and raise a NilClass error on
require "concurrent-ruby"
def run_forever(pool_size: 10)
pool = Concurrent::FixedThreadPool.new(pool_size)
i = Concurrent::AtomicFixnum.new
loop do
pool.post do
local_i = i.increment
yield local_i
end
end
end
run_forever(pool_size: 10) do |i|
if i.even?
SmarterCSV.process('csv_one.csv', chunk_size: 5) do |chunk|
raise "Wrong header! #{chunk.first.keys.join(',')}" if chunk.first.keys != %i[a b c d e f g]
end
else
SmarterCSV.process('csv_two.csv', chunk_size: 5) do |chunk|
raise "Wrong header! #{chunk.first.keys.join(',')}" if chunk.first.keys != %i[h i j k l m n]
end
end
rescue StandardError => e
puts "Iteration[#{i}]: #{e.message}"
end |
Sorry to be so noisy! This'll be my last comment on the issue for awhile. I'm writing a series on concurrency in ruby, so I share some of my thoughts on threading issues and class-level ivars in there: https://jpcamara.com/2024/06/04/your-ruby-programs.html. Just more context on this issue. |
Confirming that we're seeing this with 1.10.3, too. Had a worker processing two CSVs (one more than a million rows) and had header/column corruption midway through processing. Please consider merging @tilo |
I'll have a look at the examples you provided |
I'm looking into it, but wanted to point out that when calling it concurrently,
|
My approach for handling thread safety issues in gems is to isolate that code to a "single_threaded" queue: class UnsafeJob
include Sidekiq::Job
sidekiq_options queue: "single_threaded"
def perform; end
end Then I configure it to run with a concurrency of 1, and I instead run multiple processes or servers.
This approach will work the same as a global mutex, but also allows horizontal scaling. Using a global mutex technically works, but it means whichever thread acquires that mutex then completely hogs the GVL for any CSV processing, including bypassing the 100ms thread scheduler swap. But if this was running on a web server like Puma, a mutex is the only reasonable option, like suggested. |
@contentfree @jpcamara please check out this PR: #279 |
Looks good to me. The "Breaking Change" note in the changelog isn't quite
accurate: It's not a breaking change (as SmarterCSV.process still works).
It's merely deprecated with a warning.
👍
…On Wed, Jul 3, 2024 at 8:33 AM Tilo ***@***.***> wrote:
@contentfree <https://github.com/contentfree> @jpcamara
<https://github.com/jpcamara> please check out this PR: #279
<#279>
—
Reply to this email directly, view it on GitHub
<#277 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAACVTQMEXDXDO2NHOPLETLZKPVRZAVCNFSM6AAAAABFYK6RT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBVHE3TANBRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
FYI: The fix will be released soon from this PR |
SmarterCSV used instance variables on a module, so they were shared across all threads
When different threads ran SmarterCSV, they could overwrite the instance values that were set, and effectively corrupt eachothers data. The simplest way to show this was to run a full
process
call from multiple threads - each threads data gets aggregated together into one large resultBy using an instance, and passing it around, we simulate an instance approach while maintaining the current API
We also expose a thread-local set of instance variables which should provide backwards compatibility if someone were to directly access the global instance variables on the SmarterCSV module (which a spec did, and that's how I noticed this behavior)