Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby queries hang when returning too much data #334

Open
harelsafra opened this issue Jul 22, 2024 · 18 comments
Open

Ruby queries hang when returning too much data #334

harelsafra opened this issue Jul 22, 2024 · 18 comments

Comments

@harelsafra
Copy link

harelsafra commented Jul 22, 2024

Ruby queries hang when returning "too much" data.
I have this in quote since we don't understand what is too much but when limiting the result set the queries return.
When the Ruby process hangs the session is idle in the database.

The same queries work on Postgres.

For example, with the products table

CREATE TABLE public.products (
	id INT8 NOT NULL DEFAULT 27000000000:::INT8 + unordered_unique_rowid(),
	order_id INT8 NULL,
	merchant_id INT8 NULL,
	quantity INT8 NULL,
	price FLOAT8 NULL,
	title VARCHAR(255) NULL,
	product_id VARCHAR(255) NULL,
	created_at TIMESTAMP NOT NULL,
	updated_at TIMESTAMP NOT NULL,
	orig_price FLOAT8 NULL,
	category VARCHAR(255) NULL,
	sub_category VARCHAR(255) NULL,
	section VARCHAR(255) NULL,
	requires_shipping BOOL NULL,
	date TIMESTAMP NULL,
	brand VARCHAR(255) NULL,
	seller_id VARCHAR(255) NULL,
	product_type VARCHAR(255) NULL,
	norm_title VARCHAR(255) NULL,
	delivered_to VARCHAR(255) NULL,
	delivered_at TIMESTAMP NULL,
	parsed_title VARCHAR(255) NULL,
	is_gift_wrapped BOOL NULL,
	message VARCHAR(3724) NULL,
	fulfillment_id INT8 NULL,
	sku VARCHAR NULL,
	registry_type VARCHAR NULL,
	CONSTRAINT products_pk PRIMARY KEY (id ASC),
	INDEX products_order_id_idx_new (order_id ASC),
	INDEX products_merchant_id_idx (merchant_id ASC),
	INDEX products_updated_at_idx (updated_at ASC)
)
a= Product.find_by_sql("select * from products where order_id=38313780972592 limit 114").size
# => 114

Product.find_by_sql("select * from products where order_id=38313780972592 limit 115").size
# <<<<<THIS HANGS>>>>>>
@harelsafra
Copy link
Author

ruby '3.1.4'
activerecord (6.1.7.3)
pg (1.5.3)

@dikshant
Copy link
Contributor

@harelsafra are you using this adapter or upstream activerecord? Does the issue happen with this adapter specifically?

@harelsafra
Copy link
Author

Hi @dikshant we're using the active record currently but we tested activerecord-cockroachdb-adapter and we're seeing the same behavior

@rimadeodhar
Copy link

rimadeodhar commented Jul 22, 2024

It appears that this issue seems to happen when sql.defaults.results_buffer.size is set to 500KiB. This setting was added in 23.2.7. Setting to a lower value like 16KiB does not cause the hang.

@rafiss
Copy link
Contributor

rafiss commented Jul 22, 2024

To clarify, the sql.defaults.results_buffer.size setting has been in CockroachDB since v19.1. (see cockroachdb/cockroach#32366)

The recent change that happened is in v24.1.1, where the default value of this setting was changed from 16KiB to 512KiB. (see https://www.cockroachlabs.com/docs/releases/v24.1#v24-1-1-sql-language-changes)

We are still investigating why this causes the query result to hang.

@rafiss
Copy link
Contributor

rafiss commented Jul 22, 2024

To assist with the investigation, it would help tremendously if you could share instructions on how to reproduce the issue. In our own testing, we have not yet seen the behavior you describe. We used a scenario along the lines of:

ActiveRecord::Base.connection.execute("
  CREATE TABLE public.products (
	id INT8 NOT NULL DEFAULT 27000000000:::INT8 + unordered_unique_rowid(),
	order_id INT8 NULL
   # other columns
)")

class Product < ActiveRecord::Base
end

128.times { Product.create!(order_id: 38313780972592) }

puts "Running 114"
p Product.find_by_sql("select * from products where order_id=38313780972592 limit 114").size

puts "Running 115"
p Product.find_by_sql("select * from products where order_id=38313780972592 limit 115").size

The instructions you share ideally should be a self-contained example, so we can follow the instructions starting from an empty cluster. Also, just to confirm, are you using CockroachDB v23.2.3?

Thank you!

@rafiss
Copy link
Contributor

rafiss commented Jul 23, 2024

Here's one more thing to try, which should be easier than coming up with a self-contained examples that reproduces the problem.

I read more about concurrency and deadlocks in Rails these docs.

It has a debugging tip for investigating deadlocks/hangs:

If your application is deadlocking and you think the Load Interlock may be
involved, you can temporarily add the ActionDispatch::DebugLocks middleware to
config/application.rb:

config.middleware.insert_before Rack::Sendfile,
                                  ActionDispatch::DebugLocks

If you then restart the application and re-trigger the deadlock condition,
/rails/locks will show a summary of all threads currently known to the
interlock, which lock level they are holding or awaiting, and their current
backtrace.

Generally a deadlock will be caused by the interlock conflicting with some other
external lock or blocking I/O call.

Can you try using this setting then re-run your example that causes the hanging query and share the results from /rails/locks?

@harelsafra
Copy link
Author

harelsafra commented Jul 23, 2024

Hi @rafiss
I worked on it and got a reproducer working with obfuscated data, on a different table. This way I hope you can debug it more efficiently.

CREATE TABLE public.user_shops (
    id INT8 NOT NULL DEFAULT nextval('public.user_shops_id_seq'::REGCLASS),
    user_id INT8 NULL,
    merchant_id INT8 NULL,
    notifications JSONB NULL,
    merchant_role_id INT8 NULL,
    created_at TIMESTAMP NOT NULL,
    updated_at TIMESTAMP NOT NULL,
    status VARCHAR NOT NULL DEFAULT 'inactive':::STRING::VARCHAR,
    activation_token VARCHAR NULL,
    activated_at TIMESTAMP NULL,
    CONSTRAINT user_shops_pkey PRIMARY KEY (id ASC),
    INDEX index_user_shops_on_activation_token (activation_token ASC),
    INDEX index_user_shops_on_merchant_id (merchant_id ASC),
    INDEX index_user_shops_on_merchant_role_id (merchant_role_id ASC)
)

user_shops__202407230902.sql.zip

The attached zip has insert statements for the needed data

If you run
SELECT user_shops.* FROM user_shops WHERE user_shops.merchant_id = 24813 order by id limit 193
it hangs. Limiting to 192 and 194 works

UserShop.find_by_sql("SELECT user_shops.* FROM user_shops WHERE user_shops.merchant_id = 24813 order by id limit 192").size
# => 192

UserShop.find_by_sql("SELECT user_shops.* FROM user_shops WHERE user_shops.merchant_id = 24813 order by id limit 194").size

# => 194
UserShop.find_by_sql("SELECT user_shops.* FROM user_shops WHERE user_shops.merchant_id = 24813 order by id limit 193").size

# <<<<<<<<<<<< HANGS >>>>>>>>>>

@BuonOmo
Copy link
Collaborator

BuonOmo commented Jul 23, 2024

Prefixing the sql with CREATE SEQUENCE public.user_shops_id_seq;. By default I don't have any issue in either v23.2.5 or v24.1 and either 16 KiB or 512 KiB buffer sizes. A self contained example would be very helpful indeed as there are many settings both in CRDB and Ruby that might be the root cause of your issue. Here's a good template to start with

@rafiss
Copy link
Contributor

rafiss commented Jul 23, 2024

Thanks for sharing more details @harelsafra. As part of your reproducer example, can you please also share the commands you use to connect to CockroachDB and the definition of the UserShop model?

The goal is to have an example that matches as close as possible to the exact steps you are taking when you encounter the problem, so that when we debug this, we can be sure that we are experiencing the same behavior that you are. Otherwise, we will need to take a guess at some of the details, and we may not do the same thing that you are. (I also tried with your latest example, but did not encounter the problem.)

The link shared by @BuonOmo is an excellent starting point.

@harelsafra
Copy link
Author

This is the section from database.yml:

webapp:
  adapter: postgresql
  port: 5432
  encoding: unicode
  host: <%= ENV['DB_WEBAPP_HOST'] %>
  database: <%= ENV['DB_WEBAPP_DATABASE'] %>
  username: <%= ENV["DB_WEBAPP_USERNAME"] %>
  password: <%= ENV["DB_WEBAPP_PASSWORD"] %>
  pool: <%= (ENV['DB_CONNECTION_POOL_SIZE'] || 3).to_i %>
  statement_limit: 100
  connect_timeout: 2
  checkout_timeout: 2
  prepared_statements: true
  variables:
    statement_timeout: <%= ENV['DATABASE_STATEMENT_TIMEOUT'] || case ENV['APP_TYPE']; when 'webapp'; '30s'; when 'webhooks'; '5s'; when 'webhooks_sync'; '1s'; when 'webhooks-queue-worker'; '5s'; when 'adminserver'; '120s'; else '30s' end %>

The table's model calls other stuff and isn't clean enough to share.

I most confess that I'm a rails noob. Will try to get the simple example that @BuonOmo sent working

We're also trying in parallel to get deadlock debugging to work.

@BuonOmo
Copy link
Collaborator

BuonOmo commented Jul 24, 2024

Could you show the output of:

puts({
	DB_CONNECTION_POOL_SIZE: ENV['DB_CONNECTION_POOL_SIZE'],
	DATABASE_STATEMENT_TIMEOUT: ENV['DATABASE_STATEMENT_TIMEOUT'],
	APP_TYPE: ENV['APP_TYPE']
})

I'm not sure the deadlock debugging will yield much unfortunately as it is a single query running. You could also ensure it by setting DB_CONNECTION_POOL_SIZE=1 in your env.

@harelsafra
Copy link
Author

harelsafra commented Jul 24, 2024

Attached

puts({
 DB_CONNECTION_POOL_SIZE: ENV['DB_CONNECTION_POOL_SIZE'],
 DATABASE_STATEMENT_TIMEOUT: ENV['DATABASE_STATEMENT_TIMEOUT'],
 APP_TYPE: ENV['APP_TYPE']
})
{:DB_CONNECTION_POOL_SIZE=>nil, :DATABASE_STATEMENT_TIMEOUT=>nil, :APP_TYPE=>"adminserver"}

I also made some progress in creating a reproducer:
Used code similar to the one in https://github.com/rails/rails/blob/main/guides/bug_report_templates/active_record.rb (attached zip)
When it connects to our 3 node staging cluster v.23.2.3 it hangs
When it connects to a local docker running on my Mac v.23.2.3 it works
When it connects to a serverless cluster in Cockroach cloud v.23.2.7 it works

Trying to get a local v23.2.3 cluster running to test on it

active_record.rb.zip

@BuonOmo
Copy link
Collaborator

BuonOmo commented Aug 8, 2024

The reproduction you sent is using an old version of the adapter, and not specifying which one in the inline gemfile. Could you specify which version you are using ? And try to use the latest possible of course.

I tried to reproduce your bug with the 6.1.11 version of this adapter for now without result. I think I am also misunderstanding:

When it connects to our 3 node staging cluster v.23.2.3 it hangs

Are you sharding from the rails client? And if so how are you doing it? Or are the three nodes accessed with only one URI?

@harelsafra
Copy link
Author

Hi @BuonOmo, we're using activerecord (6.1.7.3)
Unfortunately this can't can't be changed because upgrading rails is a big effort that isn't planned at this stage.

The 3 nodes are accessed through an AWS network load balancer. The NLB connects the client to a single server in the cluster

@BuonOmo
Copy link
Collaborator

BuonOmo commented Aug 12, 2024

@harelsafra I'm asking about the adapter version (e.g: the gem activerecord-cockroachdb-adapter). Although I'd advise strongly updating rails to 6.1.11 (those are patches, so no interface changes, only security...)

@rafiss
Copy link
Contributor

rafiss commented Aug 12, 2024

The 3 nodes are accessed through an AWS network load balancer. The NLB connects the client to a single server in the cluster

@harelsafra Can you try to see if you can reproduce the problem if you make your code connect directly to a CockroachDB node, and bypass the NLB? This will help us determine if the problem only occurs while using an NLB.

@rafiss
Copy link
Contributor

rafiss commented Aug 19, 2024

This issue in the ruby-pg repo might also be related. It describes a situation where the driver can hang. ged/ruby-pg#583

To test if that is at play here, could you try setting async_api to false as indicated in that issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants