Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActiveRecord Truncation does not work when using multiple schemas #6

Open
abhchand opened this issue Dec 29, 2017 · 17 comments
Open

ActiveRecord Truncation does not work when using multiple schemas #6

abhchand opened this issue Dec 29, 2017 · 17 comments

Comments

@abhchand
Copy link

Hello - thanks for putting together database_cleaner! I've been using the gem for a while and I know maintenance takes effort, so appreciate those who contribute to this project.

I'm using the following setup -

  • Rails 5.1.0
  • ActiveRecord 3.0.0
  • Postgres 9.6

Like many multi-tenant applications, my app utilizes postgres schemas which store parrallel, independent copies of my database tables.

So if I have two tenants (earth, mars) and two tables (users, companies) I would have the following tables in my DB -

schema_migrations
ar_internal_metadata
public.users
public.companies
earth.users
earth.companies
mars.users
mars.companies

When using DatabaseCleaner.strategy = :truncation, it looks for a list of tables in my DB to truncate (see here).

When I run that query against my local test database -

SELECT schemaname || '.' || tablename
FROM pg_tables
WHERE
  tablename !~ '_prt_' AND
  tablename <> 'schema_migrations' AND tablename <> 'ar_internal_metadata'
  AND schemaname = ANY (current_schemas(false))
;
                 ?column?
-------------------------------------------
 public.users
 public.companies
(2 rows)

As you can see, it only returns the public schemas, because of the schemaname = criteria. This effectively doesn't wipe my non-public schema tables, causing run-over between tests.

I understand that there are options to use a transaction instead, or even the only: [..] parameter to list all my tables. But it seems like this should work as is without having to use one of those as a workaround (and if my DB is quite large, a whitelist can get messy and inefficient across thousands of tests).

Would it make sense to update the query logic to something like

SELECT schemaname || '.' || tablename
FROM pg_tables
WHERE
  tablename !~ '_prt_' AND
  tablename <> 'schema_migrations' AND tablename <> 'ar_internal_metadata'
  AND schemaname NOT IN ('pg_catalog', 'information_schema')
;

Thanks!

Related to: https://github.com/DatabaseCleaner/database_cleaner/issues/225

@fcheung
Copy link

fcheung commented Jan 8, 2018

I've come across a similar thing. For now doing

  def with_all_schemas
    old = ActiveRecord::Base.connection.schema_search_path
    begin
      ActiveRecord::Base.connection.schema_search_path = [
        your_list_of_schemas_here
      ].join(',')
      yield
    ensure
      ActiveRecord::Base.connection.schema_search_path = old
    end
  end

and then wrapping calls to Databasecleaner.clean with with_all_schemas seems to do the trick

@deathwish
Copy link

Honestly, this is pretty awful.

But the solution above works and can be improved by using

select schema_name from information_schema.schemata where schema_name not like 'information_schema' and schema_name not like 'pg_%'

to fetch all schemas in the active database for cleaning, which is the behavior I expected database_cleaner to exhibit.

@alejandro-falkowski-gelato

For me as I am using sequel was to use search_path as described here https://github.com/jeremyevans/sequel/blob/master/doc/opening_databases.rdoc#postgres

@botandrose botandrose transferred this issue from DatabaseCleaner/database_cleaner Feb 18, 2020
@emad-elsaid
Copy link

My workaround was to just truncate the tables I need myself after each test example:

  config.after(:each) do
    DatabaseCleaner.clean
    ['schema.tablename'].each do |table|
      ActiveRecord::Base.connection.execute("TRUNCATE #{table}")
    end
  end

@botandrose
Copy link
Contributor

Hi, so I'd love to help get this problem solved, but honestly I'm not familiar with with the concept of multiple schemas in PostgreSQL, so I'm going to need some help understanding. Can someone explain why an app might have more than one schema?

Also, about the current clause that is causing the issue: schemaname = ANY (current_schemas(false)):
Looking in the git history, the commit that added this line has the following to say:

commit 77bca129d6031571b4da1edf63b9f5ae93fc6ee2
Author: Billy Watson <[email protected]>
Date:   Tue Jul 22 12:31:03 2014 -0400

    always return Postgres table names with schema to avoid awfulness
    
    - without the schema, the migrations table gets truncated since the default AR method returns ‘public.schema_migrations’
    - also a possibility is truncation of tables other than desired once since without the schema, Postgres has no way of knowing which table to truncate if two schemas have the same table

Can someone unpack this for me?

Finally, in the OP @abhchand has suggested the possible solution of replacing the line with schemaname NOT IN ('pg_catalog', 'information_schema'). Can I get some opinions on this suggestion? Would it address the problem exhibited here, while still addressing the concerns that Billy Watson was trying to solve in his commit?

@Adamantish
Copy link

Adamantish commented Apr 16, 2020

Hiya @botandrose . The reason we're using multiple schemata is to do close integration with a proprietary 3rd party application. There are some good reasons not to do it this way (it's a microservices sin and I feel a little bit ill about it) but amongst the advantages we find are...

  • We get to have multiple services doing a lot of otherwise complex work with remote data but with the code simplicity as if it were local.
  • We get a lot of the database efficiency being able to do SQL joins between the public data we fully own and the data we don't.
  • A lot of rails stuff by default is just unaware of the other schemata. Which is good and the right thing in most cases.

I think a general solution which would ensure truncate strategy works might be to count records in all tables everywhere at the start of the whole test suite and truncate all that had zero records.

@erlinis
Copy link

erlinis commented Apr 16, 2020

There are many reasons why some apps use several schemas, but for sure it is a thing.

What Billy Watson was trying to solve is totally fine, but there is one piece in his solution
that is not working as as he is expecting (current_schemas(false), even when the database has multiple schemas current_schemas(false) is only returning {public} because it depends what is set in the search_path.

select current_schemas(false);

 current_schemas
-----------------
 {public}

While if you execute the query suggested by @deathwish you will indeed get all your schemas (excluding the Postgres' ones of course)

 SELECT schema_name FROM information_schema.schemata WHERE schema_name not like 'information_schema' AND schema_name not like 'pg_%';
 
 schema_name
-------------
 public
 archive

So, combining the query above with the current code, it will solve the issue while still addressing the concern that Billy Watson was trying to solve since it now lists the tables in all schemas and still appending the schema name.

SELECT schemaname || '.' || tablename as table_name
  FROM pg_tables
 WHERE tablename !~ '_prt_'  
   AND tablename <> 'schema_migrations' AND tablename <> 'ar_internal_metadata' 
   AND schemaname = ANY (
     	SELECT schema_name 
         FROM information_schema.schemata 
        WHERE schema_name not like 'information_schema' 
          AND schema_name not like 'pg_%'
       );


                table_name
-----------------------------------------
 public.users
 public.questions
 public.responses
 public.versions
 archive.logs
 archive.versions

@erlinis
Copy link

erlinis commented Apr 16, 2020

Or even simpler as @alejandro-falkowski-gelato suggested, setting up the search_path in the database configuration will in fact lists all the desired schemas when running current_schemas(false)

test:
  adapter: postgresql
  host: localhost
  database: db_test
  ...
  schema_search_path: 'schema1,public'

So it will worth at least mention in the docs that you should set the search_path if you expect DatabaseCleaner also truncate the tables in all schemas rather than only public

@botandrose
Copy link
Contributor

Thank you, @Adamantish, for the description of your use-case... I think I have a good handle on this now! And thank you, @erlinis, for your superb detective work and proposed solution, which appears to me to be solid.

Is there any reason why one might not want to clean all tables outside of the public schema? Let's say we were to fold this change into a hypothetical v1.8.5 bugfix release of database_cleaner. Would this, in fact, be a breaking change for some users? In that case, would it be better to release it in the form of a new option to the truncation strategy, like all_schemas: true etc?

@erlinis
Copy link

erlinis commented Apr 16, 2020

Hi @botandrose,
A new option to the truncation strategy sounds great, but instead of all_schemas: true/false,
I think would be better to have something like included_schemas: [ 'public', 'schema1'] and if nothing is specified ['public'] is the default value.

This way current behavior won't be affected and the developer will have more control over what clean.

@botandrose
Copy link
Contributor

@erlinis I like it! What do others in this thread think? Are you game for taking a stab at it in a PR?

@deathwish
Copy link

Like the original submitter's per-tenant schema use case, I'd definitely prefer to be able to clean all schemata rather than saying which ones should be cleaned. That said, I agree that being able to specify which schemata are cleaned is also important.

We are in dynamic-language land here, so why not both? Something like schemata: :all is not incompatible with schemata: ['public', 'schema1'], and allows the later addition of additional functionality like schemata: {except: 'schema2'}.

The only other note I have immediately is that this should interact correctly with the cache: true|false option as documented, which also implies that it should be possible to have a non-fixed list of schemata to be cleaned.

@botandrose
Copy link
Contributor

@deathwish All great points! Like you, I think :except can be punted on into its own subsequent PR, and agree that this functionality should provide an :all option, for those who don't want to enumerate every schema. Also agree on respecting the :cache option. Thank you for pointing these things out!

@jherdman
Copy link

jherdman commented Apr 8, 2022

Is this issue still relevant with the latest version of this gem?

@timscott
Copy link

@jherdman - I am on the latest version (2.0.1) and can confirm it's still an issue. @botandrose, anything in the works?

@timscott
Copy link

This line seems to be the problem.

@only = all_tables.map { |table| table.split(".").last }

For me all_tables contains the following two entries: public.items and events.items. So database cleaner will execute DELETE FROM items twice.

Is there a reason to remove the schema? Seems like a simple bug fix, change this:

all_tables = cache_tables? ? connection.database_cleaner_table_cache : connection.database_tables
@only = all_tables.map { |table| table.split(".").last }

To this:

@only = cache_tables? ? connection.database_cleaner_table_cache : connection.database_tables

@kanmeiban
Copy link

kanmeiban commented Apr 5, 2024

@timscott A few lines below is

@except += connection.database_cleaner_view_cache + migration_storage_names

which should also be turned into FQN for your suggestion to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests