Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checker that makes sure that enum type is used to store ActiveRecord's enums #169

Open
toydestroyer opened this issue Dec 6, 2022 · 14 comments
Labels
enhancement New feature or request

Comments

@toydestroyer
Copy link
Contributor

Let's extract our discussion from #160.

One of the things that I want to enforce in the project I'm working on right now is to stop using ActiveRecord's default integer enums and use strings with an actual enum type in the database (we're using PostgreSQL).

Regarding the general pros and cons of enum columns over integers/strings. I did a quick research (aka googled "postgresql enum pros and cons") and found this question on StackOverflow: https://stackoverflow.com/q/2318123

If I may just quote some of the answers from there:

The advantages of enums are:

  • Performance is better. You can just display what you get out of the core table instead of either having a separate lookup table that translates a code to a value or having app logic that translates a code to a value. This can be especially useful in datawarehouse applications.
  • Ad hoc SQL is easier to write

The disadvantages are:

  • Encoding display values into your database ddl is bad form. If you translate the enum value into a different display value in your app code, then you lost a lot of the advantages of using enums.
  • Adding values requires DDL changes
  • Makes language localization difficult
  • Database portability is decreased

and this one:

Advantages

  • Reduce Storage: Postgres uses only 1 byte per tuple when 255 or less ENUM elements are defined or 2 bytes for 256~65535 elements. This is because, rather that storing the constant literal of the value, Postgres stores the index in the ordered set of that value. For very large tables, this might prove to be a significant storage space save.
  • Arbitrary sorting:
CREATE TABLE opening_hours(
    week_day ENUM ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'),
    opening_time TIME,
    closing_time TIME
);

If you sort by week_day it will sort in the order you specified which is convenient in the above case.

  • Cheap constraints: Instead of doing checking in your application code or some complicated db constraint, enums check that only certain values are added in a cheap way.

Disadvantages

  • The list of options CANNOT be controlled by end-users since ENUM is part of the schema
  • An additional query is required to see the list of options
  • String operations and functions do not work on ENUM This is due to ENUM being a separate data type from the built-in data types like NUMERIC or TEXT. This can be overcome by casting ENUM values to TEXT when being operated. However, it can be a pain when using ORM.

In my case, this is definitely a win! But I understand that this is not for everyone. Especially considering that Rails added an API to create enum columns only from version 7, and only for PostgreSQL: rails/rails@4eef348. And it doesn't provide an API for adding/deleting/changing values once an enum is created. So you have to write raw SQL in your migration files or use third-party gems (#1, #2).

So what does everyone thinks about having such a checker as a part of the DatabaseConsistency gem? It might be disabled by default so people are cautious about enforcing it.

@djezzzl
Copy link
Owner

djezzzl commented Dec 6, 2022

Hi @toydestroyer,

Thank you for the follow-up discussion!

If I may just quote some of the answers from there:

I saw that topic too. Some of the answers indeed sound interesting but researching them proves they are not worthy, IMHO.

In my case, this is definitely a win!

Could you please check it in your database and share the results? Would be there a win in terms of memory reduction or performance gain?

But I understand that this is not for everyone. Especially considering that Rails added an API to create enum columns only from version 7, and only for PostgreSQL: rails/rails@4eef348. And it doesn't provide an API for adding/deleting/changing values once an enum is created. So you have to write raw SQL in your migration files or use third-party gems (#1, #2).

I think it's still fine to have the checker in DatabaseConsistency even if it helps only you. In the end, anybody can simply disable it if they don't want/disagree. Giving the customers (engineers) this choice is a big plus.

To sum up:

  • I'm fine to have a checker that promotes enums.
  • I would be happy to have the rationale so we can add it to the wiki.
  • Maybe we should change the severity from "issue" to "suggestion/warning" for this checker.

Some of my findings:

create type example_enum as enum ('value1', 'value2', 'value3');
create table tmp_varchar (example varchar);
create table tmp_enum (example example_enum);

insert into tmp_varchar (select (ARRAY['value1', 'value2', 'value3'])[generate_series % 3 + 1] from generate_series(1, 1000000));
insert into tmp_enum(select example::example_enum from tmp_varchar);

create index tmp_varchar_index on tmp_varchar(example);
create index tmp_enum_index on tmp_enum(example);

analyze tmp_varchar;
analyse tmp_enum;

select
    pg_size_pretty(sum(pg_column_size('example'))) as total_size,
    pg_size_pretty(avg(pg_column_size('example'))) as average_size,
    count(*) count
from tmp_varchar;
-- 7813 kB,8.0000000000000000 bytes,1000000

select
    pg_size_pretty(sum(pg_column_size('example'))) as total_size,
    pg_size_pretty(avg(pg_column_size('example'))) as average_size,
    count(*) count
from tmp_enum;
-- 7813 kB,8.0000000000000000 bytes,1000000

select version();
-- PostgreSQL 12.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit


explain analyze select * from tmp_varchar where example = 'value2';
-- Bitmap Heap Scan on tmp_varchar  (cost=6218.32..14788.33 rows=331600 width=7) (actual time=14.082..45.372 rows=333334 loops=1)
--   Recheck Cond: ((example)::text = 'value2'::text)
--   Heap Blocks: exact=4425
--   ->  Bitmap Index Scan on tmp_varchar_index  (cost=0.00..6135.43 rows=331600 width=0) (actual time=13.688..13.688 rows=333334 loops=1)
--         Index Cond: ((example)::text = 'value2'::text)
-- Planning Time: 0.065 ms
-- Execution Time: 54.115 ms

explain analyze select * from tmp_enum where example = 'value2';
-- Bitmap Heap Scan on tmp_enum  (cost=6294.10..14915.35 rows=335700 width=4) (actual time=11.765..41.599 rows=333334 loops=1)
--   Recheck Cond: (example = 'value2'::example_enum)
--   Heap Blocks: exact=4425
--   ->  Bitmap Index Scan on tmp_enum_index  (cost=0.00..6210.18 rows=335700 width=0) (actual time=11.381..11.382 rows=333334 loops=1)
--         Index Cond: (example = 'value2'::example_enum)
-- Planning Time: 0.069 ms
-- Execution Time: 50.327 ms

@toydestroyer
Copy link
Contributor Author

Could you please check it in your database and share the results? Would be there a win in terms of memory reduction or performance gain?

I think the main benefits in our use case are not performance related, but:

  • Consistency. Nobody with direct access to the DB will be able to typo the value.
  • Readability (compared to integer-based enums). Data analysts will be able to see what the data is about, instead of asking developers questions like "what does user status 6 mean?"

I would be happy to have the rationale so we can add it to the wiki.

Got it, I'll do more research on this!

@djezzzl
Copy link
Owner

djezzzl commented Dec 6, 2022

Readability (compared to integer-based enums). Data analysts will be able to see what the data is about, instead of asking developers questions like "what does user status 6 mean?"

I absolutely agree that integer-based enums with ActiveRecord are hard to read. It can be simply solved with string-based enum though.

For example, enum :status, {active: 'active', inactive: 'inactive', ...}.

Consistency. Nobody with direct access to the DB will be able to typo the value.

That's true. And I agree that could be critical for data-based logic.

So, We have a "consistency" reason right now which for me is already good. However, it would be great to have some more. If you can find them, please feel free to share.

In any case, I think I will be focused on something else in the upcoming weeks so if you want, feel free to do a PR.

@toydestroyer
Copy link
Contributor Author

While I'm still researching, want to point out an interesting note in the ActiveRecord::Enum documentation regarding string columns:

Finally it's also possible to use a string column to persist the enumerated value. Note that this will likely lead to slower database queries:

@djezzzl
Copy link
Owner

djezzzl commented Dec 6, 2022

I understand that it makes sense that the comparison of integers should be faster than the comparison of strings, however, I see the following results:

create table tmp_1(a integer);
create table tmp_2(a varchar);

insert into tmp_1 (select generate_series % 3 from generate_series(1, 1000000));
insert into tmp_2 (select (ARRAY['active', 'inactive', 'invalid'])[a + 1] from tmp_1);

create index tmp_1_index on tmp_1(a);
create index tmp_2_index on tmp_2(a);

analyse tmp_1;
analyze tmp_2;

explain analyse select count(*) from tmp_1 where a = 1;
-- Finalize Aggregate  (cost=6386.73..6386.74 rows=1 width=8) (actual time=11.920..13.236 rows=1 loops=1)
--   ->  Gather  (cost=6386.51..6386.72 rows=2 width=8) (actual time=11.847..13.229 rows=3 loops=1)
--         Workers Planned: 2
--         Workers Launched: 2
--         ->  Partial Aggregate  (cost=5386.51..5386.52 rows=1 width=8) (actual time=10.235..10.236 rows=1 loops=3)
--               ->  Parallel Index Only Scan using tmp_1_index on tmp_1  (cost=0.42..5038.14 rows=139347 width=0) (actual time=0.021..6.298 rows=111111 loops=3)
--                     Index Cond: (a = 1)
--                     Heap Fetches: 0
-- Planning Time: 0.062 ms
-- Execution Time: 13.260 ms

explain analyse select count(*) from tmp_2 where a = 'active';
-- Finalize Aggregate  (cost=6379.72..6379.73 rows=1 width=8) (actual time=12.200..13.513 rows=1 loops=1)
--   ->  Gather  (cost=6379.51..6379.72 rows=2 width=8) (actual time=12.125..13.507 rows=3 loops=1)
--         Workers Planned: 2
--         Workers Launched: 2
--         ->  Partial Aggregate  (cost=5379.51..5379.52 rows=1 width=8) (actual time=10.350..10.350 rows=1 loops=3)
--               ->  Parallel Index Only Scan using tmp_2_index on tmp_2  (cost=0.42..5032.04 rows=138986 width=0) (actual time=0.021..6.398 rows=111111 loops=3)
--                     Index Cond: (a = 'active'::text)
--                     Heap Fetches: 0
-- Planning Time: 0.067 ms
-- Execution Time: 13.540 ms

drop table tmp_1;
drop table tmp_2;

I would say the gain doesn't worth it.

@djezzzl
Copy link
Owner

djezzzl commented Dec 15, 2022

Hi @toydestroyer,

Could you please say if there is any update on the topic? Just wondering if you had a moment to work on the PR.

@toydestroyer
Copy link
Contributor Author

Hey @djezzzl, no updates ☹️

I don't think I'll be able to do anything productive this year, my mind is already on holiday 😂

@djezzzl
Copy link
Owner

djezzzl commented Dec 15, 2022

That is understandable to me!

I wish you and your family Happy Christmas and New Year!

@djezzzl
Copy link
Owner

djezzzl commented Jan 4, 2023

Hi @toydestroyer,

Happy New Year!

I'm reading the book "SQL Antipatterns" by Bill Karwin at the moment, and one thing I liked is the solution for our discussion here.
You can read more about it on medium.

Putting it shortly, the best way to handle it is to have a separate table as "enum storage" and define a foreign key constraint.
This gives a flexible, portable, high-performable, and durable solution.

CREATE TABLE users (
  id bigserial primary key,
  full_name varchar,
  status varchar

  FOREIGN KEY (status) REFERENCES user_statuses(status)
)

CREATE TABLE user_statuses (
  varchar status primary key,
  active boolean -- optional if we want to deprecate some of the statuses but keep them in the data
)

What do you think about enforcing this instead of enum?

@toydestroyer
Copy link
Contributor Author

Hey @djezzzl 🎉

I see how it is better than enums, especially for adding and removing values. It also looks like a nice solution to enforce consistency if you already store enums as strings.

But as a downside of this approach, I see maintenance overhead. For example, ~60% of our models have enums (some more than 1). We'll need to create many extra tables that will add unnecessary noise to our schema.rb. And we'll need to ensure that records in those tables are in sync between environments.

But I do think it is perfectly fine to have multiple checkers and allow developers to choose the ones that they think are good for their use case, like rubocop allows you to choose styles you want to enforce (" vs ').

What do you think about extending the configuration options of DatabaseConsitency to be something like:

DatabaseConsistencyCheckers:
  EnumTypeChecker:
    enforced_style: enum # foreign_key, string, integer

So far, the gem was mainly ensuring consistency between model definitions and the database schema, but it can do more. Namely — checking style consistency (good enough for v2? 🤔). Because at the end of the day, the gem is called DatabaseConsistency.

What do you think?

P. S. Saw your recent medium post. Will there be a checker for that? 😉
P. P. S. I hope I'll be able to allocate some time in the coming weeks to work on the checker to ensure the native enum type.

@djezzzl
Copy link
Owner

djezzzl commented Jan 12, 2023

Hi @toydestroyer ,

Sorry for taking long to reply.

We'll need to create many extra tables that will add unnecessary noise to our schema.rb.

This point should be fine as you have to define the same amount of enums. They may have shorter definitions, though.

And we'll need to ensure that records in those tables are in sync between environments.

This one is a bit more annoying but:

  • rails have a sophisticated way of producing seeds;
  • some companies prefer obfuscated dumps to populate other environments, so it should be trivial to keep it in sync;

But I do think it is perfectly fine to have multiple checkers and allow developers to choose the ones that they think are good for their use case

Good suggestion! As long as we have a rationale behind every strategy, we can keep them all.

So far, in your cases, PostgreSQL enums make more sense to you, and I respect that. And still prefer it after considering the approach with tables and foreign keys. So the only thing that remains is to write in plain text both directions and the reasoning when to choose what so everybody can learn this.

Could you please help me with that? It could be a good theme for the following article 😄

Namely — checking style consistency (good enough for v2? thinking). Because at the end of the day, the gem is called DatabaseConsistency.

Brilliant!

P. S. Saw your recent medium post. Will there be a checker for that?

Glad to hear that! Would you have any feedback?

And yes, the check is planned along with "hash index promotion." So adding to your idea, the library could cover three things now:

  • optimizations
  • consistency
  • styles

What do you think?

P. P. S. I hope I'll be able to allocate some time in the coming weeks to work on the checker to ensure the native enum type.

That would be awesome, but if you have strict time coding, I can do that, but I would like to ask your help on finishing preparations first so we can release it smoothly without surprising anybody.

@toydestroyer
Copy link
Contributor Author

Okay, so I did some benchmarking myself on my laptop (MBP M1 16GB), and here are my results:

-- 0. Postgres version
SELECT version();
-- => PostgreSQL 15.1 (Homebrew) on aarch64-apple-darwin22.1.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.202), 64-bit

-- 1. Tablespaces
-- create a tablespace for enum data
CREATE TABLESPACE enum_tablespace LOCATION './enum_test/enum';

-- create a tablespace for string data
CREATE TABLESPACE string_tablespace LOCATION './enum_test/string';

-- create a tablespace for string data with a foreign key
CREATE TABLESPACE string_fk_tablespace LOCATION './enum_test/string_fk';

-- create a tablespace for smallint data
CREATE TABLESPACE int2_tablespace LOCATION './enum_test/int2';

-- create a tablespace for smallint data
CREATE TABLESPACE int4_tablespace LOCATION './enum_test/int4';

-- create a tablespace for smallint data
CREATE TABLESPACE int8_tablespace LOCATION './enum_test/int8';

-- 2. Support
-- enum storage for string_fk
CREATE TABLE transaction_statuses (
  status text primary key
) TABLESPACE string_fk_tablespace;

INSERT INTO transaction_statuses (status) VALUES ('pending'), ('processing'), ('completed'), ('failed');

-- enum
CREATE TYPE transaction_status AS ENUM ('pending', 'processing', 'completed', 'failed');

-- 3. Tables
-- create a table with an enum column to store transaction status in the enum_tablespace tablespace
CREATE TABLE enum_transactions (
    id SERIAL PRIMARY KEY,
    status transaction_status
) TABLESPACE enum_tablespace;

-- create a table with a string column to store transaction status in the string_tablespace tablespace
CREATE TABLE string_transactions (
    id SERIAL PRIMARY KEY,
    status text
) TABLESPACE string_tablespace;

-- create a table with a string column to store transaction status in the string_tablespace tablespace
CREATE TABLE string_fk_transactions (
    id SERIAL PRIMARY KEY,
    status text,
		FOREIGN KEY (status) REFERENCES transaction_statuses(status)
) TABLESPACE string_fk_tablespace;

-- create a table with a smallint column to store transaction status in the int2_tablespace tablespace
CREATE TABLE int2_transactions (
    id SERIAL PRIMARY KEY,
    status INT2
) TABLESPACE int2_tablespace;

-- create a table with an integer column to store transaction status in the int4_tablespace tablespace
CREATE TABLE int4_transactions (
    id SERIAL PRIMARY KEY,
    status INT4
) TABLESPACE int4_tablespace;

-- create a table with a bigint column to store transaction status in the int8_tablespace tablespace
CREATE TABLE int8_transactions (
    id SERIAL PRIMARY KEY,
    status INT8
) TABLESPACE int8_tablespace;

-- insert 1 million rows of random data into the enum_transactions table
INSERT INTO enum_transactions (status) (SELECT ((ARRAY['pending', 'processing', 'completed', 'failed'])[generate_series % 4 + 1])::transaction_status FROM generate_series(1, 1000000));
-- => 2.343000s

-- insert 1 million rows of random data into the string_transactions table
INSERT INTO string_transactions (status) (SELECT ((ARRAY['pending', 'processing', 'completed', 'failed'])[generate_series % 4 + 1]) FROM generate_series(1, 1000000));
-- => 2.431000s

-- insert 1 million rows of random data into the string_fk_transactions table
INSERT INTO string_fk_transactions (status) (SELECT ((ARRAY['pending', 'processing', 'completed', 'failed'])[generate_series % 4 + 1]) FROM generate_series(1, 1000000));
-- => 4.561000s

-- insert 1 million rows of random data into the int2_transactions table
INSERT INTO int2_transactions (status) (SELECT generate_series % 4 FROM generate_series(1, 1000000));
-- => 2.083000s

-- insert 1 million rows of random data into the int4_transactions table
INSERT INTO int4_transactions (status) (SELECT generate_series % 4 FROM generate_series(1, 1000000));
-- => 1.898000s

-- insert 1 million rows of random data into the int8_transactions table
INSERT INTO int8_transactions (status) (SELECT generate_series % 4 FROM generate_series(1, 1000000));
-- => 1.845000s

Before proceeding, please take a look that inserting into a table with a foreign key takes about twice as long as other examples.

Vacuuming and analyzing:

VACUUM ANALYZE enum_transactions;
VACUUM ANALYZE string_fk_transactions;
VACUUM ANALYZE string_transactions;
VACUUM ANALYZE int2_transactions;
VACUUM ANALYZE int4_transactions;
VACUUM ANALYZE int8_transactions;

Check the disk usage of each table

SELECT pg_total_relation_size('enum_transactions') as enum_size,
       pg_total_relation_size('string_transactions') as string_size,
       pg_total_relation_size('string_fk_transactions') as string_fk_size,
       pg_total_relation_size('int2_transactions') as int2_size,
       pg_total_relation_size('int4_transactions') as int4_size,
       pg_total_relation_size('int8_transactions') as int8_size;

-- => enum_size:      58777600 (56 MB)
-- => string_size:    66822144 (64 MB)
-- => string_fk_size: 66822144 (64 MB)
-- => int2_size:      58777600 (56 MB)
-- => int4_size:      58777600 (56 MB)
-- => int8_size:      66813952 (64 MB)

I also used tablespaces because I wanted to see actual disk usage. So here's du -d1:

-- 71888 ./int4      (35M)
-- 71888 ./int2      (35M)
-- 88320 ./string_fk (43M)
-- 71408 ./enum      (35M)
-- 88272 ./int8      (43M)
-- 88288 ./string    (43M)

Cleanup:

-- tables
DROP TABLE IF EXISTS enum_transactions;
DROP TABLE IF EXISTS string_transactions;
DROP TABLE IF EXISTS string_fk_transactions;
DROP TABLE IF EXISTS int2_transactions;
DROP TABLE IF EXISTS int4_transactions;
DROP TABLE IF EXISTS int8_transactions;

DROP TABLE IF EXISTS transaction_statuses;

-- types
DROP TYPE transaction_status;

-- tablespaces
DROP TABLESPACE IF EXISTS enum_tablespace;
DROP TABLESPACE IF EXISTS string_tablespace;
DROP TABLESPACE IF EXISTS string_fk_tablespace;
DROP TABLESPACE IF EXISTS int4_tablespace;
DROP TABLESPACE IF EXISTS int2_tablespace;
DROP TABLESPACE IF EXISTS int8_tablespace;

Conclusion

Native enum is a clear winner for me. It combines the space efficiency of integers and the readability of strings. It also enforces consistency.

It would be nice to have some sanity check on my code because I'm pretty sure I'm very biased towards native enums 😆

I used the text datatype for simplicity. varchar doesn't make a difference (I checked). varchar(255) is less space-consuming but insignificantly.

I was also super surprized about int2 and int4, that they consume an equal amount of disk space.

And I'm even more skeptical about strings with foreign keys now. They are still strings and consume disk space as regular strings, and it seems like having a foreign key does affect the insert performance 🤷‍♂️

Thoughts? 😊

@djezzzl
Copy link
Owner

djezzzl commented Apr 14, 2023

Wow! Such tremendous benchmarking, @toydestroyer!

I'm so sorry I haven't looked at this early as I had to. I will do my best to have a good look this weekend.

@djezzzl
Copy link
Owner

djezzzl commented Apr 17, 2023

Hi @toydestroyer,

I just finished reading this. Thank you again for the extraordinary explorations! I'm pretty confident this benchmarking deserves a post to share.

-- => enum_size: 58777600 (56 MB)
-- => int2_size: 58777600 (56 MB)
-- => int4_size: 58777600 (56 MB)

The above looks odd to me, but I think it is because of how PostgreSQL allocates pages. There is probably some minimum byte size. That's why int2 and int4 produces the same results.

It would be nice to have some sanity check on my code because I'm pretty sure I'm very biased towards native enums 😆

The code looks legit to me.

I was also super surprized about int2 and int4, that they consume an equal amount of disk space.

Same feeling.

I used the text datatype for simplicity. varchar doesn't make a difference (I checked). varchar(255) is less space-consuming but insignificantly.

Makes sense to me.

And I'm even more skeptical about strings with foreign keys now. They are still strings and consume disk space as regular strings, and it seems like having a foreign key does affect the insert performance 🤷‍♂️

Yes, this is disappointing, to be honest. The table that stores it is tiny, yet inserting time increases almost twice.

=========================================================================

I think we should try to conclude.

Native enum type in PostgreSQL 15+.

Pros:

  • has the best performance
  • has the best memory efficiency
  • ensures consistency
  • straightforward to read
    Cons:
  • Hard to manipulate (rename/drop)

I think we can promote enums with the checker. It would also be great to highlight the reasoning and potential drawbacks. It would be super cool to mention how to overcome them.

P.S. Maybe it isn't efficient in the lower version, but we have a benchmarking to confirm that.

Thank you again for the topic, and please feel free to do a PR; I will be happy to merge and release it.

@djezzzl djezzzl added the enhancement New feature or request label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants