-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checker that makes sure that enum type is used to store ActiveRecord's enums #169
Comments
Hi @toydestroyer, Thank you for the follow-up discussion!
I saw that topic too. Some of the answers indeed sound interesting but researching them proves they are not worthy, IMHO.
Could you please check it in your database and share the results? Would be there a win in terms of memory reduction or performance gain?
I think it's still fine to have the checker in DatabaseConsistency even if it helps only you. In the end, anybody can simply disable it if they don't want/disagree. Giving the customers (engineers) this choice is a big plus. To sum up:
Some of my findings: create type example_enum as enum ('value1', 'value2', 'value3');
create table tmp_varchar (example varchar);
create table tmp_enum (example example_enum);
insert into tmp_varchar (select (ARRAY['value1', 'value2', 'value3'])[generate_series % 3 + 1] from generate_series(1, 1000000));
insert into tmp_enum(select example::example_enum from tmp_varchar);
create index tmp_varchar_index on tmp_varchar(example);
create index tmp_enum_index on tmp_enum(example);
analyze tmp_varchar;
analyse tmp_enum;
select
pg_size_pretty(sum(pg_column_size('example'))) as total_size,
pg_size_pretty(avg(pg_column_size('example'))) as average_size,
count(*) count
from tmp_varchar;
-- 7813 kB,8.0000000000000000 bytes,1000000
select
pg_size_pretty(sum(pg_column_size('example'))) as total_size,
pg_size_pretty(avg(pg_column_size('example'))) as average_size,
count(*) count
from tmp_enum;
-- 7813 kB,8.0000000000000000 bytes,1000000
select version();
-- PostgreSQL 12.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
explain analyze select * from tmp_varchar where example = 'value2';
-- Bitmap Heap Scan on tmp_varchar (cost=6218.32..14788.33 rows=331600 width=7) (actual time=14.082..45.372 rows=333334 loops=1)
-- Recheck Cond: ((example)::text = 'value2'::text)
-- Heap Blocks: exact=4425
-- -> Bitmap Index Scan on tmp_varchar_index (cost=0.00..6135.43 rows=331600 width=0) (actual time=13.688..13.688 rows=333334 loops=1)
-- Index Cond: ((example)::text = 'value2'::text)
-- Planning Time: 0.065 ms
-- Execution Time: 54.115 ms
explain analyze select * from tmp_enum where example = 'value2';
-- Bitmap Heap Scan on tmp_enum (cost=6294.10..14915.35 rows=335700 width=4) (actual time=11.765..41.599 rows=333334 loops=1)
-- Recheck Cond: (example = 'value2'::example_enum)
-- Heap Blocks: exact=4425
-- -> Bitmap Index Scan on tmp_enum_index (cost=0.00..6210.18 rows=335700 width=0) (actual time=11.381..11.382 rows=333334 loops=1)
-- Index Cond: (example = 'value2'::example_enum)
-- Planning Time: 0.069 ms
-- Execution Time: 50.327 ms |
I think the main benefits in our use case are not performance related, but:
Got it, I'll do more research on this! |
I absolutely agree that integer-based enums with ActiveRecord are hard to read. It can be simply solved with string-based enum though. For example,
That's true. And I agree that could be critical for data-based logic. So, We have a "consistency" reason right now which for me is already good. However, it would be great to have some more. If you can find them, please feel free to share. In any case, I think I will be focused on something else in the upcoming weeks so if you want, feel free to do a PR. |
While I'm still researching, want to point out an interesting note in the
|
I understand that it makes sense that the comparison of integers should be faster than the comparison of strings, however, I see the following results: create table tmp_1(a integer);
create table tmp_2(a varchar);
insert into tmp_1 (select generate_series % 3 from generate_series(1, 1000000));
insert into tmp_2 (select (ARRAY['active', 'inactive', 'invalid'])[a + 1] from tmp_1);
create index tmp_1_index on tmp_1(a);
create index tmp_2_index on tmp_2(a);
analyse tmp_1;
analyze tmp_2;
explain analyse select count(*) from tmp_1 where a = 1;
-- Finalize Aggregate (cost=6386.73..6386.74 rows=1 width=8) (actual time=11.920..13.236 rows=1 loops=1)
-- -> Gather (cost=6386.51..6386.72 rows=2 width=8) (actual time=11.847..13.229 rows=3 loops=1)
-- Workers Planned: 2
-- Workers Launched: 2
-- -> Partial Aggregate (cost=5386.51..5386.52 rows=1 width=8) (actual time=10.235..10.236 rows=1 loops=3)
-- -> Parallel Index Only Scan using tmp_1_index on tmp_1 (cost=0.42..5038.14 rows=139347 width=0) (actual time=0.021..6.298 rows=111111 loops=3)
-- Index Cond: (a = 1)
-- Heap Fetches: 0
-- Planning Time: 0.062 ms
-- Execution Time: 13.260 ms
explain analyse select count(*) from tmp_2 where a = 'active';
-- Finalize Aggregate (cost=6379.72..6379.73 rows=1 width=8) (actual time=12.200..13.513 rows=1 loops=1)
-- -> Gather (cost=6379.51..6379.72 rows=2 width=8) (actual time=12.125..13.507 rows=3 loops=1)
-- Workers Planned: 2
-- Workers Launched: 2
-- -> Partial Aggregate (cost=5379.51..5379.52 rows=1 width=8) (actual time=10.350..10.350 rows=1 loops=3)
-- -> Parallel Index Only Scan using tmp_2_index on tmp_2 (cost=0.42..5032.04 rows=138986 width=0) (actual time=0.021..6.398 rows=111111 loops=3)
-- Index Cond: (a = 'active'::text)
-- Heap Fetches: 0
-- Planning Time: 0.067 ms
-- Execution Time: 13.540 ms
drop table tmp_1;
drop table tmp_2; I would say the gain doesn't worth it. |
Hi @toydestroyer, Could you please say if there is any update on the topic? Just wondering if you had a moment to work on the PR. |
Hey @djezzzl, no updates I don't think I'll be able to do anything productive this year, my mind is already on holiday 😂 |
That is understandable to me! I wish you and your family Happy Christmas and New Year! |
Hi @toydestroyer, Happy New Year! I'm reading the book "SQL Antipatterns" by Bill Karwin at the moment, and one thing I liked is the solution for our discussion here. Putting it shortly, the best way to handle it is to have a separate table as "enum storage" and define a foreign key constraint. CREATE TABLE users (
id bigserial primary key,
full_name varchar,
status varchar
FOREIGN KEY (status) REFERENCES user_statuses(status)
)
CREATE TABLE user_statuses (
varchar status primary key,
active boolean -- optional if we want to deprecate some of the statuses but keep them in the data
) What do you think about enforcing this instead of |
Hey @djezzzl 🎉 I see how it is better than enums, especially for adding and removing values. It also looks like a nice solution to enforce consistency if you already store enums as strings. But as a downside of this approach, I see maintenance overhead. For example, ~60% of our models have enums (some more than 1). We'll need to create many extra tables that will add unnecessary noise to our But I do think it is perfectly fine to have multiple checkers and allow developers to choose the ones that they think are good for their use case, like rubocop allows you to choose styles you want to enforce ( What do you think about extending the configuration options of DatabaseConsitency to be something like: DatabaseConsistencyCheckers:
EnumTypeChecker:
enforced_style: enum # foreign_key, string, integer So far, the gem was mainly ensuring consistency between model definitions and the database schema, but it can do more. Namely — checking style consistency (good enough for v2? 🤔). Because at the end of the day, the gem is called DatabaseConsistency. What do you think? P. S. Saw your recent medium post. Will there be a checker for that? 😉 |
Hi @toydestroyer , Sorry for taking long to reply.
This point should be fine as you have to define the same amount of enums. They may have shorter definitions, though.
This one is a bit more annoying but:
Good suggestion! As long as we have a rationale behind every strategy, we can keep them all. So far, in your cases, PostgreSQL enums make more sense to you, and I respect that. And still prefer it after considering the approach with tables and foreign keys. So the only thing that remains is to write in plain text both directions and the reasoning when to choose what so everybody can learn this. Could you please help me with that? It could be a good theme for the following article 😄
Brilliant!
Glad to hear that! Would you have any feedback? And yes, the check is planned along with "hash index promotion." So adding to your idea, the library could cover three things now:
What do you think?
That would be awesome, but if you have strict time coding, I can do that, but I would like to ask your help on finishing preparations first so we can release it smoothly without surprising anybody. |
Okay, so I did some benchmarking myself on my laptop (MBP M1 16GB), and here are my results: -- 0. Postgres version
SELECT version();
-- => PostgreSQL 15.1 (Homebrew) on aarch64-apple-darwin22.1.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.202), 64-bit
-- 1. Tablespaces
-- create a tablespace for enum data
CREATE TABLESPACE enum_tablespace LOCATION './enum_test/enum';
-- create a tablespace for string data
CREATE TABLESPACE string_tablespace LOCATION './enum_test/string';
-- create a tablespace for string data with a foreign key
CREATE TABLESPACE string_fk_tablespace LOCATION './enum_test/string_fk';
-- create a tablespace for smallint data
CREATE TABLESPACE int2_tablespace LOCATION './enum_test/int2';
-- create a tablespace for smallint data
CREATE TABLESPACE int4_tablespace LOCATION './enum_test/int4';
-- create a tablespace for smallint data
CREATE TABLESPACE int8_tablespace LOCATION './enum_test/int8';
-- 2. Support
-- enum storage for string_fk
CREATE TABLE transaction_statuses (
status text primary key
) TABLESPACE string_fk_tablespace;
INSERT INTO transaction_statuses (status) VALUES ('pending'), ('processing'), ('completed'), ('failed');
-- enum
CREATE TYPE transaction_status AS ENUM ('pending', 'processing', 'completed', 'failed');
-- 3. Tables
-- create a table with an enum column to store transaction status in the enum_tablespace tablespace
CREATE TABLE enum_transactions (
id SERIAL PRIMARY KEY,
status transaction_status
) TABLESPACE enum_tablespace;
-- create a table with a string column to store transaction status in the string_tablespace tablespace
CREATE TABLE string_transactions (
id SERIAL PRIMARY KEY,
status text
) TABLESPACE string_tablespace;
-- create a table with a string column to store transaction status in the string_tablespace tablespace
CREATE TABLE string_fk_transactions (
id SERIAL PRIMARY KEY,
status text,
FOREIGN KEY (status) REFERENCES transaction_statuses(status)
) TABLESPACE string_fk_tablespace;
-- create a table with a smallint column to store transaction status in the int2_tablespace tablespace
CREATE TABLE int2_transactions (
id SERIAL PRIMARY KEY,
status INT2
) TABLESPACE int2_tablespace;
-- create a table with an integer column to store transaction status in the int4_tablespace tablespace
CREATE TABLE int4_transactions (
id SERIAL PRIMARY KEY,
status INT4
) TABLESPACE int4_tablespace;
-- create a table with a bigint column to store transaction status in the int8_tablespace tablespace
CREATE TABLE int8_transactions (
id SERIAL PRIMARY KEY,
status INT8
) TABLESPACE int8_tablespace;
-- insert 1 million rows of random data into the enum_transactions table
INSERT INTO enum_transactions (status) (SELECT ((ARRAY['pending', 'processing', 'completed', 'failed'])[generate_series % 4 + 1])::transaction_status FROM generate_series(1, 1000000));
-- => 2.343000s
-- insert 1 million rows of random data into the string_transactions table
INSERT INTO string_transactions (status) (SELECT ((ARRAY['pending', 'processing', 'completed', 'failed'])[generate_series % 4 + 1]) FROM generate_series(1, 1000000));
-- => 2.431000s
-- insert 1 million rows of random data into the string_fk_transactions table
INSERT INTO string_fk_transactions (status) (SELECT ((ARRAY['pending', 'processing', 'completed', 'failed'])[generate_series % 4 + 1]) FROM generate_series(1, 1000000));
-- => 4.561000s
-- insert 1 million rows of random data into the int2_transactions table
INSERT INTO int2_transactions (status) (SELECT generate_series % 4 FROM generate_series(1, 1000000));
-- => 2.083000s
-- insert 1 million rows of random data into the int4_transactions table
INSERT INTO int4_transactions (status) (SELECT generate_series % 4 FROM generate_series(1, 1000000));
-- => 1.898000s
-- insert 1 million rows of random data into the int8_transactions table
INSERT INTO int8_transactions (status) (SELECT generate_series % 4 FROM generate_series(1, 1000000));
-- => 1.845000s Before proceeding, please take a look that inserting into a table with a foreign key takes about twice as long as other examples. Vacuuming and analyzing: VACUUM ANALYZE enum_transactions;
VACUUM ANALYZE string_fk_transactions;
VACUUM ANALYZE string_transactions;
VACUUM ANALYZE int2_transactions;
VACUUM ANALYZE int4_transactions;
VACUUM ANALYZE int8_transactions; Check the disk usage of each table SELECT pg_total_relation_size('enum_transactions') as enum_size,
pg_total_relation_size('string_transactions') as string_size,
pg_total_relation_size('string_fk_transactions') as string_fk_size,
pg_total_relation_size('int2_transactions') as int2_size,
pg_total_relation_size('int4_transactions') as int4_size,
pg_total_relation_size('int8_transactions') as int8_size;
-- => enum_size: 58777600 (56 MB)
-- => string_size: 66822144 (64 MB)
-- => string_fk_size: 66822144 (64 MB)
-- => int2_size: 58777600 (56 MB)
-- => int4_size: 58777600 (56 MB)
-- => int8_size: 66813952 (64 MB) I also used tablespaces because I wanted to see actual disk usage. So here's
Cleanup: -- tables
DROP TABLE IF EXISTS enum_transactions;
DROP TABLE IF EXISTS string_transactions;
DROP TABLE IF EXISTS string_fk_transactions;
DROP TABLE IF EXISTS int2_transactions;
DROP TABLE IF EXISTS int4_transactions;
DROP TABLE IF EXISTS int8_transactions;
DROP TABLE IF EXISTS transaction_statuses;
-- types
DROP TYPE transaction_status;
-- tablespaces
DROP TABLESPACE IF EXISTS enum_tablespace;
DROP TABLESPACE IF EXISTS string_tablespace;
DROP TABLESPACE IF EXISTS string_fk_tablespace;
DROP TABLESPACE IF EXISTS int4_tablespace;
DROP TABLESPACE IF EXISTS int2_tablespace;
DROP TABLESPACE IF EXISTS int8_tablespace; ConclusionNative enum is a clear winner for me. It combines the space efficiency of integers and the readability of strings. It also enforces consistency. It would be nice to have some sanity check on my code because I'm pretty sure I'm very biased towards native enums 😆 I used the I was also super surprized about And I'm even more skeptical about strings with foreign keys now. They are still strings and consume disk space as regular strings, and it seems like having a foreign key does affect the insert performance 🤷♂️ Thoughts? 😊 |
Wow! Such tremendous benchmarking, @toydestroyer! I'm so sorry I haven't looked at this early as I had to. I will do my best to have a good look this weekend. |
Hi @toydestroyer, I just finished reading this. Thank you again for the extraordinary explorations! I'm pretty confident this benchmarking deserves a post to share.
The above looks odd to me, but I think it is because of how PostgreSQL allocates pages. There is probably some minimum byte size. That's why
The code looks legit to me.
Same feeling.
Makes sense to me.
Yes, this is disappointing, to be honest. The table that stores it is tiny, yet inserting time increases almost twice. ========================================================================= I think we should try to conclude. Native Pros:
I think we can promote enums with the checker. It would also be great to highlight the reasoning and potential drawbacks. It would be super cool to mention how to overcome them. P.S. Maybe it isn't efficient in the lower version, but we have a benchmarking to confirm that. Thank you again for the topic, and please feel free to do a PR; I will be happy to merge and release it. |
Let's extract our discussion from #160.
One of the things that I want to enforce in the project I'm working on right now is to stop using ActiveRecord's default integer enums and use strings with an actual enum type in the database (we're using PostgreSQL).
Regarding the general pros and cons of enum columns over integers/strings. I did a quick research (aka googled "postgresql enum pros and cons") and found this question on StackOverflow: https://stackoverflow.com/q/2318123
If I may just quote some of the answers from there:
and this one:
In my case, this is definitely a win! But I understand that this is not for everyone. Especially considering that Rails added an API to create enum columns only from version 7, and only for PostgreSQL: rails/rails@4eef348. And it doesn't provide an API for adding/deleting/changing values once an enum is created. So you have to write raw SQL in your migration files or use third-party gems (#1, #2).
So what does everyone thinks about having such a checker as a part of the
DatabaseConsistency
gem? It might be disabled by default so people are cautious about enforcing it.The text was updated successfully, but these errors were encountered: