You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dashboard link here just in case wanted to look at other metrics as well in a sample snapshot.
After this change
Dashboard link here just in case wanted to look at other metrics as well in a sample snapshot.
If we look at the spikes after this bulk insert changes they are very narrowed and even the IOPS got decreased by 1000s and also lead to better performance on the Clair App side.(i.e reducing the IO wait time of the requests). This will ultimately help us reduce cost from both Clair DB and App perspective.
If we look at the average query execution times, there is a 100 fold difference after moving to a bulk insert strategy (i.e. 250 ms to 2.5 ms). And also the time taken for running overall queries got reduced along with the number of queries running on the DB ultimately resulting in less IOPS on DB. (This will keep DB less busy allowing more incoming requests to be served and also will help us reduce those spikes seen in both Clair App and DB)
Summary And Conclusion
Still a better suggestion would be either, Moving the entire schema to a column based storage as they are well optimized to run these kind of bulk periodic workloads (i.e for every 6 hours) using their distributed execution pattern (i.e sharding which is not present in PostgreSQL). A good list of column store DBs worth exploring in case if decided to move to one in future would be: https://www.g2.com/categories/columnar-databases.
Moving from single INSERT/UPDATE/DELETE statements to bulk INSERT/UPDATE/DELETE statement where in a single statement acts on multiple rows of the table. Supporting posts and example snippets to perform bulk operations:
def approval_query_executor(self, table_id, table_name, columns, approval_type):
"""
executes approval query for cntl_approval_info in rds
:param table_id: table id according to cntl_table_info in rds
:param table_name: table name according to cntl_table_info in rds
:param columns: columns to be approved
:param approval_type: type of approval like add, update and delete
:return: None
"""
executer = QueryExecuter(self.default_args['rds_host'], self.default_args['rds_port'],
self.default_args['rds_dbName'], self.default_args['rds_password'],
self.default_args['rds_username'])
approval_info = self.default_args['approval_info']
cntl_columns_approval_query = """INSERT INTO %s (table_id, src_table_name, src_column_name, column_datatype,
approval_type, approval_flag, last_updated_date, approved_by, active_flag, status, data_length, data_precision,
data_scale) VALUES """
params = (approval_info,)
if len(columns) > 0:
for column_number in range(0, len(columns)):
cntl_columns_approval_query += """(%s, '%s', '%s', '%s', '%s', %s, '%s', '%s', %s, '%s', %s, %s, %s),"""
params += (table_id, table_name, columns[column_number][1], columns[column_number][2], approval_type,
0, str(datetime.now()), 'Dejavu-Gatekeeper', 1, 'pending approval for approval in rds',
columns[column_number][3], columns[column_number][4], columns[column_number][5])
cntl_columns_approval_query = cntl_columns_approval_query[:-1] + ";"
executer.execute_query(cntl_columns_approval_query, params, approval_info)
else:
logging.info("No columns to send for approval to {0} in {1}".format(approval_type, table_name))
def update_sequence_number(self, new_columns, table_name, target=None):
"""
updates sequence number of a table in cntl_column_info in rds
:param new_columns: list of new columns
:param table_name: table name
:param target: flag to determine update in source or target sequence number
:return: None
"""
executer = QueryExecuter(self.default_args['rds_host'], self.default_args['rds_port'],
self.default_args['rds_dbName'], self.default_args['rds_password'],
self.default_args['rds_username'])
column_info = self.default_args['column_info']
columns = list()
new_columns = tuple(new_columns)
seq_fashion = 'src_seq_no' if target is None else 'trgt_seq_no'
if len(new_columns) != 0:
update_sequence_number_query = """UPDATE %s SET %s = CASE src_column_name """
params = (column_info, seq_fashion)
for index in range(0, len(new_columns)):
update_sequence_number_query += """WHEN '%s' THEN %s """
params += (new_columns[index][1], index + 1)
columns.append(new_columns[index][1])
update_sequence_number_query += """END WHERE src_column_name IN """ + str(
columns).replace('[', '(').replace(']', ')') + """ AND src_table_name='%s' AND src_active_flag=%s
AND active_flag=%s;"""
params += (table_name, 1, 1)
executer.execute_query(update_sequence_number_query, params, column_info)
We can also think of modifying the current ER model of the DB (which seems to be in 3NF). But modifying DB schema requires a thorough knowledge on how each attribute/column and entity is related to the business and what all activities are expected to be happening between each entity in the model. One simple solution would be to bring data into flat format instead of further normalizing it, which will make querying faster but there will be data duplication which is again a trade off. Incase if we are deciding to modify the existing schema of the DB, we need to consider the above points on how we can design it in such as way so that we can leverage BULK statements like INSERT/UPDATE/DELETE/COPY on top of it to perform bulk operations.
The text was updated successfully, but these errors were encountered:
Why is the max and average duration so much worse?
I don't understand why the graphs don't seem to correlate at all to the numbers in the table.
It looks like the execution time is worse in the giant-query version.
Why is the max and average duration so much worse? I don't understand why the graphs don't seem to correlate at all to the numbers in the table. It looks like the execution time is worse in the giant-query version.
Yes there are some queries that ran for long time (bulk ones) which are there in the top twenty list. But the average time in overall is very less when we did gaint query execution. If we look at the average times for all the inserts before and after this change it got reduced from 250 ms to 2.5 ms (In the averge query duration screenshot). The ones which are being listed in the top 20 list as time consuming are just outliers. And also if those queries in overall were running so long, we would have seen IOPS spikes in DB metrics for so long which is not happening in our case (instead it got reduced from 3000 IOPS to 2000 IOPS on an average).
Description
We have changed one of the most time taking insert query into a bulk insert query and noticed positive difference in the DB performance.
POC PR: #996
DB Metrics And Observations
Before this change
Dashboard link here just in case wanted to look at other metrics as well in a sample snapshot.
After this change
Dashboard link here just in case wanted to look at other metrics as well in a sample snapshot.
If we look at the spikes after this bulk insert changes they are very narrowed and even the IOPS got decreased by 1000s and also lead to better performance on the Clair App side.(i.e reducing the IO wait time of the requests). This will ultimately help us reduce cost from both Clair DB and App perspective.
Profiling Report And Observations
Before this change. Complete report here
After this change. Complete report here
If we look at the average query execution times, there is a 100 fold difference after moving to a bulk insert strategy (i.e. 250 ms to 2.5 ms). And also the time taken for running overall queries got reduced along with the number of queries running on the DB ultimately resulting in less IOPS on DB. (This will keep DB less busy allowing more incoming requests to be served and also will help us reduce those spikes seen in both Clair App and DB)
Summary And Conclusion
OR
To have below action items on PostgreSQL.
https://www.postgresql.org/docs/13/performance-tips.html
https://www.postgresql.org/docs/13/parallel-query.html
Moving from single INSERT/UPDATE/DELETE statements to bulk INSERT/UPDATE/DELETE statement where in a single statement acts on multiple rows of the table. Supporting posts and example snippets to perform bulk operations:
The text was updated successfully, but these errors were encountered: