libvuln POSTGRESQL: Requesting bulk query and few performance tips for postgresDB #995

vishnuchalla · 2023-07-05T14:55:07Z

Description

We have changed one of the most time taking insert query into a bulk insert query and noticed positive difference in the DB performance.

POC PR: #996

DB Metrics And Observations

Before this change

Dashboard link here just in case wanted to look at other metrics as well in a sample snapshot.

After this change

Dashboard link here just in case wanted to look at other metrics as well in a sample snapshot.

If we look at the spikes after this bulk insert changes they are very narrowed and even the IOPS got decreased by 1000s and also lead to better performance on the Clair App side.(i.e reducing the IO wait time of the requests). This will ultimately help us reduce cost from both Clair DB and App perspective.

Profiling Report And Observations

Before this change. Complete report here

After this change. Complete report here

If we look at the average query execution times, there is a 100 fold difference after moving to a bulk insert strategy (i.e. 250 ms to 2.5 ms). And also the time taken for running overall queries got reduced along with the number of queries running on the DB ultimately resulting in less IOPS on DB. (This will keep DB less busy allowing more incoming requests to be served and also will help us reduce those spikes seen in both Clair App and DB)

Summary And Conclusion

Still a better suggestion would be either, Moving the entire schema to a column based storage as they are well optimized to run these kind of bulk periodic workloads (i.e for every 6 hours) using their distributed execution pattern (i.e sharding which is not present in PostgreSQL). A good list of column store DBs worth exploring in case if decided to move to one in future would be: https://www.g2.com/categories/columnar-databases.

OR

To have below action items on PostgreSQL.
- https://www.postgresql.org/docs/13/performance-tips.html
- https://www.postgresql.org/docs/13/parallel-query.html
- Moving from single INSERT/UPDATE/DELETE statements to bulk INSERT/UPDATE/DELETE statement where in a single statement acts on multiple rows of the table. Supporting posts and example snippets to perform bulk operations:
  - https://dba.stackexchange.com/questions/224989/multi-row-insert-vs-transactional-single-row-inserts
  - https://dba.stackexchange.com/questions/76334/multi-row-insert-vs-multiple-single-row-inserts

    def approval_query_executor(self, table_id, table_name, columns, approval_type):

        """
        executes approval query for cntl_approval_info in rds

        :param table_id: table id according to cntl_table_info in rds
        :param table_name: table name according to cntl_table_info in rds
        :param columns: columns to be approved
        :param approval_type: type of approval like add, update and delete
        :return: None
        """

        executer = QueryExecuter(self.default_args['rds_host'], self.default_args['rds_port'],
                                 self.default_args['rds_dbName'], self.default_args['rds_password'],
                                 self.default_args['rds_username'])
        approval_info = self.default_args['approval_info']
        cntl_columns_approval_query = """INSERT INTO %s (table_id, src_table_name, src_column_name, column_datatype,
         approval_type, approval_flag, last_updated_date, approved_by, active_flag, status, data_length, data_precision,
          data_scale) VALUES """
        params = (approval_info,)
        if len(columns) > 0:
            for column_number in range(0, len(columns)):
                cntl_columns_approval_query += """(%s, '%s', '%s', '%s', '%s', %s, '%s', '%s', %s, '%s', %s, %s, %s),"""
                params += (table_id, table_name, columns[column_number][1], columns[column_number][2], approval_type,
                           0, str(datetime.now()), 'Dejavu-Gatekeeper', 1, 'pending approval for approval in rds',
                           columns[column_number][3], columns[column_number][4], columns[column_number][5])
            cntl_columns_approval_query = cntl_columns_approval_query[:-1] + ";"
            executer.execute_query(cntl_columns_approval_query, params, approval_info)
        else:
            logging.info("No columns to send for approval to {0} in {1}".format(approval_type, table_name))

    def update_sequence_number(self, new_columns, table_name, target=None):

        """
        updates sequence number of a table in cntl_column_info in rds

        :param new_columns: list of new columns
        :param table_name: table name
        :param target: flag to determine update in source or target sequence number
        :return: None
        """

        executer = QueryExecuter(self.default_args['rds_host'], self.default_args['rds_port'],
                                 self.default_args['rds_dbName'], self.default_args['rds_password'],
                                 self.default_args['rds_username'])
        column_info = self.default_args['column_info']
        columns = list()
        new_columns = tuple(new_columns)
        seq_fashion = 'src_seq_no' if target is None else 'trgt_seq_no'
        if len(new_columns) != 0:
            update_sequence_number_query = """UPDATE %s SET %s = CASE src_column_name """
            params = (column_info, seq_fashion)
            for index in range(0, len(new_columns)):
                update_sequence_number_query += """WHEN '%s' THEN %s """
                params += (new_columns[index][1], index + 1)
                columns.append(new_columns[index][1])
            update_sequence_number_query += """END WHERE src_column_name IN """ + str(
                columns).replace('[', '(').replace(']', ')') + """ AND src_table_name='%s' AND src_active_flag=%s
                 AND active_flag=%s;"""
            params += (table_name, 1, 1)
            executer.execute_query(update_sequence_number_query, params, column_info)

We can also think of modifying the current ER model of the DB (which seems to be in 3NF). But modifying DB schema requires a thorough knowledge on how each attribute/column and entity is related to the business and what all activities are expected to be happening between each entity in the model. One simple solution would be to bring data into flat format instead of further normalizing it, which will make querying faster but there will be data duplication which is again a trade off. Incase if we are deciding to modify the existing schema of the DB, we need to consider the above points on how we can design it in such as way so that we can leverage BULK statements like INSERT/UPDATE/DELETE/COPY on top of it to perform bulk operations.

The text was updated successfully, but these errors were encountered:

hdonnay · 2023-07-05T18:46:10Z

Why is the max and average duration so much worse?
I don't understand why the graphs don't seem to correlate at all to the numbers in the table.
It looks like the execution time is worse in the giant-query version.

vishnuchalla · 2023-07-05T18:56:04Z

Why is the max and average duration so much worse? I don't understand why the graphs don't seem to correlate at all to the numbers in the table. It looks like the execution time is worse in the giant-query version.

Yes there are some queries that ran for long time (bulk ones) which are there in the top twenty list. But the average time in overall is very less when we did gaint query execution. If we look at the average times for all the inserts before and after this change it got reduced from 250 ms to 2.5 ms (In the averge query duration screenshot). The ones which are being listed in the top 20 list as time consuming are just outliers. And also if those queries in overall were running so long, we would have seen IOPS spikes in DB metrics for so long which is not happening in our case (instead it got reduced from 3000 IOPS to 2000 IOPS on an average).

vishnuchalla mentioned this issue Jul 5, 2023

libvuln SQL: POC to introduce bulk inserts #996

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libvuln POSTGRESQL: Requesting bulk query and few performance tips for postgresDB #995

libvuln POSTGRESQL: Requesting bulk query and few performance tips for postgresDB #995

vishnuchalla commented Jul 5, 2023 •

edited

Loading

hdonnay commented Jul 5, 2023

vishnuchalla commented Jul 5, 2023 •

edited

Loading

libvuln POSTGRESQL: Requesting bulk query and few performance tips for postgresDB #995

libvuln POSTGRESQL: Requesting bulk query and few performance tips for postgresDB #995

Comments

vishnuchalla commented Jul 5, 2023 • edited Loading

Description

DB Metrics And Observations

Before this change

After this change

Profiling Report And Observations

Before this change. Complete report here

After this change. Complete report here

Summary And Conclusion

hdonnay commented Jul 5, 2023

vishnuchalla commented Jul 5, 2023 • edited Loading

vishnuchalla commented Jul 5, 2023 •

edited

Loading

vishnuchalla commented Jul 5, 2023 •

edited

Loading