This project showcases my work on SQL queries and data handling within a protected work environment. While the source data is secured and parts of the code completed by others are excluded, the following demonstration provides a comprehensive view of the techniques and methodologies employed. This project involves transforming high-dimensional, nested financial data delivered annually by Skatteetaten (The Norwegian Tax Authority) to SSB (Statistics Norway) into a flattened, query-optimized format.
Every year, Skatteetaten delivers extensive company financial data to SSB. This data is characterized by its high dimensionality and nested schema, making direct analysis and processing challenging. An example of this nested structure can be seen in the 'naeringsspesifikasjon_2023_prod' dataset. The primary task is to flatten, clean, and store this data efficiently, ensuring it is easily queryable for subsequent statistical analysis.
To address the complexity of the nested financial data, SQL queries are utilized to flatten the data and store it as Parquet files. These files are structured in a manner that facilitates rapid querying. The transformation process includes:
1.Flattening the Nested Data: SQL queries are employed to denormalize and flatten the nested financial data.
2.Filtering and Cleaning: Data is filtered based on specific requirements and cleaned to ensure accuracy and consistency.
3.Storage Optimization: Data is stored in both large main files and smaller subsets categorized by financial data types, using a single unique identifier to link all tables. This strategy enhances query performance.
The data transformation process successfully flattens, cleans, and stores the financial data. Performance testing revealed that storing data in a wide format was optimal, despite initial assumptions. This decision was influenced by the additional data handling requirements of the downstream data editing software, which benefits from the wide format structure.
The datasets are generated by querying the financial data provided by Skatteetaten. Two primary datasets are created: one for the income statement (resultatregnskap) and another for the balance sheet (balanseregnskap). To optimize query performance, subsets are created for specific financial themes such as operating income, current assets, and long-term liabilities. This segmentation ensures that queries target only relevant data segments, improving efficiency.
Comprehensive speed tests were conducted to determine the most efficient data structures and query tools. Some of those tests are included in this repo. Various tools, including PySpark, PyArrow, Dask, and DuckDB, were evaluated. Additionally, data partitioning strategies were employed to enhance query performance. The results indicated that PyArrow offered the best performance for our use case.
The project's success lies in the efficient transformation and storage of high-dimensional financial data, enabling fast and effective querying. The cleaned and structured data is now securely stored in Google Cloud, where it is readily accessible to the statistics teams for further analysis and production of official statistics. The datasets, though large, support rapid queries, ensuring smooth performance of the in-house data editing software.