Skip to content

Part of a project I work on copied from a work repository. Demonstrates use of sql and basic data handling. No sensitive data is included. Program will not work as source data is protected and I have excluded parts of the code completed by others.

Notifications You must be signed in to change notification settings

jep739/sql-naeringsspesifikasjon

Repository files navigation

SQL-Naeringsspesifikasjon

Project Overview

This project showcases my work on SQL queries and data handling within a protected work environment. While the source data is secured and parts of the code completed by others are excluded, the following demonstration provides a comprehensive view of the techniques and methodologies employed. This project involves transforming high-dimensional, nested financial data delivered annually by Skatteetaten (The Norwegian Tax Authority) to SSB (Statistics Norway) into a flattened, query-optimized format.

Problem Statement

Every year, Skatteetaten delivers extensive company financial data to SSB. This data is characterized by its high dimensionality and nested schema, making direct analysis and processing challenging. An example of this nested structure can be seen in the 'naeringsspesifikasjon_2023_prod' dataset. The primary task is to flatten, clean, and store this data efficiently, ensuring it is easily queryable for subsequent statistical analysis.

image

Solution

To address the complexity of the nested financial data, SQL queries are utilized to flatten the data and store it as Parquet files. These files are structured in a manner that facilitates rapid querying. The transformation process includes:

1.Flattening the Nested Data: SQL queries are employed to denormalize and flatten the nested financial data.

2.Filtering and Cleaning: Data is filtered based on specific requirements and cleaned to ensure accuracy and consistency.

3.Storage Optimization: Data is stored in both large main files and smaller subsets categorized by financial data types, using a single unique identifier to link all tables. This strategy enhances query performance.

Results

The data transformation process successfully flattens, cleans, and stores the financial data. Performance testing revealed that storing data in a wide format was optimal, despite initial assumptions. This decision was influenced by the additional data handling requirements of the downstream data editing software, which benefits from the wide format structure.

Data Queries

The datasets are generated by querying the financial data provided by Skatteetaten. Two primary datasets are created: one for the income statement (resultatregnskap) and another for the balance sheet (balanseregnskap). To optimize query performance, subsets are created for specific financial themes such as operating income, current assets, and long-term liabilities. This segmentation ensures that queries target only relevant data segments, improving efficiency.

image image

Performance Testing

Comprehensive speed tests were conducted to determine the most efficient data structures and query tools. Some of those tests are included in this repo. Various tools, including PySpark, PyArrow, Dask, and DuckDB, were evaluated. Additionally, data partitioning strategies were employed to enhance query performance. The results indicated that PyArrow offered the best performance for our use case.

Conclusion

The project's success lies in the efficient transformation and storage of high-dimensional financial data, enabling fast and effective querying. The cleaned and structured data is now securely stored in Google Cloud, where it is readily accessible to the statistics teams for further analysis and production of official statistics. The datasets, though large, support rapid queries, ensuring smooth performance of the in-house data editing software.

image

About

Part of a project I work on copied from a work repository. Demonstrates use of sql and basic data handling. No sensitive data is included. Program will not work as source data is protected and I have excluded parts of the code completed by others.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published