Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

GDPR right to be forgotten #153

Closed
12 tasks done
snocke opened this issue Oct 26, 2022 · 3 comments
Closed
12 tasks done

GDPR right to be forgotten #153

snocke opened this issue Oct 26, 2022 · 3 comments
Assignees

Comments

@snocke
Copy link
Contributor

snocke commented Oct 26, 2022

Scenario:

Event data from your E-Commerce Webshop is stored in hbase.
Details of your data and master data is stored somewhere else.
Questions around this are usual:

  • how can I connect the data?
  • how can I easily manipulate the data?

Therefore, analyzing the event data can result in an additional ETL job and using up more space. Since event data can get very large you may need to aggregate the data and thus lose details.
Another challenge in context of event data from a webshop results from the GDPR context

A customer enforces his right to be forgotten and wants all his data deleted.
With hbase, you have access to a single row. Thus, a company is able to quickly find the customer, the events connect with the customer and master data.
Customers won't store all there data in hbase. Thus they need to query different sources to get a big picture of the situation. This demo will try to connect event data stored in hbase with data stored in s3 to enrich the events (data federation).

As of now, we need to introduce ACL on the hbase side. Therefore, we can not execute the deletion query. This will be possible with a future enhancement of the demo.
Another future development could be modeling the data as a data vault and applying data warehouse concepts.

Tasks:

  • Find suitable clickstream data
  • Setup Stack HDFS, Hbase, Phoenix, Trino
  • Create load job to import data from nexus to minio
  • Create load job to import the hbase relevant data to hdfs
  • Create job that load file from s3
  • Create script to transform s3 file and add hbse-row-key. We want to create a hash derived from the content. Thus, we have a key working in every "world"
  • Upload new file with hbase-row-key to s3
  • find a good way to load data to hbase, create table/view in phoenix and have it available in trino (trino-phoenix-connector)
  • create view in phoenix
  • create trino views with hashes that makes the data joinable
  • test query federation
  • Query data and delete rows (deletion won't be possible at this point in time. Need ACLs working on hbase site.
@snocke snocke self-assigned this Oct 26, 2022
@snocke
Copy link
Contributor Author

snocke commented Nov 30, 2022

The configuration and connection between hbase - phoenix - trino has been estabilished.
so far the hbase-site.xml needs to be adjusted to:

<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.regionserver.wal.codec</name>
    <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>zookeeper-server-default-0.zookeeper-server-default.gdpr.svc.cluster.local:2282/znode-d6933fb5-a660-4fe7-802e-7770877663b2</value>
  </property>
  <property>
    <name>phoenix.log.saltBuckets</name>
    <value>2</value>
  </property>
  <property>
    <name>phoenix.schema.isNamespaceMappingEnabled</name>
    <value>true</value>
  </property>
</configuration>

The trino-coordinator-default-catalog and trino-worker-default-catalog need the following adjustment:

data:
  phoenix.properties: |
    connector.name=phoenix5
    phoenix.config.resources=/stackable/config/catalog/phoenix/hbase-config/hbase-site.xml
    phoenix.connection-url=jdbc\:phoenix\:zookeeper-server-default-0.zookeeper-server-default.gdpr.svc.cluster.local\:2282\:/znode-d6933fb5-a660-4fe7-802e-7770877663b2/hbase
    unsupported-type-handling='CONVERT_TO_VARCHAR'

The hbase discovery configmap needs to be extented the following way

 <configuration>                                                                                                                                                                                                    
   <property>                                                                                                                                                                                                       
     <name>hbase.zookeeper.quorum</name>                                                                                                                                                                            
     <value>zookeeper-server-default-0.zookeeper-server-default.gdpr.svc.cluster.local:2282/znode-d6933fb5-a660-4fe7-802e-7770877663b2</value>                                                                      │
   </property>                                                                                                                                                                                                      
   <property>                                                                                                                                                                                                       
     <name>phoenix.schema.isNamespaceMappingEnabled</name>                                                                                                                                                          
     <value>true</value>                                                                                                                                                                                            
   </property>                                                                                                                                                                                                      
</configuration>

However, with dbeaver we can't see actual data.
The row count results in the correct number of rows. However, acutal data does not get visualized. This needs further investigation.

create schema "demo_schema";
SHOW SCHEMAS FROM phoenix;
                  
set session phoenix.unsupported_type_handling='CONVERT_TO_VARCHAR';
USE phoenix.demo_schema;
CREATE TABLE t_events (
  hash_col varchar,
  event varchar,
  itemid varchar,
  "timestamp" varchar,
  transactionid varchar,
  visitorid varchar
)
WITH (
  rowkeys = 'hash_col',
  default_column_family = 'cf1'
);

describe t_events;
SHOW TABLES FROM phoenix.demo_schema;
show columns from t_events;

select *
from phoenix.demo_schema.t_events
where visitorid in ('964404');

drop table t_events;

@snocke
Copy link
Contributor Author

snocke commented Dec 14, 2022

@lfrancke
Copy link
Member

We have row level deletes in the trino iceberg demo which demonstrates what we wanted to achieve here, closing

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants