GDPR right to be forgotten #153

snocke · 2022-10-26T18:50:32Z

Scenario:

Event data from your E-Commerce Webshop is stored in hbase.
Details of your data and master data is stored somewhere else.
Questions around this are usual:

how can I connect the data?
how can I easily manipulate the data?

Therefore, analyzing the event data can result in an additional ETL job and using up more space. Since event data can get very large you may need to aggregate the data and thus lose details.
Another challenge in context of event data from a webshop results from the GDPR context

A customer enforces his right to be forgotten and wants all his data deleted.
With hbase, you have access to a single row. Thus, a company is able to quickly find the customer, the events connect with the customer and master data.
Customers won't store all there data in hbase. Thus they need to query different sources to get a big picture of the situation. This demo will try to connect event data stored in hbase with data stored in s3 to enrich the events (data federation).

As of now, we need to introduce ACL on the hbase side. Therefore, we can not execute the deletion query. This will be possible with a future enhancement of the demo.
Another future development could be modeling the data as a data vault and applying data warehouse concepts.

Tasks:

snocke · 2022-11-30T18:16:37Z

The configuration and connection between hbase - phoenix - trino has been estabilished.
so far the hbase-site.xml needs to be adjusted to:

<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.regionserver.wal.codec</name>
    <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>zookeeper-server-default-0.zookeeper-server-default.gdpr.svc.cluster.local:2282/znode-d6933fb5-a660-4fe7-802e-7770877663b2</value>
  </property>
  <property>
    <name>phoenix.log.saltBuckets</name>
    <value>2</value>
  </property>
  <property>
    <name>phoenix.schema.isNamespaceMappingEnabled</name>
    <value>true</value>
  </property>
</configuration>

The trino-coordinator-default-catalog and trino-worker-default-catalog need the following adjustment:

data:
  phoenix.properties: |
    connector.name=phoenix5
    phoenix.config.resources=/stackable/config/catalog/phoenix/hbase-config/hbase-site.xml
    phoenix.connection-url=jdbc\:phoenix\:zookeeper-server-default-0.zookeeper-server-default.gdpr.svc.cluster.local\:2282\:/znode-d6933fb5-a660-4fe7-802e-7770877663b2/hbase
    unsupported-type-handling='CONVERT_TO_VARCHAR'

The hbase discovery configmap needs to be extented the following way

 <configuration>                                                                                                                                                                                                    
   <property>                                                                                                                                                                                                       
     <name>hbase.zookeeper.quorum</name>                                                                                                                                                                            
     <value>zookeeper-server-default-0.zookeeper-server-default.gdpr.svc.cluster.local:2282/znode-d6933fb5-a660-4fe7-802e-7770877663b2</value>                                                                      │
   </property>                                                                                                                                                                                                      
   <property>                                                                                                                                                                                                       
     <name>phoenix.schema.isNamespaceMappingEnabled</name>                                                                                                                                                          
     <value>true</value>                                                                                                                                                                                            
   </property>                                                                                                                                                                                                      
</configuration>

However, with dbeaver we can't see actual data.
The row count results in the correct number of rows. However, acutal data does not get visualized. This needs further investigation.

create schema "demo_schema";
SHOW SCHEMAS FROM phoenix;
                  
set session phoenix.unsupported_type_handling='CONVERT_TO_VARCHAR';
USE phoenix.demo_schema;
CREATE TABLE t_events (
  hash_col varchar,
  event varchar,
  itemid varchar,
  "timestamp" varchar,
  transactionid varchar,
  visitorid varchar
)
WITH (
  rowkeys = 'hash_col',
  default_column_family = 'cf1'
);

describe t_events;
SHOW TABLES FROM phoenix.demo_schema;
show columns from t_events;

select *
from phoenix.demo_schema.t_events
where visitorid in ('964404');

drop table t_events;

snocke · 2022-12-14T10:43:46Z

For release we need the following issues:

stackabletech/trino-operator#331
stackabletech/hbase-operator#289
stackabletech/hbase-operator#288

lfrancke · 2024-02-14T09:42:19Z

We have row level deletes in the trino iceberg demo which demonstrates what we wanted to achieve here, closing

snocke self-assigned this Oct 26, 2022

lfrancke closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GDPR right to be forgotten #153

GDPR right to be forgotten #153

snocke commented Oct 26, 2022 •

edited

Loading

snocke commented Nov 30, 2022

snocke commented Dec 14, 2022

lfrancke commented Feb 14, 2024

GDPR right to be forgotten #153

GDPR right to be forgotten #153

Comments

snocke commented Oct 26, 2022 • edited Loading

Scenario:

Tasks:

snocke commented Nov 30, 2022

snocke commented Dec 14, 2022

lfrancke commented Feb 14, 2024

snocke commented Oct 26, 2022 •

edited

Loading