Private information on exported files for Data Mining / Machine learning (refs: GDPR and LGPD) #5

fititnt · 2021-01-23T08:11:33Z

General Data Protection Regulation / GDPR
- https://en.wikipedia.org/wiki/General_Data_Protection_Regulation
Lei Geral de Proteção de Dados Pessoais
Affected issues
- At least all outputs that return user data
  - joomla-users.sql joomla-users.sql #4
- In theory, also text output (the content of articles) may contain user data and e-mails.
  - As for this issue, we may just warn the user about the implications.
Concepts
- Pseudonymization - https://en.wikipedia.org/wiki/Pseudonymization
- full anonymization - https://en.wikipedia.org/wiki/Data_anonymization

As far as the default SQL queries and documentation on the joomla-data-mining-and-machine-learning would publish, mostly

e-mails, the username and (not enforced by Joomla, but some users could put the real ones) the name.
From users that created one account on the site AND planned to create content. (but since we're also allow the joomla-users.sql joomla-users.sql #4, the full table would be there)
PLANNED, but not implemented yet

Some special cases, like the User Action Logs (not created yet), could also be used to detect fraud or misbehavior.)
Strategies to process server access logs (like Apache and NGinx); these ones can at least have IP of user (this can be used in case of fraud detection).

With all this in mind, while for private uses allow output the full name, email, and IP, I think as the default output we should at least do reasonable ways to not simply expose identifiable data.

Affected SQL exported files

Note: this only contains data at the time of this issue is written.

joomla-users.sql

See this comment #4 (comment). The joomla-users.sql v1.1, for now, only by default hides the user_name, while both user_username and user_email are still there.

joomla-content-*.sql

All tables that output articles also mention user. If some way to abstract user is used on joomla-user.sql, the default strategy should be consistent with the other ones.

Strategies to mitigate by default expose private information

[full anonymization] Manually crafted identifier by project, and keep private the references

Maybe the perfect ideal solution for serious projects is don't use hash at all based on any personal information, since hashs in some special cases could could be used to reconstruct original data (more explained next)

On this strategy, the person who have to export the dataset, would specially craft some specific table that have non-reversible mapping between the user_id and whatever is the anonymized identifier. This is likely to be considered full anonymization, not Pseudonymization.

But for this project, maybe we just document that this is an option.

Just use the user.id

Maybe one strategy could be simply use whatever user_id` the site already is using.

Good points
- People who have access to the source dataset would still be able to know what that means.
Bad points
- Is hard to see difference between numbers like 592 593 594. This not only affects eye strain, it can also lead a human to make wrong conclusions.
- Numbers, when using Weka, Orange (and likely other data mining tools) may be interpreted as... numbers. But in fact the're more an categorical/nominal than an number.
- On very extreme situation were previous data already were exposed, export IDs that cannot be changed can still be bad.
  - And yes, several countries already had full private data exported on darknet. So if someone exports a new dataset to the public, whoever possess older exposed data, could use it to update their dataset.

Hash based pseudonymization

https://en.wikipedia.org/wiki/Pseudonymization

Pseudonymization can be an good default strategy.

Good Points
- If very, very well done, can be better than use plain user_id.
Bad points
- Get it very well done is not trivial. Some hashs like MD5 are weak, and even if using salt to add some entropy, places like this complain that Using Two-way Encryption can in some cases with sufficient data be used to discover original salt.
- Strong hash have way too much characters. sha512 is overkil.
- The hash algorithm have to be somewhat available on average MariaDB/MySQL server.
- The result of the hash MUST not output characters that would make Weka and other data mining tools complaint. This actually is pretty important.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Private information on exported files for Data Mining / Machine learning (refs: GDPR and LGPD) #5

Private information on exported files for Data Mining / Machine learning (refs: GDPR and LGPD) #5

fititnt commented Jan 23, 2021

Private information on exported files for Data Mining / Machine learning (refs: GDPR and LGPD) #5

Private information on exported files for Data Mining / Machine learning (refs: GDPR and LGPD) #5

Comments

fititnt commented Jan 23, 2021

Affected SQL exported files

joomla-users.sql

joomla-content-*.sql

Strategies to mitigate by default expose private information

[full anonymization] Manually crafted identifier by project, and keep private the references

Just use the user.id

Hash based pseudonymization