Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Private information on exported files for Data Mining / Machine learning (refs: GDPR and LGPD) #5

Open
fititnt opened this issue Jan 23, 2021 · 0 comments

Comments

@fititnt
Copy link
Member

fititnt commented Jan 23, 2021


As far as the default SQL queries and documentation on the joomla-data-mining-and-machine-learning would publish, mostly

  1. e-mails, the username and (not enforced by Joomla, but some users could put the real ones) the name.
  2. From users that created one account on the site AND planned to create content. (but since we're also allow the joomla-users.sql joomla-users.sql #4, the full table would be there)
  3. PLANNED, but not implemented yet
  • Some special cases, like the User Action Logs (not created yet), could also be used to detect fraud or misbehavior.)
  • Strategies to process server access logs (like Apache and NGinx); these ones can at least have IP of user (this can be used in case of fraud detection).

With all this in mind, while for private uses allow output the full name, email, and IP, I think as the default output we should at least do reasonable ways to not simply expose identifiable data.

Affected SQL exported files

Note: this only contains data at the time of this issue is written.

joomla-users.sql

See this comment #4 (comment). The joomla-users.sql v1.1, for now, only by default hides the user_name, while both user_username and user_email are still there.

joomla-content-*.sql

All tables that output articles also mention user. If some way to abstract user is used on joomla-user.sql, the default strategy should be consistent with the other ones.

Strategies to mitigate by default expose private information

[full anonymization] Manually crafted identifier by project, and keep private the references

Maybe the perfect ideal solution for serious projects is don't use hash at all based on any personal information, since hashs in some special cases could could be used to reconstruct original data (more explained next)

On this strategy, the person who have to export the dataset, would specially craft some specific table that have non-reversible mapping between the user_id and whatever is the anonymized identifier. This is likely to be considered full anonymization, not Pseudonymization.

But for this project, maybe we just document that this is an option.

Just use the user.id

Maybe one strategy could be simply use whatever user_id` the site already is using.

  • Good points
    • People who have access to the source dataset would still be able to know what that means.
  • Bad points
    • Is hard to see difference between numbers like 592 593 594. This not only affects eye strain, it can also lead a human to make wrong conclusions.
    • Numbers, when using Weka, Orange (and likely other data mining tools) may be interpreted as... numbers. But in fact the're more an categorical/nominal than an number.
    • On very extreme situation were previous data already were exposed, export IDs that cannot be changed can still be bad.
      • And yes, several countries already had full private data exported on darknet. So if someone exports a new dataset to the public, whoever possess older exposed data, could use it to update their dataset.

Hash based pseudonymization

Pseudonymization can be an good default strategy.

  • Good Points
    • If very, very well done, can be better than use plain user_id.
  • Bad points
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant