You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as the default SQL queries and documentation on the joomla-data-mining-and-machine-learning would publish, mostly
e-mails, the username and (not enforced by Joomla, but some users could put the real ones) the name.
From users that created one account on the site AND planned to create content. (but since we're also allow the joomla-users.sql joomla-users.sql #4, the full table would be there)
PLANNED, but not implemented yet
Some special cases, like the User Action Logs (not created yet), could also be used to detect fraud or misbehavior.)
Strategies to process server access logs (like Apache and NGinx); these ones can at least have IP of user (this can be used in case of fraud detection).
With all this in mind, while for private uses allow output the full name, email, and IP, I think as the default output we should at least do reasonable ways to not simply expose identifiable data.
Affected SQL exported files
Note: this only contains data at the time of this issue is written.
joomla-users.sql
See this comment #4 (comment). The joomla-users.sql v1.1, for now, only by default hides the user_name, while both user_username and user_email are still there.
joomla-content-*.sql
All tables that output articles also mention user. If some way to abstract user is used on joomla-user.sql, the default strategy should be consistent with the other ones.
Strategies to mitigate by default expose private information
[full anonymization] Manually crafted identifier by project, and keep private the references
Maybe the perfect ideal solution for serious projects is don't use hash at all based on any personal information, since hashs in some special cases could could be used to reconstruct original data (more explained next)
On this strategy, the person who have to export the dataset, would specially craft some specific table that have non-reversible mapping between the user_id and whatever is the anonymized identifier. This is likely to be considered full anonymization, not Pseudonymization.
But for this project, maybe we just document that this is an option.
Just use the user.id
Maybe one strategy could be simply use whatever user_id` the site already is using.
Good points
People who have access to the source dataset would still be able to know what that means.
Bad points
Is hard to see difference between numbers like 592593594. This not only affects eye strain, it can also lead a human to make wrong conclusions.
Numbers, when using Weka, Orange (and likely other data mining tools) may be interpreted as... numbers. But in fact the're more an categorical/nominal than an number.
On very extreme situation were previous data already were exposed, export IDs that cannot be changed can still be bad.
And yes, several countries already had full private data exported on darknet. So if someone exports a new dataset to the public, whoever possess older exposed data, could use it to update their dataset.
As far as the default SQL queries and documentation on the joomla-data-mining-and-machine-learning would publish, mostly
With all this in mind, while for private uses allow output the full name, email, and IP, I think as the default output we should at least do reasonable ways to not simply expose identifiable data.
Affected SQL exported files
joomla-users.sql
See this comment #4 (comment). The joomla-users.sql v1.1, for now, only by default hides the
user_name
, while bothuser_username
anduser_email
are still there.joomla-content-*.sql
All tables that output articles also mention user. If some way to abstract user is used on joomla-user.sql, the default strategy should be consistent with the other ones.
Strategies to mitigate by default expose private information
[full anonymization] Manually crafted identifier by project, and keep private the references
Maybe the perfect ideal solution for serious projects is don't use hash at all based on any personal information, since hashs in some special cases could could be used to reconstruct original data (more explained next)
On this strategy, the person who have to export the dataset, would specially craft some specific table that have non-reversible mapping between the user_id and whatever is the anonymized identifier. This is likely to be considered full anonymization, not Pseudonymization.
But for this project, maybe we just document that this is an option.
Just use the user.id
Maybe one strategy could be simply use whatever user_id` the site already is using.
592
593
594
. This not only affects eye strain, it can also lead a human to make wrong conclusions.Hash based pseudonymization
Pseudonymization can be an good default strategy.
The text was updated successfully, but these errors were encountered: