Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent protection of additional tables #7

Open
mancag opened this issue Apr 15, 2024 · 1 comment
Open

Consistent protection of additional tables #7

mancag opened this issue Apr 15, 2024 · 1 comment

Comments

@mancag
Copy link

mancag commented Apr 15, 2024

Dear all,
If all the tables are protected by the small count rounding at the same time, consistent protection among them is guaranteed. Is it possible to protect all hypercubes for census 2021 at the same time or is this set of tables too big? What about if additional tables are defined later, can consistency with already protected tables be achieved?
Best regards,
Manca

@olangsrud
Copy link
Collaborator

Dear Manca,

Is it possible to protect all hypercubes for census 2021 at the same time or is this set of tables too big?

The set of tables is too big to be handled straightforwardly, but with the help of some tricks, the Norwegian data has been consistently protected using SmallCountRounding. The problem is the maximum length of data vectors in R which in turn results in a maximum number of non-zero elements in a sparse matrix. The formula for all hypercubes in the 32 total population groups written in such a way that all levels of aggregation are included gives a total of 3340 individual terms in the model. Since geo_n was constant, this could be reduced to 2283 terms. To ensure the model matrix does not exceed the limit, the maximum number of rows is then (2^31-1)/2283 = 940641. Consequently, the number of inner cells involved in the algorithm must not exceed 940641.

Here are some points that described some of what was done:

  • The sparse model matrix within the algorithm became larger than what is technically possible in the current version of R. This is due to the maximum length of data vectors in R being exceeded. Input rows that could safely remain unchanged were therefore removed. Since the execution is done in a way that should guarantee no 1s and 2s in the final result (identifyNew = TRUE), this had to be done in a thoughtful way.
  • All input variables were retained in output by using the parameter, avoidHierarchical = TRUE. This ensures full control in cases where the same code is used for different hierarchical levels.
  • Parameter maxIterRows = 46000 was used (1000 is default). This setting is close to the possible limit caused by the maximum length of data vectors in R. This is related to a cross product matrix within the algorithm.
  • The time-consuming generation of the model matrix was done once (taking 3.5 hours), and the algorithm was executed multiple times with varied rndSeed (random generator seed) and step parameters. The best solution, based on deviations, utilised the step parameter set to 600.

What about if additional tables are defined later, can consistency with already protected tables be achieved?

The method can be viewed as a method that changes microdata. Microdata can always be aggregated in new ways. But tables that were not initially included will perhaps have large deviations. In any case, this only works if the same variables are included in the microdata. I have not dealt with the problem of how to handle new variables.

Best,
Øyvind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants