Skip to content

Data Generation Service

Stephen Cote edited this page Oct 7, 2019 · 16 revisions

Overview

The data generation service is used to create and evolve populations of people data, with corresponding events and locations, within a community project. While relatively simplistic, the logic follows an iterative evolution cycle by location, including immigration, emigration, marriage, divorce, birth, death, alignment, and event impact.

Setup

In order to use the data generation service, the data files must first be obtained, and then made accessible to the Web application. Refer to AccountManager/data/readme.txt for instructions to obtain the location and dictionary files.

Note about Database Tuning

If you haven't tuned your database before attempting any of the following, do so now. There are a number of functions and views that will wind up needing to process fairly large queries. Failure to tune appropriately will result in massive query times.

Staging

Stage the data directory in a location that is read-only accessible from the filesystem by the account running the AccountManagerService application.

Configuration

Configure the following parameters in AccountManagerService/WEB-INF/web.xml:

  • data.generator.names: The path to the directory containing the names.json file.
  • data.generator.dictionary: The path to the Princeton WordNet dictionary files
  • data.generator.location: The path to the GeoNames location exports. Obtain all of the code and info files listed in data/location/readme.txt, plus one or more countries of your choice.

Configure Organization for Communities

Due to the size of the data import, the DataGenerator is coded to work within a community versus the scope of a single user. Therefore, it is necessary to configure an organization to use communities, and add a new community lifecycle and project. The following example shows how to create a new organization and configure that organization to use communities with the shell scripts.

Add a new organization under Development.

 ./auth.sh /Development Admin password
 ./addOrganization.sh /Development Admin "Test Community 1" password

Configure the new organization to use communities. This process adds a group and role structure appropriate for creating and sharing data within Lifecycle and Project structures.

 ./configureCommunity.sh "/Development/Test Community 1" Admin

Web Configuration

The following instructions apply to completing the configuration and data generation from the Web console.

Authenticate to the Web interface (default settings would be: http://localhost:8080/AccountManagerService), and authenticate to the new organization by selecting Specify ... and entering the new organization path, /Development/Test Community 1, Admin, and the password.

After authenticating, enter community mode by clicking the community mode icon (black silhouette).

Add a new community.

From the menu, select Project. The default tab is Lifecycle.

Create a new lifecycle named Data Generator.

The new lifecycle will appear in the list, and will also appear in the community drop down menu. Select Data Generator from the community drop down menu. This will shift the UI focus to the selected lifecycle for related operations.

Importing Location Data

Open the newly created Data Generator lifecycle object by selecting the list item and clicking the open button. A section titled Configure Regions should be visible, with two country codes populated for the United States and Canada. Any country codes may be entered here, keeping in mind that the entirety of location data for those countries will be imported. For example, to load country data for Mexico and France, MX and FR may be used, separated by a comma. Click the Configure button to load the specified datasets. Note country data may only be loaded once per lifecycle, so if an alternate combination or refreshed dataset is required it will be necessary to do so in a separate lifecycle.

Once the data is imported, the Configure button will no longer be visible.

Because the location data exists at the lifecycle level, it is not immediately visible from the UI as that is primarily organized around selecting and viewing shared community projects. It can be found by unsetting the community project and then navigating to the specific community group versus the project group. Also, due to the authorization rules, it is likely any user other than the organization administrator would have access to that path unless specifically granted authorization to a community level role.

Create a Project

From the Project menu item, and with Data Generator selected from the Community list at the top of the screen, click the Projects tab, and click the New icon. Create a new project named Dataset 1.

Troubleshooting

If the status indicates the project failed to be created, it may be due to an existing issue in importing shared community methodology assets. Refresh the project list, delete the new project if it appears, and try creating the project again. This error only seems to happen sporadically when adding the first project to a newly created lifecycle.

Configure Project Regions and Initial Population

With the community and newly created project selected in the community selector menus, open the newly created project from within the Projects tab. Expand the Configure Region section to see the region count and initial population settings. Specify the number of regions (default: 3), which will be randomly selected locations from those loaded in the previous operation, and the initial population (default: 250) for each region. If the defaults are selected, the total initial population would be 750 people.

Evolve the Population

After configuring the project regions, that configuration section is replaced with an Evolve Region section. The two fields specify the epoch iteration (current iteration plus 1; E.G.: the first value after configuring is 2), and the number of evolutions per epoch. Because the initial value is used for inception, it may be helpful to leave the default value (12) to simulate a standard year, which will keep progressive date evolution consistent.

For example, to evolve the population by one year, click the Evolve button.

Navigate to the Events menu item to see the project related events. After the first evolution, there should be two events: First, an event to construct the region, and Second, an event for the Epoch. If the project object is still open, the epoch count will automatically increment, and clicking Evolve again will add another Epoch (you will need to refresh the Events list to see the new entry). Each epoch event is comprised of events for each region, and each region event includes all of the individual events, such as immigrations, deaths, births, marriages, divorces, et cetera.

Navigate to the Identity menu item to see the generated people. Each evolution will affect this list. For example, after two evolutions, the total population size in one instance grew to 879. After evolving 100 years (set the number the incrementing epoch counter to 100, no need to click the button that many times), the total population grew to over 60 thousand.

Generating Identity and Access Management Data

Identity and access management data may be generated against the initial or evolved populations. Applications may be created with accounts linked to the person object, and affected by entitlements in the form of permissions, groups, or both. Database views provide easy access to the identity and account information for ingestion into identity management tools. This provides a convenient way to test large volumes of sanitary operational data, including access modeling and HR events (via the evolution).

To generate a new application, navigate to the Identity menu item and click the Applications tab. Click new application, specify a name, and expand the Generate Application section. Specify whether the generated application should include permissions or groups, the seed amount (minimum), the maximum value, and the distribution (percentile of people to accounts; default is 1.0 or 100%). If both permissions and groups are selected, then permissions are assigned to groups, and accounts made members of those groups. If only permissions are selected, the accounts are granted the permission related to the application. If only groups are selected, the accounts are made members of these groups. Note that these permissions and groups grant no privilege to the system itself.

Create as many applications as desired.

Refer to the identity data queries for methods to access the generated identity data for integration as a test data source to identity management systems.

Console Configuration

The following instructions apply to completing the configuration and data generation from the console.

Add a new community.

 ./auth.sh "/Development/Test Community 1" Admin password
 ./addCommunity.sh "/Development/Test Community 1" Admin "Data Generator"

It may be necessary to clear the local script cache to flush the temporary community object prior to running the import command.

 ./clearLocalCache.sh

Importing Location Data

Use the configureCommunityLocations.sh script to direct AccountManagerService to import the GeoNames location and trait data into the specified community. These values are stored at the community lifecycle level, so when attempting to view with any user other than the Admin user, that user will need read access to the lifecycle in addition to any subordinate project.

WARNING: The following process is somewhat resource intensive depending on the number and size of country data being imported. Additional time is spent normalizing the location relationships and hierarchies, so this is not a raw bulk import.

 ./configureCommunityLocations.sh "/Development/Test Community 1" Admin "Data Generator" "US,CA,MX" true

Validate

Authenticate to the AccountManagerService Web application. /Development/Test Community 1/Admin password

Click the person icon in the top toolbar to switch to community mode, and select "Data Generator". (Bug note: It may be necessary to reselect this item)

Open the fly-out menu and select Events. Select the Locations tab and verify that some locations were loaded. Select the traits tab and verify the traits were loaded.

Generating Data

Following the setup process, an organization (/Development/Test Community 1) should be configured for communities, a new community (Data Generator) was created, and GeoNames location and trait data loaded into that community.

Create a Project

Create a new community project in the Data Generator community.

 ./addCommunityProject.sh "/Development/Test Community 1" Admin "Data Generator" "Dataset 1"

This can also be done in the example UI by changing to community mode, selecting Data Generator, and then creating a new Project. Note that if the path is "~/Projects" and the community selector is not visible in the title bar then the view is not in community mode.

Troubleshooting Tips

Creating communities and projects should be relatively quick. However, there are still some outstanding issues and nuances that may arise. When this happens, it may seem difficult to continue without seeing a bevy of errors. Issues include:

  • An error occurs on the server when populating default community project methodology artifacts.
  • The request is interrupted, or otherwise blocked due to database tuning, creating an incomplete project model

To recover and continue, clear the local cache, delete the community project either using the shell scripts or the UI, and then cleanup any orphaned objects by running the following SQL commands (maintenance workers are provided to do this periodically, although a service is not yet provided to specifically invoke this action). If deleting the community project failed due to a group or urn constraint, verify that the underlying group for the community wasn't created. Delete this group if it exists. Then, cleanup any orphans with the following:

 select * from cleanup_orphans();
 select * from cleanup_rocket_orphans();

Try to create the community project again and it should succeed. The issue likely stems from a bug in the community project artifacts code on the initial create.

Configure Project Regions and Initial Population

Create an initial population of locations and a group of people for each location. Note the fifth parameter is the number of locations, and the sixth parameter is the initial population.

 ./configureCommunityProjectRegion.sh "/Development/Test Community 1" Admin "Data Generator" "Dataset 1" 3 250

In the example UI, change to community mode, select the Data Generator community and the Dataset 1 project, then navigate to the Events menu. One event with six child events (3 x 2) should be present, and one location with three child locations should be present. In the Identity menu, there should be 750 Persons constructed (3 x 250). In the Assets menu, under groups, a new group named Populations should be created, and navigating into that group will show a population group for each location.

Each new person includes home and work contacts, and an email contact, plus a number of attributes, including alignment, manager, trade, email (copied from but not kept in sync with the contact value), and a short uid.

Evolve the Population

The population can be evolved by specifying a number of epochs and evolutions per epoch. Each epoch and evolution generate events, in which each person is evaluated for person lifecycle events: birth, death, marriage, divorce, and migration. Subsequent evolutions are based on the total epoch count, so the epoch size must increase with each subsequent invocation. The timing of each evolution is tentatively framed as a month, so for the dates to remain in sync it's suggested to keep the evolution count as 12.

For example, to evolve the population one year:

 ./evolveCommunityProjectRegion.sh "/Development/Test Community 1" Admin "Data Generator" "Dataset 1" 1 12

To view a text report of the population:

 ./reportCommunityProjectRegion.sh "/Development/Test Community 1" Admin "Data Generator" "Dataset 1"

The report shows the initiation construction event, a summary of subsequent event, and population details and demographics of the most recent event.

To continue evolving the community, such as for another 99 years, increase the epoch size to 100 (epoch 1 plus another 99).

 ./evolveCommunityProjectRegion.sh "/Development/Test Community 1" Admin "Data Generator" "Dataset 1" 100 12

The results will vary. For example, one test run saw a population size evolve to over sixty nine thousand people.

Generating Identity and Access Management Data

The following example shows how to create an application for Dataset 1, which includes one hundred groups, but no permissions, distributed across every person (in the example evolution: nearly seventy thousand people).

  ./configureCommunityProjectApplication.sh "/Development/Test Community 1" Admin "Data Generator" "Dataset 1" "System of Record" false true 0 100 1.0

Troubleshooting

If an error is received while invoking the shell scripts, try clearing the local cache to flush out any locally persisted objects.

 ./clearLocalCache.sh

Identity Data Queries

The following queries are provided as examples for retrieving generated identity information directly from the database for use as test data with other identity management systems. While some views are provided that aggregate the data together, the queries are otherwise presented in long-form as certain attributes and relationships, such as manager, are not first-class properties within Account Manager, and are therefore related via foreign keyed attribute value.

In order to obtain results for the account queries, at least one application must be generated.

Persons

 SELECT P.id, P.name, P.firstname,P.middlename,P.lastname,P.gender,ATM.value as manager,ATM2.value as email from identityservicepersons ISP
 inner join Persons P on P.id = ISP.personid
 left join Attribute ATM on ATM.referencetype = 'PERSON' AND ATM.referenceid = P.id AND ATM.name = 'manager'
 left join Attribute ATM2 on ATM2.referencetype = 'PERSON' AND ATM2.referenceid = P.id AND ATM2.name = 'email'
 where projectname = 'Dataset 1'

Persons with Addresses

 SELECT P.id, P.name, P.firstname,P.middlename,P.lastname,P.gender,ATM.value as manager,ATM2.value as email,
 A1.addressLine1 as workAddress,A1.city as workCity,A1.state as workState,A1.region as workRegion,A1.postalCode as workPostalCode,A1.country as workCountry,
 A2.addressLine1 as homeAddress,A2.city as homeCity,A2.state as homeState,A2.region as homeRegion,A2.postalCode as homePostalCode,A2.country as homeCountry
 FROM identityservicepersons ISP
 inner join Persons P on P.id = ISP.personid
 inner JOIN contactinformation c ON c.referenceid = p.id AND c.contactinformationtype = 'PERSON'
 inner JOIN contactinformationparticipation cp ON cp.participationid = c.id
 inner JOIN addresses A1 ON A1.id = cp.participantid AND cp.participanttype = 'ADDRESS' AND A1.locationtype = 'WORK'
 inner JOIN contactinformationparticipation cp2 ON cp2.participationid = c.id
 inner JOIN addresses A2 ON A2.id = cp2.participantid AND cp2.participanttype = 'ADDRESS' AND A2.locationtype = 'HOME'
 left join Attribute ATM on ATM.referencetype = 'PERSON' AND ATM.referenceid = P.id AND ATM.name = 'manager'
 left join Attribute ATM2 on ATM2.referencetype = 'PERSON' AND ATM2.referenceid = P.id AND ATM2.name = 'email'
 where projectname = 'Dataset 1'

Accounts

 SELECT A.id, A.name, AT.value as owner FROM identityserviceapplicationaccounts ISA
 INNER JOIN Accounts A on A.id = ISA.accountid
 INNER JOIN Attribute AT on AT.referenceid = A.id AND AT.referencetype = 'ACCOUNT' AND AT.name = 'owner'
 WHERE projectname = 'Dataset 1' AND applicationname = 'System of Record'

Accounts with Groups

 SELECT A.id, A.name, AT.value as owner,string_agg(G.name,',') as groups FROM identityserviceapplicationaccounts ISA
 INNER JOIN Accounts A on A.id = ISA.accountid
 LEFT JOIN Attribute AT on AT.referenceid = A.id AND AT.referencetype = 'ACCOUNT' AND AT.name = 'owner'
 LEFT JOIN groupparticipation GP on GP.participantid = A.id AND gp.participanttype = 'ACCOUNT'
 LEFT JOIN groups G on G.id=GP.participationid
 WHERE projectname = 'Dataset 1' AND applicationname = 'System of Record'
 GROUP BY A.id,A.name,AT.value

Accounts with Permissions

 SELECT A.id, A.name, AT.value as owner,string_agg(GAR.permissionname,',') as permission FROM identityserviceapplicationaccounts ISA
 INNER JOIN Accounts A on A.id = ISA.accountid
 LEFT JOIN Attribute AT on AT.referenceid = A.id AND AT.referencetype = 'ACCOUNT' AND AT.name = 'owner'
 LEFT JOIN groupAccountRights GAR on GAR.accountId = A.id AND GAR.groupid = ISA.applicationid
 WHERE projectname = 'Dataset 1' AND applicationname = 'System of Record'
 GROUP BY A.id,A.name,AT.value

Groups with Permissions

 SELECT G.id, G.name, AT.value as owner,string_agg(P.name,',') as permission FROM identityserviceapplicationgroups
 INNER JOIN groups G on G.id=applicationgroupid
 LEFT JOIN Attribute AT on AT.referenceid = applicationgroupid AND AT.referencetype = 'GROUP' AND AT.name = 'owner'
 LEFT JOIN groupRights GAR on GAR.referenceid = applicationgroupid AND GAR.referencetype = 'GROUP'
 LEFT JOIN permissions P on P.id=GAR.affectid
 WHERE projectname = 'Dataset 1' AND applicationname = 'System of Record'
 GROUP BY G.id,G.name,AT.value

Permissions

 SELECT permissionname, ATT.value as businessdescription from identityServiceApplicationPermissions ISAP
 left join Attribute ATT on ATT.referencetype = 'PERMISSION' AND ATT.referenceid = ISAP.permissionid AND ATT.name = 'businessDescription'
 WHERE ISAP.applicationName = 'System of Record'