test metadig-engine on k8s against a hashstore #453

jeanetteclark · 2024-10-02T19:02:00Z

Testing locally has gone well but it would be nice to test the engine against a hashstore on the dev cluster

to that end I've mounted the tdg subvolume on metadig-worker, and that subvolume was mounted on dev.nceas where there is a hashstore metacat running. See helm/metadig-worker/pv.yaml and helm/metadig-worker/pvc.yaml for details on the existing mounts.

In order to actually test though the following steps are needed:

copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
symlink that hashstore to metacat
update the metacat.properties store.store_path field to be var/data/respos/dev/hashstore
deploy metadig-engine to the test cluster
submit datasets to dev.nceas (via metacatUI or any other client)

The text was updated successfully, but these errors were encountered:

doulikecookiedough · 2024-10-11T16:04:11Z

Update:

The rsync + parallel process to copy the contents of /var/metacat/hashstore to /mnt/tdg-repos/dev/metacat/hashstore has been completed.

The re-sync process takes approximately 25 minutes to complete with just the first level folders/items in the /var/metacat/hashstore folder.
- I am re-running the process with individual rsync commands for each file in the mean time to see if it's any faster

Next Steps:

Sync up with Jing to coordinate metacat switchover (and symlinking new directory)

To Do List:

copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
symlink that hashstore to metacat
~~update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore~~
- Note: This does not have to be done since we created a symlink
deploy metadig-engine to the test cluster
submit datasets to dev.nceas (via metacatUI or any other client)

For reference:

# How to produce a text file with just the first level of hashstore folders to rsync
mok@dev:~/testing$ sudo find /var/metacat/hashstore -mindepth 1 -maxdepth 1 > mc_hs_dir_list.txt
mok@dev:~/testing$ cat mc_hs_dir_list.txt
/var/metacat/hashstore/objects
/var/metacat/hashstore/metadata
/var/metacat/hashstore/refs
/var/metacat/hashstore/hashstore.yaml

# How to use rsync with a list of folders
mok@dev:~/testing$ cat mc_hs_dir_list.txt | parallel --eta sudo rsync -aHAX {} /mnt/tdg-repos/dev/metacat/hashstore/

# First get the list of files found under `/hashstore`
mok@dev:~/testing$ sudo find /var/metacat/hashstore -type f -printf '%P\n' > mc_obj_list.txt

# How to feed a single command at a time for a file to rsync
# The /./ between `metacat` and `hashstore` instructs rsync to copie folders from hashstore (and omits the previous directories) into the desired folder
mok@dev:~/testing$ parallel --eta sudo rsync -aHAXR /var/metacat/./hashstore/{} /mnt/tdg-repos/dev/metacat :::: mc_obj_list.txt

Note: Not defining the amount of cores will default rsync to determine its own limit (which when undefined, went to the max # of cores (ex. 44). When I added -j 30 it was limited to 30.)

doulikecookiedough · 2024-10-14T22:27:42Z

Metacat on dev.nceas.ucsb.edu has been moved over to write to the ceph fs mount point - a symlink has been created between /var/metacat/hashstore and /mnt/tdg-repos/dev/metacat/hashstore.

Note: We initially ran into a read-only file system issue that was caused due to how tomcat set-up its access control rules (the actual path to write above needed to be added to its configuration settings).

rsync was re-ran and the process to sync with a list of direct subfolders after /var/metacat/hashstore was the fastest. I tested with feeding rsync individual commands (ex. via :::: list_of_files.txt) but this seemed to be very slow. The re-sync process took approximately 5 minutes.

doulikecookiedough · 2024-10-20T22:08:48Z

Current Status:

It appears the 'Assessment Reports' (Metadig) for datasets at dev.nceas.ucsb.edu is not working as expected:

Example: https://dev.nceas.ucsb.edu/quality/urn%3Auuid%3A313d899d-dc77-435d-9638-abd09faf7143

Shows the error message:

There was an error generating the assessment report.
The Assessment Server reported this error:
Unable to run quality suite for pid urn:uuid:313d899d-dc77-435d-9638-abd09faf7143, suite FAIR-suite-0.4.0
Failed : HTTP error code : 403 Return to the dataset

Next Steps:

Restoring expected Metadig functionality @ dev.nceas.ucsb.edu
- The dev cluster's metadig-controller, metadig-scorer and metadig-scheduler are all on image v3.0.2 - except for metadig-worker which is using the feature-hashstore-support image. Before attempting to deploy the feature-hashstore-support image to the scorer, scheduler and controller per Jeanette's instructions, I will restore metadig-worker to using image v3.0.2 to try and resolve the issue on the test site.
Obtaining the last missing feature-hashstore-support image for metadig-controller
- metadig-controller also does not have a feature-hashstore-support image. This will require the execution of mvn publish while on the correct branch for the metadig-engine. I likely do not have appropriate permissions and will seek assistance from Jing to move forward here.

Deploying feature-hashstore-support for Metadig in full on the dev cluster

Once all the images are available, I will deploy the feature-hashstore-support as per such after updating the image.tag in the respective values.yaml files (four total for each Metadig-engine piece)

Below are the commands for quick reference:

helm upgrade metadig-scheduler ./metadig-scheduler --namespace metadig --set image.pullPolicy=Always --recreate-pods=true --set k8s.cluster=dev
helm upgrade metadig-scorer ./metadig-scorer --namespace metadig --set image.pullPolicy=Always --recreate-pods=true --set k8s.cluster=dev
helm upgrade metadig-worker ./metadig-worker --namespace metadig --set image.pullPolicy=Always --set replicaCount=1 --recreate-pods=true --set k8s.cluster=dev
helm upgrade metadig-controller ./metadig-controller --namespace metadig --set image.pullPolicy=Always --recreate-pods=true --set k8s.cluster=dev

To Do List & Follow-up Questions

copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
symlink that hashstore to metacat
~~update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore~~
- Note: This does not have to be done since we created a symlink
Fix broken Assessment reports on dev.nceas.ucsb.edu
Deploy metadig-engine to the test cluster
- Once deployed, determine how to check that a data quality check can be requested/is working as expected (a way to directly communicate with Metadig?)
Submit datasets to dev.nceas (via metacatUI or any other client)
- Running the quality checks at scale (many objects) to make sure the system is performing without errors

doulikecookiedough · 2024-10-22T17:20:36Z

Update:

The Assessment Reports began generating once again successfully after reverting the metadig-controller back to the v.3.0.2 image
Successfully deployed the metadig-scheduler, metadig-scorer and metadig-worker with the feature-hashstore-support images - however the Assessment Reports did not work.

After reviewing the existing changes in feature-hashstore-support, and the logs from the metadig-controller & metadig-worker - it looks like the engine is unable to communicate with solr to get the list of data pids. Currently debugging.

20241022-16:43:28: [ERROR]: Unable to run quality suite. [edu.ucsb.nceas.mdqengine.Worker:224]
edu.ucsb.nceas.mdqengine.exception.MetadigException: Unable to run quality suite for pid urn:uuid:761e9125-f775-4bf8-9a80-8cc970a52353, suite FAIR-suite-0.4.0Failed : HTTP error code : 403
    at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:568)
    at edu.ucsb.nceas.mdqengine.Worker$1.handleDelivery(Worker.java:212)
    at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
    at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:111)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Failed : HTTP error code : 403
    at edu.ucsb.nceas.mdqengine.MDQEngine.findDataPids(MDQEngine.java:271)
    at edu.ucsb.nceas.mdqengine.MDQEngine.runSuite(MDQEngine.java:120)
    at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:564)
    ... 6 more

A new controller image does not seem necessary at this time as the feature-hashstore-support code changes do not involve the metadig-controller, so I've paused here for now.

To Do List & Follow-up Questions

copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
symlink that hashstore to metacat
~~update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore~~
- Note: This does not have to be done since we created a symlink
Fix broken Assessment reports on dev.nceas.ucsb.edu
Deploy feature-hashstore-support image to metadig-worker, metadig-scheduler and metadig-scorer pods in the dev test cluster
Fix bug/issue with retrieving data objects from solr (http error code 403)
- Once deployed, determine how to check that a data quality check can be requested/is working as expected (a way to directly communicate with Metadig?)
Submit datasets to dev.nceas (via metacatUI or any other client)
- Running the quality checks at scale (many objects) to make sure the system is performing without errors

doulikecookiedough · 2024-10-23T12:19:39Z

Update:

The Metadig Assessment Reports are still unable to generate.

The URL that appears to be causing the issue should be and is publicly accessible:
https://dev.nceas.ucsb.edu/knb/d1/mn/v2/query/solr/q=isDocumentedBy:%22urn:uuid:c559c233-8bf9-42b4-98df-8558f4a4776a%22

Dataset: https://dev.nceas.ucsb.edu/view/urn%3Auuid%3Ac559c233-8bf9-42b4-98df-8558f4a4776a

try {
    String nodeEndpoint = D1Client.getMN(nodeId).getNodeBaseServiceUrl();
    String encodedId = URLEncoder.encode(identifier, "UTF-8");
    String queryUrl = nodeEndpoint + "/query/solr/?q=isDocumentedBy:" + "\"" + encodedId + "\"" + "&fl=id";

    URL url = new URL(queryUrl);
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setRequestMethod("GET");
    connection.setRequestProperty("Accept", "application/xml");
    if (dataOneAuthToken != null) {
        connection.setRequestProperty("Authorization", "Bearer " + dataOneAuthToken);
    }

    if (connection.getResponseCode() != 200) {
        // Line 271
        throw new RuntimeException("Failed : HTTP error code : " + connection.getResponseCode());
    }
    ...

Adjusting the metadig-worker deployment's environment variable to make use of the dataone-secret does not appear to have any effect (below for quick reference).
- In the findDataPids method, it appears that we want to include a token to make a request to get the data objects (in case we are searching for private datasets?). If it cannot find an environment variable, it will default to the config - which states that the token is not set in the config.
```
# /metadig-worker/templates/deployment.yaml

...
env:
    - name: JAVA_OPTS
      value: "-Dlog4j2.formatMsgNoLookups=true"
    - name: DATAONE_AUTH_TOKEN
      valueFrom:
        secretKeyRef:
          name: dataone-token
          key: DataONEauthToken
```

doulikecookiedough · 2024-10-25T22:25:00Z

Update:

Even after fixing the connection URL (below), I am still experiencing a http 403 forbidden error.

String encodedId = URLEncoder.encode(identifier, "UTF-8");
// This is necessary for metacat's solr to process the requested queryUrl
String encodedQuotes = URLEncoder.encode("\"", "UTF-8");
String queryUrl = nodeEndpoint + "/query/solr/?q=isDocumentedBy:" + encodedQuotes + encodedId + encodedQuotes + "&fl=id";

The logging message which shows the end point is accessible via both the browser, and within the metadig-worker pod itself. Metacat's solr index does not have specific access control rules so this get request from the metadig-worker should be able to be processed.

[email protected]:~/Code/testing/metadig $ kubectl exec -it metadig-worker-75c5689d69-4tt4v /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.

# curl "https://dev.nceas.ucsb.edu/knb/d1/mn/v2/query/solr/?q=isDocumentedBy:%22urn%3Auuid%3Aae970e0a-3a26-4af7-8a84-235c9a8e3a5d%22&fl=id"
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">19</int>
  <lst name="params">
    <str name="q">isDocumentedBy:"urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d"</str>
    <str name="fl">id</str>
    <str name="fq">(readPermission:"public")OR(writePermission:"public")OR(changePermission:"public")OR(isPublic:true)</str>
    <str name="wt">javabin</str>
    <str name="version">2</str>
  </lst>
</lst>
<result name="response" numFound="5" start="0" numFoundExact="true">
  <doc>
    <str name="id">urn:uuid:9ebcadac-b015-48fb-a2c5-1ff7db692f19</str></doc>
  <doc>
    <str name="id">urn:uuid:75db2307-4b78-4a8b-bc59-5b2ce318519f</str></doc>
  <doc>
    <str name="id">urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d</str></doc>
  <doc>
    <str name="id">urn:uuid:52106ea7-f24b-4247-a697-272023fb158e</str></doc>
  <doc>
    <str name="id">urn:uuid:b3dd42d8-7489-4d95-bcba-81940bdefbe2</str></doc>
</result>
</response>

The DATAONE_AUTH_TOKEN does not seem to make any difference (confirmed that it's been set in the environment variable both in the logs, and with the command kubectl exec -t metadig-worker-75c5689d69-4tt4v -- env)

# Error log

20241025-21:43:14: [DEBUG]: Running suite: FAIR-suite-0.4.0 [edu.ucsb.nceas.mdqengine.MDQEngine:97]
20241025-21:43:14: [DEBUG]: Got token from env. [edu.ucsb.nceas.mdqengine.MDQEngine:241]
20241025-21:43:16: [DEBUG]: queryURL: https://dev.nceas.ucsb.edu/knb/d1/mn/v2/query/solr/?q=isDocumentedBy:%22urn%3Auuid%3Aae970e0a-3a26-4af7-8a84-235c9a8e3a5d%22&fl=id [edu.ucsb.nceas.mdqengine.MDQEngine:264]
20241025-21:43:16: [ERROR]: Unable to run quality suite. [edu.ucsb.nceas.mdqengine.Worker:224]
edu.ucsb.nceas.mdqengine.exception.MetadigException: Unable to run quality suite for pid urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suite FAIR-suite-0.4.0Failed : HTTP error code : 403
	at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:568)
	at edu.ucsb.nceas.mdqengine.Worker$1.handleDelivery(Worker.java:212)
	at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
	at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:111)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Failed : HTTP error code : 403
	at edu.ucsb.nceas.mdqengine.MDQEngine.findDataPids(MDQEngine.java:275)
	at edu.ucsb.nceas.mdqengine.MDQEngine.runSuite(MDQEngine.java:120)
	at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:564)
	... 6 more
20241025-21:43:16: [DEBUG]: Saving quality run status after error [edu.ucsb.nceas.mdqengine.Worker:240]
20241025-21:43:16: [DEBUG]: Saving to persistent storage: metadata PID: urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suite id: FAIR-suite-0.4.0 [edu.ucsb.nceas.mdqengine.model.Run:272]
20241025-21:43:16: [DEBUG]: Done saving to persistent storage: metadata PID: urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suite id: FAIR-suite-0.4.0 [edu.ucsb.nceas.mdqengine.model.Run:277]
20241025-21:43:16: [DEBUG]: Saved quality run status after error [edu.ucsb.nceas.mdqengine.Worker:249]
20241025-21:43:16: [DEBUG]: Sending report info back to controller... [edu.ucsb.nceas.mdqengine.Worker:390]
20241025-21:43:16: [INFO]: Elapsed time processing (seconds): 0 for metadataPid: urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suiteId: FAIR-suite-0.4.0
 [edu.ucsb.nceas.mdqengine.Worker:422]

I have a feeling that this is related to how k8s allows external REST API calls to be made (or not). The specific JAVA code to make the get request appears to be fine (since it can communicate and receive a 403 error). Investigation continues.

mbjones · 2024-10-25T22:36:23Z

I have a feeling that this is related to how k8s allows external REST API calls to be made (or not).

k8s does not restrict pods from originating web connections to external hosts in any way unless it is configured to do so. MetaDIG is not configured to restrict anything afaik. You and I should touch base on this because I think you are following a red herring and the problem originates elsewhere. Your curl command from the pod shows that the connection is not blocked. So its something else about how you deployed. Let's chat.

doulikecookiedough · 2024-10-25T22:40:21Z

@mbjones I think so too - I can't find anything related to that. I just pushed a commit to test whether the request is getting rejected because it's missing a User-Agent property. I'll send you a PM via Slack and/or send you a calendar invite.

Deployment code for quick reference (taken from hand-off notes):
helm upgrade metadig-worker ./metadig-worker --namespace metadig --set image.pullPolicy=Always --set replicaCount=1 --recreate-pods=true --set k8s.cluster=dev

With the following changes in the respective metadig-worker deployment files:

values.yaml:

image:
  repository: ghcr.io/nceas/metadig-worker
  pullPolicy: Always
  tag: "feature-hashstore-support"

deployment.yaml under env

- name: DATAONE_AUTH_TOKEN
  valueFrom:
    secretKeyRef:
      name: dataone-token
      key: DataONEauthToken

doulikecookiedough · 2024-10-25T23:00:45Z

@mbjones The Assessment Report generated after adding the User-Agent property to the Java Code!

To Do List & Follow-up Questions

copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
symlink that hashstore to metacat
~~update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore~~
- Note: This does not have to be done since we created a symlink
Fix broken Assessment reports on dev.nceas.ucsb.edu
Deploy feature-hashstore-support image to metadig-worker, metadig-scheduler and metadig-scorer pods in the dev test cluster
Fix bug/issue with retrieving data objects from solr (http error code 403)
- Determine how to properly include/handle User-Agent as part of the get request (can I simply use something like java/17.0.1-temurin as the value - will it be accepted by solr?)
Once deployed, determine how to check that a data quality check can be requested/is working as expected (a way to directly communicate with Metadig?)
Submit datasets to dev.nceas (via metacatUI or any other client)
- Running the quality checks at scale (many objects) to make sure the system is performing without errors

doulikecookiedough · 2024-10-28T00:43:12Z

Update:

While the Assessment Reports generated after updating the User-Agent, it appears that the RabbitMQ queue is no longer being populated with any new datasets being added to dev.nceas.ucsb.edu
I thought that something could have happened that prevented the new datasets from entering the queue/postgres, so I triggered a re-run by modifying metadig-postgres's last_harvest_datetime for the urn:node:mnTestKNB nodes to be a date in the past (ex. 2024-10-24T00:00:00.000Z) per the operations manual, which shows an uptick in RabbitMQ, but this only affects datasets up to a specific date (not the new ones I added).
Restarting the controller, scheduler, scorer and worker does not seem to have helped. These new datasets are not found in the metadig-postgres runs table. Currently investigating where the breakdown in communication is occurring.

mbjones · 2024-10-28T18:43:43Z

@doulikecookiedough regarding your question on how to directly communicate with metadig, that would be via the API. Most operations require authentication, but you can, for example, access completed run reports with a request like:

https://api.test.dataone.org/quality/runs/FAIR-suite-0.4.0/urn:uuid:0b44a2d5-dcd5-4798-8072-4030b14e8936

This one doesn't work, as it appears the FAIR-suite-0.4.0 was not run for the PID listed. You can get an overview of the whole API at https://api.test.dataone.org/quality/ -- but note that only a portion of the planned methods were implemented - others are still TBD, and some were disabled for security reasons. A useful one is getting the list of current suites, which is at https://api.test.dataone.org/quality/suites/.

If the API doesn't provide what you need, you can query the database itself via psql.

doulikecookiedough · 2024-10-28T18:55:05Z

Thank you for the clarification/direction @mbjones. Currently it looks like there's an issue with the scheduler - after restarting the pods (making sure the chart and app versions were both updated), some NullPointerExceptions are being thrown. This may explain why the FAIR-suite-0.4.0 check isn't being run for the new PIDs that are being added in the urn:node:mnTestKNB nodes.

20241028-18:02:10: [ERROR]: quality-test-dataone-fair: error creating rest client: Cannot assign field "after" because "link.before" is null [edu.ucsb.nceas.mdqengine.scheduler.RequestReportJob:190]
20241028-18:02:10: [INFO]: Job metadig.quality-test-dataone-fair threw a JobExecutionException:  [org.quartz.core.JobRunShell:218]
org.quartz.JobExecutionException: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null [See nested exception: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null]
	at edu.ucsb.nceas.mdqengine.scheduler.RequestReportJob.execute(RequestReportJob.java:191)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
Caused by: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null
	at org.apache.commons.collections.map.AbstractLinkedMap.removeEntry(AbstractLinkedMap.java:293)
	at org.apache.commons.collections.map.AbstractHashedMap.removeMapping(AbstractHashedMap.java:543)
	at org.apache.commons.collections.map.AbstractHashedMap.remove(AbstractHashedMap.java:325)
	at org.apache.commons.configuration.BaseConfiguration.clearPropertyDirect(BaseConfiguration.java:133)
	at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
	at org.apache.commons.configuration.CompositeConfiguration.clearPropertyDirect(CompositeConfiguration.java:269)
	at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
	at org.apache.commons.configuration.AbstractConfiguration.setProperty(AbstractConfiguration.java:483)
	at org.dataone.client.rest.HttpMultipartRestClient.setDefaultTimeout(HttpMultipartRestClient.java:588)
	at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:222)
	at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:199)
	at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:184)
	at edu.ucsb.nceas.mdqengine.scheduler.RequestReportJob.execute(RequestReportJob.java:188)
	... 2 more
	
20241028-18:30:00: [ERROR]: Job metadig.downloads threw an unhandled Exception:  [org.quartz.core.JobRunShell:222]
java.lang.NullPointerException
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at java.base/java.io.FileReader.<init>(Unknown Source)
	at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
20241028-18:30:00: [ERROR]: Job (metadig.downloads threw an exception. [org.quartz.core.ErrorLogger:2360]
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.NullPointerException]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
Caused by: java.lang.NullPointerException
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at java.base/java.io.FileReader.<init>(Unknown Source)
	at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	... 1 more

doulikecookiedough · 2024-10-29T23:36:56Z

Check-in:

The newest datasets did not have their Assessment Reports generated because cn-stage was not harvesting from the urn:node:mnTestKNB node. So when I set back the last_harvest_datetime in metadig-postgres, it was unable to catch the latest datasets.
- This was resolved by seeking Jing's assistance to kickstart the process on https://cn-stage.test.dataone.org/cn/v2/node
```
/etc/init.d/d1-index-task-processor start
/etc/init.d/d1-index-task-generator start
/etc/init.d/d1-processing
```

AcquireWebResourcesJob Exception

This was resolved by adding downloadsList to metadig.properties
However, it seems that the file defined to be placed into /opt/local/metadig/data/ is not saving even though it appears as part of the process.

https://cn.dataone.org/cn/v2/formats ~> /opt/local/metadig/data/all-dataone-formats.xml

Exception below for reference

20241027-23:30:00: [ERROR]: Job metadig.downloads threw an unhandled Exception:  [org.quartz.core.JobRunShell:222]
java.lang.NullPointerException
    at java.base/java.io.FileInputStream.<init>(Unknown Source)
    at java.base/java.io.FileInputStream.<init>(Unknown Source)
    at java.base/java.io.FileReader.<init>(Unknown Source)
    at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
20241027-23:30:00: [ERROR]: Job (metadig.downloads threw an exception. [org.quartz.core.ErrorLogger:2360]
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.NullPointerException]
    at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
Caused by: java.lang.NullPointerException
    at java.base/java.io.FileInputStream.<init>(Unknown Source)
    at java.base/java.io.FileInputStream.<init>(Unknown Source)
    at java.base/java.io.FileReader.<init>(Unknown Source)
    at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
    ... 1 more

RequestScorerJob Class (...and RequestReportJob)

RequestReportJob is no longer experiencing the NullPointerException relating to "before" and "link.after" - but now RequestScorerJob is. This bug likely affects both classes and needs to be investigated.

      20241029-21:16:10: [ERROR]: Error creating rest client: Cannot assign field "before" because "link.after" is null [edu.ucsb.nceas.mdqengine.DataONE:74]
  20241029-21:16:10: [ERROR]: portal-test-arctic-FAIR: unable to create connection to service URL https://test.arcticdata.io/metacat/d1/mn [edu.ucsb.nceas.mdqengine.scheduler.RequestScorerJob:187]
  edu.ucsb.nceas.mdqengine.exception.MetadigProcessException: Unable to get collection pids
      at edu.ucsb.nceas.mdqengine.DataONE.getMultipartD1Node(DataONE.java:75)
      at edu.ucsb.nceas.mdqengine.scheduler.RequestScorerJob.execute(RequestScorerJob.java:185)
      at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
      at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
  Caused by: java.lang.NullPointerException: Cannot assign field "before" because "link.after" is null
      at org.apache.commons.collections.map.AbstractLinkedMap.removeEntry(AbstractLinkedMap.java:294)
      at org.apache.commons.collections.map.AbstractHashedMap.removeMapping(AbstractHashedMap.java:543)
      at org.apache.commons.collections.map.AbstractHashedMap.remove(AbstractHashedMap.java:325)
      at org.apache.commons.configuration.BaseConfiguration.clearPropertyDirect(BaseConfiguration.java:133)
      at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
      at org.apache.commons.configuration.CompositeConfiguration.clearPropertyDirect(CompositeConfiguration.java:269)
      at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
      at org.apache.commons.configuration.AbstractConfiguration.setProperty(AbstractConfiguration.java:483)
      at org.dataone.client.rest.HttpMultipartRestClient.setDefaultTimeout(HttpMultipartRestClient.java:588)
      at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:222)
      at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:199)
      at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:184)
      at edu.ucsb.nceas.mdqengine.DataONE.getMultipartD1Node(DataONE.java:72)
      ... 3 more
  20241029-21:16:10: [INFO]: Job metadig.portal-test-arctic-FAIR threw a JobExecutionException:  [org.quartz.core.JobRunShell:218]
  org.quartz.JobExecutionException: portal-test-arctic-FAIR: unable to create connection to service URL https://test.arcticdata.io/metacat/d1/mn [See nested exception: edu.ucsb.nceas.mdqengine.exception.MetadigProcessException: Unable to get collection pids]
      at edu.ucsb.nceas.mdqengine.scheduler.RequestScorerJob.execute(RequestScorerJob.java:188)
      at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
      at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
  Caused by: edu.ucsb.nceas.mdqengine.exception.MetadigProcessException: Unable to get collection pids
      at edu.ucsb.nceas.mdqengine.DataONE.getMultipartD1Node(DataONE.java:75)
      at edu.ucsb.nceas.mdqengine.scheduler.RequestScorerJob.execute(RequestScorerJob.java:185)
      ... 2 more
  Caused by: java.lang.NullPointerException: Cannot assign field "before" because "link.after" is null
      at org.apache.commons.collections.map.AbstractLinkedMap.removeEntry(AbstractLinkedMap.java:294)
      at org.apache.commons.collections.map.AbstractHashedMap.removeMapping(AbstractHashedMap.java:543)
      at org.apache.commons.collections.map.AbstractHashedMap.remove(AbstractHashedMap.java:325)
      at org.apache.commons.configuration.BaseConfiguration.clearPropertyDirect(BaseConfiguration.java:133)
      at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
      at org.apache.commons.configuration.CompositeConfiguration.clearPropertyDirect(CompositeConfiguration.java:269)
      at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
      at org.apache.commons.configuration.AbstractConfiguration.setProperty(AbstractConfiguration.java:483)
      at org.dataone.client.rest.HttpMultipartRestClient.setDefaultTimeout(HttpMultipartRestClient.java:588)
      at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:222)
      at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:199)
      at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:184)
      at edu.ucsb.nceas.mdqengine.DataONE.getMultipartD1Node(DataONE.java:72)

To Do

doulikecookiedough · 2024-11-04T23:25:35Z

Check in:

RE: User-Agent Value
- It seems like any value that does not begin with Chrome or Mozilla is rejected by solr
- For now, will use Mozilla/MetadigEngine (feature-hashstore-support)
RE: AcquireWebResourcesJob
- After k8s was redeployed/restarted, this process to acquire the defined resources seems to have executed as expected. The exception is not observed. I can see that the expected file to be placed into /opt/local/metadig/data is now present.
RE: RequestReportJob & RequestScorerJob
- These exceptions appear to be thrown when datasets were uploaded to test.arcticdata.io - however, after submitting datasets through test.adc's respective GUI, the exception could not be reproduced.
  - Note,metacatui was adjusted at test.arcticdata.io to allow submitters to set datasets to private/public (which was turned off before, and all datasets were private by default and had to be approved by an admin). My gut feeling is that this setting prevented metadig-scheduler from loading up its respective quality suites to run (quality-test-dataone-fair, portal-test-arctic-FAIR), leading to this exception every time a dataset was submitted to this node. Submitting new datasets does not trigger any exceptions after the metacatui setting change.
```
showDatasetPublicToggle: true
showDatasetPublicToggleForSubjects: []
```
- Additional context: the issue was thrown when during the HTTP client/instantiating HttpMultipartRestClient process
```
try {
    mrc = new HttpMultipartRestClient();
} catch (Exception ex) {
    log.error("Error creating rest client: " + ex.getMessage());
    metadigException = new MetadigProcessException("Unable to get collection pids");
    metadigException.initCause(ex);
    throw metadigException;
}
```
- This exception is still occurring after reviewing logs. Investigation continues.
RE: Private Datasets
- It seems that metadig is having trouble accessing private datasets despite having the DATAONE_AUTH_TOKEN present. Should quality checks be run on private datasets, or only if they are made public?
- Investigate and determine what should happen.

To Do

jeanetteclark added this to Metadig Data Quality Oct 2, 2024

jeanetteclark moved this to Ready in Metadig Data Quality Oct 2, 2024

jeanetteclark moved this from Ready to In Progress in Metadig Data Quality Oct 2, 2024

doulikecookiedough self-assigned this Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test metadig-engine on k8s against a hashstore #453

test metadig-engine on k8s against a hashstore #453

jeanetteclark commented Oct 2, 2024 •

edited by doulikecookiedough

Loading

doulikecookiedough commented Oct 11, 2024 •

edited

Loading

doulikecookiedough commented Oct 14, 2024 •

edited

Loading

doulikecookiedough commented Oct 20, 2024 •

edited

Loading

doulikecookiedough commented Oct 22, 2024 •

edited

Loading

doulikecookiedough commented Oct 23, 2024 •

edited

Loading

doulikecookiedough commented Oct 25, 2024 •

edited

Loading

mbjones commented Oct 25, 2024

doulikecookiedough commented Oct 25, 2024 •

edited

Loading

doulikecookiedough commented Oct 25, 2024 •

edited

Loading

doulikecookiedough commented Oct 28, 2024 •

edited

Loading

mbjones commented Oct 28, 2024

doulikecookiedough commented Oct 28, 2024 •

edited

Loading

doulikecookiedough commented Oct 29, 2024 •

edited

Loading

doulikecookiedough commented Nov 4, 2024 •

edited

Loading

test metadig-engine on k8s against a hashstore #453

test metadig-engine on k8s against a hashstore #453

Comments

jeanetteclark commented Oct 2, 2024 • edited by doulikecookiedough Loading

doulikecookiedough commented Oct 11, 2024 • edited Loading

doulikecookiedough commented Oct 14, 2024 • edited Loading

doulikecookiedough commented Oct 20, 2024 • edited Loading

doulikecookiedough commented Oct 22, 2024 • edited Loading

doulikecookiedough commented Oct 23, 2024 • edited Loading

doulikecookiedough commented Oct 25, 2024 • edited Loading

mbjones commented Oct 25, 2024

doulikecookiedough commented Oct 25, 2024 • edited Loading

doulikecookiedough commented Oct 25, 2024 • edited Loading

doulikecookiedough commented Oct 28, 2024 • edited Loading

mbjones commented Oct 28, 2024

doulikecookiedough commented Oct 28, 2024 • edited Loading

doulikecookiedough commented Oct 29, 2024 • edited Loading

doulikecookiedough commented Nov 4, 2024 • edited Loading

jeanetteclark commented Oct 2, 2024 •

edited by doulikecookiedough

Loading

doulikecookiedough commented Oct 11, 2024 •

edited

Loading

doulikecookiedough commented Oct 14, 2024 •

edited

Loading

doulikecookiedough commented Oct 20, 2024 •

edited

Loading

doulikecookiedough commented Oct 22, 2024 •

edited

Loading

doulikecookiedough commented Oct 23, 2024 •

edited

Loading

doulikecookiedough commented Oct 25, 2024 •

edited

Loading

doulikecookiedough commented Oct 25, 2024 •

edited

Loading

doulikecookiedough commented Oct 25, 2024 •

edited

Loading

doulikecookiedough commented Oct 28, 2024 •

edited

Loading

doulikecookiedough commented Oct 28, 2024 •

edited

Loading

doulikecookiedough commented Oct 29, 2024 •

edited

Loading

doulikecookiedough commented Nov 4, 2024 •

edited

Loading