Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test metadig-engine on k8s against a hashstore #453

Open
5 tasks
jeanetteclark opened this issue Oct 2, 2024 · 14 comments
Open
5 tasks

test metadig-engine on k8s against a hashstore #453

jeanetteclark opened this issue Oct 2, 2024 · 14 comments
Assignees

Comments

@jeanetteclark
Copy link
Collaborator

jeanetteclark commented Oct 2, 2024

Testing locally has gone well but it would be nice to test the engine against a hashstore on the dev cluster

to that end I've mounted the tdg subvolume on metadig-worker, and that subvolume was mounted on dev.nceas where there is a hashstore metacat running. See helm/metadig-worker/pv.yaml and helm/metadig-worker/pvc.yaml for details on the existing mounts.

In order to actually test though the following steps are needed:

  • copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
  • symlink that hashstore to metacat
  • update the metacat.properties store.store_path field to be var/data/respos/dev/hashstore
  • deploy metadig-engine to the test cluster
  • submit datasets to dev.nceas (via metacatUI or any other client)
@jeanetteclark jeanetteclark moved this from Ready to In Progress in Metadig Data Quality Oct 2, 2024
@doulikecookiedough doulikecookiedough self-assigned this Oct 8, 2024
@doulikecookiedough
Copy link

doulikecookiedough commented Oct 11, 2024

Update:

The rsync + parallel process to copy the contents of /var/metacat/hashstore to /mnt/tdg-repos/dev/metacat/hashstore has been completed.

  • The re-sync process takes approximately 25 minutes to complete with just the first level folders/items in the /var/metacat/hashstore folder.
    • I am re-running the process with individual rsync commands for each file in the mean time to see if it's any faster

Next Steps:

  • Sync up with Jing to coordinate metacat switchover (and symlinking new directory)

To Do List:

  • copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
  • symlink that hashstore to metacat
  • update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore
    • Note: This does not have to be done since we created a symlink
  • deploy metadig-engine to the test cluster
  • submit datasets to dev.nceas (via metacatUI or any other client)

For reference:

# How to produce a text file with just the first level of hashstore folders to rsync
mok@dev:~/testing$ sudo find /var/metacat/hashstore -mindepth 1 -maxdepth 1 > mc_hs_dir_list.txt
mok@dev:~/testing$ cat mc_hs_dir_list.txt
/var/metacat/hashstore/objects
/var/metacat/hashstore/metadata
/var/metacat/hashstore/refs
/var/metacat/hashstore/hashstore.yaml

# How to use rsync with a list of folders
mok@dev:~/testing$ cat mc_hs_dir_list.txt | parallel --eta sudo rsync -aHAX {} /mnt/tdg-repos/dev/metacat/hashstore/
# First get the list of files found under `/hashstore`
mok@dev:~/testing$ sudo find /var/metacat/hashstore -type f -printf '%P\n' > mc_obj_list.txt

# How to feed a single command at a time for a file to rsync
# The /./ between `metacat` and `hashstore` instructs rsync to copie folders from hashstore (and omits the previous directories) into the desired folder
mok@dev:~/testing$ parallel --eta sudo rsync -aHAXR /var/metacat/./hashstore/{} /mnt/tdg-repos/dev/metacat :::: mc_obj_list.txt
  • Note: Not defining the amount of cores will default rsync to determine its own limit (which when undefined, went to the max # of cores (ex. 44). When I added -j 30 it was limited to 30.)

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 14, 2024

Metacat on dev.nceas.ucsb.edu has been moved over to write to the ceph fs mount point - a symlink has been created between /var/metacat/hashstore and /mnt/tdg-repos/dev/metacat/hashstore.

  • Note: We initially ran into a read-only file system issue that was caused due to how tomcat set-up its access control rules (the actual path to write above needed to be added to its configuration settings).

rsync was re-ran and the process to sync with a list of direct subfolders after /var/metacat/hashstore was the fastest. I tested with feeding rsync individual commands (ex. via :::: list_of_files.txt) but this seemed to be very slow. The re-sync process took approximately 5 minutes.

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 20, 2024

Current Status:

It appears the 'Assessment Reports' (Metadig) for datasets at dev.nceas.ucsb.edu is not working as expected:

Next Steps:

  1. Restoring expected Metadig functionality @ dev.nceas.ucsb.edu

    • The dev cluster's metadig-controller, metadig-scorer and metadig-scheduler are all on image v3.0.2 - except for metadig-worker which is using the feature-hashstore-support image. Before attempting to deploy the feature-hashstore-support image to the scorer, scheduler and controller per Jeanette's instructions, I will restore metadig-worker to using image v3.0.2 to try and resolve the issue on the test site.
  2. Obtaining the last missing feature-hashstore-support image for metadig-controller

    • metadig-controller also does not have a feature-hashstore-support image. This will require the execution of mvn publish while on the correct branch for the metadig-engine. I likely do not have appropriate permissions and will seek assistance from Jing to move forward here.
  3. Deploying feature-hashstore-support for Metadig in full on the dev cluster

    • Once all the images are available, I will deploy the feature-hashstore-support as per such after updating the image.tag in the respective values.yaml files (four total for each Metadig-engine piece)
    • Below are the commands for quick reference:
      helm upgrade metadig-scheduler ./metadig-scheduler --namespace metadig --set image.pullPolicy=Always --recreate-pods=true --set k8s.cluster=dev
      helm upgrade metadig-scorer ./metadig-scorer --namespace metadig --set image.pullPolicy=Always --recreate-pods=true --set k8s.cluster=dev
      helm upgrade metadig-worker ./metadig-worker --namespace metadig --set image.pullPolicy=Always --set replicaCount=1 --recreate-pods=true --set k8s.cluster=dev
      helm upgrade metadig-controller ./metadig-controller --namespace metadig --set image.pullPolicy=Always --recreate-pods=true --set k8s.cluster=dev
      

To Do List & Follow-up Questions

  • copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
  • symlink that hashstore to metacat
  • update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore
    • Note: This does not have to be done since we created a symlink
  • Fix broken Assessment reports on dev.nceas.ucsb.edu
  • Deploy metadig-engine to the test cluster
    • Once deployed, determine how to check that a data quality check can be requested/is working as expected (a way to directly communicate with Metadig?)
  • Submit datasets to dev.nceas (via metacatUI or any other client)
    • Running the quality checks at scale (many objects) to make sure the system is performing without errors

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 22, 2024

Update:

  • The Assessment Reports began generating once again successfully after reverting the metadig-controller back to the v.3.0.2 image
  • Successfully deployed the metadig-scheduler, metadig-scorer and metadig-worker with the feature-hashstore-support images - however the Assessment Reports did not work.
  • After reviewing the existing changes in feature-hashstore-support, and the logs from the metadig-controller & metadig-worker - it looks like the engine is unable to communicate with solr to get the list of data pids. Currently debugging.
    20241022-16:43:28: [ERROR]: Unable to run quality suite. [edu.ucsb.nceas.mdqengine.Worker:224]
    edu.ucsb.nceas.mdqengine.exception.MetadigException: Unable to run quality suite for pid urn:uuid:761e9125-f775-4bf8-9a80-8cc970a52353, suite FAIR-suite-0.4.0Failed : HTTP error code : 403
        at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:568)
        at edu.ucsb.nceas.mdqengine.Worker$1.handleDelivery(Worker.java:212)
        at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
        at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:111)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
    Caused by: java.lang.RuntimeException: Failed : HTTP error code : 403
        at edu.ucsb.nceas.mdqengine.MDQEngine.findDataPids(MDQEngine.java:271)
        at edu.ucsb.nceas.mdqengine.MDQEngine.runSuite(MDQEngine.java:120)
        at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:564)
        ... 6 more
    
  • A new controller image does not seem necessary at this time as the feature-hashstore-support code changes do not involve the metadig-controller, so I've paused here for now.

To Do List & Follow-up Questions

  • copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
  • symlink that hashstore to metacat
  • update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore
    • Note: This does not have to be done since we created a symlink
  • Fix broken Assessment reports on dev.nceas.ucsb.edu
  • Deploy feature-hashstore-support image to metadig-worker, metadig-scheduler and metadig-scorer pods in the dev test cluster
  • Fix bug/issue with retrieving data objects from solr (http error code 403)
    • Once deployed, determine how to check that a data quality check can be requested/is working as expected (a way to directly communicate with Metadig?)
  • Submit datasets to dev.nceas (via metacatUI or any other client)
    • Running the quality checks at scale (many objects) to make sure the system is performing without errors

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 23, 2024

Update:

  • The Metadig Assessment Reports are still unable to generate.
    • The URL that appears to be causing the issue should be and is publicly accessible:
      https://dev.nceas.ucsb.edu/knb/d1/mn/v2/query/solr/q=isDocumentedBy:%22urn:uuid:c559c233-8bf9-42b4-98df-8558f4a4776a%22
      try {
          String nodeEndpoint = D1Client.getMN(nodeId).getNodeBaseServiceUrl();
          String encodedId = URLEncoder.encode(identifier, "UTF-8");
          String queryUrl = nodeEndpoint + "/query/solr/?q=isDocumentedBy:" + "\"" + encodedId + "\"" + "&fl=id";
      
          URL url = new URL(queryUrl);
          HttpURLConnection connection = (HttpURLConnection) url.openConnection();
          connection.setRequestMethod("GET");
          connection.setRequestProperty("Accept", "application/xml");
          if (dataOneAuthToken != null) {
              connection.setRequestProperty("Authorization", "Bearer " + dataOneAuthToken);
          }
      
          if (connection.getResponseCode() != 200) {
              // Line 271
              throw new RuntimeException("Failed : HTTP error code : " + connection.getResponseCode());
          }
          ...
      
    • Adjusting the metadig-worker deployment's environment variable to make use of the dataone-secret does not appear to have any effect (below for quick reference).
      • In the findDataPids method, it appears that we want to include a token to make a request to get the data objects (in case we are searching for private datasets?). If it cannot find an environment variable, it will default to the config - which states that the token is not set in the config.
      # /metadig-worker/templates/deployment.yaml
      
      ...
      env:
          - name: JAVA_OPTS
            value: "-Dlog4j2.formatMsgNoLookups=true"
          - name: DATAONE_AUTH_TOKEN
            valueFrom:
              secretKeyRef:
                name: dataone-token
                key: DataONEauthToken
      

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 25, 2024

Update:

Even after fixing the connection URL (below), I am still experiencing a http 403 forbidden error.

String encodedId = URLEncoder.encode(identifier, "UTF-8");
// This is necessary for metacat's solr to process the requested queryUrl
String encodedQuotes = URLEncoder.encode("\"", "UTF-8");
String queryUrl = nodeEndpoint + "/query/solr/?q=isDocumentedBy:" + encodedQuotes + encodedId + encodedQuotes + "&fl=id";

The logging message which shows the end point is accessible via both the browser, and within the metadig-worker pod itself. Metacat's solr index does not have specific access control rules so this get request from the metadig-worker should be able to be processed.

[email protected]:~/Code/testing/metadig $ kubectl exec -it metadig-worker-75c5689d69-4tt4v /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.

# curl "https://dev.nceas.ucsb.edu/knb/d1/mn/v2/query/solr/?q=isDocumentedBy:%22urn%3Auuid%3Aae970e0a-3a26-4af7-8a84-235c9a8e3a5d%22&fl=id"
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">19</int>
  <lst name="params">
    <str name="q">isDocumentedBy:"urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d"</str>
    <str name="fl">id</str>
    <str name="fq">(readPermission:"public")OR(writePermission:"public")OR(changePermission:"public")OR(isPublic:true)</str>
    <str name="wt">javabin</str>
    <str name="version">2</str>
  </lst>
</lst>
<result name="response" numFound="5" start="0" numFoundExact="true">
  <doc>
    <str name="id">urn:uuid:9ebcadac-b015-48fb-a2c5-1ff7db692f19</str></doc>
  <doc>
    <str name="id">urn:uuid:75db2307-4b78-4a8b-bc59-5b2ce318519f</str></doc>
  <doc>
    <str name="id">urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d</str></doc>
  <doc>
    <str name="id">urn:uuid:52106ea7-f24b-4247-a697-272023fb158e</str></doc>
  <doc>
    <str name="id">urn:uuid:b3dd42d8-7489-4d95-bcba-81940bdefbe2</str></doc>
</result>
</response>

The DATAONE_AUTH_TOKEN does not seem to make any difference (confirmed that it's been set in the environment variable both in the logs, and with the command kubectl exec -t metadig-worker-75c5689d69-4tt4v -- env)

# Error log

20241025-21:43:14: [DEBUG]: Running suite: FAIR-suite-0.4.0 [edu.ucsb.nceas.mdqengine.MDQEngine:97]
20241025-21:43:14: [DEBUG]: Got token from env. [edu.ucsb.nceas.mdqengine.MDQEngine:241]
20241025-21:43:16: [DEBUG]: queryURL: https://dev.nceas.ucsb.edu/knb/d1/mn/v2/query/solr/?q=isDocumentedBy:%22urn%3Auuid%3Aae970e0a-3a26-4af7-8a84-235c9a8e3a5d%22&fl=id [edu.ucsb.nceas.mdqengine.MDQEngine:264]
20241025-21:43:16: [ERROR]: Unable to run quality suite. [edu.ucsb.nceas.mdqengine.Worker:224]
edu.ucsb.nceas.mdqengine.exception.MetadigException: Unable to run quality suite for pid urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suite FAIR-suite-0.4.0Failed : HTTP error code : 403
	at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:568)
	at edu.ucsb.nceas.mdqengine.Worker$1.handleDelivery(Worker.java:212)
	at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
	at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:111)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Failed : HTTP error code : 403
	at edu.ucsb.nceas.mdqengine.MDQEngine.findDataPids(MDQEngine.java:275)
	at edu.ucsb.nceas.mdqengine.MDQEngine.runSuite(MDQEngine.java:120)
	at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:564)
	... 6 more
20241025-21:43:16: [DEBUG]: Saving quality run status after error [edu.ucsb.nceas.mdqengine.Worker:240]
20241025-21:43:16: [DEBUG]: Saving to persistent storage: metadata PID: urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suite id: FAIR-suite-0.4.0 [edu.ucsb.nceas.mdqengine.model.Run:272]
20241025-21:43:16: [DEBUG]: Done saving to persistent storage: metadata PID: urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suite id: FAIR-suite-0.4.0 [edu.ucsb.nceas.mdqengine.model.Run:277]
20241025-21:43:16: [DEBUG]: Saved quality run status after error [edu.ucsb.nceas.mdqengine.Worker:249]
20241025-21:43:16: [DEBUG]: Sending report info back to controller... [edu.ucsb.nceas.mdqengine.Worker:390]
20241025-21:43:16: [INFO]: Elapsed time processing (seconds): 0 for metadataPid: urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suiteId: FAIR-suite-0.4.0
 [edu.ucsb.nceas.mdqengine.Worker:422]

I have a feeling that this is related to how k8s allows external REST API calls to be made (or not). The specific JAVA code to make the get request appears to be fine (since it can communicate and receive a 403 error). Investigation continues.

@mbjones
Copy link
Member

mbjones commented Oct 25, 2024

I have a feeling that this is related to how k8s allows external REST API calls to be made (or not).

k8s does not restrict pods from originating web connections to external hosts in any way unless it is configured to do so. MetaDIG is not configured to restrict anything afaik. You and I should touch base on this because I think you are following a red herring and the problem originates elsewhere. Your curl command from the pod shows that the connection is not blocked. So its something else about how you deployed. Let's chat.

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 25, 2024

@mbjones I think so too - I can't find anything related to that. I just pushed a commit to test whether the request is getting rejected because it's missing a User-Agent property. I'll send you a PM via Slack and/or send you a calendar invite.

Deployment code for quick reference (taken from hand-off notes):
helm upgrade metadig-worker ./metadig-worker --namespace metadig --set image.pullPolicy=Always --set replicaCount=1 --recreate-pods=true --set k8s.cluster=dev

With the following changes in the respective metadig-worker deployment files:

  • values.yaml:

    image:
      repository: ghcr.io/nceas/metadig-worker
      pullPolicy: Always
      tag: "feature-hashstore-support"
    
  • deployment.yaml under env

    - name: DATAONE_AUTH_TOKEN
      valueFrom:
        secretKeyRef:
          name: dataone-token
          key: DataONEauthToken
    

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 25, 2024

@mbjones The Assessment Report generated after adding the User-Agent property to the Java Code!

To Do List & Follow-up Questions

  • copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
  • symlink that hashstore to metacat
  • update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore
    • Note: This does not have to be done since we created a symlink
  • Fix broken Assessment reports on dev.nceas.ucsb.edu
  • Deploy feature-hashstore-support image to metadig-worker, metadig-scheduler and metadig-scorer pods in the dev test cluster
  • Fix bug/issue with retrieving data objects from solr (http error code 403)
    • Determine how to properly include/handle User-Agent as part of the get request (can I simply use something like java/17.0.1-temurin as the value - will it be accepted by solr?)
  • Once deployed, determine how to check that a data quality check can be requested/is working as expected (a way to directly communicate with Metadig?)
  • Submit datasets to dev.nceas (via metacatUI or any other client)
    • Running the quality checks at scale (many objects) to make sure the system is performing without errors

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 28, 2024

Update:

  • While the Assessment Reports generated after updating the User-Agent, it appears that the RabbitMQ queue is no longer being populated with any new datasets being added to dev.nceas.ucsb.edu
  • I thought that something could have happened that prevented the new datasets from entering the queue/postgres, so I triggered a re-run by modifying metadig-postgres's last_harvest_datetime for the urn:node:mnTestKNB nodes to be a date in the past (ex. 2024-10-24T00:00:00.000Z) per the operations manual, which shows an uptick in RabbitMQ, but this only affects datasets up to a specific date (not the new ones I added).
  • Restarting the controller, scheduler, scorer and worker does not seem to have helped. These new datasets are not found in the metadig-postgres runs table. Currently investigating where the breakdown in communication is occurring.

@mbjones
Copy link
Member

mbjones commented Oct 28, 2024

@doulikecookiedough regarding your question on how to directly communicate with metadig, that would be via the API. Most operations require authentication, but you can, for example, access completed run reports with a request like:

https://api.test.dataone.org/quality/runs/FAIR-suite-0.4.0/urn:uuid:0b44a2d5-dcd5-4798-8072-4030b14e8936

This one doesn't work, as it appears the FAIR-suite-0.4.0 was not run for the PID listed. You can get an overview of the whole API at https://api.test.dataone.org/quality/ -- but note that only a portion of the planned methods were implemented - others are still TBD, and some were disabled for security reasons. A useful one is getting the list of current suites, which is at https://api.test.dataone.org/quality/suites/.

If the API doesn't provide what you need, you can query the database itself via psql.

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 28, 2024

Thank you for the clarification/direction @mbjones. Currently it looks like there's an issue with the scheduler - after restarting the pods (making sure the chart and app versions were both updated), some NullPointerExceptions are being thrown. This may explain why the FAIR-suite-0.4.0 check isn't being run for the new PIDs that are being added in the urn:node:mnTestKNB nodes.

20241028-18:02:10: [ERROR]: quality-test-dataone-fair: error creating rest client: Cannot assign field "after" because "link.before" is null [edu.ucsb.nceas.mdqengine.scheduler.RequestReportJob:190]
20241028-18:02:10: [INFO]: Job metadig.quality-test-dataone-fair threw a JobExecutionException:  [org.quartz.core.JobRunShell:218]
org.quartz.JobExecutionException: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null [See nested exception: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null]
	at edu.ucsb.nceas.mdqengine.scheduler.RequestReportJob.execute(RequestReportJob.java:191)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
Caused by: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null
	at org.apache.commons.collections.map.AbstractLinkedMap.removeEntry(AbstractLinkedMap.java:293)
	at org.apache.commons.collections.map.AbstractHashedMap.removeMapping(AbstractHashedMap.java:543)
	at org.apache.commons.collections.map.AbstractHashedMap.remove(AbstractHashedMap.java:325)
	at org.apache.commons.configuration.BaseConfiguration.clearPropertyDirect(BaseConfiguration.java:133)
	at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
	at org.apache.commons.configuration.CompositeConfiguration.clearPropertyDirect(CompositeConfiguration.java:269)
	at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
	at org.apache.commons.configuration.AbstractConfiguration.setProperty(AbstractConfiguration.java:483)
	at org.dataone.client.rest.HttpMultipartRestClient.setDefaultTimeout(HttpMultipartRestClient.java:588)
	at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:222)
	at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:199)
	at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:184)
	at edu.ucsb.nceas.mdqengine.scheduler.RequestReportJob.execute(RequestReportJob.java:188)
	... 2 more
	
20241028-18:30:00: [ERROR]: Job metadig.downloads threw an unhandled Exception:  [org.quartz.core.JobRunShell:222]
java.lang.NullPointerException
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at java.base/java.io.FileReader.<init>(Unknown Source)
	at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
20241028-18:30:00: [ERROR]: Job (metadig.downloads threw an exception. [org.quartz.core.ErrorLogger:2360]
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.NullPointerException]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
Caused by: java.lang.NullPointerException
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at java.base/java.io.FileReader.<init>(Unknown Source)
	at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	... 1 more

@doulikecookiedough
Copy link

doulikecookiedough commented Oct 29, 2024

Check-in:

  • The newest datasets did not have their Assessment Reports generated because cn-stage was not harvesting from the urn:node:mnTestKNB node. So when I set back the last_harvest_datetime in metadig-postgres, it was unable to catch the latest datasets.

    • This was resolved by seeking Jing's assistance to kickstart the process on https://cn-stage.test.dataone.org/cn/v2/node
    /etc/init.d/d1-index-task-processor start
    /etc/init.d/d1-index-task-generator start
    /etc/init.d/d1-processing
    
  • AcquireWebResourcesJob Exception

    • This was resolved by adding downloadsList to metadig.properties
    • However, it seems that the file defined to be placed into /opt/local/metadig/data/ is not saving even though it appears as part of the process.
    https://cn.dataone.org/cn/v2/formats ~> /opt/local/metadig/data/all-dataone-formats.xml
    
    • Exception below for reference
    20241027-23:30:00: [ERROR]: Job metadig.downloads threw an unhandled Exception:  [org.quartz.core.JobRunShell:222]
    java.lang.NullPointerException
        at java.base/java.io.FileInputStream.<init>(Unknown Source)
        at java.base/java.io.FileInputStream.<init>(Unknown Source)
        at java.base/java.io.FileReader.<init>(Unknown Source)
        at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
    20241027-23:30:00: [ERROR]: Job (metadig.downloads threw an exception. [org.quartz.core.ErrorLogger:2360]
    org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.NullPointerException]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
    Caused by: java.lang.NullPointerException
        at java.base/java.io.FileInputStream.<init>(Unknown Source)
        at java.base/java.io.FileInputStream.<init>(Unknown Source)
        at java.base/java.io.FileReader.<init>(Unknown Source)
        at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        ... 1 more
    
  • RequestScorerJob Class (...and RequestReportJob)

    • RequestReportJob is no longer experiencing the NullPointerException relating to "before" and "link.after" - but now RequestScorerJob is. This bug likely affects both classes and needs to be investigated.
            20241029-21:16:10: [ERROR]: Error creating rest client: Cannot assign field "before" because "link.after" is null [edu.ucsb.nceas.mdqengine.DataONE:74]
        20241029-21:16:10: [ERROR]: portal-test-arctic-FAIR: unable to create connection to service URL https://test.arcticdata.io/metacat/d1/mn [edu.ucsb.nceas.mdqengine.scheduler.RequestScorerJob:187]
        edu.ucsb.nceas.mdqengine.exception.MetadigProcessException: Unable to get collection pids
            at edu.ucsb.nceas.mdqengine.DataONE.getMultipartD1Node(DataONE.java:75)
            at edu.ucsb.nceas.mdqengine.scheduler.RequestScorerJob.execute(RequestScorerJob.java:185)
            at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
            at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
        Caused by: java.lang.NullPointerException: Cannot assign field "before" because "link.after" is null
            at org.apache.commons.collections.map.AbstractLinkedMap.removeEntry(AbstractLinkedMap.java:294)
            at org.apache.commons.collections.map.AbstractHashedMap.removeMapping(AbstractHashedMap.java:543)
            at org.apache.commons.collections.map.AbstractHashedMap.remove(AbstractHashedMap.java:325)
            at org.apache.commons.configuration.BaseConfiguration.clearPropertyDirect(BaseConfiguration.java:133)
            at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
            at org.apache.commons.configuration.CompositeConfiguration.clearPropertyDirect(CompositeConfiguration.java:269)
            at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
            at org.apache.commons.configuration.AbstractConfiguration.setProperty(AbstractConfiguration.java:483)
            at org.dataone.client.rest.HttpMultipartRestClient.setDefaultTimeout(HttpMultipartRestClient.java:588)
            at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:222)
            at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:199)
            at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:184)
            at edu.ucsb.nceas.mdqengine.DataONE.getMultipartD1Node(DataONE.java:72)
            ... 3 more
        20241029-21:16:10: [INFO]: Job metadig.portal-test-arctic-FAIR threw a JobExecutionException:  [org.quartz.core.JobRunShell:218]
        org.quartz.JobExecutionException: portal-test-arctic-FAIR: unable to create connection to service URL https://test.arcticdata.io/metacat/d1/mn [See nested exception: edu.ucsb.nceas.mdqengine.exception.MetadigProcessException: Unable to get collection pids]
            at edu.ucsb.nceas.mdqengine.scheduler.RequestScorerJob.execute(RequestScorerJob.java:188)
            at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
            at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
        Caused by: edu.ucsb.nceas.mdqengine.exception.MetadigProcessException: Unable to get collection pids
            at edu.ucsb.nceas.mdqengine.DataONE.getMultipartD1Node(DataONE.java:75)
            at edu.ucsb.nceas.mdqengine.scheduler.RequestScorerJob.execute(RequestScorerJob.java:185)
            ... 2 more
        Caused by: java.lang.NullPointerException: Cannot assign field "before" because "link.after" is null
            at org.apache.commons.collections.map.AbstractLinkedMap.removeEntry(AbstractLinkedMap.java:294)
            at org.apache.commons.collections.map.AbstractHashedMap.removeMapping(AbstractHashedMap.java:543)
            at org.apache.commons.collections.map.AbstractHashedMap.remove(AbstractHashedMap.java:325)
            at org.apache.commons.configuration.BaseConfiguration.clearPropertyDirect(BaseConfiguration.java:133)
            at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
            at org.apache.commons.configuration.CompositeConfiguration.clearPropertyDirect(CompositeConfiguration.java:269)
            at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
            at org.apache.commons.configuration.AbstractConfiguration.setProperty(AbstractConfiguration.java:483)
            at org.dataone.client.rest.HttpMultipartRestClient.setDefaultTimeout(HttpMultipartRestClient.java:588)
            at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:222)
            at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:199)
            at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:184)
            at edu.ucsb.nceas.mdqengine.DataONE.getMultipartD1Node(DataONE.java:72)
      

To Do

  • copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
  • symlink that hashstore to metacat
  • update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore
    • Note: This does not have to be done since we created a symlink
  • Fix broken Assessment reports on dev.nceas.ucsb.edu
  • Deploy feature-hashstore-support image to metadig-worker, metadig-scheduler and metadig-scorer pods in the dev test cluster
  • Fix bug/issue with retrieving data objects from solr (http error code 403)
    • Determine the User-Agent as part of the get request to retrieve data objects from metacat-solr
      • A generic value like java/17.0.1-temurin does not appear to be acceptable
  • Investigate the purpose of acquiring the web resource from the CN and placing it into /opt/local/metadig/data/all-dataone-formats.xml (and why it is not currently saving as expected)
  • Refactor code in AcquireWebResources where we do not check for a null value after retrieving a path that causes a NullPointerException
  • Fix bug in metacat-scheduler where Caused by: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null
    • This is now being thrown in RequestScorerJob relating to the configuration of URLs to collect pids
    • Review and apply any fix to RequestReportJob which now appears ok but observed the issue previously
  • Confirm that the expected data quality checks defined are executed
  • Submit datasets to dev.nceas (via metacatUI or any other client)
    • Running the quality checks at scale (many objects) to make sure the system is performing without errors

@doulikecookiedough
Copy link

doulikecookiedough commented Nov 4, 2024

Check in:

  • RE: User-Agent Value
    • It seems like any value that does not begin with Chrome or Mozilla is rejected by solr
    • For now, will use Mozilla/MetadigEngine (feature-hashstore-support)
  • RE: AcquireWebResourcesJob
    • After k8s was redeployed/restarted, this process to acquire the defined resources seems to have executed as expected. The exception is not observed. I can see that the expected file to be placed into /opt/local/metadig/data is now present.
  • RE: RequestReportJob & RequestScorerJob
    • These exceptions appear to be thrown when datasets were uploaded to test.arcticdata.io - however, after submitting datasets through test.adc's respective GUI, the exception could not be reproduced.
      • Note,metacatui was adjusted at test.arcticdata.io to allow submitters to set datasets to private/public (which was turned off before, and all datasets were private by default and had to be approved by an admin). My gut feeling is that this setting prevented metadig-scheduler from loading up its respective quality suites to run (quality-test-dataone-fair, portal-test-arctic-FAIR), leading to this exception every time a dataset was submitted to this node. Submitting new datasets does not trigger any exceptions after the metacatui setting change.
      showDatasetPublicToggle: true
      showDatasetPublicToggleForSubjects: []
      
    • Additional context: the issue was thrown when during the HTTP client/instantiating HttpMultipartRestClient process
      try {
          mrc = new HttpMultipartRestClient();
      } catch (Exception ex) {
          log.error("Error creating rest client: " + ex.getMessage());
          metadigException = new MetadigProcessException("Unable to get collection pids");
          metadigException.initCause(ex);
          throw metadigException;
      }
      
    • This exception is still occurring after reviewing logs. Investigation continues.
  • RE: Private Datasets
    • It seems that metadig is having trouble accessing private datasets despite having the DATAONE_AUTH_TOKEN present. Should quality checks be run on private datasets, or only if they are made public?
    • Investigate and determine what should happen.

To Do

  • copy the contents of metacat/hashstore to /mnt/tdg-repos/dev via parallel Rsync
  • symlink that hashstore to metacat
  • update the metacat.properties store.store_path field to be /mnt/tdg-repos/dev/metacat/hashstore
    • Note: This does not have to be done since we created a symlink
  • Fix broken Assessment reports on dev.nceas.ucsb.edu
  • Deploy feature-hashstore-support image to metadig-worker, metadig-scheduler and metadig-scorer pods in the dev test cluster
  • Fix bug/issue with retrieving data objects from solr (http error code 403)
    • Determine the User-Agent as part of the get request to retrieve data objects from metacat-solr
      • A generic value like java/17.0.1-temurin does not appear to be acceptable
  • Investigate the purpose of acquiring the web resource from the CN and placing it into /opt/local/metadig/data/all-dataone-formats.xml (and why it is not currently saving as expected)
  • Refactor code in AcquireWebResources where we do not check for a null value after retrieving a path that causes a NullPointerException
    • Null check was added.
  • Fix bug in metacat-scheduler where Caused by: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null
    • This is now being thrown in RequestScorerJob relating to the configuration of URLs to collect pids
    • Review and apply any fix to RequestReportJob which now appears ok but observed the issue previously
    • Cannot reproduce this code consistently in the affected test node
  • Confirm that the expected data quality checks defined are executed
    • I can see that the newly added quality and downloads tasks appear to be executing, but unclear which aspect of the Assessment Report represents the new data quality check.
  • Confirm if private datasets should still have quality checks being run when they're private
  • Submit datasets to dev.nceas (via metacatUI or any other client)
    • Running the quality checks at scale (many objects) to make sure the system is performing without errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants