Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: N/A values should not be indexed #1883

Closed
pdurbin opened this issue Apr 3, 2015 · 8 comments
Closed

Indexing: N/A values should not be indexed #1883

pdurbin opened this issue Apr 3, 2015 · 8 comments
Assignees

Comments

@pdurbin
Copy link
Member

pdurbin commented Apr 3, 2015

"I noticed that in #1246 that we don't want to display N/A values in UI, but as of this writing it's easy to create a dataset like this via SWORD by simply not including any dcterms:subject elements in the XML"

I first made that comment a month ago at #1430 (comment) but I'm opening this ticket (with a screenshot) to make sure we're ok with showing N/A in the Subject facet:

subjectna

This happens all the time on https://apitest.dataverse.org right now where I'm not specifying a subject in the XML I use to create a dataset via SWORD.

@pdurbin pdurbin added this to the In Review - 4.0.x milestone Apr 3, 2015
@scolapasta scolapasta modified the milestones: Beta 15 - Dataverse 4.0, In Review - 4.0.x Apr 4, 2015
@scolapasta
Copy link
Contributor

We should not be indexing these N/A values.

@scolapasta
Copy link
Contributor

Notice with production data we also see several N/As.

@pdurbin
Copy link
Member Author

pdurbin commented Apr 6, 2015

We should not be indexing these N/A values.

@scolapasta please review a195b17 (pushed to a branch) and let me know if it's ok to merge in. The question I have is for the dataset level. In that commit the logic is to never index an "N/A" value for any dataset field that has a controlled vocabulary that includes "N/A" (not just the "subject" field).

@pdurbin pdurbin assigned scolapasta and unassigned pdurbin Apr 6, 2015
@pdurbin pdurbin assigned pdurbin and unassigned scolapasta Apr 6, 2015
@pdurbin
Copy link
Member Author

pdurbin commented Apr 6, 2015

I chatted with @scolapasta and it sounds like we want to avoid indexing "N/A" for as many fields as possible, including fields that don't have controlled vocabularies (which is most fields).

Since the logic I'll need to put in the code is spread on various lines, I just did a query on our migrated data to make sure I adjust indexing for as many fields as possible. The main ones to adjust will be for authors and descriptions:

[pdurbin@dvn-vm7 ~]$ curl -s 'http://localhost:8983/solr/collection1/select?rows=1000000&wt=json&indent=true&q="N/A"' | grep '"N/A"' | sort | uniq -c | sort -nr
   6563         "dsDescriptionValue":["N/A"],
   6062         "authorName_ss":["N/A"],
   6062         "authorName":["N/A"],
     12         "producerURL":["N/A"],
      6         "universe_ss":["N/A"],
      6         "universe":["N/A"],
      4         "relatedMaterial":["N/A"],
      4         "otherGeographicCoverage":["N/A"],
      3         "relatedDatasets":["N/A"],
      2         "gsdSiteType_ss":["N/A"],
      2         "gsdSiteType":["N/A"],
      2         "geographicUnit_ss":["N/A"],
      2         "geographicUnit":["N/A"],
      2         "dvDescription":"N/A",
      2         "description":"N/A",
      1         "title":"N/A",
      1         "publicationCitation":["N/A"],
      1         "nameSort":"N/A",
      1         "gsdProgramBrief_ss":["N/A"],
      1         "gsdProgramBrief":["N/A"],

@pdurbin pdurbin changed the title N/A in Subject facet Indexing: N/A values should not be indexed Apr 6, 2015
@pdurbin
Copy link
Member Author

pdurbin commented Apr 6, 2015

@scolapasta please review the new-and-improved version at 269081b and let me know if I should merge it to master.

@pdurbin pdurbin assigned scolapasta and unassigned pdurbin Apr 6, 2015
scolapasta added a commit that referenced this issue Apr 7, 2015
prevent N/A values from being indexed #1883
@scolapasta scolapasta assigned kcondon and unassigned scolapasta Apr 7, 2015
@pdurbin
Copy link
Member Author

pdurbin commented Apr 7, 2015

@scolapasta merged #1897 but we decided to change break to continue in 50a4b9a so that we don't stop indexing values on the first "N/A" in a list of controlled vocabulary values. Some of those values may not be "N/A".

Ready for QA.

@kcondon kcondon assigned posixeleni and unassigned kcondon Apr 7, 2015
@posixeleni
Copy link
Contributor

@scolapasta @kcondon I cant test this in build since we need to see how migrated local studies got indexed. So can you please tell me when this fix is in the new production site (harvard.dataverse.edu) or vm6?

cc/ @sbarbosadataverse

@posixeleni
Copy link
Contributor

Searched for N/A in production and was unable to find it indexed. Also looked at facets to confirm this was the case. Closing this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants