Example solr queries for obtaining counts for quality assurance

Does anyone happen to have example solr queries for checking counts on common items like SRC_SYSTEM?

I’d like to double-check what solr contains and what our ETL pushed into it, but the solr query syntax gets a little confusing when trying to do grouped counts.

I’m assuming SRC_SYSTEM is a field in your solr document. If so, you should be able to get a count of documents for each value of that field pretty easy with facets. Something like:

http://localhost:8983/solr/#/documents/query?q=:&q.op=OR&indent=true&facet=true&facet.field=SOURCE&rows=0&facet.limit=1000000

(That’s for SOURCE instead of SRC_SYSTEM.) Specifically, add the query parameters

facet=true
facet.field=SRC_SYSTEM
facet.limit=10000000
rows=0

The first three configure counting by the values of a field, indicated by the facet.field query parameter. The facet limit should just be set very high. It’s the number of field values to include in the response. Set rows to zero to not return any documents in the search. You can add filters/searches as normal, using &fq or &q, using the normal syntax, so you can look at the distribution of documents indexed within the last day (to check the last ETL job for instance).

The Solr interface allows you to do this pretty easy. Simply check facet at the bottom of the query panel on the left, which will open up more input boxes. Add the field name to facet.field and set the facet.limit as well. Setting rows to zero is done above below the sort field and above the fl field. (It’s the second box on the line between those two. The line is labeled “start, rows”.)

The response should have something like the below in it:

  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "SOURCE":[
        "source2",159548,
        "source1",158832,
        "source4",158767,
        "source3",158682,
        "source5",32]},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

Meaning, there are about 160k documents that contain the value “source2” for the SOURCE field, “source1” has 159k, etc.

Let me know if this didn’t answer your question.

I was able to get this working for SOURCE and that’s sufficient for my needs.

Out of curiosity, we also have SRC_SYSTEM in our records, but it doesn’t seem to work for that by simply swapping in the field name – it comes back empty []. I thought maybe the _ was confusing solr but I can do a facet query on DOC_TYPE. One of my SRC_SYSTEMs has a slash in it – so maybe that’s confusing things.

If you don’t know, that’s OK because using SOURCE is good enough. Thanks!

It depends on the index settings of the field. Sometimes fields are stored, but not indexed, in which case you can’t count based on the values of that field. If a field is indexed, there are many ways it can be indexed, which may or may not handle the slash how you want, but I’d guess you don’t have that field indexed since at least something should come back if it were. You can see if its indexed by looking at the documents/conf/managed-schema XML file in the solr core. It’s syntax can be a bit complex, but if the element has indexed=“false” on it, it’s definitely not indexed. There are some other settings which maybe could result in no results for this particular option, but I’d have to look through them closely. If you want to post what you find, I might recognize the problem.

That’s totally what is – the SRC_SYSTEM field is stored but not indexed (I just looked at the solr admin schema page). That’s OK – the SOURCE field should work for what I needed. Thanks!