Skip to content

Instantly share code, notes, and snippets.

@pdurbin
Created July 8, 2015 15:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pdurbin/1a7b55e5714b7424fa94 to your computer and use it in GitHub Desktop.
Save pdurbin/1a7b55e5714b7424fa94 to your computer and use it in GitHub Desktop.

I'm seeing strange hl.fragsize behavior in the version of Solr 4.6.0, the version I happen to be using.

I've been testing with this "mp500.xml" file...

http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_6_0/solr/example/exampledocs/mp500.xml?view=markup

... using the query "q=indication" and I get some highlights:

$ curl -s "http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication" | jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      ", Battery level <em>indication</em>"
    ]
  }
}

Great! I got a highlight snippet back! But what if I start playing with "fragsize"? According to https://wiki.apache.org/solr/HighlightingParameters#hl.fragsize , fragsize=0 should give me the "whole field value should be used with no fragmenting." And it does:

$ curl -s "http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication&hl.fragsize=0" | jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      "Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level <em>indication</em>"
    ]
  }
}

As the docs indicate, fragsize=100 is the default and gives me the same results as we saw above when we left out fragsize:

$ curl -s "http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication&hl.fragsize=100" | jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      ", Battery level <em>indication</em>"
    ]
  }
}

But wait a minute... fragsize is defined as "the size, in characters, of the snippets (aka fragments) created by the highlighter". Is that really 100 characters? More like 27 if I strip out the HTML tags:

$ echo -n ", Battery level <em>indication</em>" | awk '{gsub("<[^>]*>", "")}1'
, Battery level indication
$ echo -n ", Battery level <em>indication</em>" | awk '{gsub("<[^>]*>", "")}1' | wc -c
      27

So that's weird. I ask for 100 characters but only get 27?

Let's try asking for 110 characters:

$ curl -s "http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication&hl.fragsize=110" | jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      ", Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level <em>indication</em>"
    ]
  }
}

That's better. With fragsize=110 we got back a snippet of 121 characters that time. But why did we only get back 27 characters from fragsize=100?

Here's something else that's strange. With fragsize=120 I get back fewer characters than fragsize=110. Only 108 characters back rather than 121:

$ curl -s "http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication&hl.fragsize=120" | jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      " firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level <em>indication</em>"
    ]
  }
}

As I increase the fragsize shouldn't I get more characters back? And again, why do I only get 27 characters back from fragsize=100?

I'm concerned about this because my fix for IQSS/dataverse#2191 is to make fragsize configurable, but I'm getting such unexpected results playing with different fragsize values I'm losing faith in it. We use highlighting heavily to indicate where in the document a query matched. To be clear, I haven't lost faith in Solr itself. It's a great project. I'm just trying to understand what's going on above.

Any advice is welcome!

Phil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment