Skip to content

Instantly share code, notes, and snippets.

@softwaredoug
Last active March 6, 2021 20:04
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save softwaredoug/3212fa9c5a198a565a9a77b8d6f888ed to your computer and use it in GitHub Desktop.
Save softwaredoug/3212fa9c5a198a565a9a77b8d6f888ed to your computer and use it in GitHub Desktop.
Opinions on using Solr effectively from Doug Turnbull

Solr needs opinions, because Solr documentation gives you way too many options. It's hard to navigate the best practices for using Solr. Some of my opinions based on dozens of Solr projects :)

Prefer preloading static, classic config files over managed schema, config API, or schemaless

Schema files are a good thing. They are declarative, and not letting them change at runtime prevents all kinds of security issues. Further, classic schema / solrconfig support all of Solr's functionality and are well documented with tons of examples online in blog articles and stackoverflow. Using managed schema or the config API takes a lot of experimentation.

Static configurations can also be easily version controlled. As I've learned as a long time Elasticsearch user, this is one of Solr's advantages. Having an API for changing every underlying config option of your index means finding the code that made the change is rather time consuming.

Static configuration is also good separation of concerns. You cleanly separate Solr configuration from your application. Having worked in Elasticsearch, the ability for clever Elasticsearch libraries to manipulate your index in weird ways (such as searchkick ) makes working directly with Elasticsearch directly difficult because you don't know what searchkick has done to it.

In SolrCloud mode, this means upconfiging to zookeeper. In Standalone mode, it means having a configset setup in the configset folder.

Push work to the search engine - Learn to love plugins

Here's a falacy I see on a lot of search teams: they keep the search engine at arms length. They prefetch thousands of search results and then run some kind of complicated model or process on top of those. This can lead to a very slow and complex search application, and forces application code to take on a lot of responsibilities of the search engine (pagination, faceting, highlighting). Don't be shy pushing your work into Solr!

Solr wraps Lucene, and it's biggest strength is the underlying Lucene library. Lucene is so extensible and fast for search and language like problems, you do yourself a disservice keeping it at arms length.

Lucene has access to term statitics in a fast index. Many information retrieval models use document frequency or total term frequency, which can be quickly accessed via the inverted index.

Lucene also increasingly uses columnar doc-values for fast retrieval of numeric attributes. Indeed, this is how a lot of vector scoring is often done.

Push the right work to the search engine

You shouldn't be shy about pushing work to Solr, but it IS important to push the right work to Solr. You should push the smallest amount of functionality that enables your application. A good plugin:

(a) helps you avoid negative patterns like prefetching thousands of results, or asking the search engine for every terms document frequency, etc - these smell like "I'm recreating the search engine in application logic"

(b) can be configured to solve your problem

(c) fits cleanly in the existing, extension points (like analyzers, custom lucene queries, query parsers, etc)

(d) doesn't implement application logic / decisions in the plugin. Or is application logic unavoidable to separate from search engine problems

(e) is a solution to a somewhat generic IR problem, like something you wish were part of Solr more generally

Deciding the right set of responsibilities for the search engine vs application code can be subjective. Here's some examples of good plugins:

  • Replacing a regex that removes a specified suffix (like 'js' from Angularjs). Just removing the suffix is much faster than a regex char replace factory

  • A query parser that quotes specified, unique phrases. Such as collocations.

Let's say you encountered a problem like code search, and you wanted to implement it in Solr. How should you go about thinking of a 'plugin' for this problem? You could

  1. Go "all Solr" and build in programming language parsing, etc into Solr itself.

  2. Go "Solr at arms length" and barely use Solr, perhaps building a completely different index itself. Maybe just reserving Solr for natural language problems

  3. Analyze the problem of searching programming languages. Build some supporting Solr functionality. For example, trigram indexing might be a problem that comes out of code search, and you could build some Solr functionality that analyzes, indexes, and scores using trigram indexes more efficiently for your use case.

Put plugins in the filesystem, not zookeeper

Solr has myriad options for where to store plugins. You can place them in several places on the node, the zookeeper blob store, etc.

From a security perspective, uploading runtime libs just seems like a bad idea. I mean look at all the security requirements. Indeed the

With modern deployment infrastructure, it's not hard to create a Solr container, that places a plugin in the right spot.

Zookeeper's blob store has had issues dealing with plugins

Prefer standalone mode if everything fits on a single shard

Learn to Love Docker-Solr

Solr isn't for passive consumers, join the community

Solr can be the wild west. An apache project like this is best suited for active organizations that can deal with warts, but are comfortable getting under the hood. Don't expect a consistent user experience, but DO expect plugability and power.

Solr doesn't have a single vendor like Elastic (for better or worse). The upside is any organization can be as core to community as any other. The downside is it's a confederation of lots of voices and opinions, for better or worse.

If you want to passively consume something, with solid opinions, think about whether you should use Elastic or another technology.

Streaming expressions are awesome

Only index a view of documents you care about for search & scoring

(See Relevant Search, chapter 5). And What Should Your Search Document Be

Avoid hierarchical documents like the plague

Search engines are flat, the nested features don't work well, think hard if you truly need them

Be cautious with the shiny new Solr feature

Solr is a very "bazaar" project (pun intended). It can be a fun, organic free-for-all where new ideas come in, but often in half-completed form, not ready for prime time. Don't expect the new Solr features to be production ready :)

@rrampage
Copy link

Searchkick should link to https://github.com/ankane/searchkick . Current link points back to the gist.

@rrampage
Copy link

At one of my previous workplaces, we used SolrCloud extensively where we followed these practices:

  • Maintain a git repo for each index and scripts to push the changes to zookeeper and reload all members of the cluster which were triggered either by admin/CI. Each index also needed to have a script which could rebuild it from scratch. This could either call the DB (using MySQL connector) or an internal endpoint if we needed to combine multiple data sources.
  • We used a well-defined schema for most indexes with the exception of an events index which had some core fields and leveraged dynamic-fields feature e.g _i suffix for arbitrary int fields.
  • Velocity templates are useful for creating quick prototypes of how a new query can affect search page.

@arafalov
Copy link

arafalov commented Mar 6, 2021

I am starting to think that we may need some sort of "Solr for Simple Projects" guide. Because it is hard to see an easy path through all the options. And, sometimes, those easy options are not even documented all the way through.

And because many complex projects start as simple projects, it would be said if they became abandoned or frozen at "collection1" schema because it was too hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment