SOLR Flashcards
What is it good for? What are common usecases too?
Text-centric (Can even do spell-check and synonym handling)
Read-dominant
Document-oriented (Flat structure, no sub-fields)
Flexible schema
Common usecase:
- Keyword search
- Ranked retrieval
- facets
- Geospacial ( indexing latitude and longitude values as well as sorting or ranking documents by geographical distance)
Dont use for deep analytics, document relationships, permission based storage
What is inverted index in context of Lucene?
Every term maps to LIST of document IDs. Also for every document there is a term frequency in the index itself, as well as the term position to denote position in doc (This helps in finding EXACT phrase match).
What are some of the query features?
Conditional logic using AND, OR, and NOT Wildcard matching Range queries for dates and numbers Phrase queries with slop to allow for some distance between terms Fuzzy string matching Regular expression matching Function queries
How can trailing wildcards be supported? e.g. *ing
Enable reverse filtering too. Terms will be reversed and indexed as well, hence twice the index space.
How is exact phrase match achieved? “homes for rent”
Solr still uses individual terms but after the results achieved it additionally looks for ‘term position’ in index to see if positions are continuous.
What is inverse document frequency?
Inverse document frequency (idf), a measure of how “rare” a search term is, is calculated by finding the document frequency (how many total documents the search term appears within), and calculating its inverse
Limitations of solr?
Relationships and joins. Denormalized schema makes it not suitable for some updates.
Field level updates not possible. Whole doc to be updated.
What is ‘warming a new searcher’ mean?
Solr uses a searcher to run queries against. Searcher has caching built in as well. On any commit, a new searcher needs to be created. For perf, config can define warming queries to run on any new searcher. Until the new searcher is warmed, the old searcher will continue to serve current traffic.
Hence commit could be a costly operation.
What is autowarming and warming query?
There is a newSearcher and a firstSearcher. Latter is created on startup and latter when new searcher is created after every commit.
Auto warming is ability to copy over cache to new searcher. Warming query is a query to be run on new searcher creation to warm the cache.
What are kinds of caches in solr?
Filter cache: Cache on filters like fq=manu:Belkin are cached to improve results from next query.
Query result cache: Document IDs of all results of a query. There is a max setting to limit docs per query that go in cache.
Document cache: Cache the actual docs so that disk seek is avoided. Query result cache will use this cache internally.
Field value cache: Lucene concept to internally return results faster.
For every cache, autowarming count can be set.
What are special field types in solr?
- dynamic fields allow you to apply the same definition to any fields in your documents whose names match either a prefix or suffix pattern, such as s_ or _s
- copy fields allow you to populate one field from one or more other fields (Using this can support catch-all search OR catch provide different analyzer to field like case insensitive)
How does solr handle date?
Using the ‘tdate’ field. Range queries can be supported even e.g. timestamp:[NOW/DAY TO NOW/DAY+1DAY}
It uses a trie data structure under the covers.
What tools allow data import?
SolrJ - uses javabin efficient protocol
Data Import Handler - can support JDBC connections
ExtractingRequestHander - can extract from rich docs like PDF
Nutch - integrates with web crawler by default
What is soft commit?
Soft commit is a mechanism to make documents searchable in near real-time by skipping the costly aspects of hard commits.
Does solr support field level updates?
Yes. PK needs to be provided. Under the covers it fetches the whole doc and makes changes. Optimistic concurrency control is in place using a version field.