SOLR Flashcards

1
Q

What is it good for? What are common usecases too?

A

Text-centric (Can even do spell-check and synonym handling)
Read-dominant
Document-oriented (Flat structure, no sub-fields)
Flexible schema

Common usecase:

  • Keyword search
  • Ranked retrieval
  • facets
  • Geospacial ( indexing latitude and longitude values as well as sorting or ranking documents by geographical distance)

Dont use for deep analytics, document relationships, permission based storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is inverted index in context of Lucene?

A

Every term maps to LIST of document IDs. Also for every document there is a term frequency in the index itself, as well as the term position to denote position in doc (This helps in finding EXACT phrase match).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some of the query features?

A
Conditional logic using AND, OR, and NOT
    Wildcard matching
    Range queries for dates and numbers
    Phrase queries with slop to allow for some distance between terms
    Fuzzy string matching
    Regular expression matching
    Function queries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can trailing wildcards be supported? e.g. *ing

A

Enable reverse filtering too. Terms will be reversed and indexed as well, hence twice the index space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is exact phrase match achieved? “homes for rent”

A

Solr still uses individual terms but after the results achieved it additionally looks for ‘term position’ in index to see if positions are continuous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is inverse document frequency?

A

Inverse document frequency (idf), a measure of how “rare” a search term is, is calculated by finding the document frequency (how many total documents the search term appears within), and calculating its inverse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Limitations of solr?

A

Relationships and joins. Denormalized schema makes it not suitable for some updates.
Field level updates not possible. Whole doc to be updated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is ‘warming a new searcher’ mean?

A

Solr uses a searcher to run queries against. Searcher has caching built in as well. On any commit, a new searcher needs to be created. For perf, config can define warming queries to run on any new searcher. Until the new searcher is warmed, the old searcher will continue to serve current traffic.
Hence commit could be a costly operation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is autowarming and warming query?

A

There is a newSearcher and a firstSearcher. Latter is created on startup and latter when new searcher is created after every commit.
Auto warming is ability to copy over cache to new searcher. Warming query is a query to be run on new searcher creation to warm the cache.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are kinds of caches in solr?

A

Filter cache: Cache on filters like fq=manu:Belkin are cached to improve results from next query.

Query result cache: Document IDs of all results of a query. There is a max setting to limit docs per query that go in cache.

Document cache: Cache the actual docs so that disk seek is avoided. Query result cache will use this cache internally.

Field value cache: Lucene concept to internally return results faster.

For every cache, autowarming count can be set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are special field types in solr?

A
  • dynamic fields allow you to apply the same definition to any fields in your documents whose names match either a prefix or suffix pattern, such as s_ or _s
  • copy fields allow you to populate one field from one or more other fields (Using this can support catch-all search OR catch provide different analyzer to field like case insensitive)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does solr handle date?

A

Using the ‘tdate’ field. Range queries can be supported even e.g. timestamp:[NOW/DAY TO NOW/DAY+1DAY}
It uses a trie data structure under the covers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What tools allow data import?

A

SolrJ - uses javabin efficient protocol
Data Import Handler - can support JDBC connections
ExtractingRequestHander - can extract from rich docs like PDF
Nutch - integrates with web crawler by default

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is soft commit?

A

Soft commit is a mechanism to make documents searchable in near real-time by skipping the costly aspects of hard commits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Does solr support field level updates?

A

Yes. PK needs to be provided. Under the covers it fetches the whole doc and makes changes. Optimistic concurrency control is in place using a version field.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some types of text transformations solr does?

A

lowercase, removing apostrophes, stemming (drinking -> drink), stop words removal (a, an, with), remove hyphenated words (i-pad -> ipad and i pad), collapse repeated letters (yummmm -> yumm).

Text analyzers need to be added to fields that need to be transformed. In solr terms they are filters that act on (in sequence of definition) on text fields. They can be defined on indexing or on query times (many times both for e.g. lowercasing has to be done on both to match)

17
Q

What is omitNorms attribute on fields?

A

The document length norm is used to boost smaller documents. Lucene gives smaller documents a slight boost over longer documents to improve relevance scoring.
omitNorms can be enabled on fields so that this boost is disabled.

18
Q

How to determine what to send in fq and q parameters? What is ‘cost’ param for each fq?

A

fq = filter query. The only difference is that fq query results will be cached. So fq terms will be a cache hit first and then other query will be looked at. (fq can be multiple params while q can only be single)
You can also send in a number to denote ‘cost’ where lower cost fq will be run first (denote it on param which filters docs most). Cost specified of 100 and above kicks on ‘post filtering’ which is fq applied AFTER collecting results from both q and fq.

19
Q

How do ‘term proximity’ based queries work?

A

It’s also possible to search for terms close together, but not necessarily right beside each other, by adding a tilde and the number of positions the terms can be away from each other:
“apache solr”~3
(Default is 0 and hence quoted text is exact match)

20
Q

How do ‘character proximity’ based queries work?

A

Not only can you perform proximity searching between terms, you can also perform edit distances on the characters within a term you’re querying to find similarly spelled terms.
solr~1 (Matches sol, sor …)

21
Q

What is eDisMax parser?

A

Extended Disjunction Maximum parser is used for user facing applications generally where user can type queries to search on. Key differences with Lucene parser:

  • Searches mutliple fields
  • Liberal towards query syntax
  • By default, uses max of score if two terms found rather than sum (Lucene sums up the score)
  • Flexible relevancy calculation

But flexibility comes with a slight overhead.

22
Q

What is boost query parser?

A

Parser that allows relevancy boost only and does not filter results. e.g.
/select?q=query:”{!edismax qf=title content}data science” AND
query:”{!boost b=log(popularity)}:” AND
query:”{!boost b=recip(
ms(NOW,articledate),3.16e-11,1,1)}category:news”

Above is nested query where boost parser does not filter only edismax parser filters on data science.

23
Q

What fields can facets be on? What are some of the features for facets?

A

Only on indexed fields. You can specify sorting, num count to return for facets. Also you can specify how many threads to use to calculate facets if multiple fields facets requested for.

24
Q

What is query faceting and range faceting?

A
Query faceting is returning counts of potential queries if no fields are defined for that data. e.g. 
select?q=*:*&facet=true&
  facet.query=price:[* TO 5}&
  facet.query=price:[5 TO 10}&
  facet.query=price:[10 TO 20}
Range faceting is similar but more suited for evenly spread data:
select?q=*:*&facet=true&
  facet.range=price&
  facet.range.start=0&
  facet.range.end=50&
  facet.range.gap=5
25
Q

How is spell-checker and auto-suggest implemented?

A

Spell checker is on search handler and uses levesthein distance to suggest. Also gives number of hits for suggestions.
Auto suggest only has a prefix and needs to be fast. Its a separate handler and uses prefix-tree data structure to quickly match on index.

26
Q

What is results grouping?

A

Similar to facets, results are grouped under desired group by a field.
group=true&group.field=type&group.limit=3

Has perf impact