Lucene Query Syntax

Lucene has a custom query syntax for querying its indexes. Here are some query examples demonstrating the query syntax.

Keyword matching

Search for word "foo" in the title field.

title:foo

Search for phrase "foo bar" in the title field.

title:"foo bar"

Search for phrase "foo bar" in the title field AND the phrase "quick fox" in the body field.

title:"foo bar" AND body:"quick fox"

Search for either the phrase "foo bar" in the title field AND the phrase "quick fox" in the body field, or the word "fox" in the title field.

(title:"foo bar" AND body:"quick fox") OR title:fox

Search for word "foo" and not "bar" in the title field.

title:foo -title:bar

Wildcard matching

Search for any word that starts with "foo" in the title field.

title:foo*

Search for any word that starts with "foo" and ends with bar in the title field.

title:foo*bar

Note that Lucene doesn't support using a * symbol as the first character of a search.

Proximity matching

Lucene supports finding words are a within a specific distance away.

Search for "foo bar" within 4 words from each other.

"foo bar"~4

Note that for proximity searches, exact matches are proximity zero, and word transpositions (bar foo) are proximity 1.

A query such as "foo bar"~10000000 is an interesting alternative to foo AND bar.

Whilst both queries are effectively equivalent with respect to the documents that are returned, the proximity query assigns a higher score to documents for which the terms foo and bar are closer together.

The trade-off, is that the proximity query is slower to perform and requires more CPU.

Solr DisMax and eDisMax query parsers can add phrase proximity matches to a user query.

Range searches

Range Queries allow one to match documents whose field(s) values are between the lower and upper bound specified by the Range Query. Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically.

mod_date:[20020101 TO 20030101]

Solr's built-in field types are very convenient for performing range queries on numbers without requiring padding.

Boosts

Query-time boosts allow one to specify which terms/clauses are "more important". The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding document scores.

A typical boosting technique is assigning higher boosts to title matches than to body content matches:

(title:foo OR title:bar)^1.5 (body:foo OR body:bar)

You should carefully examine explain output to determine the appropriate boost weights.

The official docs for the query parser syntax are here: http://lucene.apache.org/java/3_5_0/queryparsersyntax.html

The query syntax has not changed significantly since Lucene 1.3 (it is now 3.5.0).

Parsing Queries

Queries can be parsed by constructing a QueryParser object and invoking the parse() method.

String querystr = args.length > 0 ? args[0] : "lucene";
Query q = new QueryParser(Version.LUCENE_CURRENT, "title", analyzer).parse(querystr);

Programmatic construction of queries

Lucene queries can also be constructed programmatically. This can be really handy at times. Besides, there are some queries which are not possible to construct by parsing.

Available query objects as of 3.4.0 are:

  • BooleanQuery
  • ConstantScoreQuery
  • CustomScoreQuery
  • DisjunctionMaxQuery
  • FilteredQuery
  • MatchAllDocsQuery
  • MultiPhraseQuery
  • MultiTermQuery
  • PhraseQuery
  • RangeQuery
  • SpanQuery
  • TermQuery
  • ValueSourceQuery

Use the BooleanQuery object to join and nest queries.

These classes are part of the org.apache.lucene.search package.

Here's a simple example:

String str = "foo bar";
String id = "123456";
BooleanQuery bq = new BooleanQuery();
Query query = qp.parse(str);
bq.add(query, BooleanClause.Occur.MUST);
bq.add(new TermQuery(new Term("id", id), BooleanClause.Occur.MUST_NOT);