Semantic search

From semantic-mediawiki.org
Simplified example to explain the difference between condition and printout result

Semantic MediaWiki includes an easy-to-use query language which enables users to access the wiki's knowledge. The syntax of this query language is similar to the syntax of annotations in Semantic MediaWiki. This query language can be used on the special page Special page "Ask", in concepts, and in inline queries. This page provides a short introduction to semantic search in general. More detailed explanations are found on other pages of this manual:

  • Selecting pages: explains the basic way to describe what pages should appear in a query result. This is the core of SMW's query language.
  • Displaying information: introduces the printout statements as a way of showing additional information for the queried pages, such as their property values or category assignments.
  • Result formats: describes the available formats and shapes for the results as a whole.
  • Concepts: shows how queries can be saved in concepts, which are a kind of «dynamic categories» offered by SMW.
  • Inline queries: explains ways of including query results into wiki pages, and shows how to format the query results for display. This is the purpose of the SMW parser functions #ask and #show.
  • Inferencing: explains how one can specify general schematic knowledge in SMW (and what this is in the first place). This feature is used by SMW to smartly deduce facts that were not directly entered into the wiki.

Naturally, answering queries requires additional resources, and the administrators of some sites can decide to switch off or restrict query features in order to ensure that even high-traffic sites can handle the additional load.

Introduction[edit]

Semantic queries specify two things:

  1. Which pages to select
  2. What information to display about those pages

All queries must state some conditions that describe what is asked for. You can select pages by name, namespace, category, and most importantly by property values. For example, the query

[[Located in::Germany]]

is a query for all pages with the "Located in" property with a value of "Germany". If you enter this in Special page "Ask" and click "Find results", SMW executes the query and displays results as a simple table of all matching page titles. If there are many results, they can be browsed via the navigation links at the top and bottom of the query results, for example a query for all persons on semanticweb.org.

The second point is important to display more information. In the example above, one might be interested in the population of the things located in Germany. To display that on Special page "Ask", one just enters the following into the printout box on the right:

?Population

and SMW displays the same page titles and the values of the Population property on those pages, if any. Printout statements may have some additional settings to further control how the property is displayed.

Summary table[edit]

This is a table of comparison, or cheatsheet, for the different matching strategies available depending on the search engine and configuration used:

Not yet included in this comparison is SPARQLStore, details for which are currently lacking.

Standard SQLStore Full-Text Search ElasticStore
Properties indexed All, incl. type Text, Page and URL Configurable. Default: user-defined properties of type Text and URL.[1] All, incl. type Text, Page and URL
Regular
LIKE/NOT LIKE operators ~/!~ and like:/nlike: (standard)
like:/nlike: (fallback if FTS is enabled)[2]
~/!~[3] ~/!~ and like:/nlike:
Wide proximity operators Unsupported ~~/!~~ (two tildes not one) ~~/!~~; also supported by in: and phrase:
Wildcards standalone + (any value)
* (0 or more)[4]
? (any single character)
standalone + (any value)
* (0 or more)
? (any single character)
standalone + (any value)
* (0 or more; see also in:)
? (any single character)
in: (shorthand) Unsupported Unsupported Equivalent to
~* ... * (with named property),
or ~~* ... * (wide proximity)
Tokenisation No Yes Yes (different tokenizers available)
Maximum searchable string length first 255 per value (type Page)
first 40 or 72, or 300 (type Text)[5]
? per token[6] maximum token length configurable (Length token filter)
Supported wildcard (*/?) positions start, middle, end (of value) end (of token) only[7] start, middle, end (of token)
Match only at beginning/end of property value Supported Unsupported (tokens or phrases only) Unsupported (tokens or phrases only) except with e.g. Keyword tokenizer.[8]
Meaning of whitespace between characters part of string token delimiter token delimiter (usually, but see Keyword tokenizer)
Case folding No Yes Possible (lowercase filter)
Accent folding[9] No Yes Possible (asciifolding filter)
Features unique to tokenisation
Minimum token length - Configurable (default: 3)[10] Configurable (Length token filter)
Boolean operators (+/-) - Yes
caveat: do not apply to strings below minimum token length (Runtime Exception)
Yes
Stopword filter - Yes Yes (Stop token filter)
CJK support - Yes (onoi/tesa; see documentation) Yes (CJK language analyzer)
Phrase matching
Phrase matching Already the default. Matching is exact. Use double quotes (" ... "). Case/accent-insensitive. Does not support wildcards. Use double quotes, or phrase: (see below). Level of precision depends on case/accent-sensitivity. Does not support wildcards.
phrase: (shorthand) Unsupported Unsupported Equivalent to ~" ... ", or ~~" ... " (wide proximity)
Other features
Search highlighting with #-hl Yes Yes Yes
  1. The data types to be indexed are set in $smwgFulltextSearchIndexableDataTypes. The setting defaults to the constants SMW_FT_BLOB (used for type Text) and SMW_FT_URI (used for type URL). Specific properties can be exempted from indexing.
  2. No such fallback is available for ElasticStore.
  3. However, behaviour falls back to regular SQL if the property is not indexed.
  4. In a very early version of SMW, * may have meant 1 or more.
  5. See the documentation on search operators.
  6. Tokenisation comes with the benefit that length restrictions pertaining to the full string no longer apply. Even Taumatawhakatangi­hangakoauauotamatea­turipukakapikimaunga­horonukupokaiwhen­uakitanatahu (85 characters) should be fine.
  7. The CONTAINS predicate does not allow for other positions.
  8. The tokenizer does record the "order or position of each term", but it is unclear if this information is or can be used for anything other than "phrase and word proximity queries".
  9. Accent folding is a common technique that maps Unicode characters to their ASCII equivalents so that a query can be agnostic of any diacritics or accents being used.
  10. $smwgFulltextSearchMinTokenSize