| Status: | effective |
| Progress: | 100% |
| Version: | 1.2.0+ |
Semantic search
Semantic MediaWiki includes an easy-to-use query language which enables users to access the wiki's knowledge. The syntax of this query language is similar to the syntax of annotations in Semantic MediaWiki. This query language can be used on the special page Special page "Ask", in concepts, and in inline queries. This page provides a short introduction to semantic search in general. More detailed explanations are found on other pages of this manual:
- Selecting pages: explains the basic way to describe what pages should appear in a query result. This is the core of SMW's query language.
- Displaying information: introduces the printout statements as a way of showing additional information for the queried pages, such as their property values or category assignments.
- Result formats: describes the available formats and shapes for the results as a whole.
- Concepts: shows how queries can be saved in concepts, which are a kind of «dynamic categories» offered by SMW.
- Inline queries: explains ways of including query results into wiki pages, and shows how to format the query results for display. This is the purpose of the SMW parser functions #ask and #show.
- Inferencing: explains how one can specify general schematic knowledge in SMW (and what this is in the first place). This feature is used by SMW to smartly deduce facts that were not directly entered into the wiki.
Naturally, answering queries requires additional resources, and the administrators of some sites can decide to switch off or restrict query features in order to ensure that even high-traffic sites can handle the additional load.
Introduction[edit]
Semantic queries specify two things:
- Which pages to select
- What information to display about those pages
All queries must state some conditions that describe what is asked for. You can select pages by name, namespace, category, and most importantly by property values. For example, the query
[[Located in::Germany]]
is a query for all pages with the "Located in" property with a value of "Germany". If you enter this in Special page "Ask" and click "Find results", SMW executes the query and displays results as a simple table of all matching page titles. If there are many results, they can be browsed via the navigation links at the top and bottom of the query results, for example a query for all persons on semanticweb.org.
The second point is important to display more information. In the example above, one might be interested in the population of the things located in Germany. To display that on Special page "Ask", one just enters the following into the printout box on the right:
?Population
and SMW displays the same page titles and the values of the Population property on those pages, if any. Printout statements may have some additional settings to further control how the property is displayed.
Summary table[edit]
This is a table of comparison, or cheatsheet, for the different matching strategies available depending on the search engine and configuration used:
- SMW with the standard setup of SQLStore. The documentation in 'Selecting pages' and child pages usually refers to the standard setup.
- SMW with Full-Text Search (FTS) enabled for the SQLStore, which is still listed as experimental. A guide is separately available here. When Full-Text Search is enabled, SQLStore remains available as a fallback as detailed below.
- SMW with ElasticStore using Elasticsearch as a search engine. A guide is available from Github. Of these three, ES is the most flexible, allowing for a wider range of configurations, including different analyzers for different data types. Some alternative configuration options may not have been taken into account yet. For the default settings and examples of other possibilities, see DefaultSettings.php and these JSON files.
Not yet included in this comparison is SPARQLStore, details for which are currently lacking.
| Standard SQLStore | Full-Text Search | ElasticStore | |
|---|---|---|---|
| Properties indexed | All, incl. type Text, Page and URL | Configurable. Default: user-defined properties of type Text and URL.[1] | All, incl. type Text, Page and URL |
| Regular | |||
| LIKE/NOT LIKE operators | ~/!~ and like:/nlike: (standard)like:/nlike: (fallback if FTS is enabled)[2]
|
~/!~[3]
|
~/!~ and like:/nlike:
|
| Wide proximity operators | Unsupported | ~~/!~~ (two tildes not one)
|
~~/!~~; also supported by in: and phrase:
|
| Wildcards | standalone + (any value)* (0 or more)[4]? (any single character)
|
standalone + (any value)* (0 or more)? (any single character)
|
standalone + (any value)* (0 or more; see also in:)? (any single character)
|
in: (shorthand)
|
Unsupported | Unsupported | Equivalent to~* ... * (with named property), or ~~* ... * (wide proximity)
|
| Tokenisation | No | Yes | Yes (different tokenizers available) |
| Maximum searchable string length | first 255 per value (type Page) first 40 or 72, or 300 (type Text)[5] |
? per token[6] | maximum token length configurable (Length token filter) |
Supported wildcard (*/?) positions
|
start, middle, end (of value) | end (of token) only[7] | start, middle, end (of token) |
| Match only at beginning/end of property value | Supported | Unsupported (tokens or phrases only) | Unsupported (tokens or phrases only) except with e.g. Keyword tokenizer.[8] |
| Meaning of whitespace between characters | part of string | token delimiter | token delimiter (usually, but see Keyword tokenizer) |
| Case folding | No | Yes | Possible (lowercase filter) |
| Accent folding[9] | No | Yes | Possible (asciifolding filter) |
| Features unique to tokenisation | |||
| Minimum token length | - | Configurable (default: 3)[10] | Configurable (Length token filter) |
Boolean operators (+/-)
|
- | Yes caveat: do not apply to strings below minimum token length (Runtime Exception) |
Yes |
| Stopword filter | - | Yes | Yes (Stop token filter) |
| CJK support | - | Yes (onoi/tesa; see documentation) | Yes (CJK language analyzer) |
| Phrase matching | |||
| Phrase matching | Already the default. Matching is exact. | Use double quotes (" ... "). Case/accent-insensitive. Does not support wildcards.
|
Use double quotes, or phrase: (see below). Level of precision depends on case/accent-sensitivity. Does not support wildcards.
|
phrase: (shorthand)
|
Unsupported | Unsupported | Equivalent to ~" ... ", or ~~" ... " (wide proximity)
|
| Other features | |||
| Search highlighting with #-hl | Yes | Yes | Yes |
- ↑ The data types to be indexed are set in $smwgFulltextSearchIndexableDataTypes. The setting defaults to the constants SMW_FT_BLOB (used for type Text) and SMW_FT_URI (used for type URL). Specific properties can be exempted from indexing.
- ↑ No such fallback is available for ElasticStore.
- ↑ However, behaviour falls back to regular SQL if the property is not indexed.
- ↑ In a very early version of SMW,
*may have meant 1 or more. - ↑ See the documentation on search operators.
- ↑ Tokenisation comes with the benefit that length restrictions pertaining to the full string no longer apply. Even Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu (85 characters) should be fine.
- ↑ The CONTAINS predicate does not allow for other positions.
- ↑ The tokenizer does record the "order or position of each term", but it is unclear if this information is or can be used for anything other than "phrase and word proximity queries".
- ↑ Accent folding is a common technique that maps Unicode characters to their ASCII equivalents so that a query can be agnostic of any diacritics or accents being used.
- ↑ $smwgFulltextSearchMinTokenSize