How to do searching with the full-text search

From semantic-mediawiki.org
Table of Contents

Full-text search, once it is enabled for SQL, offers features that help to overcome the native limitations of searching in property values. It comes with additional syntax options and changes the standard behaviour of existing syntax such as tildes (~) and search operators.

Full-text search is typically word-driven, language-dependent and optimised for large chunks of text. Features include word tokenization, phrase matching, wide proximity and search highlighting.

It may not provide the right solution for you if you require more exact forms of pattern-matching. For those use cases, queries using the like:/not like: notation should continue to provide a fallback.

It is not possible to come up with encompassing hard rules that reliably predict the outcome of full-text searches. Much is likely to depend on your SQL type and version, storage engine (MyISAM, InnoDB) and the language used, notably CJK. The outcome is also dictated by the chosen configuration settings. The examples below are currently incomplete as they exclusively focus on property values of type Text.

A sample[edit]

For the examples in this guide we will store a paragraph as a semantic annotation using a property "Has text" of datatype "Text"Holds text of arbitrary length.1:

... The principles of definition, the law of contradiction, the fallacy of arguing in a circle, the distinction between the essence and accidents of a thing or notion, between means and ends, between causes and conditions; also the division of the mind into the rational, concupiscent, and irascible elements, or of pleasures and desires into necessary and unnecessary --these and other great forms of thought are all of them to be found in the Republic, and were probably first invented by Plato. The greatest of all logical truths, and the one of which writers on philosophy are most apt to lose sight, the difference between words and things, has been most strenuously insisted on by him, although he has not always avoided the confusion of them in his own writings. ...

Pattern matching using the single tilde for LIKE/NOT LIKE (no quotes)[edit]

With full-text search enabled, LIKE/NOT LIKE queries using the tilde (~ / !~) operator will no longer behave the same way. Special features that deviate from standard behaviour include:

  • Spaces are treated as delimiters that subdivide a string into individual words or 'tokens', making it easier to match on them. If a space is required as part of the match, use phrase matching instead, as illustrated below.
  • This approach does away with the need for wildcards if you need to match full words, e.g. [[Has text::~contradiction]].
  • Depending on the kind of language support provided, certain 'stopwords', or words that may be considered too common or functional to be meaningful, are not indexed. For example, our example contains the adverb "probably", apparently a stopword because a query condition like [[Has text::~probably]] returns nothing. For documentation on stopwords, see that published by MariaDB and MySQL. Additional stopword detection is done with the onoi/tesa library.
  • It allows for case-insensitive searches, e.g. [[Has text::~conTradiCtIon]].
  • To an extent, it also allows for diacritic-insensitive searches.
Example

Where the standard search requires you to use explicit OR statements, as in

[[Category:Semantic MediaWiki documentation]] [[Has text::~*ontradiction]] OR [[Category:Semantic MediaWiki documentation]] [[Has text::~differenc*]]

you can now write:

[[Category:Semantic MediaWiki documentation]] [[Has text::~*ontradiction differenc*]]

(Click the link to see it in action)

Limitations and side-effects
  • For MATCH operations, some words/tokens are ignored by the indexer. While it is a matter of efficiency and reducing pollution, it may come at the expense of leaving out meaningful words that are shorter in length. We already mentioned stopword detection, but another reason may be the configurable minimum word/token length, which is 3 by default. Try this condition: [[Has text::~law]]
  • Wildcard searches:
    • Because of tokenization, it is not possible to match only the full string at its beginning or end. All 'tokens' are evaluated.
    • While you can use the asterisk wildcard at either the beginning or end of a word, you may not get any results if you place one on either side (e.g. [[Has text::~*ontradicti*]]) or somewhere in between (e.g. [[Has text::~co*tradiction]]). Note that this does work over on the Sandbox wiki.
    • This being an experimental feature, not all combinations are guaranteed to work. On this wiki, for instance, [[Has text::~contradictio* suberat*]] works, returning results from two different pages, but [[Has text::~contradictio* *uberate]] omits a result. What does work is [[Has text::~*ontradiction differenc*]], in which case the matches come from the same property value.

Phrase matching using double quotes[edit]

Phrase matching is done by putting double quotes on either side of the phrase ("...").

  • It works for both uppercase and lowercase characters.
  • Phrases consisting solely of stopwords may not yield results, e.g. [[Has text::~"probably first"]], even if the longer phrase below is fine.

Query:

{{#ask:
 [[Has text::~"probably first invented by plato"]]
 |?Has text
 |format=plainlist
}}
NoteNote: Notice how phrase matching is case-insensitive. The query string has plato with a lowercase initial, while the source text has Plato with an uppercase initial.

Result:

How to do searching with the full-text search (Has text: ... The principles of definition, the law of contradiction, the fallacy of arguing in a circle, the distinction between the essence and accidents of a thing or notion, between means and ends, between causes and conditions; also the division of the mind into the rational, concupiscent, and irascible elements, or of pleasures and desires into necessary and unnecessary --these and other great forms of thought are all of them to be found in the Republic, and were probably first invented by Plato. The greatest of all logical truths, and the one of which writers on philosophy are most apt to lose sight, the difference between words and things, has been most strenuously insisted on by him, although he has not always avoided the confusion of them in his own writings. ...)

Wide proximity[edit]

Broad text search tries to match a string text without a specific property to broaden possible result matches. To initiate a full-text match search specific an additional ~ will indicate to the QueryEngine to use the full-text index. In other words, if the property is unknown, just use two tildes (~~) instead of one:

Query:

{{#ask:
 [[~~probably first invented by Plato]]
 |format=broadtable
 |link=all
 |headers=show
}}

How it works:

  • [[~~ first invented by Plato]] is translated into the SQL-query (MATCH(t0.o_text) AGAINST ('first invented by plato' IN BOOLEAN MODE) )
  • [[!~~ first invented by Plato]] (the negative) is translated into the SQL-query (MATCH(t0.o_text) AGAINST ('-first invented by plato' IN BOOLEAN MODE) )

Search highlighting[edit]

Search highlighting2 is done by adding the #-hl formatter to the printout statement of the respective property.

Query:

{{#ask:
 [[Has text::~"probably first invented by plato"]]
 |?Has text#-hl
}}

Result:

 Has text
How to do searching with the full-text search... The principles of definition, the law of contradiction, the fallacy of arguing in a circle, the distinction between the essence and accidents of a thing or notion, between means and ends, between causes and conditions; also the division of the mind into the rational, concupiscent, and irascible elements, or of pleasures and desires into necessary and unnecessary --these and other great forms of thought are all of them to be found in the Republic, and were probably first invented by Plato. The greatest of all logical truths, and the one of which writers on philosophy are most apt to lose sight, the difference between words and things, has been most strenuously insisted on by him, although he has not always avoided the confusion of them in his own writings. ...
NoteNote:
  • All occurrences of strings included in the query statement will be highlighted as well, e.g. "by" in the above example.
  • There is a difference between Plato (with an uppercase initial) in the original text and plato (as part of the query string).

More examples[edit]

References

  1. ^  Semantic MediaWiki: GitHub issue #1481 example
  2. ^  Semantic MediaWiki: GitHub pull request gh:smw:2253