Full-text search

From semantic-mediawiki.org
Full-text search
Full-text search support for properties which data types use strings of characters or text to store their database tables . e.g. datatype "Page"Holds names of wiki pages, and displays them as a link, datatype "Text"Holds text of arbitrary length, datatype "Code"Holds technical, pre-formatted texts (similar to datatype Text) or datatype "URL"Holds URIs, URNs and URLs, etc.
Keywords
Table of Contents

Semantic MediaWiki 2.5.0Released on 14 March 2017 and compatible with MW 1.23.0 - 1.29.x. adds experimental support for accessing the full-text capabilities of the relational databases (SQL back-end) for properties whose data types use strings of characters or text to store their database tables, e.g. datatype "Page"Holds names of wiki pages, and displays them as a link, datatype "Text"Holds text of arbitrary length, datatype "Code"Holds technical, pre-formatted texts (similar to datatype Text) or datatype "URL"Holds URIs, URNs and URLs, etc.

Features[edit]

General notes[edit]

  • The FT_SEARCH table aggregates search content for datatypes storing their data as BLOB and URI values, e.g. datatype "Page"Holds names of wiki pages, and displays them as a link, datatype "Text"Holds text of arbitrary length, datatype "Code"Holds technical, pre-formatted texts (similar to datatype Text) or datatype "URL"Holds URIs, URNs and URLs, etc.
  • These datatypes use either CHAR, VARCHAR, or TEXT to store their data in the database tables.
  • Supported operations rely on the relational backend database (MySQL, MariaDB and SQLite).
  • For MySQL and MariaDB databases, IN BOOLEAN MODE is used as default search mode. This allows for a number of special operators to be used by the software.
  • Relevance and scores are not used for any sorting purpose, e.g. as in best match.
  • TextSanitizer relies on the "onoi/tesa" library1 to help with the sanitization of text or string elements to provide some text manipulation support as well as a possibility to use language detection if enabled. This library is pre-installed for use by Semantic MediaWiki.
  • Custom stopwords are only applied by the "onoi/tesa" library1 in case the language detection is enabled but MySQL/MariaDB provide their own standard list2 which are enabled by default
  • Starting with Semantic MediaWiki 3.0.0Released on 11 October 2018 and compatible with MW 1.27.0 - 1.31.x.:
    • If the SMW_FIELDT_CHAR_NOCASE option to configuration parameter $smwgFieldTypeFeaturesSets relational database specific field type features is enabled the full-text search only comes into effect for selections using the comparators ~ and !~.3
    • API-module "smwtask"Allows to invoke and execute internal Semantic MediaWiki tasks is used instead of a socket connection via a special page to invoke extra "work" after an update has been completed as part of an independent transaction.4 See also configuration parameter $smwgPostEditUpdateSets how many jobs should be executed as part of a post-edit event.

Notes on language support for Chinese, Japanese, and Korean (CJK)[edit]

  • General CJK support is a challenging endeavour due to text elements to be broken into corresponding tokens that are not separate by spaces
  • The "onoi/tesa" library1 provides some simple Tokenizer's which does not require language detection and will try to provide rudimentary CJK search out-of-the box. This requires ICU 54+.
  • Mroonga is a MySQL storage engine and said to be a CJK-ready fulltext search, column store
  • MySQL comes with an optional ngram Full-Text Parser and MeCab Full-Text Parser Plugin.
  • According to this issue, MariadDB is missing those parser plug-ins. Support is still wanting in 2023.

Instructions[edit]

For users
  • Searching contains some examples and descriptions about the available search syntax
For system administrators
For developers
  • Technical notes provides some information on the technical implementation, fine-tuning, and performance