|Full-text search support for properties which data types use strings of characters or text to store their database tables . e.g. "Text", "Page" or "URL", etc.|
|Table of Contents|
Semantic MediaWiki 2.5.0 adds an experimental support for accessing the full-text capabilities of the relational databases (SQL back-end) for properties which data types use strings of characters or text to store their database tables, e.g. "Text", "Page" or "URL", etc. These datatypes use either
TEXT to store their data in the database tables.
- This feature is not enabled by default since this feature is still considered experimental. In may be enabled for the wiki with configuration parameter
- Support was added for MySQL/MariaDB1 and SQLite2 while PostgreSQL34 is currently not supported.
SMWSQLStore3is supported since the
SPARQLStorewould require the native support of full-text search capabilities by the triple-store.
Features and limitations
- General notes
FT_SEARCHtable aggregates search content for datatypes storing their data as
URIvalues against an index search is being executed
- Supported operations rely on the relational backend database (MySQL, MariaDB and SQLite)
- For MySQL and MariaDB databases,
IN BOOLEAN MODEis used as default search mode
- Relevance and scores are not used for any sorting purpose, e.g. as in best match
TextSanitizerrelies on the "onoi/tesa" library5 to help with the sanitization of text or string elements to provide some text manipulation support as well as a possibility to use language detection if enabled. This library is pre-installed for use by Semantic MediaWiki.
- Custom stopwords are only applied by the "onoi/tesa" library5 in case the language detection is enabled but MySQL/MariaDB provide their own standard list6 which are enabled by default
- Notes on Chinese, Japanese, and Korean support (CJK)
- General CJK support is a challenging endeavour due to text elements to be broken into corresponding tokens that are not separate by spaces
- The "onoi/tesa" library5 provides some simple
Tokenizer's which does not require language detection and will try to provide rudimentary CJK search out-of-the box. This however requires ICU 54+ which is still not being used by MediaWiki as of version 1.29-alpha.
- Mroonga is a MySQL storage engine and said to be a CJK-ready fulltext search, column store
- MySQL comes with an optional ngram Full-Text Parser and MeCab Full-Text Parser Plugin.
- According to this issue, MariadDB is missing those parser plug-ins
$smwgEnabledFulltextSearch− Allows to enable the feature
$smwgFulltextDeferredUpdate− Allows to throttle the number of expected index updates
$smwgFulltextSearchTableOptions− Allows to set database related options
$smwgFulltextSearchMinTokenSize− Allows to describe the minimum word/token
$smwgFulltextLanguageDetection− Allows to detect a language (experimental setting)
$smwgFulltextSearchIndexableDataTypes− Allows to list datatypes that should be indexed
$smwgFulltextSearchPropertyExemptionList− Allows to list properties that should be not be indexed
Usage and instuctions
- for users
- Searching contains some examples and descriptions about the available search syntax
- for system administrators
- Indexing describes some methods on how to manually create and update the index table
- for developers
- Technical notes provides some information on the technical implementation, fine-tuning, and performance
- Semantic MediaWiki: GitHub pull request #1481
- Semantic MediaWiki: GitHub pull request #1801
- Semantic MediaWiki: GitHub pull request #1956 notes that "... any interested developer who is eager to help with implementing a PostgreSQL solution ..."
- Postgres is not supported due to a different index schema (e.g.
to_tsquery) but users interested to make it available are encouraged to have a look at the
MySQLValueMatchConditionBuilderon how to create a Postgres specific implementation.
- "onoi/tesa" - A small library to help with the sanitization of text or string elements.
- https://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html and https://mariadb.com/kb/en/mariadb/stopwords/