v3.0.0+
Status: | effective |
Progress: | 100% |
Version: | 3.0.0+ |
English
Full-text search
From semantic-mediawiki.org
Full-text search | |
---|---|
Full-text search support for properties which data types use strings of characters or text to store their database tables . e.g. datatype "Page"Holds names of wiki pages, and displays them as a link, datatype "Text"Holds text of arbitrary length, datatype "Code"Holds technical, pre-formatted texts (similar to datatype Text) or datatype "URL"Holds URIs, URNs and URLs, etc. | |
Keywords | |
Table of Contents | |
Semantic MediaWiki 2.5.0Released on 14 March 2017 and compatible with MW 1.23.0 - 1.29.x. adds an experimental support for accessing the full-text capabilities of the relational databases (SQL back-end) for properties which data types use strings of characters or text to store their database tables, e.g. datatype "Page"Holds names of wiki pages, and displays them as a link, datatype "Text"Holds text of arbitrary length, datatype "Code"Holds technical, pre-formatted texts (similar to datatype Text) or datatype "URL"Holds URIs, URNs and URLs, etc. These datatypes use either CHAR
, VARCHAR
, or TEXT
to store their data in the database tables.
This feature is not enabled by default since this feature is still considered experimental. In may be enabled for the wiki with configuration parameter
$smwgEnabledFulltextSearch
Sets whether full-text search support for properties may be used.
- Support was added for MySQL/MariaDB1 and SQLite2 while PostgreSQL34 is currently not supported.
- Only
SMWSQLStore3
is supported since theSPARQLStore
would require the native support of full-text search capabilities by the triple-store.
Requirements[edit]
- Semantic MediaWiki 2.5.0+
SMWSQLStore3
using MySQL 5.5+1, MariaDB 10.0.5+1 or SQLite 3.8+2- PHP 5.5+
Features and limitations[edit]
- General notes
- The
FT_SEARCH
table aggregates search content for datatypes storing their data asBLOB
andURI
values against an index search is being executed, e.g. datatype "Page"Holds names of wiki pages, and displays them as a link, datatype "Text"Holds text of arbitrary length, datatype "Code"Holds technical, pre-formatted texts (similar to datatype Text) or datatype "URL"Holds URIs, URNs and URLs, etc. - Supported operations rely on the relational backend database (MySQL, MariaDB and SQLite)
- For MySQL and MariaDB databases,
IN BOOLEAN MODE
is used as default search mode - Relevance and scores are not used for any sorting purpose, e.g. as in best match
TextSanitizer
relies on the "onoi/tesa" library5 to help with the sanitization of text or string elements to provide some text manipulation support as well as a possibility to use language detection if enabled. This library is pre-installed for use by Semantic MediaWiki.- Custom stopwords are only applied by the "onoi/tesa" library5 in case the language detection is enabled but MySQL/MariaDB provide their own standard list6 which are enabled by default
- Starting with Semantic MediaWiki 3.0.0Released on 11 October 2018 and compatible with MW 1.27.0 - 1.31.x.:
- If the
SMW_FIELDT_CHAR_NOCASE
option to configuration parameter$smwgFieldTypeFeatures
Sets relational database specific field type features is enabled the full-text search only comes into effect for selections using the comparators~
and!~
.7 - API-module "smwtask"Allows to invoke and execute internal Semantic MediaWiki tasks is used instead of a socket connection via a special page to invoke extra "work" after an update has been completed as part of an independent transaction.8 See also configuration parameter
$smwgPostEditUpdate
Sets how many jobs should be executed as part of a post-edit event.
- If the
- Notes on Chinese, Japanese, and Korean support (CJK)
- General CJK support is a challenging endeavour due to text elements to be broken into corresponding tokens that are not separate by spaces
- The "onoi/tesa" library5 provides some simple
Tokenizer
's which does not require language detection and will try to provide rudimentary CJK search out-of-the box. This however requires ICU 54+ which is still not being used by MediaWiki as of version 1.29-alpha. - Mroonga is a MySQL storage engine and said to be a CJK-ready fulltext search, column store
- MySQL comes with an optional ngram Full-Text Parser and MeCab Full-Text Parser Plugin.
- According to this issue, MariadDB is missing those parser plug-ins
Configuration[edit]
- Configuration parameter
$smwgEnabledFulltextSearch
Sets whether full-text search support for properties may be used − Allows to enable the feature - Configuration parameter
$smwgFulltextDeferredUpdate
Sets the number of expected full-text search index updates − Allows to throttle the number of expected index updates - Configuration parameter
$smwgFulltextSearchTableOptions
Sets the full-text search table options to use during installation or update − Allows to set database related options - Configuration parameter
$smwgFulltextSearchMinTokenSize
Sets the minimum word/token length to help to decide whether MATCH or LIKE operators are to be used for a condition statement − Allows to describe the minimum word/token - Configuration parameter
$smwgFulltextLanguageDetection
Sets which languages to detect for the full-text search from an indexable text − Allows to detect a language (experimental setting) - Configuration parameter
$smwgFulltextSearchIndexableDataTypes
Sets which datatypes are allowed to be indexed using the full-text search − Allows to list datatypes that should be indexed - Configuration parameter
$smwgFulltextSearchPropertyExemptionList
Sets the property keys for which value assignments are being exempted from the full-text indexing − Allows to list properties that should be not be indexed
Changes to any of the above settings, requires to re-run maintenance script "rebuildFulltextSearchTable.php"Allows to rebuild the full text search data table.
Usage and instructions[edit]
- for users
- Searching contains some examples and descriptions about the available search syntax
- for system administrators
- Indexing describes some methods on how to manually create and update the index table
- for developers
- Technical notes provides some information on the technical implementation, fine-tuning, and performance
References
- a b c | Semantic MediaWiki: GitHub pull request gh:smw:1481
- a b | Semantic MediaWiki: GitHub pull request gh:smw:1801
- ^ | Semantic MediaWiki: GitHub pull request gh:smw:1956
- ^ | PostgreSQL, farklı bir dizin şeması nedeniyle desteklenmiyor (ör.
to_tsvector
,to_tsquery
) ancak bunu kullanıma sunmak isteyen kullanıcıların, PostgreSQL'e özgü bir uygulamanın nasıl oluşturulacağıyla ilgili "MySQLValueMatchConditionBuilder" ile bir göz atmaları önerilir. - a b c | "onoi/tesa" - Metin veya dizi öğelerinin sterilize edilmesine yardımcı olacak küçük bir kitaplık.
- ^ | https://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html ve https://mariadb.com/kb/en/mariadb/stopwords/
- ^ Semantic MediaWiki: GitHub issue comment gh:smw:2499:307624826
- ^ Semantic MediaWiki: GitHub pull request gh:smw:3318