Problem with selecting pages with long SMW Property Values and LIKE

From semantic-mediawiki.org

First and foremost I know that searching pages with a semantic property that contains some text is only allowed for the first 70 characters or less, using the syntax [[Property_name::~*needle*]], however I would like to extend this functionality.


The example I will show you in this topic could not be reproduced on sandbox.semantic-mediawiki or similar wikis, probably due to different versions of the platform, I will give you my specifics:

MediaWiki 1.25.1

PHP 5.3.3 (apache2handler)

MySQL 5.1.73

Semantic MediaWiki 2.2.2

Semantic Result Formats 2.3


I will show you screenshots generated from my wiki/DB


Let's take 2 different pages, both part of the same category (Impresa), created with the same form etc. One, Very Short Page has a very short value for the Comment property, the other, Entity with a very long labeltest1234567890, has a very long value for the Comment property as shown in the pictures below

Very Short Page: http://imgur.com/aQBu4nY

The other one: http://imgur.com/TtTymSl


Now if I want to select these pages I could see that there is a common string in their comment, "This is a page", so i could think about doing an ask on [[Comment::~*is a page*]]

The result only includes Very Short Page because it's the only one with that substring in its first characters as shown here: http://imgur.com/Wd64Njs

Thanks to the debug result format I can see the SQL query in which the ask was translated, running the query directly on the database has the same result: http://imgur.com/rxTFi0r

But there is a problem with this: If I call [[Comment::~*c9b45d*]] I would expect to hit no results, as no pages in my wiki has that string in the Comment property, however I do hit the Entity with a very long... because its hashed content contains that string, this leads to a wrong result: http://imgur.com/ULYlQwz

While you might think this is a stupid search, in Italian there are words that can be formed with just letters from "a" to "f" such as "bebe", "caffe" etc. so it's possible to hit hashed properties while looking for something else entirely, these are false positive matches that I absolutely do not want.


The solution to both problems (false negatives while looking for "is a page" and false positives while looking for a gibberish hexadecimal string) is to look in the blob when it is not null and only look directly in the hash only if the blob is null, like Extension "Semantic Drilldown" does, such as: http://imgur.com/3wICFp0


I would like to know if it is possible to hotfix this functionality in the core SemanticMediaWiki implementation or why this is a bad idea, I would appreciate guidance on how to use this workaround for all ask queries myself if needed, or any other workaround that would get me the same result.

Thanks

17:37, 27 December 2016

As a hack/workaround I made the following modifications:

in extensions/SemanticMediaWiki/src/SQLStore/QueryEngine/QueryEngine.php -> private function getInstanceQueryResult( Query $query, $rootid ):

added near the beginning:


global $wgCustomImSearch;

if(preg_match("/([a-zA-Z][0-9]+).o_hash LIKE /", $this->querySegments[$rootid]->where) && $wgCustomImSearch) {

$this->querySegments[$rootid]->where = preg_replace("/([a-zA-Z][0-9]+).o_hash LIKE /", "(IF( \${1}.o_blob IS NULL , \${1}.o_hash, CONVERT( \${1}.o_blob USING utf8 ) )) LIKE ", $this->querySegments[$rootid]->where);

}


In extensions/SemanticMediaWiki/src/SQLStore/QueryEngine/QuerySegmentListResolver.php -> public function resolveForSegment( QuerySegment &$query ) -> case QuerySegment::Q_DISJUNCTION:

added a similar check to the subquery where(s) before they are translated in SQL, after sql = ; I added:

global $wgCustomImSearch;

if(preg_match("/([a-zA-Z][0-9]+).o_hash LIKE /", $subQuery->where) && $wgCustomImSearch) {

$subQuery->where = preg_replace("/([a-zA-Z][0-9]+).o_hash LIKE /", "(IF( \${1}.o_blob IS NULL , \${1}.o_hash, CONVERT( \${1}.o_blob USING utf8 ) )) LIKE ", $subQuery->where);

}

$wgCustomImSearch is a global variable that I adedd for now, defined in LocalSettings.php so that I can troubleshoot performance differences when this hack is active or not, but it can be avoided. So far I do not see any drawback for this hack and I have not found any other points in the code where it is needed. If you think this could be useful for the project, please let me know. If you know of any reason why this is a bad idea, once again, I'm all ears.

18:22, 28 December 2016

> Semantic MediaWiki 2.2.2 > Semantic Result Formats 2.3

Version 2.2.2 is no longer in active maintenance and therefore should not be avoided as reference when trying to modify code or cite code snippets.

> some text is only allowed for the first 70 characters or less, using the syntax ~*needle*, however I would like to extend this functionality.

To cover search in regards to longer text (or better text that is longer than the 70 char limit) or case insensitivity we have approached the problem differently in 2.5 [0] by relying on the full-text index for enabled dataTypes that are provided by the DB back-end.

> could not be reproduced on sandbox.semantic-mediawiki

The sandbox has the Fulltext search enabled where ~/!~ expressions do match conditions against the available full-text index (if you use format=debug then you should see that the SQL is generated using the MATCH ... AGAINST instead of LIKE).

> So far I do not see any drawback for this hack and I have not found any other points in the code where it is needed. If you think this could be useful for the project, please let me know. If you know of any reason why this is a bad idea

[0] cites several sources as to why using LIKE on a "long text" creates potential bottlenecks and is the reason why LIKE/NLIKE should only be considered on a limited VARCHAR field.

I think it is commendable that you dig into the code trying to find a workaround, unfortunately using an arbitrary `preg_replace` or `preg_match` approach is unlikely to be implemented due to SQL convention, support for different DB back-ends.

If you still think your approach should be examined more thoroughly then you may create a PR and run it against our test suite.

> only look directly in the hash only if the blob is null, like Extension:Semantic Drilldown

We consider this the wrong approach and may break on any schema change we apply to the SQLStore in future.

> I would like to know if it is possible to hotfix this functionality in the core SemanticMediaWiki implementation or why this is a bad idea

We try to find implementations that are hopefully stable and provide features without impacting existing functionality or risking the introduction of instabilities.

PS: Using [1] is a better place to discuss technical (code) changes.

[0] https://github.com/SemanticMediaWiki/SemanticMediaWiki/pull/1481

[1] https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues

04:09, 31 December 2016