Considerations for assigning fixed property?

From semantic-mediawiki.org

Short of doing a bunch of deep digging into the generated SQL and doing performance comparisons I'm wondering if anyone can shed some light on the tradeoffs of using fixed properties. Maybe Nischay Nahata? Some things that come to mind:

  1. How important is the datatype for fixed properties? On WikiApiary I'm using a lot of booleans. Do booleans get a benefit greater or lesser than text? Does it matter at all?
  2. While moving to separate tables should make searches faster (smaller indexes?), does it increase the number of joins that the database has to do?

I hope this doesn't sound lazy. It's great to now have this option, but I'm guessing everyone would benefit from understanding performance tradeoffs (it's never something for nothing right?) and impact before we all start making fixed properties. WikiApiary has 2.5M properties, and if moving my extremely often used boolean fields to fixed properties would improve performance I'd do it in a second.

🐝 thingles (talk)

21:49, 18 June 2013

First of all I am no MySQL expert, that said fixed properties is an experimental feature I did in my GSoC project and its performance benefits yet to be analysed.

1. How important is the datatype?

SMW has dedicated tables for each datatype. 'smw_di_bool' would be the name of the table for booleans having three attributes s_id (ID of the page), p_id (ID of the property), o_value (true or false). Now if you have a fixed property of the same type it would only have two fields s_id and o_value - that's because we already know the property. Therefore, in queries we have one less thing to compare; this is one benefit if you have lots of properties for the same datatype. This could be the queries used to display the Property:Page for that boolean property

Using fixed_table: Select s_id, o_value from fixed_table

Not using fixed_table: Select s_id, o_value from boolean_table where p_id=ID_OF_THE_PROPERTY

2. does it increase the number of joins that the database has to do?

Yes, in some cases it would. But I am unsure about the tradeoffs here as I haven't yet dealt with complex SQL generated from queries. Let's ask Markus :)

I hope you and everybody understands that all the predefined properties (Modification date, redirects, etc ) already are fixed properties having their own dedicated tables.

05:26, 19 June 2013
1. How important is the datatype?

The main consideration for selecting the datatype should be your application needs. If the property value can only be true or false, then it makes sense to be a Boolean. Regarding technical ramifications: roughly speaking, Booleans are simpler than pages, which in turn are simpler than integers and dates, which in turn are simpler than texts. "Simpler" mostly means "should require less disk space and memory", which may (or may not) translate into performance since more query results/indexes fit into memory. Selecting a value uses a single index in every case, so a greater number of fields in the value table does not imply a greater number of comparisons. However, larger data require more comparisons to match a value (the identity of two Booleans is easier to check than the identity of two strings).

2. Does it increase the number of joins that the database has to do?

No, I don't think there is any such situation. The use of any property condition in a query always requires one join with the property's table, no matter whether this table is a fixed table or shared with other properties. Special statistics pages (like Special:UnusedProperties) do not query property value tables at all. The use of fixed tables merely reduces the size of the index needed to find the right entries in the table (we don't need to check that the property is the same). This means that comparisons are done with a smaller index structure, which again should lead to performance increases in some cases. (This is what Nischay has already explained above by referring to the smaller number of comparisons.)

There is only one case I can think of where fixed property tables create an extra cost (and only if they are of type wikipage). This happens for value-based queries that do not refer to a specific property. For example, to find all pages that have "Berlin" as a page value for some property (without saying which), one needs to search all property tables for that value, and take the union of all the results (this is not a greater number of joins, but a greater number of queries/unions). The main situation where this type of query is used is in the lower backlinks section of Special:Browse. This only affects fixed property tables with datatype wikipage.

In general, it therefore makes sense to create fixed property tables for all heavily used properties (that usually are not expected to change their datatype). Doing this will save disk space and memory, and can potentially speed up some comparisons at query time (smaller indexes = less things to compare when searching a value). The number of tables that a DBMS can practically handle is limited. A few hundreds should not be a problem, but some ten-thousands would most likely lead to performance degradation. Moreover, a property with less than a thousand values does not need an own table. Fixed property tables are meant for very large wikis.

08:46, 19 June 2013

Super helpful Nischayn and Markus! Thank you! I'm wondering if I could help generate some performance data to help the cause? I would like to switch all of my status flags on WikiApiary to fixed properties Is active, Is audited and probably my collection flags as well like Collect statistics, Collect extension data (see all properties). I'm wondering if there is a way that I could collect some data to help show the benefit of using fixed properties? If there were some before and after I could collect that would be helpful I would happily do that.

🐝 thingles (talk)

22:25, 19 June 2013

One could compare query-response times but we don't have a benchmark or anything, so probably have to do things manually. Is it possible for you to try this on a test wiki rather than the production one?

14:25, 20 June 2013