SQLStore update

'''This page collects design notes for the new SQLStore3. It was a development document that should not be confused with the actual documentation of what is implemented now.

Discussions are done at the talk page. Also see /Tasks/ page for the process.

Issues and concerns
Here we collect the things that we would like to improve:


 * The code is too complex and thus hard to optimise.
 * The code is not really aware of the possible kinds of dataitems, but uses another way of describing data.
 * The table layout is historical and makes some operations harder (e.g., one cannot find out the type of a value from its table)
 * Data is written too often, even if it did not change.
 * Statistical information as needed for the special pages is very expensive to calculate.
 * String data has unnecessary length restrictions 34511

Unordered ideas

 * Align the table layout with the types of dataitems to make everything clearer
 * Use hashes to avoid unnecessary write operations
 * Split the code in multiple files based on functionality (ID management, writing, simple reading, statistics, queries)

Reading methods
The following table shows all reading methods that a store must implement. The query pattern describes input and output: bold parts are given as an input, parts with ? are left variable, underlined variables are returned as an output. All methods that take properties as inputs will also accept inverse properties: in this case subject and value are swapped and the according methods is called. For example, calling getPropertyValues with an inverse property leads to a call of getPropertySubjects.

(*) There is no method getAllPropertyValues in SMWStore yet, but one can call getPropertyValues with NULL as subject to get the same result.

Sorting and filtering results
All reading methods with one return variable support SMWRequestOptions that can be used to sort or to filter the list of return values. Filtering involves settings like Limit and Offset.

Sorting will sort values according to their natural order (if any), usually numerical or lexicographic. Geographic coordinates do not have a natural linear order to use for sorting. However, every kind of data must be sorted in some way (even if the order is not useful in practice). Otherwise the order of results could change between two requests, which would make it impossible to go through all results using Limit and Offset.

Properties are handled like wiki pages and sorted by their title. Wikipages and properties require a JOIN with the smw_ids table for sorting. If no sorting is needed, it is supposed to be cheaper to fetch their ID separately and to use it to select the right values.

Some types of data have multiple versions of each value stored in the database, one for sorting and one for storage. For example, numbers are stored as strings (reliable precision) and as floats (efficient sorting). In this case, it is likely that only the numerical version is needed for selecting values, so no index should be required on the other column.

Sorting and filtering can be done efficiently by the database management system in many cases. However, in order to use data that was cached, it might be necessary to sort and filter in PHP as well. SMWSQLStore2 has methods for both cases: one for translating SMWRequestOptions into SQL options, and one for applying SMWRequestOptions to a list of values.

Table layout
Here we document all database tables and related questions.

Table smw_object_ids
This table is used to store integer IDs for SMWDIWikiPage objects. These objects can represent wiki pages (possibly with some external interwiki prefix), or subobjects of wiki pages. IDs are also kept for properties (identified by a wiki page) and for special properties (no wiki page).

Columns:
 * smw_id (integer NOT NULL KEY AUTO_INCREMENT): the integer id
 * smw_namespace (integer NOT NULL): namespace of the wiki page
 * smw_title (titletext NOT NULL): title of the wiki page
 * smw_iw (interwikitext NOT NULL): interwiki prefix of the wiki page (or empty string)
 * smw_subobject (titletext NOT NULL): name for subobjects of a page (empty string if the entry is about the page)
 * smw_sortkey (titletext NOT NULL): string used for sorting
 * smw_data_hash (varchar NULL): string used to store hash value of the Semantic Data for this page to compare with newly generated ones on page edits

Indexes:
 * smw_id: primary key
 * smw_title,smw_namespace,smw_iw: selection by wiki page
 * smw_title,smw_namespace,smw_iw,smw_subobject: selection by the full data stored in SMWDIWikiPage
 * smw_sortkey: selection by sortkey (relevant, e.g., when comparators are used with page-type properties in #ask queries)

There are also some special objects that are stored here. These are marked by special interwiki prefixes that cannot occur in real pages:
 * SMW_SQL2_SMWREDIIW: entries for wiki pages that are redirects. Otherwise, such rows are similar to normal entries for (internal) wiki pages.
 * SMW_SQL2_SMWPREDEFIW: predefined entries, usually for predefined properties that have user-visible name (and a wiki page)
 * SMW_SQL2_SMWINTDEFIW: predefined entries for objects that have an Id but no user-visible name (internal objects)
 * SMW_SQL2_SMWBORDERIW: this is used in one row to mark the border between predefined ids (rows that are reserved for hardcoded ids built into SMW) and normal entries. This row is no object, but makes sure that SQL's auto increment counter is high enough to not add any objects before that marked "border".

Predefined entries for predefined properties are created on setup. SMW's internal predefined properties have fixed IDs to avoid lookups. Having them in the table is useful so that they can be found by functions that cannot not check this (e.g., statistical functions that show all properties). It is also important that no other object has the same ID, so it makes sense to occupy the space in the table.

Table smw_property_ids
Similar to table smw_ids but has some new fields to store important information about that property

Columns:
 * prop_id (integer NOT NULL KEY AUTO_INCREMENT): the integer id
 * prop_name (titletext NOT NULL): name of the property
 * prop_table (titletext NOT NULL): name of the property-value table for this property (we are planning to have dedicated tables for some properties)

Indexes:
 * prop_id: primary key
 * prop_title: selection by property name

Property-Value tables
Various tables store property-value pairs that are assigned to a given page. All of these tables have the following two columns:
 * s_id (integer): ID of the subject, that is, the page that has the value
 * p_id (integer): ID of the property to which the value was assigned

All IDs refer to the entries in smw_object_ids. The columns have the following indexes:
 * s_id: selection by subject
 * p_id: selection by property

For different types of values, there are different further columns as shown next.

SMWDINumber: smw_di_number
This table will store property-values of type SMWDINumber.

Value columns: Indexes:
 * value: the value of the property(Datatype:varchar).
 * sort_value: value used for sorting purpose(Datatype:double)
 * value: selection by value.

SMWDIBool: smw_di_bool
This table will store property-values of type SMWDIBool.

Value columns:
 * value: the value of the property(Datatype: Boolean).

SMWDITime: smw_di_time
This table will store property-values of type SMWDITime.

Value columns:
 * value_smwformat : the time stored as is now in smw_atts2.value_xsd column.(Datatype:string)
 * value_timestampformat: the time stored as a SMW timestamp(used for sorting and other purposes).(Datatype: double)

Indexes:
 * time_timestampformat: selection by value.

SMWDIString: smw_di_string
This table will store property-values of type SMWDIString.

Value columns:
 * value: the value of the property as blob.(Datatype:Blob)
 * hash_value: Hash value used for sorting and selection.(Datatype:varchar)

Indexes:
 * hash_value: Indexing strings by Hashing will allow effective searching and selection.

SMWDIWikiPage: smw_di_wikipage
This table stores property values of type SMWDIWikiPage. This table is also used to store "compound values". Such values are encoded as subobjects that act as "internal wiki pages", which can have further property values.

Value columns:
 * o_id (integer): ID of the object (value) that was assigned.(Datatype:int)

Indexes:
 * o_id: selection by value

SMWDIGeoCoord: smw_di_geocoord
This table stores property values of type SMWDIGeoCoord.

Value Columns:
 * lat (float): latitude of a coordinate
 * lon (float): longitude of a coordinate
 * alt (float): altitude of a coordinate

Indexes:
 * lat: selection by latitude
 * lon: selection by longitude
 * alt: selection by altitude

Questions and Answers

 * Shouldn't there be an index for lat,lon,alt (whole value)?
 * Do we use the indexes for the individual components anywhere?
 * The current database layout can only store limited precision numbers, so many user inputs are not going to be preserved accurately when storing them. We probably should use varchars to store accurate values. Should these be the only representations or do we also need the floats for anything? (this relates to the previous question)

Maintenance Scripts

 * Property Table migration - This script will make a new table for a property (and put that table name in the property_table field of table smw_property_ids) and move all the property-value tuples into this new table.