Talk:SQLStore update

Suggestions

 * Using the interwiki prefixes field to also identify special types of objects seems rather contrived. Having a separate field is probably better.. --Jeroen De Dauw 19:19, 6 May 2012 (CEST)
 * If you think of interwiki prefixes as a kind of "super-namespace" then it is not so contrived. Having two fields would require a lot more memory, which relates to access speed and performance. The code would not become much simpler, since it would still have to check the same cases. So I am not convinced. --Markus Krötzsch 21:31, 7 May 2012 (CEST)


 * Field naming: s_id and p_id to subject_id and property_id --Jeroen De Dauw 19:19, 6 May 2012 (CEST)
 * This would require a custom update script to change; could be done but has low priority. For the new DB layout, we can use better names right away. --Markus Krötzsch 21:31, 7 May 2012 (CEST)


 * Would be nice to also have a draft of what the classes handling the db stuff will look like, in particular on how the current SQLStore functionality will be split. --Jeroen De Dauw 19:19, 6 May 2012 (CEST)
 * Good idea. --Markus Krötzsch 21:31, 7 May 2012 (CEST)


 * sphinxsearch could be useful to us, not sure though --Nischayn22 18:10, 7 May 2012 (CEST)
 * Can't really imagine where Sphinx (see also mws:Extension:SphinxSearch; MVA (multi-valued attributes) could be of interest) could be helpful at this stage, as it just an external search engine optimized to search in MySQL databases; Sphinx runs on an external daemon (similar to that of a triple store) but it doesn't store any blob object instead it generates highly optimized search keys together with its corresponding page Id's, surely instead of making a DB select/joint on a MySQL table, you could ask Sphinx to do the same and I assume it would probably be faster of any optimization you will come up in MySQL especially in cases where you have to read sequentially through the DB --MWJames 19:19, 7 May 2012 (CEST)
 * That seems to be similar to Lucene as a search backend. I am not sure how good these tools are for the (join-intensive) queries we have, but it could be an interesting project. However, this seems to be a whole GSoC project on its own, since one would have to translate SMW queries into calls/queries for Sphinx (not sure what query language they support), and then translate the results back. So one would replicate the effort of the existing SQL and SPARQL query backend bindings. Also, I am not sure to which extent Sphinx even supports all query features we use; this was not even easy with SPARQL and SQL in all cases. --Markus Krötzsch 21:31, 7 May 2012 (CEST)
 * We could have store implementations with many storage/query systems we currently do not support, but let's not get distracted by this and optimize the main store, MySQL, first. This is what people are using, and even if there are well supported faster alternatives, is likely to remain the most used option. --Jeroen De Dauw 17:12, 14 May 2012 (CEST)

Questions

 * Why are time values stored in two different formats? Just store it in one field as TS_MW, like all the rest of MW does... (unless this notation does not support all the stuff SMW does ofc, not sure here). Also, the field names seem rather odd at the moment. --Jeroen De Dauw 19:19, 6 May 2012 (CEST)
 * I think the other format is stored to save the format in which the user entered the value and output in the same format if required. --Nischayn22 12:40, 7 May 2012 (CEST)
 * Yes, and there are more reasons. SMW time values support imprecision. For example, if a user enters "2010" as a time then this is what will be returned in queries. On the other hand, SMW supports sorting by times, which requires every time to be precise. For this purpose, "2010" is treated like "1 Jan 2010 00:00:01". It might still be possible to combine this into one column by using a custom format (note that we also need much longer years than MW, and I don't recall what exactly we do with calendar models on the DB level. If this can all be expressed in one string that sorts in the correct way (lexicographically), then this might be less efficient than sorting by a floating point number as we do now. But I don't know if this is the case. I would also suppose that an unindexed varchar + an indexed float might need less memory than an indexed varchar. So a one-column solution would possibly use more memory, be slower to query, and require a cumbersome encoding. This could all be wrong, but at least it is not obvious what would be better here. But I agree that the column names could be clearer. --Markus Krötzsch 21:40, 7 May 2012 (CEST)

Remarks

 * It's unclear to me what smw_rels2 is supposed to do... --Jeroen De Dauw 19:19, 6 May 2012 (CEST)
 * It stores property values of type SMWDIWikiPage. Is this what you meant? --Markus Krötzsch 21:46, 7 May 2012 (CEST)


 * The indexes on the geocoords table are fine for some very basic things, but for efficient handling of geospatial operations we need geospatial indexes. MySQL does not come with these by default, so we do not really need this that much, but it would be great to have an optional addition of such indexes if the DB has the right components installed. Both MySQL and PostgreSQL have additional modules for this. --Jeroen De Dauw 19:19, 6 May 2012 (CEST)
 * My question was a different one: in all SMW read operations, we filter by all three columns (the whole value). For this, one would like to use an index over "lat,lon,alt" (btw., since you mentioned it, these are bad names too ;-). Instead, the current layout has separate indexes over "lat", "lon", and "alt". This is likely to be less efficient and more memory consuming. --Markus Krötzsch 21:46, 7 May 2012 (CEST)
 * Why always filter over all columns? That seems the part that should get fixed to me. Either way, unless I'm mistaken it's perfectly fine to have separate indexes. What makes you think this is not the case? --Jeroen De Dauw 17:21, 14 May 2012 (CEST)
 * Bad names? I'd not object to having them written out here, but I don't think the current naming is bad, since it's obvious what it means in this context and is not redundant (everything I remarked upon had a problem with one of those). --Jeroen De Dauw 17:21, 14 May 2012 (CEST)


 * Layout of geo_coords table: storing a value in floats as the primary presentation seems to be a bad idea, since float cannot represent numbers accurately. So it is likely that values "change" when being stored (i.e., they do not round-trip). It would probably be better to store the coordinate as three numbers, either in one varchar or in three. --Markus Krötzsch 21:46, 7 May 2012 (CEST)

Indexing the String/Blob fields
AFAIK we currently don't index fields of type string/blob. This article suggests how querying strings can be made faster using indexes "If a table has 1,000 rows, this is at least 100 times faster than reading sequentially". However, it also says "If you need to access most of the rows, it is faster to read sequentially".--Nischayn22 12:25, 7 May 2012 (CEST)

Can we support searching using both techniques? If the user only wants to know the page where Email=xyz@xyz.com sequential search will be very slow and indexed search will be very fast.--Nischayn22 12:25, 7 May 2012 (CEST)


 * You are right: one should always have indexes on the column (or column combination) that one uses to select rows. However, SMW does not select by Blobs, so it does not need those indexes. Normal strings are indexed (they are now stored in smw_atts2 in the value_xsd column). The original idea of splitting blobs from strings was that strings are length-limited but can be used in search, while blobs are unlimited but can not be used to select rows (or to sort results). The idea of the new design is to have both. --Markus Krötzsch 21:50, 7 May 2012 (CEST)

Migration and DB scheme update
I surely state something obvious, nevertheless I wanted to emphasizes that any change on the DB scheme should be thoroughly tested in order to avoid any interruptions for production systems. Running  unselective over every object (the last time we did a complete update took three days to complete) would be a very costly endeavour and should be avoided at all cost while at the same time the migration should ensure that any conversion of values will result in the same output for the user. --MWJames 05:20, 8 May 2012 (CEST)


 * I have made the basic scripts to migrate data now https://gerrit.wikimedia.org/r/#/c/16995/ :). These still only work in one run (no limits can be specified). Feel free to hack on it.--Nischayn22 06:39, 31 July 2012 (CEST)

Database design for faster query answering
Few suggestions to improve our query answering time
 * The table smw_ids stores ids of both properties and wiki-pages, the idea was that properties also have pages. But the number of pages can be very large while Properties will be few, so maybe we can have separate table for storing ids of properties and also store prop_table. (this will definitely decrease our query response time to some small extent) --Nischayn22 19:45, 16 May 2012 (CEST)


 * We can make considerations for horizontal partitioning of the property-value tables based on some categorization/grouping such that we know which table to look for particular types of properties. For example, for properties with pid between 1-10 and 10-20 we have two different tables so when we want to look for some specific property-value tuples, we only scan through one of the two smaller tables. (Not sure about how we can maintain separate tables for some properties as you suggested and switching those tables on and off will be a time intensive task and will lead to database locking for that time) --Nischayn22 19:45, 16 May 2012 (CEST)


 * Instead now we will be using separate tables for some properties (see below section) --Nischayn22 19:45, 16 May 2012 (CEST)


 * We can trade-off space for time and have some duplicate tables with the same data in different structure, so that we don't have to look at all the tables to search for all properties used in a particular page (subject). May be we can have separate tables wherein the data is grouped on the basis of sids (similar to my example above with pids). --Nischayn22 19:45, 16 May 2012 (CEST)
 * This will be low priority now, also we have to see that we are not normalizing the database anymore --Nischayn22 19:45, 16 May 2012 (CEST)

New Special Property "Uses Table"
I propose that we have a new special property "Uses Table" which will be used to assign a new db table for that property. The table will be created automatically with the fields as required for that property's type. The name of the table will be something like smw_yourproperty and it will be stored in the smw_property_ids table's prop_table field which I proposed earlier (which means the table for this special property will be smw_property_ids).

Note: --Nischayn22 19:45, 16 May 2012 (CEST)
 * If you change the type of the property later on it will mess up with the table's fields and produce error, checking this can be done in further stages.
 * Only the administrator should be able to use use this property.