Architecture guide

From semantic-mediawiki.org
Jump to: navigation, search

This documentation is work in progress; some sections are not finished yet.

This page explains the software architecture of SMW on a high level. It is intended to help developers to get started with SMW development by understanding the main concepts. Another important goal of this page is to clarify the architectural choices that are underlying the PHP implementation. Due to the nature of PHP code and extension development, the actual source code may make it difficult to separate architecture (deliberate, essential design decisions) from implementation (pragmatic, opportunistic realization decisions). This page tries to fix at least the main concepts.

The SMW source code documentation should be consulted for details on the code described below. While general architecture decisions are rather stable, this page can still become outdated. The last review of this page was in Feb 2011, SMW 1.5.5. A new revision is underway; the first parts already reflect the status in SMW 1.6.0.

Functional components

The user-visible functions of SMW can roughly be divided into the following main groups:

  • Data input (annotation, writing)
  • Data inspection and output (browsing, special pages)
  • Query answering and query formatting
  • Data export
  • Maintenance tools (scripts and special pages used by site admins)

Naturally, this is only a rough user view, and some features may belong to more than one group. Each of these units has dedicated code, but all of this code is based on a common data model that provides a unified internal presentation of handling semantic data. This is the core of the SMW architecture, explained in more detail below. Moreover, there are additional facilities for internationalization that are relevant throughout the code.

The data model of SMW

The most essential architectural choice of SMW is in its data model. This specifies the way in which semantic information is represented within the system – actually it specifies what semantic information is in SMW. The main concepts of this agree with the view of SMW users:

Data in SMW is represented by property-value pairs that are assigned to objects.

For example:

Dresden (object) has a population (property) of 517,052 (value).

We can elaborate this schema with additional facts:

  • The described objects are normally wiki pages.
  • Properties can be used everywhere. They do not belong to specific wiki pages, categories, etc.
  • Values can be of different types (numbers, dates, other wiki pages, ...).
  • The datatype is part of the value's identity (values of different types are different, even if they are based on the same user input).
  • Objects can have zero, one, or many values for any property.
  • A semantic fact is completely specified by its object, property, and value. For instance, it does not matter who specified a fact (in contrast to tagging systems where each user can have individual tags for the same thing).
  • Facts are either given or not given, but they cannot be given more than once (again in contrast to tagging systems where we count how often a tag has been given to a resource).
  • "Object" is a very general term, so we often use "subject" when we want to emphasize that a thing is the subject of a property-value assignment in a fact.

These ideas are reflected in the basic data model. All elements of a fact are represented by PHP objects of (subclasses of) the class SMWDataItem. Sets of semantic facts in turn are represented by objects of type SMWSemanticData. The main concepts of both are described below, details are explained in the source code documentation.

The System Perspective on Data

Developers who directly work with data do not need functionality for user input and output. They can work with SMW like with a database that allows data to be managed, stored, and queried. This section describes the basic software components that are involved in representing data on such a level.

SMWDataItem: SMW's way of representing elements

This class and its subclasses is the basic building block of all SMW elements. Its purpose is to provide a unified interface for all semantic entities that SMW deals with, e.g., numbers, dates, geo coordinates, wiki pages, and properties. It might be surprising that not only values but also subjects and properties are represented by the SMWDataValue class. This makes sense since wiki pages can be both subjects and values, and since properties have many similarities with wiki pages (in particular they have associated articles).

Objects of class SMWDataItem represent very simple pieces of data. A dataitem is like a primitive type (e.g. a PHP string or number): its identity is determined by its contents and nothing else. Dataitems should thus be thought of as "primitive values" that are merely a bit more elaborate than the primitive types in PHP. Their main characteristics are:

  • Immutable: Once created, a dataitem cannot be changed.
  • Context independent: The meaning of a dataitem is only based on its content, not on any contextual information (such as the information about the property it is assigned to).
  • Limited shape: The kinds of datatitems (numbers, URLs, pages, ...) that SMW supports are limited and fixed. Extensions cannot add new kinds of dataitems, and programmers only need to handle a fixed list of possible kinds of datatitems.

Being immutable is essential for datatitems to behave like simple values. It imposes a restriction on programmers, but it also simplifies programming a lot since one does not have to be concerned about dataitems being changed by code that happens to have a reference to them.

The available kinds of dataitems correspond to subclasses of SMWDataItem. For convenience, each kind of dataitem is also associated with a PHP constant called its "DIType". For example, instead of using a nested if-then-else statement with many instanceof checks, one can use a switch over this DIType to handle different cases. The following table gives all dataitems:

Class DIType Description
SMWDIWikiPage SMWDataItem::TYPE_WIKIPAGE Dataitems that represent a page in a wiki or a "subobject" of such a page. They are determined by the page title (string in MediaWiki DBkey format), namespace, interwiki code, and a subobject name (can be empty).
SMWDIProperty SMWDataItem::TYPE_PROPERTY Dataitems that represent an SMW property. They are determined by the property key (which is the page DBKey string for user-defined properties), and the information whether or not they are inverted.
SMWDINumber SMWDataItem::TYPE_NUMBER Dataitems that represent some number.
SMWDIString SMWDataItem::TYPE_STRING Dataitems that represent a string that is not longer than MediaWiki titles (256 characters).
SMWDIBlob SMWDataItem::TYPE_BLOB Dataitems that represent a string (of any length).
SMWDIBoolean SMWDataItem::TYPE_BOOLEAN Dataitems that represent a truth value (true or false).
SMWDIUri SMWDataItem::TYPE_URI Dataitems that represent a URI (or IRI) according to RFC 3987.
SMWDITime SMWDataItem::TYPE_TIME Dataitems that represent a point in time in human or geological history. They are determined by a year, month, day, hour, minute, and (decimal) second, as well as a calendar model to interpret these values in (Julian or Gregorian).
SMWDIGeoCoord SMWDataItem::TYPE_GEO Dataitems that represent a location on earth, represented by latitude and longitude.
SMWDIContainer SMWDataItem::TYPE_CONTAINER Dataitems that represent a set of SMW facts, represented by an object of type SMWSemanticData (see below).
SMWDIConcept SMWDataItem::TYPE_CONCEPT Dataitems that represent the input and feature information for some SMW concept (query, description, features in query, size and depth).
SMWDIError SMWDataItem::TYPE_ERROR Dataitems that represent a list of errors (array of string). Used to gently pass on errors when dataitem return types are expected.
no class SMWDataItem::TYPE_NOTYPE Additional DIType constant that is used to indicate that the type is not known at all.

The restriction to these types of dataitem may at first look like a major limitation, since it means that SMW can only represent limited forms of data. For example, there is no dataitem for storing the structure of chemical formulae – doesn't this mean that SMW can never handle such data? No, because the existing datatitems can be used to keep all required information (for example by representing chemical formulae as strings). The task of interpreting this basic data as a chemical formula has to be handled on higher levels that deal with user input and output (the user view is explained in later sections). There is one kind of dataitem, SMWDIContainer, that represents "values" that consist of many SMW facts (subject-property-value triples); almost all complex forms of data that SMW does not have a dataitem for could be accurately represented in this format.

Creating dataitems is very easy: just call the constructor of the dataitem with the required values. Note that dataitems are strict about data quality: they are not meant to show the error-tolerance of the SMW user interface. For a programmer, it is more useful to see a clear error than to have SMW use some "repaired" or partly "guessed" value when a problem occurred. When trying to create dataitems from illegal data (e.g. trying to make a wikipage for an empty page title), an exception will be thrown. Usually dataitems will only implement basic data validation to avoid complex computations. If strict validation of, say, a URI string is needed, then own methods need to be implemented.

Dataitems implement a standard interface that allows useful operations like serialization and unserialization (a second way to create them from serialized strings). They also can generate a string hash code to efficiently compare their contents. Each dataitem also implements basic get methods to access the data, and sometimes other helper methods that are useful for the given kind of data. See the online documentation for details. The important thing is to keep data items reasonably lean and simple data containers – complex parsing or formatting functions are implemented elsewhere.

SMWSemanticData and other ways to represent facts

To represent semantic information, dataitems need to be combined into facts. For this purpose, the class SMWSemanticData provides a basic container for handling sets of facts that refer to the same subject. This makes sense since it is by far the most common case that the subject is the same for many facts (e.g. all facts on one page, or all facts in one row of a query result). A SMWSemanticData container further groups values by property: it has a list of properties, and for each property a list of values. Again this reflects common access patterns and avoids duplication of information. The data contained in SMWSemanticData can still be viewed as a set of subject-property-value triples, but SMW has no dedicated way to represent such triples, i.e. there is no special class for representing one fact.

SMW generally uses SMWSemanticData whenever sets of triples are needed. If many subjects are involved, then one may use an array of SMWSemanticData objects. In other cases, one only wants to consider a list of SMWDataValues (Introduced below) instead of whole facts, e.g. when fetching the list of all pages that are annotated with a given property-value pair (e.g. all things located in France). In this case, the facts are implicit (one could combine the query parameters "located_in" and "France" with each of the result values). Another important case are query results. They have their own special container class SMWQueryResult which is similar to a list of SMWSemanticData containers for each row, but has some more intelligence to fetch the required data only on demand (implementation detail, but part of the raison d'etre for this class).

The User Perspective on Data

The earlier section provided a basic data model and its technical realization with dataitems and SMWSemanticData containers. This allows us to accurately represent data in PHP, but it does not yet provide the necessary input and output functionality to communicate this data to the user. Much of this user interface is provided by SMWDataValue and its subclasses. The underlying idea is that values are organized in datatypes that define how user inputs are interpreted and how data is presented in outputs.

The rest of this section needs revision. It is still at the state of SMW 1.5.6.

SMWDataValue

This class and its subclasses is the basic building block of all SMW elements. Its purpose is to provide a unified interface for all semantic entities that SMW deals with, e.g., numbers, dates, geo coordinates, wiki pages, and properties. It might be surprising that not only values but also objects and properties are represented by the SMWDataValue class. This makes sense since wiki pages can be both objects and values, and since properties have many similarities with wiki pages (in particular they have associated articles).

The various subclasses of SMWDataValue roughly correspond to the datatypes that a value can have, and they implement the specific behaviour of particular values. In the current architecture, SMWDataValue subclasses implement all functions that are specific to a particular datatype:

  • Interpreting user input (e.g. parsing a string into a calendar date)
  • Encoding values in a format that is suitable for storage (e.g. computing a standardized, language-independent string for representing a date and a floating point number that can be used for sorting dates)
  • Generating readable output (e.g. converting the internal representation of a date back into a text that is readable in the current language)
  • Generating serializations in the RDF export format (e.g. representing dates with the XML Schema type "date" if they are in the range of this datatype, and using some fallback encoding otherwise)

Each data value thus has many forms of representation: the text a user writes on a wiki page (this can have many forms that lead to the same value), the various display versions (e.g. augmented with links or tooltips), a unique internal representation (the value as the software sees it), an RDF encoding. Some subclasses of SMWDataValue have additional representations, e.g. SMWTimeValue provides an output that is formatted according to ISO 8601. The class SMWDataValue has different get methods for obtaining these representation, and programmers should read the software documentation of that class to understand when to use which method.

A general challenge here is that the datatypes are very diverse. Many values have an internal structure, e.g. dates have a year, month, and day component, whereas wiki page titles have a namespace identifier, title text, and interwiki component. This diversity makes it hard to treat values as single values of some primitive datatype (e.g. representing wiki pages as strings would not be accurate, since it would be impossible to filter such values by namespace). The internal representation of a data value therefore is a list of values, obtained with the method getDBkeys(). How long this list is, and which type each of its entries has is determined by the method getSignature(). There is also a method getHash() that returns a string that can be used to compare two datavalues without looking at the details of their format.

Another challenge is the diversity of desirable output formats. Users typically want a large number of formatting options that are very specific to certain datatypes, so that it is hard to provide them via a unified interface. Moreover, output is used in MediaWiki both within HTML and within Wikitext contexts, requiring different formatting and treatment of special characters. The output methods of SMWDataValue reflect some of this diversity, and an additional facility of "output formats" (see SMWDataValue::setOutputFormat()) is provided for more fine-grained control. But obviously there must be limits of what can be achieved without cluttering the architecture, and users are advised to subclass their own datatype implementations for special formatting.

DataValue objects can be created directly via their own methods, but it is usually advisable to use the SMWDataValueFactory instead. See below for an explanation of SMW's datatype system.


Datatypes: user and system perspective

Users can pick many datatypes for their data. Yet they do not specify the type for each value they write, but assign one global type to a property. This is slightly different from SMW's internal architecture, where dataypes influence a value's identity, whereas all properties are represented by values of a single type (class SMWPropertyValue). This is not a problem, it simply says that the type information that users provide for each property is interpreted as "management information" that SMW uses to interpret user inputs. The data model is still as described above, with types being part of the values (which is where they are mainly needed). Again: the typing approach in the user interface does not affect the data model but helps SMW to make sense of user input. One could equivalently dispense with the property-related types and require users to write the type for each input value instead. This would simply be cumbersome and would prevent some implementation optimizations that are now possible since we can assume that properties have values of only one type that we know.

Users refer to types by using natural language labels such as datatype Date. These labels are subject to internationalization. There can also be aliases for some types to allow multiple spellings in one language. To make SMW code independent from the selected language, SMW uses internal type IDs for referring to datatypes. These are short strings that typically start with "_" to avoid confusion with page titles. For example, '_dat' is the type ID for Type:Date. Developers should always use the internal type IDs. The correspondence of user labels and internal IDs is defined in language files, e.g. in the file SMW_LanguageEn.php.

How do type IDs relate to the subclasses of SMWDataValue that implement type-specific behavior? The answer is that one such class may take care of one or more type IDs. For example, the handling of URLs and Email addresses has many things in common, so there is just one class SMWURIValue that handles both. The datavalue object is told its type ID on creation, so it can adjust its behavior to suit more than one type.

The association of internal type IDs with the classes that should be used to represent their objects is done in the file SMW_DataValueFactory.php. The static class SMWDataValueFactory also introduces some hooks that can be used to extend and change these associations. So developers can add new types and even register their own implementations for existing types without patching SMW code. The factory class should therefore be used to create most data value objects in SMW, since otherwise the associations (that someone might have overwritten) would not be honored.

Note: Some datavalue classes provide special methods, e.g. for getting the year component of a date, and parts of SMW (extension) code that use such methods must first check if they are dealing with the right class (you cannot rely on the type ID to determine the class). This also means that developers who overwrite SMW implementations may want to subclass the existing classes to ensure that checks like ( $datavalue instanceof SMWTimeValue ) succeed (if not, a modified time class might not work with some time-specific features).
Note: Own datatypes should always use type IDs that start with "___" (three underscores) to avoid (future) name clashes with SMW types.

There are some exceptions where the use of the datavalue factory is not needed/recommended. In particular, properties are always represented by the SMWPropertyValue class and it does not make sense to do additional type ID lookups for each of the cases when a property is needed. But even there you would not use the normal class constructor and rather pick a static creator method that takes care of the full initialization that the datavalue factory may otherwise do. See the documentation of SMWPropertyValue for details.

Finally, there are some datatypes that are internal to SMW. They use IDs starting with two underscores, and are not assigned to any user label. So they cannot be named in a wiki and are only available to developers. The purpose of these types is usually to achieve a special handling when storing data. For example, values of Property:Subproperty of could be represented by datatype Page but subproperty_of uses a special internal type that ensures that the data can be stored separately and in a form that simplifies its use in query answering. A datatype that is added by an extension becomes internal if it is not given any user label. In this case, users cannot create properties with this type, but everything else works normally (in particular, SMW will not know anything special about internal extension types and will just treat them like any other extension type).

Query answering

TODO


Storage backend

TODO


RDF export

TODO


Editing and update control flow

TODO

Jobs

TODO

I18N

TODO

Hooks

Semantic MediaWiki provides several hooks that can help other extensions to make use of SMW's internal data interface. SMW Hooks lists existing hooks with their input/output parameters.

This page is a DRAFT!
The content of this page is incomplete and might contain errors.
You may consult this page which contains more credible information: Architecture guide
This documentation page applies to all SMW versions from 1.7.1 to the most current version.
     

Architecture guide en 1.7.1