Archive:Using SPARQL and RDF stores 1.7.1 - 1.9.2

From semantic-mediawiki.org
SMW admin manual
Installation
Configuration
Concept caching
Fixed properties
Using SPARQL and RDF stores
SPARQLStore
Pretty URIs
Troubleshooting
Repairing data and data structures
Extensions
Basic extensions
Semantic extensions
SMW user manual
Table of Contents

By default, SMW stores all data in the same relational database (usually, a MySQL database) that is used by MediaWiki. This ensures a simple setup, but a relational database is not an ideal type of storage for semantic data. A more natural data model for SMW data is RDF, a data format that organizes information in graphs rather than in fixed database tables. Fortunately, it is possible to use RDF-based systems, in conjunction with the standard SQL database, to manage and query SMW's data. This page explains the details.

Pros and cons of using an RDF database[edit]

Whether or not to use an RDF store in a specific wiki depends on a number of factors, including the specific RDF database being used. Nonetheless, we can reasonably hope for the following advantages:

  • Better query performance
    RDF stores are designed to answer queries in the SPARQL query language. SMW queries can be expressed in this language much more naturally than in the SQL query language of relational databases. In this sense, SMW queries are a mostly typical use case for RDF database systems while they are a rather exotic use case for relational database systems. Moreover, many important optimization methods for relational database queries are useless or misleading for SMW queries. It can therefore be expected that RDF stores should provide superior query performance.
  • Additional interfaces
    RDF stores that support the SPARQL standard also allow other applications to ask SPARQL queries against their data without going via the SMW web frontend. This allows efficient use of wiki data in other applications. Some SPARQL-capable databases further support (parts of) the OWL 2 ontology language and provide according interfaces to query the stored data (e.g. via the OWL Link protocol). Semantic Web applications also use a number of common programming libraries (such as librdf or the OWL API) that can be useful for integrating them with other tools on a lower level.
  • Reasoning features and ontology-based data access
    Semantic Web languages such as RDF Schema and OWL provide additional expressive features for modeling, for example by allowing the declaration of derived classes or the declaration of further property characteristics (e.g. transitivity of properties). Some SPARQL-capable databases can evaluate these features for query answering, e.g. for ontology-based data access (OBDA), the method of creating "virtual views" on data by means of semantic modeling constructs.
  • Data integration and ontology re-use
    It is possible to store additional data in the RDF database that is updated by SMW. In this way, the RDF store can act as a platform for data integration and ontology re-use.
  • Physical separation of computing resources
    Using a database backend that is not the same as in MediaWiki provides an easy way to distribute tasks across multiple servers. In particular, complex queries can thus be prevented from affecting the basic operation of the wiki, even if they unexpectedly consume a prohibitive amount of computing power, i.e. if they kill the server that hosts the RDF database.

Nevertheless, there are a number of possible disadvantages as well:

  • Higher storage requirements
    The data is only mirrored in RDF databases, not removed from SQL. Hence additional storage space is required.
  • Additional maintenance effort
    The setup of RDF backends in SMW is easy, but there is still some effort in running an additional database-management system.
  • Questions regarding performance and stability
    There are a number of industry-strength RDF databases available today, some of them free/open source. Yet, the experience of using these systems with SMW is still limited, so some testing is helpful before deciding on a particular backend for a large-scale SMW application.

Luckily, it is possible to switch back and forth between SQL-based and RDF-based storage backends without major effort, so that the decision can be revisited after trying it for a while.

Deciding on an RDF database[edit]

In principle, SMW supports any database that supports the SPARQL query language and SPARUL (SPARQL/Update) as introduced in SPARQL 1.1. In Semantic MediaWiki 1.7.0, stores are required to accept updates and queries that do not specify a graph but it is planned to remove this limitation in the future. Two places where lists of RDF stores are maintained are:

NoteNote: RDF stores are sometimes called "triple stores" even though many modern stores are actually "quad stores" that also assign a named graph to each RDF triple.

As of 2012, two particularly notable free and open source RDF databases are 4Store and Virtuoso. Both have been used with SMW successfully, though Virtuoso currently still needs minor changes in SMW due to the aforementioned restriction that no named graphs are used in SMW queries.

Configuring SMW[edit]

SPARQL requests, whether queries or updates, are exchanged through web services. This means that requests are sent to and data is received from URLs that specify the location of the according service. This location is determined by the RDF database and by its configuration. For example, a typical default location for the SPARQL query web service ("endpoint") of 4Store on a local machine might be http://localhost:8080/sparql/.

To configure SMW to use a SPARQL database, you need to know the location of the SPARQL query service and the location of the SPARQL update service. Optionally, you can also make use of a service that supports the SPARQL over HTTP protocol for updates (if used, SMW will prefer this method over SPARQL Update for simple update requests; it can be omitted if problems occur). The locations of these web services then must be given in LocalSettings.php, as in the following example:

$smwgDefaultStore = 'SMWSparqlStore';
$smwgSparqlQueryEndpoint = 'http://localhost:8080/sparql/';          # location of query service
$smwgSparqlUpdateEndpoint = 'http://localhost:8080/update/';         # location of update service
$smwgSparqlDataEndpoint = 'http://localhost:8080/data/';             # location of SPARQL over HTTP service
                                                                      # set it to ''; in case of problems
$smwgSparqlDefaultGraph = 'http://example.org/mydefaultgraphname';   # optional name of default graph

The first line tells SMW to use the SPARQL store implementation to store data (instead of the SMWSQLStore2 that is the default). The remaining lines provide the relevant service locations, where the last two lines can be omitted. In many cases, $smwgSparqlDataEndpoint should rather not be set since most stores have their own protocol for posting.

By default, SMW will use a generic SPARQL connector that is based on recent SPARQL documents. Some RDF databases might not be fully compatible with this or might need special tweaks to make use of advanced, non-standard features. Special settings therefore should be used with some stores.

4Store[edit]

Users of 4Store should use the following settings:

$smwgDefaultStore = 'SMWSparqlStore';
$smwgSparqlDatabase = 'SMWSparqlDatabase4Store';                     # using 4Store as connector
$smwgSparqlQueryEndpoint = 'http://localhost:8080/sparql/';          # location of query service
$smwgSparqlUpdateEndpoint = 'http://localhost:8080/update/';         # location of update service
$smwgSparqlDataEndpoint = '';                                        # location of SPARQL over HTTP service
                                                                      # optional value; leave as is in case of problems
$smwgSparqlDefaultGraph = 'http://example.org/mydefaultgraphname';   # optional name of default graph

In addition to the regular settings 4Store must be set explicitly with: $smwgSparqlDatabase = 'SMWSparqlDatabase4Store';

4Store is available since Semantic MediaWiki 1.6.0. This ensures that 4Stores soft limit feature is used for restricting the resources needed in query answering. Moreover, the protocol used for the data endpoint (if configured) is specific to 4Store. It is recommended to use this feature, since it is more efficient for writing.

Virtuoso[edit]

Virtuoso is supported since Semantic MediaWiki 1.7.1. Users of Virtuoso should use the following settings:

$smwgDefaultStore = 'SMWSparqlStore';
$smwgSparqlDatabase = 'SMWSparqlDatabaseVirtuoso';                   # using Virtuoso as connector
$smwgSparqlQueryEndpoint = 'http://localhost:8890/sparql/';          # location of query service
$smwgSparqlUpdateEndpoint = 'http://localhost:8890/sparql/';         # location of update service
$smwgSparqlDataEndpoint = '';                                        # location of SPARQL over HTTP service
                                                                      # optional value; leave as is in case of problems
$smwgSparqlDefaultGraph = 'http://example.org/mydefaultgraphname';   # name of default graph

In addition to the regular settings Virtuoso must be set explicitly with: $smwgSparqlDatabase = 'SMWSparqlDatabaseVirtuoso';

The exact URLs depend on the local configuration. The URI of the default graph can be chosen arbitrarily but must be set. There are some known limitations with (at least some versions of) Virtuoso:

  • Numerical datatypes are not supported properly, and Virtuoso will miss query results when query conditions require number values. This also affects datatype Date properties since the use numerical values for querying.
  • Some edit (insert) queries fail for unknown reasons, probably related to unusual/complex input data (e.g., using special characters in strings); errors will occur when trying to store such values on a page.
  • Virtuoso stumbles over XSD dates with negative years, even if they have only four digits as per ISO. Trying to store such data will cause errors.
  • More information on combining SMW with Open Virtuoso can be found at [1].

Other RDF stores[edit]

Stores that are conforming to the official SPARQL and SPARQL Update standards should mostly work out of the box. For example, Sesame has been reported to work with the default connector. In many cases, the setting $smwgSparqlDataEndpoint = ''; should be used, since support for the (optional) data endpoint protocol is not well tested.

To take advantage of special features of your store, a modified implementation can be created easily by subclassing $smwgSparqlDatabase. Examples can be found in the files for 4Store (SMW_SparqlDatabase4Store.php) and Virtuoso (SMW_SparqlDatabaseVirtuoso.php). See directory "[wikipath]/extensions/SemanticMediaWiki/includes/sparql/") to find the files. Suggestions and files can be placed in the bugtracker (see link "Report a bug" in the sidebar on the left) or sent via email.

Jena Fuseki[edit]

The following steps may be taken to integrate SMW with Jena Fuseki 1.0.2

1. Include the following lines in LocalSettings.php in the wiki folder:

$smwgDefaultStore = 'SMWSparqlStore';
$smwgSparqlQueryEndpoint = 'http://localhost:3030/db/query';  
$smwgSparqlUpdateEndpoint = 'http://localhost:3030/db/update'; 
$smwgSparqlDataEndpoint = '';

2. Start up Fuseki with this commandline:

fuseki-server.bat --update --loc=<database location> /db

3. Refresh all existing data in SMW. See the help page on repairing SMW's data.

NoteNote:  At this point, Fuseki should communicate correctly with SMW, but inline queries in SMW will not work because Fuseki by default returns a JSON response to queries, whereas SMW expects an XML response instead. To ask for an XML response for inline queries, go to the file "SMW_SparqlDatabase.php" in directory <wiki folder>\extensions\SemanticMediaWiki\includes\sparql\ and append this string "&output=xml" to $parameterString in the doQuery function. This will call for an XML output response from Fuseki.

Moving data to the new database[edit]

After the configuration was changed, there is no data yet in the RDF database. To fill it with the current content of the wiki, it is necessary to refresh all data. See Repairing data and data structures for details. Any method that refreshes the data will work. All SMW queries (inline or semantic search) will be executed against the RDF database, so their results will only be correct when all data has been refreshed.

Known limitations[edit]

There are still a few features that are not supported when using query answering via an RDF database:

  • Concept queries: There is no support for concepts in RDF stores yet.
  • Category and property hierarchies: Hierarchies are only taken into account if the RDF database has built-in support for the rdfs:subClassOf and rdfs:subPropertyOf features of RDF Schema. Otherwise hierarchy information will not lead to additional query results.


This documentation page applies to all SMW versions from 1.7.1 to 1.9.2.
Other versions: 1.6.0 – 1.7.0       Other languages: de