Help:Blocking robots from Semantic MediaWiki special pages

From semantic-mediawiki.org
Jump to: navigation, search

It has been reported that pages like Special:ExportRDF, Special:SearchByAttribute, or Special:Browse can sometimes tie up the bots of search engines for long periods of time and prevent them from indexing the actual pages. In the worst case, this can lead to search engines offering up RDF pages to users searching for terms instead of an actual wiki page, which is likely to confuse and discourage them.

If this is a problem on a site, it might be adequate to exclude robots from all RDF-generating special pages. Note that this is not required on most sites, and that popular sites like semanticweb.org are easily found in Google without having any blocks on their RDF content. Also note that robots that are indexing RDF specifically (e.g. Semantic Web search engines) will be disabled when forbidding all bot access to the RDF pages. Moreover, the SMW registry will not work properly if ExportRDF is not accessible.

Disallowing robots from SMW special pages

In order for this to work, you will need to enable short URLs on your wiki where the script path is different from the URL path. Then, you can block any robots you need to discourage from using these (and other problematic MediaWiki pages) with the following rules in your robots.txt file:

Disallow: /w
Disallow: /wiki/Special:Ask/
Disallow: /wiki/Special:Browse/
Disallow: /wiki/Special:SearchByProperty/
Disallow: /wiki/Special:ExportRDF/
Disallow: /wiki/Special:PageProperty/
Disallow: /wiki/Special:Properties/
Disallow: /wiki/Special:UnusedProperties/
Disallow: /wiki/Special:WantedProperties/
Disallow: /wiki/Special:SMWAdmin/
Disallow: /wiki/Special:Types/
Disallow: /wiki/Special:URIResolver/
Disallow: /wiki/Special:QueryCreator/

In this case, /w is the script directory and /wiki is the virtual directory that internally redirects to /w/index.php.

This will also prevent the disallowed bots from indexing things like edit pages, which is also useful.

To disallow all bots from these pages, your robots.txt would look like:

User-agent: *
Disallow: /w
Disallow: /wiki/Special:Ask/
Disallow: /wiki/Special:Browse/
Disallow: /wiki/Special:SearchByProperty/
Disallow: /wiki/Special:ExportRDF/
Disallow: /wiki/Special:PageProperty/
Disallow: /wiki/Special:Properties/
Disallow: /wiki/Special:UnusedProperties/
Disallow: /wiki/Special:WantedProperties/
Disallow: /wiki/Special:SMWAdmin/
Disallow: /wiki/Special:Types/
Disallow: /wiki/Special:URIResolver/
Disallow: /wiki/Special:QueryCreator/

Alternative, you could try the generateSitemap.php script in the maintenance directory to generate sitemaps for your site that encourage them to index the wiki pages over Special pages.