Using Elasticsearch store / file ingestion
Elasticsearch uses the "Apache Tika" plugin1 for its file indexing support where it parses files, extracts the text, and returns it to Elasticsearch and requires the "Ingest Attachement Processor Plugin"2 to make the functionality available via its REST API.
Features and requirements
To enable file ingestion as an experimental feature3 in the Elasticseach store (
SMWElasticStore) requires installing the "Ingest Attachment Processor Plugin" on your Elasticsearch cluster to make unstructured content from files available to Elasticsearch and Semantic MediaWiki.
File content indexed and ingested by Elasticsearch and the Elasticseach store (
SMWElasticStore) will not be made available within the wiki itself, i.e. will not be copied or otherwise stored in a SQL table. Therefore the content of ingested files is only searchable via the Elasticsearch Query Engine.
Due to size and memory consumption requirements by Elasticsearch and Tika, file content ingestion happens exclusively in the background using the "smw.elasticFileIngest" job. It makes the actual request to Elasticsearch for a file ingestion.
More details about file ingestion and the respective indexing process can be found in the "replication.md" file4.
In the event that the file content ingestion and extraction was successful, a file attachment annotation will appear on the specific file entity, and depending on the extraction quality of Elasticsearch and Tika. The file attachment can contain metadata information such as:
- Content type
- Content author
- Content length
- Content language
- Content title
- Content date
- Content keyword
- Help page on media files and metadata
- Semantic MediaWiki: GitHub issue #4488 – Further information on file ingestion
- Semantic MediaWiki: GitHub issue #4503 – Elasticsearch, file ingestion, and the display of text excerpts
- Semantic MediaWiki: GitHub issue #4528 – Elasticsearch, file ingestion, and the "Content keyword" property
- Apache Tika - a content analysis toolkit: Detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
- Ingest Attachement Processor Plugin: Lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.
- Semantic MediaWiki: GitHub pull request gh:smw:3054
- Elasticsearch replication: This document on index creation describes how the index process occurs in Elasticsearch for Semantic MediaWiki.