Using Elasticsearch store / file ingestion

From semantic-mediawiki.org
Replication monitoring indicating a failed (or not yet executed) ingest attempt for the file
Shows information retrieved from the Elasticsearch (Tika) document ingest processor

Elasticsearch uses the "Apache Tika" plugin1 for its file indexing support where it parses files, extracts the text, and returns it to Elasticsearch and requires the "Ingest Attachement Processor Plugin"2 to make the functionality available via its REST API.

Features and requirements[edit]

To enable file ingestion as an experimental feature3 in the Elasticseach store (SMWElasticStore) requires installing the "Ingest Attachment Processor Plugin" on your Elasticsearch cluster to make unstructured content from files available to Elasticsearch and Semantic MediaWiki.

File content indexed and ingested by Elasticsearch and the Elasticseach store (SMWElasticStore) will not be made available within the wiki itself, i.e. will not be copied or otherwise stored in a SQL table. Therefore the content of ingested files is only searchable via the Elasticsearch Query Engine.

Due to size and memory consumption requirements by Elasticsearch and Tika, file content ingestion happens exclusively in the background using the "smw.elasticFileIngest" job. It makes the actual request to Elasticsearch for a file ingestion.

More details about file ingestion and the respective indexing process can be found in the "replication.md" file4.

File metadata[edit]

In the event that the file content ingestion and extraction was successful, a file attachment annotation will appear on the specific file entity, and depending on the extraction quality of Elasticsearch and Tika. The file attachment can contain metadata information such as:

  • Content type
  • Content author
  • Content length
  • Content language
  • Content title
  • Content date
  • Content keyword

See also[edit]