Help:ElasticStore/File ingestion



Elasticsearch uses the "Apache Tika" pluginCiteRef::web:tika.apache.org for its file indexing support where it parses files, extracts the text, and returns it to Elasticsearch and requires the "Ingest Attachement Processor Plugin"CiteRef::web:www.elastic.co:ingest-attachment to make the functionality available via its REST API.

Features and requirements
To enable file ingestion as an experimental featureCiteRef::gh:smw:3054 in the Elasticseach store requires installing the "Ingest Attachment Processor Plugin" on your Elasticsearch cluster to make unstructured content from files available to Elasticsearch and Semantic MediaWiki.

File content indexed and ingested by Elasticsearch and the Elasticseach store will not be made available within the wiki itself, i.e. will not be copied or otherwise stored in a SQL table. Therefore the content of ingested files is only searchable via the Elasticsearch Query Engine.

Due to size and memory consumption requirements by Elasticsearch and Tika, file content ingestion happens exclusively in the background using the "smw.elasticFileIngest" job. It makes the actual request to Elasticsearch for a file ingestion.

More details about file ingestion and the respective indexing process can be found in the "replication.md" fileCiteRef::web:github.com:smw:Elasticsearch-replication.

File metadata
In the event that the file content ingestion and extraction was successful, a file attachment annotation will appear on the specific file entity, and depending on the extraction quality of Elasticsearch and Tika. The file attachment can contain metadata information such as:
 * Content type
 * Content author
 * Content length
 * Content language
 * Content title
 * Content date
 * Content keyword