Help:Property similarity

From semantic-mediawiki.org
Jump to: navigation, search

Semantic MediaWiki's schema free approach allows users to create or define properties freely and with that freedom it is possible that conceptional identical or near-duplicate properties can occur and be used for value annotations without being detected by an agent that engages in a data curation1 task.

Syntactic similarity is understood as function that "analyzes the syntactic similarity of a pair of tags" using the "Levenshtein Distance, the Cosine Similarity, the Jaccard Similarity, the Jaro Distance" 2:100 while semantic similarity analyzes the "semantic relations defined between tags as well as their frequency" 2:101.

Several methods can help mitigate and counter label similarity issues such as:

  • Use of templates to formalize user input
  • Use of #REDIRECT to build a pool of synonyms around a canonical property and allow them to be merged 3 into a coherent extension of a properties semantics.

See also

Notes

  •  Mikhail Bilenko, Raymond J Mooney. "Adaptive duplicate detection using learnable string similarity measures". ACM (2003): 39--48.
  •  Kenji Sagae, Andrew S Gordon. "Clustering words by syntactic similarity improves dependency parsing of predicate-argument structures". ACM (2009): 192--201.
  •  Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka. "Measuring semantic similarity between words using web search engines.". {{{publisher}}} 7. (2007): 757--766.

References

  1. ^  "...term used to indicate processes and activities related to the organization and integration of data collected from various sources, annotation of the data, and publication and presentation of the data..." from https://en.wikipedia.org/wiki/Data_curation
  2. a b  Nik Bessis Fatos Xhafa (eds.) Richard Mordinyi Eva Kühn (auth.). "Next Generation Data Technologies for Collective Computational Intelligence". Springer-Verlag Berlin Heidelberg (2011).
  3. ^  Iulia Dănăilă, Liviu P Dinu, Vlad Niculae, Octavia-Maria Sulea. "String Distances for Near-duplicate Detection". Instituto Polit{\'e}cnico Nacional, Centro de Innovaci{\'o}n y Desarrollo Tecnol{\'o}gico en C{\'o}mputo (2012): 21--25.