SemGen - Towards a Semantic Data Generator for Benchmarking Duplicate Detectors
Sprache des Titels:
Proceedings of the 4th International Workshop on Data Quality in Integration Systems in conjunction with DASFAA 2011
Benchmarking the quality of duplicate detection methods
requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with
artificially created data is promising, current approaches
to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented,
leading to only insufficiently configurable variability.
In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level,
before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the
domain of road traffic management. A discussion of lessons learned concludes the paper.