Semantic Data Type Detection via Machine Learning using Semi-synthetic Data and Robust Features

Marc Chevallier; Faouzi Boufares`; Nicoleta Rogovschi; Nistor Grozavu

Semantic Data Type Detection via Machine Learning using Semi-synthetic Data and Robust Features

Authors

Marc Chevallier
Faouzi Boufares`
Nicoleta Rogovschi
Nistor Grozavu

Keywords:

Type detection, Tabular data, Semantic types, Data Profiling, Knowledge discovery, Machine learning

Abstract

Being able to automatically identify the semantic type of tabular data is a useful feature in many areas of the data landscape. This information is especially important for data integration, data science, and data cleaning. Traditional semantic type detection tools based on dictionaries and regular expressions have recently been challenged by methods using machine learning. These new methods are very efficient but require a large amount of data to learn, which limits the use of these methods to semantic types and languages for which a large amount of data is available. To overcome these drawbacks, we introduce a data generation method to produce training data with minimal real data. In addition, we propose several new feature extraction methods that are less dependent on column length, language independent (for a given alphabet) and robust to errors in the data. Experiments conducted on synthetic and real data indicate an accuracy higher than 0.9 which is equivalent to classical methods.

Downloads

Download data is not yet available.

Downloads

Published

2023-06-01

How to Cite

Marc Chevallier, Faouzi Boufares`, Nicoleta Rogovschi, & Nistor Grozavu. (2023). Semantic Data Type Detection via Machine Learning using Semi-synthetic Data and Robust Features. International Journal of Computer Information Systems and Industrial Management Applications, 15, 11. Retrieved from https://cspub-ijcisim.org/index.php/ijcisim/article/view/570

Download Citation

Issue

Vol. 15 (2023)

Section

Original Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.