ThaiScience  


NARESUAN UNIVERSITY JOURNAL OF SCIENCE AND TECHNOLOGY


Volume 28, No. 04, Month OCTOBER, Year 2020, Pages 36 - 49


Dataspace management with etl and rdf support

Marko Niinimaki, Tapio Niemi, Peter Thanisch


Abstract Download PDF

Dataspaces have become popular in data modeling and business intelligence. In this paper, we introduce a dataspace management system with Extract-Transform-Load capabilities and RDF (Resource Description Framework) data export. Moreover, we demonstrate how distributed processing based on the MapReduce framework can be used in processing the exported RDF data. Specifically, our system helps the user (i) discover potential problems with data integration and then (ii) carry out the actual integration. In the first case, the software generates a file of comma separated values that the users can load into their statistics software. In the second case the user can transform the files into the RDF format and analyze the data using Python tools, or export the final data set to a visualization package or business intelligence software. Our method can therefore be seen as constructive; building a tool for data professionals. We demonstrate the viability of both the dataspace management system and RDF processing using a Hadoop cluster. There, using Hadoop distribution improved the processing speed by 85..86%.


Keywords

dataspace, MapReduce, RDF, cloud, Hadoop



NARESUAN UNIVERSITY JOURNAL OF SCIENCE AND TECHNOLOGY


Published by : Naresuan University
Contributions welcome at : http://www.journal.nu.ac.th/NUJST