Dataspaces have become popular in data modeling and business intelligence. In this paper, we introduce a dataspace management system with Extract-Transform-Load capabilities and RDF (Resource Description Framework) data export. Moreover, we demonstrate how distributed processing based on the MapReduce framework can be used in processing the exported RDF data.
Specifically, our system helps the user (i) discover potential problems with data integration and then (ii) carry out the actual integration. In the first case, the software generates a file of comma separated values that the users can load into their statistics software. In the second case the user can transform the files into the RDF format and analyze the data using Python tools, or export the final data set to a visualization package or business intelligence software. Our method can therefore be seen as constructive; building a tool for data professionals.
We demonstrate the viability of both the dataspace management system and RDF processing using a Hadoop cluster. There, using Hadoop distribution improved the processing speed by 85..86%.
Keywords
dataspace, MapReduce, RDF, cloud, Hadoop
NARESUAN UNIVERSITY JOURNAL OF SCIENCE AND TECHNOLOGY