Poster
in
Workshop: Table Representation Learning
SiMa: Federating Data Silos using GNNs
Christos Koutras · Rihan Hai · Kyriakos Psarakis · Marios Fragkoulis · Asterios Katsifodimos
Keywords: [ data silos ] [ data integration ] [ graph neural networks ] [ matching ]
Virtually every sizable organization nowadays is building a form of a data lake. In theory, every department or team in the organization would enrich their datasets with metadata, and store them in a central data lake. Those datasets can then be combined in different ways and produce added value to the organization. In practice, though, the situation is vastly different: each department has its own privacy policies, data release procedures, and goals. As a result, each department maintains its own data lake, leading to data silos. For such data silos to be of any use, they need to be integrated. This paper presents SiMa, a method for federating data silos that consistently finds more correct relationships than the state-of-the-art matching methods, while minimizing wrong predictions and requiring 20x to 1000x less time to execute. SiMa leverages Graph Neural Networks (GNNs) to learn from the existing column relationships and automated data profiles found in data silos. Our method makes use of the trained GNN to perform link prediction and find new column relationships across data silos. Most importantly, SiMa can be trained incrementally on the column relationships within each silo individually, and does not require consolidating all datasets into one place.