In the age of machine learning, the more data, the better. Pooling and running models on medical data leads to new insights about genetics, disease, and treatment; combining laboratory results helps scientists more quickly discover new materials for next-generation batteries or superconductors. But many obstacles — technical, legal, and ethical — prevent the open sharing of data between organizations or research groups.
Data Stations, a new project from University of Chicago researchers, will attempt to eliminate these hurdles by reversing the flow of big data science. Instead of the arduous process of tracking down the right dataset, receiving permission to use it, downloading and working with it on their own computational resources, researchers can simply query a Data Station, where all of these steps are automated out of the view of users. Data providers control how their data can be used or combined, protecting sensitive information and intellectual property.
“The Data Station is a radically new approach that is needed to change how people and organizations think about, access, and use data,” said Ian Foster, Arthur Holly Compton Distinguished Service Professor in the UChicago Department of Computer Science, Distinguished Fellow and Senior Scientist at Argonne National Laboratory, and primary investigator of the project. “This platform will ease access to sensitive data, assist with data discovery and integration, and facilitate data governance and compliance across fields of inquiry.”
The study was funded by a $1 million grant from the National Science Foundation as part of their Convergence Accelerator program, which boosts collaborations between academia and industry. Other investigators on the project include Michael J. Franklin and Raul Castro Fernandez of UChicago Computer Science and Sendhil Mullainathan of the Booth School of Business.