ITI Researchers Work to "Debloat" Data in Applications

10/29/2020 10:14:00 AM

Written by

When working with large datasets, carving out data relevant to a specific research project can be like finding a needle in a haystack. While a database may contain terabytes of data, a researcher may only need a few hundred megabytes. On top of that, once the researcher retrieves the data and runs an analysis, sharing the information and reproducing the research is progressively more difficult, according to the Information Trust Institute’s (ITI) Sibin Mohan. Mohan is working to make the entire process, from finding data, to sharing, to reproducing, easier, in a new project called “Minimizing Data Sets: MiDas.”

“We call this work ‘data-based debloating,’ and it is exciting work with a lot of great potential,” said Mohan, research assistant professor in computer science (CS). “It makes it much easier to do research in collaboration and reproduce data efficiently.” 

Sibin Mohan
Sibin Mohan

An example of an area of research that could greatly benefit from Mohan’s project is climatology. There are datasets that contain climate information for the entire United States over a period of time. Of this vast data, most researchers only need a small subset; for example, researchers may just want to look at data for California from 2000-2009. To share the data behind their research, they would still need to share the entire dataset not just the subset used in their work, making it incredibly unlikely their research could be exactly reproduced.

Mohan and his student, Chaitra Niddodi, have been working to debloat data associated with an application and provide it to researchers in a packaged bundle.

“Most of the current debloating frameworks focus on code-based debloating by trying to prune off unnecessary code in an application,” said Niddodi, a CS PhD student. “This research focuses on the data perspective. Having the application and the relevant data packaged together improves both speed and accuracy when replicating the results in a given setting.”

The project, MiDas, maps high-level user inputs to file offsets in order to identify the subset of data accessed by the application. The application is then modified and packaged along with this data. More details can be found in the duo’s short paper,“MiDas: Containerizing Data-Intensive Applications with I/O Specialization,” presented at P-RECS’20 workshop affiliated with HPDC 2020 conference.

Niddodi worked on this project for two summers during her internship at SRI, which started the collaboration between the research lab and Mohan’s research group. Since the work began, a researcher from DePaul University in Chicago also has joined the project. The team believes this work can be applied to a wide range of research areas and industries.

“It can apply to any field where you have to deal with huge data sets and you only want to work on parts of it,” said Mohan, who also has an affiliation with electrical and computer engineering. “Future research becomes easier, performance gets better, reproducibility is better, transmission is easier, and archival becomes easier. It’s not just academia. Any field that has tremendous amounts of data analysis can benefit from this works.”


Share this story

This story was published October 29, 2020.