Enhancing cache content management in a data lake architecture using Reinforcement Learning

Tracolli, Mirco

In the past few years, data dimensionality has become so high and complex that a specific field has been created: Big Data. Besides the size of the data, that is continuing to grow in each sector, from business to scientific domains, the advent of IoT (Internet of Things) and data from sensors, introduces a large volume of information that is not simple to manage and to extract valuable knowledge. The process to extract useful information and value from such data is mainly composed of two phases: first, the processing, and then the data access. One of the main requirements for data access is fast response time, whose order of magnitude can vary a lot depending on the specific type of processing as well as processing patterns. Therefore, besides the specific optimization of algorithms and software processes, there are several aspects that involve the infrastructure level of the analysis environment that could be enhanced. From this point of view, the optimization of the access layer becomes more and more important while dealing with a geographically distributed environment where data must be retrieved from remote servers of a Data Lake. From the infrastructural perspectives, caching systems are used to mitigate latency and to serve better popular data. Thus, the role of the cache becomes key to effective and efficient data access. In this thesis, we will explore how to make a cache autonomous and adapt- able to improve the performances of a system in terms of data management with the aim of reducing the cache costs, such as the amount of data written and the amount of data read from the cache memory.

Enhancing cache content management in a data lake architecture using Reinforcement Learning / Mirco Tracolli. - (2021).