- #1
lomidrevo
- 433
- 250
I think the basic idea is quite clear, as for example defined by wikipedia:
But when I google more about this "technology", I am getting quite various ideas about what is considered as data lake. Some of them:
How do you understand the term data lake? Is it just a buzzword?
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
But when I google more about this "technology", I am getting quite various ideas about what is considered as data lake. Some of them:
- just a synonym to ETL approach to data processing
- a distributed file system, like Apache Hadoop HDFS
- NoSQL database with additional support of SQL, like for example MondogDB
- or some proprietary architecture involving all of that and maybe some extra tools, like reporting, visualization and maybe machine learning?
How do you understand the term data lake? Is it just a buzzword?