Data lake Security -By Vivek Shitole (Information Security & Data Privacy Leader)
VTCC Education | 30 Jun 2024Total Views : 126Data lake Security
- By Vivek Shitole (Information Security & Data Privacy Leader)
Picture Source: https://opendatascience.com/best-practices-for-data-lake-security/
One buzzword, which is common these days, be it technology discussions, business strategy discussions, or future projections on industry growth, stocks, innovations, etc. And, that word is ‘DATA’. With close to half a billion terabytes of data created and significant of that transacted every day, you can imagine the need of ways to effectively collect, store and analyze their data. Data lake has emerged as one of the efficient and effective mechanism / technology to enable this. With this gigantic growth of data, comes the equal if not more necessity to secure this data from breaches and cybersecurity threats. This article outlines potential security threats and risks around Data lake and some of the effective ways to tackle those. Let’s dive in!
Let us first understand why Datalake has become an important aspect today’s modern data architecture.
It has been long since structured data has been managed and used in effective way for day to day operations and effective and accurate decision making. It is the unstructured data which has relatively recently made its presence felt in data analytics. It has also made data analytics a bit more complicated, which is why repository such as Data lake becomes critical, as it is a centralized repository for both structured and unstructured data. Data lake stores the structured and unstructured data as-is in open-source file formats to enable direct analytics. Using big data via tools such as Data lake has propelled technological advancements, which has, in turn, given organizations the ability to use data to uncover insights about their customer needs and help them fulfill it while growing their revenues.
But this advancement in data analytics / science does not come without any challenges. It complicates the information security and data security in multiple directions. Information below is focused on these complications related to Data lake security and potential remediations.
Many companies nowadays have shifted their data lakes to cloud platforms, as they have discovered the core advantages of cloud computing and storage. Lower infrastructure and maintenance costs, great customizability, and the accessibility it affords has allowed them to effectively manage and store huge volumes of data while saving money on data infrastructure costs. But, as companies are allured by the promise of cloud technology, many still don’t understand the vulnerabilities and challenges associated with the migration and integration process – especially the security risk it entails. All this means more data vulnerabilities, which creates a need for data security policies. From data loss to defending against cyber-attacks when migrating and operating, there are inherent security vulnerabilities that should be understood.
Like many standard information security programs, Data lake security comprises of a set of processes and procedures to ensure data protection from cyberattacks. Depending on the industry or organizations deploying Data lake, humongous and sensitive information such as credit card numbers and medical test results, customer data, and so on, may be at stakes, potentially creating many cybersecurity risks.
Given below are some of the best practices to remediate key risks emerging from cyberattacks:
1. Data Governance and Compliance
2. Data administration program structure
3. Data access and control processes
4. Data protection controls
Data Governance and compliance
Data governance is about managing an organization’s data assets, but its about ensuring that data is accurate, reliable, and secure, which is crucial in making informed decisions, complying with regulatory requirements, and driving business success. By establishing a robust and comprehensive data governance framework, an organization can manage its data effectively and efficiently, protect sensitive information, and meet regulatory requirements, enhancing its overall performance and reputation.
Policies, Procedures, and Standards are three critical components of any data governance framework. Data governance is a great guiding document for organisations to navigate the abundance of data. In the current digital era, data is both a valuable asset and a liability. Therefore, effective data governance is critical for managing, utilizing, and protecting data precisely and purposefully. It involves maintaining the authenticity of data and complying with regulations. Data governance fosters trust, enables informed decision-making and ultimately guides organisations towards success. Throughout all of this, the increasing interconnectedness of operational and information technology systems presents key security risks.
Data Administration
Data administration’s focus is on managing data from a conceptual, database-independent perspective. It coordinates the strategies for information and metadata management by controlling the requirements gathering and modeling functions. Data modeling supports individual application development with tools, methodologies, naming standards, and internal modeling consulting. It also provides the upward integration and bridging of disparate application and software package models into the overall data architecture. This overall data architecture is the enterprise data model and is critical in the organization’s ability to assess business risk and the impact of business changes. Each Data tool may need unique approach towards administration. Data administration allows the organization to maintain consistent security standards throughout the Data lake. Another aspect of Data lake administration can be auditing Data lake usage. This helps in understanding the importance the Data asset and defining best practices of securing it.
Data Access and Control
Generally, an organization can define data access and controls through authentication and authorization.
• Authentication: Verifies the user identity. Doing it through a multifactor authentication mechanism is a norm these days.
• Authorization: Determines each users level of access to the data based on specified policies and also the actions the user can take on it. Security principal-based authorization, where the system evaluates permissions based on a policy designed order is one of the effective ways of authorization.
Authentication and authorization both need to be implemented properly across the organization to have effective and adequate data access controls for Data lake in place. In addition, no single approach to managing data lake access suits everyone. In practice, different organizations want different levels of governance and control over their data in Data lake. Organizations must choose the approach that meets the required level of governance. That choice should not result in undue delays or friction in gaining access to data in a certain Data lake.
Data Protection
Encryption of data at rest is a requirement of most information security standards. These are traditionally implemented through third-party database encryption products. But enterprises using cloud data lake vendors, encryption at rest is offered as a bundled free service.
For data lake security, though encryption is desired and often required, it is not a complete solution, especially for analytics and machine learning applications.
One needs to make security a primary focus of their data operations, the same thing is applicable for Data lake as well. This has been done through a simple always on security posture that makes security a by default option that is both integrated and prescriptive approach to securing your most sensitive assets. Following industry best practices, encrypting your data in transit and at rest is a must. With this encryption comes the encryption key that must also be protected and secured. Both an on prem and a native in the cloud operations to provide secure storage and management of your encryption keys not only for internal applications, but third party as well.
With encryption, there are two challenges.
• The changed data field format may cause many applications to break.
• Encryption is only as secure as the key to encrypt and decrypt, which is nothing but a single point of failure.
Unlike encryption, tokenization keeps the format intact, so even if a hacker gets the key, they still do not have access to the data.
Best practices would be to use the built-in encryption from the cloud provider and then add additional security from a third party. This vendor should decrypt the data, tokenize it, and provide custom views depending on the user’s access rights – all done dynamically at run time.
Importance of Metadata:
A lot of around Datalake security and in general data security can be achieved via data governance, and Metadata is an integral part of data governance. Let’s understand a bit more about Metadata and how it enables the effective data governance.
Metadata simply means data about data. For ex. It helps understand characteristics of the data in consideration. Format of data, length of data fields, number of data fields, type of data, all of these are part of Metadata. It shares information about data assets in various dimensions such as technical information (technical metadata such as data structure, data schema, technical data fields characteristics, data transfer protocol, etc.) and business information about data assets (business metadata such as data owner, data processor, types of data access roles, etc.)
If used strategically, Metadata can help understand the key attributes of data such as:
- Data ownership
- Data accuracy
- Data classification
- Effective approach towards data governance
- Data sources and destinations
- Reliability of data
Management of Metadata is critical for effective data governance. Management of metadata is enabled via various Metadata management processes such as access controls, data control processes, data schema management, data fields edit management, data classification, data quality management, data inflow and outflow, Data search features, and required data compliance controls.
Effective Metadata management leads to solid data governance structure, which ultimately leads to high quality data, data accuracy, data integrity and intelligent usability. These are critical blocks for various users to use the data assets to its full potential.
Conclusion: Due to technical advancements and various latest innovative data management techniques, Datalake Security has become a dynamic and challenging topic. A great combination of right processes, tools, integrations, and skill set needs to be adopted to ensure appropriate controls around Datalake security, which may lead to better security posture and reduction in data security risks and vulnerabilities for an organization, as a whole.
References:
https://www.eiminstitute.org/library/white-papers-articles/information-resource-management-data-administration-versus-database-administration/view
https://www.snowflake.com/guides/data-lake-security/
https://lakefs.io/blog/metadata-management-data-lakes-challenges/
https://explodingtopics.com/blog/