14. Related Technologies - DONE Flashcards
What is “Big Data”?
The term “big data” refers to extremely large data sets from which you can derive valuable information. Big data can handle volumes of data that traditional data-processing tools are simply unable to manage. You can’t go to a store and buy a big data solution, and big data isn’t a single technology. It refers to a set of distributed collection, storage, and data-processing frameworks.
According to Gartner, “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery, and process optimization.”
“The CSA refers to the qualities from the Gartner quote as the “Three Vs.” Let’s define those now:
*High Volume - A large amount of data in terms of the number of records or attributes
*High Velocity - Fast generation and processing of data (such as real-time or data stream)
*High Variety - Structured, semistructured, or unstructured data”
“The Three Vs of big data make it very practical for cloud deployments, because of the attributes of elasticity and massive storage capabilities available in Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) deployment models. Additionally, big data technologies can be integrated into cloud-computing applications.”t.
“Big data systems typically are typically associated with three common components:
“ data gets collected, stored, and processed.”
*Distributed data collection = This component refers to the system’s ability to ingest large volumes of data, often as streamed data. Ingested data could range from simple web clickstream analytics to scientific and sensor data. Not all big data relies on distributed or streaming data collection, but it is a core big data technology.
*Distributed storage = This refers to the system’s ability to store large data sets in distributed file systems (such as Google File System, Hadoop Distributed File System, and so on) or databases (such as NoSQL). NoSQL (Not only SQL) is a nonrelational distributed and scalable database system that works well in big data scenarios and is often required because of the limitations of nondistributed storage technologies.
*Distributed Processing = Tools and techniques can distribute processing jobs (such as MapReduce, Spark, and so on) for the effective analysis of data sets that are so massive and rapidly changing that single-origin processing can’t effectively handle them.
You know that big data is a framework that uses multiple modules across multiple nodes to process high volumes of data with a high velocity and high variety of sources. This makes security and privacy challenging when you’re using a patchwork of different tools and platforms.
“This is a great opportunity to discuss how security basics can be applied to technologies with which you may be unfamiliar, such as big data.”
At its most basic level, you need to authenticate, authorize, and audit (AAA) least-privilege access to all components and modules in the Hadoop environment. This, of course, includes everything from the physical layer all the way up to the modules themselves.
For application-level components, your vendor should have their best practices documented (for example, Cloudera’s security document is roughly 500 pages long) and should quickly address any vulnerabilities with patches. Only after these AAA basics are addressed should you consider encryption requirements, both in-transit and at-rest as required.”
Data Collection
“When data is collected, it will likely go through some form of intermediary storage device before it is stored in the big data analytics system. Data in this device (virtual machine, instance, container, and so on) will also need to be secured, as discussed in the previous section. Intermediary storage could be swap space (held in memory).
Your provider should have documentation available for customers to address their own security requirements.
EXAM TIP
“all components and workloads required of any technology must have secure AAA in place. This remains true when underlying cloud services are consumed to deliver big data analytics for your organization. An example of a cloud-based big data system could consist of processing nodes running in instances that collect data in volume storage.”
Key Management
“If encryption at rest is required as part of a big data implementation (everything is risk-based, after all), implementation may be complicated by the distributed nature of nodes. As far as the protection of data at rest, encryption capabilities in a cloud environment will likely be defined by a provider’s ability to expose appropriate controls to secure data, and this includes key management. Key management systems need to be able to support distribution of keys to multiple storage and analysis tools.”
Security Capabilities
CSP controls can be used to address your security requirements as far as the services that may be consumed (such as object storage) as part of your big data implementation. If you need your data to be encrypted, see if your cloud provider can do that for you. If you need very granular access control, see if the provider’s service includes it.
The details of the security configuration of these services and controls should be included in your security architecture.
“Identity and Access Management”
As mentioned, authorization and authentication are the most important controls. You must ensure that they are done correctly. In your cloud environment, this means starting with ensuring that every entity that has access to the management plane is restricted based on least-privilege principles.
Moving from there, you need to address access to the services that are used as part of your big data architecture.
Finally, all application components of the big data system itself need to have appropriate access controls established.
“Considering the number of areas where identity and access management (IAM) must be implemented (cloud platform, services, and big data tool level), entitlement matrices can be complicated.”
PaaS benefits
“Cloud providers may offer big data services as a PaaS. Numerous benefits can be associated with consuming a big data platform instead of building your own. Cloud providers may implement advanced technologies, such as machine learning, as part of their offerings.”
PaaS risks
You need to have an adequate understanding of potential data exposure, compliance, and privacy implications. Is there a compliance exposure if the PaaS vendor employees can technically access enterprise data? How does the vendor address this insider threat? These are the types of questions that must be addressed before you embrace a big data PaaS service.
risk-based decisions must be made and appropriate security controls implemented to satisfy your organizational requirements.
“Internet of Things (IoT)”
“Internet of Things includes everything in the physical world, ranging from power and water systems to fitness trackers, home assistants, medical devices, and other industrial and retail technologies.
Beyond these products, enterprises are adopting IoT for applications such as the following:
*Supply chain management
*Physical logistics management
*Marketing, retail, and customer relationship management
*Connected healthcare and lifestyle applications for employees and consumers”
“The following cloud-specific IoT security elements are identified in the CSA Guidance:”
“*Secure data collection and sanitization = This could include, for example, stripping code of sensitive and/or malicious data.
*Device registration, authentication, and authorization = One common issue encountered today is the use of stored credentials to make direct API calls to the backend cloud provider. There are known cases of attackers decompiling applications or device software and then using those credentials for malicious purposes.
*API security for connections from devices back to the cloud infrastructure = In addition to the stored credentials issue just mentioned, the APIs themselves could be decoded and used for attacks on the cloud infrastructure.
Encrypted communications = Many current devices use weak, outdated, or nonexistent encryption, which places data and the devices at risk.”
“Ability to patch and update devices so they don’t become a point of compromise = Currently, it is common for devices to be shipped as-is, and they never receive security updates for operating systems or applications. This has already caused multiple significant and highly publicized security incidents, such as massive botnet attacks based on compromised IoT devices.”
Mobile Computing
Companies don’t require cloud services to support mobile applications, but still, many mobile applications are dependent on cloud services for backend processing. Mobile applications leverage the cloud not only because of its processing power capabilities for highly dynamic workloads but also because of its geographic distribution.
The CSA Guidance identifies the following security issues for mobile computing in a cloud environment:
*Device registration, authentication, and authorization are issues for mobile applications, as they are for IoT devices, especially when stored credentials are used to connect directly to provider infrastructure and resources via an API. If an attacker can decompile the application and obtain these stored credentials, they will be able to manipulate or attack the cloud infrastructure.
*Any application APIs that are run within the cloud environment are also listed as a potential source of compromise. If an attacker can run local proxies that intercept these API calls, they may be able to decompile the likely unencrypted information and explore them for security weaknesses. Certificate pinning/validation inside the application may help mitigate this risk.