Data Science at Scale Flashcards
How have companies changed in the big data era and what has enabled this?
Become more social, customer-orientated and dynamic. They have done this by collecting data, learning from the data, and improving and adapting in response. This is because of cheaper storage and processing, faster networks, and free open-source tools.
What technological shift enabled widespread data analytics?
Cloud-based infrastructure (e.g., AWS, GCP) and Infrastructure as a Service solutions from internet giants like Google, Amazon, and Microsoft.
How has big data transformed marketing?
Customer profiling, targeted ads, and personalised communication and recommendations.
What are the 3 V’s of big data?
Volume, Velocity and Variety
What are the four fundamental functionalities that Data-intensive applications are built from?
Database - Store data so it can be retrieved later.
Caching - Store the results of expensive operations to be used again soon.
Indexing - Allow users to efficiently search the data.
Batch Processing - Periodically run specific routines on large amounts of accumulated data.
What are the three important things that a Data-intensive application needs to be?
Reliable, Scalable, Maintainable
What does it mean for a system to be reliable?
It performs the function the user expected. It can tolerate the user making mistakes. Its performance is good enough for the requires use case, under the expected load and volume. It prevents any unauthorised access.
What are faults and failures?
A fault is when a component (hardware/software) of the system works in an unexpected way, and a failure is when the entire system stops providing the service.
What are Hardware Faults and what measures can be taken to stop them?
Usually when a HDD, memory module or PSU stops working. In large data centres this is common. We can use hardware measures such as RAID for HDD’s, redundant PSU’s, and hot-swappable CPU’s.
What are Software Faults?
When the software stops working. These are harder to anticipate, and can be present on many nodes of a system causing widespread failures.
What is Scalability?
A system’s ability to cope with increased load
What is Load?
A measure of the amount of use of a system, for example: requests per second, number of players, read/write ratio.
What is performance?
How well the system is responding to the load, for example: response time, or time taken to process a dataset. The average and distribution are both important.
What is Vertical Scaling?
Upping the specs of the current system, this does not scale linearly, and has limited fault tolerance.
What is Horizontal Scaling?
Increasing the amount of devices in the system, which scales better and has better fault tolerance.
What is Maintainability?
The overall cost to maintain a system operational and updated.
What is Operability?
How easy it is for the operation team to keep it running. Includes good monitoring, automation, and predictable behaviour.
What is Simplicity?
How easy it is for new people working on the system to understand it, without reducing functionality, just accidental complexity.
What is Evolvability?
How easy it is to make changes and update the system, which is closely linked with simplicity.