Cloud Computing Flashcards
What is cloud computing?
Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. The term is generally used to describe data centers available to many users over the Internet.
It is a core component for big data and data science.
What are some of the key benefits to cloud computing?
- Cost efficient: cheaper for maintenance and upgrades
- Unlimited storage available in seconds
- Improved backup and recovery
- Easy access to information
- On-demand model
What are some of the key cons of cloud computing?
- Technical issues
- Can arise if your internet connection is not working
- Security concerns
- You’re sending your information to a third-party service provider such as AWS
- Prone to attack
- Storing info in the cloud makes you vulnerable to external hack attacks and threats
if Data Science is not a science, what is it?
- A methodology based on multidisciplinary knowledge
- It is a data processing model focused on extracting insights from data using machine learning and predictive analytics
What are the 3 roles in a Data Science team?
- Data Scientist = statisticians, data managers
- Data engineer = Data managers, database administrators
- Data analyst = business analysts
By definition, what is a computer?
an electronic machine that takes input, processes it and returns output
how does an analog computer work?
analog computers are continuous from 0-9 on the waves.
it is NOT binary as digital computers working in 0 and 1 = this is simpler to use.
what is moore’s law?
the number of transistors in a dense integrated circuit doubles approximately every two years
adjust for operations per second, density, power consumption, cost (constant dollar)
What are the key components of a computer (system resources)
- Motherboard
- Main part of the computer
- Has connectors to plug the other components and allow communication between them
- Central Processing Unit (CPU)
- This is where the magic happens. The CPU runs arithmetic, logical, control and I/O operations like a calculator but much more complex and billions of operations per second
- Memory (SSD)
- Temporary, volatile storage area that holds the program and data
- Hard Disk Drive (HDD)
- Persistent data storage unit.
- This is slower than the memory but has more capacity
- Graphic Processing Unit (GPU)
- Like a CPU but specialized for graphics
- GPUs works with ARRAYS and MATRICES and can thus do many more operations at the same time than CPU. GPUs work in PARALLEL while CPUs work one at a time.
- GPUs are now also used for physics, artificial intelligence, blockchain
- Network Interface
- Connect the computer to a network (LAN, Internet, Ethernet, Wifi)
What is a TPU?
Tensor Processing Unit
A new invention by Google.
Is better than GPU as it can run multiple arrays and matrices at the same time.
What if you need more of a resource, say memory, but adding more to the computer is super expensive?
Build a cluster = connect multiple computers and pool the resources.
What are typical bottlenecks in computing?
- CPU
- Math calculations, simulations, compression
- Example: training AI models (make a cluster OR better, use GPU or TPU)
- Memory
- Databases, processing data in real time
- Example: doing complex searches
- Storage
- Storing huge amount of data
- Example: content delivery network for video (YouTube)
When was the first message sent on the internet?
In 1969 from Los Angeles to Stanford; “LO”
(they wanted to say “login” but it crashed)
What are some of the main milestones for the internet?
- 1950s: Electronic computers
- 1969: First message “LO”
- 1970s: TCP/IP Protocols, LANs (local area networks)
- 1980: USENET (newsgroups), ethernet cable
- 1982: SMTP (email)
- 1983: DNS (domains)
- 1991: WWW (invented by British scientist, Tim Berners-Lee at CERN), URLs, HTML
- 1999: Napster (p2p protocol) for file-sharing
- 2000: Dot-com bubble
- 2008: Amazon EC2 (Elastic Compute Cloud) = start of AWS
What are physical servers, virtual machines and containers?
- Physical servers
- Hardware + operating system (the standard old)
- Virtual machine (VirtualBox, KVM, VMware)
- Emulated computer system = virtual hardware
- Different virtual machines, even when part of the same computer (partition), do not affect each other. We can have multiple “tenants” on the same machine.
- Containers (chroot, docker, LXC)
- Like virtual machines BUT optimized for parallel usage (avoiding having unused capacity)
- say we have a machine with 16 CPUs, we would have most likely 16 virtual machines, but we can have hundreds of containers.
- Like virtual machines BUT optimized for parallel usage (avoiding having unused capacity)
https://www.youtube.com/watch?v=L1ie8negCjc
What does “on premises” mean?
- The hardware are in your own building where you control the hardware and the network
- Better latency for employees in the office
- but… limited connectivity to the outside + centralization risk
In Data science, it’s common to use local servers to train machine-learning algorithms, deep neural networks etc. Why? For speed!
And… for security, some critical systems shouldn’t even be connected to the internet.
What is co-location in data centers?
You hire space in cabinets to co-locate your servers
The data center provides power, internet and physical security.
Data Centers are located in places with cheap electricity, good defenses and upstream connections.
NOTE!! You still own the hardware and will in some instances have to go to the data center for some interventions.
What is IasS and examples thereof?
Infrastructure as a Service
Provider offers instances (servers) in their cloud (virtual or physical).
- You manage the operating system of the instance
- No need to worry about the hardware. The provider will fix it or move you to another machine.
AWS EC2, Google Compute Engine, Microsoft Azure
- Virtual Machines, Servers, Storage, Load Balancers, Network
Use setup
- All your infrastructure is IaaS (this is good for cash flow as you rent all)
- Own your infrastructure, use IaaS for temporary workloads (scale up for christmas. This way of doing it is more risky (security) but better for cash flow)
- Use IaaS just for external storage (as a CDN, content delivery network)
What is PasS and examples thereof?
Platform as a Service
Provider offers an application platform and you just manage the settings of the service.
Cons:
- you cannot fully control or customize the system
Examples:
- Web hosting
- Databases (no need to worry about scalability, monitoring, availability)
- Heroku: deploy your software and they will host it, run it and scale it
- Development tools
What is SasS and examples thereof?
Software as a Service
You can use the software as it and the provider runs the infrastructure and maintains the software.
Usually pay-per-use or subscriptions.
Examples:
- CRM (Salesforce)
- Email (Google Apps)
- Customer Management (Zendesk)
- Virtual desktop, Communication, Games
What is “serverless”?
A trend going on where there is no server.
Providers:
- AWS Lambda
- Google Cloud Functions
- Azure Functions
Advantages:
- Cost, if the usage is low it will be cheaper than having an instance
- No need to tune or scale the setup
- Simpler functions can be more productive
Disadvantages:
- Performance, resources (time to start up)
- Difficult to monitor, complexity
How can you compare IaaS, PaaS and SaaS?
When is something scalable?
“A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added”
Werner Vogels, CTO Amazon
What does he mean by “Performance != Scalability”?
Performance = quality metric; the time it takes to execute one request
Scalability = the ability to maintain that performance under increasing load OR increase performance when adding resources
What is the Big O?
- Mathematical notation describing the limiting behavior of a function
- It is used to classify algorithms according to the complexity class (how their requirements grow as the input size grows)
- It gives us an upper bound (worst case) of how much time/space the algorithm will need
What does this show?
It illustrates how fast a problem of size, n, grows depending on the complexity class
What makes logarithmic algorithms efficient?
An O(log n) algorithm is highly efficient, as the operations per instance required to complete decrease with each instance.
What are logarithmic algorithms often used for?
- Binary trees and binary search
- Binary search; algorithm that finds the position of a target value within a sorted array
When is a linear algorithm optimal?
- in situations where the algorithm has to sequentially read its entire input
- Example; a procedure that adds up all elements of a list requires time proportional to the length of the list
When do you use quadratic algorithms? O(n^2)
- Common with algorithms that involve nested iterations over the data set
- Examples:
- multiplying two n-digit numbers by a simple algorithm
- simple sorting algorithms such as bubble sort, selection sort and insertion sort
- Examples:
What do you often use polynomial for? O(n^c)
- Public Key Cryptography
- It is computationally hard to find prime factors of large numbers so we use such numbers on purpose to make decryption unfeasible
running time is upper bounded by a polynomial expression T(n) = O(n^c)
What has been used to make algorithms win in chess and go?
Exponential algorithms T(n) = O(2n^c)
How can you explain the use case of Factorial O(n!) ?
With current hardware, we can maximum do around a 21!
Other than speed concerns, what other scalability types are important?
- Administrative
- increase number of organisations/users to easily share a single distributed system
- Functional
- enhance the system by adding new functionality at minimal effort
- Geographic
- expand from concentration in a local area to a more distributed geographic pattern, and keep performance
- Load
- expand/contract the resource pool to accommodate heavier/lighter loads or numbers of inputs
- Generation
- the ability of a system to scale up by using new generations of components (& different vendors)
What should you ask yourself when building a new system for scalability?
scalability is only possible if we architect and engineer our system to take scalability into account.
We must ask ourselves
- which axis do we expect the system to grow?
- where is redundancy required?
- how do we manage heterogeneity?
- where are the pitfalls and bottlenecks?
- etc.
What are the requirements to build a good distributed system?
- Scalability
- enlarge by adding more resources
- Elasticity
- able to provision resources at any time
- Performance
- good response time
- High availability
- avoid downtime / low downtime
- the 5 9s… 99.999 = only amazon, google, apple etc. that are this efficient (a few seconds downtime)
- Maintainability
- automate deploys (DevOps)
- Monitoring
- know the status of the system
- Security
- keep the data secure
Why is monitoring so important?
because you can’t improve what you don’t measure
- measure your system to find bottlenecks
- optimize those bottlenecks
- verify the improvements
- rinse and repeat!
Why is the idea of ‘microservices’ powerful?
It is so complicated to create well-working distributed systems and software BUT… if you structure it as a collection of loosely coupled microservices, you will make your life much easier.
you make it easy to scale each microservice individually (usually in a container)
What can be said about the importance of data structure vs code?
Smart data structures and dumb code work a lot better than the other way around
importance:
Data structures > Code
- Code is easy to change
- Data schemas are difficult to migrate