Cloud Computing Flashcards

1
Q

What is cloud computing?

A

Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. The term is generally used to describe data centers available to many users over the Internet.

It is a core component for big data and data science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some of the key benefits to cloud computing?

A
  • Cost efficient: cheaper for maintenance and upgrades
  • Unlimited storage available in seconds
  • Improved backup and recovery
  • Easy access to information
  • On-demand model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some of the key cons of cloud computing?

A
  • Technical issues
    • Can arise if your internet connection is not working
  • Security concerns
    • You’re sending your information to a third-party service provider such as AWS
  • Prone to attack
    • Storing info in the cloud makes you vulnerable to external hack attacks and threats
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

if Data Science is not a science, what is it?

A
  • A methodology based on multidisciplinary knowledge
  • It is a data processing model focused on extracting insights from data using machine learning and predictive analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 3 roles in a Data Science team?

A
  • Data Scientist = statisticians, data managers
  • Data engineer = Data managers, database administrators
  • Data analyst = business analysts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

By definition, what is a computer?

A

an electronic machine that takes input, processes it and returns output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how does an analog computer work?

A

analog computers are continuous from 0-9 on the waves.

it is NOT binary as digital computers working in 0 and 1 = this is simpler to use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is moore’s law?

A

the number of transistors in a dense integrated circuit doubles approximately every two years

adjust for operations per second, density, power consumption, cost (constant dollar)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the key components of a computer (system resources)

A
  • Motherboard
    • Main part of the computer
    • Has connectors to plug the other components and allow communication between them
  • Central Processing Unit (CPU)
    • This is where the magic happens. The CPU runs arithmetic, logical, control and I/O operations like a calculator but much more complex and billions of operations per second
  • Memory (SSD)
    • Temporary, volatile storage area that holds the program and data
  • Hard Disk Drive (HDD)
    • Persistent data storage unit.
    • This is slower than the memory but has more capacity
  • Graphic Processing Unit (GPU)
    • Like a CPU but specialized for graphics
    • GPUs works with ARRAYS and MATRICES and can thus do many more operations at the same time than CPU. GPUs work in PARALLEL while CPUs work one at a time.
    • GPUs are now also used for physics, artificial intelligence, blockchain
  • Network Interface
    • Connect the computer to a network (LAN, Internet, Ethernet, Wifi)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a TPU?

A

Tensor Processing Unit

A new invention by Google.

Is better than GPU as it can run multiple arrays and matrices at the same time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What if you need more of a resource, say memory, but adding more to the computer is super expensive?

A

Build a cluster = connect multiple computers and pool the resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are typical bottlenecks in computing?

A
  • CPU
    • Math calculations, simulations, compression
    • Example: training AI models (make a cluster OR better, use GPU or TPU)
  • Memory
    • Databases, processing data in real time
    • Example: doing complex searches
  • Storage
    • Storing huge amount of data
    • Example: content delivery network for video (YouTube)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When was the first message sent on the internet?

A

In 1969 from Los Angeles to Stanford; “LO”

(they wanted to say “login” but it crashed)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some of the main milestones for the internet?

A
  • 1950s: Electronic computers
  • 1969: First message “LO”
  • 1970s: TCP/IP Protocols, LANs (local area networks)
  • 1980: USENET (newsgroups), ethernet cable
  • 1982: SMTP (email)
  • 1983: DNS (domains)
  • 1991: WWW (invented by British scientist, Tim Berners-Lee at CERN), URLs, HTML
  • 1999: Napster (p2p protocol) for file-sharing
  • 2000: Dot-com bubble
  • 2008: Amazon EC2 (Elastic Compute Cloud) = start of AWS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are physical servers, virtual machines and containers?

A
  • Physical servers
    • Hardware + operating system (the standard old)
  • Virtual machine (VirtualBox, KVM, VMware)
    • Emulated computer system = virtual hardware
    • Different virtual machines, even when part of the same computer (partition), do not affect each other. We can have multiple “tenants” on the same machine.
  • Containers (chroot, docker, LXC)
    • Like virtual machines BUT optimized for parallel usage (avoiding having unused capacity)
      • say we have a machine with 16 CPUs, we would have most likely 16 virtual machines, but we can have hundreds of containers.

https://www.youtube.com/watch?v=L1ie8negCjc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does “on premises” mean?

A
  • The hardware are in your own building where you control the hardware and the network
  • Better latency for employees in the office
  • but… limited connectivity to the outside + centralization risk

In Data science, it’s common to use local servers to train machine-learning algorithms, deep neural networks etc. Why? For speed!

And… for security, some critical systems shouldn’t even be connected to the internet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is co-location in data centers?

A

You hire space in cabinets to co-locate your servers

The data center provides power, internet and physical security.

Data Centers are located in places with cheap electricity, good defenses and upstream connections.

NOTE!! You still own the hardware and will in some instances have to go to the data center for some interventions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is IasS and examples thereof?

A

Infrastructure as a Service

Provider offers instances (servers) in their cloud (virtual or physical).

  • You manage the operating system of the instance
  • No need to worry about the hardware. The provider will fix it or move you to another machine.

AWS EC2, Google Compute Engine, Microsoft Azure

  • Virtual Machines, Servers, Storage, Load Balancers, Network

Use setup

  • All your infrastructure is IaaS (this is good for cash flow as you rent all)
  • Own your infrastructure, use IaaS for temporary workloads (scale up for christmas. This way of doing it is more risky (security) but better for cash flow)
  • Use IaaS just for external storage (as a CDN, content delivery network)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is PasS and examples thereof?

A

Platform as a Service

Provider offers an application platform and you just manage the settings of the service.

Cons:

  • you cannot fully control or customize the system

Examples:

  • Web hosting
  • Databases (no need to worry about scalability, monitoring, availability)
  • Heroku: deploy your software and they will host it, run it and scale it
  • Development tools
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is SasS and examples thereof?

A

Software as a Service

You can use the software as it and the provider runs the infrastructure and maintains the software.

Usually pay-per-use or subscriptions.

Examples:

  • CRM (Salesforce)
  • Email (Google Apps)
  • Customer Management (Zendesk)
  • Virtual desktop, Communication, Games
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is “serverless”?

A

A trend going on where there is no server.

Providers:

  • AWS Lambda
  • Google Cloud Functions
  • Azure Functions

Advantages:

  • Cost, if the usage is low it will be cheaper than having an instance
  • No need to tune or scale the setup
  • Simpler functions can be more productive

Disadvantages:

  • Performance, resources (time to start up)
  • Difficult to monitor, complexity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How can you compare IaaS, PaaS and SaaS?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

When is something scalable?

A

“A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added”

Werner Vogels, CTO Amazon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does he mean by “Performance != Scalability”?

A

Performance = quality metric; the time it takes to execute one request

Scalability = the ability to maintain that performance under increasing load OR increase performance when adding resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the Big O?

A
  • Mathematical notation describing the limiting behavior of a function
  • It is used to classify algorithms according to the complexity class (how their requirements grow as the input size grows)
  • It gives us an upper bound (worst case) of how much time/space the algorithm will need
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does this show?

A

It illustrates how fast a problem of size, n, grows depending on the complexity class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What makes logarithmic algorithms efficient?

A

An O(log n) algorithm is highly efficient, as the operations per instance required to complete decrease with each instance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are logarithmic algorithms often used for?

A
  • Binary trees and binary search
    • Binary search; algorithm that finds the position of a target value within a sorted array
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

When is a linear algorithm optimal?

A
  • in situations where the algorithm has to sequentially read its entire input
    • Example; a procedure that adds up all elements of a list requires time proportional to the length of the list
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

When do you use quadratic algorithms? O(n^2)

A
  • Common with algorithms that involve nested iterations over the data set
    • Examples:
      • multiplying two n-digit numbers by a simple algorithm
      • simple sorting algorithms such as bubble sort, selection sort and insertion sort
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What do you often use polynomial for? O(n^c)

A
  • Public Key Cryptography
    • It is computationally hard to find prime factors of large numbers so we use such numbers on purpose to make decryption unfeasible

running time is upper bounded by a polynomial expression T(n) = O(n^c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What has been used to make algorithms win in chess and go?

A

Exponential algorithms T(n) = O(2n^c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How can you explain the use case of Factorial O(n!) ?

A

With current hardware, we can maximum do around a 21!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Other than speed concerns, what other scalability types are important?

A
  • Administrative
    • increase number of organisations/users to easily share a single distributed system
  • Functional
    • enhance the system by adding new functionality at minimal effort
  • Geographic
    • expand from concentration in a local area to a more distributed geographic pattern, and keep performance
  • Load
    • expand/contract the resource pool to accommodate heavier/lighter loads or numbers of inputs
  • Generation
    • the ability of a system to scale up by using new generations of components (& different vendors)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What should you ask yourself when building a new system for scalability?

A

scalability is only possible if we architect and engineer our system to take scalability into account.

We must ask ourselves

  • which axis do we expect the system to grow?
  • where is redundancy required?
  • how do we manage heterogeneity?
  • where are the pitfalls and bottlenecks?
  • etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are the requirements to build a good distributed system?

A
  • Scalability
    • enlarge by adding more resources
  • Elasticity
    • able to provision resources at any time
  • Performance
    • good response time
  • High availability
    • avoid downtime / low downtime
    • the 5 9s… 99.999 = only amazon, google, apple etc. that are this efficient (a few seconds downtime)
  • Maintainability
    • automate deploys (DevOps)
  • Monitoring
    • know the status of the system
  • Security
    • keep the data secure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Why is monitoring so important?

A

because you can’t improve what you don’t measure

  • measure your system to find bottlenecks
  • optimize those bottlenecks
  • verify the improvements
  • rinse and repeat!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Why is the idea of ‘microservices’ powerful?

A

It is so complicated to create well-working distributed systems and software BUT… if you structure it as a collection of loosely coupled microservices, you will make your life much easier.

you make it easy to scale each microservice individually (usually in a container)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What can be said about the importance of data structure vs code?

A

Smart data structures and dumb code work a lot better than the other way around

importance:

Data structures > Code

  • Code is easy to change
  • Data schemas are difficult to migrate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are MQs?

A

Message Queues

Software services to pass messages between programs/components

used to scale components horizontally => all listen from a queue and write another queue

Examples: AMQP, AWS SQS, RabbitMQ, Redis

41
Q

What is the difference between vertical and horizontal scalability? (IMPORT FOR EXAM)

A
  • Vertical
    • better hardware to the single node (single computer)
      • faster CPU, more CPUs, more memory, better disk
      • Often the easier solution (when possible)
      • very common on databases stateful systems
      • BUT
        • there are hardware limits
        • diminishing returns (gets expensive to improve)
        • involves downtime
          • so keep a hot spare, fx. RDS Multi A-Z “standby” replica with automatic failover
  • Horizontal
    • using multiple nodes (computers in a cluster)
    • common on stateless web servers
    • high availability because one node can crash without problems
    • cheap solution (cheap hardware, cheap virtual machines)
    • BUT…
      • architecture must support horizontal scaling
      • management overhead cost
42
Q

What’s the pros and cons of vertical scaling?

A
  • Often the easier solution (when possible)
  • BUT
    • there are hardware limits
    • diminishing returns (gets expensive to improve)
    • involves downtime
      • so keep a hot spare, fx. RDS Multi A-Z “standby” replica with automatic failover

very common on databases stateful systems

43
Q

What’s the pros and cons of horizontal scaling?

A
  • high availability because one node can crash without problems
  • cheap solution (cheap hardware, cheap virtual machines)
  • BUT…
    • architecture must support horizontal scaling
    • management overhead cost

common on stateless web servers

44
Q

What are important concepts within distributed file systems and databases?

A
  • Distributed FILE SYSTEMS
    • File Transfer Protocol (FTP)
    • Content Delivery Network (CDNs)
    • Amazon Simple Storage Service (S3)
    • Hadoop Distributed File System (HDFS)
  • Distributed DATABASES
    • Relational vs. Non-relational
    • ACID Properties
    • Scaling Relational DBs
45
Q

What is FTPs?

A

File Transfer Protocols

  • Standard network protocol for the transfer of files between a client and server
    • It allows authentication
    • By default, it is insecure. But use SFTP instead
    • Can be scaled to some extent
    • Very simple and also too simple for many use cases
46
Q

What are CDNs?

A
  • A CDN is a geographically distributed network of proxy servers and their data centers (near the end-users at the ISP)
    • from Denmark, you get your content on Netflix, Spotify etc. from a server close by either in Denmark, Germany, Norway, Sweden or UK.
  • CDNs serve most of the internet content such as live and on-demand video-streaming, downloadable files, software, updates etc. and generally content on mobile and web.
47
Q

What is AWS S3?

A

Amazon Simple Storage Service

  • Provides storage through web services interfaces and APIs
  • Store arbitrary files up to XX terabytes
  • Guarantees 99.9 % monthly uptime (less than 43m of downtime monthly)

Competitors: Google Cloud Storage and Microsoft Azure Storage

48
Q

What is Hadoop?

A

Hadoop Distributed File System is a distributed, scalable and portable file system written in Java

Stores large files, replicated on commodity machines (small and cheap)

49
Q

What is a relational database?

A
  • Great for typical tabular data (think Excel structure)
  • Examples: MySQL, PostgreSQL
  • It is really difficult to distribute relational databases
  • ACID
    • Atomicity, Consistency, Isolation, Durability
    • Guarantee validity in the event of errors, power failures..
50
Q

What is a non-relational database?

A

NoSQL

  • Easier to be distributed (scale horizontally)
    • (easier because they don’t guarantee the ACID properties)
  • Other models
    • Column: Cassandra, HBase
    • Document: CouchDB, MongoDB, IBM Domino
    • Key-value: Dynamo, Redis, Riak
    • Graph: Neo4J, OrientDB
    • Multi-model: ArangoDB, Couchbase
  • NO ACID!!
  • CAP theorem (choose 2 of 3)
    • Consistency
    • Availability
    • Partition-Tolerance
51
Q

What are the properties of ACID?

A

For relational databases…

  • Atomicity
    • Each transaction is “all or nothing” => if part of the transaction fails, the entire transaction fails and the database state is left unchanged
    • Must provide atomicity in each and every situation, including power failures, errors and crashes
  • Consistency
    • One transaction will bring the database from one valid state to another WHILE no programming errors will result in the violation of any defined rules (including constraints, cascades, triggers and so forth)
  • Isolation
    • transactions are executed sequentially => one after the other
  • Durability
    • Once a transaction has been committed, it will remain so, even in the event of power loss, crashes or errors => SQL statements are stored permanently after execution
52
Q

How can you scale relational databases?

A

There are many approaches to scale them

  • Partitioning of large tables
  • Primary / Replicas
  • Other Tricks:
    • Add indexes, improve the schema
    • UUIDs instead of sequential IDs
    • Preload data in a cache
    • Queries in batches
    • Persistent connections
53
Q

What is “primary and replicas”?

A

referred to as master/slave by some databases

1 primary and n replicas

  • Write in the primary (will replicate)
  • Read on the replicas
    • especially, the complex and expensive queries such as backups, consolidations, aggregations etc.

Hot spare:

  • If the primary dies, promote a replica to be the new primary => minimum downtime
54
Q

What is partitioning of databases and examples hereof?

A

A partition is a division of a logical database into independent parts

  • Split in databases or in tables
  • split given a certain criteria

Examples of partition practices:

  • Round-Robin partitioning
    • simplest strategy
    • with n partitions, assign each row (i) to a partition, sequentially (i mod n)
  • Range partitioning
    • the key is inside a certain range (example, range of zip codes)
  • List partitioning
    • assigned from a list of values (example, a partition containing a list of countries. If they key is one of these countries, the partition is used)
  • Hash partitioning
    • Applies a hash function (“signature”) to yield the partition number
  • Composite partitioning
    • Use combinations of the other partitioning schemes
55
Q

What is the CAP Theroem?

A

Choose 2 of 3

  • Consistency
    • Every read receives the most recent write or an error
  • Availability
    • Every request receives a non-error response - without guarantee that it contains the most recent write
  • Partition tolerance
    • The system continues to operate despite messages being dropped/delayed by the network between nodes

With….

  • Consistency + Availability
    • Network failures do happen
  • Consistency + Partition Tolerance
    • You’ll get errors/timeouts
  • Availability + Partition Tolerance
    • You’ll get out-of-date responses
56
Q

What is Open-source?

A

type of computer software in which the copyright holder grants users the right to study, change, and distribute the software for any purpose

57
Q

What is relational databases?

A
  • designed for all purposes
  • ACID (Atomicity, Consistency, Isolation, Durability)
  • Mathematical background
  • Vertically scalable (but not horizontally = over multiple computers)
58
Q

What does SQL mean?

A

Standard query language

59
Q

What characterizes NoSQL databases?

A
  • Rather called “not only SQL” than NoSQL
  • Non-relational
  • Cluster friendly, Horizontal scaling
  • Schema-less = No burden of up-front schema design
  • 21 century web
  • Open source
  • Minimum overhead
  • Solution to impedance mismatch
  • Examples: Redis, MongoDB, Cassandra etc.
60
Q

What are the pros and cons of SQL databases?

A
61
Q

What are the pros and cons of NoSQL databases?

A
62
Q

What are the file formats used in NoSQL?

A

JSON, XML,

63
Q

What are the 4 different aggregate data model families?

A
  • Key-value data models
  • Column-family
  • Document-based
  • Graph
64
Q

What is a 404 error?

A

html, file not found

65
Q

What is a 403 error?

A

forbidden, due to not being human (looks like a computer)

66
Q

What is HTML and CSS used for respectively? (at basic form)

A

Both used for front-end development:

  • HTML (Hypertext Markup Language)
    • Gives websites structure and stores the content
      • Our target for web scraping
    • HTML contains tags and references to style….
  • CSS (Cascading Style Sheets)
    • Style… gives format to content and provides visualization opportunities (style, font, color, border, images, positioning etc.)
67
Q

What are the key questions that make public, private and hybrid clouds differ?

A
  • Who owns the hardware?
  • Who can customize the infrastructure?
  • Flexibility to scale?
  • Security?
  • Cost? Is it predictable?
68
Q

What characterizes a PUBLIC cloud?

A
  • SHARED physical hardware
    • owned and operated by a 3rd-party provider
    • Multi-tenant environment with pay-as-you grow scalability
    • Best for non-sensitive, public-facing operations and unpredictable traffic
  • Examples: AWS, MS Azure, Google Cloud
69
Q

What characterizes a PRIVATE cloud?

A
  • Infrastructure DEDICATED to your business
    • hosted on-site or in data center
  • Greater level of control and security
    • for strict regulations and governance obligations
  • Customizable
  • Best for sensitive, business-critical operations
  • Example: OpenStack
70
Q

What characterizes a HYBRID cloud?

A
  • Combine public cloud with private cloud
  • Leverage the best of both worlds:
    • Public cloud for non-sensitive operations
    • Private cloud for business-critical operations
    • Highly flexible/agile and cost-effective solution

You use different providers to achieve this

71
Q

What is OpenStack and who uses it?

A

It is a free software platform for cloud computing that offers servers, services and resources to deploy a private cloud.

Clients:

  • Auto companies, Hollywood, Airlines, Banks (Visa, Paypal go through OpenStack), Supermarkets (Walmart, Target etc.), Telecom (fx AT&T), Energy companies, Academic Researchers, Bloomberg = Everybody and huge companies
72
Q

What is the definition of “free software”?

A
  1. Run the program for any purpose
  2. Study how the program works, and change it to make it do what you wish
  3. Redistribute and make copies so you can help your neighbor
  4. Improve the program, and release your improvements to the public, so that the whole community benefits
73
Q

What are some great success stories of companies that have worked intelligently with cloud architecture?

A
  • Dropbox
    • Simplicity
  • Instagram
    • Scaling slowly by invitation and thus also managing costs to AWS
  • Whatsapp
    • MVP first and scale from there
  • Snapchat
    • Invested 3B in Google Cloud ($2b) and AWS ($1b) after IPO
  • Netflix
    • Use multiple public cloud providers but also have their own private
    • Multi-CDN near end-users
74
Q

What is stackshare.io?

A

A great online tool where you can see the tech stack of many startups and larger ventures such as Netflix, AirBnb, Spotify etc.

75
Q

What can you do to prevent outages?

A

Downtime of even minutes can have huge cost for companies as seen by for example Gmail, Paypal (1h, but had to reimburse merchants for lost sales), hotmail (deleting 17k accounts), AWS (big shit, 2011), Salesforce, etc.

Ways to prevent it:

  1. Learn from your mistakes
    1. Post-mortems used to build up an institutional memory
  2. Expect failures
    1. Use Simian Army (inc. Chaos Monkey) to test your infrastructure
    2. Use multiple cloud providers
  3. Transparency helps rebuilding trust
    1. Should it still happen, make sure to write a proper post-mortem
76
Q

What is Chaos Monkey and what does it do?

A

Chaos Monkey is an open-source software tool that was developed by Netflix engineers to test the resiliency and recoverability of their Amazon Web Services (AWS). The software simulates failures of instances of services running within Auto Scaling Groups (ASG) by shutting down one or more of the virtual machines.

It works by intentionally disabling computers in Netflix’s production network to test how remaining systems respond to the outage.

Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.

77
Q

What is post-mortems and why do you make them?

A

A record explaining what went right and wrong over the course of a project. Always blameless and stating how we can prevent failures from happening again.

Goal: build up an institutional memory and develop a set of best practices.

Also great for trust-building.

78
Q

What is the Simian Army?

A

The Simian Army is a collection of open source cloud testing tools created by the online video streaming company, Netflix. The tools allow engineers to test the reliability, security, resiliency and recoverability of the cloud services that Netflix runs on Amazon Web Services (AWS) infrastructure.

79
Q

How can you compare oil spills to data spills?

A

“Data spills occur with the regularity of oil spills. The victim of identity theft, bogged down in unwanted credit cards and bills, is just as trapped and unable to fly as the bird caught in the oil slick, its wings coated with a glossy substance from which it struggles to free itself.”

80
Q

What are intellectual property, copyright, patent law and trademark law?

A
  • Intellectual property
    • catch-all phrase for the concepts below
  • Copyright
    • promoting authorship and art (covers the details of expression of a work)
  • Patent law
    • promote the publication of useful ideas and provide a great incentive in the form of a temporary monopoly
  • Trademark law
    • enabling buyers to know what they are buying (brands, advertising)
81
Q

What is GNU GPL?

A

GNU General Public License is a widely used free software license, which guarantees end users the freedom to RUN, STUDY, REDISTRIBUTE and IMPROVE the software.

The GPL license is copyleft, therefore you must disclose your source code and make your modified version of your code open source as well. Under GPL you can’t sub-license, meaning, you can’t change any of the original license terms or introduce any of your own. You’re also required to state all the changes you make to the original code.

82
Q

What is BSD License?

A

Family of permissive free software licenses, imposing minimal restrictions on the use and redistribution of covered software.

BSD is more relaxed / free so you can do what you want.

83
Q

What happens to BSD and GNU licenses when working in the cloud?

A

Software is not run at the user-machine but in the cloud.

Hence, the freedoms don’t apply as the user can’t see the code, study it or adapt it.

84
Q

What is AFFERO GPL?

A
  • The GNU Affero General Public License is a modified version of the ordinary GNU GPL version 3.
  • It has one added requirement: if you run a modified program on a server and let other users communicate with it there, your server must also allow them to download the source code corresponding to the modified version running there.
85
Q

What is the idea of “open data”?

A

“Some data should be freely available to everyone to use and re-publish as they wish, without restrictions from copyright, patents or other mechanisms of control”

Used to be a requirement for Science. Now it is discussed.

86
Q

What is the idea of “open government”?

A

“Citizens have the right to access the documents and proceedings of the government to allow for effective public oversight”

important for public scrutiny - especially with finances

87
Q

What does Net Neutrality imply?

A
  • ISPs must treat all data on the internet the same
  • Don’t discriminate or charge differently by user, content, site…
  • The internet was build with net neutrality as it allows for innovation, competition and equality
88
Q

What is a brute force attack?

A

A systematic approach to hacking passwords by trying all combinations.

optimized by prioritizing likely possibilities through frequency tables, dictionary attacks, and most common passwords

89
Q

What is digital phishing?

A

when an attacker disguises as a trustworthy entity, to obtain your sensitive information by tricking you.

90
Q

What is spear-phishing?

A

Targeted attacks where the attacker gathers personal information about a specific target.

This is generally very successful.

Can be targeted at executives (in such cases CEO fraud / whaling)

91
Q

What is MITM?

A

Man-in-the-middle

When two parties communicate between each other but an attacker is in the middle collecting credentials and altering messages.

You need encrypted message services.

HTTPS > HTTP

92
Q

In which 4 ways had computers changed over time?

A
  • Mechanical => Electronic
  • Manual => Automatic
  • Analog => Digital
  • Single-purpose => General-purpose
93
Q

What is important to remember regarding redundancy in regards to scalability?

A

Adding redundancy should not deteriorate the performance! (then it is not scalable)

Redundancy = duplication of critical components in order to improve system reliability

94
Q

What characterizes constant algorithms and when are they useful?

A

The value of T(n) is bounded by a value that does not depend on the size of the input.

Examples

  • Accessing any single element in an array
  • Determining if an integer is odd or even
95
Q

What are the components of ACID?

A
  • Atomicity
  • Consistency
  • Isolation
  • Durability

Relates to relational databases

96
Q

What is Data privacy?

A

“The relationship between the collection and dissemination of data, technology, the public expectation of privacy, and the legal and political issues surrounding them”

97
Q

What does “bad” data science concern?

A

“If you torture the data long enough, it will confess to anything”

=> Fake news

98
Q

What is internet tracking?

A