Big Data Lecture 01 Introduction Flashcards
What is the main learning objective of the course?
Learn to query gigantic amounts of data even when it is a bit messy.
How is Data Science similar to Physics?
It is epistemic science of artificial data, so it has the same relation as Physics has to Mathematics, but to Computer Science.
What was the first human data transmitting manner and what was its problem, and how was it solved?
People would speak or sing, however, this would get distorted over time. Solved by writing.
What was the first data storing format? What is its problem and how was it solved?
Clay tablet table, tables are the most natural form of storing data.<br></br><br></br>Problematic copying, this was solved by the printing press.
How was data stored in computers in history?
- 1960s: file base systems
- 1970s: relational databases
- 2000s: NoSQL era
Three Vs of Big Data
- Volume
- Variety
- Velocity
- (Veracity)
Why do we store more data?
<ul><li>We can, storage is cheap.</li><li>It carries value.</li><li>Combined data is worth more than sum of its parts.</li><li>We need data totality, some sites only operate well if they have all the data.</li></ul>
Name prefix for unit: 1 000 (3 zeros)
kilo (k)
Name prefix for unit: 1 000 000 (6 zeros)
Mega (M)
Name prefix for unit: 1 000 000 000 (9 zeros)
Giga (G)
Name prefix for unit: 1 000 000 000 000 (12 zeros)
Tera (T)
Name prefix for unit: 1 000 000 000 000 000 (15 zeros)
Peta (P)
Name prefix for unit: 1 000 000 000 000 000 000 (18 zeros)
Exa (E)
Name prefix for unit: 1 000 000 000 000 000 000 000 (21 zeros)
Zetta (Z)
Name prefix for unit: 1 000 000 000 000 000 000 000 000 (24 zeros)
Yotta (Y)
Name prefix for unit: 1 000 000 000 000 000 000 000 000 000 (27 zeros)
Ronna (R)
Name prefix for unit: 1 000 000 000 000 000 000 000 000 000 000 (30 zeros)
Quetta (Q)
What are examples of different data shapes?
<ul><li>Tables,</li><li>trees,</li><li>graphs,</li><li>cubes,</li><li>text (unstructured).</li></ul>
What is capacity?
How much data we can store.
What is throughput?
How fast we can transmit data.
What is latency?
How long till we start receiving data.
What is the progress made in capacity, throughput and latency in last 70 years? What does this mean?
<ul><li>Capacity 23 000 000 000x,</li><li>Throughput 20 800x,</li><li>Latency 144x.</li></ul>
This is a big problem, now we need to parallelize.
What is Big Data?
Porfolio of technologies that we designed to <i>store, manage and analyze data</i> that is too large to fit on a single machine while accommodating for the issue of growing discrepancy beween capacity, throughput and latency.