Big Data for Dummies Flashcards

1
Q

What are the three Vs of big data?

A

Extremely large volumes of data; extremely high velocity of data, extremely wide variety of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is big data important?

A

Enables organizations to gather, store, manage and maniuplate vast amounts of data at the right speed, the at the right time, to gain the right insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data warehouses vs. data marts

A

Data warehouses can be too complex and large and didn’t offer the speed and agility that the business required. The answer was a further refinement of hte data being managed through data marts. Data marts were focused on specific business issues and more streamlined, supporting the business need for speedy queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data warehouses are typically fed in…

A

Batch intervals, like daily or weekly. Limits in real-time business and consumer environments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a BLOB

A

Binary large objects – stores an unstructured data element. ODMS (object database management system) stores the BLOB as an addressable set of pieces so that we could see what was in there.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is advantage of object database

A

Includes a programming language and a structure for the data elements so that it is easier to manipulate various data objects without programming and complex joins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some of the technologies a the heart of big data? (4)

A
  1. Virtualization 2. Parallel processing 3. Distributed file systems 4. In-memory databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Different approaches to handling data exist based on whether it is data in motion or data at rest. What is data in motion vs data at rest

A

Data in motion would be used if a company is able to analyze the quality of its products during the manufacturing process to avoid costly errors. Data at rest would be used by a business analyst to better understand customers’ current buying patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Is big data a single technology? What does it help companies gain?

A

Big data is a combo of old and new technologies that helps companies gain actionable insight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 5 components of the cycle of big data management

A
  1. Capture 2. Organize 3. Integrate 4. Analyze 5. Act
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is validation an important issue in big data management

A

If your organization is combining data sources, it is critical that you have the ability to validate that these sources make sense when combined. Also, certain data sources may contain sensitive information, so you must implement sufficient levels of security and governance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Where would you start in big data management?

A

Start with the problem you’re trying to solve. That will dictate the kind of data that you need and what the architecture might look like.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you determine what performance requirements will be when setting up a big data management system?

A

Your needs will depend on the nature of hte analysis you are supporting. You will need the right amount of computational power and speed. Some analysis will be real time but you will be storing some amount of data as well. -How much data will my organization need to manage today and in the future? -How often will my organization need to manage data in real time or near real time? -How much risk can my organization afford? Is my industry subject to strict security, compliance and governance requirements?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do you need redundancy in your data management system?

A

So you are protected from unanticipated latency and downtime

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is in a big data tech stack?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What makes big data big?

A

It relies on picking up lots of data from lots of sources.

17
Q

Why are APIs important in the big data stack?

A

To get massive amounts of data in, you need integration. Open application programming interfaces (APIs) will be core to any big data architecture. Interfaces exist at every level and between every layer of hte stack. Without integration services, big data can’t happen.

18
Q

Why does big data need different infrastructure than traditional data

A

To support an unanticipated or unpredictable volume of data. So it’s based on a distributed computing model. This means that data may be physically stored in many different locations and can be linked together through networks, the use of a distrihbuted file system, and various big data analytic tools and applications.

19
Q

What is a distributed computing model

A

This means that data may be physically stored in many different locations and can be linked together through networks, the use of a distrihbuted file system, and various big data analytic tools and applications.

20
Q

Why is redundant physical infrastructure important

A

Because we’re dealing with so much data from so many different sources. Redundancy comes in many forms. If your company has created a private cloud, you will want to have redundancy built within th eprivate environment so that it can scale out to support changing workloads. In some cases, this redundancy may come in the form of a Software as a Service (SaaS) offering that allows companies to do sophisticated data analysis as a service.

21
Q

Why use SaaS for redundant physical infrastructure?

A

Lower costs, quicker startup and seamless evolution of the underlying tech

22
Q

Security infrastrcuture is important why

A

If you have to comply with regulations or keep customer info secure you will need to take into account who is allowed to see the data and under what circumstances they are allowed to do so.

23
Q

What is an operational data source

A

In big data you have to incorporate all the data sources that will give you a complete picture of your business and see how the data impacts the way you operate your business. In the past this was highly structured data managed in a relational database. But operational data now has to encompass a broader set of data sources, including unstructured sources such as customer and social media data

24
Q

What characteristics does a good operational data source have?

A
  1. Represent systems of record that keep track of the critical data required for real-time, day to day operation of the business
  2. Continually updated based on transactions wherever they take place
  3. Blend structured and unstructured data
  4. System that scales to support many users on a consistent basis.
25
Q

What do mapreduce type computing technologies accomplish

A

They provide the ability to process massive amounts of data efficiently, cost-effectively, and in a timely fashion.

26
Q

What does MapReduce do?

A

Designed by Google to efficiently execute a set of functions against a large amount of data in batch mode. The map component distributes the programming problem or tasks across a large number of computers

27
Q

What does Big Table do?

A

Developed by Google to be a distributed storage system intended to manage highly scalable structured data. Stores huge volumes of data across commodity servers

28
Q

What does Hadoop do?

A

Derived from MapReduce and Big Table. Hadoop allows applications based on MapReduce to run on large clusters of commodity hardware. Designed to parallelize data processing across computing nodes to speed computation and hide latency. Two major components:

  1. Massively scalable distributed file system that can support huge amounts of data
  2. Massively scalable MapReduce engine that computes results in batch.
29
Q

What is change between dealing with lots of data previously and dealing with it now in big data

A

Big data changes what you can do with that information. Can anticipate and solve business problems and react to opportunities.

30
Q

When would you need to deal with data in real time vs not

A
  1. Monitoring traffic data – real time.
  2. Analyze big set of data to mine for patterns – not real time

Deciding when you need which informs the technology purchases that you make

31
Q

What is structured data

A

Data that has a defined length and format: numbers, dates, and groups of words and numbers called strings. You can query it with SQl

32
Q

What is a string

A

Structured data but words and numbers

33
Q

What is machine generated data vs human generated data

A

Web log data, sales data….vsinput data, clickstream data, moves you make in a game

34
Q

What is data persistence

A

How a database retains versions of itself when modified. The great grandaddy of persistent data stores is the relational database management system (RDMS).

35
Q

What is a schema

A

In a relational database, a schema is a structural representation of what is in the database. The tables, the fields in the tables, and the relationships between the two.

36
Q

How do you update, read and delete data in a relational model?

A

SQL (structured query language). Tables can be queried using a common key (customer ID)