Revature NoSQL, Hadoop, Hive,UNIX, Distributed Systems Flashcards

1
Q

What does BASE stand for?

A

BASE stands for Basically Available, Soft state, Eventual consistency, which contrasts with the ACID properties of traditional databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the CAP Theorem?

A

CAP Theorem states that a distributed data store can only provide two out of the following three guarantees: Consistency, Availability, and Partition Tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does CAP mean for our distributed data stores when they have network problems?

A

In the event of network problems (partition tolerance), a distributed data store must sacrifice either consistency or availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a database in Mongo?

A

In MongoDB, a database is a container for collections, which are groups of documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a collection?

A

A collection is a group of MongoDB documents, similar to a table in a relational database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a document?

A

A document is a record in MongoDB, represented in BSON (Binary JSON) format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What rules does Mongo enforce about the structure of documents inside a collection?

A

MongoDB does not enforce a strict schema, allowing documents in a collection to have different fields and structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a distributed application?

A

A distributed application is software that runs on multiple computers or nodes, often to provide scalability and fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a distributed data store?

A

A distributed data store is a database that stores data across multiple nodes or servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is High Availability? How is it achieved in Mongo?

A

High Availability ensures a system remains operational even during failures. In MongoDB, it is achieved using replica sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Scalability? How is it achieved in Mongo?

A

Scalability refers to the ability to handle increased loads by scaling horizontally or vertically. MongoDB achieves this using sharding for horizontal scaling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain replica sets and sharding.

A

Replica sets are groups of MongoDB servers that provide redundancy and high availability. Sharding splits data across multiple servers to handle large datasets and high throughput.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are NoSQL databases? What are the different types of NoSQL databases?

A

NoSQL databases are non-relational databases. Types include document, key-value, column-family, and graph databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What kind of NoSQL database MongoDB is?

A

MongoDB is a document-based NoSQL database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which are the most important features of MongoDB?

A

MongoDB features include schema-less design, high performance, horizontal scaling, built-in replication, and support for geospatial queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Namespace in MongoDB?

A

A Namespace in MongoDB is a combination of the database name and the collection name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which all languages can be used with MongoDB?

A

MongoDB supports many languages including Python, Java, Node.js, Ruby, PHP, C#, and more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Compare SQL databases and MongoDB at a high level.

A

SQL databases are relational and enforce schemas, while MongoDB is schema-less and uses documents for flexibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is MongoDB better than other SQL databases?

A

MongoDB provides flexibility, horizontal scaling, and faster iteration for applications with varying data structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Compare MongoDB and CouchDB at high level.

A

MongoDB uses a BSON-based document model, supports rich queries, and high write performance. CouchDB focuses on replication and offline-first design using JSON documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Does MongoDB support foreign key constraints?

A

No, MongoDB does not support foreign key constraints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Does MongoDB support ACID transaction management and locking functionalities?

A

MongoDB supports ACID transactions at the document level and multi-document transactions starting from version 4.0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can you achieve primary key - foreign key relationships in MongoDB?

A

You can achieve this by embedding related documents or using references between documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

When should we embed one document within another in MongoDB?

A

Embedding is ideal when related data is frequently accessed together and has a one-to-one or one-to-few relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Explain can you move old files in the moveChunk directory?

A

Yes, you can move or delete old files in the moveChunk directory after ensuring they are no longer in use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Mention what is ObjectId composed of?

A

ObjectId is composed of a 12-byte value: a 4-byte timestamp, 5 bytes of random value, and 3 bytes of an incrementing counter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What was the Hadoop Explosion?

A

The Hadoop Explosion refers to the rapid adoption and development of the Hadoop ecosystem for big data processing in the 2000s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What about CDH?

A

CDH stands for Cloudera Distribution of Hadoop. It is one of the ways Hadoop is used in the wild. Cloudera and HortonWorks, two major companies supporting Hadoop clusters, merged a few years ago.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are some differences between hard disk space and RAM?

A

Hard disk space provides long-term storage with slower access times, while RAM provides temporary, faster-access storage for active processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a VM?

A

A VM (Virtual Machine) is a software-based emulation of a computer system that runs its own operating system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is AWS?

A

AWS (Amazon Web Services) is a cloud computing platform offering services like computing, storage, and databases.

32
Q

What is/was Unix? Why is Ubuntu a Unix-like operating system?

A

Unix is an early operating system emphasizing simplicity and multitasking. Ubuntu is Unix-like because it inherits its principles from Linux, which was inspired by Unix.

33
Q

Know basic file manipulation and navigation commands in Unix: ls -al

A

Lists files and directories with detailed information.

34
Q

pwd

A

Prints the working directory.

35
Q

mkdir

A

Creates a new directory.

36
Q

touch

A

Creates a new, empty file.

37
Q

nano

A

Opens a simple text editor.

38
Q

man

A

Displays the manual for a command.

39
Q

less

A

Views file contents one screen at a time.

40
Q

cat

A

Displays the contents of a file.

41
Q

mv

A

Moves or renames a file or directory.

42
Q

cp

A

Copies files or directories.

43
Q

history

A

Displays the history of executed commands.

44
Q

What’s the difference between an absolute and a relative path?

A

An absolute path starts from the root directory (/), while a relative path is relative to the current directory.

45
Q

How do permissions work in Unix?

A

Permissions are divided into read, write, and execute, and apply to owners, groups, and others.

46
Q

What are users, what are groups?

A

Users are individual accounts; groups are collections of users sharing permissions.

47
Q

How does the chmod command change file permissions?

A

The chmod command modifies file permissions using symbolic or numeric notation.

48
Q

What is a package manager? What package manager do we have on Ubuntu?

A

A package manager installs, updates, and manages software. Ubuntu uses APT (Advanced Package Tool).

49
Q

What is ssh?

A

SSH (Secure Shell) is a protocol for securely accessing remote servers.

50
Q

Be able to explain the significance of Mapper[LongWritable, Text, Text, IntWritable] and Reducer[Text, IntWritable, Text, IntWritable].

A

Mapper processes input data and emits key-value pairs; Reducer consolidates values associated with keys into final outputs.

51
Q

What needs to be true about the types contained in the above generics?

A

The types must be serializable and compatible with the Hadoop framework.

52
Q

What are the 3 Vs of big data?

A

Volume, Variety, and Velocity.

53
Q

What are some examples of structured data? Unstructured data?

A

Structured data: Tables, CSVs; Unstructured data: Text files, images.

54
Q

What is a daemon?

A

A daemon is a background process running without direct user interaction.

55
Q

What is data locality and why is it important?

A

Data locality means processing data close to where it is stored, reducing network latency.

56
Q

How many blocks will a 200MB file be stored in in HDFS, if we assume default HDFS block size for Hadoop v2+?

A

Assuming a default block size of 128MB, the file will use 2 blocks.

57
Q

What is the default number of replications for each block?

A

The default number of replications is 3.

58
Q

How are these replications typically distributed across the cluster? What is rack awareness?

A

Replications are distributed across racks to enhance fault tolerance. Rack awareness ensures copies are on different racks.

59
Q

What is the job of the NameNode? What about the DataNode?

A

The NameNode manages metadata and file system structure; DataNodes store the actual data blocks.

60
Q

How many NameNodes exist on a cluster?

A

Typically, there is one active NameNode per cluster.

61
Q

How are DataNodes fault tolerant?

A

DataNodes replicate data across the cluster to provide fault tolerance.

62
Q

How does a Standby NameNode make the NameNode fault tolerant?

A

The Standby NameNode synchronizes metadata with the active NameNode to take over during failures.

63
Q

What purpose does a Secondary NameNode serve?

A

It periodically checkpoints the NameNode metadata.

64
Q

How might we scale a HDFS cluster past a few thousand machines?

A

HDFS Federations, with multiple NameNodes, can be used to scale to tens of thousands of machines.

65
Q

In a typical Hadoop cluster, what’s the relationship between HDFS data nodes and YARN node managers?

A

HDFS DataNodes store data, while YARN NodeManagers manage resources for processing jobs on those nodes.

66
Q

When does the combine phase run, and where does each combine task run?

A

The combine phase runs after the map phase and before the shuffle, on the node where the map task ran.

67
Q

Know the input and output of the shuffle + sort phase.

A

Input: Mapper output key-value pairs; Output: Sorted key-value pairs sent to Reducers.

68
Q

What does the NodeManager do?

A

The NodeManager manages resources and monitors container processes on a node.

69
Q

What about the ResourceManager?

A

The ResourceManager allocates cluster resources for jobs.

70
Q

Which responsibilities does the Scheduler have?

A

The Scheduler allocates resources without guaranteeing task completion.

71
Q

What about the ApplicationsManager?

A

The ApplicationsManager manages job submission and monitors progress.

72
Q

What is an ApplicationMaster? How many of them are there per job?

A

The ApplicationMaster coordinates resource allocation and task execution for a single job.

73
Q

What is a Container in YARN?

A

A container is a resource allocation (CPU, memory) for running tasks on a node.

74
Q

How do we interact with the distributed filesystem?

A

Using HDFS commands like dfs, get, and put.

75
Q

What do the following commands do? hdfs dfs -get /user/adam/myfile ~

A

Downloads a file from HDFS to the local home directory.

76
Q

hdfs dfs -put ~/coolfile /user/adam/

A

Uploads a file from the local home directory to HDFS.

77
Q
A