Linux Commands Flashcards

1
Q

5V’s of big data

A
Volume - amount of data that exists
Velocity - how quickly data is generated and moved.
Variety - diversity of data types
Veracity - quality and accuracy of data
Value - value as to what the data is
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hadoop cluster

A

A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform these kinds of parallel computations on big data sets. … Hadoop clusters consist of a network of connected master and slave nodes that utilize high availability, low-cost commodity hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

definition of hadoop

A

Hadoop is an open-source framework that uses a network of many computers to solve problems involving massive amounts of data and computation.

Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a distributed and parallel manner. But MapReduce is not limited to just HDFS. Being a FS, HDFS lacks the random read/write capability. It is good for sequential data access. And this is where HBase comes into picture. It is a NoSQL database that runs on top your Hadoop cluster and provides you random real-time read/write access to your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is hive?

A

Hive is an application that runs over the Hadoop framework and provides SQL like interface for processing/query the data. Hive runs its query using HQL and is having the same structure as RDBMS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is hbase?

A

HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a pig?

A

Pig is a high-level platform that lets you create programs that run on Apache Hadoop. This language is called pig Latin. The result of Pig always stored in the HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is yarn?

A

YARN is an Apache Hadoop technology and stands for Yet Another Resource Negotiator. YARN is the resource managing component of Hadoop and consists of a Resource manager, node manager, and application master.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

df command

A

The df command (short for disk free), is used to display information related to file systems about total space and available space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

man command

A

command in Linux is used to display the user manual of any command that we can run on the terminal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

fdisk

A

format disk and lets you create and manipulate disk partitions. Disk partitioning allows your system to run as if it were actually multiple independent systems – even though it’s all on the same hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

sfdisk

A

sfdisk reads and writes partition tables, but is not interactive like fdisk or cfdisk (it reads input from a file or stdin)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

cfdisk

A

provides basic partitioning functionality with a friendly user interface

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

lsblk

A

used to display details about block devices and these block devices(Except ram disk) are basically those files that represent devices connected to the pc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What was the hadoop explosion?

A

Different tools were needed for different tasks so it inspired a lot of other technologies that were built alongside and on top of Hadoop like spark, hive, kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What’s the CDH?

A

Cloudera Distribution of Hadoop. Open source apache hadoop distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some differences between hard disk space and RAM?

A

RAM is short-term memory, it is where programs, outputs, and the inputs to the commands running on the processor are stored.
Hard disk is long-term memory. The only part of the computer that preserves anything when you shut down. Stores files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a VM?

A

A virtual machine is a program on a computer that works like it is a separate computer inside the main computer. … It is a simple way to run more than one operating system on the same computer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
Know basic file manipulation commands
ls -al
cd
pwd
mkdir
touch
nano
man 
less 
cat
mv
cp
rm
history
A

ls - displays the contents of the directory
cd - change directory
pwd - print working directory : prints out current file path you are in
mkdir - makes a new directory
touch - used to create change or modify timestamps of a file
nano - command line text editor
man - see the manual for list of commands
less - prints contents of file to command line
cat - read from stdin and write to a file
mv - moves a file from source to destination
cp - copy specified file
rm - removes file
history - shows history of commands

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What’s the differences between an absolute and relative path?

A

The difference between an absolute and a relative path is that an absolute path specifies the location from the root directory whereas a relative path is related to the current directory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do permissions work in Unix?

A

There are 3 user types on Linux system: Owner, Group, and All users.
Linux divides the file permissions into read, write, and execute denoted by r, w, and x.
The permissions on a file can be changed by ‘chmod’ command

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are users, what are groups?

A

groups can contain multiple users. All users belonging to a group will have the same Linux group permissions access to the file.
A user or account of a system is uniquely identified by a numerical number called the UID (unique identification number).
A root or super user can access all the files, while the normal user has limited access to files.

22
Q

How does the chmod command change file permissions?

A

The chmod command can be used to explicitly assign privileges to owner, group, and user. This can be accomplished either using binary number format, i.e. 777 for all privileges to all groups, or through letter format, i.e. o + rwx, g + rwx, u + rwx

23
Q

What is a package manager? What package manager do we have on Ubuntu?

A

package managers download applications or parts of applications for you, installing and managing dependencies automatically. On Ubuntu we’ll use APT (Advanced Package Tool).

24
Q

What is ssh?

A

SSH, also known as Secure Shell or Secure Socket Shell, is a network protocol that gives users, particularly system administrators, a secure way to access a computer over an unsecured network.

25
Q

Explain difference between Mapper and Reducer?

A

The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function.

o Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package.

o Here we use LongWritable, which corresponds to a Java Long, Text (like Java String),and IntWritable (like Java Integer).

26
Q

What are some examples of structured and unstructured data?

A

Structured data is highly-organized and formatted in a way so it’s easily searchable in relational databases (dates, phone numbers, ssn, addresses, etc). Unstructured data has no pre-defined format or organization, making it much more difficult to collect, process, and analyze (text files, reports, images, video files, etc).

27
Q

What is a daemon?

A

A Daemon is just a long running process that runs in the background. Hadoop has five such daemons, namely NameNode, Secondary NameNode, DataNode, JobTracker, and TaskTracker. Each daemons runs separately in its own JVM.

28
Q

What is data locality and why is it important?

A

In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

29
Q

What is the default number of replications for each block? How are these replications typically distributed across the cluster?

A

Three. Replication information and other metadata is stored on the NameNode, and the NameNode makes all decisions about where data/replicas will be stored on the cluster. Each file/block within a file is replicated across the cluster.

30
Q

What is the job of the NameNode? What about the DataNode?

A

The NameNode also known as master daemon stores the metadata of the hdfs which is the directory tree of all files in the system. It also tracks files across the cluster.

The DataNodes are responsible for storing the actual data in hdfs. They perform operations like creation replication and deletion of data blocks.

31
Q

How many namenodes exist in a cluster?

A

1 namenode can exist in a cluster

32
Q

How are datanodes fault tolerant?

A

For DataNodes, their fault tolerance is handled by the NameNode. DNs send heartbeats to the NN, so when a DN goes down, it stops sending those heartbeats, and the NN knows to make new replicas of all the data stored on the downed DN.

33
Q

How does a standby namenode make the namenode fault tolerant?

A

This is a daemon that runs on another machine and follows the same steps as the NameNode while they are occuring in real time. it just receives the information in the EditLog and keeps its own FSImage. Then, if the real NameNode fails, the Standby NameNode steps in and becomes the new NameNode. This is called failover. This is the best option, but requires more resources

34
Q

What purpose does a secondary namenode serve?

A

Secondary Namenode takes edit logs from the Primary Namenode, in regular intervals and updates it to fsimage.

Once it gets the updated fsimage, it copies back fsimage to the Namenode

So, now whenever the Namenode restarts, it will use this fsimage and the startup time will be reduced accordingly.

It does not provide failover which is the job of the standby node.

35
Q

How might we scale a HDFS cluster past a few thousand machines?

A

HDFS Federations, with multiple NameNodes, can be used if you need 10000s of machines.

36
Q

In a typical hadoop cluster, what’s the relationship between HDFS data nodes and Yarn node managers?

A

one per machine, the worker daemon. Node managers manage bundles of resources called containers running on their machine and report the status back to the RM.
We submit jobs to the Resource Manager. Tasks are the individual pieces Jobs are broken up into. Tasks are what run inside of containers. Data Nodes are responsible for these map and reduce tasks.

37
Q

When does the combine phase run and where does each combine task run?

A

The Combiner is a partial reduction before shuffle and sort

Output of combiner will be sent over network to actual reduce task as input.

38
Q

Know the input and output of the shuffle + sort phase?

A

takes output from mapper and orders all associated keys before passing it to the reducer to make it easier to parse data

39
Q

What does the nodemanager do?

A

Node managers manage bundles of resources called containers running on their machine and report the status back to the ResourceManager.

40
Q

What does the resourcemanager do?

A

Resource Manager manages the resource allocation in the cluster.

41
Q

Which responsibilities does the scheduler have?

A

this is responsible for allocating resources (containers) across the cluster based on requests.

42
Q

What about the applications manager?

A

Accepts job submissions, and creates the Application Master for each submitted job.

43
Q

What is the applications master and how many of them are there per job?

A

1 per job (managed by the applications manager)
run in containers on the cluster, and are responsible for communicating with the scheduler to achieve their jobs. This allows the ApplicationsManager to be ultimately responsible for job completion, while offloading most of the work to ApplicationMasters running on worker nodes.

44
Q

What is a container in Yarn?

A

Bundles of resources, tasks are what run inside them
The RM makes tasks run in containers across the cluster and the scheduler allocates containers across the cluster, based on request
ApplicationMasters run in containers on the cluster

45
Q

How do we interact with distributed file system?

A

Through fs shell commands Jar: runs a jar file. Users can bundle their MapReduce code in a JAR file and execute it using this command

46
Q

What do the following commands do?
hdfs dfs -get /user/adam/myfile ~
hdfs dfs -put ~/coolfile /user/adam/

A

1) gets a file from hdfs and puts it into our local system

2) puts a file from our local environment into hdfs

47
Q

What is/was Unix? Why is Ubuntu a Unix-like operating system?

A

Unix is an operating system.

48
Q

What is Big Data?

A

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently.

49
Q

What are the 4 types of data?

A

Unstructured data - data that has no inherent structure and is ususally stored as different types of files. (pdfs, images, video)

Quasi-structured - textual data with erratic formats that can be formatted with effort and software tools.

Semi-structured - text data files with an apparent pattern, enabiling analysis. (spreadsheets and xml files)

Structured - data having a defining data model, format, structure (database)

50
Q

What is HDFS?

A

HDFS stands for hadoop distributed file system and it is the storage system of the hadoop framework. Stores data in blocks of 128 mb and reads data sequentially in a single seek operation.

51
Q

What are the characteristics of HDFS?

A
Fault tolerant
Scalable
Rack-aware
Support for heterogenous cluster
ilt for large datasets