Linux Commands Flashcards
5V’s of big data
Volume - amount of data that exists Velocity - how quickly data is generated and moved. Variety - diversity of data types Veracity - quality and accuracy of data Value - value as to what the data is
Hadoop cluster
A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform these kinds of parallel computations on big data sets. … Hadoop clusters consist of a network of connected master and slave nodes that utilize high availability, low-cost commodity hardware.
definition of hadoop
Hadoop is an open-source framework that uses a network of many computers to solve problems involving massive amounts of data and computation.
Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a distributed and parallel manner. But MapReduce is not limited to just HDFS. Being a FS, HDFS lacks the random read/write capability. It is good for sequential data access. And this is where HBase comes into picture. It is a NoSQL database that runs on top your Hadoop cluster and provides you random real-time read/write access to your data.
What is hive?
Hive is an application that runs over the Hadoop framework and provides SQL like interface for processing/query the data. Hive runs its query using HQL and is having the same structure as RDBMS.
What is hbase?
HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases.
What is a pig?
Pig is a high-level platform that lets you create programs that run on Apache Hadoop. This language is called pig Latin. The result of Pig always stored in the HDFS
What is yarn?
YARN is an Apache Hadoop technology and stands for Yet Another Resource Negotiator. YARN is the resource managing component of Hadoop and consists of a Resource manager, node manager, and application master.
df command
The df command (short for disk free), is used to display information related to file systems about total space and available space.
man command
command in Linux is used to display the user manual of any command that we can run on the terminal.
fdisk
format disk and lets you create and manipulate disk partitions. Disk partitioning allows your system to run as if it were actually multiple independent systems – even though it’s all on the same hardware.
sfdisk
sfdisk reads and writes partition tables, but is not interactive like fdisk or cfdisk (it reads input from a file or stdin)
cfdisk
provides basic partitioning functionality with a friendly user interface
lsblk
used to display details about block devices and these block devices(Except ram disk) are basically those files that represent devices connected to the pc.
What was the hadoop explosion?
Different tools were needed for different tasks so it inspired a lot of other technologies that were built alongside and on top of Hadoop like spark, hive, kafka
What’s the CDH?
Cloudera Distribution of Hadoop. Open source apache hadoop distribution.
What are some differences between hard disk space and RAM?
RAM is short-term memory, it is where programs, outputs, and the inputs to the commands running on the processor are stored.
Hard disk is long-term memory. The only part of the computer that preserves anything when you shut down. Stores files.
What is a VM?
A virtual machine is a program on a computer that works like it is a separate computer inside the main computer. … It is a simple way to run more than one operating system on the same computer.
Know basic file manipulation commands ls -al cd pwd mkdir touch nano man less cat mv cp rm history
ls - displays the contents of the directory
cd - change directory
pwd - print working directory : prints out current file path you are in
mkdir - makes a new directory
touch - used to create change or modify timestamps of a file
nano - command line text editor
man - see the manual for list of commands
less - prints contents of file to command line
cat - read from stdin and write to a file
mv - moves a file from source to destination
cp - copy specified file
rm - removes file
history - shows history of commands
What’s the differences between an absolute and relative path?
The difference between an absolute and a relative path is that an absolute path specifies the location from the root directory whereas a relative path is related to the current directory
How do permissions work in Unix?
There are 3 user types on Linux system: Owner, Group, and All users.
Linux divides the file permissions into read, write, and execute denoted by r, w, and x.
The permissions on a file can be changed by ‘chmod’ command