Keywords Flashcards
Data Warehouse
A centralized repository for storing and managing large volumes of data from various sources, used for querying, reporting, and data analysis to support business intelligence activities.
Multidimensional Data
Data organized into multiple dimensions, allowing for analysis from different perspectives, such as time, geography, and product categories.
OLAP
Online analytical processing (OLAP) is a server that allows businesses to analyze data from multiple sources in different ways. It is a database technology that is optimized for querying and reporting, instead of processing transactions.
What is a server?
In computing, a server is a computer that provides services, data, or resources to other devices, called clients, over a network.
What is data mining?
The process of searching and analyzing a large batch of raw data in order to identify patterns and extract useful information.
What is Operational BI?
Operational Business Intelligence provides near real-time to short-term insights focused on daily operations, enabling immediate decision-making rather than long-term strategic planning.
What is analytics?
The methodology of transforming data into insight for making better decisions. It answers questions like why is this happening, what if these trends continue, what will happen next (i.e., predict), and what is the best that can happen.
What is web analytics?
The measurement, collection, analysis, and reporting of web data
Explain the data analytics process
Identify business needs (choose KPIs)
Collect the data (from the subject matter experts)
Review and clean the data
Model the data
Analyze the data
Interpret the results
Predict and optimize
Communicate
What is the Gartner Magic Quadrant for BI:
An annual report that evaluates and ranks leading business intelligence (BI) vendors based on their ability to execute and their completeness of vision, helping organizations choose the best BI tools for their needs.
What is structured data?
It is highly organized data stored in databases and spreadsheets in columns and rows. Ready to be integrated into a database of a structured file format such as XML. Only 20% of available data is like this.
What is unstructured data?
Unstructured data is raw and unorganized. It does not have a predefined data model. It has no identifiable internal structure.
What are three components of a data warehouse system?
Acquisition Component: Interfaces with source systems to import data into the data warehouse. TL Tools (Extract, Transform, Load): Examples include Apache Nifi, Talend, and Microsoft SQL Server Integration Services (SSIS), which help import data from various sources into the data warehouse.
Storage Component: A large physical database used to store the imported data. Databases: Examples include Amazon Redshift, Google BigQuery, and Snowflake, which store large volumes of data in a structured format.
Access Component: Enables accessing and analyzing the data in the data warehouse. BI Tools: Examples include Tableau, Power BI, and Looker, which allow users to access, query, and analyze data stored in the data warehouse.
What are some of the benefits of metadata?
- Has the source of information about operations that were applied on the imported data
- It documents relationships between data structures
- It provides useful mapping information
- It can be used to review how the business definitions and calculations changed over time and, also, provides a history of extracts and changes in data over time
What is business metadata?
Provides information about the data, its sources, definitions, etc. in business terminology
What is Technical metadata?
It defines the objects and processes in the data warehouse
What is process metadata?
It documents the data warehouse operations.
What is access metadata?
Access metadata provides the dynamic link between a data warehouse and its associated applications.
What is data conversion?
Refers to the process of converting data from one format into another due to differences in storage types and data structures, as well as variations in data encoding across computer systems
What is data integration?
Data Integration: Imagine you work with data from multiple sources, like sales records, customer information, and inventory lists. Data integration is the process of combining this data into a single, unified view. It’s like creating a master spreadsheet where all this information is brought together, making it easier to analyze and share with your team or partners.
What is data migration?
Data Migration: Data migration is when you move data from one system to another. For example, if your company is switching from an old database to a new one, you would transfer all the data from the old system to the new one. Once the data is successfully moved, the old system is no longer needed and can be retired.
What is data quality
How accurate, complete, reliable, and relevant data is for its intended use, ensuring that data is consistent, free of errors, and useful for making decisions and analysis.
What is Master Data Management?
Master Data Management (MDM): MDM focuses on creating a single, consistent, and accurate view of key business data entities, such as customers, products, and suppliers. It involves processes and tools for integrating, cleansing, and maintaining this master data across different systems and departments to ensure consistency and reliability.
What are some data cleansing and tool categories?
Data error discovery tools and data correction tools
What is a relational database?
Relational Databases: Relational databases are a type of database management system (DBMS) that stores data in tables with rows and columns, where each table represents a relation. The tables are related to each other through keys (primary and foreign keys), allowing for efficient querying and retrieval of data using structured query language (SQL). Relations between tables enforce data integrity and enable complex queries and transactions to be performed on the data. Examples include MySQL, PostgreSQL, Oracle Database, and Microsoft SQL Server.
What is decentralized processing?
Decentralized Processing: Decentralized processing means spreading out computing tasks and data management across many separate devices or nodes in a network. Each device works on its own and collaborates with others to get things done without needing one main server to control everything. This setup makes it easier to handle big amounts of data and complex jobs across different parts of a network, making systems more flexible and reliable. Example would be a peer-to-peer (P2P) file-sharing network like BitTorrent
What is extract processing?
Extract processing typically, it refers to the process of extracting data from one system or source for use in another system or for further analysis. This term is often used in the context of ETL (Extract, Transform, Load) processes in data integration and data warehousing.
What is an OLTP?
Online Transaction Processing Systems (OLTPs): OLTP systems are specialized databases designed to manage and facilitate a large number of short, atomic transactions in real time. These systems are optimized for tasks such as data entry, retrieval, and processing, commonly used in environments where fast query processing and maintaining data integrity in multi-user access scenarios are critical. Examples of OLTP applications include banking systems for handling ATM transactions, retail systems for processing sales, and reservation systems for booking flights or hotels.
OLTP System is an Operational Database.
What is an ERP?
Enterprise Resource Planning (ERP):
Definition: ERP systems are integrated software platforms that manage and automate core business processes across various departments within an organization. These processes include finance, human resources, manufacturing, supply chain, procurement, and more.
SAP, Oracle ERP, and Microsoft Dynamics are popular ERP systems used by organizations to manage their day-to-day business activities.
What’s a data mart?
A data mart is a smaller, more focused version of a data warehouse designed to meet the specific needs of a department or a smaller group within an organization.
Whats an operational data store (ODS)?
An Operational Data Store (ODS) serves as a central database that captures a snapshot of the latest data from multiple transactional systems. Its purpose is to support operational reporting and provide a source of data for the enterprise data warehouse (EDW). An operational data store is a subject oriented database that contains structured data extracted directly from OLTP systems
Examples include patient records, inventory management, or transaction data, or meter readings.
Used as a staging area before data is imported into a data warehouse
It contains current, or near current, data and its objective is to meet the ad hoc query, tactical day-to-day, needs of operational users.
an ODS can be frequently updated from operational systems
What is data federation?
Data federation is a data integration technique that provides a unified view of data from multiple sources without physically consolidating it. Imagine it as a sophisticated mechanism that allows you to access and query data across various systems in real time, as if it were all stored in a single location1. Essentially, it creates a virtual database that maps several distinct data sources within an enterprise, making them accessible through a single interface2
What’s an Online Analytical Processing (OLAP) system?
Online Analytical Processing (OLAP) system is a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to interactively analyze multidimensional data from multiple perspectives. They are a crucial part of business intelligence systems, facilitating complex queries and analysis that support decision-making processes.
What is a MOLAP (Multidimensional OLAP)?
Data is stored in a multidimensional cube, which allows for fast retrieval and analysis.
What is a ROLAP (Relational OLAP)?
Data is stored in a relational database, and complex queries are used to perform multidimensional analysis. ROLAP can handle large volumes of data but may have slower query performance compared to MOLAP.
What is a HOLAP (Hybrid OLAP)?
Combines features of both MOLAP and ROLAP, storing part of the data in multidimensional cubes and part in a relational database.
Example relational databases, OLTP, and OLAP.
Relational Databases are a Relational Database Management System (RDBMS) is a type of database management system that organizes data into tables (relations) where data points are related to one another through common fields.
(OLTP - Online Transaction Processing) are transactional systems that require fast query processing and maintain data integrity in multi-access environments. They are optimized for inserting, updating, and deleting small amounts of data. OLTP systems manage and facilitate transactional operations that involve day-to-day business activities. Point-of-Sale (POS) Systems, Online Banking. De-normalization creates redundant data and is suitable for a data warehouse, it not appropriate for a transaction database, which emphasizes performance over redundancy.
OLAP Systems (Online Analytical Processing): are designed for analysis and reporting. Deals with historical, summarized, and aggregated data rather than real-time transactional data. They are optimized for complex queries and aggregations over large volumes of data, often supporting data warehousing and business intelligence applications. Data is organized in multidimensional cubes with measures and dimensions. Cubes allow for fast retrieval of aggregated data. Optimized for complex queries that involve aggregations, trend analysis, and data mining. Use specialized query languages like MDX (Multidimensional Expressions). Part of Microsoft’s SQL Server suite, SSAS is a powerful tool for creating OLAP cubes, which can be used for data mining and complex analytical queries. It supports both multidimensional OLAP (MOLAP) and tabular data models. Use cases are Business Reporting, Budgeting and Forecasting, Market Research,
What is Denormalization?
Denormalization is the process of intentionally introducing redundancy into a database by merging tables and reducing the complexity of relationships. This is often done to improve the read performance of the database by reducing the number of joins needed to retrieve related data. Denormalization can be particularly beneficial in scenarios where query performance is more critical than the maintenance of strict data integrity and normalization rules.
Denormalization introduces redundancy, meaning the same piece of data is stored in multiple places. This can lead to inconsistencies if not managed properly. More storage space is required since data is duplicated. Updates become more complex because changes to redundant data must be propagated to multiple places, increasing the risk of data anomalies.
Whats a join?
A join is an operation in SQL that allows you to combine rows from two or more tables based on a related column between them. Joins are fundamental in relational databases as they enable the retrieval of data spread across multiple tables by establishing relationships between them.
What is normaliztion?
Normalization is a process in database design that organizes columns and tables of a database to reduce data redundancy and improve data integrity. The main objective of normalization is to separate data into distinct, related tables to minimize redundancy and dependency. This process involves structuring a relational database according to a series of normal forms to ensure that data is stored logically and efficiently.
What are three types of data warehouses?
There are three types - Enterprise Data Warehouses (EDW), Data Marts, and Operational Data Stores (ODS). EDWs are comprehensive and cover the entire organization, Data Marts cater to specific departmental needs, and ODSs provide near-current data for operational use.
What is a star schema?
A star schema is a type of data warehouse schema where a central fact table is connected to multiple dimension tables through foreign key relationships.
They are de-normalized
Are linked to the fact table through unique keys (one per dimension table)
Imagine a retail company analyzing sales data. The fact table contains sales transactions, and the dimension tables include products, time, and stores. Since the relationships are straightforward (e.g., sales per product per store), a star schema simplifies querying.
What’s an ERD?
An Entity-Relationship Diagram (ERD) is a visual representation that illustrates the relationships between entities in a database. ERDs are essential tools in database design and development, helping to organize and structure data models. ERDs serve as blueprints for database design, helping developers and stakeholders visualize and understand the structure and relationships within a database system.
What’s a snowflake schema?
The snowflake schema is a multi-dimensional data model commonly used in business intelligence (BI) and reporting. It’s an extension of the star schema, with dimension tables broken down into subdimensions. It’s normalized, so there isn’t data redundancy.
Consider a healthcare organization analyzing patient data. The fact table represents medical procedures, and the dimension tables include patients, doctors, hospitals, and diagnoses. Since there are multiple levels of hierarchy (e.g., patient demographics, doctor specialties), a snowflake schema allows for more detailed analysis.
What is a Database Management System (DBMS)?
a database management system is software that allows users to define, create, maintain, and control access to databases.
What is middleware?
Middleware is software that sits between different software applications or services, enabling them to communicate and work together. It’s like the glue that connects different components of a system, ensuring smooth data flow and interaction.
Middleware provides common services and capabilities such as messaging, authentication, and data integration, facilitating communication and management of data between different systems and applications.
What is Apache HTTP Server and Nginx an example of? What do they do?
Examples of Middleware: These are Web Servers. They Serve web pages and handle HTTP requests from clients.
What is IBM WebSphere, Oracle WebLogic an example of? What do they do?
Examples of Middleware: Application Servers, which host your online store application, making it accessible to users via the web.. Function: Host and manage web applications, providing services like transaction management and security.
What is ODBC (Open Database Connectivity), JDBC (Java Database Connectivity). an example of? What do they do?
Examples of Database Middleware: Function: Facilitate interaction between applications and databases.
What is RabbitMQ, Apache Kafka. an example of? What do they do?
Examples of Message-Oriented Middleware (MOM): Function: Handle message passing between distributed systems.
What is gRPC, Apache Thrift an example of? What do they do?
Examples of Remote Procedure Call (RPC) Middleware: Function: Enable functions in different systems to call each other as if they were local.
You (Client): Make a request (order a pizza).
Restaurant (Server): Receives your request, performs the task (makes the pizza), and sends back the result.
Phone Call (RPC Mechanism): Facilitates the communication between you and the restaurant.
In RPC, instead of calling a restaurant, you’re calling a function or procedure on another computer as if it were on your own computer. The “restaurant” (server) does the work and sends the result back to you (client), allowing you to use the result just like you would use the pizza you ordered
What is the ETL Process?
Extraction: pulling data from the source system
Transformation: subjecting the data to a number of operations before it can be imported
Loading: Involves physically placing extracted and transformed data in the target database
What is Back flushing?
If clean validated data warehouse data is to be fed back to the source system(s)
What is Purge processing?
refers to the methodical removal or deletion of obsolete, unnecessary, or outdated data from a database, system, or storage to improve performance, manage storage space, and maintain system efficiency. This process is crucial for maintaining data hygiene and ensuring that only relevant and current data is kept within the system.
What is merge processing?
Merge processing refers to the operation of combining two or more datasets, files, or data streams into a single, unified dataset. This process is commonly used in data management, particularly when dealing with sorted data, where the goal is to maintain a specific order in the resulting merged dataset.
What is parallel processing?
Parallel processing is a method in computing where multiple processors or computers work simultaneously on different parts of a task to complete it faster. It divides a large problem into smaller sub-problems, solves them concurrently, and then combines the results.
What is a pipeline?
A pipeline in the context of computing and data processing refers to a series of processing steps arranged so that the output of one step is the input to the next.
What is partitioning in the ETL process?
Partitioning involves dividing large tables or datasets into smaller, more manageable parts based on a defined criteria (e.g., range of values, hash value). Each partition can then be processed independently.
What is indexing in the ETL process?
Indexing involves creating indexes on columns in a database table to speed up data retrieval operations. Indexes allow the database engine to quickly locate rows that match certain criteria.
What is Parallel bulk loading in the ETL process?
Parallel bulk loading involves splitting data loading tasks into multiple concurrent processes or threads, each handling a portion of the data simultaneously.
What is a dimension?
It is a data element that categorizes each item in a dataset into non-overlapping regions. Examples of dimensions are customer, region, and time. Represents an attribute such as product, region, sales channel, or time. Used to analyze facts known as business metrics (ways we measure business)
What is a fact?
A fact is a business measure or metric, which is used to measure business performance such as sales, revenue, units sold, and costs. They are the values that cange over time.
What are the three types of facts?
Additive:
This type of fact, which is the most common, is a measurement that can be added across all dimensions in a fact table; examples include revenue, profit, sales, and cost
Semi-additive: This type of fact or measure can be added for some dimensions only, such as headcount
Non-additive: This type of fact cannot be summed for any dimension
What is Granularity?
Granularity refers to the lowest level of detail that is stored in a data warehouse fact table. For example, the lowest level of data can be maintained at the yearly, quarterly, monthly, weekly, daily, or hourly level. For more in-depth reporting capability, low granularity is preferred.
What is OLAP?
OLAP is a BI tool that addresses the need to perform multi-dimensional analysis. Query outputs are presented in a matrix or pivot, where the columns and rows are the dimensions. The values in the matrix are obtained from the measures, which are derived from the fact table records. The dimensions are derived from the dimension tables.
How do you identify big data?
Volume: data at rest - quantity of data that is generated.
Velocity (data in motion): Clickstreams and ad impressions that capture user behavior at millions of events per second
High-frequency stock trading algorithms reflect market changes within microseconds. Machine-to-machine processes exchange data between billions of devices. Infrastructure and sensors generate massive log data in real-time. On-line gaming systems support millions of concurrent users
Variety (data in many forms)_
Nowadays, data is generated in structured and unstructured format.
Variability: Daily, seasonal and event-triggered peak data loads can be challenging to manage, especially where unstructured data is involved.
Complexity: It is necessary to connect and correlate relationships, hierarchies and multiple data linkages or the data can quickly spiral out of control. This characteristic is referred to as the ‘complexity’ of Big Data.
Veracity (data in doubt) - not sure of it’s reliability of data
What is MPP Database Systems (Massively Parallel Processing)?
MPP database systems are designed to handle large-scale data processing by distributing data and processing tasks across multiple nodes or servers. Each node operates independently and processes a subset of the data in parallel. Examples: Amazon Redshift, Google BigQuery, Teradata.
What is MapReduce?
MapReduce is a programming model and framework developed by Google for processing and generating large datasets in parallel across a distributed cluster of computers. Suited for processing unstructured or semi-structured data. MapReduce is like big detective kits for data. MapReduce is the processing component of Hadoop. It allows processing of large datasets in parallel across the nodes in the cluster. Divides tasks into smaller sub-tasks (Map phase) and processes them independently, then aggregates results (Reduce phase). Typically used for batch processing of data, such as log processing, data transformation, and ETL (Extract, Transform, Load) operations.
What is operational big data and whats’ a tool you would use for it?
Operational Big Data is about handling data quickly for everyday tasks, where NoSQL databases shine.
What is Hadoop?
Hadoop is like a powerful toolbox that can handle both fast data processing (like NoSQL databases do) and deep analysis (like MPP databases and MapReduce do). It’s a framework that lets you store lots of data across many computers and then process that data in different ways.
What is analytical Big Data and what are some tools you would use for it?
Analytical Big Data is about finding patterns and insights from large amounts of data over time, where MPP databases and tools like MapReduce come in.
What is MongoDB?
MongoDB is a type of software used to store and manage data. It’s part of a broader category of databases called NoSQL databases, which means it doesn’t use the traditional relational database structures that you might be familiar with from systems like MySQL or PostgreSQL.
What are Samza, Spark Streaming, Storm used for?
These are commonly used for processing streaming data in real-time.
What are these used for Drill and Impala?
Drill, Impala: These are focused on enabling fast SQL queries on large datasets.
What is Apache Impala?
Apache Impala is like a super-fast detective that helps you find answers in a very big library of information (data). Imagine you have a huge library (like a library with millions of books). Apache Impala can split up the work of finding information across many librarians (computers). Each librarian (computer) works on their own set of books (data), which makes it much faster to find what you need. Apache Impala understands a language called SQL, which is like a common language used to ask questions to databases. You can ask Apache Impala questions in SQL, and it will find the answers for you in your data.
What is Hadoop Distributed File System (HDFS)?
HDFS is the storage component of Hadoop. It stores large files in a distributed manner across multiple nodes in a Hadoop cluster.
What are nodes in Hadoop?
Nodes: A Hadoop cluster consists of multiple nodes (computers), typically divided into two types: NameNode (manages file system metadata) and DataNode (stores actual data).
What is a data lake?
A data lake is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. It can store data from different sources such as websites, mobile apps, and social media platforms. examples Azure Data Lake Storage:
What are the 4 core components of Hadoop?
Hadoop Distributed File System (HDFS): HDFS is a distributed file system that stores data across multiple nodes in a Hadoop cluster. It provides high throughput and fault tolerance by replicating data blocks across nodes. Ideal for storing large datasets (structured, semi-structured, and unstructured) with high aggregate bandwidth for data processing.
Hadoop MapReduce: MapReduce is a programming model and processing framework for large-scale data processing across a Hadoop cluster. It breaks down tasks into smaller Map and Reduce operations that can be executed in parallel. Batch processing tasks such as data transformation, ETL (Extract, Transform, Load), and analytical processing on large datasets stored in HDFS.
Hadoop YARN (Yet Another Resource Negotiator): YARN is a resource management and job scheduling platform in Hadoop. It manages and allocates computing resources (CPU, memory) across nodes in the cluster to run various applications. Enables multi-tenancy and supports diverse workloads including MapReduce, Spark, and other distributed computing frameworks.
Hadoop Common: Hadoop Common includes libraries, utilities, and necessary components shared by other Hadoop modules. It provides core functionalities like file system interfaces, networking, security, and configuration management.oundation for building and running Hadoop-based applications, ensuring compatibility and abstraction of underlying complexities.
What is Apache Pig?
A high-level data flow scripting language and execution framework for parallel computation.
What is Apache Hive?
A data warehouse infrastructure that provides SQL-like querying (HiveQL) and metadata management on large datasets stored in Hadoop.
What is Apache HBase?
It is a distributed, scalable, and NoSQL database that provides real-time read/write access to data stored in HDFS.
What is Apache Spark?
It is an open-source, distributed computing system that provides in-memory processing for real-time data analytics and machine learning.
What is Apache Kafka ?
Kafka is a distributed streaming platform that handles real-time data feeds.
What is Apache Zeppelin?
Apache Zeppelin is an open-source web-based notebook that provides an interactive and collaborative environment for data analysis, visualization, and exploration. It supports a variety of data sources and processing engines, making it a versatile tool for data scientists, analysts, and developers.
What does DQ stand for?
Data Quality
What are six steps you can take to make sure your data is cleaned?
Step 1: Remove irrelevant data.
Step 2: Deduplicate your data.
Step 3: Fix structural errors.
Step 4: Deal with missing data.
Step 5: Filter out data outliers.
Step 6: Validate your data.
What is Oracle Business Intelligence Enterprise Edition (OBIEE)?
OBIEE is a comprehensive suite of enterprise BI tools designed to deliver a full range of analytic and reporting capabilities.
What is IBM Cognos?
It is a BI and performance management software suite designed to enable business users without technical knowledge to extract corporate data, analyze it, and assemble reports.
What is SAP BusinessObjects?
SAP is a suite of front-end applications that allow business users to view, sort, and analyze business intelligence data.
What is a Client–Server Model
The client-server model is a network architecture where client devices request services and resources from a central server. Clients send requests to the server over a network, and the server processes these requests and returns the appropriate responses. This model can be used for a wide variety of services, including file sharing, email, and web services.
An example is Email clients retrieving emails from an email server.
What is virualization?
Virtualization is the process of creating a virtual version of something, such as a server, a desktop, a storage device, an operating system, or network resources. You might use this to test software in different environments without needing separate physical hardware. Example of brands VMware, Microsoft Hyper-V, Oracle VM VirtualBox, Citrix XenServer.
Virtualization software creates an abstraction layer over computer hardware that allows the hardware elements of a single computer—processors, memory, storage, and more—to be divided into multiple virtual computers, commonly called virtual machines (VMs). Each VM runs its own operating system and applications, independently of the other VMs.
Virtualization, the process of using computer resources to imitate other resources,
What is grid computing?
Grid computing involves a distributed architecture of large numbers of computers connected to solve a complex problem. Computers in the grid work together to complete tasks by dividing the workload among them. Each computer works on a small part of the task independently and then combines the results. Globus Toolkit, Apache Hadoop, BOINC (Berkeley Open Infrastructure for Network Computing).
An example of use might be Scientific research projects requiring large-scale computation, such as climate modeling or genome sequencing.
What is a Mainframe Computer?
A mainframe computer is a large, powerful, and expensive computer system capable of handling and processing very large amounts of data quickly. Mainframes are designed to manage high volumes of input and output and support numerous simultaneous users. They use specialized operating systems to manage hardware and software resources efficiently.
An example are IBM Z Series, Unisys ClearPath, Fujitsu BS2000.
What is Service-Oriented Architecture (SOA)?
SOA is a design pattern where services are provided to other components by application components, through a communication protocol over a network.
In SOA, services are modular and can be independently deployed and scaled. Each service has a well-defined interface and communicates with other services over standard protocols.
Here are some of the software brands: IBM WebSphere, Oracle SOA Suite, Microsoft BizTalk Server.
E-commerce systems integrating various services like payment processing, inventory management, and customer support.
What is code on demand?
Code on demand is a web application design pattern where code is sent from the server to the client and executed on the client-side. When a client requests data, the server sends back not just the data but also code that can process and display that data. This code is executed on the client’s machine, often improving performance and interactivity.
Software Brands:
JavaScript libraries (e.g., jQuery, React.js), Adobe Flash (historically), Microsoft Silverlight (historically).
Examples of Use:
Dynamic web applications where parts of the UI update without a full page reload.
What is OS-level Virtualization?
OS-level virtualization is a method of virtualization in which the kernel of an operating system allows multiple isolated user-space instances. Instead of emulating an entire machine as in full virtualization, OS-level virtualization runs multiple isolated systems on a single host with a shared OS kernel.
Software Brands:
Docker, OpenVZ, LXC (Linux Containers).
Containerized applications where each container runs a single application in an isolated environment.
What is Infrastructure Utilization?
Infrastructure utilization refers to the efficient use of IT infrastructure resources such as servers, storage, and networking.
Techniques such as virtualization, load balancing, and resource management are employed to ensure that infrastructure resources are used effectively and efficiently, minimizing waste and maximizing performance.
Software Brands:
VMware vSphere, Microsoft System Center, Red Hat Ansible.
Cloud computing environments where resources are allocated dynamically based on demand.
What is Distributed Computing?
Distributed computing involves multiple computers working together over a network to achieve a common goal.
Tasks are divided into smaller subtasks, which are distributed among the networked computers. Each computer processes its subtask independently, and the results are combined to complete the overall task.
Apache Spark, Hadoop, Microsoft Azure Batch, Amazon EC2.
Big data processing where datasets are too large for a single machine.
Scientific simulations that require extensive computational power.
What is Parallel computing?
Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems are divided into smaller ones, which are then solved concurrently. It leverages multiple processors or computers to perform these tasks simultaneously, either by dividing tasks among multiple processors within a single computer or distributing tasks across multiple networked computers.
Software Brands:
MPI (Message Passing Interface), OpenMP, CUDA (for GPU computing), Apache Spark.
Examples are: Weather forecasting models that require massive amounts of data processing.
What is a NoSQL Database?
A NoSQL database is a non-relational database designed to store, retrieve, and manage large volumes of unstructured or semi-structured data.
NoSQL databases do not use the traditional table-based relational database structure. They can store data in various formats like key-value pairs, document-oriented, column-family, and graph formats.
MongoDB, Cassandra, Redis, Couchbase, Amazon DynamoDB.
example Storing user session information in distributed systems.
What’s a Virtual Machines?
A virtual machine (VM) is a software-based emulation of a physical computer that runs an operating system and applications just like a physical computer.
A hypervisor or virtual machine monitor (VMM) creates and runs VMs by allocating hardware resources such as CPU, memory, and storage from the host system. Each VM operates independently with its own OS and applications.
Running multiple operating systems on a single physical machine for development and testing.
What is a disk image?
A disk image is a snapshot of a storage device captured at a specific point in time. It includes all the files, directories, and metadata, as well as the file system structure, partition table, and other system-specific information.
What is a firewall?
A firewall is a network security device that monitors and controls incoming and outgoing network traffic based on predetermined security rules.
What is a A virtual-machine disk image library?
A virtual-machine disk image library is a collection of disk images used by virtual machines, containing the complete content and structure of a storage.
Maintaining a library of standard OS images for rapid deployment of VMs.
What is Raw Block Storage?
Raw block storage is a type of storage where data is organized and managed in small, fixed-size pieces called blocks. These blocks are directly controlled by the operating system or applications, giving more flexibility in how data is used and stored.
In raw block storage, the storage device (like an SSD or a hard drive) divides its capacity into uniformly sized blocks, each identified by a unique address. The operating system or application can read from or write to these blocks directly, without needing to worry about the structure of the data within them.
For example, think of raw block storage like a large grid of numbered boxes. The OS or application can put data into any box it chooses and later retrieve it by referring to the box’s number. This method is efficient and allows for high performance and flexibility, especially useful for databases and virtual machines that need fast, low-level access to the storage medium. Block storage is optimized for performance and low-latency access.
This is different than file storage, which organizes data into a hierarchical structure of files and folders, similar to how documents are stored in a physical filing cabinet.
File Storage: Shared drives, user directories, and document management systems.
Block Storage: Databases, virtual machines, and high-performance applications.
What is Object Storage?
Object storage stores data as discrete units called objects, each with its own unique identifier and metadata, in a flat structure without a hierarchy. Storing large amounts of unstructured data like photos, videos, backups, and logs.
What is a load balancer?
A load balancer is a device or software that distributes network or application traffic across multiple servers to ensure reliability and performance. Load balancers use algorithms to distribute incoming traffic. Common methods include round-robin, least connections, and IP hash. This helps prevent any single server from becoming overwhelmed, ensuring that applications remain responsive and available.
What is a A VLAN?
A VLAN is a virtualized version of a physical local area network, allowing devices on different physical LANs to be grouped together into a single logical network.
What is Microsoft Azure?
Microsoft Azure is a cloud computing platform and service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers.
What is Google App Engine?
Google App Engine is a platform-as-a-service (PaaS) offering from Google that allows developers to build and host web applications in Google-managed data centers.
What is Ajax (Asynchronous JavaScript and XML)?
Ajax (Asynchronous JavaScript and XML) is a set of web development techniques using many web technologies to create asynchronous web applications. Ajax allows web pages to be updated asynchronously by exchanging small amounts of data with the server behind the scenes. This means parts of a web page can be updated without reloading the entire page.
What is a Native Application?
A native application is a software program developed for use on a particular platform or device, typically using platform-specific programming languages and tools.
What is a A hybrid cloud?
A hybrid cloud is a computing environment that combines a private cloud and a public cloud, allowing data and applications to be shared between them.
Explain what Infrastructure as a Service (IaaS) means?
IaaS provides virtualized computing resources over the internet. It offers fundamental computing, networking, and storage resources to consumers on a pay-as-you-go basis.
How It Works:
In an IaaS model, a cloud provider hosts the infrastructure components traditionally present in an on-premises data center, including servers, storage, and networking hardware, as well as the virtualization or hypervisor layer. Users can provision and manage these resources through a web-based console or API. In an IaaS model, a cloud provider hosts the infrastructure components traditionally present in an on-premises data center, including servers, storage, and networking hardware, as well as the virtualization or hypervisor layer. Users can provision and manage these resources through a web-based console or API.
Amazon Web Services (AWS) EC2: Provides scalable virtual servers.
Microsoft Azure Virtual Machines: Offers Windows and Linux virtual machines.
Google Compute Engine: Provides infrastructure for running large-scale workloads.
Use Cases:
Hosting websites and web applications.
Setting up development and test environments.
Running enterprise applications and big data analytics.
What is Platform as a Service (PaaS)?
PaaS provides a platform allowing customers to develop, run, and manage applications without dealing with the underlying infrastructure. It includes operating systems, middleware, and development tools.
PaaS delivers a framework for developers to build upon and use to create customized applications. The infrastructure (servers, storage, networking) is managed by the cloud provider, while developers focus on writing code and integrating it into the platform.
Developing and deploying web applications.
Rapid prototyping and development of software.
Building scalable mobile and web apps.
What is Software as a Service (SaaS)?
SaaS delivers software applications over the internet, on a subscription basis. Users access the software via a web browser, and the provider manages the infrastructure, middleware, and application software.
In the SaaS model, the cloud provider hosts the software application and manages all the technical aspects, such as infrastructure, maintenance, and security. Users simply access the application through a web browser or a thin client.
Microsoft Office 365: Provides cloud-based versions of Microsoft Office applications like Word, Excel, and PowerPoint.
What are the key differences between Iaas, Paas, and SaaS?
IaaS: Users have the most control over the infrastructure. They are responsible for managing operating systems, applications, and data. Offers the highest level of flexibility, allowing customization of the infrastructure. Ideal for companies needing full control over their infrastructure.
PaaS: Users focus on application development and management. The provider handles the underlying infrastructure and platform maintenance. Balances flexibility and ease of use, providing tools and frameworks for development. Suitable for developers who want to focus on coding without worrying about the underlying infrastructure.
SaaS: Users have the least control and responsibility. They only use the software application and the provider manages everything else. Offers the least flexibility but the highest convenience and ease of use. Perfect for end-users who need to use software applications without dealing with maintenance or updates.
What are the seven types of quantitative messaging?
Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A line chart may be used to demonstrate the Trend
Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the measure) by sales persons (the category, with each sales person a categorical subdivision) during a single period. A bar chart may be used to show the comparison across the sales persons.
Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%). A pie chart or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.
Deviation: Categorical subdivisions are compared again a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period. A bar chart can show comparison of the actual versus the reference amount.
Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0-10%, 11-20%, etc. A histogram, a type of bar chart, may be used for this analysis.
Correlation: Comparison between observations represented by two variables (X,Y) tom determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for this message.
Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.
- Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A cartogram is a typical graphic used. Visual display elements used by dashboards and scorecards Types of presentation media
What’s a pareto chart?
A Pareto chart is a type of bar chart that represents the frequency or impact of problems or causes in descending order, combined with a line chart that shows the cumulative percentage of the total. Pareto charts are used to identify the most significant factors in a dataset. The principle behind the Pareto chart is the 80/20 rule, which suggests that roughly 80% of effects come from 20% of the causes.
Use Cases:
Quality Control: Identifying the most common defects or issues in a manufacturing process.
Sales Analysis: Highlighting the top products or customers that generate the most revenue.
Problem Solving: Focusing on the most impactful issues to prioritize improvement effor
What is a Sparkline chart?
Sparklines are used to give a quick visual summary of data, without the need for a detailed chart or graph. They can show trends, patterns, and variations in data in a space-efficient manner.
What is a tree map chart?
A tree map is a visualization that displays hierarchical data using nested rectangles. Each rectangle represents a category, with the size of the rectangle proportional to its value.
Tree maps are used to visualize large amounts of hierarchical data in a compact space, making it easy to spot patterns, trends, and outliers. Colors can be added to differentiate categories or to represent additional dimensions of data.
What is predictive analytics?
Predictive analytics is a set of BI technologies that uncovers relationships and patterns within large volumes of data that can be used to predict behavior and events
What is Interoperability?
- Process of creating and deploying predictive models traditionally involves accessing or moving data and models among multiple machines, operating platforms, and applications, which requires interoperable software
What is count regression?
Count regression is a type of regression analysis used for modeling count data, where the response variable is a count of occurrences of an event.
What is linear regression?
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. Linear regression is used to predict the value of a dependent variable based on the values of independent variables. It provides insights into the strength and nature of the relationship between variables.
What is logistic regression?
Logistic regression is a statistical method used to model binary outcome variables (i.e., variables that have two possible outcomes) by estimating the probability of a certain event occurring. Logistic regression is used for classification tasks where the goal is to predict the probability of an outcome falling into one of two categories.
What is cluster analysis?
Cluster analysis is a technique used to group a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups.
Cluster analysis is used for exploratory data analysis to identify patterns or groupings in data without predefining categories.
What is time series?
Time series analysis involves statistical techniques for analyzing time-ordered data points to identify patterns, trends, and seasonal variations.
What is association analysis?
Association analysis, also known as market basket analysis, is a data mining technique used to discover relationships between variables in large datasets.
What is A/B analysis (or A/B testing)?
A/B analysis (or A/B testing) is a method of comparing two versions of a webpage, product, or feature to determine which one performs better.
A/B testing involves dividing users into two groups: one group sees version A, and the other group sees version B. The performance of each version is then measured to identify which one achieves better results.
What is Scoring?
Scoring refers to the process of assigning values (scores) to different entities based on their characteristics or behaviors to rank or categorize them.
What are the four types of data analyses?
Simple summation and statistics
Predictive (forecasting),
Descriptive (business intelligence and data mining)
Prescriptive (optimization and simulation)
What is transaction profiling?
Transaction profiling is the process of analyzing and categorizing transactions to understand patterns, detect anomalies, and gain insights into transactional behavior. This technique is commonly used in financial services, e-commerce, and other industries where understanding the nature and behavior of transactions is crucial.
What is Predictive search?
What is Multinomial logistic regression?
Multinomial logistic regression is a type of regression analysis used for modeling outcomes where the dependent variable has more than two categorical outcomes.
A use case might be Marketing: Predicting the likelihood of a customer choosing among multiple product categories.
What is probit regression?
Probit regression is a type of regression where the dependent variable is binary, and the link function used is the cumulative normal distribution function (the probit link).
Probit regression estimates the relationship between the predictors and the probability of a binary outcome. Unlike logistic regression, which uses the logistic function, probit regression assumes a normal distribution of the error terms.
Credit Scoring: Estimating the probability of loan default.
Medicine: Modeling the likelihood of a patient having a disease based on various risk factors.
Time Series Models - What is it?
Statistical techniques used to analyze time-ordered data points to identify patterns, trends, and seasonal effects, and to make forecasts.
What is encapsulation?
Encapsulation isolates each application within its VM, ensuring that the application operates independently without interfering with other applications. This isolation helps maintain stability and security across multiple VMs running on the same physical hardware.
What is a Logical data warehouse?
A logical data warehouse (LDW) is an architectural approach to data management that combines traditional data warehousing with modern data integration techniques. Unlike a traditional data warehouse that relies on a single, physical repository of data, a logical data warehouse allows data to be accessed and analyzed across multiple, disparate data sources without requiring all data to be physically moved to a single location.
An e-commerce company using a logical data warehouse can combine customer data from their CRM system, transaction data from their order management system, and web analytics data from their online store to provide a holistic view of customer behavior and sales performance. This integrated view enables real-time analytics and informed decision-making without the need for complex ETL processes and extensive data replication.
Survival or duration analysis - what is it
A branch of statistics that deals with the analysis of time-to-event data, modeling the time until an event of interest occurs.
What is a Classification and Regression Trees (CART) - What is it?
Decision tree techniques used for predictive modeling, with classification trees for categorical outcomes and regression trees for continuous outcomes.
ART creates a decision-making process by asking a series of yes-or-no questions.
The Tree: The result is a tree-like diagram where each branch represents a decision point, and each leaf represents a final prediction or outcome.
Use: It helps in making predictions or decisions based on data by breaking down complex information into simple, manageable parts.
What is Multivariate adaptive regression splines?
MARS is a technique used to predict outcomes when there are complex relationships between variables. Think of it as a smart way to fit a flexible, detailed curve or surface to your data.
Let’s say you’re trying to predict test scores based on hours studied and amount of sleep. Here’s how MARS might work:
Initial Guess: You start with a basic guess, like a simple line.
Refine: MARS adds flexible pieces to the model:
For students who studied a lot, it might add a curve that reflects a different pattern.
For students who didn’t study much, it might use a different curve or line.
Combine: All these pieces are put together to make a detailed prediction model that fits the data better than a simple line.
What is a Neural Network?
Neural Networks are a type of computer model designed to recognize patterns and make decisions, inspired by how the human brain works. They are used for tasks like recognizing images, understanding speech, and making predictions.
Neurons: The basic units of a neural network, similar to brain cells. Each neuron takes in information, processes it, and passes it to the next neuron.
Layers: Neurons are organized into layers:
Input Layer: The first layer, where data enters the network (e.g., an image or text).
Hidden Layers: Layers in the middle where the actual processing happens. Each neuron in these layers transforms the data in some way.
Output Layer: The final layer, where the network’s result or prediction is produced (e.g., identifying if an image is a cat or a dog).
What is Multilayer Perceptron (MLP) ?
Multilayer Perceptron (MLP) is a type of neural network with multiple layers of neurons. It’s like a more complex and powerful version of a basic neural network.
What is Radial Basis Functions (RBF) ?
Radial Basis Functions (RBF) are used in certain types of neural networks and machine learning algorithms to model complex data.
unction: RBF uses functions that depend on the distance from a central point (the “center”). It’s like creating a map where each point influences nearby areas more than distant areas.
Learning: It helps in making predictions based on how close data points are to known examples.
Example:
If you’re trying to predict the price of a house based on its location, RBF can help by focusing on houses nearby and adjusting predictions based on their prices.
What is Support Vector Machines (SVM) ?
Support Vector Machines (SVM) are used to classify data into categories by finding the best boundary (or “line”) that separates them.
How It Works:
Boundary: SVM finds a line or curve that best divides the data into different groups. It aims to maximize the distance between the boundary and the nearest data points from each group.
Classification: Once the boundary is found, new data can be classified based on which side of the boundary it falls on.
Example:
Imagine you have a bunch of apples and oranges with different sizes. SVM helps draw a line that best separates apples from oranges based on their sizes.
What is Naïve Bayes?
Naïve Bayes is a simple but powerful method for classifying data based on probability.
How It Works:
Assumptions: It assumes that the features (attributes) of data are independent of each other, which makes the calculations easier.
Probability: It uses probabilities to predict which category new data belongs to based on its features.
Example:
If you want to classify emails as “spam” or “not spam,” Naïve Bayes looks at the words in the email and calculates the probability of it being spam based on previous emails.
What is k-Nearest Neighbors (k-NN) ?
k-Nearest Neighbors (k-NN) is a simple method for classifying data based on the similarity to its neighbors.
eighbors: It looks at the “k” closest data points (neighbors) to the new data point and classifies it based on the majority class of those neighbors.
Distance: Uses distance (e.g., how close or far away) to decide which data points are neighbors.
Example:
If you’re trying to classify a new fruit as an apple, orange, or banana, k-NN looks at the closest fruits in your dataset and classifies it based on which type is most common among those closest neighbors.
What is Geospatial Predictive Modeling?
Geospatial Predictive Modeling is about predicting outcomes based on geographic or spatial data.
How It Works:
Location Data: It uses data that includes locations, like maps or coordinates, to make predictions. This could include factors like climate, terrain, or population density.
Modeling: Builds models to predict things like where a new restaurant might be successful or how weather patterns affect agriculture.
Example:
If you want to predict where to build a new store for the best sales, geospatial predictive modeling uses location data, like population density and competition, to find the best spots.
What are the two different types of data storage?
There are two types of data storage mechanisms:
- Disk (hard disk): In traditional disk-based technology, a query accesses data from multiple tables stored on a server’s hard disk. Disk-based technologies include relational database management systems like Oracle, SQL Server, MySQL, DB2
- RAM (Random Access Memory): In-memory computing primarily relies on keeping data in a server’s RAM so that processing can be performed at very fast speeds.
Modern computers have more available disk storage than RAM.
How in-memory processing works?
It’s like having access to all your files quickly compared to a database where you have to take time to search and find your files.
Uses techniques like compression to save space.
Data can be accessed within seconds by multiple concurrent users at a detailed level and offers the potential for excellent analytics
reading data from disk is much slower (possibly hundreds of times) when compared to reading the same data from RAM.