Keywords Flashcards

Question

What is a relational database?

Answer 1

Relational Databases: Relational databases are a type of database management system (DBMS) that stores data in tables with rows and columns, where each table represents a relation. The tables are related to each other through keys (primary and foreign keys), allowing for efficient querying and retrieval of data using structured query language (SQL). Relations between tables enforce data integrity and enable complex queries and transactions to be performed on the data. Examples include MySQL, PostgreSQL, Oracle Database, and Microsoft SQL Server.

Answer 2

Decentralized Processing: Decentralized processing means spreading out computing tasks and data management across many separate devices or nodes in a network. Each device works on its own and collaborates with others to get things done without needing one main server to control everything. This setup makes it easier to handle big amounts of data and complex jobs across different parts of a network, making systems more flexible and reliable. Example would be a peer-to-peer (P2P) file-sharing network like BitTorrent

Answer 3

Extract processing typically, it refers to the process of extracting data from one system or source for use in another system or for further analysis. This term is often used in the context of ETL (Extract, Transform, Load) processes in data integration and data warehousing.

Answer 4

Online Transaction Processing Systems (OLTPs): OLTP systems are specialized databases designed to manage and facilitate a large number of short, atomic transactions in real time. These systems are optimized for tasks such as data entry, retrieval, and processing, commonly used in environments where fast query processing and maintaining data integrity in multi-user access scenarios are critical. Examples of OLTP applications include banking systems for handling ATM transactions, retail systems for processing sales, and reservation systems for booking flights or hotels. OLTP System is an Operational Database.

Answer 5

Enterprise Resource Planning (ERP): Definition: ERP systems are integrated software platforms that manage and automate core business processes across various departments within an organization. These processes include finance, human resources, manufacturing, supply chain, procurement, and more. SAP, Oracle ERP, and Microsoft Dynamics are popular ERP systems used by organizations to manage their day-to-day business activities.

Answer 6

A data mart is a smaller, more focused version of a data warehouse designed to meet the specific needs of a department or a smaller group within an organization.

Answer 7

An Operational Data Store (ODS) serves as a central database that captures a snapshot of the latest data from multiple transactional systems. Its purpose is to support operational reporting and provide a source of data for the enterprise data warehouse (EDW). An operational data store is a subject oriented database that contains structured data extracted directly from OLTP systems Examples include patient records, inventory management, or transaction data, or meter readings. Used as a staging area before data is imported into a data warehouse It contains current, or near current, data and its objective is to meet the ad hoc query, tactical day-to-day, needs of operational users. an ODS can be frequently updated from operational systems

Answer 8

Data federation is a data integration technique that provides a unified view of data from multiple sources without physically consolidating it. Imagine it as a sophisticated mechanism that allows you to access and query data across various systems in real time, as if it were all stored in a single location1. Essentially, it creates a virtual database that maps several distinct data sources within an enterprise, making them accessible through a single interface2

Answer 9

Online Analytical Processing (OLAP) system is a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to interactively analyze multidimensional data from multiple perspectives. They are a crucial part of business intelligence systems, facilitating complex queries and analysis that support decision-making processes.

Answer 10

Data is stored in a multidimensional cube, which allows for fast retrieval and analysis.

Answer 11

Data is stored in a relational database, and complex queries are used to perform multidimensional analysis. ROLAP can handle large volumes of data but may have slower query performance compared to MOLAP.

Answer 12

Combines features of both MOLAP and ROLAP, storing part of the data in multidimensional cubes and part in a relational database.

Answer 13

Relational Databases are a Relational Database Management System (RDBMS) is a type of database management system that organizes data into tables (relations) where data points are related to one another through common fields. (OLTP - Online Transaction Processing) are transactional systems that require fast query processing and maintain data integrity in multi-access environments. They are optimized for inserting, updating, and deleting small amounts of data. OLTP systems manage and facilitate transactional operations that involve day-to-day business activities. Point-of-Sale (POS) Systems, Online Banking. De-normalization creates redundant data and is suitable for a data warehouse, it not appropriate for a transaction database, which emphasizes performance over redundancy. OLAP Systems (Online Analytical Processing): are designed for analysis and reporting. Deals with historical, summarized, and aggregated data rather than real-time transactional data. They are optimized for complex queries and aggregations over large volumes of data, often supporting data warehousing and business intelligence applications. Data is organized in multidimensional cubes with measures and dimensions. Cubes allow for fast retrieval of aggregated data. Optimized for complex queries that involve aggregations, trend analysis, and data mining. Use specialized query languages like MDX (Multidimensional Expressions). Part of Microsoft's SQL Server suite, SSAS is a powerful tool for creating OLAP cubes, which can be used for data mining and complex analytical queries. It supports both multidimensional OLAP (MOLAP) and tabular data models. Use cases are Business Reporting, Budgeting and Forecasting, Market Research,

Answer 14

Denormalization is the process of intentionally introducing redundancy into a database by merging tables and reducing the complexity of relationships. This is often done to improve the read performance of the database by reducing the number of joins needed to retrieve related data. Denormalization can be particularly beneficial in scenarios where query performance is more critical than the maintenance of strict data integrity and normalization rules. Denormalization introduces redundancy, meaning the same piece of data is stored in multiple places. This can lead to inconsistencies if not managed properly. More storage space is required since data is duplicated. Updates become more complex because changes to redundant data must be propagated to multiple places, increasing the risk of data anomalies.

Answer 15

A join is an operation in SQL that allows you to combine rows from two or more tables based on a related column between them. Joins are fundamental in relational databases as they enable the retrieval of data spread across multiple tables by establishing relationships between them.

Answer 16

Normalization is a process in database design that organizes columns and tables of a database to reduce data redundancy and improve data integrity. The main objective of normalization is to separate data into distinct, related tables to minimize redundancy and dependency. This process involves structuring a relational database according to a series of normal forms to ensure that data is stored logically and efficiently.

Answer 17

There are three types - Enterprise Data Warehouses (EDW), Data Marts, and Operational Data Stores (ODS). EDWs are comprehensive and cover the entire organization, Data Marts cater to specific departmental needs, and ODSs provide near-current data for operational use.

Answer 18

A star schema is a type of data warehouse schema where a central fact table is connected to multiple dimension tables through foreign key relationships. They are de-normalized Are linked to the fact table through unique keys (one per dimension table) Imagine a retail company analyzing sales data. The fact table contains sales transactions, and the dimension tables include products, time, and stores. Since the relationships are straightforward (e.g., sales per product per store), a star schema simplifies querying.

Answer 19

An Entity-Relationship Diagram (ERD) is a visual representation that illustrates the relationships between entities in a database. ERDs are essential tools in database design and development, helping to organize and structure data models. ERDs serve as blueprints for database design, helping developers and stakeholders visualize and understand the structure and relationships within a database system.

Answer 20

The snowflake schema is a multi-dimensional data model commonly used in business intelligence (BI) and reporting. It’s an extension of the star schema, with dimension tables broken down into subdimensions. It's normalized, so there isn't data redundancy. Consider a healthcare organization analyzing patient data. The fact table represents medical procedures, and the dimension tables include patients, doctors, hospitals, and diagnoses. Since there are multiple levels of hierarchy (e.g., patient demographics, doctor specialties), a snowflake schema allows for more detailed analysis.

Answer 21

a database management system is software that allows users to define, create, maintain, and control access to databases.

Answer 22

Middleware is software that sits between different software applications or services, enabling them to communicate and work together. It's like the glue that connects different components of a system, ensuring smooth data flow and interaction. Middleware provides common services and capabilities such as messaging, authentication, and data integration, facilitating communication and management of data between different systems and applications.

Answer 23

Examples of Middleware: These are Web Servers. They Serve web pages and handle HTTP requests from clients.

Answer 24

Examples of Middleware: Application Servers, which host your online store application, making it accessible to users via the web.. Function: Host and manage web applications, providing services like transaction management and security.

Answer 25

Examples of Database Middleware: Function: Facilitate interaction between applications and databases.

Answer 26

Examples of Message-Oriented Middleware (MOM): Function: Handle message passing between distributed systems.

Answer 27

Examples of Remote Procedure Call (RPC) Middleware: Function: Enable functions in different systems to call each other as if they were local. You (Client): Make a request (order a pizza). Restaurant (Server): Receives your request, performs the task (makes the pizza), and sends back the result. Phone Call (RPC Mechanism): Facilitates the communication between you and the restaurant. In RPC, instead of calling a restaurant, you’re calling a function or procedure on another computer as if it were on your own computer. The “restaurant” (server) does the work and sends the result back to you (client), allowing you to use the result just like you would use the pizza you ordered

Answer 28

Extraction: pulling data from the source system Transformation: subjecting the data to a number of operations before it can be imported Loading: Involves physically placing extracted and transformed data in the target database

Answer 29

If clean validated data warehouse data is to be fed back to the source system(s)

Answer 30

refers to the methodical removal or deletion of obsolete, unnecessary, or outdated data from a database, system, or storage to improve performance, manage storage space, and maintain system efficiency. This process is crucial for maintaining data hygiene and ensuring that only relevant and current data is kept within the system.

Answer 31

Merge processing refers to the operation of combining two or more datasets, files, or data streams into a single, unified dataset. This process is commonly used in data management, particularly when dealing with sorted data, where the goal is to maintain a specific order in the resulting merged dataset.

Answer 32

Parallel processing is a method in computing where multiple processors or computers work simultaneously on different parts of a task to complete it faster. It divides a large problem into smaller sub-problems, solves them concurrently, and then combines the results.

Answer 33

A pipeline in the context of computing and data processing refers to a series of processing steps arranged so that the output of one step is the input to the next.

Answer 34

Partitioning involves dividing large tables or datasets into smaller, more manageable parts based on a defined criteria (e.g., range of values, hash value). Each partition can then be processed independently.

Answer 35

Indexing involves creating indexes on columns in a database table to speed up data retrieval operations. Indexes allow the database engine to quickly locate rows that match certain criteria.

Answer 36

Parallel bulk loading involves splitting data loading tasks into multiple concurrent processes or threads, each handling a portion of the data simultaneously.

Answer 37

It is a data element that categorizes each item in a dataset into non-overlapping regions. Examples of dimensions are customer, region, and time. Represents an attribute such as product, region, sales channel, or time. Used to analyze facts known as business metrics (ways we measure business)

Answer 38

A fact is a business measure or metric, which is used to measure business performance such as sales, revenue, units sold, and costs. They are the values that cange over time.

Answer 39

Additive: This type of fact, which is the most common, is a measurement that can be added across all dimensions in a fact table; examples include revenue, profit, sales, and cost Semi-additive: This type of fact or measure can be added for some dimensions only, such as headcount Non-additive: This type of fact cannot be summed for any dimension

Answer 40

Granularity refers to the lowest level of detail that is stored in a data warehouse fact table. For example, the lowest level of data can be maintained at the yearly, quarterly, monthly, weekly, daily, or hourly level. For more in-depth reporting capability, low granularity is preferred.

Answer 41

OLAP is a BI tool that addresses the need to perform multi-dimensional analysis. Query outputs are presented in a matrix or pivot, where the columns and rows are the dimensions. The values in the matrix are obtained from the measures, which are derived from the fact table records. The dimensions are derived from the dimension tables.

Answer 42

Volume: data at rest - quantity of data that is generated. Velocity (data in motion): Clickstreams and ad impressions that capture user behavior at millions of events per second High-frequency stock trading algorithms reflect market changes within microseconds. Machine-to-machine processes exchange data between billions of devices. Infrastructure and sensors generate massive log data in real-time. On-line gaming systems support millions of concurrent users Variety (data in many forms)_ Nowadays, data is generated in structured and unstructured format. Variability: Daily, seasonal and event-triggered peak data loads can be challenging to manage, especially where unstructured data is involved. Complexity: It is necessary to connect and correlate relationships, hierarchies and multiple data linkages or the data can quickly spiral out of control. This characteristic is referred to as the ‘complexity’ of Big Data. Veracity (data in doubt) - not sure of it’s reliability of data

Answer 43

MPP database systems are designed to handle large-scale data processing by distributing data and processing tasks across multiple nodes or servers. Each node operates independently and processes a subset of the data in parallel. Examples: Amazon Redshift, Google BigQuery, Teradata.

Answer 44

MapReduce is a programming model and framework developed by Google for processing and generating large datasets in parallel across a distributed cluster of computers. Suited for processing unstructured or semi-structured data. MapReduce is like big detective kits for data. MapReduce is the processing component of Hadoop. It allows processing of large datasets in parallel across the nodes in the cluster. Divides tasks into smaller sub-tasks (Map phase) and processes them independently, then aggregates results (Reduce phase). Typically used for batch processing of data, such as log processing, data transformation, and ETL (Extract, Transform, Load) operations.

Answer 45

Operational Big Data is about handling data quickly for everyday tasks, where NoSQL databases shine.

Answer 46

Hadoop is like a powerful toolbox that can handle both fast data processing (like NoSQL databases do) and deep analysis (like MPP databases and MapReduce do). It's a framework that lets you store lots of data across many computers and then process that data in different ways.

Answer 47

Analytical Big Data is about finding patterns and insights from large amounts of data over time, where MPP databases and tools like MapReduce come in.

Answer 48

MongoDB is a type of software used to store and manage data. It's part of a broader category of databases called NoSQL databases, which means it doesn’t use the traditional relational database structures that you might be familiar with from systems like MySQL or PostgreSQL.

Answer 49

These are commonly used for processing streaming data in real-time.

Answer 50

Drill, Impala: These are focused on enabling fast SQL queries on large datasets.

Answer 51

Apache Impala is like a super-fast detective that helps you find answers in a very big library of information (data). Imagine you have a huge library (like a library with millions of books). Apache Impala can split up the work of finding information across many librarians (computers). Each librarian (computer) works on their own set of books (data), which makes it much faster to find what you need. Apache Impala understands a language called SQL, which is like a common language used to ask questions to databases. You can ask Apache Impala questions in SQL, and it will find the answers for you in your data.

Answer 52

HDFS is the storage component of Hadoop. It stores large files in a distributed manner across multiple nodes in a Hadoop cluster.

Answer 53

Nodes: A Hadoop cluster consists of multiple nodes (computers), typically divided into two types: NameNode (manages file system metadata) and DataNode (stores actual data).

Answer 54

A data lake is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. It can store data from different sources such as websites, mobile apps, and social media platforms. examples Azure Data Lake Storage:

Answer 55

Hadoop Distributed File System (HDFS): HDFS is a distributed file system that stores data across multiple nodes in a Hadoop cluster. It provides high throughput and fault tolerance by replicating data blocks across nodes. Ideal for storing large datasets (structured, semi-structured, and unstructured) with high aggregate bandwidth for data processing. Hadoop MapReduce: MapReduce is a programming model and processing framework for large-scale data processing across a Hadoop cluster. It breaks down tasks into smaller Map and Reduce operations that can be executed in parallel. Batch processing tasks such as data transformation, ETL (Extract, Transform, Load), and analytical processing on large datasets stored in HDFS. Hadoop YARN (Yet Another Resource Negotiator): YARN is a resource management and job scheduling platform in Hadoop. It manages and allocates computing resources (CPU, memory) across nodes in the cluster to run various applications. Enables multi-tenancy and supports diverse workloads including MapReduce, Spark, and other distributed computing frameworks. Hadoop Common: Hadoop Common includes libraries, utilities, and necessary components shared by other Hadoop modules. It provides core functionalities like file system interfaces, networking, security, and configuration management.oundation for building and running Hadoop-based applications, ensuring compatibility and abstraction of underlying complexities.

Answer 56

A high-level data flow scripting language and execution framework for parallel computation.

Answer 57

A data warehouse infrastructure that provides SQL-like querying (HiveQL) and metadata management on large datasets stored in Hadoop.

Answer 58

It is a distributed, scalable, and NoSQL database that provides real-time read/write access to data stored in HDFS.

Answer 59

It is an open-source, distributed computing system that provides in-memory processing for real-time data analytics and machine learning.

Answer 60

Kafka is a distributed streaming platform that handles real-time data feeds.

Answer 61

Apache Zeppelin is an open-source web-based notebook that provides an interactive and collaborative environment for data analysis, visualization, and exploration. It supports a variety of data sources and processing engines, making it a versatile tool for data scientists, analysts, and developers.

Answer 62

Data Quality

Answer 63

Step 1: Remove irrelevant data. Step 2: Deduplicate your data. Step 3: Fix structural errors. Step 4: Deal with missing data. Step 5: Filter out data outliers. Step 6: Validate your data.

Answer 64

OBIEE is a comprehensive suite of enterprise BI tools designed to deliver a full range of analytic and reporting capabilities.

Answer 65

It is a BI and performance management software suite designed to enable business users without technical knowledge to extract corporate data, analyze it, and assemble reports.

Answer 66

SAP is a suite of front-end applications that allow business users to view, sort, and analyze business intelligence data.

Answer 67

The client-server model is a network architecture where client devices request services and resources from a central server. Clients send requests to the server over a network, and the server processes these requests and returns the appropriate responses. This model can be used for a wide variety of services, including file sharing, email, and web services. An example is Email clients retrieving emails from an email server.

Answer 68

Virtualization is the process of creating a virtual version of something, such as a server, a desktop, a storage device, an operating system, or network resources. You might use this to test software in different environments without needing separate physical hardware. Example of brands VMware, Microsoft Hyper-V, Oracle VM VirtualBox, Citrix XenServer. Virtualization software creates an abstraction layer over computer hardware that allows the hardware elements of a single computer—processors, memory, storage, and more—to be divided into multiple virtual computers, commonly called virtual machines (VMs). Each VM runs its own operating system and applications, independently of the other VMs. Virtualization, the process of using computer resources to imitate other resources,

Answer 69

Grid computing involves a distributed architecture of large numbers of computers connected to solve a complex problem. Computers in the grid work together to complete tasks by dividing the workload among them. Each computer works on a small part of the task independently and then combines the results. Globus Toolkit, Apache Hadoop, BOINC (Berkeley Open Infrastructure for Network Computing). An example of use might be Scientific research projects requiring large-scale computation, such as climate modeling or genome sequencing.

Answer 70

A mainframe computer is a large, powerful, and expensive computer system capable of handling and processing very large amounts of data quickly. Mainframes are designed to manage high volumes of input and output and support numerous simultaneous users. They use specialized operating systems to manage hardware and software resources efficiently. An example are IBM Z Series, Unisys ClearPath, Fujitsu BS2000.

Answer 71

SOA is a design pattern where services are provided to other components by application components, through a communication protocol over a network. In SOA, services are modular and can be independently deployed and scaled. Each service has a well-defined interface and communicates with other services over standard protocols. Here are some of the software brands: IBM WebSphere, Oracle SOA Suite, Microsoft BizTalk Server. E-commerce systems integrating various services like payment processing, inventory management, and customer support.

Answer 72

Code on demand is a web application design pattern where code is sent from the server to the client and executed on the client-side. When a client requests data, the server sends back not just the data but also code that can process and display that data. This code is executed on the client’s machine, often improving performance and interactivity. Software Brands: JavaScript libraries (e.g., jQuery, React.js), Adobe Flash (historically), Microsoft Silverlight (historically). Examples of Use: Dynamic web applications where parts of the UI update without a full page reload.

Answer 73

OS-level virtualization is a method of virtualization in which the kernel of an operating system allows multiple isolated user-space instances. Instead of emulating an entire machine as in full virtualization, OS-level virtualization runs multiple isolated systems on a single host with a shared OS kernel. Software Brands: Docker, OpenVZ, LXC (Linux Containers). Containerized applications where each container runs a single application in an isolated environment.

Answer 74

Infrastructure utilization refers to the efficient use of IT infrastructure resources such as servers, storage, and networking. Techniques such as virtualization, load balancing, and resource management are employed to ensure that infrastructure resources are used effectively and efficiently, minimizing waste and maximizing performance. Software Brands: VMware vSphere, Microsoft System Center, Red Hat Ansible. Cloud computing environments where resources are allocated dynamically based on demand.

Answer 75

Distributed computing involves multiple computers working together over a network to achieve a common goal. Tasks are divided into smaller subtasks, which are distributed among the networked computers. Each computer processes its subtask independently, and the results are combined to complete the overall task. Apache Spark, Hadoop, Microsoft Azure Batch, Amazon EC2. Big data processing where datasets are too large for a single machine. Scientific simulations that require extensive computational power.

Answer 76

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems are divided into smaller ones, which are then solved concurrently. It leverages multiple processors or computers to perform these tasks simultaneously, either by dividing tasks among multiple processors within a single computer or distributing tasks across multiple networked computers. Software Brands: MPI (Message Passing Interface), OpenMP, CUDA (for GPU computing), Apache Spark. Examples are: Weather forecasting models that require massive amounts of data processing.

Answer 77

A NoSQL database is a non-relational database designed to store, retrieve, and manage large volumes of unstructured or semi-structured data. NoSQL databases do not use the traditional table-based relational database structure. They can store data in various formats like key-value pairs, document-oriented, column-family, and graph formats. MongoDB, Cassandra, Redis, Couchbase, Amazon DynamoDB. example Storing user session information in distributed systems.

Answer 78

A virtual machine (VM) is a software-based emulation of a physical computer that runs an operating system and applications just like a physical computer. A hypervisor or virtual machine monitor (VMM) creates and runs VMs by allocating hardware resources such as CPU, memory, and storage from the host system. Each VM operates independently with its own OS and applications. Running multiple operating systems on a single physical machine for development and testing.

Answer 79

A disk image is a snapshot of a storage device captured at a specific point in time. It includes all the files, directories, and metadata, as well as the file system structure, partition table, and other system-specific information.

Answer 80

A firewall is a network security device that monitors and controls incoming and outgoing network traffic based on predetermined security rules.

Answer 81

A virtual-machine disk image library is a collection of disk images used by virtual machines, containing the complete content and structure of a storage. Maintaining a library of standard OS images for rapid deployment of VMs.

Answer 82

Raw block storage is a type of storage where data is organized and managed in small, fixed-size pieces called blocks. These blocks are directly controlled by the operating system or applications, giving more flexibility in how data is used and stored. In raw block storage, the storage device (like an SSD or a hard drive) divides its capacity into uniformly sized blocks, each identified by a unique address. The operating system or application can read from or write to these blocks directly, without needing to worry about the structure of the data within them. For example, think of raw block storage like a large grid of numbered boxes. The OS or application can put data into any box it chooses and later retrieve it by referring to the box’s number. This method is efficient and allows for high performance and flexibility, especially useful for databases and virtual machines that need fast, low-level access to the storage medium. Block storage is optimized for performance and low-latency access. This is different than file storage, which organizes data into a hierarchical structure of files and folders, similar to how documents are stored in a physical filing cabinet. File Storage: Shared drives, user directories, and document management systems. Block Storage: Databases, virtual machines, and high-performance applications.

Answer 83

Object storage stores data as discrete units called objects, each with its own unique identifier and metadata, in a flat structure without a hierarchy. Storing large amounts of unstructured data like photos, videos, backups, and logs.

Answer 84

A load balancer is a device or software that distributes network or application traffic across multiple servers to ensure reliability and performance. Load balancers use algorithms to distribute incoming traffic. Common methods include round-robin, least connections, and IP hash. This helps prevent any single server from becoming overwhelmed, ensuring that applications remain responsive and available.

Answer 85

A VLAN is a virtualized version of a physical local area network, allowing devices on different physical LANs to be grouped together into a single logical network.

Answer 86

Microsoft Azure is a cloud computing platform and service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers.

Answer 87

Google App Engine is a platform-as-a-service (PaaS) offering from Google that allows developers to build and host web applications in Google-managed data centers.

Answer 88

Ajax (Asynchronous JavaScript and XML) is a set of web development techniques using many web technologies to create asynchronous web applications. Ajax allows web pages to be updated asynchronously by exchanging small amounts of data with the server behind the scenes. This means parts of a web page can be updated without reloading the entire page.

Answer 89

A native application is a software program developed for use on a particular platform or device, typically using platform-specific programming languages and tools.

Answer 90

A hybrid cloud is a computing environment that combines a private cloud and a public cloud, allowing data and applications to be shared between them.

Answer 91

IaaS provides virtualized computing resources over the internet. It offers fundamental computing, networking, and storage resources to consumers on a pay-as-you-go basis. How It Works: In an IaaS model, a cloud provider hosts the infrastructure components traditionally present in an on-premises data center, including servers, storage, and networking hardware, as well as the virtualization or hypervisor layer. Users can provision and manage these resources through a web-based console or API. In an IaaS model, a cloud provider hosts the infrastructure components traditionally present in an on-premises data center, including servers, storage, and networking hardware, as well as the virtualization or hypervisor layer. Users can provision and manage these resources through a web-based console or API. Amazon Web Services (AWS) EC2: Provides scalable virtual servers. Microsoft Azure Virtual Machines: Offers Windows and Linux virtual machines. Google Compute Engine: Provides infrastructure for running large-scale workloads. Use Cases: Hosting websites and web applications. Setting up development and test environments. Running enterprise applications and big data analytics.

Answer 92

PaaS provides a platform allowing customers to develop, run, and manage applications without dealing with the underlying infrastructure. It includes operating systems, middleware, and development tools. PaaS delivers a framework for developers to build upon and use to create customized applications. The infrastructure (servers, storage, networking) is managed by the cloud provider, while developers focus on writing code and integrating it into the platform. Developing and deploying web applications. Rapid prototyping and development of software. Building scalable mobile and web apps.

Answer 93

SaaS delivers software applications over the internet, on a subscription basis. Users access the software via a web browser, and the provider manages the infrastructure, middleware, and application software. In the SaaS model, the cloud provider hosts the software application and manages all the technical aspects, such as infrastructure, maintenance, and security. Users simply access the application through a web browser or a thin client. Microsoft Office 365: Provides cloud-based versions of Microsoft Office applications like Word, Excel, and PowerPoint.

Answer 94

IaaS: Users have the most control over the infrastructure. They are responsible for managing operating systems, applications, and data. Offers the highest level of flexibility, allowing customization of the infrastructure. Ideal for companies needing full control over their infrastructure. PaaS: Users focus on application development and management. The provider handles the underlying infrastructure and platform maintenance. Balances flexibility and ease of use, providing tools and frameworks for development. Suitable for developers who want to focus on coding without worrying about the underlying infrastructure. SaaS: Users have the least control and responsibility. They only use the software application and the provider manages everything else. Offers the least flexibility but the highest convenience and ease of use. Perfect for end-users who need to use software applications without dealing with maintenance or updates.

Answer 95

Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A line chart may be used to demonstrate the Trend Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the measure) by sales persons (the category, with each sales person a categorical subdivision) during a single period. A bar chart may be used to show the comparison across the sales persons. Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%). A pie chart or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market. Deviation: Categorical subdivisions are compared again a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period. A bar chart can show comparison of the actual versus the reference amount. Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0-10%, 11-20%, etc. A histogram, a type of bar chart, may be used for this analysis. Correlation: Comparison between observations represented by two variables (X,Y) tom determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for this message. Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison. 8. Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A cartogram is a typical graphic used. Visual display elements used by dashboards and scorecards Types of presentation media

Answer 96

A Pareto chart is a type of bar chart that represents the frequency or impact of problems or causes in descending order, combined with a line chart that shows the cumulative percentage of the total. Pareto charts are used to identify the most significant factors in a dataset. The principle behind the Pareto chart is the 80/20 rule, which suggests that roughly 80% of effects come from 20% of the causes. Use Cases: Quality Control: Identifying the most common defects or issues in a manufacturing process. Sales Analysis: Highlighting the top products or customers that generate the most revenue. Problem Solving: Focusing on the most impactful issues to prioritize improvement effor

Answer 97

Sparklines are used to give a quick visual summary of data, without the need for a detailed chart or graph. They can show trends, patterns, and variations in data in a space-efficient manner.

Answer 98

A tree map is a visualization that displays hierarchical data using nested rectangles. Each rectangle represents a category, with the size of the rectangle proportional to its value. Tree maps are used to visualize large amounts of hierarchical data in a compact space, making it easy to spot patterns, trends, and outliers. Colors can be added to differentiate categories or to represent additional dimensions of data.

Answer 99

Predictive analytics is a set of BI technologies that uncovers relationships and patterns within large volumes of data that can be used to predict behavior and events

Answer 100

- Process of creating and deploying predictive models traditionally involves accessing or moving data and models among multiple machines, operating platforms, and applications, which requires interoperable software

Answer 101

Count regression is a type of regression analysis used for modeling count data, where the response variable is a count of occurrences of an event.

Answer 102

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. Linear regression is used to predict the value of a dependent variable based on the values of independent variables. It provides insights into the strength and nature of the relationship between variables.

Answer 103

Logistic regression is a statistical method used to model binary outcome variables (i.e., variables that have two possible outcomes) by estimating the probability of a certain event occurring. Logistic regression is used for classification tasks where the goal is to predict the probability of an outcome falling into one of two categories.

Answer 104

Cluster analysis is a technique used to group a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups. Cluster analysis is used for exploratory data analysis to identify patterns or groupings in data without predefining categories.

Answer 105

Time series analysis involves statistical techniques for analyzing time-ordered data points to identify patterns, trends, and seasonal variations.

Answer 106

Association analysis, also known as market basket analysis, is a data mining technique used to discover relationships between variables in large datasets.

Answer 107

A/B analysis (or A/B testing) is a method of comparing two versions of a webpage, product, or feature to determine which one performs better. A/B testing involves dividing users into two groups: one group sees version A, and the other group sees version B. The performance of each version is then measured to identify which one achieves better results.

Answer 108

Scoring refers to the process of assigning values (scores) to different entities based on their characteristics or behaviors to rank or categorize them.

Answer 109

Simple summation and statistics Predictive (forecasting), Descriptive (business intelligence and data mining) Prescriptive (optimization and simulation)

Answer 110

Transaction profiling is the process of analyzing and categorizing transactions to understand patterns, detect anomalies, and gain insights into transactional behavior. This technique is commonly used in financial services, e-commerce, and other industries where understanding the nature and behavior of transactions is crucial.

Answer 111

Multinomial logistic regression is a type of regression analysis used for modeling outcomes where the dependent variable has more than two categorical outcomes. A use case might be Marketing: Predicting the likelihood of a customer choosing among multiple product categories.

Answer 112

Probit regression is a type of regression where the dependent variable is binary, and the link function used is the cumulative normal distribution function (the probit link). Probit regression estimates the relationship between the predictors and the probability of a binary outcome. Unlike logistic regression, which uses the logistic function, probit regression assumes a normal distribution of the error terms. Credit Scoring: Estimating the probability of loan default. Medicine: Modeling the likelihood of a patient having a disease based on various risk factors.

Answer 113

Statistical techniques used to analyze time-ordered data points to identify patterns, trends, and seasonal effects, and to make forecasts.

Answer 114

Encapsulation isolates each application within its VM, ensuring that the application operates independently without interfering with other applications. This isolation helps maintain stability and security across multiple VMs running on the same physical hardware.

Answer 115

A logical data warehouse (LDW) is an architectural approach to data management that combines traditional data warehousing with modern data integration techniques. Unlike a traditional data warehouse that relies on a single, physical repository of data, a logical data warehouse allows data to be accessed and analyzed across multiple, disparate data sources without requiring all data to be physically moved to a single location. An e-commerce company using a logical data warehouse can combine customer data from their CRM system, transaction data from their order management system, and web analytics data from their online store to provide a holistic view of customer behavior and sales performance. This integrated view enables real-time analytics and informed decision-making without the need for complex ETL processes and extensive data replication.

Answer 116

A branch of statistics that deals with the analysis of time-to-event data, modeling the time until an event of interest occurs.

Answer 117

Decision tree techniques used for predictive modeling, with classification trees for categorical outcomes and regression trees for continuous outcomes. ART creates a decision-making process by asking a series of yes-or-no questions. The Tree: The result is a tree-like diagram where each branch represents a decision point, and each leaf represents a final prediction or outcome. Use: It helps in making predictions or decisions based on data by breaking down complex information into simple, manageable parts.

Answer 118

MARS is a technique used to predict outcomes when there are complex relationships between variables. Think of it as a smart way to fit a flexible, detailed curve or surface to your data. Let’s say you’re trying to predict test scores based on hours studied and amount of sleep. Here’s how MARS might work: Initial Guess: You start with a basic guess, like a simple line. Refine: MARS adds flexible pieces to the model: For students who studied a lot, it might add a curve that reflects a different pattern. For students who didn’t study much, it might use a different curve or line. Combine: All these pieces are put together to make a detailed prediction model that fits the data better than a simple line.

Answer 119

Neural Networks are a type of computer model designed to recognize patterns and make decisions, inspired by how the human brain works. They are used for tasks like recognizing images, understanding speech, and making predictions. Neurons: The basic units of a neural network, similar to brain cells. Each neuron takes in information, processes it, and passes it to the next neuron. Layers: Neurons are organized into layers: Input Layer: The first layer, where data enters the network (e.g., an image or text). Hidden Layers: Layers in the middle where the actual processing happens. Each neuron in these layers transforms the data in some way. Output Layer: The final layer, where the network’s result or prediction is produced (e.g., identifying if an image is a cat or a dog).

Answer 120

Multilayer Perceptron (MLP) is a type of neural network with multiple layers of neurons. It’s like a more complex and powerful version of a basic neural network.

Answer 121

Radial Basis Functions (RBF) are used in certain types of neural networks and machine learning algorithms to model complex data. unction: RBF uses functions that depend on the distance from a central point (the “center”). It’s like creating a map where each point influences nearby areas more than distant areas. Learning: It helps in making predictions based on how close data points are to known examples. Example: If you’re trying to predict the price of a house based on its location, RBF can help by focusing on houses nearby and adjusting predictions based on their prices.

Answer 122

Support Vector Machines (SVM) are used to classify data into categories by finding the best boundary (or “line”) that separates them. How It Works: Boundary: SVM finds a line or curve that best divides the data into different groups. It aims to maximize the distance between the boundary and the nearest data points from each group. Classification: Once the boundary is found, new data can be classified based on which side of the boundary it falls on. Example: Imagine you have a bunch of apples and oranges with different sizes. SVM helps draw a line that best separates apples from oranges based on their sizes.

Answer 123

Naïve Bayes is a simple but powerful method for classifying data based on probability. How It Works: Assumptions: It assumes that the features (attributes) of data are independent of each other, which makes the calculations easier. Probability: It uses probabilities to predict which category new data belongs to based on its features. Example: If you want to classify emails as “spam” or “not spam,” Naïve Bayes looks at the words in the email and calculates the probability of it being spam based on previous emails.

Answer 124

k-Nearest Neighbors (k-NN) is a simple method for classifying data based on the similarity to its neighbors. eighbors: It looks at the “k” closest data points (neighbors) to the new data point and classifies it based on the majority class of those neighbors. Distance: Uses distance (e.g., how close or far away) to decide which data points are neighbors. Example: If you’re trying to classify a new fruit as an apple, orange, or banana, k-NN looks at the closest fruits in your dataset and classifies it based on which type is most common among those closest neighbors.

Answer 125

Geospatial Predictive Modeling is about predicting outcomes based on geographic or spatial data. How It Works: Location Data: It uses data that includes locations, like maps or coordinates, to make predictions. This could include factors like climate, terrain, or population density. Modeling: Builds models to predict things like where a new restaurant might be successful or how weather patterns affect agriculture. Example: If you want to predict where to build a new store for the best sales, geospatial predictive modeling uses location data, like population density and competition, to find the best spots.

Answer 126

There are two types of data storage mechanisms: - Disk (hard disk): In traditional disk-based technology, a query accesses data from multiple tables stored on a server’s hard disk. Disk-based technologies include relational database management systems like Oracle, SQL Server, MySQL, DB2 - RAM (Random Access Memory): In-memory computing primarily relies on keeping data in a server's RAM so that processing can be performed at very fast speeds. Modern computers have more available disk storage than RAM.

Answer 127

It's like having access to all your files quickly compared to a database where you have to take time to search and find your files. Uses techniques like compression to save space. Data can be accessed within seconds by multiple concurrent users at a detailed level and offers the potential for excellent analytics reading data from disk is much slower (possibly hundreds of times) when compared to reading the same data from RAM.