C1 : What is Data Science? Flashcards
Understand Introductory concepts.
What is MLOps?
- Machine learning operations.
- Tools that provide ongoing monitoring of models and automated retraining of drifted models.
What is a Algorithm?
A set of step-by-step instructions to solve a problem or complete a task.
What is a Model?
A representation of the relationships and patterns found in data.
* They are useful for making predictions or when analyzing complex systems.
* They retain the essential elements of the data needed for analysis.
What’s an Outlier?
A data point that differs significantly from other observations.
Potentially indicating anomalies, errors, or unique phenomena that could impact statistical analysis or modeling.
What is Structured Data?
Data is organized and formatted into a predictable schema, usually related tables with rows and columns.
What is Unstructured Data?
- Unorganized data that lacks a predefined data model.
- Which are harder to analyze using traditional methods.
- This data type often includes text, images, videos, and other content that doesn’t fit neatly into rows and columns like structured data.
What does .CSV stand for?
Comma seperated values.
What does .XLSX stand for?
Microsoft Excel Open XML Spreadsheet.
What does .XML stand for?
Extennsible Markup Language.
What does .PDF stand for?
Portable document format. (Adobe)
What does .JSON stand for?
JavaScript Object Notation.
What does .TSV stand for?
Tab Seperated Values.
What are some of the benfits of .JSON file format?
- Language-independent data format.
- Is considered as one of the best tools for sharing data of any size and type, even audio and video.
What are some of the benifits of .XLSX file format?
- XLSX uses the open file format.
- It can use and save all functions available in Excel.
- Is known to be one of the more secure file formats as it cannot save malicious code.
What are some of the benifits of the .XML file format?
- Readable by humans and machines.
- It is a self-descriptive language.
- Does not use predefined tags like .HTML does. * XML is platform independent.
What is a Data Visualization?
A visual way of representing data and it’s trends that is easily comprehensible.
What defines a Delimited Text File?
It is a plain text file where a specific character separates the data values.
What is Hadoop?
An open-source framework designed to store and process large datasets across clusters of computers.
What are Jupyter Notebooks?
An IDE and type of computational notebook that allows reserchers create to share code, equations, visualizations, and explanatory text.
(AKA, Python notebooks.)
What is the Nearest Neighbor algorithm?
An algorithm that uses proximity to make classifications or predictions about how to group an individual data point.
aka., KNN or k-NN.
What is a Neural Network?
A computational model used in deep learning that mimics the structure and functioning of the human brain’s neural pathways. It takes an input, processes it using previous learning, and produces an output.
What is Pandas?
- An open-source Python library that provides tools for working with structured data.
- It is often used for data manipulation and analysis.
What is R?
An open-source programming language used for statistical computing, data analysis, and data visualization.
What is a recommendatoin engine?
A computer program that analyzes user input, such as behaviors or preferences, and makes personalized recommendations based on that analysis.
What is regression?
A statistical model that identifies strength & correlation between one or more inputs and an output.
What defines Tabular Data?
Data that is orgainized into rows and columns.
What are the five characteristics of Cloud Computing?
- On-demand self-service.
- Broad network access.
- Resource pooling.
- Rapid elasticity.
- Measured service.
What is on on-demand self-service in cloud computing?
Access cloud resources such as the processing power, storage, and network without requiring human interaction..
What is broad network access in cloud computing?
When cloud computing resources can be accessed via the network through standard mechanisms and platforms such as mobile phones, tablets, laptops, and workstations.
What is resource pooling in cloud computing?
*** A schema that gives cloud providers economies of scale. **
* Whereby cloud resources are dynamically assigned and reassigned according to demand, without customers needing to know the physical location of these resources.
What is rapid elasticity in cloud computing?
A characteristic of cloud computing wherby organizations are able to access more cloud resources when they need them, and scale back when they don’t.
What is measured service in cloud computing?
A schema by which an organization only pays for what they use or reserve as they go.
* Resource usage is monitored, measured, and reported transparently based an organization’s utilization.
* If they’re not using resources, they’re not paying.
What are the three Cloud Deployment Models?
- Public Cloud.
- Hybrid Cloud.
- Private Cloud.
What is a public cloud in cloud computing?
When an orgaization leverages cloud services over the open internet on hardware owned by the cloud provider, but its usage is shared by other companies.
What is private cloud in cloud computing?
Infrastructure provisioned for exclusive use by a single organization. It could run on-premises or it could be owned, managed, and operated by a service provider.
What is hybrid cloud in cloud computing?
When an oganization is leveraging a mix of public cloud(s) and private cloud(s) that are configured to work together seamlessley.
What are the three cloud service models?
- IaaS
- PaaS
- SaaS
What does IaaS stand for?
Infrastructure as a Service.
A cloud computing service model that gives an organization access to the infrastructure and physical computing resources such as servers, networking, storage, and data center space without the need to manage or operate them.
What does PaaS stand for?
Platform as a Service.
A cloud computing service model that gives an organization access to the platform that comprises the hardware and software tools that are usually needed to develop and deploy applications to users over the Internet
What does SaaS stand for?
Software as a Service
A cloud computing service model that gives an organization access to a software licensing and delivery model whereby software and applications are centrally hosted on the cloud and licensed on a subscription basis.
What are the V’s of Big Data?
- Velocity - The speed at which data is accumulating.
- Volume - The scale of the data accumulating.
- Variety - The vast and growing number of data types.
- Veracity - The quality and origin of data.
- Value - Utility implicit in data.
What is a Hadoop node?
A single computer.
What is a Hadoop cluster?
A network of hadoop nodes.
What does HDFS stand for?
Hadoop Distributed File System.
What is Apache Hive?
Data warehouse software.
It is open-source and excels at reading, writing, and managing large data set files that are stored directly in either HDFS or other data storage systems such as Apache HBase
What is the best use case for Apache Hive and why?
It is suited for data warehousing tasks such as ETL, reporting, and data analysis.
* This is because it excels at low-write, high-latency, read-based queries.
* Think, “slow and meticulious tasks”.
What is Apache Spark?
A distributed data analytics framework designed to perform complex data analytics in real-time.
What is the best use case for Apache Spark and why?
Extractng and processing large volumes of data for a wide range of applications.
E.g:
* Interactive Analytics,
* Streams Processing,
* Machine Learning,
* Data Integration,
* ETL.
It is a general-purpose data processing engine that takes advantage of in-memory processing and only writes to disk when it’s memory is constrained.
What is an in-sample forecast?
A test of the predictive capabilities of a model on observed data.
What are the seven steps of Data Mining exercise?
- Goal Setting.
- Selecting Data.
- Preprocessing.
- Transforming Data.
- Storing Data.
- Data Mining.
- Evaluating.
In a data mining exercise, what is goal setting?
Identifying the key questions that need to be answered.
Also, preforming a cost-benifit analysis of collecting the data vis a vis expected level of accuracy and usefulness of the results obtained from the data mining exercise.
In a data mining exercise, describe the process of selecting data.
Identifing relevent existing data, and/or collecting new data.
Costs in time and money should be kept in mind when aquiring any data.
In a data mining exercise, what is preprocessing?
Developing and/or employing a formal method of dealing with missing data and determining whether the data are missing randomly or systematically.
Also, employng regular checks to ensure data integrity.
In a data mining exercise, what is transforming?
- Determining the appropriate format in which data must be stored.
- Prioritizing reducing the number of attributes needed to explain the phenomena.
- Using aglorithims to convert the data to fit those determinaions and priorities.
In a data mining exercise, what considerations must me made when storing transformed data?
- Is the format conducive to data mining?
- Does the storage grand expidited Read/write privileges to the data scientist.
- Are you taking into accouunt data safety and privacy concerns.
In a data mining exercise, what is “data mining”?
Usingdata analysis methods, including parametric and non-parametric methods, and machine-learning algorithms to discover insights in the cleaned, transformed, and stored data.
In a data mining exercise, what is evaluation?
The formaly evaluation of data mining results.
What language is Hadoop implemented in?
Java.
In business, what is Digital Change / Digital Transformation?
A strategic and cultural organizational change driven by data science, especially Big Data, where digital technology is integrated across the organization, resulting in fundamental operational and value delivery changes.
What is Data Replication in cloud computing?
A strategy in which data is duplicated across multiple nodes in a cluster to ensure data durability and availability, reducing the risk of data loss due to hardware failures.
What is Commodity Hardware in cloud computing?
Standard, off-the-shelf hardware components that can be used in a big data cluster, offering cost-effective solutions for storage and processing without relying on specialized hardware.
What is Data Science?
The process and method for extracting knowledge and insights from large volumes of disparate data.
What is Data Mining?
Automatically searching and analyzing data, and discovering previously unrevealed patterns.
What is Machine Learning?
A subset of AI that uses computer algorithms to analyze data and make intelligent decisions based on what it has previously learned without being explicitly programmed.
What is Deep Learning?
A specialized subset of machine learning that uses layered neural networks to simulate human decision-making.
* Deep learning algorithms can label and categorize information and identify patterns.
What is Gernarive AI?
A subset of AI that focuses on creating new data, such as images, music, text, or code, rather than just analyzing existing data.
What does GAN stand for?
Generative Adversarial Networks.
What does VAE stand for?
Variational Autoencoders.
In general what do VAEs and GANs do?
These models create new instances of data that replicate the underlying distribution of the original data by learning patterns from enormous volumes of data.
What is synthetic data?
Artificial data with properties similar to the real data, such as its distribution, clustering, and many other factors an AI learned about the real data set.
What is an Artificial Neural Network?
Collections of small computing units (neurons) that process data and learn to make decisions over time.
What is Bayesian Analysis?
Using Bayes’ theorem to update probabilities based on new evidence.
What is Cluster Analysis?
Grouping similar data points together based on certain features or attributes.
What is a Decision Tree?
A type of machine learning algorithm used for decision-making that creates a tree-like structure of decisions.
Name two deep learning models.
- Generative Adversarial Networks (GANs)
- Variational Autoencoders (VAEs)
What is NLP?
Natural Language Processing.
A field of AI that enables machines to understand, generate, and interact with human language.
What is an Arithmetic Model?
A mathematical model to analyze data and predict outcomes.
What is a Data Custer?
A group of similar, related data points distinct from other clusters.
What is an HPC?
High-performing computing cluster.
A computing technology that uses a system of networked computers designed to solve complex and computationally intensive problems in traditional environments.
What is Stata?
A software package used for statistical analysis.
What is SQL?
Structured Query Language.
What is EDA?
Exploratory Data Analysis.
What is Technical Metadata?
Technical definitions of the data structures.
What is Process Metadata?
Data that describe the processes that operate behind business systems such as data warehouses, accounting systems, or CRM tools.
What is Business Metadata?
It is information about the data described in readily interpretable ways.
What is a NoSQL database?
A database designed to store and manage unstructured data.
What does RDBMS mean?
Relational Database Management System.
in a DB, Rows are called?
Records.
In a DB, Columns are called?
Attributes.
What is ACID?
Atomicity, Consistency, Isolation, and Durability
What is a Data Mart?
A sub-section of the data warehouse, built specifically for a particular function.
What is a Data Lake?
A pool of raw data, where data is simply tagged with a UID for future use.
What does NoSQL stand for?
Not Only SQL.