What is Data Science? Flashcards

Question

# airport An example of velocity in action

Answer 1

- Imagine a bustling airport during peak travel hours. - Hundreds of flights are landing and taking off every minute. - Each flight generates data: passenger lists, flight paths, fuel consumption, baggage handling, and more. - The velocity of this data is immense, accumulating rapidly as planes taxi, ascend, and descend. - Airlines must process this real-time data to optimise flight schedules, ensure safety, and enhance passenger experiences.

Answer 2

- Consider an e-commerce giant like Amazon. - Millions of customers browse products, add items to their carts, and make purchases simultaneously. - The sheer volume of data generated—product listings, customer profiles, transaction histories, reviews, and shipping details—is staggering. - Amazon’s servers handle petabytes of data daily. - To manage this volume, they employ distributed databases, data lakes, and scalable cloud infrastructure.

Answer 3

- Picture a social media platform like Instagram. - Users share diverse content: photos, videos, stories, captions, hashtags, and geotags. - Additionally, Instagram collects metadata (likes, comments, timestamps). - This mix of structured (metadata) and unstructured (visual content) data creates variety. - Instagram’s challenge lies in organising and analysing this eclectic data to personalise feeds, recommend content, and detect trends.

Answer 4

- Think about a healthcare system that collects patient data from various sources: electronic health records, wearable devices, diagnostic images, and clinical notes. - However, not all data is equally reliable. - Some entries may contain errors, missing values, or inconsistencies. - Veracity refers to the trustworthiness and accuracy of data. - Healthcare institutions invest in data validation, quality checks, and anomaly detection to ensure reliable insights for patient care and research.

Answer 5

- Think about financial trading. - Traders analyse stock market data—stock prices, trading volumes, news sentiment, and economic indicators—to make informed decisions. - The value lies in identifying patterns, predicting market movements, and executing profitable trades. - However, not all data contributes equally to value. - Extracting actionable insights requires sophisticated algorithms, machine learning models, and real-time analytics.

Answer 6

They are expected to derive insights from Big Data and cope with the challenges that come with these massive datasets.

Answer 7

The scale of the data means that it is not feasible to use conventional data analysis tools

Answer 8

Alternative tools such as Hadoop and Apache Spark are used.

Answer 9

- Scope: - It combines multi-disciplinary fields and computing to interpret data for decision-making. - Applications: - Data Science involves data cleaning, integration, visualisation, and statistical analysis of data sets to uncover patterns and trends. - Decision Making: - Data science uses scientific methods to discover and understand patterns, performance, and trends, often comparing numerous models to produce the best outcome. - Meanwhile, statistics focuses on using mathematical analysis with quantified models to represent a given data set.

Answer 10

These processing technologies provide ways to work with structured, semi-structured, and unstructured data so that value can be derived from big data

Answer 11

- Apache Hadoop - Apache Hive - Apache Spark

Answer 12

Apache Hadoop is a collection of tools that provide distributed storage and processing of big data

Answer 13

Apache Hive is a data warehouse for data query and analysis

Answer 14

Apache Spark is a distributed analytics framework for complex, real-time data analytics

Answer 15

It allows distributed storage and processing of large datasets across clusters of computers.

Answer 16

In a Hadoop distributed system: - a node is a single computer, and a collection of nodes forms a cluster - Hadoop can scale up from a single node, to any number of nodes each providing local storage and computation - Hadoop provides a reliable, scalable, and cost-effective solution for storing data with no format requirementsIn a Hadoop distributed system:

Answer 17

- Better real-time data-driven decisions - incorporates emerging data formats not typically used in data warehouses - Improved data access and analysis - provides real-time self-service access to stakeholders - Data offload and consolidation - optimises and streamlines costs by consolidating data, including cold data, across the organisation

Answer 18

Hadoop Distributed File System (HDFS). HDFS is a storage system for big data that runs on multiple commodity hardware connected through a network.Hadoop Distributed File System (HDFS).

Answer 19

HDFS: - provides scalable and reliable big data storage by partitioning files over multiple nodes - splits files over multiple computers, allowing parallel access to them - replicates file blocks on different nodes to prevent data loss

Answer 20

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

Answer 21

Consider an example where we have a file that contains the phone numbers of everyone in South Africa. If we were to store this file on a single machine, there would be several challenges: 1. Storage: The file could be larger than the storage capacity of a single machine. 2. Processing Speed: Processing this file (e.g., searching for a specific number) could take a long time because only one machine’s resources (CPU, memory) are being used. 3. Fault Tolerance: If the machine fails, we lose access to the file. Now, let’s see how HDFS addresses these challenges: 1. Distributed Storage: In HDFS, data is split into blocks (default size is 128MB in Hadoop 2.x), and these blocks are distributed across multiple nodes in the cluster. So, our phone directory file would be split into many blocks, and these blocks would be stored on different machines. This allows HDFS to store a file that is larger than the storage capacity of a single machine. 2. Parallel Processing: Each block of the file is stored on a separate machine, and processing can happen on all machines simultaneously. This means that if we want to search for a phone number, the search operation can be carried out on all machines at the same time, significantly speeding up the process. 3. Fault Tolerance: HDFS is designed to continue operating without a noticeable interruption to the user, even when a machine fails. This is achieved by replicating each block of data across multiple machines. So, if one machine fails, the same block can be found on another machine. In conclusion, HDFS provides a scalable, fault-tolerant, distributed storage system that works closely with distributed processing frameworks like MapReduce. This makes it an excellent choice for storing and processing big data.

Answer 22

- Fast recovery from hardware failures - HDFS is built to detect faults and automatically recover - Access to streaming data - HDFS supports high data throughput rates - Accommodation of large datasets - HDFS can scale to hundreds of nodes, or computers, in a single cluster - Portability - HDFS is portable across multiple hardware platforms and compatible with a variety of underlying operating systems

Answer 23

Long, sequential scans

Answer 24

Hive is an open-source data warehouse software for reading, writing, and managing large data set files that are stored directly in either HDFS or other storage systems such as Apache HBase

Answer 25

Because Hive is based on Hadoop, queries have very high latency. This makes Hive less appropriate for applications that need a very fast response time.

Answer 26

It makes it unsuitable for transaction processing that typically involves a high percentage of write operations.

Answer 27

Hive is suited for data warehousing tasks such as ETL, reporting, and data analysis and includes tools that enable easy access to data via SQL.

Answer 28

ETL stands for Extract, Transform, and Load.

Answer 29

In the context of Apache Spark, ETL is a process that: 1. Extracts data from the source (original database or data source). 2. Transforms the data, which involves changing the structure of the information so it integrates with the target data system and the rest of the data in that system. This transformation can include cleaning (such as mapping NULL to 0 or changing date format consistency), deduplication (identifying and removing duplicate records), and format revision (like character set conversion, unit of measurement conversion, date/time conversion, etc.). 3. Loads the transformed data into a target database.

Answer 30

Spark is a general-purpose data processing engine designed to extract and process large volumes of data for a wide variety of applications, including Interactive Analytics, Machine Learning, and ETL.

Answer 31

It takes advantage of in-memory processing to significantly increase the speed of computations and spilling to disk only when memory is constrained.

Answer 32

- Spark can run using its standalone clustering technology - It can run on top of other infrastructures, such as Hadoop - It can access data in a wide variety of data sources, including HDFS and Hive - Spark is able to process streaming data fast - Spark is able to perform complex analytics in real-time

Answer 33

1. Establishing data mining goals 2. Selecting data 3. Preprocessing data 4. Transforming data 5. Storing data 6. Mining data 7. Evaluating Mining Results

Answer 34

- Identifying the key questions that need to be answered - Considering the costs and benefits of the exercise - Determining, in advance, the expected level of accuracy and usefulness of the results obtained from data mining

Answer 35

The cost-benefit trade-off is always important in the data mining exercise. - The level of accuracy expected from the results influences the costs. - High levels of accuracy from data mining would cost more and vice versa. - Furthermore, beyond a certain level of accuracy, you do not gain much from the exercise, given the diminishing returns.

Answer 36

The quality of data being used.

Answer 37

In such cases, you must identify other sources of data or even plan new data collection initiatives, including surveys.

Answer 38

The size, type, and frequency of collection of data.

Answer 39

This is because in the preprocessing stage: - you identify the irrelevant attributes of data and expunge such attributes from further consideration. - During preprocessing, you also identify the erroneous aspects of the data set and flag them as such. - Lastly, in the preprocessing stage, you develop a formal method of dealing with missing data and determine whether the data are missing randomly or systematically.

Answer 40

The next step is to determine the appropriate format in which data must be stored.

Answer 41

To reduce the number of attributes needed to explain the phenomena.

Answer 42

In a format that gives unrestricted and immediate read/write privileges to the data scientist.

Answer 43

During data mining, new variables are created, which are written back to the original database.

Answer 44

After data is appropriately processed, transformed, and stored.

Answer 45

Data analysis methods, including parametric and non-parametric methods, and machine-learning algorithms.

Answer 46

A good starting point for data mining is data visualisation. Multidimensional views of the data using the advanced graphing capabilities of data mining software are very helpful in developing a preliminary understanding of the trends hidden in the data set.

Answer 47

You perform a formal evaluation of the results. Data mining and evaluating the results becomes an iterative process such that the analysts use better and improved algorithms to improve the quality of results generated in light of the feedback received from the key stakeholders.

Answer 48

Big Data refers to datasets that are so massive, so quickly built, and so varied that they defy traditional data analysis methods; such as you might perform with a relational database.

Answer 49

The process of automatically searching and analysing data to discover previously unrevealed patterns

Answer 50

A subset of AI that uses computer algorithms to analyse data and make intelligent decisions based on what it has learned without being explicitly programmed.

Answer 51

- They are trained with large datasets - They learn from examples - They do not follow rules-based algorithms

Answer 52

It enables machines to make machines on their own and make accurate predictions using the provided data

Answer 53

It is a specialised subset of machine learning that uses layered neural networks to simulate human decision-making.

Answer 54

They can label and characterise information and identify patterns. It is what enables AI systems to continuously learn on the job and improve the quality and accuracy of results by determining whether decisions were correct.

Answer 55

It is a collection of small computing units (neurons) that take incoming data and learn to make decisions over time.

Answer 56

Neural networks are often layer-deep.

Answer 57

- Data Science is an interdisciplinary field that employs techniques from fields like statistics, computer science, and information science to create actionable intelligence from data. - It utilises technologies like machine learning to interpret and analyse data, discover patterns, make predictions, and generate insights. - Artificial Intelligence is broader and involves the creation of intelligent machines capable of learning and decision-making. - It denotes the emulation of human cognition in machines designed to mimic human thought and behaviour.

Answer 58

Generative AI is a subset of Artificial Intelligence that focuses on producing new data rather than just analysing existing data.

Answer 59

It allows machines to create content such as: - Images - Music - Language - Code - And more

Answer 60

Deep learning models such as Generative adversarial networks (GANs) or Variational auto-encoders (VAEs) form part of the foundation of what allows Generative AI to create new content. These models create new instances that replicate the underlying features of the original data by learning patterns from enormous volumes of data.

Answer 61

- Natural Language Processing - e.g. OpenAI's GPT-4 - Healthcare - GenAI can synthesise medical images aiding in the training of medical professionals - Art and Design - GenAI can create visually stunning artworks - Gaming - Game developers can use GenAI to generate realistic environments and characters

Answer 62

- By way of synthetic data - By way of coding automation - By way of uncovering insights

Answer 63

- Building data models takes a lot of data and sometimes data sets may not have enough data to build a model - Generative AI makes data augmentation possible - Data scientists can then use this synthetic data along with real data for model training and testing

Answer 64

Generative AI can generate the software code needed to construct models allowing the data scientist to focus on higher-level tasks

Answer 65

- Speech recognition - Facial recognition

Answer 66

- Market basket analysis - Predictive analytics - Recommendation engine (e.g. how YouTube recommends you videos) - In FinTech, fraud detection systems are an application of machine learning

Answer 67

A type of machine learning algorithm used for decision-making by creating a tree-like structure of decisions.

Answer 68

Artificially generated data with properties similar to real data, used by data scientists to augment their datasets and improve model training.

Answer 69

Analysing which goods tend to be bought together is often used for marketing insights.

Answer 70

A field of AI that enables machines to understand, generate, and interact with human language, revolutionising content creation and chatbots.

Answer 71

A statistical technique that uses Bayes' theorem to update probabilities based on new evidence.

Answer 72

Collections of small computing units (neurons) that process data and learn to make decisions over time.

Answer 73

The process of grouping similar data points together based on certain features or attributes.

Answer 74

A simple probabilistic classification algorithm based on Bayes' theorem.

Answer 75

Regression identifies the strength and amount of the correlation between one or more inputs and an output.

Answer 76

Digital transformation is not simply duplicating existing processes in digital form; the in-depth analysis of how the business operates helps organisations discover how to improve their processes and operations and harness the benefits of integrating data science into their workflows.

Answer 77

Cloud computing can also refer to applications and data that users access on the Internet rather than locally.

Answer 78

- They should start recording information and start recording data. - It is very important also that they measure any pre-existing data to try and understand it. - Companies should also build a team of data scientists.

Answer 79

Archive your data, never overwrite old data. Data never gets old.

Answer 80

To discover optimum solutions to existing problems.

Answer 81

To communicate findings to the concerned who might use these insights to formulate policy or strategy.

Answer 82

- The data scientist should use the insights to build the narrative to communicate the findings. - In academia, the final deliverable is in the form of essays and reports. Such deliverables are usually 1,000 to 7,000 words in length. - In consulting and business, the final deliverable takes on several forms. It can be a small document of fewer than 1,500 words illustrated with tables and plots, or it could be a comprehensive document comprising several hundred pages.

Answer 83

Embarking on analytics, without due consideration to the final deliverable

Answer 84

- Drive business goals - Improve efficiency - Make predictions - Save lives

Answer 85

Data science allows organisations to discover optimal solutions and helps in establishing a clear understanding of the problem.

Answer 86

Measurement. The organisation needs to start by measuring their data.

Answer 87

If you're coming into a data science team, the first skill you need would be to know how to program. You also need to know: - algebra - analytical geometry - calculus - basic probability and statistics - databases

Answer 88

- Computer Science theory - Statistics - Probability The intersection of the three is very important in data science and is what is common amongst higher-end data scientists who possess PhDs.

Answer 89

Curiosity (along with a sense of humour).

Answer 90

They should enjoy: - working with data - coding - mathematics and statistics - telling stories

Answer 91

Think about the structure of the report.

Answer 92

The length of the document.

Answer 93

A brief report is more to the point and presents a summary of key findings. A detailed report incrementally builds the argument and contains details about other relevant works, research methodology, data sources, and intermediate findings along with the main results.

Answer 94

The purpose of the report.

Answer 95

- a cover page, - table of contents, - executive summary, - detailed contents, - acknowledgments, - references, and - appendices (if needed).

Answer 96

At a minimum, the cover page should include - the title of the report, - names of authors, - their affiliations, and contacts, - the name of the institutional publisher (if any), and - the date of publication.

What is Data Science? Flashcards

IBM Data Science Professional Certificate (Course 1/10) (120 cards)