What is Data Science? Flashcards

IBM Data Science Professional Certificate (Course 1/10)

1
Q

What is data science?

A

The translation of data into a story, and then using these stories to generate insights. It is with these insights are you then able to develop strategies for companies, for example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does digital transformation affect business operations?

A

It affects them by updating existing processes and operations and creating new ones to harness the benefits of new technologies (e.g. harnessing the benefits of Big Data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

optical tracking

An example of how Big Data can trigger a digital transformation, not just within an organisation, but within an entire industry

A

Manchester City has embraced the use of Big Data to improve their game.

They have a team of data analysts who use millions of stats about players’ performance and the upcoming opposition to help the club’s chances of winning.

One of the tools they use is optical tracking, which can be used to pinpoint the position of players on the pitch 25 times a second, in relation to the ball, opposition, and teammates. This data, along with other ball-related data such as passes, shots, and turnovers, is analysed to gain insights into the team’s performance.

These insights can then be used to inform the team’s strategy in future games. For example, they might adjust their formation, change their passing strategy, or alter player positions based on the data.

It’s a great example of how Big Data can transform not just a single team, but the entire sport of football.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is cloud computing?

A

The delivery of on-demand computing resources such as:
* Networks
* Servers
* Storage
* Applications
* Services
* Data centres
over the Internet on a pay-for-use basis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some of the benefits of cloud computing?

A
  • Users do not need to purchase and install the software on their local systems, they can just use the online version of the software and pay a monthly subscription.
  • This makes everything more cost-effective as well as ensuring you always have access to the most up-to-date version of the software. Think of Microsoft 365, for example.
  • Other benefits include saving the user some local storage space as well as encouraging collaboration among colleagues/project teams as the software would be hosted online.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is cloud computing composed of?

A
  • 5 characteristics
  • 3 service models
  • 3 deployment models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Only Brave Rabbits Run Marathons

What are the five characteristics of cloud computing?

A
  1. On-demand self-service
    1. this means getting access to cloud resources such as power, storage and network without requiring human interaction with each service provider
  2. Broad network access
    1. this means that cloud computing resources can be via the network through standard mechanisms and platforms such as mobile phones, tablets, laptops, and workstations.
  3. Resource pooling
    1. this is what gives cloud providers economies of scale, which they pass on to their users, making cloud cost-efficient
    2. using a multi-tenant model, computing resources are pooled to serve multiple customers, and cloud resources are dynamically assigned and reassigned according to demand without customers needing to know the physical location of these resources
  4. Rapid elasticity
    1. this implies that you can access more resources when you need them and scale things back when you don’t, because resources are elastically provisioned and released
  5. Measured service
    1. this implies that you only pay for what you use as you go; if you’re not using those resources, you’re not paying
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is cloud computing really about?

A

It is about using technology “as a service”, leveraging remote systems on-demand over the open Internet, scaling up and scaling back, and only paying for what you use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do cloud deployment models indicate?

A

They indicate where the infrastructure resides, who owns and manages it, and how cloud resources and services are made available to users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the three types of cloud deployment models?

A
  1. Public
    1. this is when you leverage cloud services over the open internet on hardware owned by the cloud provider, but its usage is shared by other companies
  2. Private
    1. this means that the cloud infrastructure is provisioned for exclusive use by a single organisation
    2. it could run on-premises or it could be owned, managed, and operated by a service provider
  3. Hybrid
    1. this is when you use a mix of both the public and private deployment models.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the three cloud service models based on?

A

The three layers in a computing stack: infrastructure, platform, and application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the three cloud service models?

A
  1. Infrastructure as a Service (IaaS)
    1. In this model, you can access the infrastructure and physical computing resources such as servers, networking, storage, and data centre space without the need to manage or operate them
  2. Platform as a Service (PaaS)
    1. you can access the platform that comprises the hardware and software tools that are usually needed to develop and deploy applications to users over the Internet.
  3. Software as a Service (SaaS)
    1. this is a software licensing and delivery model in which software and applications are centrally hosted and licensed on a subscription basis. It is sometimes referred to as “on-demand software.”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is the cloud such a positive for data science?

A

It allows a data scientist to bypass the physical limitations of their computer and the system they’re using.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Big Data?

A

Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools, and machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does Big Data need in order to be effective?

A

It requires new, innovative, and scalable technology to collect, host, and analytically process the vast amount of data gathered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does Big Data aim to do?

A

It aims to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the V’s of Big Data?

A
  • Velocity
    • This is the speed at which data is accumulated
  • Volume
    • This the the scale of the data or the increase in the amount of data stored
  • Variety
    • This is the diversity of the data
  • Veracity
    • This is the quality and origin of data and its conformity to facts and accuracy
  • Value
    • This refers to our need and ability to turn data into value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the drivers of Big Data Volume?

A
  • The increase in data sources
  • Higher resolution sensors
  • Scalable infrastructure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the difference between structured and unstructured data?

A
  • Structured data fits neatly into rows and columns in relational databases.
    • For example, employee details at a company.
    • These employee details would include things like job, employee number, age etc. which would be criteria that everyone at the company would have, with all of it being the same data type.
  • Unstructured data is data that is not organised in a predefined way.
    • For example, this could be tweets, blog posts and videos. - Structured data fits neatly into rows and columns in relational databases.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does variety reflect?

A

That data comes from different sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the drivers of variety?

A
  • Mobile technologies
  • Social media
  • Wearable technologies
  • Geo technologies
  • Video
  • Many more
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

CCIA

What are the attributes of veracity?

A
  • Consistency
  • Completeness
  • Integrity
  • Ambiguity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

CN

What are the drivers of veracity?

A
  • Cost
  • Need for traceability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the main reason people take time to understand Big Data?

A

In order to derive value from it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

airport

An example of velocity in action

A
  • Imagine a bustling airport during peak travel hours.
    • Hundreds of flights are landing and taking off every minute.
    • Each flight generates data: passenger lists, flight paths, fuel consumption, baggage handling, and more.
    • The velocity of this data is immense, accumulating rapidly as planes taxi, ascend, and descend.
    • Airlines must process this real-time data to optimise flight schedules, ensure safety, and enhance passenger experiences.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

amazon

An example of volume in action

A
  • Consider an e-commerce giant like Amazon.
    • Millions of customers browse products, add items to their carts, and make purchases simultaneously.
    • The sheer volume of data generated—product listings, customer profiles, transaction histories, reviews, and shipping details—is staggering.
    • Amazon’s servers handle petabytes of data daily.
    • To manage this volume, they employ distributed databases, data lakes, and scalable cloud infrastructure.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

An example of variety.

A
  • Picture a social media platform like Instagram.
    • Users share diverse content: photos, videos, stories, captions, hashtags, and geotags.
    • Additionally, Instagram collects metadata (likes, comments, timestamps).
    • This mix of structured (metadata) and unstructured (visual content) data creates variety.
    • Instagram’s challenge lies in organising and analysing this eclectic data to personalise feeds, recommend content, and detect trends.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

An example of veracity

A
  • Think about a healthcare system that collects patient data from various sources: electronic health records, wearable devices, diagnostic images, and clinical notes.
    • However, not all data is equally reliable.
    • Some entries may contain errors, missing values, or inconsistencies.
    • Veracity refers to the trustworthiness and accuracy of data.
    • Healthcare institutions invest in data validation, quality checks, and anomaly detection to ensure reliable insights for patient care and research.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

An example of value

A
  • Think about financial trading.
    • Traders analyse stock market data—stock prices, trading volumes, news sentiment, and economic indicators—to make informed decisions.
    • The value lies in identifying patterns, predicting market movements, and executing profitable trades.
    • However, not all data contributes equally to value.
    • Extracting actionable insights requires sophisticated algorithms, machine learning models, and real-time analytics.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are data scientists today expected to do with Big Data?

A

They are expected to derive insights from Big Data and cope with the challenges that come with these massive datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are some challenges that Big Data has presented for data scientists?

A

The scale of the data means that it is not feasible to use conventional data analysis tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How are the challenges associated with Big Data overcome?

A

Alternative tools such as Hadoop and Apache Spark are used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How does data science differ from traditional subjects like statistics?

A
  • Scope:
    • It combines multi-disciplinary fields and computing to interpret data for decision-making.
  • Applications:
    • Data Science involves data cleaning, integration, visualisation, and statistical analysis of data sets to uncover patterns and trends.
  • Decision Making:
    • Data science uses scientific methods to discover and understand patterns, performance, and trends, often comparing numerous models to produce the best outcome.
    • Meanwhile, statistics focuses on using mathematical analysis with quantified models to represent a given data set.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What do Big Data Processing Tools help you do?

A

These processing technologies provide ways to work with structured, semi-structured, and unstructured data so that value can be derived from big data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are the most common open-source Big Data computing tools?

A
  • Apache Hadoop
  • Apache Hive
  • Apache Spark
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is Apache Hadoop?

A

Apache Hadoop is a collection of tools that provide distributed storage and processing of big data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is Apache Hive?

A

Apache Hive is a data warehouse for data query and analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is Apache Spark?

A

Apache Spark is a distributed analytics framework for complex, real-time data analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What does a Hadoop framework allow?

A

It allows distributed storage and processing of large datasets across clusters of computers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What happens in a Hadoop distributed system?

A

In a Hadoop distributed system:
- a node is a single computer, and a collection of nodes forms a cluster
- Hadoop can scale up from a single node, to any number of nodes each providing local storage and computation
- Hadoop provides a reliable, scalable, and cost-effective solution for storing data with no format requirementsIn a Hadoop distributed system:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What are the benefits that come with using Hadoop?

A
  • Better real-time data-driven decisions
    • incorporates emerging data formats not typically used in data warehouses
  • Improved data access and analysis
    • provides real-time self-service access to stakeholders
  • Data offload and consolidation
    • optimises and streamlines costs by consolidating data, including cold data, across the organisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is one of the four main components of Hadoop?

A

Hadoop Distributed File System (HDFS).
HDFS is a storage system for big data that runs on multiple commodity hardware connected through a network.Hadoop Distributed File System (HDFS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What does HDFS do?

A

HDFS:
- provides scalable and reliable big data storage by partitioning files over multiple nodes
- splits files over multiple computers, allowing parallel access to them
- replicates file blocks on different nodes to prevent data loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

How is HDFS different from other distributed file systems?

A

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

An example of how HDFS is different from other distributed file systems

A

Consider an example where we have a file that contains the phone numbers of everyone in South Africa.

If we were to store this file on a single machine, there would be several challenges:
1. Storage: The file could be larger than the storage capacity of a single machine.
2. Processing Speed: Processing this file (e.g., searching for a specific number) could take a long time because only one machine’s resources (CPU, memory) are being used.
3. Fault Tolerance: If the machine fails, we lose access to the file.

Now, let’s see how HDFS addresses these challenges:
1. Distributed Storage: In HDFS, data is split into blocks (default size is 128MB in Hadoop 2.x), and these blocks are distributed across multiple nodes in the cluster. So, our phone directory file would be split into many blocks, and these blocks would be stored on different machines. This allows HDFS to store a file that is larger than the storage capacity of a single machine.
2. Parallel Processing: Each block of the file is stored on a separate machine, and processing can happen on all machines simultaneously. This means that if we want to search for a phone number, the search operation can be carried out on all machines at the same time, significantly speeding up the process.
3. Fault Tolerance: HDFS is designed to continue operating without a noticeable interruption to the user, even when a machine fails. This is achieved by replicating each block of data across multiple machines. So, if one machine fails, the same block can be found on another machine.

In conclusion, HDFS provides a scalable, fault-tolerant, distributed storage system that works closely with distributed processing frameworks like MapReduce. This makes it an excellent choice for storing and processing big data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What are some other benefits that come with using HDFS?

A
  • Fast recovery from hardware failures
    • HDFS is built to detect faults and automatically recover
  • Access to streaming data
    • HDFS supports high data throughput rates
  • Accommodation of large datasets
    • HDFS can scale to hundreds of nodes, or computers, in a single cluster
  • Portability
    • HDFS is portable across multiple hardware platforms and compatible with a variety of underlying operating systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is Hadoop intended for?

A

Long, sequential scans

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is Apache Hive?

A

Hive is an open-source data warehouse software for reading, writing, and managing large data set files that are stored directly in either HDFS or other storage systems such as Apache HBase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is the problem with Hive?

A

Because Hive is based on Hadoop, queries have very high latency. This makes Hive less appropriate for applications that need a very fast response time.

50
Q

What is the issue with Hive being read-based?

A

It makes it unsuitable for transaction processing that typically involves a high percentage of write operations.

51
Q

What is Hive suited for?

A

Hive is suited for data warehousing tasks such as ETL, reporting, and data analysis and includes tools that enable easy access to data via SQL.

52
Q

What does ETL stand for?

A

ETL stands for Extract, Transform, and Load.

53
Q

In the context of Apache Spark, what is ETL?

A

In the context of Apache Spark, ETL is a process that:
1. Extracts data from the source (original database or data source).
2. Transforms the data, which involves changing the structure of the information so it integrates with the target data system and the rest of the data in that system. This transformation can include cleaning (such as mapping NULL to 0 or changing date format consistency), deduplication (identifying and removing duplicate records), and format revision (like character set conversion, unit of measurement conversion, date/time conversion, etc.).
3. Loads the transformed data into a target database.

54
Q

What is Apache Spark?

A

Spark is a general-purpose data processing engine designed to extract and process large volumes of data for a wide variety of applications, including Interactive Analytics, Machine Learning, and ETL.

55
Q

How does Spark work?

A

It takes advantage of in-memory processing to significantly increase the speed of computations and spilling to disk only when memory is constrained.

56
Q

What are some key attributes of Spark?

A
  • Spark can run using its standalone clustering technology
  • It can run on top of other infrastructures, such as Hadoop
  • It can access data in a wide variety of data sources, including HDFS and Hive
  • Spark is able to process streaming data fast
  • Spark is able to perform complex analytics in real-time
57
Q

What are the seven steps in the data mining process?
(ESPTSME)

A
  1. Establishing data mining goals
  2. Selecting data
  3. Preprocessing data
  4. Transforming data
  5. Storing data
  6. Mining data
  7. Evaluating Mining Results
58
Q

What are the goals you need to set up as the first step in the data mining exercise?

A
  • Identifying the key questions that need to be answered
  • Considering the costs and benefits of the exercise
  • Determining, in advance, the expected level of accuracy and usefulness of the results obtained from data mining
59
Q

What is always instrumental in determining the goals and scope of the data mining exercise, and why?

A

The cost-benefit trade-off is always important in the data mining exercise.
- The level of accuracy expected from the results influences the costs.
- High levels of accuracy from data mining would cost more and vice versa.
- Furthermore, beyond a certain level of accuracy, you do not gain much from the exercise, given the diminishing returns.

60
Q

What is the output of a data mining exercise largely dependent on?

A

The quality of data being used.

61
Q

What must you do if data are not available for further processing?

A

In such cases, you must identify other sources of data or even plan new data collection initiatives, including surveys.

62
Q

What factors have a direct bearing on the cost of data mining exercise?

A

The size, type, and frequency of collection of data.

63
Q

Why is preprocessing an important step in the data mining process?

A

This is because in the preprocessing stage:
- you identify the irrelevant attributes of data and expunge such attributes from further consideration.
- During preprocessing, you also identify the erroneous aspects of the data set and flag them as such.
- Lastly, in the preprocessing stage, you develop a formal method of dealing with missing data and determine whether the data are missing randomly or systematically.

64
Q

What is the next step after the relevant attributes of data have been retained?

A

The next step is to determine the appropriate format in which data must be stored.

65
Q

What is an important consideration in data mining?

A

To reduce the number of attributes needed to explain the phenomena.

66
Q

How should the transformed data be stored?

A

In a format that gives unrestricted and immediate read/write privileges to the data scientist.

67
Q

Why does the data storage scheme need to facilitate efficiently reading from and writing to the database?

A

During data mining, new variables are created, which are written back to the original database.

68
Q

When is data subject to data mining?

A

After data is appropriately processed, transformed, and stored.

69
Q

What does data mining cover?

A

Data analysis methods, including parametric and non-parametric methods, and machine-learning algorithms.

70
Q

What is a good starting point for data mining and why?

A

A good starting point for data mining is data visualisation.

Multidimensional views of the data using the advanced graphing capabilities of data mining software are very helpful in developing a preliminary understanding of the trends hidden in the data set.

71
Q

What do you do after extracting results from data mining?

A

You perform a formal evaluation of the results.

Data mining and evaluating the results becomes an iterative process such that the analysts use better and improved algorithms to improve the quality of results generated in light of the feedback received from the key stakeholders.

72
Q

What is big data?

A

Big Data refers to datasets that are so massive, so quickly built, and so varied that they defy traditional data analysis methods; such as you might perform with a relational database.

73
Q

What is data mining?

A

The process of automatically searching and analysing data to discover previously unrevealed patterns

74
Q

What is machine learning?

A

A subset of AI that uses computer algorithms to analyse data and make intelligent decisions based on what it has learned without being explicitly programmed.

75
Q

What are some important characteristics of machine learning algorithms?

A
  • They are trained with large datasets
  • They learn from examples
  • They do not follow rules-based algorithms
76
Q

Why is machine learning special?

A

It enables machines to make machines on their own and make accurate predictions using the provided data

77
Q

What is deep learning?

A

It is a specialised subset of machine learning that uses layered neural networks to simulate human decision-making.

78
Q

What is an important characteristic of deep learning algorithms?

A

They can label and characterise information and identify patterns.

It is what enables AI systems to continuously learn on the job and improve the quality and accuracy of results by determining whether decisions were correct.

79
Q

What is a neural network in AI?

A

It is a collection of small computing units (neurons) that take incoming data and learn to make decisions over time.

80
Q

Why do deep learning algorithms become more efficient as the data set increases, as opposed to other machine learning algorithms that may plateau as data increases?

A

Neural networks are often layer-deep.

81
Q

What is the difference between AI and Data Science?

A
  • Data Science is an interdisciplinary field that employs techniques from fields like statistics, computer science, and information science to create actionable intelligence from data.
    • It utilises technologies like machine learning to interpret and analyse data, discover patterns, make predictions, and generate insights.
  • Artificial Intelligence is broader and involves the creation of intelligent machines capable of learning and decision-making.
    • It denotes the emulation of human cognition in machines designed to mimic human thought and behaviour.
82
Q

What is Generative AI?

A

Generative AI is a subset of Artificial Intelligence that focuses on producing new data rather than just analysing existing data.

83
Q

What does GenAI allow machines to do?

A

It allows machines to create content such as:
- Images
- Music
- Language
- Code
- And more

84
Q

How does Generative AI work?

A

Deep learning models such as Generative adversarial networks (GANs) or Variational auto-encoders (VAEs) form part of the foundation of what allows Generative AI to create new content.

These models create new instances that replicate the underlying features of the original data by learning patterns from enormous volumes of data.

85
Q

What are some applications of Generative AI?

A
  • Natural Language Processing
    • e.g. OpenAI’s GPT-4
  • Healthcare
    • GenAI can synthesise medical images aiding in the training of medical professionals
  • Art and Design
    • GenAI can create visually stunning artworks
  • Gaming
    • Game developers can use GenAI to generate realistic environments and characters
86
Q

How do data scientists make use of Generative AI?

A
  • By way of synthetic data
  • By way of coding automation
  • By way of uncovering insights
87
Q

Why would a data scientist make use of synthetic data?

A
  • Building data models takes a lot of data and sometimes data sets may not have enough data to build a model
  • Generative AI makes data augmentation possible
  • Data scientists can then use this synthetic data along with real data for model training and testing
88
Q

Why is coding automation beneficial for a data scientist?

A

Generative AI can generate the software code needed to construct models allowing the data scientist to focus on higher-level tasks

89
Q

What are some use cases of deep learning?

A
  • Speech recognition
  • Facial recognition
90
Q

Applications of Machine Learning

A
  • Market basket analysis
  • Predictive analytics
  • Recommendation engine (e.g. how YouTube recommends you videos)
  • In FinTech, fraud detection systems are an application of machine learning
91
Q

What are decision trees?

A

A type of machine learning algorithm used for decision-making by creating a tree-like structure of decisions.

92
Q

What is synthetic data?

A

Artificially generated data with properties similar to real data, used by data scientists to augment their datasets and improve model training.

93
Q

What is market basket analysis?

A

Analysing which goods tend to be bought together is often used for marketing insights.

94
Q

What is Natural Language Processing (NLP)?

A

A field of AI that enables machines to understand, generate, and interact with human language, revolutionising content creation and chatbots.

95
Q

What is Bayesian Analysis?

A

A statistical technique that uses Bayes’ theorem to update probabilities based on new evidence.

96
Q

What are Artificial Neural Networks?

A

Collections of small computing units (neurons) that process data and learn to make decisions over time.

97
Q

What is Cluster Analysis?

A

The process of grouping similar data points together based on certain features or attributes.

98
Q

What is Naive Bayes?

A

A simple probabilistic classification algorithm based on Bayes’ theorem.

99
Q

What does regression do?

A

Regression identifies the strength and amount of the correlation between one or more inputs and an output.

100
Q

What is the true essence of digital transformation?

A

Digital transformation is not simply duplicating existing processes in digital form; the in-depth analysis of how the business operates helps organisations discover how to improve their processes and operations and harness the benefits of integrating data science into their workflows.

101
Q

What can cloud computing also refer to?

A

Cloud computing can also refer to applications and data that users access on the Internet rather than locally.

102
Q

How should companies get started in Data Science?

A
  • They should start recording information and start recording data.
  • It is very important also that they measure any pre-existing data to try and understand it.
  • Companies should also build a team of data scientists.
103
Q

What should you do after you’ve started capturing data?

A

Archive your data, never overwrite old data. Data never gets old.

104
Q

What do all organisations use data science for?

A

To discover optimum solutions to existing problems.

105
Q

What is the ultimate purpose of analytics?

A

To communicate findings to the concerned who might use these insights to formulate policy or strategy.

106
Q

What should a data scientist do with their findings?

A
  • The data scientist should use the insights to build the narrative to communicate the findings.
  • In academia, the final deliverable is in the form of essays and reports. Such deliverables are usually 1,000 to 7,000 words in length.
  • In consulting and business, the final deliverable takes on several forms. It can be a small document of fewer than 1,500 words illustrated with tables and plots, or it could be a comprehensive document comprising several hundred pages.
107
Q

What mistake would result in a poor-quality document where the analytics and narrative struggle to blend?

A

Embarking on analytics, without due consideration to the final deliverable

108
Q

What do organisations use data science for?

A
  • Drive business goals
  • Improve efficiency
  • Make predictions
  • Save lives
109
Q

Why is data science used?

A

Data science allows organisations to discover optimal solutions and helps in establishing a clear understanding of the problem.

110
Q

What is the first step in helping an organisation solve its problems using data?

A

Measurement. The organisation needs to start by measuring their data.

111
Q

How can someone become a data scientist?

A

If you’re coming into a data science team, the first skill you need would be to know how to program.
You also need to know:
- algebra
- analytical geometry
- calculus
- basic probability and statistics
- databases

112
Q

As you go further up the data science field, what do you need to know?

A
  • Computer Science theory
  • Statistics
  • Probability
    The intersection of the three is very important in data science and is what is common amongst higher-end data scientists who possess PhDs.
113
Q

What is one of the most important skills that a data scientist should possess?

A

Curiosity (along with a sense of humour).

114
Q

What motivates someone going into a Data Science career?

A

They should enjoy:
- working with data
- coding
- mathematics and statistics
- telling stories

115
Q

Before thinking of the analysis in your report, what should you think of?

A

Think about the structure of the report.

116
Q

What does the structure of the report depend on?

A

The length of the document.

117
Q

What is the difference between a brief report and a detailed report?

A

A brief report is more to the point and presents a summary of key findings.
A detailed report incrementally builds the argument and contains details about other relevant works, research methodology, data sources, and intermediate findings along with the main results.

118
Q

What can cause the length of the report to be varied?

A

The purpose of the report.

119
Q

What should all reports have?

A
  • a cover page,
  • table of contents,
  • executive summary,
  • detailed contents,
  • acknowledgments,
  • references, and
  • appendices (if needed).
120
Q

What should a cover page have?

A

At a minimum, the cover page should include
- the title of the report,
- names of authors,
- their affiliations, and contacts,
- the name of the institutional publisher (if any), and
- the date of publication.