What is Data Science? Flashcards
IBM Data Science Professional Certificate (Course 1/10)
What is data science?
The translation of data into a story, and then using these stories to generate insights. It is with these insights are you then able to develop strategies for companies, for example.
How does digital transformation affect business operations?
It affects them by updating existing processes and operations and creating new ones to harness the benefits of new technologies (e.g. harnessing the benefits of Big Data).
optical tracking
An example of how Big Data can trigger a digital transformation, not just within an organisation, but within an entire industry
Manchester City has embraced the use of Big Data to improve their game.
They have a team of data analysts who use millions of stats about players’ performance and the upcoming opposition to help the club’s chances of winning.
One of the tools they use is optical tracking, which can be used to pinpoint the position of players on the pitch 25 times a second, in relation to the ball, opposition, and teammates. This data, along with other ball-related data such as passes, shots, and turnovers, is analysed to gain insights into the team’s performance.
These insights can then be used to inform the team’s strategy in future games. For example, they might adjust their formation, change their passing strategy, or alter player positions based on the data.
It’s a great example of how Big Data can transform not just a single team, but the entire sport of football.
What is cloud computing?
The delivery of on-demand computing resources such as:
* Networks
* Servers
* Storage
* Applications
* Services
* Data centres
over the Internet on a pay-for-use basis.
What are some of the benefits of cloud computing?
- Users do not need to purchase and install the software on their local systems, they can just use the online version of the software and pay a monthly subscription.
- This makes everything more cost-effective as well as ensuring you always have access to the most up-to-date version of the software. Think of Microsoft 365, for example.
- Other benefits include saving the user some local storage space as well as encouraging collaboration among colleagues/project teams as the software would be hosted online.
What is cloud computing composed of?
- 5 characteristics
- 3 service models
- 3 deployment models
Only Brave Rabbits Run Marathons
What are the five characteristics of cloud computing?
- On-demand self-service
- this means getting access to cloud resources such as power, storage and network without requiring human interaction with each service provider
- Broad network access
- this means that cloud computing resources can be via the network through standard mechanisms and platforms such as mobile phones, tablets, laptops, and workstations.
- Resource pooling
- this is what gives cloud providers economies of scale, which they pass on to their users, making cloud cost-efficient
- using a multi-tenant model, computing resources are pooled to serve multiple customers, and cloud resources are dynamically assigned and reassigned according to demand without customers needing to know the physical location of these resources
- Rapid elasticity
- this implies that you can access more resources when you need them and scale things back when you don’t, because resources are elastically provisioned and released
- Measured service
- this implies that you only pay for what you use as you go; if you’re not using those resources, you’re not paying
What is cloud computing really about?
It is about using technology “as a service”, leveraging remote systems on-demand over the open Internet, scaling up and scaling back, and only paying for what you use.
What do cloud deployment models indicate?
They indicate where the infrastructure resides, who owns and manages it, and how cloud resources and services are made available to users.
What are the three types of cloud deployment models?
- Public
- this is when you leverage cloud services over the open internet on hardware owned by the cloud provider, but its usage is shared by other companies
- Private
- this means that the cloud infrastructure is provisioned for exclusive use by a single organisation
- it could run on-premises or it could be owned, managed, and operated by a service provider
- Hybrid
- this is when you use a mix of both the public and private deployment models.
What are the three cloud service models based on?
The three layers in a computing stack: infrastructure, platform, and application.
What are the three cloud service models?
- Infrastructure as a Service (IaaS)
- In this model, you can access the infrastructure and physical computing resources such as servers, networking, storage, and data centre space without the need to manage or operate them
- Platform as a Service (PaaS)
- you can access the platform that comprises the hardware and software tools that are usually needed to develop and deploy applications to users over the Internet.
- Software as a Service (SaaS)
- this is a software licensing and delivery model in which software and applications are centrally hosted and licensed on a subscription basis. It is sometimes referred to as “on-demand software.”
Why is the cloud such a positive for data science?
It allows a data scientist to bypass the physical limitations of their computer and the system they’re using.
What is Big Data?
Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools, and machines.
What does Big Data need in order to be effective?
It requires new, innovative, and scalable technology to collect, host, and analytically process the vast amount of data gathered.
What does Big Data aim to do?
It aims to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value.
What are the V’s of Big Data?
- Velocity
- This is the speed at which data is accumulated
- Volume
- This the the scale of the data or the increase in the amount of data stored
- Variety
- This is the diversity of the data
- Veracity
- This is the quality and origin of data and its conformity to facts and accuracy
- Value
- This refers to our need and ability to turn data into value
What are the drivers of Big Data Volume?
- The increase in data sources
- Higher resolution sensors
- Scalable infrastructure
What is the difference between structured and unstructured data?
- Structured data fits neatly into rows and columns in relational databases.
- For example, employee details at a company.
- These employee details would include things like job, employee number, age etc. which would be criteria that everyone at the company would have, with all of it being the same data type.
- Unstructured data is data that is not organised in a predefined way.
- For example, this could be tweets, blog posts and videos. - Structured data fits neatly into rows and columns in relational databases.
What does variety reflect?
That data comes from different sources.
What are the drivers of variety?
- Mobile technologies
- Social media
- Wearable technologies
- Geo technologies
- Video
- Many more
CCIA
What are the attributes of veracity?
- Consistency
- Completeness
- Integrity
- Ambiguity
CN
What are the drivers of veracity?
- Cost
- Need for traceability
What is the main reason people take time to understand Big Data?
In order to derive value from it.
airport
An example of velocity in action
- Imagine a bustling airport during peak travel hours.
- Hundreds of flights are landing and taking off every minute.
- Each flight generates data: passenger lists, flight paths, fuel consumption, baggage handling, and more.
- The velocity of this data is immense, accumulating rapidly as planes taxi, ascend, and descend.
- Airlines must process this real-time data to optimise flight schedules, ensure safety, and enhance passenger experiences.
amazon
An example of volume in action
- Consider an e-commerce giant like Amazon.
- Millions of customers browse products, add items to their carts, and make purchases simultaneously.
- The sheer volume of data generated—product listings, customer profiles, transaction histories, reviews, and shipping details—is staggering.
- Amazon’s servers handle petabytes of data daily.
- To manage this volume, they employ distributed databases, data lakes, and scalable cloud infrastructure.
An example of variety.
- Picture a social media platform like Instagram.
- Users share diverse content: photos, videos, stories, captions, hashtags, and geotags.
- Additionally, Instagram collects metadata (likes, comments, timestamps).
- This mix of structured (metadata) and unstructured (visual content) data creates variety.
- Instagram’s challenge lies in organising and analysing this eclectic data to personalise feeds, recommend content, and detect trends.
An example of veracity
- Think about a healthcare system that collects patient data from various sources: electronic health records, wearable devices, diagnostic images, and clinical notes.
- However, not all data is equally reliable.
- Some entries may contain errors, missing values, or inconsistencies.
- Veracity refers to the trustworthiness and accuracy of data.
- Healthcare institutions invest in data validation, quality checks, and anomaly detection to ensure reliable insights for patient care and research.
An example of value
- Think about financial trading.
- Traders analyse stock market data—stock prices, trading volumes, news sentiment, and economic indicators—to make informed decisions.
- The value lies in identifying patterns, predicting market movements, and executing profitable trades.
- However, not all data contributes equally to value.
- Extracting actionable insights requires sophisticated algorithms, machine learning models, and real-time analytics.
What are data scientists today expected to do with Big Data?
They are expected to derive insights from Big Data and cope with the challenges that come with these massive datasets.
What are some challenges that Big Data has presented for data scientists?
The scale of the data means that it is not feasible to use conventional data analysis tools
How are the challenges associated with Big Data overcome?
Alternative tools such as Hadoop and Apache Spark are used.
How does data science differ from traditional subjects like statistics?
- Scope:
- It combines multi-disciplinary fields and computing to interpret data for decision-making.
- Applications:
- Data Science involves data cleaning, integration, visualisation, and statistical analysis of data sets to uncover patterns and trends.
- Decision Making:
- Data science uses scientific methods to discover and understand patterns, performance, and trends, often comparing numerous models to produce the best outcome.
- Meanwhile, statistics focuses on using mathematical analysis with quantified models to represent a given data set.
What do Big Data Processing Tools help you do?
These processing technologies provide ways to work with structured, semi-structured, and unstructured data so that value can be derived from big data
What are the most common open-source Big Data computing tools?
- Apache Hadoop
- Apache Hive
- Apache Spark
What is Apache Hadoop?
Apache Hadoop is a collection of tools that provide distributed storage and processing of big data
What is Apache Hive?
Apache Hive is a data warehouse for data query and analysis
What is Apache Spark?
Apache Spark is a distributed analytics framework for complex, real-time data analytics
What does a Hadoop framework allow?
It allows distributed storage and processing of large datasets across clusters of computers.
What happens in a Hadoop distributed system?
In a Hadoop distributed system:
- a node is a single computer, and a collection of nodes forms a cluster
- Hadoop can scale up from a single node, to any number of nodes each providing local storage and computation
- Hadoop provides a reliable, scalable, and cost-effective solution for storing data with no format requirementsIn a Hadoop distributed system:
What are the benefits that come with using Hadoop?
- Better real-time data-driven decisions
- incorporates emerging data formats not typically used in data warehouses
- Improved data access and analysis
- provides real-time self-service access to stakeholders
- Data offload and consolidation
- optimises and streamlines costs by consolidating data, including cold data, across the organisation
What is one of the four main components of Hadoop?
Hadoop Distributed File System (HDFS).
HDFS is a storage system for big data that runs on multiple commodity hardware connected through a network.Hadoop Distributed File System (HDFS).
What does HDFS do?
HDFS:
- provides scalable and reliable big data storage by partitioning files over multiple nodes
- splits files over multiple computers, allowing parallel access to them
- replicates file blocks on different nodes to prevent data loss
How is HDFS different from other distributed file systems?
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
An example of how HDFS is different from other distributed file systems
Consider an example where we have a file that contains the phone numbers of everyone in South Africa.
If we were to store this file on a single machine, there would be several challenges:
1. Storage: The file could be larger than the storage capacity of a single machine.
2. Processing Speed: Processing this file (e.g., searching for a specific number) could take a long time because only one machine’s resources (CPU, memory) are being used.
3. Fault Tolerance: If the machine fails, we lose access to the file.
Now, let’s see how HDFS addresses these challenges:
1. Distributed Storage: In HDFS, data is split into blocks (default size is 128MB in Hadoop 2.x), and these blocks are distributed across multiple nodes in the cluster. So, our phone directory file would be split into many blocks, and these blocks would be stored on different machines. This allows HDFS to store a file that is larger than the storage capacity of a single machine.
2. Parallel Processing: Each block of the file is stored on a separate machine, and processing can happen on all machines simultaneously. This means that if we want to search for a phone number, the search operation can be carried out on all machines at the same time, significantly speeding up the process.
3. Fault Tolerance: HDFS is designed to continue operating without a noticeable interruption to the user, even when a machine fails. This is achieved by replicating each block of data across multiple machines. So, if one machine fails, the same block can be found on another machine.
In conclusion, HDFS provides a scalable, fault-tolerant, distributed storage system that works closely with distributed processing frameworks like MapReduce. This makes it an excellent choice for storing and processing big data.
What are some other benefits that come with using HDFS?
- Fast recovery from hardware failures
- HDFS is built to detect faults and automatically recover
- Access to streaming data
- HDFS supports high data throughput rates
- Accommodation of large datasets
- HDFS can scale to hundreds of nodes, or computers, in a single cluster
- Portability
- HDFS is portable across multiple hardware platforms and compatible with a variety of underlying operating systems
What is Hadoop intended for?
Long, sequential scans
What is Apache Hive?
Hive is an open-source data warehouse software for reading, writing, and managing large data set files that are stored directly in either HDFS or other storage systems such as Apache HBase