Intro To Data Science Flashcards
What is Data Science?
The act of analysing structured and unstructured big data to drive social and commercial insight.
What is meant by “Big Data”?
Large scale collection and management of STRUCTURED and UNSTRUCTURED data.
Generally speaking, what does the Data Scientist do?
The Data Scientist exposes insight from RAW DATA, and the builds operational systems to utilise that value, in order to provide social or commercial INSIGHT
When did the term Data Science arise, when and why has it become a common term?
1974 - Peter Naur
2008 - Coined by LinkedIN and Facebook team, followed by mass expansion of world data
What are some key challenges facing Data Science implementation?
Tools - delivery of end user tools - inc. Self-service reporting/analysis
Compatibility - reporting/analysis across multiple systems
Knowledge - unlocking lost data within disparate databases/systems
Diversity - catering for all users’ needs - desktop, mobile, tablet etc…
Briefly describe the core components of hadoop?
Hadoop Common is the module that consists of a ll the basic utilities and libaries that are required by other modules.
HDFS is the “secret sauce” that enables Hadoop to store huge files. It’s a scalable file system that distributes and stores data across all machines in a Hadoop cluster.
MapReduce is the system used to efficiently process the large amount of data Hadoop stores in HDFS
YARN is a resource manager that determines which MapReduce jobs run and when. Allows for multiple MapReduce jobs to run simultaneously.
Describe 3 future trends of Data Science?
Cloud Deployment: Services held on the cloud => access, maintenance, environment taken care of
Big Insights: Big data => big insights, by combining internal and external data, the big picture is made
Context Driven Visualisations: show the data for the end users, answering the buiness question
Geospatial Augmentation: car windscreen example
Augmented Intelligence: computer + human = unbeatable
Elastic Open Environments: easily modified environments, links with cloud deployment
Freemium Community: many programs that can do the job are becoming free
Which challenge(s) to Data Science has cascading broken and how?
Cascading provides a higher level of abstraction for Hadoop, allowing developers to create complex jobs quickly, easily and in several different languages. In effect, this has shattered the knowledge barrier, enabling for example Twitter to use Hadoop more broadly
What is meant by cloud computing?
A distributed computing system over a network used for storing data off-premises
What is a Data Visualisation?
A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively