01. Introduction to Big Data Analytics Flashcards
Name the three main attributes which define big data characteristics
Huge volumes of data
Complexity of data types and structures
Speed of new data creation and growth
What is the definition of big data
Big data is data whose scale, distribution, diversity, and/or timeliness requires the use of new technical architectures and analytics to enable insights that unlock new sources of business value
What is driving the data deluge
Mobile sensors Social media Video surveillance Video rendering Smart grids Geophysical exploration Medical imaging Gene sequencing
What is structured data
Data containing a defined data type, format, and structure (that is, transactional data, online analytical processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple spread-sheets).
What is semi structured data
Textual data files with a discernible pattern that enables parsing (such as Extensible Markup Language [XML] data files that are self-describing and defined by an XML schema).
What is quasi-structured data
Textual data with erratic data formats that can be formatted with effort, tools, and time (for instance, web clickstream data that may contain inconsistencies in data values and formats).
What is unstructured data
Data that has no inherent structure, which may include text documents, PDFs, images, and video.
What are the four Vs of big data
Volume (amount)
Velocity (speed)
Variety (types of data)
Veracity (accuracy)
What is a data repository
A data repository is a general term used to refer to a destination designated for data storage
Name the five main skill sets of a Data Scientist
Quantitative Skill - maths or stats
Technical Aptitude - software, programming and machine learning
Sceptical/Critical thinking - examine their own work
Curious/Creative - passionate about solving problems
Communicative and collaborative - work with the business
What is the difference between BI and data science
BI presents insight on the past by way of Dashboards
Data Science is trying to predict the future and is capable of dealing with a wider variety of data types
What are the characteristics of supervised machine learning
It uses historical data (a training data set) to build a model which allows it to predict future data.
What should you use supervised machine learning to do
To make a prediction of a continuous variable or for classification
What is machine learning
Machine learning is an application of artificial intelligence that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed
What is unsupervised machine learning
Unsupervised learning algorithms are used when the information used to train is neither classified nor labelled. The system doesn’t figure out the right output, but it explores the data and can draw inferences from data sets to describe hidden structures from unlabelled data.