U1T3.2 - Applications of DT Flashcards
Data Mining & Cloud Computing
What is data mining?
Process of analysis large data sets (big data) with view to discovering patterns + trends that go beyond simple analysis. Combines AI, stats + database systems in analysis of groups of (un)structured data sets which are difficult to analyse using traditional methods. Extracts info from data set + transforms into appropriate format for use. (Summary of input data for analysis) Stops at process of pattern extraction.
What is big data?
Data sets so complex that traditional databases + other processing applications can’t capture, curate, manage + process them in acceptable time frame.
What does curate mean?
Process of organising data from range of data sources.
What are the 3 big data challenges?
Volume (amount of data to be processed), variety (num of types of data to be analysed) + velocity (speed of data processing)
How does DT allow us to collect data for analysis?
Online forms, mobile phone data transmissions, email data, stock market data, market research, PDAs, smartphones, tablets + netbooks etc.
How can data sources be categorised and what are the differences?
Internal + external. Internal = customer details, product details, sales data. External = business partners, data suppliers, internet, govt + market research companies.
What are the most commonly used data sources?
Social media, machine data (generated from devices like RFID chip readers, GPS results) + transactional data (data from companies like eBay, Amazon, Tesco)
What are the key requirements of big data storage?
Handle large amounts of data + keep scaling up to handle growth of data sets. High speed input/output operations to support delivery of data analytics as they’re carried out. Big data practitioners run hyperscale computing environments.
What are hyperscale computing environments?
Consists of many servers with DAS, each unit has PCIe flash storage devices to support data storage + high speed access to data sets.
What is DAS?
Direct Attached Storage.
How can smaller organisations support the storage of big data?
Use of NAS devices, can scale outward so can be difficult to manage as span out in hierarchial manner (many devices, many folders within folders)
What is NAS?
Network Attached Storage. File access shared storage, easily scaled out to meet increased capacity/computing requirements for big data analysis.
What are object-based storage systems?
Alt to NAS devices + their issues. Each file storage given unique identifier + index to support high speed access to particular data file/set.
What do big data processing techniques do?
Analyse data sets at terabyte/petabyte scale. Some methods include cluster analysis, classification, anomaly detection, association rule mining + sequential pattern mining, regression + summarisation.
What is cluster analysis?
Groups of data records identified.
What is classification?
Data mining process used to determine appropriate structure to new data. e.g. way email application classified some emails as spam.
What is anomaly detection?
Unusual records identified. Some anomalies merit investigation as points of interest to organisation or may be representative of errors.
What is association rule mining + sequential pattern mining?
Dependencies between data items identified. e.g. use of data sets by supermarket to determine which patterns of products bought together.