U1T3.2 - Applications of DT Flashcards
Data Mining & Cloud Computing
What is data mining?
Process of analysis large data sets (big data) with view to discovering patterns + trends that go beyond simple analysis. Combines AI, stats + database systems in analysis of groups of (un)structured data sets which are difficult to analyse using traditional methods. Extracts info from data set + transforms into appropriate format for use. (Summary of input data for analysis) Stops at process of pattern extraction.
What is big data?
Data sets so complex that traditional databases + other processing applications can’t capture, curate, manage + process them in acceptable time frame.
What does curate mean?
Process of organising data from range of data sources.
What are the 3 big data challenges?
Volume (amount of data to be processed), variety (num of types of data to be analysed) + velocity (speed of data processing)
How does DT allow us to collect data for analysis?
Online forms, mobile phone data transmissions, email data, stock market data, market research, PDAs, smartphones, tablets + netbooks etc.
How can data sources be categorised and what are the differences?
Internal + external. Internal = customer details, product details, sales data. External = business partners, data suppliers, internet, govt + market research companies.
What are the most commonly used data sources?
Social media, machine data (generated from devices like RFID chip readers, GPS results) + transactional data (data from companies like eBay, Amazon, Tesco)
What are the key requirements of big data storage?
Handle large amounts of data + keep scaling up to handle growth of data sets. High speed input/output operations to support delivery of data analytics as they’re carried out. Big data practitioners run hyperscale computing environments.
What are hyperscale computing environments?
Consists of many servers with DAS, each unit has PCIe flash storage devices to support data storage + high speed access to data sets.
What is DAS?
Direct Attached Storage.
How can smaller organisations support the storage of big data?
Use of NAS devices, can scale outward so can be difficult to manage as span out in hierarchial manner (many devices, many folders within folders)
What is NAS?
Network Attached Storage. File access shared storage, easily scaled out to meet increased capacity/computing requirements for big data analysis.
What are object-based storage systems?
Alt to NAS devices + their issues. Each file storage given unique identifier + index to support high speed access to particular data file/set.
What do big data processing techniques do?
Analyse data sets at terabyte/petabyte scale. Some methods include cluster analysis, classification, anomaly detection, association rule mining + sequential pattern mining, regression + summarisation.
What is cluster analysis?
Groups of data records identified.
What is classification?
Data mining process used to determine appropriate structure to new data. e.g. way email application classified some emails as spam.
What is anomaly detection?
Unusual records identified. Some anomalies merit investigation as points of interest to organisation or may be representative of errors.
What is association rule mining + sequential pattern mining?
Dependencies between data items identified. e.g. use of data sets by supermarket to determine which patterns of products bought together.
What is regression?
Relationships between data variables investigated to help see how change in independent variable impacts on dependent data variable.
What is summarisation?
Data summarised in visual format.
What are some of the key objectives of collecting and using big data by the financial services sector?
Ensure they comply with regulations (using fuzzy matching to check customer names + aliases against customer blacklist, lower cost), improve risk analysis (algorithms run of transaction data to identify fraudulent activity/perform risk analysis, support trading decisions), understand customer behaviour/transaction patterns + improve services (identify what leads to dissatisfaction)
How does the health sector use big data?
Predict epidemics, cure disease, improve life quality + avoid preventable deaths. Smartphones measure steps, diet + sleep patterns which in future could be shared with GP for diagnosis help. Supports clinical trials to select best subjects. Phone location can track pop movement and predict spread of Ebola virus.
How does the retail sector use big data?
Predict trends + forecast demand, price optimisation (spending habits + demand) + identify potential customers (data collected through transactional records + loyalty programs allows demand to be forecast on basis of geographical areas)
What is cloud computing?
Use of internet by large computing companies to provide services normally provided by LAN. Use server farms to host services they provide for other organisations who can access these services from any computer w/ internet connection. Users don’t know where data stored. Virtual servers form foundation of cloud servers. Capitalises on principle of virtual clusters.