Book - Chapter 1 intro to big data analytics Flashcards
What are the vs of big data
Volume. Velocity. Variety.
What is meta data
The minimum you should know about the data
What is paraders
How has the data been processed. What are the artefacts left in the data
What is velocity
It is speed
What are the three attributes that stand out of defining big data characteristics
Huge volume of data
Complexity of data types and structures
Speed of new date of creation and growth
What is huge volume of data
Rather than thousands of rows, big data can be billions of rows and millions of columns
What is complexity of data types and structures
It reflects the variety of new data sources, formats and structures, including digital traces been left on the web and other digital repositories for subsequent analysis
What is speed of new data creation and growth
If you describe high velocity data, the rapid data ingestion in near real-time analysis
What way is big data sometimes described as having
The big free v’s
What are the big three Vs
Volume, variety and velocity
Can big data be Efficiently analysed using only traditional database or methods
No it requires new tools and technologies to store, manage and realise the business benefits
What main two forms can big data come from
Structured and nonstructured
How is most of the big data formed
Usually unstructured or semistructured in nature Which requires different techniques and tools to process and analyse
Where does 80 to 90% of future data growth come from
Non-structured data types
What sort of data in addition could the RDBMS have
Quasi-or semistructured data, such as three form cell log information taking from an email ticket of the problem, customer chat history
What are the four parts of big data characteristics: data structures
Bottom: unstructured
Third: “is the structured
Second: semistructured
Top: structured
What is quasi structured
Erratic structure, Webb click
What is semistructured
Structure definition is embedded in the data
What is structured
External definition of structure
What does structured data consist of
A defined data type, format, and structure (transaction data online analytical processing data cubes, traditional RDBMS, CSV files and even simple spreadsheet) Excel
What does semistructured data consist of
Textual data files with a discernible pattern that enables passing (such as extensible markup language XML data files that are self describing and find by an XML schema)
Scripts
What does quasi-structured data consist of
Textual data with erratic data formats that can be formatted with effort, and time, and tools (for instance, web clckstreams data that may contain inconsistencies in data values and format)
What does unstructured data consist of
Text documents, PDFs, images and video i.e. data has no inherent structure
How can a clickstream be used
It can be passed in mind by data scientist to discover usage patterns I don’t have a relationship someone clicks and areas of interest on the website a group of sites
How does big data describe data
It describes new kinds of data with which most organisations may not be used to working
Is database administration training required to create spreadsheets
No
What are EDW
Enterprise data warehouse
What are enterprise data warehouse is critical for
Reporting and B I tasks and solve many other problems that proliferating spreadsheets introduce such as which of multiple versions of a spreadsheet is correct
Despite the benefits of EDW and PI what do these systems tend to restrict
The flexibility need to perform robust or exploratory data analysis
With the EDW model who is the data managed and controlled by
IT groups and database administrators (DBA) And data analysts who depend on IT for access and changes to the data of schemas
What new problems do EDW and B I introduce
Flexibility and agility which were less pronounced when dealing with spreadsheets
What is the solution to the problems faced with EDW and PI when dealing with spreadsheets
The analytic sandbox
What does the analytic sandbox attempt to resolve
The conflict for analysis and data scientists with EDw and more formally managed corporate data
How are analytic sandboxes purposely designed
To enable robust analytics well being centrally managed and secured
How are analytic sandboxes often referred to as
Work spaces as they are designed to enable teams to explore more data set in a controlled fashion and are not typically use the enterprise level financial reporting and sales databases
What do Analytic sandboxes enable
High-performance Computering using in database processing