Full Study Flashcards

Question 1

Q

What is Data Analytics?

Answer

A

The process of extracting useful insights from raw data

Question 2

Q

What is Data Analysis?

Answer

A

Data analysis refers to the process of compiling and analysing data to support decision making.

Question 3

Q

Compare Data Analytics and Data Analysis?

Answer

A

The difference between data analysis and data analytics is that data analytics is a broader term of which data analysis forms a subcomponent.
Data analytics also includes the tools and techniques used to do so.

Question 4

Q

What is Big Data?

Answer

A

Data that is high volume, velocity, variety, and veracity. Data that comes from multiple sources. Unlocking the value of Big Data allows business to better sense and respond to the environment. A key to creating competitive advantages in a complex and rapidly changing market. Government also taking notice of the Big Data phenomenon. Traditional data processing and analysis of structured data using RDBMS and data warehousing no longer satisfy the challenges of Big Data. Data is created constantly, and at an ever-increasing rate. Mobile phones, social media, imaging technologies to determine a medical diagnosis-all these and more create new data, and that must be stored somewhere for some purpose.

Question 5

Q

What are the Technology Trends in Big Data?

Answer

A

open-source software,
commodity servers,
massively parallel-distributed processing platforms.

Question 6

Q

What are the challenges of Big Data?

Answer

A

Data at Rest – Terabytes to exabytes of existing data to process.
Data in Motion – Streaming data, requiring seconds to respond.
Data in Many Forms – Structured, Unstructured, Text, Multimedia.

Question 7

Q

What are the Characteristics of Big Data?

Answer

A

Huge volume of data:
Big Data reflects the variety of new data sources, formats, and structures,
Velocity of new data creation and growth: Big Data can describe high velocity data, with rapid data ingestion and near real time analysis.

Question 8

Q

Where does Big Data get the information?

Answer

A

Mobile Devices
Social Media and Networks
Scientific Instruments
Sensor Technology and Networks

Question 9

Q

What are the Drivers of Big Data?

Answer

A

Medical information, such as genomic sequencing and diagnostic imaging

Photos and video footage uploaded to the World Wide Web.
Video surveillance, such as the thousands of video cameras spread across a city.

Mobile devices, which provide geospatial location data of the users, as well as metadata about text messages, phone calls, and application usage on smart phones.

Smart devices, which provide sensor-based collection of information from smart electric grids, smart buildings, and many other public and industry infrastructures.

Non-traditional IT devices, including the use of radio-frequency identification (RFID) readers, GPS navigation systems, and seismic processing

Question 10

Q

What is Data?

Answer

A

The number and rate of data produced in any particular’ discipline now exceed our ability to effectively treat and analyze them.

Question 11

Q

What are some Data Sources?

Answer

A

digital instruments , high resolution cameras , medical scanners , simulations , transactional data , social media

Question 12

Q

What are the main players in the Big Data ecosystem?

Answer

A

Data devices and the “Sensor net” gather data from multiple locations and continuously generate new data about the is data.

Data collectors include sample entities that collect data from the device and users.

Data aggregators make sense of the data collected from the various entities from the “Sensor Net” or the “Internet of Things These organizations compile data from the devices and usage patterns collected by government agencies, retail stores and websites. ln turn, they can choose to transform and package the data as products to sell to list brokers, who may want to generate marketing lists of people who may be good targets for specific ad campaigns.

Data users and buyers are denoted by (4) These groups directly benefit from the data collected and aggregated by others within the data value chain.

Question 13

Q

What are the Key Roles in the Big Data eco system?

Answer

A

Deep Analytical Talent - technically savvy, with strong analytical skills. This group has advanced training in quantitative disciplines, such as mathematics, statistics, and machine learning. To do their jobs, members need access to a robust analytic sandbox or workspace where they can perform large-scale analytical data experiments.

Data Savvy Professionals - Has less technical depth but has a basic knowledge of statistics or machine learning and can define key questions that can be answered using advanced analytics. These people tend to have a base knowledge of working with data, or an appreciation for some of the work being performed by data scientists and others with deep analytical talent.

Technology and Data Enablers - This group represents people providing technical expertise to support analytical projects, such as provisioning and administrating analytical sandboxes, and managing large-scale data architectures that enable widespread analytics within companies and other organizations.
role requires skills related to computer engineering, programming, and database administration.

Question 14

Q

What are the Activity’s performed by Data Scientists?

Answer

A

Reframe business challenges as analytics challenges - Diagnose business problems, consider the core of a given problem, and determine which kinds of candidate analytical methods can be applied to solve it.

Design, implement, and deploy statistical models and data mining techniques on Big Data - Applying complex or advanced analytical methods to a variety of business problems using data.

Develop insights that lead to actionable recommendations - Draw insights out of the data and communicate them effectively.

Question 15

Q

What are the skills and behavioral characteristics of a data scientist?

Answer

A

Quantitative skill: such as mathematics or statistics

Technical aptitude: namely, software engineering, machine learning, and programming skills

Skeptical mind-set and critical thinking: It’s important that data scientists can examine their work critically rather than in a one-sided way.

Curious and creative: Data scientists are passionate about data and finding creative ways to solve problems and portray information.

Communicative and collaborative: Data scientists must be able to articulate the business value in a clear way and collaboratively work with other groups, including project sponsors and key stakeholders.

Question 16

Q

What is the profile of a data scientist?

Answer

A

1) Quantitative
2) Curious and Creative
3) Skeptical
4) Technical
5) Communicative and Collaborative

Question 17

Q

What are the types of Data Structures?

Answer

A

Structured data: Data containing a defined data type, format, and structure.

Semi-Structures Data: Textual data files with a discernible pattern that enables parsing.

Quasi-structured data:
Textual data with erratic data formats that can be formatted with effort and tools.

Unstructured Data: Data that has no inherent structure.

Question 18

Q

What are the Disadvantages of EDW?

Answer

A

EDW - Enterprise Data Warehouse

EDWs and BI, systems tend to restrict the flexibility needed to perform robust or exploratory data analysis.

With the EDW model, data is managed and controlled by IT groups and database administrators (DBAs), and data analysts must depend on IT for access and changes to the data schemas. Imposing major lead time.

DW rules may restrict analysts from building datasets.

EDW and BI introduce new problems related to flexibility and agility, which were less pronounced when dealing with spreadsheets.

A solution to this problem is the analytic sandbox, which attempts to resolve the conflict for analysts and data scientists with EDW and more formally managed corporate data.

Question 19

Q

What is a Sandbox?

Answer

A

Sandboxes, often referred to as workspaces, are designed to enable teams to explore many datasets in a controlled fashion and are not typically used for enterprise level financial reporting and sales dashboards. Many times, analytic sandboxes enable high-performance computing using in-database processing the analytics occur within the database itself.

Question 20

Q

What are types of Data Repositories?

Answer

A

Spreadsheets and data marts (“spreadmarts”):Spreadsheets and low-volume databases for record keeping.

Data Warehouses: Centralized data containers in a purpose-built space Supports Bl and reporting but restricts robust analyses.

Question 21

Q

What are the advantages of using Analytical Sandboxes?3

Answer

A

Enables flexible, high-performance analysis in a nonproduction environment.

Can leverage in-database processing Reduces costs and risks associated with data replication into “shadow” file systems.

“Analyst owned” rather than “DBA owned”.

Question 22

Q

What are the Business Drivers?

Answer

A

Optimize Business Operations
Identify Business Risk
Predict New Business Opportunities
Comply with laws or regulatory requirements

Question 23

Q

What are the fundamental differences of BI and Data Science?

Answer

A

BI:
Bl tends to provide reports, dashboards, and queries on business questions for the current period or in the past
Bl systems make it easy to answer questions related to quarter-to-date revenue, progress toward quarterly targets, and understand how much of a given product was sold in a prior quarter or year.
BI questions tend to be closed-ended and explain current or past behaviour, by aggregating historical data and grouping it in some way.
BI provides hindsight and some insight and generally answers questions related to “when” and “where” events occurred.
BI problems tend to require highly structured data organized in rows and columns for accurate reporting,

Data Science:
Data Science tends to use disaggregated data in a more forward-looking, exploratory way, focusing on analysing the present and enabling informed decisions about the future.
Rather than aggregating historical data to look at how many of a given product sold in the previous quarter, a team may employ Data Science techniques to forecast future product sales and revenue more accurately than extending a simple trend line.
Data Science tends to be more exploratory in nature and may use scenario optimization to deal with more open-ended questions and provides insight into current activity and foresight into future events, while generally focusing on questions related to “how” and “why” events occur.
Data Science projects tend to use many types of data sources, including large or unconventional datasets

Question 24

Q

What is a Analytical Architecture?

Answer

A

Analytics architecture refers to the applications, infrastructures, tools, and leading practices that enable access to and analysis of information to optimize business decisions and performance.

Question 25

Q

Why would you use Data Warehouses?

Answer

A

Data warehouses provide excellent support for traditional reporting and simple data analysis activities but may not support more robust analyses.

Question 26

Q

What is a typical Analytical Architecture?

Answer

A

1) Data Sources
2) Departmental Warehouse
3) Dashboards, Reports and Alerts
4) Data Science Users

Question 27

Q

What are the requirements for Data Sources to enter controlled environments?

Answer

A

Well understood.
Structured
Normalized with the appropriate data type definitions.
Pre-processed and checked at multiple points.
Not subjected to data exploration and iterative analytics.

Question 28

Q

What are departmental warehouses and local data marts?

Answer

A

Created by business users to accommodate their need for flexible analysis.
Local data marts may not have the same constraints for security and structure as the main EDW.
Local data marts allow users to do some level of more in-depth analysis.

Question 29

Q

What is a EDW?

Answer

A

Once in the data warehouse, data is read by additional applications across the enterprise for Bl and reporting purposes.
These are high-priority operational processes getting critical data feeds from the data warehouses and repositories.

Question 30

Q

What are Data Science users and their rules?

Answer

A

Analysts get data provisioned for their downstream analytics.
Users not allowed to run custom or intensive analytics on production databases,
Analysts create data extracts from the EDW to analyse data offline in R or other local analytical tools.
Tools limited to in-memory analytics on desktops analysing samples of data, rather than the entire population of a dataset.
Because these analyses are based on data extracts, they reside in a separate location, and the results of the analysis-and any insights on the quality of the data or anomalies-rarely are fed back into the main data repository.

Question 31

Q

What are the limitations on the typical architecture?

Answer

A

Data is slow to move into the EDW, and the data schema is slow to change.

EDWs generally limit the ability of analysts to iterate on the data in a separate nonproduction environment

Departmental data warehouses may have been originally designed for a specific purpose and set of business needs,

The typical data architectures inhibit data exploration and more sophisticated analysis

Question 32

Q

What are some implications for Data Scientists?

Answer

A

High-value data is hard to reach and leverage, and predictive analytics and data mining activities are last in line for data.

Data moves in batches from EDW to local analytical tools.

Data Science projects will remain isolated and ad hoc, rather than centrally managed.

Question 33

Q

How do data science projects differ from Business Intelligence projects?

Answer

A

Data science projects are exploratory.

A process is necessary to govern the projects.
Ensure the participants are thorough and rigorous in approach.

A problem that appears complicated at first can be broken down into smaller pieces or actionable phases that can be more easily addressed.

A good and documented process ensures a comprehensive and repeatable method for conducting analysis.

IT helps focus time and energy early in the process to get a clear grasp of the business problem to be solved.

A well-defined process offers a common framework for others to adopt, especially new members.

Question 34

Q

What are the key roles for a successful analytics project?

Answer

A

Business User:
Someone who understands the domain area and usually benefits from the results.
This person can consult and advise the project team on the context of the project, the value of the results, and how the outputs will be operationalized.

Project Sponsor:
Provides the impetus and requirements for the project and defines the core business problem. Generally, provides the funding and gauges the degree of value from the final outputs of the working team.

Project Manager:
Ensures that key milestones and objectives are met on time and at the expected quality.

Business Intelligence Analyst:
Provides business domain expertise based on a deep understanding of the data, key performance indicators (KPis), key metrics, and business intelligence from a reporting perspective.

Database Administrator (DBA):
Provisions and configures the database environment to support the analytics needs of the working team.

Data Engineer:
Has deep technical skills to assist with tuning SQL queries for data management and data extraction. Provides support for data ingestion into the analytic sandbox.

Data Scientist:
Provides subject matter expertise for analytical techniques, data modelling, and applying valid analytical techniques to given business problems. Ensures overall analytics objectives are met.

Question 35

Q

What are the methods used in the data analytical lifecycle?

Answer

A

Scientific method: relates to forming hypotheses and finding ways to test ideas.
CRISP-DM: useful input on ways to frame analytics problems.

Tom Davenport’s DELTA framework: offers an approach for data analytics projects(Five Stages of Analytics Maturity(Data, Enterprise, Leadership, Targets, and Analysts)

Doug Hubbard’s Applied Information Economics (AlE) Good at deriving the expected value of information (a rigorous, quantitative approach to improving IT investment decision making

“MAD Skills” by Cohen. Best at phases that focus on model planning, execution, and key findings (Magnetic, Agile, Deep (MAD) data analysis.

Question 36

Q

What is the Data Analytical Lifecycle steps?

Answer

A

1) Discovery: Involves learning the business domain, assessing the resources available , framing the business problem as an analytics challenge and formulate initial hypotheses (IHs) to test and begin learning the data.
2) Data Preparation: Involves executing extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox, so the team can work with it and analyse it.
3) Model Planning: the team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase and explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models.
4) Model Building: The team develops data sets for testing, training, and production purposes.
5) Communicate Results: The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders.
6) Operationalize: The team delivers final reports, briefings, code, and technical documents and may run a pilot project to implement the models in a production environment.

Question 37

Q

What are the data preparation steps?

Answer

A

1) Explore,
2) Pre-process,
3) Condition data prior to modelling and analysis
4) Create a robust environment in which it can explore the data that is separate from a production environment.(An analytics sandbox)
5) Perform ETLT, by a combination of extracting, transforming, and loading data into the sandbox.
6) Decide how to condition and transform data to get it into a format to facilitate subsequent analysis.
7) Data preparation tends to be the most labour-intensive step in the analytics lifecycle.

Question 38

Q

What are the categories of Data Analytics?

Answer

A

Descriptive analytics - help answer questions about what has happened based on historical data. (Can help track the success or failure of key objectives e.g., sales reports)

Diagnostic analytics - help answer questions about why events happened (supplement basic descriptive analytics to discover the cause of these events/anomalies e.g., unexpected changes in a metric or a particular market.)

Predictive analytics - help answer questions about what will happen in the future.
(use historical data to identify trends and determine if they’re likely to recur.)

Prescriptive analytics - help answer questions about which actions should be taken to achieve a goal or target. (allows businesses to make informed decisions in the face of uncertainty.)

Cognitive analytics - attempt to draw inferences from existing data and patterns, derive conclusions based on existing knowledge bases, and then add these findings back into the knowledge base for future inferences, a self-learning feedback loop.

Question 39

Q

What are the roles in Data Analytics?

Answer

A

1) Business Analyst
2) Data Analyst
3) Data Engineer
4) Data Scientist
5) Database Administrator

Question 40

Q

What are the PowerBI project steps?

Answer

A

1) Gather
2) Store
3) Model
4) Visualize
5) Share

Question 41

Q

What can you do in PowerBI desktop?

Answer

A

1) Connect with your data.
2) Set up transformation rules.
3) Perform data wrangling operations.
4) Massage the data into the format that your reports require.
5) Build your reports.

Question 42

Q

What are the three types of PowerBI?

Answer

A

Power BI Desktop
Power BI Service
Power BI Mobile

Question 43

Q

What is PowerBI?

Answer

A

It is a collection of Software Services, Apps and Connectors that work together to turn unrelated sources of data into coherent, and interactive insights.

Question 44

Q

What are the Capabilities of PowerBI?

Answer

A

Creating quick insights from an Excel workbook or a local database.
Extensive modelling and real-time analytics
Custom development
Serve as the analytics and decision engine behind group projects, divisions, or entire corporations.

Question 45

Q

What the work flow in PBI?

Answer

A

a report is created.
report is then published to the Power BI service.
report is shared, so that users of Power BI Mobile apps can consume the information.

Question 46

Q

What are the Building blocks of PowerBI?

Answer

A

Visualizations
Datasets
Reports
Dashboards
Tiles

Question 47

Q

What are the main areas of a report?

Answer

A

1) The ribbon, which displays common tasks associated with reports and visualizations.
2) The Report view, or canvas, where visualizations are created and arranged.
3) The Pages tab area along the bottom, which lets you select or add a report page.
4) The Visualizations pane, where you can change visualizations, customize colours or axes, apply filters, drag fields, and more.
5) The Fields pane, where query elements and filters can be dragged onto the Report view or dragged to the Filters area of the Visualizations pane.

Question 48

Q

What is the Query Editor?

Answer

A

The Query Editor ribbon contains additional tools, such as changing the data type of columns, adding scientific notation, or extracting elements from dates, such as day of the week.

Question 49

Q

what are the key points of interest when importing data from SQL Server

Answer

A

1) It is more efficient to import data from SQL Server views than directly from database tables.
2) Each view should create a data-model-ready table, containing only relevant columns, with report-friendly names.
3) A data model containing imported SQL Server data can be mashed up with data from other sources, as well as with calculated tables.
4) There are no restrictions on the creation of DAX calculated tables, measures and calculated columns.
5) When uploaded to the Power BI service, the imported data needs to be updated via an on-premises gateway using a refresh schedule.

Question 50

Q

Compare Import and Direct Query

Answer

A

You can connect to the same SQL Server views as we did when looking at the Import option; but, this time, we will use DirectQuery mode.
But when using DirectQuery on the left of your Power BI desktop screen, only Report and Relationship modes are available;
Data mode is not present when working in DirectQuery mode.
If you switch to Relationship view and you will notice that Power BI has not automatically created relationships, as it did when we imported the data.
In DirectQuery mode, to create a relationship, you must always drag from the many tables to the one table to create the many-to-one relationship required by Power BI else you get an error message.

Question 51

Q

What are the key points of a DirectQuery connection to a SQL Server database?

Answer

A

1) It is still more efficient to import data from SQL Server views than directly from database tables.
2) All imported tables and views must come from the same database.
3) A data model containing a DirectQuery connection to SQL Server data cannot be mashed up with data from any other sources.
4) Calculated tables cannot be added to the data model.
5) There are severe restrictions on the use of DAX when creating measures and calculated columns. (These options can be overridden in Power BI Desktop’s options settings.)
6) When a DirectQuery report is uploaded to the Power BI service, a data source must be created to enable Power BI to connect to the database via an on-premises gateway; however, no refresh schedule needs to be created.

Question 52

Q

What is R?

Answer

A

The R programming language is an offshoot of a programming language called S, which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks.
R is a free software environment for data manipulation, calculation and graphical display.
It provides a wide variety of statistical and graphical techniques.
R is very much a vehicle for newly developing methods of interactive data analysis.
It has developed rapidly and has been extended by a large collection of packages.
R currently is free to download and use.