Week 1 Flashcards

Week 1 course content

1
Q

What are the types of data?

A

Structured Data, Semi-Structured Data, and Unstructured Data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the differences of data types in term of format?

A

Structured Data has predefined schema. Semi-Structured Data has some structure, often with tags. Unstructured Data has no fixed format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the differences of data types in term of analysis?

A

SD is easy. SSD is moderate. UD is difficult.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the differences of data types in terms of tools?

A

In SD, sql, traditional databases. IN SSD, Specialized Tools like JSON parsers. In UD, Natural Language Processing, Computer Vision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the differences of data types in term of examples?

A

In SD, databases and spreadsheets. In SSD, JSON, HTML, and XML. In UD, text, images, and videos.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is ‘Structured Data’

A

organized in a predefined format with a fixed schema and is typically stored in rows and columns, similar to a spreadsheet or database table. Examples include customer information, spreadsheet sales data, and machine sensor readings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is ‘Unstructured Data’

A

lacks a predefined structure or format. It’s often text-heavy or multimedia-based. Social media posts, emails, images, videos, and audio files are typical examples of unstructured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is ‘Semi-Structured Data’

A

has some structure but lacks the rigid format of structured data. It often includes tags or markers to indicate the meaning of different parts of the data. Examples of semi-structured data include JSON data, XML data, and HTML documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Data?

A

the raw material that fuels insights and informed decision-making, originating from various data sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Data Source?

A

is where data is stored or generated, such as sensors, social media platforms, customer interactions, databases, and public records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ‘Data Collection’?

A

involves systematically gathering data from these diverse sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two categories of Data Sources?

A

Primary Data Source and Secondary Data Source

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Primary Data Source?

A

is data collected firsthand by the researcher for a specific purpose or project. Data is collected from primary data sources through surveys, experiments, and observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Secondary Data Source?

A

is data that has already been collected by someone else for another purpose but is being repurposed for a new analysis. Secondary data sources include public databases, published research, and third-party sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Databases?

A

a database is an organized collection of data that allows for efficient storage, retrieval, and manipulation. It is designed for transactional processing and day-to-day operations like creating, reading, updating, and deleting data (CRUD). Examples of databases are Microsoft SQL Server, MySQL, and MongoDB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Data Warehouse?

A

a data warehouse is a large, centralized repository of data that aggregates information from various sources. It’s designed for analytical processing, historical data storage, and decision-making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is ‘Knowledge Discovery in Databases(KDD)’

A

is a method that offers a structured framework for extracting valuable insights from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the KDD Steps?

A

Data Selection, Data Preprocessing, Data Transformation, Data Mining, Pattern Evaluation, and Knowledge Representation.

19
Q

What is ‘Data Selection’

A

identifying the relevant data sources for analysis by selecting the target dataset(s) or focusing on a subset of variables or data samples.

20
Q

What is ‘Data Preprocessing’

A

cleaning and preparing the data for analysis by removing outliers and handling missing values to correct errors and inconsistencies.

21
Q

What is ‘Data Transformation’

A

preparing the data for analysis by transforming it into suitable formats for mining. This data transformation includes reduction, normalization, discretization, and feature engineering.

22
Q

What is ‘Data Mining’

A

apply algorithms such as classification, clustering, and association rules to extract patterns or models from the processed data.

23
Q

How does ‘Data Mining’ work?

A

Collect Data: Start with a lot of information (like sales records, customer reviews, or website activity).

Analyze: Use tools and algorithms (special formulas) to look for patterns.

Find Insights: Spot trends, such as which products sell best during certain months or what type of customers are most loyal.

24
Q

What is ‘Pattern Evaluation’

A

identify interesting patterns in the dataset by assessing their relevance, validity, novelty, and potential usefulness for action.

25
Q

What is ‘Knowledge Representation’

A

share the discovered knowledge using reports, visualizations, or decision support systems to communicate the findings to stakeholders.

26
Q

How does ‘Data Transformation’ works?

A

Collect Data: Gather a lot of small details (e.g., daily sales from multiple stores).

Group It: Organize the data (e.g., by month, region, or product type).

Summarize: Use calculations like totals, averages, or counts to create a summary (e.g., “Total sales for January = $10,000”).

27
Q

What is ‘Data Dredging’

A

Data dredging (also called data fishing or data snooping) is when someone looks through a lot of data to find patterns or results, but they do it without a clear plan or hypothesis.

28
Q

What is ‘Data Discrepancy’

A

is when there’s a mismatch or inconsistency in data that should be the same or aligned.

29
Q

What is ‘Data Regression’

A

A system bug or error that causes old, incorrect data to reappear.

30
Q

What is ‘Regression Analysis’

A

A statistical tool to predict and explain data relationships.

31
Q

What is the difference between ‘Data Analysis’ and ‘Data Mining’?

A

data analysis answers predefined questions using statistical techniques. data mining involves discovering hidden patterns in large datasets without a specific question using clustering or association rule mining methods.

32
Q

What is ‘Feature Engineering’

A

is the process of creating, selecting, or transforming data into meaningful inputs (features) that can improve the performance of a machine learning model.is the process of creating, selecting, or transforming data into meaningful inputs (features) that can improve the performance of a machine learning model.

33
Q

How does ‘Feature Engineering’ works?

A

Identify Raw Data: Start with the original dataset (e.g., sales data, customer details, sensor readings).

Create New Features: Derive useful information from the raw data.
Example: Combine “date of birth” to calculate “age.”

Transform Features: Apply techniques like scaling or encoding to make the data usable for models.

Select Features: Pick the most relevant features and remove unnecessary ones to avoid overcomplicating the model.

34
Q

What is ‘Classification Technique’

A

a technique used to categorize data into predefined classes or categories based on the features or attributes of the data instances. It involves training a model on labeled data and using it to predict the class labels of new, unseen data instances.

35
Q

What is ‘Regression Technique’

A

employed to predict numeric or continuous values based on the relationship between input variables and a target variable. It aims to find a mathematical function or model that best fits the data to make accurate predictions.

36
Q

What is ‘Clustering Technique’

A

a technique used to group similar data instances together based on their intrinsic characteristics or similarities. It aims to discover natural patterns or structures in the data without any predefined classes or labels.

37
Q

What is ‘Association Rule’

A

focuses on discovering interesting relationships or patterns among a set of items in transactional or market basket data. It helps identify frequently co-occurring items and generates rules such as “if X, then Y” to reveal associations between items.

38
Q

What is ‘Anomaly Detection’

A

aims to identify rare or unusual data instances that deviate significantly from the expected patterns. It is useful in detecting fraudulent transactions, network intrusions, manufacturing defects, or any other abnormal behavior.

39
Q

What is ‘Time Series Analysis Technique’

A

focuses on analyzing and predicting data points collected over time. It involves techniques such as forecasting, trend analysis, seasonality detection, and anomaly detection in time-dependent datasets.

40
Q

What is ‘Neural Networks Technique’

A

a type of machine learning or AI model inspired by the human brain’s structure and function. They are composed of interconnected nodes (neurons) and layers that can learn from data to recognize patterns, perform classification, regression, or other tasks.

41
Q

What is ‘Decision Trees Technique’

A

graphical models that use a tree-like structure to represent decisions and their possible consequences. They recursively split the data based on different attribute values to form a hierarchical decision-making process.

42
Q

What is ‘Ensemble Methods Technique’

A

combine multiple models to improve prediction accuracy and generalization. Techniques like Random Forests and Gradient Boosting utilize a combination of weak learners to create a stronger, more accurate model.

43
Q
A