midterm Flashcards

1
Q

What are the most common data format for data analysis?

A
  • Structured data (database)
  • semi-stractured data (XML / JSON data, email, webpage)
  • unstructured data (audio, video, image data, natural language)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the types of qualitative data?

A
  • Nominal data: data can be labelled or classified into mutually exclusive categories within a variable (hair color, gender, ethnicity)
  • Ordinal data: (groups variables into ordered categories (grades, economic status)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the types of quantitative data?

A
  • discrete data: data that includes nondivisible figures and statistics that can be counted (number of people)
  • Continuous data: data that can take any value (including decimals) (height, lenght, temperature)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the alternative data sources?

A

non-traditional or unconventional sources of information that can be used to gain insights and make informed decisions. these sources complement or supplement traditional data sources such as official statistics or financial reports

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

where to find alternative data sources?

A
  • development partners (CAF, European Bank, OECD, UNDP);
  • data development partnership (solving development challenges through data science collaboration between companies and international organizations)
  • data providers (github, google, meta, linkedin, etc)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the humanitarian data exchange (HDX)?

A

an alternative data source; and open platform for sharing data across cries and organizations. The goal of HDX is to make humanitarian data easy to find and use for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is ETL (Extract, transform and load)?

A

it’s a process used in data integration and data warehousing to extract data from various sources, transform it into a desired format, and load it into a target system or data repository for further analysis or storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are the possible variants to the extract data in ETL process?

A
  1. database extraction: extracting data from relational databases such as Oracle, MySQL, or SQL Server. This involves querying the database using SQL statements to retrieve the required data.
  2. file extraction: extracting data from flat files, such as CSV, Excel, XML files, JSON files or text files. The extraction process invlolves reading the file content and parsing it to extract the relevant data
  3. Web scraping: extracting data from websites vy crawling web pages and scarping the required information. This could involve using tools and libraries like BeautifulSoup or Scrapy to navigate web pages, locate specific elements and extract the desired data.
  4. API Extraction: exctracting data from web APIs (Application Programing Interfaces). APIs provide a structured way to access and retrieve data from various sources.
  5. Log File Extraction: extracting data from log generated by systems, applications, or devices. Log files often contains valuable information that can be extracted and analyzed for troubleshooting, performance monitoring, or security purposes.
  6. Sensor or IoT Data Extraction: extracting data from sensors or Internet of Things devices. This involves capturing data from sensors or devices that collect and transmit real-time data, such as temperature sensors, GPS devices or smart meters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are the most common transformation processes?

A

data cleaning, data integration, data enrichment, data standardization, aggregation and summarization, feature engineering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is data cleaning?

A

the transformation process includes cleaning the extracted data to ensure its quality and consistency. this may involve handling missing values, removing duplicates, correcting data formatting errors, and validating data against predefined rules or constraints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is data standarization?

A

transforming the data into a consistent format is often necessary or efficient analysis or loading into a target system. This may involve converting data types, harmonizing units of measurement, applying standard coding schemes to ensure data consistency across different sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is data integration?

A

if data is extracted from multiple sources, the transformation process involves integration the data into a unified format. This may include resolving inconsistencies, merging overlapping data, and reconciling differences in data structures or naming conventions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is featuring engineering/data enrichment?

A

in some cases, additional data may need to be added to the extracted data to enhance its value or context. This could involve merging with external data sources, performing data lookup, or applying algorithms or advances analytics techniques to derive new attributes or insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Example of featuring engineering?

A

calculate the time difference between two events
calculate the average age in a group of people
find the closest supermarket to a certain point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Example of data enrichment

A

Using natural language processing (NLP) to extract insights from a tweet
Using machine learning algorithms to extract insights from a picture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is load step in ETL process?

A

This step refers to loading the transformed and processed data into a target system or data repositories for storage, analysis or further processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is API

A

API: Application Programming Interface is a software intermediary that allows two applications to talk to each other. APIs are an accessible way to extract and share data within across organizations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

who is chief data scientist?

A

they focus on applying advanced analytics and machine learning algorithms to solve complex problems, build predictive models and develop data-driven solutions. They have deep understanding of statistical modeling, data mining and programming in R and python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

who is data engineer?

A

they focus on building and maintaining the infrastructure required for data processing and storage. they are responsible for data extraction, setting up databases, data pipelines, and ensuring data quality and reliability. they work with technologies like hadoop, spark, sql, and data integration tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

who is data analyst?

A

they act as a bridge between data scientists and data engineers with business needs. they interact with customers to understand their business challenges and solve them through data. they usually work with PowerBI, tableau, reporting tools and have data storytelling skills.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what are type of algoritms?

A
  • logic rule-based system
  • AI systems
  • Generative AI systems
22
Q

what is a logic rule-based system?

A

a type of computer program or AI that operates by following predefined rules and logical principles to make decisions or draw conclusions
If X, then y

23
Q

What are AI system?

A

Artificial Intelligence (AI) systems refer to computer programs or machines that exhibit intelligence and perform tasks that typically require human intelligence. These systems are designed to simulate human cognitive functions such as learning, reasoning, problem-solving, perception, and decision-making. AI systems can range from simple rule-based programs to complex neural networks and deep learning algorithms.

24
Q

what is generative AI system?

A

Generative AI systems are a type of artificial intelligence that focuses on generating new content, such as images, text, music, or even entire pieces of software code. Unlike traditional AI systems that primarily analyze existing data or make predictions based on patterns, generative AI systems have the ability to create entirely new content that was not explicitly present in the training data.

25
Q

what are image generation models?

A

machine learning (ML) models trained with large amounts of images that are able to generate visual content from a text description

26
Q

what are large language models?

A

Large language models are advanced artificial intelligence systems designed to understand and generate human-like text based on vast amounts of data. These models, like GPT-3 (Generative Pre-trained Transformer 3), are characterized by their extensive size, complexity, and ability to perform a wide range of natural language processing tasks.

27
Q

what key advances have enabled the creation of LLMs?

A

-deep neural networks
-transformers
-transfer and learning
-increased computing capacity
-large text data sets
-advances in optimization and training

28
Q

what is the model size of LLMs?

A

the size of the model is measured in the number of parameters
they indicate the size and capacity of the model (weights and biases of the neurons that are adjusted in the training process)

29
Q

what is token?

A

is a part of the word, the atomic unit that LLMs work in. IT can be characters, words, subwords, other segments of text, or code, depending on the chosen tokenization method or scheme

30
Q

what are some of the most important LLMs?

A

Open AI, Claude (Anthropic), Brad, Gemini (Google), Preplexity (Preplexity AI), Grok (Xai)

31
Q

what are potential risks of LLMs (from corporate/social scale) and how to mitigate them?

A
  1. Hallucinations (incorrect information given as true): *implement robust data validation and fact-checking protocols, * regularly update models with accurate data, * educate users about AI limitations
  2. Biased and unfair models (AI models can inherit or amplify biases present in their training data, leading to unfair or discriminatory outcomes): *use diverse and representative datasets for training, *apply fairness algorithms and bias detection tools, *continuous monitoring for biases
  3. Data privacy (risk of capturing or release private informaiton): *compliance technology architecture, *use anynymization techniques
  4. Loose competitivness (LLM if not adopted could affect efficiency lost against competitors): *foster a culture of continuous learning, *invest in AI research, *partner with AI innovation leaders
  5. Intellectual property (AI can generate content that blurs the lines of intellectual property ownership, leading to legal challenges): *develop clear guidelines on AI-generated content IP, *Stay informed about evolving IP laws, *Use AI ethically and responsibly
32
Q

what are the risks of LLMs (from government scale) and how to mitigate them?

A
  1. Economic and employment impact (job displacements in sector where automatics tasks are done by humans): *develop workforce training programs, *support sectors most affected by AI
  2. Misinformation and propaganda (propagation of fake news to manipulate people): *implement stringent laws against AI-generated misinforamtion, *develop AI tools to detect fake news, *educate the public on media literacy and critical thinking, *develop AI regulation and close collaboraiton with AI entities
33
Q

what is prompt engineering?

A

a discipline focused on designing and refining text inputs for LLMs to obtain optimal results

34
Q

What are key points about raw data?

A
  • in its basic unstructured form, is completely useless and meaningless
    -needs to be cleaned, processed, organized, analyzed and visualized in order to become meaningful or informative
  • if this information leads to a deeper understanding of a given situation, the how and why, then this information becomes knowledge and allows you to make evidence-based decisions
35
Q

What is primary data?

A

it’s data collected by you, through surveys, focus groups, interviews, observations, and experiments, can be either qualitative or quantitative, specific to your needs, and you control the quality, the disadvantage is that it usually costs more and takes more time

36
Q

what is secondary data?

A

collected by someone else, can be either qualitative or quantitative, usually cheap and quick, the key disadvantage is that data can be too old and /or not specific enough for your needs

37
Q

what is an analysis framework?

A

An analysis framework is a structured approach or methodology used to systematically analyze data, information, processes, systems, or phenomena in order to gain insights, draw conclusions, and make informed decisions. It provides a framework or structure for organizing, categorizing, and interpreting data or observations within a specific context or domain.

38
Q

what are other types of data (except for qualitative and quantitative)

A
  • audiovisual data (they neither qualitative nor quantitative)
    -geospatial data
    -PII (personal identifiable information): any type of information relative to a physical person that can lead to its identification
39
Q

what primary data collection methods?

A

-observations
-key informant interviews
-Participatory approaches
-household surveys

40
Q

what are field observations

A

can be used to rapidly collect different types of information. Doesn’t require costly resources, or detailed training, which makes it quick data collection process that is easy to implement. Observations can be structured (the observer looks for a specific behavior, object or event) and unstructured (the observer looks at how things are done and what issues exist)

41
Q

what is key informant interviews?

A

qualitative data collection method to gather in-depth information from individuals who have specialized knowledge or insights about a particular topic and/or a particular group of people

42
Q

what is a focus group discussions?

A

participatory and qualitative research method involving a small, diverse group of people who are part of a target population. They are engaged in structured discussions to explore their perceptions, opinions, and experiences on specific development-related issues or interventions

43
Q

what is a household surveys?

A

quantitative data collection method where information is gathered from a sample of household, typically regarding demographic, economic, and social characteristics

44
Q

what is a random sampling?

A

this method involves selecting individuals from a population entirely by chance, where each member has an equal probability of being chosen.

45
Q

what is a systematic sampling?

A

this method involves selecting nth individuals from the population list.

46
Q

what is a stratified sampling

A

in this method, the population is divided into smaller groups, or strata, based on shared characteristics (like age, income, or location). A random sample is then taken from each stratum

47
Q

what is cluster sampling?

A

cluster sampling involves dividing the population into clusters and then randomly selecting entire clusters for study

48
Q

what is convenience sampling?

A

in convenience sampling, individuals are selected based on their availability and willingness to participate.

49
Q

what is the snowball sampling?

A

used primary in qualitative research, snowball sampling involves existing study subjects recruiting future subjects from among their acquaintances

50
Q

what are the data collection biases?

A

-Sampling bias: occurs when the sample isn’t representative of the population. For example, only surveying accessible or certain demographic groups can lead to skewed results
-Response Bias: can happen if respondents give answers they think the interviewer wants to hear, rather then their true opinion
-nonresponse bias: when a significant portion of selected participants doesn’t respond, and their nonresponse is correlated with the outcome of interest
-data entry errors: mistakes in data entry can occur, especially with large datasets and manual entry
-cultural bias: not considering cultural nuances can lead to misunderstanding or misinterpretations of data
-observer bias: the presence of an observer can sometimes influence the behavior of those being observed
-recall bias: in surveys or interviews, participants might not accurately remember past events or experiences, leading to inaccurate responses
-selection bias: this happens when the procedure used to select participants leads to a sample that is not representative of the population