midterm Flashcards
What are the most common data format for data analysis?
- Structured data (database)
- semi-stractured data (XML / JSON data, email, webpage)
- unstructured data (audio, video, image data, natural language)
What are the types of qualitative data?
- Nominal data: data can be labelled or classified into mutually exclusive categories within a variable (hair color, gender, ethnicity)
- Ordinal data: (groups variables into ordered categories (grades, economic status)
What are the types of quantitative data?
- discrete data: data that includes nondivisible figures and statistics that can be counted (number of people)
- Continuous data: data that can take any value (including decimals) (height, lenght, temperature)
What are the alternative data sources?
non-traditional or unconventional sources of information that can be used to gain insights and make informed decisions. these sources complement or supplement traditional data sources such as official statistics or financial reports
where to find alternative data sources?
- development partners (CAF, European Bank, OECD, UNDP);
- data development partnership (solving development challenges through data science collaboration between companies and international organizations)
- data providers (github, google, meta, linkedin, etc)
what is the humanitarian data exchange (HDX)?
an alternative data source; and open platform for sharing data across cries and organizations. The goal of HDX is to make humanitarian data easy to find and use for analysis
What is ETL (Extract, transform and load)?
it’s a process used in data integration and data warehousing to extract data from various sources, transform it into a desired format, and load it into a target system or data repository for further analysis or storage.
what are the possible variants to the extract data in ETL process?
- database extraction: extracting data from relational databases such as Oracle, MySQL, or SQL Server. This involves querying the database using SQL statements to retrieve the required data.
- file extraction: extracting data from flat files, such as CSV, Excel, XML files, JSON files or text files. The extraction process invlolves reading the file content and parsing it to extract the relevant data
- Web scraping: extracting data from websites vy crawling web pages and scarping the required information. This could involve using tools and libraries like BeautifulSoup or Scrapy to navigate web pages, locate specific elements and extract the desired data.
- API Extraction: exctracting data from web APIs (Application Programing Interfaces). APIs provide a structured way to access and retrieve data from various sources.
- Log File Extraction: extracting data from log generated by systems, applications, or devices. Log files often contains valuable information that can be extracted and analyzed for troubleshooting, performance monitoring, or security purposes.
- Sensor or IoT Data Extraction: extracting data from sensors or Internet of Things devices. This involves capturing data from sensors or devices that collect and transmit real-time data, such as temperature sensors, GPS devices or smart meters.
what are the most common transformation processes?
data cleaning, data integration, data enrichment, data standardization, aggregation and summarization, feature engineering
what is data cleaning?
the transformation process includes cleaning the extracted data to ensure its quality and consistency. this may involve handling missing values, removing duplicates, correcting data formatting errors, and validating data against predefined rules or constraints
what is data standarization?
transforming the data into a consistent format is often necessary or efficient analysis or loading into a target system. This may involve converting data types, harmonizing units of measurement, applying standard coding schemes to ensure data consistency across different sources.
what is data integration?
if data is extracted from multiple sources, the transformation process involves integration the data into a unified format. This may include resolving inconsistencies, merging overlapping data, and reconciling differences in data structures or naming conventions.
what is featuring engineering/data enrichment?
in some cases, additional data may need to be added to the extracted data to enhance its value or context. This could involve merging with external data sources, performing data lookup, or applying algorithms or advances analytics techniques to derive new attributes or insights.
Example of featuring engineering?
calculate the time difference between two events
calculate the average age in a group of people
find the closest supermarket to a certain point
Example of data enrichment
Using natural language processing (NLP) to extract insights from a tweet
Using machine learning algorithms to extract insights from a picture
What is load step in ETL process?
This step refers to loading the transformed and processed data into a target system or data repositories for storage, analysis or further processing
What is API
API: Application Programming Interface is a software intermediary that allows two applications to talk to each other. APIs are an accessible way to extract and share data within across organizations.
who is chief data scientist?
they focus on applying advanced analytics and machine learning algorithms to solve complex problems, build predictive models and develop data-driven solutions. They have deep understanding of statistical modeling, data mining and programming in R and python
who is data engineer?
they focus on building and maintaining the infrastructure required for data processing and storage. they are responsible for data extraction, setting up databases, data pipelines, and ensuring data quality and reliability. they work with technologies like hadoop, spark, sql, and data integration tools.
who is data analyst?
they act as a bridge between data scientists and data engineers with business needs. they interact with customers to understand their business challenges and solve them through data. they usually work with PowerBI, tableau, reporting tools and have data storytelling skills.