Module 3 - Preparing and Cleaning Data for Analysis Flashcards
Selecting relevant data for your analysis includes determining what?
determining the type(s) of data that you need and finding a source for the data.
Explain this “When selecting data for a project, it is important to focus on finding data that may provide insights into your original business question.”
For example, if you are seeking to understand demographic characteristics of people who bought Product X in the past year, you should only be using data that is directly related to Product X.
What to do if sometimes the data you need to answer your questions isn’t readily available?
It may be necessary to establish new procedures to collect the data required for your analysis. Other times, it may involve combining data from multiple sources into a format that can be analyzed.
Explain on what can be done on this situation: “ an entertainment producer gathering data about the viability of a movie project.”
If the movie is an adaptation of a book, they need data on the sales of books by that author, within that genre, and across a variety of population demographics. They might compare the profitability of other movies with similar plots or characters, and their release dates, to determine the best time of the year to release a picture of that genre. Producers may also analyze data on the actors and locations that appear in the most successful recent movies to make casting and production decisions.
Enumerate the questions that you should ask yourself when selecting a data source.
Some questions that you should ask yourself when selecting a data source:
a. What data points are necessary to inform your analysis?
b. Do I already have access to this data, or must I find a dataset from another source?
c. Where are reliable and verifiable sources of this data?
d. How often is the relevant data collected and updated?
e. How the data is licensed for use, and is there a cost?
f. Is the data in a format that I can use, or convert to use, with my tools?
What are the two types of data that analysts work with?
static data and streaming data.
Data that is received and stored prior to performing analysis on the data is considered ?
static data
When each event is processed and analyzed as it is received and subsequent results are used or stored, the data is referred to as?
streaming data.
is a data type used to represent text or sequences of characters. It can include letters, numbers, symbols, and spaces. Strings are commonly used to store names, addresses, sentences, and any other textual information. In programming, strings are typically enclosed within single (‘’) or double (“”) quotation marks.
String
is a whole number without any decimal or fractional parts. It represents a count or quantity that can be positive, negative, or zero. Integers are used to store values like counts of items, ages, and identifiers. They are typically represented without a decimal point.
Integer
is a data type used to represent numbers with decimal points. Floats can represent a wide range of values, including both integers and fractions. They are used for calculations involving precision, such as scientific calculations, measurements, and financial computations.
A floating-point number, often referred to as a “float,”
This data type is used to represent specific points in time. It can include information about the year, month, day, hour, minute, second, and sometimes even milliseconds. This data type is crucial for recording events, scheduling, and performing temporal calculations. Formats for representing date and time values can vary depending on the programming language and system.
The date and time datatype
This data type represents binary values: either true or false, yes or no, on or off. Booleans are used to make logical comparisons and decisions in programming. They are often used in conditional statements and expressions to control the flow of a program. Booleans help determine the validity of conditions or statements.
The boolean data type
refers to data that is entered and maintained in defined fields within a file or record.
Structured data
is easily entered, classified, queried, and analyzed by a computer.
Structured data