Data analysis with R Programming Flashcards
What you have learnt so far?
-Use structured thinking to define a problem and ask the right questions.
- Work with spreadsheets, databases, and tools like SQL to organize and transform data.
-Clean your data to make sure it has integrity before you analyze it.
- Create impactful data visualizations to illustrate key points.
- Craft a compelling story to communicate insights to stakeholders.
Computer programming
Giving instructions to a computer to perform an action or set of instructions.
What you will learn?
- Introduction to programming languages.
- Explore main features and functions.
- Basic programming concepts in R.
- How to work with data in R.
- Clean, transform, visualize, report data in R.
R Programing language
Used for statistical analysis, visualization, and other data analysis.
Programming Languages
- The words and symbols we use to write instructions for computers to follow.
Coding
- is writing instructions to the computer in the syntax of a specific programming language.
Programming languages
-R
- Python
- JavaScript
- SAS
-Scala
-Julia
Benefits of using programming languages
- Clarify the steps of your analysis.
- Saves time.
- Reproduce and share your work.
R
A programming language frequently used for statistical analysis, visualization, and other data analysis.
Open Source
Code that is freely available and may be modified and shared by the people who use it.
R Benefits
- Accessible
- Data-centric
- Open source
- Community
Uses of R
- Reproducing your analysis
- Processing lots of data
- Creating data visualizations
Integrated Development Environment (IDE)
A software application that brings together all the tools you may want to use in a single place.
R code known as pipe
Helps make a sequence of code easier to work with and read.
The Basic concepts of R
- Functions
-Comments - Variables
- Data types
- Vectors
-Pipes
Functions (R)
A body of reusable code to perform specific tasks in R.
Argument (R)
Information that a function in R needs in order to run.
Variable (R)
A representation of a value in R that can be stored for use later during programming.
Vector (R)
A group of data elements of the same type stored in a sequence in R.
Pipe(R)
A tool in R for expressing a sequence of multiple operations, represented with “%>%.
Pipe (R) example
Tooth Growth %>%
filter(dose==0.5)%>%
arrange(Len)
Data Structure
Data structure is a format for organizing and storing data.
Types of atomic vectors
-Logical
-Double
-integer
-Character
Logical Vector
True/False
Logical vector example
TRUE
Integer vector
Positive and negative whole values
Integer vector example
3
Double vector
Decimal values
Double vector example
101.175
Character vector
String/ character values
Character vector example
“Coding”
Data Frames
are the most common way of storing and analyzing data in R.
Matrix
is a two-dimensional collection of data elements. This means it has both rows and columns.
Operator
A symbol that names the type of operation or calculation to be performed in a formula.
Assignment operators
Used to assign values to variables and vectors.
Assignment operator Example
sales _1 <-1 c(67.00,75.50,90.00,54.75)
Arithmetic Operators
Used to complete math calculations.
Athematic Operators
+ (addition)
-(subtraction)
*(multiplication)
/(division)
Function
A body of reusable code for performing specific tasks in R.
Argument
Information needed by function in R in order to run.
Comment
Helpful text that describes or explains R code, preceded by#.
Variable
A representation of a value in R that can be stored for later use.
Data Types
An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform.
Vector
A group of data elements of the same type stored in a one-dimensional sequence in R.
Pipe
A tool in R for expressing a sequence of multiple operations, represented with %>%.
Packages (R)
Units of reproducible R code
Packages include:
- Reusable R functions
- Documentation about the functions
- Sample datasets
- Tests for checking your code.
CRAN(Comprehensive R Archive Network)
An online archive with R packages, source code, manuals, and documentation.
R Packages
Packages offer a helpful combination of code, reusable R functions, descriptive documentation, tests for checking operability, and sample data sets.
Tidyverse (R)
A system of packages in R with a common design philosophy for data manipulation, exploration, and visualization.
How do Conflicts in R studio happen?
Conflicts happen when packages have functions with the same names as other functions.
8 Core tidy verse Packages
-ggplot2
-Tibble
-tidyr
-readr
-purrr
-dplyr
-stringr
-forcats
Conflict notifications
are just one type of message that can show up in the console.
Vignette
is documentation that acts as a guide to an R package.
Four Packages that are an essential part of the workflow for data analysts:
- ggplot2
-dplyr
-tidyr
-readr
ggplot2 (R)
Create a variety of data viz by applying different visual properties to the data variables in R.
tidyr(R)
A package used for data cleaning to make tidy data.
readr(R)
Used for importing data
dplyr(R)
Offers a consistent set of functions that help you complete some common data manipulation tasks.
Factors (R)
Store categorical data in R where the data values are limited and usually based on a finite group like country or year.
What you have Learnt so far.
- Fundamentals of R from variables to vectors and more.
-Explored the different operations in R and saw how they can help you complete calculations. - Check out pipes and how they can make your programming more efficient.
-Unpacked packages to find out how they are a big part of what you can do in R.
Nested
In Programming, describes code that performs a particular function and is contained within code that performs a broader function.
Nested function
A function that is completely contained within another function.
Keyboard shortcuts for inserting pipe operators
- PC/ Chromebook: ctrl+shift+m
-Mac: cmd+shift+m
Things to consider when using pipes:
-Add the pipe operator at the end of each line of the piped operation except the last one.
-Check your code after you have programmed your pipe.
- Revisit piped operations to check for parts of your code to fix.
Data Frame
A collection of columns
Data Frames rules
- Columns should be named
- Data stored can be many different types, like numeric, factor, or character.
- Each column should contain the same number of data items.
In Tidy verse
- Tibbles are like streamlined data frames
Tibbles
-Never change the data types of the inputs.
- Never change the names of your variables.
- Never create row names
- Make printing easier
Tidy data (R)
A way of standardizing the organization of data within R.
Tidy data standards
- Variables are organized into columns.
- Observations are organized into rows.
- Each value must have its own cell.
.CVS (comma-separated values )
a .csv file is a plain text file that contains a list of data. They mostly use commas to separate (or delimit) data, but sometimes they use other characters, like semicolons.
.TSV(tab-separated values)
a tsv file stores a data table in which the columns of data are separated by tabs. For example, a database table or spreadsheet data.
.FWF (Fixed width files)
a. fwf file has a specific format that allows for the savings for textual data in an organised fashion
.LOG
a log file is a computer-generated file that records events from operating systems and other software programs.
Arithmetic Operators
let you perform both math operations like addition, subtraction, multiplication, and division.
Relational Operators
Relational operators, also known as comparators, allow you to compare values. Relational operators identify how one R object relates to another ex <,>, <=.
Logical operators
allow you to combine logical values. Logical operators return a logical data type or Boolean (TRUE or FALSE).
Assignment Operators
let you assign values to variables. ex <-
Organizational functions
Help you sort, filter, and summarize your data.
Cleaning functions
help you preview and rename data so its easier to work with.
Transformational functions
help you separate and combine data, as well as create new variables.
Anscombe’s quartet
Four datasets that have nearly identical summary statistics.
Popular Visualizations packages in R .
-ggplot2
-Plotly
-Lattice
-RGL
-Dygraphs
-Leaflet
-Highcharter
-Patchwork
-gganimate
-ggridges
The basics of ggplot2
The ggplot2 package lets you make high-quality, customizable plots of your data. ggplot-2 is based on the grammar of graphics, which is a system for describing and building visualizations.
Benefits of ggplot-2
-Create different types of plots
-Customize the look and feel of plots
-Create high quality visuals
- Combine data manipulation and visualization.
Our focus on core concepts in ggplot-2
-Aesthetics
-Geoms
-Facets
- Labels and annotations
Aesthetic (R)
A visual property of an object in your plot.
Geom (R)
The geometric object used to represent your data.
Facets (R)
Let you display smaller groups, or subsets, of your data.
Labels and annotations (R)
Let you customize your plot
Mapping (R)
Matching up a specific variable in your dataset with a specific aesthetic.
Steps to Create your plot in R programming
1) Start with the ggplot function and choose a dataset to work with.
2) Add a geom_funtion to display your data.
3) Map the variables you want to plot in the arguments of the aes() function.
Aesthetics for points
-X
-Y
-Color
-Shape
-Size
-Alpha
Geom functions
-geom_point
-geom_bar
-geom_line
Smoothing
enables that detection of a data trend when you can’t easily notice a trend from a plotted data points.
Loess smoothing
The loess smoothing process is the best for smoothing plots with less than 1000 points.
Gam smoothing
Gam smoothing or generalized additive model smoothing is useful for something plots with a large number of points. i.e. more than 1000 points.
Facet functions
-Facet_wrap()
-Facet_grid()
To add a title to a chart
label function= title= Average product rating.
Blue and yellow bars
To highlight underperforming products, use an aesthetics function: col = ifelse (x<2, ‘blue’, ‘yellow’).
Bar chart
To create the bars on the chart, use a geom function: geom_bar ().
Trend line
To create a trend line, use a geom function: geom_smooth ().
Scatter plot chart
To create the scatter plot, use a geom function: geom_point ().
Compare data
To compare data trends across average ratings, use a facets function: facet_wrap (~Average Rating)
Axis labels
To label the axes, use an aesthetics function: aes (x = Average price (USD), y = Product)
Annotate
To add notes to a document or diagram to explain or comment upon it.
R Markdown
A file format for making dynamic documents with R.
Course Overview for R markdown
- An Overview for R Markdown
-How to install R Markdown in RStudio - How to Create an R Markdown document
- The Structure and components of the document
- How to insert and edit pieces of code called chunks in your document.
- The Process of exporting your documentation.
Markdown
A syntax for formatting plain text files.
Markdown formatting
-Add a_single_underscore
- or asterisk
Markdown report output
Add a single underscore or asterisk.
R Notebook
Lets users run your code and show the graphs and charts that visualize the code.
R Markdown file formats
- HTML, PDF and Word documents.
-Slide presentation
-Dashboard
HTML
The set of markup symbols or codes used to create a webpage.
Other notebook options
-Jupyter
-Kaggle
- Google Colab
Jupyter notebooks
are documents that contain computer code and rich text elements – such as comments, links, or descriptions of your analysis and results
YAML
A Language for data that translates it so it’s readable.
Code Chunk
Code added in an.Rmd file
Delimiter
A character that indicates the beginning or end of a data item.
Code chunk delimiters
{r } and
Code chunk keyboard shortcuts
PC/Chromebook: ctrl+alt+I
What we have explored so far?
- What R Markdown is
- How to use R Markdown in Rstudio to create.Rmd files
- Structure of these files and how to format them to make reports.
- What code chunks are and how to include them in your documentation.
- How to take all of your analyses and transform it from an .Rmd file into a report.
Case study
A common way for employers to assess job skills and gain insight into how you approach common data related challenges.
Portfolio
Collection of case studies that can be shared with potential employers.
Best Practices for Case studies and Portfolios
1) Make sure your case study answers the questions being asked.
2) Make sure that you are communicating the steps you have taken and the assumptions you have made.
3) The best portfolios are personal, unique and simple.
4) Make sure your portfolio is relevant and presentable.