Introduction to data science Flashcards
Webscraping tools: Readymade
import.io ScraperWiki Tabula Google Sheets _=IMPORTHTML_ excel
Web scraping tools: custom
R Python Bash Java PHP
Steps to explore data
Review data
Check assumptions
Check Anomalies
Data Suggestions
Exploratory Graphics
Coding: R, Python, JavaScript
Applications: Tableau, Qlik
Exploratory Graphics
Bar Charts: for categories and can be grouped
Box Plots: for quantitative variables, in quartiles, show outliers
Histograms: show shape of distribution
scatter plot matrices
Questions to ask in exploratory process
do you have what you need?
are there clumps or Gaps
are there exceptional cases
are there errors in the data
Exploratory Statistics
Robust Statistics: stable, less effected by outliers, skewness, kurtosis
Resampling: empirical estimate of sampling variability, jackknife, bootstrap, permutation, cross validation
Advantages of Excel
good for browsing, sorting rearranging, getting a visual picture, finding and replacing
more uses: formatting, transposing, making pivot tables
SQL
First, SQL is the language used for getting data from relational databases. Second, a few commands go a very long way. And third, the data is usually pulled out of a database and then sent to other programs like R or Python for analysis.
HTML
HTML, or HyperText Markup Language is the language of web pages; it’s the thing that says what a text is and what the headings are, and where to put links. And the information on a web page is styled with CSS which is for Cascading Style Sheets.
XML
XML or Extensible Markup Language. This is a data encoding that is simultaneously human-readable and machine-readable, which is not always the case.