The Data Science Handbook Flashcards
What is data wrangling?
The nitty-gritty task of cleaning data and getting it into a standard format that is suitable for downstream analysis
What is exploratory analysis?
A stage of analysis that focuses on exploring the data to generate hypotheses about it. EDA relies heavily on visualization
What is a feature?
A feature is a small piece of data, usually a number or label, that is extracted from your data and characterizes some entity in your dataset. i.e. you might extract average word length from a text doc or the # of characters in a doc
Feature extraction means taking your raw datasets and distilling them down into a table with rows and columns (tabular data) with a row corresponding to some real world entity and each column giving a single piece of information (generally a number) that describes the entity.
Extracting good features is the most important thing for getting your analysis to work
Feature extraction is the most creative part of data science and the one most closely tied to domain expertise; typically a really good feature will correspond to some real-world phenomenon. Data scientists should work closely with domain experts and understand what these phenomena mean and how to distill them into numbers.
What is a PRD?
A product requirements document is as document that specifies exactly what functionality a planned product should have
What is production code?
Software that is run repeatedly and maintained. It especially refers to source code of software product that is distributed to other people
What is SOW?
A statement of work is a document that specifies what work is to be done in a project, relevant timelines, and specific deliverables
What is a target variable?
A feature you are trying to predict in machine learning. Sometimes it is already in your data and other times you must construct it yourself. if you want to figure out whether client’s customers would lose their brand loyalty, there’s no loyalty field in the data–it’s just a log of various customer interactions and transactions and you need to figure out how to measure “loyalty
What is the data science roadmap?
The data science road map: 1. Frame the problem 2. Understand the data 3. Extract features 4. Model and analyze 5a. Present results to a human (give business insights in the form of a deck or report, likely) OR 5b. Deploy code (deliverable is apiece of software that performs some analytics work. I.e. implementing an algorithm
What is Excel best used for
Simple data analysis
What is Tableau best for?
Visualizing data in relational databases. It’s pretty limited in its functionality but makes beautiful graphics.
What is Weka? What are its advantages?
A tool for applying pre-canned machine learning algorithms to datasets that are already well formatted and contain relevant features.
Weka has an advantage because it essentially provides a user-friendly interface (GUI) that makes it easy for people to interact with and use some powerful tools written in Java (a programming language). This means that if you create models or perform analyses in Weka during your initial exploration of data, you can smoothly transition to using the same models in your actual computer code for production purposes, especially if you’re working with Java. This seamless integration makes it convenient for users to move from the user-friendly environment of Weka to incorporating the same models into their more advanced and customized programming work
What is a GUI?
Graphical user interface. It provides a more user-friendly way to interact with software compared to text-based interfaces. It allows users to perform tasks by clicking on visual elements rather than entering commands manually. GUIs are commonly used in applications, operating systems, and software tools to enhance the user experience and make it more intuitive.
What do Excel, Tableau, and Weka all have in common?
They all assume data is in tabular form to begin with. Because each dataset requires its own idiosyncratic data wrangling, you need to be creative and flexible in what features you extract from raw data which is why you need to be proficient in at least one programming language
What is Python? (4)
Best programming language for general purpose use and a popular choice among data scientists. Balances flexibility of a conventional scripting language with numerical muscles of a good mathematics package.
Released in 1991.
High level scripting language with functionality similar to Perl and Ruby with a clean, self-consistent syntax.
Has open-source technical computing libraries that make it powerful for analytics
Designed for computer programmers and augmented with libraries for technical computing
What is R?
Another popular programming language. While Python is designed for computer programmers and augmented with libraries for technical computing, R was designed by and for statisticians and is natively integrated with graphics capabilities and extensive statistical functions.