06_secondary data 2 Flashcards
What is secondary data?
Secondary data: Data gathered and recorded by someone else prior to and for a purpose other than the current project.
What are the advantages of secondary data?
- Available
- Faster and less expensive than acquiring primary data
- Requires no access to subjects
- Inexpensive—government data is often free
- May provide information otherwise not accessible
- Often: high data quality (particularly large samples and established measures)
- Availability of time series data
- Simplification of cross-cultural studies
What are the disadvantages of secondary data?
-
No control over data quality and
representativeness - Lack of familiarity with variables
- Lack of key variables
- Uncertain validity
- Data not consistent with needs
- Inappropriate units of measurement
- (Probably) too old
What are the key characteristics of big data?
Volume:
- Huge amount of data created each day
- Data explosion:
- “Big” is relative
Variety:
- Internal vs. external data sources
- Mix of three different data formats:
Velocity:
- Speed of data production and processing
- Real-time information create highest value
- Quick response to big data can offer a competitive advantage
Veracity:
- Term coined by IBM
- Accuracy, quality, truthfulness, trustworthiness of data
Variability:
- Data flows can be inconsistent with periodic peaks
- Can be challenging to manage
Value Proposition
- Key purpose
- Assumption: Big data more valuable than small data
What is unstructued and structured data?
Unstructured: Video, Image, text data, voice
Structured: Numeric secondary data, categorial data, geographic data
Evaluate secondary data as a method for interring causal relationships?
Distinct entities: maybe, none (depends on the conceptual elaborations
Association: yes, easily identifieable
Temporal Precedence: yes, secondary data often collected at various points of time
Eliminating rival causalm relationships: maybe, can be problematic
–>In case of panel data, control of all temporarily invariant variables possible, otherwise problematic…
How is the eliminating rival causal explaination in experiments, survey and seondary data?
- Experiments: Can be eliminated by randomization
- Survey: in theory you can eliminate them by asking all rival explanations, and use them as controls, but difficult
- Secondary data: Measurement is possible, however in many cases it cannot be measured
What is longitudinal data?
Longitudinal data refers to data collected from the same subject or units over a period of time
What are the two forms of longitudinal data?
One unit of analysis: TIme series analysis
Several units of analysis: Panel data analysis
What are the challenges of longitudinal data?
Challenges:
- Violation of OLS assumption: “Residual are uncorrelated
–>Different estimators necessary; panel estimators, time series estimators
What are the advantages of longitudinal data?
- Allow to distinguish true loyalty effects from spurious effects
- Allow to include lagged values of the dependent variable as predictor and analysis of novel research questions
- Repeated measurements allow to address an omitted variable bias in different ways (Chapter 4.7
What is unstructured data?
Unstructured Data is a single data unit in which the information offers a relatively concurrent representation of its multifaceted nature without predefined organization or numeric values.
What are the advantage and challenges and forms of unstructured data?
Advantage
- Large amoounts of unstructured data available at companies
- derive deeper and novel insights
Challenges:
- Complex data structure
Forms:
- Text, images, audio, video
What is text analytics?
Idea
* Can investigate “what” is being said and “how” it is said, using both qualitative and quantitative inquiries with various degrees of human involvement”
* Based on content analysis (Chapter 5)
What is the distinction and purpose of text analytics?
Distinctions
* Text as a reflection of the producer
* Text’s impact on receivers
* Important to consider contextual influences on text
Purpose of text analytics
* Text for prediction
* Text for understanding
What are the approaches, tools and application ares of text analyics?
Approaches
* Top-down
* Bottom-up
Tools for text analytics
* Entity extraction (e.g., dictionary approaches, sentiment analysis [mostly])
* Topic modeling (e.g., latent Dirchilet allocation [LDA])
* Relation extraction (e.g., deep learning, supervised machine learning)
Application areas
* Sentiment analysis (investor sentiment, consumer sentiment)
* Word-of-mouth
* Creation of positioning maps
What is the process of text analytic ??
1.Develop a Research Question
2. Data collection
3. Data preprocessing
4. Construct definition
5. Operationalization
6. Run text analysis and measurement validation
7. Study the research question
Where can data of text analytic come from? (2. step in process)
Data collection
Potential data from: e.g.,
- E-Mails, Online reviews
- Company websites, Financial reports
- Shareholder letters
- Online databases or newspapers
- Field interviews (Digital converters for printed text)
Access to data
- Receive access to a set of documents
- Web scraping (legal aspects important)
What is the 3. Step of data preprocessing about in text analytics?
Text is unstructured and messy, requires cleaning before data analysis
Typical steps:
Tokenization: Break text into tokes (often words or sentences), e.g. Text: “I love my ipod.” → Tokens: “I” “love” “my” “ipod” “.”
- Cleaning: Remove non-meaningful text (e.g., HTML tags) and nontextual information
- Removing stop words (e.g., common words such as ‘a’ or ‘the’ that
appear in most documents) - Correct spelling mistakes:common speller available but require careful application (e.g., language in specific domains not part of the speller, or worse, may be incorrectly “fixed”)
-
Stemming and lemmatization:Stemming: “car” and “cars” are stemmed to “car,” but “automobile” is not.
**Lemmatization:** “Car,” “cars,” and “automobile” reduced to lemma “automobile.”
What is the 4. step “Construct definition in text analytics about?
Qualitatively analyze a subsample of the data
Create a word list for each concept (dictionary development)
- Deductive: create a wordlist from theoretical concepts or categories
- Inductive: concordance (all words in a document listed in terms of frequency) and group words in categories; qualitative approaches open and axial coding to identify categories
- Have human coders check and refine dictionary
What is operationalization in the text analytic process about?
Conduct computer analysis to compute the raw data (Diction, LIWC,
WordStat, R, Python)
- Make measurement decisions based on the research question:
– Percent of all words
– Percent of words within the time period or category
– Percent of all coded words
– Binary (“about” or “not about” a topic)
What is important in the step of runing text analsis and measurement validation?
Establish and/or enhance validity:
- Construct validity, e.g.,
- Pull a subsample and have coded by research assistant or researcher
- Calculate Krippendorf’s alpha or a hit/miss rate
- Survey participants evaluate the dictionary
- concurrent validity (e.g., calculate and compare multiple textual measures)
- Discriminant validity
last step of text analytics process: “Study the research question”
Choose the appropriate statistical method for the research question:
– Analysis of variance (ANOVA)
– Regression analysis
– Multidimensional scaling
– Correlational analysis
What are futher tools and further issues of text analytics?
Further tools:
* Topic Modeling
* Relation extraction
* Further issues:
Problems:
- can miss subtleties and cannot code
finer shades of meaning, homonyms or sarcasm
- Cultural issues and language affect document
- Semantic relationships (instead of a bag-of-words approach)
- Joint analysis of text and other unstructured data (e.g., image, audio)
What are the key challenges in evaluation of secondary data analyses?
Problems with secondary data research:
- Complexity of data and analyses
- Data accuracy (and age)
- Measurement comparability
- Data lumping and level of aggregation
- Predictive accuracy
- Measurement reliability over time