06_secondary data 2 Flashcards by Lukas Fahrländer

What is secondary data?

Secondary data: Data gathered and recorded by someone else prior to and for a purpose other than the current project.

How well did you know this?

Not at all

Perfectly

What are the advantages of secondary data?

Available
Faster and less expensive than acquiring primary data
Requires no access to subjects
Inexpensive—government data is often free
May provide information otherwise not accessible
Often: high data quality (particularly large samples and established measures)
Availability of time series data
Simplification of cross-cultural studies

How well did you know this?

Not at all

Perfectly

What are the disadvantages of secondary data?

No control over data quality and
representativeness
Lack of familiarity with variables
Lack of key variables
Uncertain validity
Data not consistent with needs
Inappropriate units of measurement
(Probably) too old

How well did you know this?

Not at all

Perfectly

What are the key characteristics of big data?

Volume:
- Huge amount of data created each day
- Data explosion:
- “Big” is relative

Variety:
- Internal vs. external data sources
- Mix of three different data formats:

Velocity:
- Speed of data production and processing
- Real-time information create highest value
- Quick response to big data can offer a competitive advantage

Veracity:
- Term coined by IBM
- Accuracy, quality, truthfulness, trustworthiness of data

Variability:
- Data flows can be inconsistent with periodic peaks
- Can be challenging to manage

Value Proposition
- Key purpose
- Assumption: Big data more valuable than small data

How well did you know this?

Not at all

Perfectly

What is unstructued and structured data?

Unstructured: Video, Image, text data, voice

Structured: Numeric secondary data, categorial data, geographic data

How well did you know this?

Not at all

Perfectly

Evaluate secondary data as a method for interring causal relationships?

Distinct entities: maybe, none (depends on the conceptual elaborations

Association: yes, easily identifieable

Temporal Precedence: yes, secondary data often collected at various points of time

Eliminating rival causalm relationships: maybe, can be problematic
–>In case of panel data, control of all temporarily invariant variables possible, otherwise problematic…

How well did you know this?

Not at all

Perfectly

How is the eliminating rival causal explaination in experiments, survey and seondary data?

Experiments: Can be eliminated by randomization
Survey: in theory you can eliminate them by asking all rival explanations, and use them as controls, but difficult
Secondary data: Measurement is possible, however in many cases it cannot be measured

How well did you know this?

Not at all

Perfectly

What is longitudinal data?

Longitudinal data refers to data collected from the same subject or units over a period of time

How well did you know this?

Not at all

Perfectly

What are the two forms of longitudinal data?

One unit of analysis: TIme series analysis

Several units of analysis: Panel data analysis

How well did you know this?

Not at all

Perfectly

What are the challenges of longitudinal data?

Challenges:
- Violation of OLS assumption: “Residual are uncorrelated

–>Different estimators necessary; panel estimators, time series estimators

How well did you know this?

Not at all

Perfectly

What are the advantages of longitudinal data?

Allow to distinguish true loyalty effects from spurious effects
Allow to include lagged values of the dependent variable as predictor and analysis of novel research questions
Repeated measurements allow to address an omitted variable bias in different ways (Chapter 4.7

How well did you know this?

Not at all

Perfectly

What is unstructured data?

Unstructured Data is a single data unit in which the information offers a relatively concurrent representation of its multifaceted nature without predefined organization or numeric values.

How well did you know this?

Not at all

Perfectly

What are the advantage and challenges and forms of unstructured data?

Advantage
- Large amoounts of unstructured data available at companies
- derive deeper and novel insights

Challenges:
- Complex data structure

Forms:
- Text, images, audio, video

How well did you know this?

Not at all

Perfectly

What is text analytics?

Idea
* Can investigate “what” is being said and “how” it is said, using both qualitative and quantitative inquiries with various degrees of human involvement”
* Based on content analysis (Chapter 5)

How well did you know this?

Not at all

Perfectly

What is the distinction and purpose of text analytics?

Distinctions
* Text as a reflection of the producer
* Text’s impact on receivers
* Important to consider contextual influences on text

Purpose of text analytics
* Text for prediction
* Text for understanding

How well did you know this?

Not at all

Perfectly

What are the approaches, tools and application ares of text analyics?

Study These Flashcards

Approaches
* Top-down
* Bottom-up

Tools for text analytics
* Entity extraction (e.g., dictionary approaches, sentiment analysis [mostly])
* Topic modeling (e.g., latent Dirchilet allocation [LDA])
* Relation extraction (e.g., deep learning, supervised machine learning)

Application areas
* Sentiment analysis (investor sentiment, consumer sentiment)
* Word-of-mouth
* Creation of positioning maps

What is the process of text analytic ??

Study These Flashcards

1.Develop a Research Question
2. Data collection
3. Data preprocessing
4. Construct definition
5. Operationalization
6. Run text analysis and measurement validation
7. Study the research question

Where can data of text analytic come from? (2. step in process)

Study These Flashcards

Data collection
Potential data from: e.g.,
- E-Mails, Online reviews
- Company websites, Financial reports
- Shareholder letters
- Online databases or newspapers
- Field interviews (Digital converters for printed text)

Access to data
- Receive access to a set of documents
- Web scraping (legal aspects important)

What is the 3. Step of data preprocessing about in text analytics?

Study These Flashcards

Text is unstructured and messy, requires cleaning before data analysis

Typical steps:
Tokenization: Break text into tokes (often words or sentences), e.g. Text: “I love my ipod.” → Tokens: “I” “love” “my” “ipod” “.”

Cleaning: Remove non-meaningful text (e.g., HTML tags) and nontextual information
Removing stop words (e.g., common words such as ‘a’ or ‘the’ that
appear in most documents)
Correct spelling mistakes:common speller available but require careful application (e.g., language in specific domains not part of the speller, or worse, may be incorrectly “fixed”)
Stemming and lemmatization:Stemming: “car” and “cars” are stemmed to “car,” but “automobile” is not.
```
 **Lemmatization:** “Car,” “cars,” and “automobile” reduced to lemma  “automobile.”
```

What is the 4. step “Construct definition in text analytics about?

Study These Flashcards

Qualitatively analyze a subsample of the data

Create a word list for each concept (dictionary development)
- Deductive: create a wordlist from theoretical concepts or categories
- Inductive: concordance (all words in a document listed in terms of frequency) and group words in categories; qualitative approaches open and axial coding to identify categories

Have human coders check and refine dictionary

What is operationalization in the text analytic process about?

Study These Flashcards

Conduct computer analysis to compute the raw data (Diction, LIWC,
WordStat, R, Python)

Make measurement decisions based on the research question:
– Percent of all words
– Percent of words within the time period or category
– Percent of all coded words
– Binary (“about” or “not about” a topic)

What is important in the step of runing text analsis and measurement validation?

Study These Flashcards

Establish and/or enhance validity:
- Construct validity, e.g.,
- Pull a subsample and have coded by research assistant or researcher
- Calculate Krippendorf’s alpha or a hit/miss rate
- Survey participants evaluate the dictionary
- concurrent validity (e.g., calculate and compare multiple textual measures)
- Discriminant validity

last step of text analytics process: “Study the research question”

Study These Flashcards

Choose the appropriate statistical method for the research question:
– Analysis of variance (ANOVA)
– Regression analysis
– Multidimensional scaling
– Correlational analysis

What are futher tools and further issues of text analytics?

Study These Flashcards

Further tools:
* Topic Modeling
* Relation extraction
* Further issues:

Problems:
- can miss subtleties and cannot code
finer shades of meaning, homonyms or sarcasm
- Cultural issues and language affect document
- Semantic relationships (instead of a bag-of-words approach)
- Joint analysis of text and other unstructured data (e.g., image, audio)

What are the key challenges in evaluation of secondary data analyses?

**Problems with secondary data research**: - **Complexity of data** and analyses - Data **accuracy (and age)** - Measurement **comparability** - Data **lumping and level of aggregation** - Predictive **accuracy** - Measurement **reliability over time**

06_secondary data 2 Flashcards

(25 cards)