06_secondary data 2 Flashcards

1
Q

What is secondary data?

A

Secondary data: Data gathered and recorded by someone else prior to and for a purpose other than the current project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the advantages of secondary data?

A
  • Available
  • Faster and less expensive than acquiring primary data
  • Requires no access to subjects
  • Inexpensive—government data is often free
  • May provide information otherwise not accessible
  • Often: high data quality (particularly large samples and established measures)
  • Availability of time series data
  • Simplification of cross-cultural studies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the disadvantages of secondary data?

A
  • No control over data quality and
    representativeness
  • Lack of familiarity with variables
  • Lack of key variables
  • Uncertain validity
  • Data not consistent with needs
  • Inappropriate units of measurement
  • (Probably) too old
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the key characteristics of big data?

A

Volume:
- Huge amount of data created each day
- Data explosion:
- “Big” is relative

Variety:
- Internal vs. external data sources
- Mix of three different data formats:

Velocity:
- Speed of data production and processing
- Real-time information create highest value
- Quick response to big data can offer a competitive advantage

Veracity:
- Term coined by IBM
- Accuracy, quality, truthfulness, trustworthiness of data

Variability:
- Data flows can be inconsistent with periodic peaks
- Can be challenging to manage

Value Proposition
- Key purpose
- Assumption: Big data more valuable than small data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is unstructued and structured data?

A

Unstructured: Video, Image, text data, voice

Structured: Numeric secondary data, categorial data, geographic data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Evaluate secondary data as a method for interring causal relationships?

A

Distinct entities: maybe, none (depends on the conceptual elaborations

Association: yes, easily identifieable

Temporal Precedence: yes, secondary data often collected at various points of time

Eliminating rival causalm relationships: maybe, can be problematic
–>In case of panel data, control of all temporarily invariant variables possible, otherwise problematic…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is the eliminating rival causal explaination in experiments, survey and seondary data?

A
  • Experiments: Can be eliminated by randomization
  • Survey: in theory you can eliminate them by asking all rival explanations, and use them as controls, but difficult
  • Secondary data: Measurement is possible, however in many cases it cannot be measured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is longitudinal data?

A

Longitudinal data refers to data collected from the same subject or units over a period of time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the two forms of longitudinal data?

A

One unit of analysis: TIme series analysis

Several units of analysis: Panel data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the challenges of longitudinal data?

A

Challenges:
- Violation of OLS assumption: “Residual are uncorrelated

–>Different estimators necessary; panel estimators, time series estimators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the advantages of longitudinal data?

A
  • Allow to distinguish true loyalty effects from spurious effects
  • Allow to include lagged values of the dependent variable as predictor and analysis of novel research questions
  • Repeated measurements allow to address an omitted variable bias in different ways (Chapter 4.7
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is unstructured data?

A

Unstructured Data is a single data unit in which the information offers a relatively concurrent representation of its multifaceted nature without predefined organization or numeric values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the advantage and challenges and forms of unstructured data?

A

Advantage
- Large amoounts of unstructured data available at companies
- derive deeper and novel insights

Challenges:
- Complex data structure

Forms:
- Text, images, audio, video

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is text analytics?

A

Idea
* Can investigate “what” is being said and “how” it is said, using both qualitative and quantitative inquiries with various degrees of human involvement”
* Based on content analysis (Chapter 5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the distinction and purpose of text analytics?

A

Distinctions
* Text as a reflection of the producer
* Text’s impact on receivers
* Important to consider contextual influences on text

Purpose of text analytics
* Text for prediction
* Text for understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the approaches, tools and application ares of text analyics?

A

Approaches
* Top-down
* Bottom-up

Tools for text analytics
* Entity extraction (e.g., dictionary approaches, sentiment analysis [mostly])
* Topic modeling (e.g., latent Dirchilet allocation [LDA])
* Relation extraction (e.g., deep learning, supervised machine learning)

Application areas
* Sentiment analysis (investor sentiment, consumer sentiment)
* Word-of-mouth
* Creation of positioning maps

17
Q

What is the process of text analytic ??

A

1.Develop a Research Question
2. Data collection
3. Data preprocessing
4. Construct definition
5. Operationalization
6. Run text analysis and measurement validation
7. Study the research question

18
Q

Where can data of text analytic come from? (2. step in process)

A

Data collection
Potential data from: e.g.,
- E-Mails, Online reviews
- Company websites, Financial reports
- Shareholder letters
- Online databases or newspapers
- Field interviews (Digital converters for printed text)

Access to data
- Receive access to a set of documents
- Web scraping (legal aspects important)

19
Q

What is the 3. Step of data preprocessing about in text analytics?

A

Text is unstructured and messy, requires cleaning before data analysis

Typical steps:
Tokenization: Break text into tokes (often words or sentences), e.g. Text: “I love my ipod.” → Tokens: “I” “love” “my” “ipod” “.”

  • Cleaning: Remove non-meaningful text (e.g., HTML tags) and nontextual information
  • Removing stop words (e.g., common words such as ‘a’ or ‘the’ that
    appear in most documents)
  • Correct spelling mistakes:common speller available but require careful application (e.g., language in specific domains not part of the speller, or worse, may be incorrectly “fixed”)
  • Stemming and lemmatization:Stemming: “car” and “cars” are stemmed to “car,” but “automobile” is not.
     **Lemmatization:** “Car,” “cars,” and “automobile” reduced to lemma  “automobile.”
20
Q

What is the 4. step “Construct definition in text analytics about?

A

Qualitatively analyze a subsample of the data

Create a word list for each concept (dictionary development)
- Deductive: create a wordlist from theoretical concepts or categories
- Inductive: concordance (all words in a document listed in terms of frequency) and group words in categories; qualitative approaches open and axial coding to identify categories

  • Have human coders check and refine dictionary
21
Q

What is operationalization in the text analytic process about?

A

Conduct computer analysis to compute the raw data (Diction, LIWC,
WordStat, R, Python)

  • Make measurement decisions based on the research question:
    – Percent of all words
    – Percent of words within the time period or category
    – Percent of all coded words
    – Binary (“about” or “not about” a topic)
22
Q

What is important in the step of runing text analsis and measurement validation?

A

Establish and/or enhance validity:
- Construct validity, e.g.,
- Pull a subsample and have coded by research assistant or researcher
- Calculate Krippendorf’s alpha or a hit/miss rate
- Survey participants evaluate the dictionary
- concurrent validity (e.g., calculate and compare multiple textual measures)
- Discriminant validity

23
Q

last step of text analytics process: “Study the research question

A

Choose the appropriate statistical method for the research question:
– Analysis of variance (ANOVA)
– Regression analysis
– Multidimensional scaling
– Correlational analysis

24
Q

What are futher tools and further issues of text analytics?

A

Further tools:
* Topic Modeling
* Relation extraction
* Further issues:

Problems:
- can miss subtleties and cannot code
finer shades of meaning, homonyms or sarcasm
- Cultural issues and language affect document
- Semantic relationships (instead of a bag-of-words approach)
- Joint analysis of text and other unstructured data (e.g., image, audio)

25
Q

What are the key challenges in evaluation of secondary data analyses?

A

Problems with secondary data research:

  • Complexity of data and analyses
  • Data accuracy (and age)
  • Measurement comparability
  • Data lumping and level of aggregation
  • Predictive accuracy
  • Measurement reliability over time