Intro Flashcards

1
Q

Corpus

A

A corpus is […] a systematic, computerised
collection of authentic language used for
linguistic analysis.

• A corpus is a body of naturally occurring language “generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Questions for Corpus Linguistics

A

• What are the most frequent words in English?
• Which prepositions do particular verbs
combine with?
• Which linguistic constructions (words, phrases
etc) are used more often in written English,
which in spoken English?
• Which linguistic constructions are commonly
used in formal situations, which in informal
situations?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is not a corpus?

A
  • A list of words is not a corpus
  • Building blocks of language
  • A text archive is not a corpus
  • A random collection of texts
  • A collection of citations is not a corpus
  • A short quotation which contains a word or phrase that is the reason for its selection
  • A collection of quotations is not a corpus
  • A short selection from a text chosen on internal criteria by human beings
  • A text is not a corpus
  • Intending to be read in different ways
  • The Web is not a corpus
  • Its dimensions unknown, constantly changing, not designed from a linguistic perspective
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Purpose of a corpus

A

Study language in a broad sense
• Test linguistic theory and hypotheses
• Generate and verify new linguistic hypotheses
• Beyond linguistics, to provide textual evidence in text-based
humanities and social sciences subjects
 The purpose is reflected in a well-designed corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why corpora?

A

Even expert speakers
• Have only a partial knowledge of a language
⇒A corpus can be more comprehensive and balanced
• Tend to notice the unusual and think of what is possible
⇒A corpus can show us what is common and typical
• Cannot quantify their knowledge of language
⇒A corpus can readily give us accurate statistics
• Cannot remember everything they know
⇒ A corpus can store and recall all the information that has been stored in it
• Cannot make up natural examples
⇒A corpus can provide us with a vast number of examples in real
communication context
• Have prejudices and preferences and every language has cultural
connotations and underlying ideology
⇒ A corpus can give you more objective evidence
• Are not always available to be consulted
⇒ A corpus can be made permanently accessible to all
• Cannot keep up with language change
⇒A constantly updated corpus can reflect even recent changes in
the language
• Lack authority: they can be challenged by other expert speakers
⇒ A corpus can encompass the actual language use of many expert
speakers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Intuition as alternative?

A

Intuitions are useful in linguistics
• To invent (grammatical, ungrammatical, or questionable) example
sentences for linguistic analysis
• To make judgments about the acceptability / grammaticality or
meaning of an expression
• To help with categorization
BUT Intuitions should be applied with caution!
• Possibly biased (influenced by one’s dialect or sociolect)
• Possibly blind the analyst (notice the unusual but overlook the
commonplace)
• Not observable and verifiable by everyone (corpora are)
• Not used reliably in all linguistic areas (e.g. language variation, historical
linguistics, register and style, first and second language acquisition)
• Human beings have only the vaguest notion of the frequency of a construct
or a word
• Introspective data
• is artificial and may not represent typical language use
• is decontextualized (in the analyst’s mind not real context)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Benefits of corpora

A

Reliability
⇒ A corpus pools together linguistic intuitions of a range of language speakers,
which offsets the potential biases in intuitions of individual speakers
• Naturally occurring data
⇒ It is used in real communications instead of being invented specifically for
linguistic analysis
• Contextualization
⇒ Attested language use which has already occurred in real linguistic context
• Quantitative
⇒ Corpora can provide frequencies and statistics readily
• Corpus data can find differences that intuitions alone cannot perceive
⇒ E.g. synonyms totally, absolutely, utterly, completely, entirely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Corpora vs./and intuitions

A

Not antagonistic but corroborate each other ⇒ complementary
• Armchair linguists and corpus linguists “need each other. Or better,
[…] the two kinds of linguists, wherever possible, should exist in the
same body.” (Fillmore 1992)
• “Neither the corpus linguist of the 1950s, who rejected intuitions, nor
the general linguist of the 1960s, who rejected corpus data, was able
to achieve the interaction of data coverage and the insight that
characterize the many successful corpus analyses of recent years.”
(Leech 1991)
⇒ Find balance between the use of corpus data an one’s intuitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Methodology or theory?

A

Inspite of its name corpus “linguistics” it is indeed a methodology
• It differs from branches of linguistics such as phonetics, syntax,
semantics or pragmatics
• these describe, explain a certain aspect of language
• CL not restricted to an aspect of language ⇒ employed in almost any area of
linguistic research

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

History

A

The term corpus linguistics first appeared only in
the early 1980s, but corpus-based language study
has a substantial history
• The history of CL can be split into two periods:
before and after Chomsky
Pre-chomskyan era
• Field linguists and linguists of the structuralist tradition used “shoebox corpora”
(shoeboxes filled with paper slips)
• methodology essentially “corpus-based”
(i.e. empirical and based on observed data)
• Work of early corpus linguistics was underpinned by two fundamental, yet flawed
assumptions
• The sentences of a natural language are finite.
• The sentences of a natural language can be collected and enumerated.
• Most linguists saw the “corpus” as the only source of linguistic evidence in the
formation of linguistic theories
Chomsky revolution: Between 1957 and 1965 Chomsky changed the
direction of linguistics from empiricism towards rationalism
• “Any natural corpus will be skewed. Some sentences won’t occur because they
are obvious, others because they are false, still others because they are impolite.
The corpus, if natural, will be so wildly skewed that the description would be no
more than a mere list.” (Chomsky 1962)
• Our internal knowledge of language in the human brain (competence) replaces
observed data (performance)
• Intuitions started to be relied on as evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Performance vs. Competence

A

Corpus linguistics ‘concentrates on linguistic performance rather than
linguistic competence’
• ‘We can study performance either as process or as product’
• ‘CL studies performance as product: a corpus consists of spoken or
written texts in themselves, the physical manifestations of language,
independent (in principle) of the mental processes of their addressers
and addressees.’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Revival of CL

A

Corpus research was continued in a few centres (Brown, Lancaster) in the
60s-70s
• The Brown University Standard Corpus of Present-day American English (Brown
corpus)
• Lancaster-Oslo-Bergen Corpus of British English (LOB)
• Hardware still imposed restrictions until the real development started in
the 1980s
• Corpora + computer technology rekindled interest in the corpus methodology
• Since then, the number and size of corpora and corpus-based studies have increased
dramatically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

CL (history)

A

takes us back towards the empiricist end of the spectrum
• observations contribute to theory more than theory contributes to
observation
• data is independent of the tenets of the theory they are required to test
• does not deny the more rationalist principle that the way we construct our
theory determines the way we categorise and interpret our data
• nowadays enjoys widespread popularity, and has opened up or
foregrounded many new areas of research

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Falsifiability

A

‘A corpus-based linguistic model is falsifiable in the sense that it can he
tested on a new sample of corpus material, distinct from the sample
which was employed in the development of the model itself. And if it is
found wanting, it can be replaced by a model which fits the data more precisely. ‘

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Completeness

A

‘The model should be “complete” in the sense of accounting for all the
corpus data in the relevant samples. The “robustness” of corpus
linguistics is that the language model is required to account for
unrestricted data: no theoretically-motivated selection process
intervenes to choose suitable data, as so often happens in other
varieties of linguistics’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Objectivity

A

‘This is less a property of the model itself, than of the circumstances in
which it is tested. A corpus-based model can be objectively tested in
that the test can be successfully replicated by independent observers
or investigators, including those who do not have any emotional
commitment to the success or failure of the model.’

17
Q

Typical focus

A

(1) Focus on a more empiricist, rather than rationalist view of
scientific inquiry.
(2) Focus on linguistic performance, rather than
competence
(3) Focus on linguistic description, rather than linguistic
universals
(4) Focus on quantitative, as well as qualitative models of
language