Corpus linguistic Flashcards

1
Q

What is a corpus and features

A

A collection of electronic texts assembled according to a specific criteria

  • REPRESENTATIVE of a LANGUAGE or a TEXT TYPE (which has specific features)
  • ELECTRONIC because they are often in a txt. format
  • BALANCED because they have to be consistent with their content and be representative of ONE specific field (i cannot have a large proportion of spoken language rather than written language - I NEED THE RIGHT AMOUNT
  • they portray the REAL USAGE of a language, therefore they are interested in language as a DISCOURSE
  • they are a PRACTICE RATHER THAN A THEORY because they are based on REAL FACTS, REAL USAGES
  • in corpora is very important the FREQUENCY of words because the more frequenta a word is, the more it will be common
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Difference USE and USAGE

A
USE = invented examples on a language's grammar
USAGE = nativa speakers language use, how they actually speak it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

DISCOURSE

A

A discourse is not a speech, but the entirety of texts produced by a discourse community ex. academic discourse is produced by university community

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Four main features of corpora

A
  • Authentic texts (produce by natives)
  • Machine readable
  • Sampled to be
  • Representative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Differences TEXT and CORPUS

A

TEXT

  • read as a whole
  • read horizontally
  • read for its CONTENT
  • read as a unique event (just one time)
  • read as an individual act of will
  • instance of parole (individual use of a language)
  • it is a coherent communicative event

CORPUS

  • read fragmented
  • read vertically
  • read for formal pattening (the way words work at a grammatical level)
  • read for repeated events
  • read as a sample of social practice
  • gives insight into the LANGUAGE
  • it is NOT a coherent communicative event
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is NOT a corpus?

A
  • the web
  • a text
  • a text archive
  • a list of words
  • a collection of citations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do we use corpora?

A

We use corpora because
- they give us information of the real usage of a language or of a certain word because they are representative and balanced
- they focus on the typical and common words, which have to do with repetition
- they can store and recall all the information stored in it and provide us with a vast numbers of examples in real communication context
-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Areas that use corpora

A
  • Lexicography
  • LSP
  • Grammatical studies
  • Language teaching
  • Translation studies
  • Semantics
  • Pragmatics
  • Discourse analysis
  • Forensic linguistic
  • Language variation
    and so on
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Types of corpora

A

PARALLEL CORPORA
a corpus composed by a source text in a language and its translation into another language.
They can be aligned at word leverl, at phrase level or at sentence level, establishing correspondences between units of bilingual or multilingual texts.
They provide insights into the language that are compared to analize:
- typological differences
cultural differences
- language specific differences
- universal features

They are used also in language teaching for a number of practical applications.
These corpora can be MONOLINGUAL, BILINGUAL or PLURILINGUAL.

COMPARABLE
a corpus composed of original texts in 2 or more languages regarding the same topic/speaker/time/channel/function

Monolingual comparable corpora are used for:
- sudying the intrinsic featues of translation
- improving translation understaningof the SUBJECT DOMAIN
- TERMINOLOGY or EXPRESSIONS of a specific field
They help a lot to understand text type conventions and an author’s specific style

In both cases, the focus of translation studies with corpora is both the PRODUCT of translation and the PROCESS of translation!!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

UNIVERSALS OF TRANSLATION

A

Features that can be found in almost every translated text that do not vary across cultures. They differ from norms of translation because these are culture, social and historically bound, while the universals are not.
This is because humans tend universally to translate in the same way.

They have been theorized by Mona Baker in the ’90s and they are:
- EXPLICITATION: the tendency to speak things out rather than leave them implicit. As a result the target text will be longer,
By using PARALLEL corpora we can investigate on the lenght of these texts
While by using COMPARABLE corpora we can investigare the SYNCTACTIC and LEXICAL explicitation by seeing the frequency of conjuntions and explanatory vocabulary

  • SIMPLIFICATION: reflected in strategies such as
  • breaking up longer sentences
  • omission of repetitions
  • shortening complex collocations
    The main aim is to adhere to the target language norms and rules
  • NORMALIZATION and CONVENTIONALIZATION: these consist in conforming to patterns and practices typical of the target language, to the point of almost EXAGGERATING them in order to make them familiar to the target readers
    They focus on
  • collocational patterns
  • clichés
  • grammatical structure and punctuation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Corpora in translation training

A

Corpora can be very useful to train translator and students are often asked to produce their own corpus
PARALLEL CORPORA can be used to
- retrieve terminology or collocations
- find phrasal patterns
- look into the lexical polysemy
- how to translate idioms and collocations

COMPARABLE CORPORA can be used to:

  • check terminology and collocates
  • indentify text-type specific features
  • validate intuitions and provide explanation for some solutions to problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

SPECIALIZED CORPORA

A

HWT ARE THEY
They are corpora which contain a collection of texts belonging to a certain type and representing the language of that type
WHAT ARE THEY USED FOR
Used to:
- help translator familiarize with concepts and terms from a specific domain
- understand text.type conventions
- study an author’s style

EXAMPLES

  • Michigan Corpus for Academic Spoken English - language from university setting
  • CHILDES Corps - corpus of child language
  • Michigan Corpus of Upper Level Sudent Papers
  • Medical corpora with language used by doctors or nurses

WHERE ARE THEY USED?
They are used expecially in ESP settings, for example the AWL was generated froma specialized corpora of academic texts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ESP

A

English for specific purpose refers to the most obvious application of corpus linguistics. Areas like register, lexicogrammar and phraseology can all be applied to specific purposes
For example, by investigating a corpus comprised of academic language, Coxhead (a scholar) was able to pinpoint the most frequent vocabulary words used in academic texts; she then made the list available for instructors to help students focus their vocabulary study. .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Monolingual corpora and learner corpora

A

The bank of english
British National corpus
CORIS (Corpus di italiano scritto)

Longman learners corpus
cambridge learner corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Useful concepts for corpora

A

NODE: the word we are investigating
COLLOCATE. the words that often co-occur with the node
CO-TEXT: what comes before and after the notde

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

COLLOCATION

A

The way words combine regularly and predictably with other specific words –> they do not occur randomly

The criteria for collocations are
- non-substitutability
- non-compositionality
- non-modifiability
They are useful to ivestigate because we can find specific collocations in specific text type and different kind of collocation in other text types

ex water collocates differently accordint to the type of text

We can have MARKED or UNMARKED collocations, with marked ones being the uncommon ones

17
Q

Sinclair understanding of meaning

A

When we read corpora we need to look at the extended unit meaning; to identify that, Sinclair proposed these steps:

  • identify COLLOCATIONS PROFILE (lexical realizations) how a word is used
  • indentify the COLLIGATIONAL PATTERNS (lexico-grammatical realizations) which are the collocations at a grammatical level
  • consider the common SEMANTIC FIELD (SEMANTIC PREFERENCE)
  • consider PRAGMATIC REALIZATIONS (SEMANTIC PROSODY)
18
Q

SEMANTIC PREFERENCE

A

The semantic preference refers to the semantic field a word commonly belogns to
ex, the work “break out” is used in situations of CONFLICTS, or DISEASE or PROBLEMATIC CIRCUMSTANCES

19
Q

SEMANTIC PROSODY

A

The pragmatical usage of a word; it describes the way in which certain words can be perceived as POSITIVE or NEGATIVE through frequent occurences with particular collocations

ex. set in, cause, commit, rife HAVE A NEGATIVE SEMANTIC PRODOSY; impressive, which occurs with words like dignity, talent, achievement, HAS A POSITIVE PROSODY

It refers to STRETCHES of words because of the collocates.
As a result, if we use the word in a different context, we have a clash, or an ironical background,

IT IS DIFFERENT FROM CONNOTATION BECAUSE IT REFERS TO THE WORD IN ITS REAL CONTEXT OF USAGE, while a connotation refers to just one word

20
Q

What is a WORD LIST

A

refers to the most frequent words wecan find a in a corpus and it reveals numerous regularities which can be useful to language researchers searching for patterns of importance

It includes

  1. Alphabetical order used for reference;
  2. Frequency order is mainly used to be compared to that of other texts and it gives information about the local use of the specialized language.

In a corpus, at the TOP a word list, we will ALWAYS find grammar words, which are the most common ones

21
Q

Concordance lines

A

Concordance lines are core tools in corpus linguistics that refer to using corpus softwares to find EVERY OCCURENCE OF A WORD IN A PHRASE

We may have different options to search for words : asterisk

22
Q

Clusters

A

sequences of strings of words (from 2 to a maximum of 8 items) which occur with a particular frequency fixed by the inquirer in the set of texts being examined. They are kind of extended collocation.
They are also a multiword unit

For some, they constitute missing links from the linguistic morass to the abstraction of discourse, revealing typical ways of saying things

How to cluster items?

1) From the Concordance programme by clicking directly on the cluster menu option,
2) Cluster lists can be prepared from WordList
3) Key-cluster lists can be compiled by COMPARING cluster lists.

The higher is the number of the components of the cluster list, the higher will be the number of phraseology

23
Q

Phraseology

A

The essence of a word based approach, by referring to the fact that words have specific sets of behaviour and it is lin ked to the meaning of the word

PHRASE = MULTI WORD UNIT WHOSE MEANING IS NOT GIVEN BY THE SUM OF THE INDIVIDUAL WORDS

24
Q

A study on “naked eye”

A

We have seen a study referring to the collocation “naked eye”, which is a 2 word unit.
First, we have seen that the word “eye”, if used in plural form, refers to the PHYSICAL EYE, while if it is used in singular form, it has a metaphoric and figurative use (keep an eye on, catch someon’s eye)

By analyzing the corpus we saw that
- occurs frequently at the end of a phrase
- in terms of collocation we find words referring to the field of vision
- in terms of colligation we have THE, which occurs every time in every occurence
We can therefore say that the collocaion “naked eye” occurs in the SEMANTIC FIELD of vision and visibility, and has a NEGATIVE SEMANTIC PROSODY, given by the fact that we find the words “not” or “invisible”, which is a lexeme considered to be negative

25
Q

IDIOM PRINCIPLE and OPEN CHOICE PRINCIPLE

A

These principles have been introduced by Sinclair and refer to how language is put together.
- The idiom principle is at work when there is a SET PHRASEOLOGY

It occurs when we say something, the person we are speaking to, by contextualizing the speech, will understand what we’re talking about.

  • The open choice principle is at work when THERE IS NOT A FIXED PHRASEOLOGY; it refers to how the language is put together by combining lexis in accordance with the rules of grammar
    According to Sinclair, the tendency towards the open choice principle is the one of a word that tends to have a fixed meaning in reference to the world

The crucial point about Sinclair’s argument is that he suggests that the idiom principle is the NORM and the open-choice principle is the EXCEPTION

26
Q

PRACTICAL APPLICATIONS OF CORPORA

A
  1. Study of specialized domains and extraction of conceptual knowledge;
  2. Search for translation equivalents;
  3. Study of phraseology;
  4. Construction of glossaries;
27
Q

Glossary

A

What is a glossary?
A glossary is a list of terms in one or more languages.
It is useful if you need to acquire the temrinoogy in another languare and you are interested in finding equivalents between two languages

Corpus linguistic tools can be used to speed up the glossary production by using:

  • word list or kew-word in context, and lists of clusters, which can allow you to judge which are the most appropriate terms to include in your glossary
  • concordances allow you to collect information about what terms mean and how they are used
28
Q

Study on GENE FLOW

A

We identified:
• Domain (biotechnology)
• Subfield (transgenic plants)
• Language (English)
• Term (in its base form) + gender (if applicable) (gene flow, no plural attested)
• Grammatical category (noun, verb, etc)
• Definition (the transfer of genes from cultivars to related wild populations
• Synonym (gene escape)
• Abbreviated form
• Contextual fragment
• Related terms (long-distance gene flow, horizontal gene flow)

29
Q

A study on idiom principle and open choice principle

A

Connect the dots - an expression coined soon after the tragic attacks of September 11th and to date no guidance is given in dictionaries on the use of this phrase. Commonly, connect the dots, or dot to dot or join the dot is a paper puzzle game which consists in drawing lines between the dots in order of the numbers. the drawn lines reveal a hidden picture, so it is sometimes used aas a metaphor to illustrate the ability of a person to associate an idea to another
We saw this expression in Steve Jobs’ speech, where it is used in the majority of the cases according to the idiom principle, and for about the 20% of the speech in a literal sense, according to the open-choice principle

JUST LIKE THAT
This expression has two distinct meanings
A literal one, which can be translated as “proprio così”, and a metaphorical one, which means “very quickly
AT THE END OF THE DAY
This expression can mean “alla fine della giornata” or “alla fine di qualcosa” in an idiomatic meaning

AND ALL THE REST OF IT
This expression shows 50 occrenced in BE vs 2 occurences in AE, which makes us think that it is more commonly used in British english.

30
Q

A study on “shred”

A

In this study we have “shred” as the node of the corpus, and by investigating it we can see that on the left side we often have “a”, while on the right side we have “of”, showing how the word SHRED commonly collocates with a and of.
Since this expression is collocated with WORDS RELATED TO FACTUALITY (evidence, doubt, truth, credibility) and WORDS RELATED TO HUMAN ATTRIBUTES (dignity, pity, decency) we can say that it has a NEGATIVE SEMANTIC PRODOSY

31
Q

A study on SHALL

A

We saw that shall has a deontic function that in italian doesn’t existe, therefore in italian target texts it is translated as simply “è”

32
Q

A study on “to develop”

A

The verb “to develop” has many meanings according to the context. We saw the challenges of how to translate it in italian in medical context
A literal translation would be = sviluppare i sintomi, but that is now how we say it in italian –> presentare, avvertire i sintomi
Italian corpora prove that

33
Q

A study on FLU VACCINE and MEDICATION

A

By looking at corpora, we can see what are the most common translation of a collocation or word or expression

In a corpus we saw 2 different ways of translating FLU VACCINE:

  • vaccino antinfluenzale (most suitable)
  • vaccino influenzale (occurs twice)

MEDICATION has many translations, but in our corpus we found it occourring only 3 times, while drugs has far more occurences (107)

We can translate it as

  • medicamento
  • medicazione

However, if we search for these translations in the corpus, we have 0 hits, This is because the most adequate translation is FARMACO, even if that is also commonly associated to “drugs”

34
Q

A study on the word Welcome in Tourism

A

This study was made to see what were the occasions in which the word welcome was translated with “benvenut”

Since language and culture are strictly related, also in translation we have to consider:

  • the context of culture
  • the context of the situation
  • the function of the text

For this reason we have to consider:

  • Collocation - lexical attraction
  • Colligation - grammatical attraction
  • Semantic preference - semantic attraction
  • Semantic prosody - pragmatic attraction

CORPORA USED:
COMPARABLE CORPORA ON ITALIAN AGRITURISMI WEBSITES AND BRITISH FARMHOUSE HOLIDAYS WEBSITES, which are made up of similar texts

  1. By looking at the frequency list, we see that the word WELCOME is very frequent, while BENVENUTO just a few times
    This means that the concept of welcome is expressed differently in italian

WE ARE IN A CASE OF NON EQUIVALENCE

  1. Identify the 4 types of attractions of welcome and benvenuto and compare then in order to find equivalences
    Then we have to find THE MOST FREQUENT COLLOCATES: pets and dogs, children and guest and visitors
    After indentifying that, they are translated prima facie, so we have animali/cani, ospiti and bambini; each of the collocates is then investigated in turn
  • Children: the collocational range suggests a case of non-equivalence –> the word never occurs with words that translate welcome
    However, the sentences which are used have in common RESTRICTION towards little kids or they are gound the semantic field of DISCOUNTS (tariffe speciali)
  • Pets and dogs: The patterns developed around
    the word animali suggested, at the functional level, the EQUIVALENCE with the translation “welcome- si accettano/sono ammessi”
  • Guests and visitors: Here
    corpus data suggest the EQUIVALENCE of the translation “welcome-possono, vi è la possibilità”

CONCLUSIONE:
- Equivalence is function in context
- Equivalence is culture bound
- Equivalence may not exist across cultures
This shows us that the translator needs to observe the collocational
patterning and its cumulative effect very closely in order to identify
functional equivalence.

35
Q

Study on NATURA

A

In this study was investigated the concept of nature as
realised by the words NATURE in English and NATURA in Italian both in their nominal and adjectival functions

STEP 1: the first step that was taken in the analysis was that of checking the occurrences of both the words in english and italian corpora:
- Italian Corpus indeed confirms the central importance of natura, which proves to be very frequent: 147 occurrences.
- Farmhols Corpus shows that the word ‘nature’ displays a very low frequency: only 27
occurrences.

We are in a case of NON-EQUIVALENCE

STEP 2: The first thing we notice scanning the concordance of natura in the language of Agriturismo is that it is qualified by ADJECTIVES and that it frequently collocates with:
- incontaminata
- circostante
- a contatto con/in simbiosi/immersa
- amanti
- bellezze
- suoni
- pace
These words reveal the collocational profile 

STEP 3: we have to check what is the COLLOCATIONAL PROFILE of the word “nature” and see if they coincide
By analyzing the corpus we saw that nature frequently collocates with
- Reserve/trail
- Lover/s

At a word level we find ourselves in a situation of non-equivalence

STEP 4: we have to try and translate the italian collocates and see if they ever appear with the word nature. this actually never happens, because the english translated words frequently collocate with COUNTRYSIDE, and never with nature

This means that at a word level, with NATURE-NATURA we find ourselves in a case of NON-EQUIVALENCE
By analizing the intercollocation of collocates we found that, at a collocational level, COUNTRYSIDE was the word which actually collocated with the translated collocates of natura, leading us to a case of FUNCTIONAL EQUIVALENCE