Corpus linguistic Flashcards
What is a corpus and features
A collection of electronic texts assembled according to a specific criteria
- REPRESENTATIVE of a LANGUAGE or a TEXT TYPE (which has specific features)
- ELECTRONIC because they are often in a txt. format
- BALANCED because they have to be consistent with their content and be representative of ONE specific field (i cannot have a large proportion of spoken language rather than written language - I NEED THE RIGHT AMOUNT
- they portray the REAL USAGE of a language, therefore they are interested in language as a DISCOURSE
- they are a PRACTICE RATHER THAN A THEORY because they are based on REAL FACTS, REAL USAGES
- in corpora is very important the FREQUENCY of words because the more frequenta a word is, the more it will be common
Difference USE and USAGE
USE = invented examples on a language's grammar USAGE = nativa speakers language use, how they actually speak it
DISCOURSE
A discourse is not a speech, but the entirety of texts produced by a discourse community ex. academic discourse is produced by university community
Four main features of corpora
- Authentic texts (produce by natives)
- Machine readable
- Sampled to be
- Representative
Differences TEXT and CORPUS
TEXT
- read as a whole
- read horizontally
- read for its CONTENT
- read as a unique event (just one time)
- read as an individual act of will
- instance of parole (individual use of a language)
- it is a coherent communicative event
CORPUS
- read fragmented
- read vertically
- read for formal pattening (the way words work at a grammatical level)
- read for repeated events
- read as a sample of social practice
- gives insight into the LANGUAGE
- it is NOT a coherent communicative event
What is NOT a corpus?
- the web
- a text
- a text archive
- a list of words
- a collection of citations
Why do we use corpora?
We use corpora because
- they give us information of the real usage of a language or of a certain word because they are representative and balanced
- they focus on the typical and common words, which have to do with repetition
- they can store and recall all the information stored in it and provide us with a vast numbers of examples in real communication context
-
Areas that use corpora
- Lexicography
- LSP
- Grammatical studies
- Language teaching
- Translation studies
- Semantics
- Pragmatics
- Discourse analysis
- Forensic linguistic
- Language variation
and so on
Types of corpora
PARALLEL CORPORA
a corpus composed by a source text in a language and its translation into another language.
They can be aligned at word leverl, at phrase level or at sentence level, establishing correspondences between units of bilingual or multilingual texts.
They provide insights into the language that are compared to analize:
- typological differences
cultural differences
- language specific differences
- universal features
They are used also in language teaching for a number of practical applications.
These corpora can be MONOLINGUAL, BILINGUAL or PLURILINGUAL.
COMPARABLE
a corpus composed of original texts in 2 or more languages regarding the same topic/speaker/time/channel/function
Monolingual comparable corpora are used for:
- sudying the intrinsic featues of translation
- improving translation understaningof the SUBJECT DOMAIN
- TERMINOLOGY or EXPRESSIONS of a specific field
They help a lot to understand text type conventions and an author’s specific style
In both cases, the focus of translation studies with corpora is both the PRODUCT of translation and the PROCESS of translation!!!
UNIVERSALS OF TRANSLATION
Features that can be found in almost every translated text that do not vary across cultures. They differ from norms of translation because these are culture, social and historically bound, while the universals are not.
This is because humans tend universally to translate in the same way.
They have been theorized by Mona Baker in the ’90s and they are:
- EXPLICITATION: the tendency to speak things out rather than leave them implicit. As a result the target text will be longer,
By using PARALLEL corpora we can investigate on the lenght of these texts
While by using COMPARABLE corpora we can investigare the SYNCTACTIC and LEXICAL explicitation by seeing the frequency of conjuntions and explanatory vocabulary
- SIMPLIFICATION: reflected in strategies such as
- breaking up longer sentences
- omission of repetitions
- shortening complex collocations
The main aim is to adhere to the target language norms and rules - NORMALIZATION and CONVENTIONALIZATION: these consist in conforming to patterns and practices typical of the target language, to the point of almost EXAGGERATING them in order to make them familiar to the target readers
They focus on - collocational patterns
- clichés
- grammatical structure and punctuation
Corpora in translation training
Corpora can be very useful to train translator and students are often asked to produce their own corpus
PARALLEL CORPORA can be used to
- retrieve terminology or collocations
- find phrasal patterns
- look into the lexical polysemy
- how to translate idioms and collocations
COMPARABLE CORPORA can be used to:
- check terminology and collocates
- indentify text-type specific features
- validate intuitions and provide explanation for some solutions to problems
SPECIALIZED CORPORA
HWT ARE THEY
They are corpora which contain a collection of texts belonging to a certain type and representing the language of that type
WHAT ARE THEY USED FOR
Used to:
- help translator familiarize with concepts and terms from a specific domain
- understand text.type conventions
- study an author’s style
EXAMPLES
- Michigan Corpus for Academic Spoken English - language from university setting
- CHILDES Corps - corpus of child language
- Michigan Corpus of Upper Level Sudent Papers
- Medical corpora with language used by doctors or nurses
WHERE ARE THEY USED?
They are used expecially in ESP settings, for example the AWL was generated froma specialized corpora of academic texts
ESP
English for specific purpose refers to the most obvious application of corpus linguistics. Areas like register, lexicogrammar and phraseology can all be applied to specific purposes
For example, by investigating a corpus comprised of academic language, Coxhead (a scholar) was able to pinpoint the most frequent vocabulary words used in academic texts; she then made the list available for instructors to help students focus their vocabulary study. .
Monolingual corpora and learner corpora
The bank of english
British National corpus
CORIS (Corpus di italiano scritto)
Longman learners corpus
cambridge learner corpus
Useful concepts for corpora
NODE: the word we are investigating
COLLOCATE. the words that often co-occur with the node
CO-TEXT: what comes before and after the notde