C10 Flashcards

Question 1

Q

goals of biomedical text mining

Answer

A

interactive knowledge discovery: assisting the expert in finding the information they need

TM can assist researchers in
- finding, evaluating and interpreting the scientific literature and patented biomedical inventions
- generating new medical hypotheses using information extracted from patient information (health records, social media data)

Question 2

Q

topics for TM in biomedical research

Answer

A

gene/protein/disease extraction
adverse events (side-effects)
predicting time to death
drug interactions

Question 3

Q

What steps to take in bio-tm, eg. task of finding side-effects for medications on online forums?

Answer

A

Filter the potentially relevant messages
Get/create training data for NER
Train an NER model to identify drug names and side effects in the messages
Normalize the side effects (map to ontology)
Relation extraction: co-occurrences of drug names and side effects in one message
(Match the found relations to an existing knowledge base to identify which relations are new)

Needed:
- lists/ontologies of drug names and known side effects
- pre-processing
- pre-trained BERT models for NER and ontology linking
- labelled data for supervised NER finetuning and evaluation

Question 4

Q

one of the biggest challenges in bio-NER

Answer

A

recognition of genes and protein names in scientific text: often described using different names and symbols and multiple genes share symbols and names

Question 5

Q

relation extration: co-occurrence based methods

Answer

A

assume that two concepts that often occur together in the same text are related

Statistics for co-occurrence frequencies:
- actual number of co-occurrences
- expected number of co-occurrences based on the frequencies of both entities
- a statistical test to decide if the co-occurrence is statistically significant

Question 6

Q

relation extraction: structure-based methods

Answer

A

phrase based, able to detect triples in text, e.g. gene A inhibits gene B or gene C is involved in disease G

provides information about the type of relationship between two concepts
structure-based methods often have a higher precision than co-occurrence based methods but lower recall (limited set of relations)

Question 7

Q

6 modules in bio-tm

Answer

A

Information Retrieval
Named Entity Recognition
Ontology linking
Relation Extraction
Knowledge Discovery
Visualization

Question 8

Q

identifying biomedical entities in retrieved documents

Answer

A

mentions of entities are highlighted and linked to the specific concept in the controlled vocabulary (thesaurus or ontology)

Unified Medical Language System (UMLS)

Question 9

Q

differences in pre-training of domain-specific models vs general models

Answer

A

further pre-training vs. pre-training from scratch: the collection has to be huge for pre-training from scratch, so domain-specific models are often further trained

WordPiece vocabulary is optimized for the pre-training corpus

C10 Flashcards

(9 cards)