C10 Flashcards
goals of biomedical text mining
interactive knowledge discovery: assisting the expert in finding the information they need
TM can assist researchers in
- finding, evaluating and interpreting the scientific literature and patented biomedical inventions
- generating new medical hypotheses using information extracted from patient information (health records, social media data)
topics for TM in biomedical research
- gene/protein/disease extraction
- adverse events (side-effects)
- predicting time to death
- drug interactions
What steps to take in bio-tm, eg. task of finding side-effects for medications on online forums?
- Filter the potentially relevant messages
- Get/create training data for NER
- Train an NER model to identify drug names and side effects in the messages
- Normalize the side effects (map to ontology)
- Relation extraction: co-occurrences of drug names and side effects in one message
- (Match the found relations to an existing knowledge base to identify which relations are new)
Needed:
- lists/ontologies of drug names and known side effects
- pre-processing
- pre-trained BERT models for NER and ontology linking
- labelled data for supervised NER finetuning and evaluation
one of the biggest challenges in bio-NER
recognition of genes and protein names in scientific text: often described using different names and symbols and multiple genes share symbols and names
relation extration: co-occurrence based methods
assume that two concepts that often occur together in the same text are related
Statistics for co-occurrence frequencies:
- actual number of co-occurrences
- expected number of co-occurrences based on the frequencies of both entities
- a statistical test to decide if the co-occurrence is statistically significant
relation extraction: structure-based methods
phrase based, able to detect triples in text, e.g. gene A inhibits gene B or gene C is involved in disease G
- provides information about the type of relationship between two concepts
- structure-based methods often have a higher precision than co-occurrence based methods but lower recall (limited set of relations)
6 modules in bio-tm
- Information Retrieval
- Named Entity Recognition
- Ontology linking
- Relation Extraction
- Knowledge Discovery
- Visualization
identifying biomedical entities in retrieved documents
mentions of entities are highlighted and linked to the specific concept in the controlled vocabulary (thesaurus or ontology)
Unified Medical Language System (UMLS)
differences in pre-training of domain-specific models vs general models
further pre-training vs. pre-training from scratch: the collection has to be huge for pre-training from scratch, so domain-specific models are often further trained
WordPiece vocabulary is optimized for the pre-training corpus