Test 2 Flashcards
Chapters 8, 16, 17, 21
Natural Language
Unfettered spoken or written language
-Primary means of human communication
Natural Language Processing (NLP)
Enabling the use of automated methods that represent the relevant information in the text with high validity and reliability.
Patrick Suppes
-Pioneer in computerized learning
“…the challenge to psychological theory made by linguists to provide an adequate theory of language learning may well be regarded as the most significant intellectual challenge to theoretical psychology in this century.”
Bag-of-Words
A language model where text is represented as a collection of words, independent of each other and disregarding word order.
Keyword
A word or phrase that conveys special meaning or to refer to information that is relevant to such a meaning,
Machine Learning
A computer technique in which information learned from data is used to improve system performance.
NLP Text Processing
- Lexical: Tokenization, part of speech, head, lemma
- Parsing and Chunking
- Semantic Tagging: Semantic role, word sense
- Certain Expressions: Named entities
- Discourse: coreference, discourse segments
NLP Speech Processing
- Phonetic transcription
- Segmentations (Puncutations)
- Prosody
Types of NLP: Information Extraction
Methods that process text to capture and organize specific information in the text and also to capture and organize specific relations between the pieces of information.
-Most common form in biomedicine.
Biosurveillance
A public health activity that monitors a population for occurrence of a rare disease or increased occurrence of a common one.
Named-entity Recognition
In language processing, a sub-task of information extraction that seeks to locate and classify atomic elements in text into predefined categories
Named-entity Normalization
The natural language processing method, after finding a named entity in a document, for linking (normalizing) that mention with appropriate database identifiers.
Modifiers of Interest
In NLP, a term used to describe or otherwise modify a named-entity that has been recognized.
Relations Among Named Entities
A characterization of two entities in NLP with respect to the semantic nature of the relationship between them.
Reference Resolution
In NLP, recognizing that two mentions in two different textual locations refer to the same entity.
Question Answering (QA)
A computer-based process whereby a user submits a natural language question that is then automatically answered by returning a specific response.
Text Summarization
Takes one or several documents as input and produces a single, coherent text that synthesizes the main points of the input documents.
Text Generation
Methods that create coherent natural language text from structured data or from textual documents in order to satisfy a communication goal.
Machine Translation
Automatic mapping of text written in one natural language into text of another language.
Text Readability Assessment and Simplification
An application of NLP in which computational methods are used to assess the clarity of writing for a certain audience or to revise the exposition using similar terminology and sentence construction.
Linguistic Steps in NLP: Morphology
The way words are built up from smaller, meaning-bearing units; the structure of words
- Various forms of basic words
- Make more words from less.
Linguistic Steps in NLP: Syntax
How words are put together to form correct sentences and what structural role each word has.
-Syntax tree assigned by grammar or lexicon.
Linguistic Steps in NLP: Semantics
What words mean and how these meanings combine in sentences to form sentence meanings.
Linguistic Steps in NLP: Pragmatics
How sentences are used in different situations and how use affects the interpretation of the sentence.
Linguistic Steps in NLP: Discourse
How the immediately preceding sentences affect the interpretation of the next sentence.
Natural Language Understanding (NLU)
Subtopic of NLP in Artificial Intelligence that deals with machine reading comprehension.
Applications of NLP
- Intelligent computer systems
- NLU interfaces to databases
- Computer-aided instruction
- Information Retrieval
- Intelligent web searching
- Data mining
- Machine translation
- Speech Recognition
- Natural Language Generation
- Question Answering
Difficulties of NLP
- Different ways of parsing a sentence.
- Word category ambiguity
- Word sense ambiguity
- Words can mean more than the sum of their parts
- Imparting world knowledge is difficult
- Fictitious worlds
- Defining scope
- Language is changing and evolving
- Complex ways of interaction between the kinds of knowledge
- Exponential complexity at each point in using the knowledge
Ambiguity
The fundamental problem of computational linguistics
Morpheme
The smallest unit in grammar that has a meaning or linguistic function.
-Generally a root of a word, a prefix, or a suffix
Free Morpheme
A morpheme that is a word and does not contain another morpheme
Bound Morpheme
A morpheme that creates a different form of a word but must always occur with another morpheme.
Inflectional Morpheme
A morpheme that creates a different form of a word without changing the meaning or part of speech.
Derivational Morpheme
A morpheme that changes the meaning or part of speech of a morpheme.
Regular Expression
A mathematical model of a set of strings, defined using characters of an alphabet and the operators concatenation, union, and closure.
-Zero or more occurrences of an expression
Lexicon
A catalogue of words in a language, usually containing syntactic information such as parts of speech, pluralization rules, etc.
Finite State Automaton
An abstract, computer-based representation of the state of some entity together with a set of actions that can transform the state.
-Collections of finite state automata can used to model complex systems.
Tokens (NLP)
The composite entities constructed from individual characters, typically words, numbers, dates, or punctuation.
Markov Process
A mathematical model of a set of strings in which the probability of a given symbol occurring depends on the identity of the immediately preceding symbol or the two immediately preceding symbols.
Lexemes
A minimal lexical unit in a language that res presents different forms of the same word.
Telegraphic (NLP)
Language that does not follow the usual rules of grammar but is compact and efficient.
Grammar (NLP)
A mathematical model of a potentially infinite set of strings.
Nested Structures (NLP)
A phrase or phrases that are used in place of simpler words within other phrases.
Probabilistic Context-Free Grammar
CFG in which the possible ways to expand a given symbol have varying probabilities rather than equal weight.
Dependency Grammar (NLP)
A linguistic theory of syntax that is based on dependency relations between words, where one word in the sentence is independent and other words are dependent on that word.
Logic-based Semantics
A knowledge representation method based on the use of predicates.
Conceptual Graph (Semantics)
A formal notation in which knowledge is represented through explicit relationships between concepts.
Word Senses
Possible meanings of a word
Semantic Types
The categorization of words into semantic classes according to meaning.
Semantic Patterns
The study of the patterns formed by the co-occurrence of individual words in a phrase of the co-occurrence of the associated semantic types of words.
Semantic Relations
A classification of the meaning of a linguistic relationship.
Referential Expression
A sequence of one or more words that refer to a particular person, object, or event.
Coreference Chains
Provide a compact representation for encoding the words and phrases in a text that all refer to the same entity.
Parse Tree
The representation of structural relationships that results when using a grammar to analyze a given sentence.
Transition Matrix
A table of numbers giving the probability of moving from one state in a Markov model to another state, or the state that is reached in a finite-state machine, depending on the current character of the alphabet.
Chunking (NLP)
A processing method for determining non-recursive phrases where each phrase corresponds to a specific part of speech.
Chart Parsing
A dynamic programming algorithm for structuring a sentence according to grammar by saving and reusing segments of the sentence that have been parsed.
Predicate
The part of a sentence of clause containing a verb and stating something about the subject.
Argument
A word or phrase that helps complete a predicate.
Cascading Finite State Automata (FSA)
A tagging method in NLP in which a series of FSA are employed such that the output of one FSA becomes the input of another
Centering Theory
A theory that attempts to explain what entities are indicated by referential expressions by noting how the center of each sentence changes across the text.
Extrinsic Evaluation
An evaluation of a component of a system based on an evaluation of the performance of the entire system.
Intrinsic Evaluation
An evaluation of a component of a system that focuses only on the performance of the component.
Recall
The percentage of results that should have been obtained according to the test set that actually were obtained.
Precision
The percent of results that the system obtained that were actually correct according to the test set.
F Measure
A measure of overall accuracy that is a combination of Recall and Precision.
Harmonic Mean
An average of a set of weighted values in which the weights are determined by the relative importance of the contribution to the average.
Information Retrieval (IR)
Finding material of an unstructured nature that satisfies an information need from within large collections.
-aka: SEARCH
Challenges in Biomedical IR
- Transition from little information to information overload
- Multiple expressions for search topics
- Multiple meanings for each expression
- Balancing open access vs. providing for cost of production and maintenance.
3 Major Uses of the Web
- Informational: Seeking information
- Navigational: Looking for a specific page
- Transactional: Exchanges of goods and services
IR’s Relevance to Biomedicine and Health
- Growth of knowledge surpassed human memory capabilities
- Clinicians have frequent and unmet information needs
- Researchers must frequently update their knowledge in new areas quickly
- Primary literature can be scattered and ahrd to synthesize.
- Non-primary literature sources are neither comprehensive or systematic
- Web is increasingly used as a source of health and biomedical information.
Classification of Knowledge-Based Scientific Information: Primary
Original Research:
- Mainly published in journals but also conference proceedings, technical reports, books, etc.
- Can include re-analysis, meta-analysis, and systematic reviews
Classification of Knowledge-Based Scientific Information: Secondary
Reviews, Condensations, Synopses of primary literature:
- Textbooks and handbooks are staples of clinicians, researchers, and others
- Guidelines are important for normalizing care and measuring quality
Classification of Knowledge-Based Content: Bibliographic
Contains databases of collections involving citations or pointers to published literature.
-Mainly primary sources
-Rich in metadata
-One of the oldest mainstays of IR
sources.
Classification of Knowledge-Based Content: Full-Text
Involves the complete textual information contained in a bibliographic source.
-Everything online
Classification of Knowledge-Based Content: Annotated
Content that has been annotated to describe its type, subject matter, and other attributes.
- Non-text
- Structured text annotated with text
Includes:
- Image collections
- Citation databases
- Evidence-based Medicine databases
- Clinician Decision Support
- Genomic Databases
Classification of Knowledge-Based Content: Aggregations
Collections of content from a variety of types.
Types of Bibliographic Content
- Old databases have been revised
- New databases have emerged
- Web Catalogs
- Real Simple Syndication/Rich Site Summary (RSS)
MEDLINE
- Contains references to biomedical journal literature
- Launched in 1971; database dates back to 1966 (MEDLARS)
- Free for use as of 1997 via PubMed
National Guidelines Clearinghouse
- Produced by Agency for Healthcare Research and Quality (AHRQ)
- Contains detailed information about guidelines
Web Catalogs
Aim to provide quality-filtered web sites to specific audiences
-Some geared towards physicians while others are for patients.
Real Simple Syndication/Rich Site Summary (RSS)
Feeds providing short summaries, typically of news, journal articles, or other recent web postings.
-Can be filtered by user with an aggregation tool.
Full-Text Content
Contains complete texts as well as tables, figures, images, etc
-Usually provides identical print version
Includes:
- Periodicals
- Books
- Web sites
Full-Text Content: Books
- Textbooks: Most well-known clinical textbooks now available online in e-text, accessible on mobile devices
- Compendia of drugs, diseases, evidence, etc
- Handbooks: Popular among clinicians
-
Value of E-Books
- Added multimedia
- Bundling of multiple books
- Can be updated between editions
- Links to related information
Full-Text Content: Websites
- Defined more narrowly to refer to coherent collections of information on the Web
- Includes added links and multimedia
- Increasingly integrated with other resources and available in different platforms.
Annotated Content: Image Collections
- Most prominent in visual medical specialties
- Many have associated text to support with indexing and retrieval
Indexing
Assignment of metadata to content to facilitate retrieval
Two major types:
- Manual
- Automated
Human Indexing
- Usually performed by a professional with some background in biomedicine
- Follows protocol to scan resource and select terms from a controlled vocabulary
- Most vocabularies are hierarchical and have specific definitions for when term is to be assigned
Medical Subject Heading (MeSH)
- Over 26,000 terms with many synonyms
- Hierarchical based on 16 trees
- Contains 83 subheadings, for specificity
- MeSH browser allows exploration
MEDLINE Indexing
- Done by professionals who follow protocols first derived by Bachrach (1978)
- –Read: Title, Intro, and Conclusion
- –Scan: Methods, Results, Figures, Tables, and lastly Abstract
- Ignore publisher’s “key words”
- Assign 2-4 headings as central concepts and 5-10 as minor headings
- Use most specific heading in assigned hierarchy
- Publication Type is an important secondary tag
- Modern tools have been created to assist in the task.
Metadata in Indexing
Indexing covers more than content, such as:
- Author(s)
- Source
- Publication/Resource Type
- Relationship to Information
Automated Indexing
Indexing of all words that appear in the content.
- Often use stop words to rule out common words.
- Some systems stem words to root form
Weighted Indexing
- Usually used with Automated
- Gives weight to words that are frequent but discriminating
- Most common approach is for weight to equal product TF*IDF
- –Inverse Document Frequency (IDF)
- –Term Frequency (TF)
Citation Indexing
Citation databases list all other articles that cite a specific article in journals
- Index articles that cite other articles
- Performed at content item level
- Goal is to designate related or important content items
Limitations of Human Indexing
- Inconsistency
- –Frequent Duplications
- Inadequate Indexing Vocabulary
- –Up to 25% of all concepts NOT in MeSH
- –Ambiguities and other naming problems
Limitations of Word Indexing: Synonymy
Different words have the same meaning.
Limitations of Word Indexing: Polysemy
The same word may have different meanings or senses.
Limitations of Word Indexing: Content
Words in a document might not reflect its focus.
Limitations of Word Indexing: Context
Words take on meaning based on the words around them
Limitations of Word Indexing: Morphology
Words can have suffixes that do not change the underlying meaning
Limitations of Word Indexing: Granularity
Queries and documents may describe concepts at different levels of a hierarchy.
IR System Evaluation
- Is the system used?
- Are users satisfied?
- Do they find relevant information?
- Do they complete their desired task?
Physicians are the most studied group
The Impact of IR on Physicians (4 Themes)
1) Recall - of forgotten information
2) Learning - of new information
3) Confirmation - of existing knowledge
4) Frustration - that the system used was not succesful
Also:
1) Reassurance - that the system is available
2) Practice Improvement - of patient-physician relationship
Future challenges of IR Evaluation
- Must understand the tasks of the user and focus evaluation accordingly
- Ultimate measure may be a health outcome
Personal Health Records (PHR)
An electronic application through which individuals can access, manage, and share their health information, and that of others whom they are authorized, in a private, secure and confidential environment.
Common Areas Included in PHR
- Allergies
- Medications
- Personal Medical History
- Past and Future Doctor’s visits
- Vaccinations
- Surgeries/Procedures
- Past Diagnoses
Origin of PHR Data
- Doctors
- Self-report
- Health plans/Government insurance plans
Consumer Benefits of PHR
- Insight into Medical Record
- –Help uncover errors
- –Help develop ownership of one’s own health
- –Improves care when changing providers
- Improves Convenience
- –Referrals
- –Appointments
-Health education personalized to patient
Proprietary State of PHR Today
- Storage repositories for medical information
- Not well-defined across the industry
- Many systems that are institution-specific
- Information does not transfer well to other institutions
Future of Connectivity: Patient
- Secure communication between patient and provider
- Encompasses medical records from multiple providers
- Direct connectivity to Biomonitors
- Responsive to varying levels of health literacy, self-efficacy, and tech fluency
- High level of individually customizable security
Future of Connectivity: Provider
- Increased frequency of data collection from patients
- Understanding notes/results from other providers
- Reduced cost for chronic disease management
Personally Controlled Health Records (PCHR)
- Subset of PHR
- Enables a patient to assemble, maintain, and manage a secure copy of their medical data
- Designed on the principle idea that patients should be allowed to own and manage copies of their own health records
Personal Internetworked Notary and Guardian (PING)
A system designed as a fully distributed electronic medical record in which patients have control over who can read, write, or modify components
- Developed in 1998
- Renamed Indivo in 2006
Consumer Health Informatics (CHI)
Examines patient information from the POVs such as: health literacy, consumer knowledge, and education.
- Intended to empower patients while giving them the knowledge they need to make their own health decisions
- Couples the consumer’s needs for information with their healthcare preferences to create a tailor-made medical experience.
CHI defined by The National Center for Biomedicine
- Any tool or system primarily responsible for interacting with health information users or health information consumers.
- Any tool into which a patient inputs their health information and receives a body of health information
- Tool or system where information or other benefits may be used with the assistance of a healthcare professional, but not dependent on one.
Consumer Application
- Apps facilitating knowledge and understanding of disease management
- Apps facilitating the knowledge of observations of daily living (ODL’s)
- Apps facilitating and promoting lifestyle management assistance
- Apps facilitating patient health, preventative care and self-care/assisted care.
Self-Management Systems
- Highly varied and usable on multiple platforms.
- Best when providing a timely response with information regarding the user’s current state of health.
Electronic PHR and Patient Portals
- Contain an individual’s health information, conforming to nationally recognized standards.
- Info can be pulled from multiple sources while managed and controlled by the user.
- Information stored can be:
- –Identifiers
- –Contact Info
- –Medication History
- –Allergies
- –Immunizations
Peer Interaction Systems
- Can operate alone or as part of a set of applications
- Use online forums or discussion groups to help patients communicate with others who have similar conditions.
Public Health Informatics
The systematic application of information and computer science and technology to public health practice, research, and learning.
Work of Informatics in Public Health
- Formulate models for acquiring, representing, processing, displaying, or transmitting health information or knowledge.
- Develop computer systems that use the models to deliver the information/knowledge
- Install systems to support the models
- Assess Outcomes regarding the effects to the overall health care system.
Ten Essential Services of Public Health (1-5)
1) Monitor the health of individuals in the community to identify community health problems
2) Diagnose and investigate community health problems and hazards.
3) Inform, educate, and empower the community with respect to health issues.
4) Mobilize community partnerships in identifying and solving community health problems
5) Develop policies and plans that support individual and community health efforts
Ten Essential Services of Public Health (6-10)
6) Enforce laws and rules that protect public health and ensure safety in accordance with these laws.
7) Link individuals who have a need for community and personal health services to appropriate providers
8) Ensure a competent workforce for the provision of essential health services.
9) Research new insights and innovate solutions to community health problems.
10) Evaluate the effectiveness, accessibility, and quality of personal and population-based health services in a community
Epidemiology
The study of the prevalence and determinants of disability and disease in populations.
Core Functions of Public Health: Assessment
Tracking and monitoring the health status of populations; identifying and controlling disease outbreaks and epidemics.
Core Functions of Public Health: Policy Development
Utilizes the results of assessment activities and etiologic research in concert with local values and culture to recommend interventions and public policies that improve health status
Core Functions of Public Health: Assurance
The duty of public health agencies to assure their populations that services necessary to achieve agreed upon goals are met.
Public Health Surveillance
The ongoing collection, analysis, interpretation, and dissemination of data on health conditions and threats to health.
- Represents one of the fundamental means by which priorities for public health action are set.
- Data collected for the purpose of action.
Coordinated Function of Public Health Informatics: Detection and Monitoring
Support of disease and threat surveillance, national health status indicators
Coordinated Function of Public Health Informatics: Analysis
Facilitating real-time evaluation of live data feeds, turning data into information for people at all levels of public health.
Coordinated Function of Public Health Informatics: Information Resources/Knowledge Management
Reference information, distance learning, decision support
Coordinated Function of Public Health Informatics: Alerting and Communications
Transmission of emergency alerts, routine professional discussions, collaborative activities
Coordinated Function of Public Health Informatics: Response
Management support of recommendations, prophylaxis, vaccinations, etc
National Electronic Disease Surveillance System (NEDSS)
Major CDC initiative that addresses public health issues by promoting the use of data and information system standards to advance the development of efficient, integrated, and interoperable surveillance systems at federal, state, and local levels.
-Designed to facilitate electronic transfer of appropriate information from clinical systems to public health systems, reduce provider burden in the provision of information, and enhance both the timeliness and quality of info.
Components of NEDSS
- Browser-based data entry
- Person-centric
- Case investigation capabilities
- ELR messages can be received
- Security to meet HIPAA standards
Geographic Information Systems (GIS)
- Support data warehouse capabilities
- Optimized for retrieval from very large record databases
- Can quickly cross-tabulate
- Study seasonal and secular trends
- Look for patterns by person, place, and time.
CDC’s National Health and Nutrition Examination Survey (NHANES)
Used to assess the health and nutritional status of children and adults in the U.S.
- Combines a home interview and health exam in mobile clinic.
- 50 years of survey conducting experience using direct physical measures
Sample for NHANES
- Civilian, non-institutionalized household population
- Residents of all states and D.C.
- All ages
- 5,000 individuals annually
Immunization Information Systems (IIS)
Confidential, population-based, computerized databases that record all immunization doses administered by participating providers to persons residing within a given geopolitical area.
-Assists in designing and sustaining effective immunization strategies at the provider and program levels
Point-of-Care IIS
Can provide consolidated immunization histories for use by a vaccination provider in determining appropriate client vaccines.
Population Level IIS
Provides aggregate data on vaccinations for use in surveillance and program operations, and in guiding public health action with the goals of improving vaccination rates and reducing vaccination-preventing disease.