Lesson 1 Intro to NLP Flashcards
What is Natural language
Natural language refers to the language that is used for communication between humans
What is NLP
Is the process of using computers to extract meaning from text
What are the 2 components in NLP
Natural language understanding (NLU)
Natural language generation (NLG)
What is AI-complete in NLP
- Requires all types of knowledge and context awareness human posses
- AI-complete is the most difficult problem in the field of AI
What are the 3 fields that interact with one another to form NLP
- Computer Science
- Artificial Intelligence
- Linguistics
What are the ambiguity of language (6 points)
- Synonymies: different words with the same meaning
- Polysemy: same word, different meaning, different usage
- Text and speech are unstructured data.
- No fixed structure in sentence format
- No fixed schema: grammar
- Sometimes dirty: Misspell, slang, abbreviations
What are the 3 stages of history of NLP
- 1950s-1980s: linguistics methods and rules
- 1980s-Now: Statistical + Machine learning methods
- Now-Future: Deep learning
How has NLP evolved over the years
1950s-1980s: linguistics methods and rules
o Approach focused on
Linguistics: grammar rules, sentence structure parsing
Handwritten rules: huge sets of logical (if/else) statements
Phase structure grammar: conversion of sentences into forms that computers can understand
o Problems:
Too complex to maintain
Cannot scale
Cannot generalize
1980s-Now: Statistical + Machine learning methods
o Approach shifted from linguistics to data driven
o Increasing computational power and ease to access of text
Web page
Digital archives
o NLP starts using statistical and probabilistic models
Data mining -> text mining
o Generic machine learning algorithms applied to NLP tasks
Sentiment analysis using logistic regression
Language models with Markov models
What can we expect for the future of NLP
Now-Future: Deep learning
o More advances in computing power with parallelization (GPU)
o Availability of large datasets becomes the norm
o Neural Network
Learnt word representation with finite dimensions
Capture semantic and relationships among words
o RNN (recurrent neural network) / LSTM (Long Short-Term Memory)
Allows sequential processing and leaning of text
Application into machine translation tasks and questions/ answering systems
o Attention-based model
A way to place various degree of focus (attention) on different part of the text
Break-through in machine translation and text generation tasks
What is Natural Language Understanding (NLU)
Converting of text/speech into concept space/ computer-readable format is Natural Language understanding (NLU)
What is Natural Language Generation (NLG)
Process of going from the concept space back to either speech or text is Natural Language Generation (NLG)
What are the 6 examples of NLU applications
Document classification
o Classify documents into categories
o Classify emails as spam and not spam
o Classify products as positive and negative
o Assign labels to documents
Document Recommendation
o Choosing most relevant document based on some information or ‘finger print’
o Choosing the most relevant webpages based on query to search engine
o Recommend news articles based on past articles liked or read
o Recommend restaurants based on restaurant reviews
Topic modelling
o Breaking a set of documents into topics at the word level
o See how prevalence of certain topics covered in a magazine changes over time
o Find documents belonging to a certain topic
Intent Matching
o Understanding that there are many ways to say or ask for the same thing
o Use in dialog systems
Natural Language Search
o Speaking or typing into a device using their everyday language rather than keywords
o Natural language question answering
o Chatbot
Language Identification
o Determining which natural language given content is in
What are the 4 examples of NLG applications
Machine Translation
o Automatically translate text between languages
Document Summarization
o Automatically generate text summaries of documents
Text generation
Question and answering
o Is concerned with building unsupervised models/systems that provides answer to questions based on large and diverse text sources
Image Captioning
What is Large Language Models (LLMs)
- Deep learning models trained to produce text
- Backbone of modern Natural Language Processing
- Pre-trained by academic institutions and big tech companies such as OpenAI
- As the number of parameters increases, the model can acquire more granular knowledge and improve its predictions
What are the 10 steps in training a LLM from scratch
Data collection
o Gather a vast and diverse text corpus from the internet, website, social media platforms, academic sources, and etc. The quality and quantity of data are crucial
Data pre-processing
o Clean and format the data, including tokenization and handling special characters
Model Architecture
o Choose a Transformer-based model and design its structure, including the number of layers, attention mechanisms, and other hyperparameters
Initialization
o Initialize model parameters with random values or pre-trained weights from a similar model
Training objective
o The objective of LLMs is to enable the model to predict the next word or sequence of words in a sentence based on input data
Training
o Use backpropagation and optimization algorithms (e.g. Adam) to update model parameters, minimizing the chosen loss function
Regularization
o Apply techniques like dropout and layer normalization to prevent overfitting
Hyperparameter Tuning
o Fine-tune various hyperparameters to optimize model performance
Evaluation
o Assess the model’s performance using appropriate metrics and validation datasets
Inference
o Once trained, the LLM can be used for various natural language processing tasks, such as text generation, translation, and summarization