DataScience Flashcards
Mesh TensorFlow
Mesh TensorFlow (mtf) is a “language” for “distributed deep learning”
Supports:
Data-parallel training (batch-splitting)
The parameters of the model do not fit on one device - e.g. a 5-billion-parameter language model.
An example is so large that the activations do not fit on one device. - e.g. large 3D image model(experimental/unet.py).
Lower-latency parallel inference (at batch size 1).
https://github.com/tensorflow/mesh
T5
Text to text transfer transformer
https://www.machinecurve.com/index.php/question/what-is-the-t5-transformer-and-how-does-it-work/
Transfer learning
pre-training a model on “large unlabeled text data” with a “self-supervised task”,
After that, the model can be fine-tuned on smaller labeled datasets, often resulting in (far) better performance than training on the labeled data alone
such as language modeling or filling in missing words.
General Language Understanding Evaluation (GLUE)
GLUE benchmark is a collection of nine natural language understanding tasks,
single-sentence tasks CoLA and SST-2,
similarity and paraphrasing tasks MRPC, STS-B and QP
natural language inference tasks MNLI, QNLI, RTE and WNLI.
ROUGE-N
ROUGE-N measures the number of matching ‘n-grams’ between our model-generated text and a ‘reference’.
Fine-tuning
“unfreezing” the pre trained model (or part of it), and “re-training” it on the new data with a “very low learning rate”.
checkpoint
A checkpoint is an intermediate “dump of a model’s entire internal state (its weights, current learning rate, etc.)”
natural language inference
Natural language inference is the task of “determining” whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.
How to automatically compute weights?
Construct a simple neural network
network = (scores * weights) -> Activation function -> 0/1 -> grid search to update weights
NER models
spacy med7 deep pavlov flair allen nlp polycot biobert
extract important information from text
- using structural integrity of the document (pdf, ms doc, speaker conversation)
- identify entities and extract surrounding text
- remove noise like greetings
- remove tables and paragraphs with anchor points/ place holders
Thins to research before implementing a pipeline
- understand the code requirements (size of the libraries used, how much memory your code consumes while computation, difficult libraries to install, safety)
- Run a demo on an instance before finalising the requirements
- understand your platform (was, gcp, databricks, azure etc)
- the data model should have proper updates
- ALWAYS start of with a pipeline flow (process flow and dataflow)
6.
list of dict to pandas df
df = pd.DataFrame(dict)
STT connection error while loading models
- download the folder
- set local path to model fodler
#Q&A model = BertForQuestionAnswering.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large") #'bert-large-uncased-whole-word-masking-finetuned-squad')
#Extractive sumamriser custom_config = AutoConfig.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large") custom_config.output_hidden_states=True custom_tokenizer = AutoTokenizer.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large") custom_model = AutoModel.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large", config=custom_config)
from summarizer import Summarizer
body = “text”
summ_model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)
summ_model(body)
filter dataframe rows based on column values
df.loc[df.col_name == condition]