DataScience Flashcards

Question 1

Q

Mesh TensorFlow

Answer

A

Mesh TensorFlow (mtf) is a “language” for “distributed deep learning”

Supports:
Data-parallel training (batch-splitting)

The parameters of the model do not fit on one device - e.g. a 5-billion-parameter language model.

An example is so large that the activations do not fit on one device. - e.g. large 3D image model(experimental/unet.py).

Lower-latency parallel inference (at batch size 1).

https://github.com/tensorflow/mesh

Question 2

Q

T5

Answer

A

Text to text transfer transformer

https://www.machinecurve.com/index.php/question/what-is-the-t5-transformer-and-how-does-it-work/

Question 3

Q

Transfer learning

Answer

A

pre-training a model on “large unlabeled text data” with a “self-supervised task”,

After that, the model can be fine-tuned on smaller labeled datasets, often resulting in (far) better performance than training on the labeled data alone

such as language modeling or filling in missing words.

Question 4

Q

General Language Understanding Evaluation (GLUE)

Answer

A

GLUE benchmark is a collection of nine natural language understanding tasks,

single-sentence tasks CoLA and SST-2,
similarity and paraphrasing tasks MRPC, STS-B and QP
natural language inference tasks MNLI, QNLI, RTE and WNLI.

Question 5

Q

ROUGE-N

Answer

A

ROUGE-N measures the number of matching ‘n-grams’ between our model-generated text and a ‘reference’.

Question 6

Q

Fine-tuning

Answer

A

“unfreezing” the pre trained model (or part of it), and “re-training” it on the new data with a “very low learning rate”.

Question 7

Q

checkpoint

Answer

A

A checkpoint is an intermediate “dump of a model’s entire internal state (its weights, current learning rate, etc.)”

Question 8

Q

natural language inference

Answer

A

Natural language inference is the task of “determining” whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

Question 9

Q

How to automatically compute weights?

Answer

A

Construct a simple neural network

network = (scores * weights) -> Activation function -> 0/1 -> grid search to update weights

Question 10

Q

NER models

Answer

A

spacy
med7
deep pavlov
flair
allen nlp polycot
biobert

Question 11

Q

extract important information from text

Answer

A

using structural integrity of the document (pdf, ms doc, speaker conversation)
identify entities and extract surrounding text
remove noise like greetings
remove tables and paragraphs with anchor points/ place holders

Question 12

Q

Thins to research before implementing a pipeline

Answer

A

understand the code requirements (size of the libraries used, how much memory your code consumes while computation, difficult libraries to install, safety)
Run a demo on an instance before finalising the requirements
understand your platform (was, gcp, databricks, azure etc)
the data model should have proper updates
ALWAYS start of with a pipeline flow (process flow and dataflow)
6.

Question 13

Q

list of dict to pandas df

Answer

A

df = pd.DataFrame(dict)

Question 14

Q

STT connection error while loading models

Answer

A

download the folder
set local path to model fodler

#Q&A
model = BertForQuestionAnswering.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large") #'bert-large-uncased-whole-word-masking-finetuned-squad')

 #Extractive sumamriser
custom_config = AutoConfig.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large")
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large")
custom_model = AutoModel.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large", config=custom_config)

from summarizer import Summarizer

body = “text”
summ_model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)
summ_model(body)

Question 15

Q

filter dataframe rows based on column values

Answer

A

df.loc[df.col_name == condition]

Question 16

Q

decode ascii froms tring

Answer

A

s.decode(“ascii”)

Question 17

Q

apply function to pandas column

Answer

A

df[“col name”] = df[“col name].apply(lambda x : x.decode(“ascii”))

Question 18

Q

question tagger

Answer

A

nltk nps_chat

2. stanza

Question 19

Q

question tagger

Answer

A

nltk nps_chat (fast)

2. stanza (dead slow)

Question 20

Q

keyword extraction

Question 21

Q

keyphrase extraction

Question 22

Q

replace string

Answer

A

string.replace(“?”, “word”)

Question 23

Q

replace string at the index level

Answer

A

string.replace(“?”,”word”,2)

Question 24

Q

syntax

Answer

A

structure of words in a sentence

Question 25

Q

semantic

Answer

A

the meaning of a word, phrase, or text.

Question 26

Q

Haystack

Answer

A

Open source framework for building search system

Question 27

Q

call assessment -

Answer

A

phrase- filter with use, on filtered sentence apply semantic search
embedding - try new things like beeeert, use etc
keyword search - pos, expansion using word2vec,fastex, trained models and pruning using masked bert and score
multiple silimalirty scores - embedding score, aspect score, number score
final match

Question 28

Q

TOPIC SEARCH

Answer

A

keyword search - exact keyword match, lemma and search, noun-noun/keyword pos pattern search
-> Identify duplicates

Question 29

Q

keyword expansion

Answer

A

embedding search- fastex/use/bert and cosine similarity search
reduce computation byslecting nouns and other patters

Question 30

Q

Zero-shot learning

Answer

A

Bart/Roberta -> NLI model for topic classfication without keywords

Question 31

Q

Taxonomy

Answer

A

hierarchical ontology for med affairs

Question 32

Q

sentiment search

Answer

A

BERT ABSA model - mind blowing performance

Question 33

Q

emerging themes

Answer

A

kendal/slope eta/time series decomposition using stas library

Question 34

Q

q&a

Answer

A

haystack pipeline based implementation