DataScience Flashcards

1
Q

Mesh TensorFlow

A

Mesh TensorFlow (mtf) is a “language” for “distributed deep learning”

Supports:
Data-parallel training (batch-splitting)

The parameters of the model do not fit on one device - e.g. a 5-billion-parameter language model.

An example is so large that the activations do not fit on one device. - e.g. large 3D image model(experimental/unet.py).

Lower-latency parallel inference (at batch size 1).

https://github.com/tensorflow/mesh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

T5

A

Text to text transfer transformer

https://www.machinecurve.com/index.php/question/what-is-the-t5-transformer-and-how-does-it-work/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Transfer learning

A

pre-training a model on “large unlabeled text data” with a “self-supervised task”,

After that, the model can be fine-tuned on smaller labeled datasets, often resulting in (far) better performance than training on the labeled data alone

such as language modeling or filling in missing words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

General Language Understanding Evaluation (GLUE)

A

GLUE benchmark is a collection of nine natural language understanding tasks,

single-sentence tasks CoLA and SST-2,
similarity and paraphrasing tasks MRPC, STS-B and QP
natural language inference tasks MNLI, QNLI, RTE and WNLI.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ROUGE-N

A

ROUGE-N measures the number of matching ‘n-grams’ between our model-generated text and a ‘reference’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Fine-tuning

A

“unfreezing” the pre trained model (or part of it), and “re-training” it on the new data with a “very low learning rate”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

checkpoint

A

A checkpoint is an intermediate “dump of a model’s entire internal state (its weights, current learning rate, etc.)”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

natural language inference

A

Natural language inference is the task of “determining” whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to automatically compute weights?

A

Construct a simple neural network

network = (scores * weights) -> Activation function -> 0/1 -> grid search to update weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

NER models

A
spacy
med7
deep pavlov
flair
allen nlp polycot
biobert
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

extract important information from text

A
  1. using structural integrity of the document (pdf, ms doc, speaker conversation)
  2. identify entities and extract surrounding text
  3. remove noise like greetings
  4. remove tables and paragraphs with anchor points/ place holders
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Thins to research before implementing a pipeline

A
  1. understand the code requirements (size of the libraries used, how much memory your code consumes while computation, difficult libraries to install, safety)
  2. Run a demo on an instance before finalising the requirements
  3. understand your platform (was, gcp, databricks, azure etc)
  4. the data model should have proper updates
  5. ALWAYS start of with a pipeline flow (process flow and dataflow)
    6.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

list of dict to pandas df

A

df = pd.DataFrame(dict)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

STT connection error while loading models

A
  1. download the folder
  2. set local path to model fodler
#Q&A
model = BertForQuestionAnswering.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large") #'bert-large-uncased-whole-word-masking-finetuned-squad')
 #Extractive sumamriser
custom_config = AutoConfig.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large")
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large")
custom_model = AutoModel.from_pretrained(r"C:\Users\sd22768\Downloads\bert_uncased_large", config=custom_config)

from summarizer import Summarizer

body = “text”
summ_model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)
summ_model(body)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

filter dataframe rows based on column values

A

df.loc[df.col_name == condition]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

decode ascii froms tring

A

s.decode(“ascii”)

17
Q

apply function to pandas column

A

df[“col name”] = df[“col name].apply(lambda x : x.decode(“ascii”))

18
Q

question tagger

A
  1. nltk nps_chat

2. stanza

19
Q

question tagger

A
  1. nltk nps_chat (fast)

2. stanza (dead slow)

20
Q

keyword extraction

A
  1. kBERT
21
Q

keyphrase extraction

A
  1. kBERT
22
Q

replace string

A

string.replace(“?”, “word”)

23
Q

replace string at the index level

A

string.replace(“?”,”word”,2)

24
Q

syntax

A

structure of words in a sentence

25
Q

semantic

A

the meaning of a word, phrase, or text.

26
Q

Haystack

A

Open source framework for building search system

27
Q

call assessment -

A

phrase- filter with use, on filtered sentence apply semantic search
embedding - try new things like beeeert, use etc
keyword search - pos, expansion using word2vec,fastex, trained models and pruning using masked bert and score
multiple silimalirty scores - embedding score, aspect score, number score
final match

28
Q

TOPIC SEARCH

A

keyword search - exact keyword match, lemma and search, noun-noun/keyword pos pattern search
-> Identify duplicates

29
Q

keyword expansion

A

embedding search- fastex/use/bert and cosine similarity search
reduce computation byslecting nouns and other patters

30
Q

Zero-shot learning

A

Bart/Roberta -> NLI model for topic classfication without keywords

31
Q

Taxonomy

A

hierarchical ontology for med affairs

32
Q

sentiment search

A

BERT ABSA model - mind blowing performance

33
Q

emerging themes

A

kendal/slope eta/time series decomposition using stas library

34
Q

q&a

A

haystack pipeline based implementation