Network, Graph and Data Science Flashcards

Question 1

Q

Latent Dirichlet Allocation (LDA) is one of the most popular topic modeling methods.

What is topic modeling?
Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds some natural groups of items (topics) even when we’re not sure what we’re looking for.

Answer

A

https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

Question 2

Q

Supervised Machine Learning - When you provide the algorithm/model/etc. with pre-made categories and examples from those categories. Used in classification tasks - for example, finding which photos are cats, which are dogs, which are birds, etc.

Answer

A

https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/

Question 3

Q

Unsupervised Machine Learning - the algorithm attempts to automatically find structure in the data by extracting useful features and analyzing its structure.

Clustering:
Without being an expert ornithologist, it’s possible to look at a collection of bird photos and separate them roughly by species, relying on cues like feather color, size or beak shape. That’s how the most common application for unsupervised learning, clustering, works: the deep learning model looks for training data that are similar to each other and groups them together.
Anomaly detection:
Banks detect fraudulent transactions by looking for unusual patterns in customer’s purchasing behavior. For instance, if the same credit card is used in California and Denmark within the same day, that’s cause for suspicion. Similarly, unsupervised learning can be used to flag outliers in a dataset.
Association: Fill an online shopping cart with diapers, applesauce and sippy cups and the site just may recommend that you add a bib and a baby monitor to your order. This is an example of association, where certain features of a data sample correlate with other features. By looking at a couple key attributes of a data point, an unsupervised learning model can predict the other attributes with which they’re commonly associated.
Autoencoders:
Autoencoders take input data, compress it into a code, then try to recreate the input data from that summarized code. It’s like starting with Moby Dick, creating a SparkNotes version and then trying to rewrite the original story using only SparkNotes for reference. While a neat deep learning trick, there are fewer real-world cases where a simple autocoder is useful. But add a layer of complexity and the possibilities multiply: by using both noisy and clean versions of an image during training, autoencoders can remove noise from visual data like images, video or medical scans to improve picture quality.

Answer

A

https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/

Question 4

Q

Semi-Supervised Machine Learning -

Answer

A

https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/

Question 5

Q

Homophily - The degree to which nodes are alike

Question 6

Q

Stylometric analysis - identification using language patterns

Question 7

Q

Causal inference - finding out what caused phenomenon by analyzing the phenomenon itself

Question 8

Q

Automorphic equivalence - all store managers can be considered Automorphically equivalent based on the role

Question 9

Q

Structural equivalence - Parties in a network that are completely interchangeable and substitutable for each other

Question 10

Q

Eigenvector centrality - ranking the cumulative power or influence of a nodes connected network, emphasis on centrality

Question 11

Q

PageRank - ranking of the cumulative number of links or mentions of an article by other articles

Question 12

Q

Entrainment - The study of how the movement or actions of one party affect other parties, for example, being ‘drawn into their wake.’

Question 13

Q

Map generalization - simplifying map data and removing items to make things more clear

Question 14

Q

Directed / Nondirected Networks - Whether you pay attention to directionality of the connection or not

Question 15

Q

Edges - AKA Links or Lines

Answer

A

Different terminology is prevalent in different communities. “Graph theory” people tend to prefer “vertices and edges”, but “network science” people tend to prefer “nodes and links”. Early 20th century topologists called them “points and lines”. The definitions strongly suggest “things and pairs of things”

https://academia.stackexchange.com/questions/52659/vertices-edges-vs-nodes-links

Question 16

Q

Vertices AKA Nodes or Points

Answer

A

Different terminology is prevalent in different communities. “Graph theory” people tend to prefer “vertices and edges”, but “network science” people tend to prefer “nodes and links”. Early 20th century topologists called them “points and lines”. The definitions strongly suggest “things and pairs of things”

https://academia.stackexchange.com/questions/52659/vertices-edges-vs-nodes-links

Question 17

Q

Betweenness - the number of shortest paths an actor is on

Question 18

Q

Closeness - Relative distance to all other actors

Question 19

Q

“Invariably, simple models and a lot of data trump more elaborate models based on less data.” - http://www.cs.cmu.edu/~jegonzal/talks/biglearning_with_graphs.pptx

Answer

A

Alex Ruch, Graphika

Question 20

Q

Signal When a fact X is a signal of a fact Y, we mean simply that knowing X tells us something, or reduces our uncertainty about, Y. This usage contrasts a little with the standard use, where a signal often indicates some kind of intentionality (X is about Y), or agency (a person uses X deliberately to inform you about Y), or causality (X signals Y only if, for example, X preceeds Y in time).

Two examples of how signal is used in the broader sense:

(1) “Zip code is a signal of income.” This means that if I know the zip code (postal code) that you live in, I gain some information about your income. I won’t necessarily know your income precisely, but it will lead me to refine my beliefs about your income. For example, if your zip code is associated with Greenwich, Connecticut (a fancy part of the East Coast of the USA), I’ll consider it more likely than usual that your income is high.

If zip code is a signal of income it means that “in general” knowledge of zip code helps improve the accuracy of the belief about income; however, signals can be imperfect. An imperfect signal may only give a little information for everyone, or it may sometimes work great, and other times not at all.

(2) “Income is a signal of zip code.” Signal relationships are (usually) symmetric: if knowledge of X tells you about Y, then knowledge of Y tells you about X. Knowing that someone’s high-income, for example, tells you that they’re more likely to live in one of a small number of zip codes usually located in fancy parts of major coastal cities, or in vacation spots near by.

In information theory (see “information-theoretic” below), this symmetry is enforced precisely; if you measure the strength of the signal that X gives you for Y, it’s precisely equal to the strength of the signal that Y gives you for X.

Answer

A

https://www.complexityexplorer.org/courses/135-foundations-applications-of-humanities-analytics/segments/13508

Question 21

Q

Information Theory - the science of signals, patterns, and prediction in data. An information-theoretic account of an archive is an investigation built around the idea of discovering patterns and signals, and understanding the underlying events in terms of how those patterns and signals interact.

Answer

A

https://www.complexityexplorer.org/courses/135-foundations-applications-of-humanities-analytics/segments/13508

Question 22

Q

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.

Answer

A

https: //en.wikipedia.org/wiki/Word2vec
https: //jalammar.github.io/illustrated-word2vec/

Question 23

Q

Mortality displacement is a phenomenon where a period of excess deaths (i.e., more deaths than expected) is followed by a period of mortality deficit (i.e., fewer deaths than expected). It is also known as “harvesting”.

It is usually attributable to things like epidemics and pandemics, heat waves, cold spells, especially influenza pandemics, famine or war.

Answer

A

https://en.wikipedia.org/wiki/Mortality_displacement

Question 24

Q

Network Centrality The extent to which a given entity, or node, in a network is connected to other nodes in the network; a highly connected node is often referred to as a “hub”. For example, in a network of UK train stations (nodes) linked by rail lines (connections or edges), London is a hub with high network centrality: one can get directly to many other stations from London.

Answer

A

Complexity explore foundations course

Question 25

Q

Overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”.[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented underlying model structure.

Answer

A

https: //en.wikipedia.org/wiki/Overfitting
https: //www.ibm.com/cloud/learn/overfitting