Network, Graph and Data Science Flashcards
Latent Dirichlet Allocation (LDA) is one of the most popular topic modeling methods.
What is topic modeling?
Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds some natural groups of items (topics) even when we’re not sure what we’re looking for.
https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2
Supervised Machine Learning - When you provide the algorithm/model/etc. with pre-made categories and examples from those categories. Used in classification tasks - for example, finding which photos are cats, which are dogs, which are birds, etc.
https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
Unsupervised Machine Learning - the algorithm attempts to automatically find structure in the data by extracting useful features and analyzing its structure.
Clustering:
Without being an expert ornithologist, it’s possible to look at a collection of bird photos and separate them roughly by species, relying on cues like feather color, size or beak shape. That’s how the most common application for unsupervised learning, clustering, works: the deep learning model looks for training data that are similar to each other and groups them together.
Anomaly detection:
Banks detect fraudulent transactions by looking for unusual patterns in customer’s purchasing behavior. For instance, if the same credit card is used in California and Denmark within the same day, that’s cause for suspicion. Similarly, unsupervised learning can be used to flag outliers in a dataset.
Association: Fill an online shopping cart with diapers, applesauce and sippy cups and the site just may recommend that you add a bib and a baby monitor to your order. This is an example of association, where certain features of a data sample correlate with other features. By looking at a couple key attributes of a data point, an unsupervised learning model can predict the other attributes with which they’re commonly associated.
Autoencoders:
Autoencoders take input data, compress it into a code, then try to recreate the input data from that summarized code. It’s like starting with Moby Dick, creating a SparkNotes version and then trying to rewrite the original story using only SparkNotes for reference. While a neat deep learning trick, there are fewer real-world cases where a simple autocoder is useful. But add a layer of complexity and the possibilities multiply: by using both noisy and clean versions of an image during training, autoencoders can remove noise from visual data like images, video or medical scans to improve picture quality.
https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
Semi-Supervised Machine Learning -
https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
Homophily - The degree to which nodes are alike
?
Stylometric analysis - identification using language patterns
?
Causal inference - finding out what caused phenomenon by analyzing the phenomenon itself
?
Automorphic equivalence - all store managers can be considered Automorphically equivalent based on the role
?
Structural equivalence - Parties in a network that are completely interchangeable and substitutable for each other
?
Eigenvector centrality - ranking the cumulative power or influence of a nodes connected network, emphasis on centrality
?
PageRank - ranking of the cumulative number of links or mentions of an article by other articles
?
Entrainment - The study of how the movement or actions of one party affect other parties, for example, being ‘drawn into their wake.’
?
Map generalization - simplifying map data and removing items to make things more clear
?
Directed / Nondirected Networks - Whether you pay attention to directionality of the connection or not
?
Edges - AKA Links or Lines
Different terminology is prevalent in different communities. “Graph theory” people tend to prefer “vertices and edges”, but “network science” people tend to prefer “nodes and links”. Early 20th century topologists called them “points and lines”. The definitions strongly suggest “things and pairs of things”
https://academia.stackexchange.com/questions/52659/vertices-edges-vs-nodes-links
Vertices AKA Nodes or Points
Different terminology is prevalent in different communities. “Graph theory” people tend to prefer “vertices and edges”, but “network science” people tend to prefer “nodes and links”. Early 20th century topologists called them “points and lines”. The definitions strongly suggest “things and pairs of things”
https://academia.stackexchange.com/questions/52659/vertices-edges-vs-nodes-links
Betweenness - the number of shortest paths an actor is on
?
Closeness - Relative distance to all other actors
?
“Invariably, simple models and a lot of data trump more elaborate models based on less data.” - http://www.cs.cmu.edu/~jegonzal/talks/biglearning_with_graphs.pptx
Alex Ruch, Graphika
Signal When a fact X is a signal of a fact Y, we mean simply that knowing X tells us something, or reduces our uncertainty about, Y. This usage contrasts a little with the standard use, where a signal often indicates some kind of intentionality (X is about Y), or agency (a person uses X deliberately to inform you about Y), or causality (X signals Y only if, for example, X preceeds Y in time).
Two examples of how signal is used in the broader sense:
(1) “Zip code is a signal of income.” This means that if I know the zip code (postal code) that you live in, I gain some information about your income. I won’t necessarily know your income precisely, but it will lead me to refine my beliefs about your income. For example, if your zip code is associated with Greenwich, Connecticut (a fancy part of the East Coast of the USA), I’ll consider it more likely than usual that your income is high.
If zip code is a signal of income it means that “in general” knowledge of zip code helps improve the accuracy of the belief about income; however, signals can be imperfect. An imperfect signal may only give a little information for everyone, or it may sometimes work great, and other times not at all.
(2) “Income is a signal of zip code.” Signal relationships are (usually) symmetric: if knowledge of X tells you about Y, then knowledge of Y tells you about X. Knowing that someone’s high-income, for example, tells you that they’re more likely to live in one of a small number of zip codes usually located in fancy parts of major coastal cities, or in vacation spots near by.
In information theory (see “information-theoretic” below), this symmetry is enforced precisely; if you measure the strength of the signal that X gives you for Y, it’s precisely equal to the strength of the signal that Y gives you for X.
https://www.complexityexplorer.org/courses/135-foundations-applications-of-humanities-analytics/segments/13508
Information Theory - the science of signals, patterns, and prediction in data. An information-theoretic account of an archive is an investigation built around the idea of discovering patterns and signals, and understanding the underlying events in terms of how those patterns and signals interact.
https://www.complexityexplorer.org/courses/135-foundations-applications-of-humanities-analytics/segments/13508
Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.
https: //en.wikipedia.org/wiki/Word2vec
https: //jalammar.github.io/illustrated-word2vec/
Mortality displacement is a phenomenon where a period of excess deaths (i.e., more deaths than expected) is followed by a period of mortality deficit (i.e., fewer deaths than expected). It is also known as “harvesting”.
It is usually attributable to things like epidemics and pandemics, heat waves, cold spells, especially influenza pandemics, famine or war.
https://en.wikipedia.org/wiki/Mortality_displacement
Network Centrality The extent to which a given entity, or node, in a network is connected to other nodes in the network; a highly connected node is often referred to as a “hub”. For example, in a network of UK train stations (nodes) linked by rail lines (connections or edges), London is a hub with high network centrality: one can get directly to many other stations from London.
Complexity explore foundations course
Overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”.[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented underlying model structure.
https: //en.wikipedia.org/wiki/Overfitting
https: //www.ibm.com/cloud/learn/overfitting