Textanalys och prediktiva analysverktyg Flashcards
Ge tre exempel på algoritmer som mäter relationer i ett nät av dokument?
PageRank: This is a widely used algorithm for ranking web pages based on their
importance or relevance. It measures the importance of a page based on the number
and quality of links pointing to that page.
* HITS (Hyperlink-Induced Topic Search): This algorithm is similar to PageRank but also
takes into account the content of the pages being linked to. It identifies authority pages
(pages that provide valuable information) and hub pages (pages that link to many
relevant authority pages).
* SALSA (Stochastic Approach for Link-Structure Analysis): This algorithm considers both
the inbound and outbound links of a page, as well as the content of the pages. It
identifies groups of pages that are highly connected and form communities.
* SimRank (Similarity Ranking): This algorithm measures the similarity between two pages
based on the similarity of their neighbors. It assumes that two pages are similar if they
are referenced by similar pages.
* TrustRank: This algorithm is used to identify and filter out spam pages. It measures the
trustworthiness of a page based on the number and quality of links pointing to it from
trustworthy pages.
Ge exempel på tre algoritmer som analyserar innehållet i ett dokument?
Latent Semantic Analysis, Latent Dirichlet Allocation, Text summarization, Named Entity Recognition, Sentiment Analysis
Beskriv Latent Semantic Analysis (LSA)
analyzes the relationship between different documents and the terms they contain by producing a set of concepts relating to the documents and terms
Beskriv Latent Dirichlet Allocation
LDA is a topic modeling which is a method
for unsupervised classification
of documents, which finds
some natural groups of items
(topics) even when we’re not
sure what we’re looking for.
* Topic modeling provides
methods for automatically
organizing, understanding,
searching, and summarizing
large electronic archives.
Beskriv Name Entity Recognition (NER)
Named-entity recognition
(NER) (also known as entity
identification, entity
chunking, and entity
extraction) seeks to locate
and classify named entities
mentioned in unstructured
text into pre-defined
categories such as person
names, organizations,
locations, medical codes,
time expressions, quantities,
monetary values,
percentages, etc
Beskriv sentiment analysis
Sentiment analysis (also known as
opinion mining or emotion AI) is the
use of natural language processing,
text analysis, computational
linguistics, and biometrics to
systematically identify, extract,
quantify, and study affective states
and subjective information.
The goal is to answer the question:
“What do people feel about a certain
topic?”
Jämför data mining och text mining
Likheter:
- Båda letar efter mönster
- Båda är semi-automatiserade processer
- Båda använder både search och discover
Olikheter:
- Data mining används på strukturerad data medan text mining används på ostrukturerad data.
Vad är vector space model (VSM)?
Vector space model (VSM) is a popular, most widely used algebraic
model for representing text documents as vector of identifiers.
Here, every document can be represent as a multidimensional vectors
of keywords(i.e keywords extracted from that document). The weight
associated with each keyword determines the relevance of the
keyword in the document.
Inom text retrieval, vad är precision and recall?
Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
* Recall: the percentage of documents that are relevant to
the query and were, in fact, retrieved
What are the three steps in the Text Mining Process?
- Establish the Corpus: Collect & Organize the domain-specific unstructured data
- Create the Term-Document-Matrix: Introduce structure into the corpus
- Extract knowledge: Discover novel patterns from the T-D-Matric.