8 - Pulling it Together Flashcards

1
Q

What new division was Steve transferred to after his success in the Fraud Department?

A

Consumer real estate business

Shu Financial aimed to buy, renovate, and rent used homes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the primary business model for Shu Financial’s new venture?

A

Buy used homes, renovate them, and rent them out

The business could also make money by selling homes if they appreciated in value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Who is Jerry, and what is his role in the Real Estate Division?

A

Runs operations and has a hybrid skill set in business development and data science

He is expected to be a rising star at Shu Financial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What methodologies did Steve need to use in his new role?

A

Different methodologies that he hadn’t previously used in other rotations

This included gaining business-specific knowledge about real estate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What did Charissa refer to Steve’s new opportunity as?

A

‘Pulling it together’ opportunity

It involved leveraging past knowledge and building new skills.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What was Jerry’s advice regarding the use of AI?

A

Focus on solving problems rather than getting caught up in buzzwords

He emphasized understanding AI as having computers perform tasks that typically require human intelligence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define deep learning.

A

Involves neural networks with many layers of processing

Different layers serve different functions, such as identifying edges in images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is autoencoding in neural networks?

A

A clever application where the input is the same as the output

It is used to compress data and understand important features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What key variable drives the financial model for Shu Financial’s real estate business?

A

Future rental price

Other inputs have limited variability and can be managed with existing models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is the future rental price currently estimated?

A

Using an off-the-shelf solution from specialized companies

These solutions are often expensive and not very accurate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What data sources did Steve identify for estimating rental prices?

A
  • Real estate listings
  • Neighborhood-level information
  • Government data sources
  • Home descriptions
  • Photos from sellers

These sources can provide critical information for model building.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is natural language processing (NLP)?

A

An application of AI focusing on understanding and analyzing human language

It deals with written or spoken communications rather than structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

List some applications of NLP.

A
  • Speech recognition
  • Automatic language translation
  • Intelligent searching
  • Spam filters
  • Sentiment analysis
  • Chatbots

NLP has a vast range of applications across different fields.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the first step in the data cleaning process for NLP?

A

Data usually undergoes some form of a cleaning process

The cleaning process varies depending on the problem being solved.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does sentiment analysis involve?

A

Analyzing feedback to identify positive or negative sentiments

It can be applied to reviews of rental properties among other uses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or False: Jerry believes that deep learning is the best solution for every problem.

A

False

He advised to solve problems first and not just run to buzzwords.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Fill in the blank: The goal of AI is to create a _______ system that can solve complex problems.

A

smart computer

AI aims to replicate human-like problem-solving abilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What should Steve focus on when working with NLP according to Jerry?

A

Understand NLP beyond simple text search

He should prepare precise requests and requirements for the data science team.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What did Steve prepare for the blue sky session?

A

A discussion guide summarizing rental price modeling research

He included key questions about features influencing rental prices and potential data sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the first step in data preprocessing?

A

Data cleaning process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why might you retain certain symbols like $ in data cleaning?

A

They are critically important for problems involving multiple currencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is tokenization in NLP?

A

Converts sentences into individual words called tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are stop words?

A

Common words with little distinguishing value in predictive analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is stemming?

A

Reduces words to their root form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is lemmatization?
Similar to stemming, but the stem must be a valid word.
26
What are N-grams?
Sequences of N words that preserve order.
27
Define bag-of-words analysis.
Examines documents to create a list of words and measure overlap.
28
What does term frequency measure?
The probability of a word appearing in a document.
29
What is inverse document frequency?
Measures whether a term is common or rare in a set of documents.
30
What is the purpose of multiplying term frequency by inverse document frequency?
To reflect the importance of a specific word in a document collection.
31
What is word2vec?
An algorithm that identifies similarities between words based on their usage.
32
What is sentiment analysis?
Uses text to determine positive, negative, or neutral reactions.
33
True or False: Sentiment analysis can only be performed using rules-based approaches.
False.
34
What is the output variable in supervised machine learning for sentiment analysis?
Categorical (positive, negative, neutral) or numerical (e.g., -10 to 10).
35
What is a common misconception about NLP?
That it is merely text searching.
36
What are large language models (LLMs)?
Generative AI models that generate human-like text.
37
What is the role of geospatial information in data analysis?
Uses location data as part of modeling.
38
What types of geospatial data do properties come with?
Latitude and longitude information.
39
What is an example of demographic data that can be linked to geospatial information?
Income, family size, population density.
40
What is spatial smoothing in rental price analysis?
Computing average rental price within a fixed distance from a property.
41
What is a key consideration when choosing geospatial data sources?
Frequency of data updates and level of spatial resolution.
42
What is the purpose of linking census tracts and zip codes with rental properties?
To assign the value of features to the correct zip code, census tract, or other identifier for that house.
43
What is spatial smoothing in rental price modeling?
Computing the average rental price within a fixed distance from the property of interest.
44
How does weighting in spatial smoothing work?
Properties within distance X are weighted the same, while those farther have a weighting of 0.
45
What factors can influence the choice of weighting in geospatial analytics?
The scale can be linear or nonlinear, affecting how properties' distances are factored into weights.
46
What types of points of interest are important in rental price modeling?
* Transportation hubs * Shopping malls * Parks
47
What is the difference between vector data and raster data in GIS?
Vector data represents features like properties and roads, while raster data is gridded or pixelated data.
48
What is a convolutional neural network (CNN)?
A type of neural network used in computer vision that processes data through multiple layers to identify features.
49
What is the role of the initial layers in a CNN?
To extract features such as brightness, edges, and direction of shading from images.
50
What is the significance of translational and rotational invariance in CV?
They allow the system to identify objects regardless of their position or orientation in the image.
51
What is the primary focus of recurrent neural networks (RNNs)?
To analyze time-varying information, often used for video content.
52
What does the output of computer vision analysis include?
Probabilities related to property features, such as the condition of roofs or details of interiors.
53
What is the purpose of network analysis in real estate?
To identify the structure of the brokerage system and the connections between brokers.
54
What are nodes and relations in a social network analysis?
Nodes are individuals or groups, and relations are the connections between them.
55
What are the two main data sets in network analysis?
* Nodelist: identifies all brokers in the network * Relation data: defines how nodes are connected
56
What is an adjacency matrix in network analysis?
A matrix where rows and columns represent nodes and cells define their relations.
57
What is an edgelist in the context of network analysis?
A list of all relations between different nodes, detailing pairs and types of relations.
58
Fill in the blank: Computer vision (CV) is a field of AI that enables computers to extract useful information from _______.
[digital images, videos, and other visual inputs]
59
True or False: The final layer of a CNN produces an estimate of the probability of what is contained in the image.
True
60
What is the main function of the output layer in a CNN?
To produce estimates of the probability of different features contained in the image.
61
What is the significance of conducting a literature search in data analysis?
To gather information about potential features and data sources that can enhance modeling.
62
What is the role of social media networks like Agentster in real estate network analysis?
They provide connections and interactions between real estate agents that can be analyzed.
63
What defines how nodes are connected in a network?
The network structure, which can take forms such as adjacency matrix or edgelist.
64
What is an adjacency matrix?
A matrix where columns and rows are nodes, and cells define the relations between them.
65
What does an edgelist provide?
A list of all relations between different nodes along with the type of relation.
66
What is the simplest form of an adjacency matrix or edgelist?
Binary data consisting of 0s and 1s that show whether a connection exists.
67
True or False: Binary data conveys the strength of connections.
False.
68
What is a valued network?
A network where connections are represented with values that reflect the strength of the connection.
69
What distinguishes directed networks from undirected networks?
In directed networks, connections do not necessarily go both ways, creating asymmetric relationships.
70
What is in-degree in a directed network?
The number of inbound connections to a node.
71
What is out-degree in a directed network?
The number of outbound connections from a node.
72
What is an example of a directed network?
Twitter.
73
What is an example of an undirected network?
Facebook.
74
What is homophily in network analysis?
The tendency of individuals to connect with those who are similar to them.
75
What does triadic closure refer to?
The concept that friends of friends are more likely to be friends.
76
What are cliques in the context of networks?
Groups of individuals who spend a lot of time together with limited interactions outside their group.
77
What is network density?
The percentage of possible connections that actually exist in a network.
78
What is the maximum possible network density?
100 percent.
79
What is the average path length in a network?
The average number of edges along the shortest paths connecting all possible pairs of network nodes.
80
What does closeness centrality measure?
How close a node is to all other nodes in the network.
81
What is eigenvector centrality?
A metric that measures the influence of a node based on the quality of its connections.
82
Fill in the blank: The brokerage network is a _______ network.
binary undirected
83
What is one limitation of simply counting connections in network analysis?
It does not distinguish the quality of the connections.
84
What can affect the virality of a post in a social network?
The connectivity of the friends of the poster.
85
What is the relationship between broker influence and property showings?
More influential brokers are likely to bring more people to view properties.
86
What is one consideration when evaluating network analysis for modeling?
The cost of acquiring and maintaining data.
87
What was the outcome of the network analysis in this case?
It did not add value to the rental price model.
88
What was a key takeaway from the network analysis process?
Testing assumptions about data can lead to informed decisions without significant risk.
89
What specific steps did you take in the data cleaning and preprocessing?
Steps taken in data cleaning and preprocessing involve various techniques ## Footnote Steps may include removing duplicates, handling missing values, and normalizing data formats.
90
Did you remove stop words?
Yes, stop words were removed ## Footnote It is important to review removed stop words to ensure none are relevant to the business.
91
Is sentiment analysis appropriate for this problem?
Yes or No ## Footnote The appropriateness of sentiment analysis depends on the context of the problem.
92
With what frequency will the text undergo sentiment analysis?
Frequency of sentiment analysis must be defined ## Footnote This could be daily, weekly, or based on specific events.
93
Did you use Generative AI?
Yes or No ## Footnote If yes, specify whether an open source app or an API was used for prompt engineering.
94
What security measures did you take to safeguard data fed into the large language model?
Security measures include encryption and access controls ## Footnote These measures ensure data protection during processing.
95
What data sources do we have that include geospatial information?
Geospatial data sources may include GIS databases and satellite imagery ## Footnote These sources provide valuable location-based insights.
96
How frequently are the different geospatial data points updated?
Update frequency varies by data source ## Footnote Some data points may be updated in real-time, while others may be weekly or monthly.
97
Can we use some form of spatial smoothing to obtain estimates of the average rental value?
Yes, spatial smoothing techniques can be applied ## Footnote This helps in estimating average values across geographical areas.
98
What radius should we use for spatial smoothing?
The radius for spatial smoothing needs to be determined ## Footnote The choice of radius affects the accuracy of estimates.
99
How did we weigh the closest properties versus the farthest properties?
Weighing method must be defined (e.g., inverse distance weighting) ## Footnote This affects the influence of nearby properties on estimates.
100
Did we test the weighting algorithm?
Yes or No ## Footnote Testing the algorithm is crucial for validating its effectiveness.
101
Did we examine whether the distance to points of interest was a significant feature?
Yes or No ## Footnote This analysis determines the relevance of proximity to key locations.
102
Which points of interest were examined?
Points of interest include schools, parks, and shopping centers ## Footnote These locations can significantly impact property values.
103
What information can be obtained from the new data source?
Unique insights that cannot be obtained through other means ## Footnote This may include specific demographic or behavioral data.
104
What other data sources can be used?
Additional data sources may include social media, public records, and surveys ## Footnote These sources can complement existing data.
105
How much will it cost to access this data?
Cost includes initial access and ongoing expenses ## Footnote Understanding costs is essential for budgeting.
106
Are there simpler or better ways to obtain information than using computer vision?
Yes or No ## Footnote Alternatives may include manual data collection or other analytical methods.
107
How complete is the coverage of the images?
Percentage of items with image data must be assessed ## Footnote Completeness affects the reliability of insights drawn from images.
108
What specific information can and cannot be obtained from these images?
Information obtained includes visual features; not obtained may include contextual data ## Footnote Understanding limitations is important for analysis.
109
Is the network a binary network or a valued network?
The network type must be identified ## Footnote Binary networks have two states, while valued networks have weights assigned to connections.
110
What metrics were used to describe the network?
Metrics may include degree centrality, betweenness centrality, etc. ## Footnote These metrics help in understanding the network's structure.
111
Is this considered a dense or a distributed network?
The network's density must be evaluated ## Footnote Dense networks have many connections, while distributed networks have fewer.
112
What information about the network is actionable?
Actionable information includes insights that can drive decisions ## Footnote Non-actionable information informs but does not lead to specific actions.
113
What metrics were used to identify key connectors in the network?
Metrics include degree and closeness centrality ## Footnote These metrics highlight influential nodes within the network.
114
Why were those metrics chosen?
Metrics were chosen for their relevance in identifying key players ## Footnote Selection criteria may include effectiveness and ease of interpretation.