Explain Data Discovery Flashcards
Generally speaking, there are several discovery approaches. The simplest one is looking at
metadata conditions, like the permissions of a file, to see if it’s accessible to all. This is a simple check that can help companies identify over-exposed objects.
Another discovery approaches is going into the data itself and is pattern-based, which
usually leverages regular expressions to find PII. You cannot use it to connect the data to its owner. You can however use it to see if a certain data source contains a certain type of data, and it produces significant value with very little effort.
Going one level up, we can leverage NLP (natural language processing) to
identify names, addresses, phone numbers and other contextual data. It also cannot connect data to its owner but would be enough for some
legacy regulations.
the highest level of data discovery, which is the most complex to implement but also the most
extensive, uses
smart value matching and machine learning to correlate entities with their data. This level is required for modern privacy
egulations like GDPR and CCPA and really the only way to fulfil use cases like
DSAR and breach response.
On top of all discovery approaches, a company would greatly benefit from
a catalog/registry that shows all the results
from all those discovery levels in one place.
BigID Discovery Types - Correlation:
find connected and associated data to an entity or person
BigID Discovery Types - Classification:
locate specific format or type of data
BigID Discovery Types - Clustering:
find duplicate and related data by topic
BigID Discovery Types - Catalog:
metadata collection for fast PII catalog view
BigID Discovery Methods - Reference Set (IDSoR)
Discovery Algorithm? Value Matching
When? Scan
Correlated? Yes
BigID Discovery Methods - Enrichment
Discovery Algorithm? Proximity Analysis
When? Scan
Correlated? Yes
BigID Discovery Methods - Data Classification
Discovery Algorithm? Pattern Matching
When? Scan
Correlated? No
BigID Discovery Methods - Advanced Classification
Discovery Algorithm? Machine Learning (NLP)
When? Scan
Correlated? No
BigID Discovery Methods - Document Classification
Discovery Algorithm? Machine Learning
When? Scan
Correlated? No
BigID Discovery Methods - Subject Access Request
Discovery Algorithm? Index, Value Matching and Proximity
When? Report
Correlated? Yes
There are different ways in which BigID could identify personal data. The default
method is
using value matching and leveraging correlation. For each data source one
can also choose to enable enrichment and/or classification.
Machine learning is used in different areas of the platform:
- Correlation
- Cleansing the information we find
- Advanced Classification
- Document classification
The value matching method requires
the use of attributes from entity sources.
When correlation process revealsunknown personal data (i.e. “dark data”), the BigID ML automatically correlates this data to an entity based on
parameters like uniqueness, proximity, frequency, etc, and then calculates the quality of the correlation using only metadata and not the private data itself.
BigID uses intelligent correlation algorithms utilizing entity sources to
understand basic
identifiers, relationships, and distributions in other data stores.
confidence levels
are only calculated for
structured data, not unstructured data.
For unstructured data, we do rely on what we discovered during the initial scan because with unstructured data we
try not to sample
Value Matching Logic
Break the data field into segments (a
segment starts at string start or a
delimiter, and ends at string end or a
delimiter).
A delimiter is a whitespace or a punctuation
character.
Segments of 4 characters or less are ignored.
Perform case-insensitive match of the
(as-is) entity field to each segment.
Enrichment is based on ____ , and is only applicable to ____.
proximity - structured data.