Electronic Discovery Quiz Flashcards
Deposition
interrogation of a party or witness (“deponent”) under oath, where both the questions and responses are recorded for later use in hearings or at trial
Interrogatories
Interrogatories are written questions posed by one party to another to be answered under oath. No recording taken
Requests for Production:
demand to inspect or obtain copies of tangible evidence and documents
Requests for Physical and Mental Examination:
request for examining a party for mental and health soundness
Requests for Admission:
used to require parties to concede, under oath, that particular facts and matters are true or that a document is genuine.
Subpoena:
a directive requiring the recipient to take some action, typically to appear and give testimony or hand over or permit inspection of specified documents or tangible evidence.
Scope of discovery:
nonprivileged, proportional, and relevant
Protective order:
“The court may, for good cause, issue an order to protect a party or person from annoyance, embarrassment, oppression, or undue burden or expense”
ESI competence
Assessment, Preservation, Sources, Custodian, Search, Collection, Counsel, Conference, Production
EDRM
Electronic Discovery Reference Model
Stages in EDRM
Information Governance, Identification, Preservation, Collection, Processing, Review, Analysis, Production, Presentation
Information Governance
Getting your electronic house in order to mitigate risk & expenses, from initial creation of ESI (electronically stored information) through its final disposition.
Identification:
Locating potential sources of ESI & determining its scope, breadth & depth.
Preservation
Ensuring that ESI is protected against inappropriate alteration or destruction.
Collection
Gathering ESI for further use in the e-discovery process (processing, review, etc.)
Processing
Reducing the volume of ESI and converting it, if necessary, to forms more suitable for review & analysis.
Review
Evaluating ESI for relevance & privilege.
Analysis
Evaluating ESI for content & context, including key patterns, topics, people & discussion.
Production
Delivering ESI to others in appropriate forms & using appropriate delivery mechanisms.
Presentation
Displaying ESI before audiences
Analog
analog recording— the variations in the recording are analogous to the variations in the music
Areal Density
the quantity of data (in bits) that can be stored on a given surface area of a computer storage medium.
Actuator Arm
on an electromagnetic hard drive, holds the read/write heads
Bates Numbering
an organizational method to label and identify legal documents, especially those produced in discovery.
Behind the Firewall:
refers to on-premises (“on prem”) computers and networks that exist within a party’s physical dominion
CD
optical media, holds about 700mb
CHS Addressing
file storage locations were based on the physical geometry of the platters, addressed by Cylinder, Head and Sector tuples
Cloud:
typically reside in facilities not physically accessible to persons using the servers, and servers are not typically dedicated to a single user.
Clusters
operating systems speed the process by grouping sectors into contiguous chunks of data called clusters.
Cylinders
Tracks that overlie one-another on both sides of a platter and across multiple platters
De-Duplication
Hashing serves to flag identical documents, permitting a single, consistent assessment of an item that might otherwise have cropped up hundreds of times and been differently characterized
De-NISTing
cull data collected from computers that couldn’t be evidence because it isn’t a custodian’s work product. It’s done by matching hash values of collected data files to hash values corresponding to common retail software and operating systems.
DVD
Optical media, holds 4.7gb
Electromagnetic
examples include magnetic tape, floppy disks, and electromagnetic hard drive, and storage tape
Encoding
how data is translated to be stored in various forms of media, such as binary
Form Factor
hardware design aspect that defines and prescribes the size, shape, and other physical specifications of components, particularly in electronics
Hashing
the use of mathematical algorithms to calculate a unique sequence of letters and numbers to serve as a reliable digital “fingerprint” for electronic data.
Master File Table
NTFS uses a powerful and complex file system database called the Master File Table or MFT to manage file storage.
Network Share
When the user stores data to the mapped drive, that data is backed up along with the contents of the file server. Although network shares are not local to the user’s computer, they are typically addressed using drive letters (e.g., M: or T:) as if they were local hard drives. An allocation of remote storage employed to facilitate routine backup of user data.
NTFS:
Windows file system, uses MFT
Platters
round, flat discs on an EM hard drive, coated on both sides with a special material able to store data as magnetic patterns
RAID Array:
Redundant Arrays of Independent Disks. Data divided across multiple drives using a technique called striping. “When a drive fails using RAID 1, you’ve still got one copy of the data; when a drive fails using RAID 0, you’ve got nothing—zip, ZERO!.”
Read-Write Head:
on an electromagnetic hard drive, read and write data from a platter. Space between the head and platter is made of air
SAN and NAS:
storage devices, SAN (for Storage Attached Network) or a NAS (for Network Attached Storage).
Sectors
Disk formatting, first with various concentric rings of data called tracks, and then with tracks further subdivided into tiny arcs called sectors
SIM card
serve both to authenticate and identify a communications
device on a cellular network and to store SMS messages and phone book contacts.
Solid State Storage
storage devices with no moving parts where the data resides entirely within the solid semiconductor material which comprise the memory chips. Examples include Flash Drives, Memory Cards, SIMs and Solid-State Drives
Media Tracks:
low level formatting divides each platter into tens of thousands of densely packed concentric circles called tracks. Each track is broken down into physical sectors of 4096 bytes.
Partitioning
divides drives into volumes, which users see as drive letters (e.g., C:, E:, F: and so on).
Formatting
defines the logical structures on the partition and places necessary operating system files at the start of the disk to facilitate booting.
low level-carving out tracks and sectors in the old days; high level=defines structures on a partition
“Local” or “on-prem” servers
employ hardware that’s physically available to the party that owns or leases the servers
“Peer-to-Peer” (P2P) networks:
exploit the fact that any computer connected to a network has the potential to serve data across the network..
In order of data capacity:
Bits < Bytes(8 bits) < Sectors(512 bytes convetional) < Clusters(8 sectors–4096 bytes) < Tracks < Cylinders < Platters < Drive<Array
EXIF data
photo metadata, detailing information about the date and time the
photo was taken, the camera, settings, exposure, lighting, even precise geolocation data.
Load files:
ancillary files that can be used to extract metadata from TIFF files where metadata was stripped away
Header data:
detailing the routing and other information about message transit and delivery for email
“MAC dates”
Last Modified, Last Accessed and Created. Last modified is most useful, the last accessed is least useful
Chain of custody
describes the processes used to track and document the acquisition, storage and handling of evidence to be able to demonstrate that the integrity of the evidence has not been compromised.
Preserving family relationships
safeguarding the association between the data and metadata
e-discovery review platforms
the software tools lawyers use to search, sort, read and tag electronic evidence
TIFF images:
strips away all the metadata
Williams v. Sprint/United Mgmt Co., 230 F.R.D. 640 (D. Kan. 2005)
The court responded by ordering production of all metadata as maintained in the ordinary course of business, save only privileged and expressly protected metadata.
Privilege log
disclose what’s been withheld or redacted
Delimited load file
Metadata may be produced as a database or housed in it
Crucial Distinctions: System versus Application Metadata
File tables hold system metadata about the file (e.g., name, locations on disk, MAC dates): it’s CONTEXT residing outside the file
Files hold application metadata (e.g., geolocation data in photos, comments in docs): it’s CONTENT embedded in the file.
System Metadata Examples: File names, file sizes, Modified, Accessed and Created (MAC) dates, file locations (path), custodian.
Application Metadata Examples: Comments, tracked changes, editing times, last printed dates.
System Metadata values must be collected and produced in delimited text files called “Load Files.”
Application Metadata is embedded in native files, but when files are not produced in native formats, Application Metadata must likewise be extracted and produced in load files.
Active data
available to users
Encoded data
Log files and system files are examples about Encoded data that reveal info about a user’s behavior
Unallocated clusters and slack space
holds discarded data, is a forensic artifact
Slack space
difference between file size and nearest cluster size, that’s empty
Computer forensics
is the expert acquisition, interpretation and presentation of the data within these three categories (Active, Encoded and Forensic data)
Master file table
AKA MFT, used by Windows’ NTFS file system to track location of files
Resident files
files small enough to be stored fully in the MFT
Formatting
low level-carving out tracks and sectors in the old days; high level=defines structures on a partition
Partitioning:
into volumes, such as C: or E:
Bro-Tech Corp. v. Thermax,
an examiner should have no trouble understanding what was expected to examine
Examination protocol
an order of a court or an agreement between parties that governs the scope and procedures for testing and inspection of a source of electronic evidence
Windows registry system
central database that stores information the OS needs to manage in hives
Shellbags
maintain information about folder configuration, such as when it was open
swap/page file
File on disk that’s an extension of RAM
Named entities
passwords, phone numbers, english text, etc.
Volume shadow copies
Windows feature, store a copy of basically everything
Fragmentation
splitting files for storage
File table
file directory
File carving
copying/recovering a deleted file by binary signature, remnant directory data, or keyword
Binary signature
unique signature identifying file type
Noise hits
false positives when searching by keyword
Hexavigesimal:
Base 26 encoding
Steganography
Hidden messages, one system was invented by Francis Bacon
File format:
establishes the way to encode and order data for storage within a file.
.TGA
Targa Graphics files
.TAG
Dataflex data file
File type identification
done using binary file signatures and file extensions
Binary file signature
(also called a magic number) will typically occupy the first few bytes of a file’s contents. It will always be hexadecimal values
Pdf files binary signature
Starts with hex corresponding to %PDF- in ASCII
MS Tape archive
Starts with hex corresponding to TAPE in ASCII
Adobe Shockwave Flash
Extension .SWF but file signature starts with FWS
JPG image
Starts with hex corresponding to ÿØÿà in ASCII
Offset addressing
beginning retrieval at a specified number of bytes from the start of the file (offset from the start) and retrieving a specified extent of data from that offset forward
Chunk structure
data is labeled within the file to indicate its beginning and ending, or it may be tagged (“marked up”) for identification. most commmon file structure
Directory structure
constructs a file as a small operating environment. The directory keeps track of what’s in the file, what it’s called and where it begins and ends. Examples: ZIP, MS office files after Office 2007
Lossless compression
If the compression algorithm preserves all compressed data. Example: ZIP uses algo called DEFLATE(free, efficient, most common)
Lossy compression
jettisons data. Example: JPEG, Sharpness and color depth is lost in JPEG compression, rough margins called “jaggies”. MPEG and MP3 are also lossy
Identification tool exception
If file type cannot be determined from metadata or file signature, flag the file as unknown or pursue other methods such as Byte frequency analysis(BFA)
Only binary files have signatures
Yes
Run-Length Encoding
It works especially well for images containing consecutive, identical data elements, like the ample white space of a fax transmission.
Media (MIME) Type Detection
MIME, which stands for Multipurpose Internet Mail Extensions, is a seminal Internet standard that enables the grafting of text enhancements, foreign language character sets (Unicode) and multimedia content (e.g., photos, video, sounds and machine code) onto plain text e-mails. Used by Linux and Mac OS. All email is in MIME format
Internet Assigned Numbers Authority (IANA):
oversees global Internet addressing and defines the hierarchy of media type designation. IANA is prompted to change MIME Types to Media Types
Media types follow a path-like tree structure under one of the following standard types: application, audio, image, text and video (collectively called discrete media types) and message and multipart (called composite media types)
Just study
Not IANA
File types prefixed with x- are not IANA
Vendor specific
prefixed .vnd
Octet stream
When file type is not identifiable exception, identifies as octect stream which is an arbitrary sequence or “stream” of data presumed to be binary data stored as eight-bit bytes or “octets.” Any file the processor fails to recognize
ESI is much different than paper documents in crucial ways:
- ESI collections tend to be exponentially more voluminous than paper collections
- ESI is stored digitally, rendering it unintelligible absent electronic processing
- ESI is electronically searchable while paper documents require laborious human scrutiny
- ESI is readily culled, filtered and deduplicated, and inexpensively stored and transmitted
- ESI carries metainformation that is always of practical use and may be probative evidence
- ESI and associated metadata change when opened in native applications
Native applications are not suited to e-discovery
and you shouldn’t use them for review. E- discovery review tools are the only way to go.
Two broad approaches used by processing tools to extract content from files.
One is to use the Application Programming Interface (API) of the application that created the file. The other is to turn to a published file specification or reverse engineer the file(Document Filters) to determine where the data sought to be extracted resides and how it’s encoded.
Document filters:
lay out where content is stored within each filetype and how that content is encoded and interpreted. Leading - Oracle Outside In used by most e-discovery tools
Aspose Pty. Ltd
an Australian concern, licenses libraries of commercial APIs, enabling software developers to read and write to, e.g., Word documents, Excel spreadsheets, PowerPoint presentations, PDF files and multiple e-mail container formats. Aspose tools can both read from and write to the various formats, the latter considerably more challenging.
Hyland Software’s Document Filters
is another developer’s toolkit that facilitates file identification and content extraction for 500+ file formats, as well as support for OCR, redaction and image rendering. Per Hyland’s website, its extraction tools power e-discovery products from Catalyst and Reveal Software.
dtSearch
commercial product that lies at the heart of several e-discovery and computer forensic tools which serves as both content extractor and indexing engine.
open source side, Apache’s Tika
is a free toolkit for extracting text and metadata from over a thousand file types, including most encountered in e-discovery. Tika was a subproject of the open source Apache Lucene project, Lucene being an indexing and search tool at the core of several commercial e-discovery tools.
Compound files
Modern productivity files like Microsoft Office documents are rich, layered containers
OLE (Object Linking and Embedding)
OLE supports dragging and dropping content between applications and the dynamic updating of embedded content
unitization
update the database with information about what data came from what file, a relationship called unitization.
Family Tracking
In the context of e-mail, recording the relationship between a transmitting message and its attachments is called family tracking: the transmitting message is the parent object and the attachments are child objects.
important metadata values to preserve and pair
One of the most important metadata values to preserve and pair with each object is the object’s custodian or source.
non-searchable documents
Common examples of non-searchable documents are faxes and scans, as well as TIFF images and Adobe PDF documents lacking a text layer.
Exceptions report:
A processing tool must track all exceptions and be capable of generating an exceptions report to enable counsel and others with oversight responsibility to act to rectify exceptions by, e.g., securing passwords, repairing or replacing corrupt files and running OCR against the files. Exceptions resolution is key to a defensible e-discovery process.
Lexical preprocessing
computers apply rules assigned by programmers to normalize, tokenize, and segment natural language
Character Normalization
Unicode equivalency, diacriticals (accents) and case (capitalization).
Unicode Normalization
All accented characters are normalized in a same way. Unicode Consortium promulgates normalization algorithms that produce a consistent (“normalized”) encoding for each identical character
Diacritical Normalization
This requires normalizing the data to forge a false equivalency between accented characters and their non-accented ASCII counterparts. So, if you search for “resume” or “cafe,” you will pick up instances of “resumé” and “café.” As well, we must normalize ligatures like the German Eszett (ß) seen in the word “straße,” or “street.”
Case Normalization:
Treat Upper and Lower case same
Time Zone Normalization:
a common processing task is to normalize date and time values according to a single temporal baseline, often Coordinated Universal Time (UTC)— essentially Greenwich Mean Time—or to any other time zone the parties choose
Normalization vs Tokenization(Important)
Normalization is the process of reformatting data to a standardized form, such as setting the date and time stamp of files to a uniform time zone or converting all content to the same character encoding. Normalization facilitates search and data organization.
Tokenization is a method of document parsing that identifies words (“tokens”) to be used in a full-text index. Because computers cannot read as humans do but only see sequences of bytes, computers employ programmed tokenization rules to identify character sequences that constitute words and punctuation.
Relativity
uses dtSearch as an indexing tool and dtSearch has reserved the character “%”. Relativity, treat all the following characters as spaces:
!”#$&’()*+,./:;<=>?@[\5c]^`{|}~. The following characters CANNOT be made searchable in dtSearch and Relativity: ( ) * ? % @ ~ & : =
The Concordance Index:
The term “concordance” describes an alphabetical listing, particularly a mapping, of the important words in a text.
Culling and selecting dataset
We can also cull the dataset by immaterial item suppression, de-NISTing and deduplication, all discussed infra. The crudest but most common culling method is keyword and query filtering; that is, lexical search.
Immaterial Item Suppression
Immaterial items are those extracted for forensic completeness but having little or no intrinsic value as discoverable evidence. Common examples of immaterial items include the folder structure in which files are stored and the various container files (like ZIP, RAR files and other containers, e.g., mailbox files like Outlook PST and MBOX, and forensic disk image wrapper files like .E0x or .AFF) that tend to have no relevance apart from their contents.
De-NISTing
De-NISTing is a technique used in e-discovery and computer forensics to reduce the number of files requiring review by excluding standard components of the computer’s operating system and off- the-shelf software applications like Word, Excel and other parts of Microsoft Office.Eliminating this noise is called “de-NISTing” because those noise files are identified by matching their cryptographic hash values (i.e., digital fingerprints, explanation to follow) to a huge list of software hash values maintained and published by the National Software Reference Library, a branch of the National Institute for Standards and Technology (NIST). the better focused the e-discovery collection effort (i.e., the more targeted the collection), the smaller the volume of data culled via de-NISTing.
Near-Deduplication
the first file is sometimes called the “pivot file,”and subsequent files with matching hashes are suppressed as duplicates, and the instances of each duplicate and certain metadata is typically noted in a deduplication or “occurrence” log
Deduplication by hashing
requires the same source data and the same, consistent application of algorithms
When parties cannot deduplicate e-mail, the reasons will likely be one or more of the following:
1. They are working from different forms of the ESI
2. They are failing to consistently exclude inherently non-identical data (like message headers and
IDs) from the hash calculation
3. They are not properly normalizing the message data (such as by ordering all addresses
alphabetically without aliases)
4. They are using different hash algorithms
5. They are not preserving the hash values throughout the process; or
6. They are changing the data.
Entropy Testing
Entropy testing is a statistical method by which to identify encrypted files and flag them for special handling.