Concepts Advanced Flashcards

Learn about digital preservation concepts.

1
Q

What is…

Format Identification

A
  • Format identification has previously been accepted as the first step of digital preservation ‘knowing what you’ve got’ the volume of material that some organisations are responsible for, however, makes this an ideal, but not necessarily a practicality.
  • File format identification means looking at a digital file’s data (it’s binary content) for patterns that match the structures of specific file formats.
  • Reading the pattern “PDF- 1.4” at the beginning of some files, may for example, be a good clue the file is going to be a PDF.
  • Where a binary pattern cannot be ascribed to a digital file, either one isn’t known, or the file doesn’t conform to one, then other clues may be used.
  • File extension may be a clue as to a file format e.g. CSV (Comma Separated Values).
  • File name may be another, e.g. consistently named files, DS_Store, or Thumbs.db.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a…

File Format Extension

A
  • A file extension is part of a file’s name.
  • The extension is commonly three characters in length and prefixed with a dot.
  • For the file name, ‘example.txt’ the extension is .txt.
  • Registries of extensions exist that they can be searched and an ID asigned to a file.
  • A file extension has no bearing on the content of a file, as such, a file that has the file extension .pdf is not guaranteed to be .pdf.
  • A file may not have the right extension for a number of reasons, including for circumventing information security measures (e.g. certain upload types on a website).
  • Users may also adopt a temporary naming scheme e.g. renaming a file .backup, or .tmp.
  • Users may not know the appropriate extension and so might provide another, e.g. assigning .xls (Microsoft Excel) to a .csv (comma-separated-values table format)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a…

File Format Signature

A
  • A file format signature is a sequence, or sequences of bytes inside a digital file. Bytes become human readable when looked at through a hex (hexadecimal) editing tool.
  • At the beginning of a Microsoft Word file, for example, one may find the hexadecimal values,
0xD0 0xCF 0x11 0xE0
  • (DOC FILE) – 0x denotes hexadecimal.
  • Taken verbatim, these four bytes can be used by a tool to categorise any files that also begin with the same sequence.
  • The skill in crafting a good file format signature is finding a set of sequences unique enough to group all files belonging to a single file format; broad enough so as not to miss a single file; and narrow enough not to falsely identify other files – a false positive.
  • File format signatures are often described in file format specifications but they may still need crafting into something more useful that can be used by tools such as DROID.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is…

Web Archiving

A

The process of crawling a website in its current form and duplicating it, and its resorces (images, sound files, etc.) offline, or simply, elsewhere.

Web archives will often also make the snapshot available through an online portal as well.

Snapshots are sophisticated ‘crawls’ of the domain home page and all the hyperlinks stemming off from there and on and on. How deep, or how far a crawler will go is determined by the archiving institution.

Jurasdictions will often be responsible for websites on various top level domains, e.g. the British Library are responsible for archiving .co.uk, while The National Archives, UK are responsible for .gov.uk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is…

Web Crawling

A
  • Web crawling is the process by which the content of a single web page is read, archived, and all of the hyperlinks referenced on it are indexed.
  • The ‘crawl’ process then visits the indexed hyperlinks, and repeats the process.
  • The number of links to follow and repeat the process is called crawl-depth.
  • Organisations may have different strategies for utilizing crawl depth.
  • Tools distributed with Linux such as Wget can crawl websites and a common tool used in the digital preservation community is called Heritrix.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is…

Robots.txt

A
  • A mechanism that web-sites can employ to communicate with web-crawlers to prevent them from accessing them.
  • Robots.txt can be employed to prevent spurious requests from non-altruistic bots, or other practical reasons like the domain only having a limited amount of bandwidth available to it per month.
  • Robots.txt can be configured for all- or parts- of a web site. Crawlers may not always cooperate with the protocol.
  • The Internet Archive ignores Robots.txt for Government Archives.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is…

Fixity

A
  • Fixity is the mechanism used to identify whether the content of something has changed at all – literally the things that are fixed which should not change.
  • We can use filename, for example, when comparing a file with what is listed in an archival catalogue.
  • File system dates are useful, e.g. last modified date on a file.
  • A more fault-tolerant way of ensuring that the content hasn’t changed in a digital object is to use a checksum algorithm to create a unique mathematical digest of the file’s contents.
  • Example checksums include MD5, SHA1, and BLAKE-512.
  • More often than not, when discussing fixity in digital preservation we will be discussing checksums.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is…

Obsolescence

A
  • Obsolescence is the process of becoming obsolete.
  • Obsolescence is identified as a cause for data potentially becoming unreadable.
  • Obsolescence is synonymous with the terms out-dated and no-longer used.
  • Given the number of dependencies on which a piece of technology relies:
    • Operating system,
    • Memory type,
    • Mains power voltage,
    • Creating application,
    • etc.
  • There are a number of components that we’re monitoring in digital preservation for obsolescence.
  • We can mitigate obsolescence. through a number of means, but we will select whichever strategy is appropriate to the archival institutions requirements.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is…

Data Obfuscation

A
  • When information is hidden or modified in a way that cannot be easily read then it is said to be obfuscated.
  • Redaction, password protection, and encryption are three such methods of obfuscation.
  • The latter two pose risk to successful digital preservation as they impact the ability to read the information in a record.
  • Encryption impacts the ability to read the binary content of a file entirely unless an encryption key, and a known algorithm is available to decrypt the file’s contents.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is…

Metadata

A
  • Metadata is any information that describes a digital record.
  • Records may be self-describing; meaning that the metadata can be read from the file, e.g. author is sometimes encoded in a file separately from the content.
  • Metadata can be derived from the digital object’s content, e.g. word count, audio length.
  • The file system alos has metadata which the operating system uses to describe a digital object; modification and creation dates are two such examples.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are…

Significant Properties

A
  • Properties of individual records or groups of records that may be prioritised for preservation, and used as a measure of a successful ‘preservation action’.
  • Significant properties may become important when it is deemed impossible to preserve all aspects of a digital object.
  • Examples of significant properties may be, word count, colour profile, interactivity, etc.
  • Significant properties are not universal. They are speciifc to the record and the community the record belongs to.
  • Strategies for preservation should be developed on the basis of a full analysis of the user(s) requirements.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is…

Digital Preservation (AV Preserve definition)

A
  • Digital preservation is a function of digital curation, in which digital content is prepared and actively managed for long-term access.
  • Digital content requires constant, active management.
  • At the most basic level, this includes managing multiple copies in different geographic locations, ongoing and consistent comparison of the same files in multiple locations to ensure that no changes have occurred to them (this is called fixity checking).
  • It also involves performing healing procedures when files no longer match up, and maintaining audit logs from the time of ingest into the archival system that tracks all activities, like access and changes to the files over time.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an…

Intellectual Entity (PREMIS definition)

A
  • An Intellectual Entity is a distinct intellectual or artistic creation that is considered relevant to a designated community.
  • For example, a particular book, map, photograph, database, or hardware or software.
  • An Intellectual Entity can include other Intellectual Entities; for example, a web site can include a web page and a web page can include an image.
  • An Intellectual Entity may have one or more digital or non-digital Representations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a…

Representation (PREMIS definition)

A
  • A Representation is the set of files, including structural metadata, needed for a complete rendition of an Intellectual Entity.
  • For example, a journal article may be complete in one PDF file; this single file constitutes the Representation.
  • Another journal article may consist of one HTML file and two image files; these three files constitute the Representation.
  • A third article may be represented by one TIFF image for each of 12 pages plus an XML file of structural metadata showing the order of the pages; these 13 files constitute the Representation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is…

Emulation

A
  • The recreation of legacy, or current, computer architecture in software (an ‘emulator’) such that it can then be used to run the operating system and software of said original hardware.
  • Emulation is a potential method of delivery of ‘preserved’ content to users.
  • Those who want to access content will do so by interacting with the computer system as-was, or as-is.
  • A popular JavaScript emulator called JSMESS is used in the Internet Archive to enable full interaction and playability or retro pc/dos computer games archived by the service.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly