Data Flashcards
Data integration can be problematic because of what factors? (2 Factors)
- Differences in analysis methodology (assay, variant caller etc)
- Different formats
What does a person have to consider when determining the minimum information required to describe a genomic sequence? (Requirements)
The minimum information required should include all data which will make the information useful in the future. Aspects could include:
Description of the experimental method
Description of the environmental context
Pathogenicity
Extraction method
Assay
Tissue type etc.
What are the file formats often used to capture minimal information data?
XML and JSON
What is XML?
Extendable Markup Language
Looks similar to HTML. A meta language which gives meaning to data so another application can use it. It is designed to store and exchange data
What is data encryption? (2 types)
Data encryption translates data from one form into another form, or code, so that only people with a decryption key can access the data. Encrypted data is commonly referred to as ciphertext which unencrypted data is plaintext. Currently, encryption is one of the most popular and effective data security methods used by organizations. Two main types of data encryption exist - asymmetric encryption, also known as public-key encryption, and symmetric encryption.
What is the main function of data encryption?
The purpose of data encryption is to protect digital data confidentiality as it is stored on computer systems and transmitted using the internet or other computer networks. The outdated data encryption standard (DES) has been replaced by modern encryption algorithms that play a critical role in the security of IT systems and communications.
These algorithms provide confidentiality and drive key security initiatives including authentication, integrity, and non-repudiation. Authentication allows for the verification of a message’s origin, and integrity provides proof that a message’s contents have not changed since it was sent. Additionally, non-repudiation ensures that a message sender cannot deny sending the message.
What is the process of data encryption?
Data, or plaintext, is encrypted with an encryption algorithm and an encryption key. The process results in ciphertext, which only can be viewed in its original form if it is decrypted with the correct key.
Symmetric-key ciphers use the same secret key for encrypting and decrypting a message or file. While symmetric-key encryption is much faster than asymmetric encryption, the sender must exchange the encryption key with the recipient before he can decrypt it. As companies find themselves needing to securely distribute and manage huge quantities of keys, most data encryption services have adapted and use an asymmetric algorithm to exchange the secret key after using a symmetric algorithm to encrypt data.
On the other hand, asymmetric cryptography, sometimes referred to as public-key cryptography, uses two different keys, one public and one private. The public key, as it is named, may be shared with everyone, but the private key must be protected. The Rivest-Sharmir-Adleman (RSA) algorithm is a cryptosystem for public-key encryption that is widely used to secure sensitive data, especially when it is sent over an insecure network like the internet. The RSA algorithm’s popularity comes from the fact that both the public and private keys can encrypt a message to assure the confidentiality, integrity, authenticity, and non-repudiability of electronic communications and data through the use of digital signatures.
What is data integrity?
Data integrity is the maintenance of, and the assurance of the accuracy and consistency of, data over its entire life-cycle, and is a critical aspect to the design, implementation and usage of any system which stores, processes, or retrieves data. The term is broad in scope and may have widely different meanings depending on the specific context – even under the same general umbrella of computing. It is at times used as a proxy term for data quality, while data validation is a pre-requisite for data integrity. Data integrity is the opposite of data corruption.The overall intent of any data integrity technique is the same: ensure data is recorded exactly as intended (such as a database correctly rejecting mutually exclusive possibilities,) and upon later retrieval, ensure the data is the same as it was when it was originally recorded. In short, data integrity aims to prevent unintentional changes to information. Data integrity is not to be confused with data security, the discipline of protecting data from unauthorized parties.
What are the two types of data integrity?
Physical integrity
Physical integrity deals with challenges associated with correctly storing and fetching the data itself. Challenges with physical integrity may include electromechanical faults, design flaws, material fatigue, corrosion, power outages, natural disasters, acts of war and terrorism, and other special environmental hazards such as ionizing radiation, extreme temperatures, pressures and g-forces. Ensuring physical integrity includes methods such as redundant hardware, an uninterruptible power supply, certain types of RAID arrays, radiation hardened chips, error-correcting memory, use of a clustered file system, using file systems that employ block level checksums such as ZFS, storage arrays that compute parity calculations such as exclusive or or use a cryptographic hash function and even having a watchdog timer on critical subsystems.
Logical integrity
This type of integrity is concerned with the correctness or rationality of a piece of data, given a particular context. This includes topics such as referential integrity and entity integrity in a relational database or correctly ignoring impossible sensor data in robotic systems. These concerns involve ensuring that the data “makes sense” given its environment. Challenges include software bugs, design flaws, and human errors. Common methods of ensuring logical integrity include things such as a check constraints, foreign key constraints, program assertions, and other run-time sanity checks.
What is a checksum?
A checksum is the outcome of running an algorithm, called a cryptographic hash function, on a piece of data, usually a single file. Comparing the checksum that you generate from your version of the file, with the one provided by the source of the file, helps ensure that your copy of the file is genuine and error free.
A checksum is also sometimes called a hash sum and less often a hash value, hash code, or simply a hash.
What is rsync?
rsync is a utility for efficiently transferring and synchronizing files across computer systems, by checking the timestamp and size of files. It is commonly found on Unix-like systems and functions as both a file synchronization and file transfer program. The rsync algorithm is a type of delta encoding, and is used for minimizing network usage. Zlib may be used for additional compression, and SSH or stunnel can be used for data security.
Rsync is typically used for synchronizing files and directories between two different systems. For example, if the command rsync local-file user@remote-host:remote-file is run, rsync will use SSH to connect as user to remote-host. Once connected, it will invoke the remote host’s rsync and then the two programs will determine what parts of the file need to be transferred over the connection.
What data from a pipeline/NGS analysis is it necessary to store? (3 types)
Raw data – Need a secure means of archiving data. May never need to look at it again.
Results data – The variants we identified for a patient. Useful for analysis of other patients.
Meta data – Data describing how we achieved our results.