Data and AI Flashcards
IBM Storage Ceph
IBM Storage Ceph is an enterprise-level, IBM-supported version of the open-source Ceph storage platform. It provides scalable solutions for object, block, and file storage, making it suitable for environments that require high levels of data scalability and operational resiliency (IBM - United States).
Designed to be software-defined, IBM Storage Ceph abstracts storage resources from the underlying hardware, allowing for dynamic allocation and efficient utilization of data storage. This setup not only simplifies management but also enhances flexibility to adapt to changing business needs and workload demands. It’s particularly well-suited for modern data management tasks such as supporting data lakehouses, AI, and machine learning frameworks (IBM Newsroom).
Key components of IBM Storage Ceph include Ceph OSD (Object Storage Device), which handles data storage, replication, and recovery; Ceph Monitors, which maintain a master copy of the storage cluster map to ensure high consistency; and Ceph Managers, which enhance performance by managing metadata and hosting essential management interfaces (IBM - United States).
IBM Storage Ceph is engineered to be self-healing and self-managing, with features that support scalability from a few nodes to thousands, making it ideal for handling vast amounts of data across various deployment scenarios. Additionally, it offers integration capabilities with cloud-native applications and existing infrastructures, providing a seamless bridge from legacy systems to modern, scalable solutions (IBM Redbooks).
MySQL
relational database management system
Data Lakehouse
Lakehouse solutions typically provide
* a high-performance query engine
* over low-cost object storage
* along with a metadata governance layer.
Data lakehouses are based around open-standard object storage and enable multiple analytics and AI workloads to operate simultaneously on top of the data lake without requiring that the data be duplicated and converted
ADDI
Application Discovery & Delivery Intelligence
Spark
good in sQL, not good in
spectrum sinfonie
databricks
only in Spark
Encoder Models
An “encoder-only” model, also known as an autoencoder, is a type of neural network architecture used in unsupervised learning tasks such as feature learning, data compression, and anomaly detection.
In a traditional autoencoder architecture, there are two main components: an encoder and a decoder. The encoder processes input data and compresses it into a lower-dimensional representation, often called a “latent space” or “encoding.” The decoder then takes this compressed representation and reconstructs the original input data from it. The goal of training an autoencoder is typically to minimize the reconstruction error, encouraging the model to learn a compact and informative representation of the input data.
However, in an encoder-only model, only the encoder component is used, and there is no decoder. This means that the model takes input data and maps it directly to a lower-dimensional representation without attempting to reconstruct the original data. Encoder-only models are often used for tasks such as dimensionality reduction, feature learning, or pre-training for downstream supervised learning tasks.
Apache Spark
Apache Spark is designed to perform large-scale data processing and analytics across clustered computers, providing faster and more generalized processing capabilities compared to other big data technologies like Hadoop MapReduce. Here are some specific tasks and capabilities of Apache Spark:
General Execution Graphs: Spark’s advanced Directed Acyclic Graph (DAG) engine supports both batch and real-time data processing. The DAG capabilities allow for more complex, multi-step data pipelines that involve branching and reusing intermediate results.
In-Memory Computing: One of Spark’s standout features is its ability to process data in memory. This can dramatically increase the speed of iterative algorithms and interactive data mining tasks.
Fault Tolerance: Even though Spark processes data in memory, it uses a sophisticated fault recovery mechanism. It achieves fault tolerance through lineage; it remembers the series of transformations applied to some input data to rebuild lost data on a node that fails.
Libraries and APIs: Spark provides a rich ecosystem of development libraries, including:
Spark SQL: For processing structured data using SQL queries, it allows you to run SQL queries or use SQL-like DataFrame syntax alongside conventional programming operations.
MLlib: For machine learning, this library provides common machine learning algorithms like clustering, regression, classification, and collaborative filtering.
GraphX: For graph processing, GraphX allows for the creation, transformation, and querying of graphs.
Spark Streaming: For real-time data processing, this library enables the processing of live streams of data. Examples include data from sensors, financial systems, or social media feeds.
Polyglot Programming: Spark supports multiple programming languages for data science and development, including Scala, Java, Python, and R. This makes it accessible to a wider range of users, from application developers to data scientists.
Hadoop Integration: Spark can run on top of existing Hadoop clusters to leverage Hadoop’s storage systems (HDFS, HBase) and resource management (YARN), making it a versatile choice for processing data stored on Hadoop.
Scalability: Spark is designed to scale up from a single server to thousands of machines, each offering local computation and storage. This scalability makes it effective at handling a wide variety of big data processing tasks.
AAP
Advanced Ansible Automation Platform
PEPT
Parameter efficient prompt tuning
LoRA
Low-Rank Adaptation aka LoRA is a technique used to finetuning LLMs in a parameter efficient way. This doesn’t involve finetuning whole of the base model, which can be huge and cost a lot of time and money.
PEFT
Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model’s parameters. This significantly decreases the computational and storage costs.
PyTorch
PyTorch is an open-source machine learning library developed by Facebook’s AI Research lab (FAIR). It is widely used for various tasks in artificial intelligence and deep learning, such as neural network modeling, natural language processing, computer vision, and reinforcement learning.
Key features of PyTorch include:
Dynamic Computational Graphs: PyTorch uses dynamic computation graphs, allowing for more flexible and intuitive model building compared to static graph frameworks. This enables users to define and modify computation graphs on-the-fly, making it easier to debug and experiment with models.
Tensors: PyTorch provides a multi-dimensional array data structure called “tensors,” which is similar to NumPy arrays but with additional GPU acceleration and support for automatic differentiation. Tensors are the fundamental building blocks for constructing neural networks and performing computations in PyTorch.
Automatic Differentiation: PyTorch offers automatic differentiation through its autograd module, which automatically computes gradients of tensor operations. This makes it easy to implement and train complex neural network models using gradient-based optimization algorithms like stochastic gradient descent (SGD).
Neural Network Modules: PyTorch provides a rich set of pre-defined neural network modules and layers in the torch.nn module, making it easy to build and customize neural network architectures for various tasks. Users can also define custom layers and models by subclassing PyTorch’s Module class.
GPU Acceleration: PyTorch leverages GPU acceleration using CUDA, allowing for efficient training and inference on GPUs. This enables faster computation and scalability for deep learning models, especially for large-scale datasets and complex architectures.
Support for Dynamic and Static Graphs: While PyTorch primarily uses dynamic computation graphs, it also supports static graph execution through the torch.jit module, enabling optimizations and deployment of models in production environments.
Integration with Other Libraries: PyTorch integrates well with other popular libraries and frameworks in the Python ecosystem, such as NumPy, SciPy, and scikit-learn, allowing for seamless interoperability and integration with existing workflows.
CRUD
CRUD is an acronym that stands for Create, Read, Update, and Delete. These are the four basic functions of persistent storage, often used when dealing with databases or data storage systems in software development. Here’s a breakdown of each function:
Create: This operation involves adding new records or data to a database. In programming, this could be handled by an SQL statement like INSERT in SQL databases or a method call in object-oriented programming that saves a new object.
Read: This operation retrieves data from a database. It can involve querying the database to get specific records or a subset of data based on certain criteria. SQL databases use the SELECT statement for this purpose.
Update: This function modifies existing data within the database. This might involve changing values in existing rows or records. In SQL, this is typically achieved using the UPDATE statement along with conditions that specify which records to update.
Delete: This involves removing existing records from a database. In SQL, this is done using the DELETE statement, often with conditions to select the specific records to be deleted.
RRF
Reciprocal Rank Fusion