NiFi & Streamsets Flashcards
What is Apache NiFi?
Is is a dataflow management tool for automating the movement of data between different systems. It manages the movement of data between any source and destination. It is data source agnostic supporting disparate and distributed sources of differing formats, shemas, protocols, speeds and sizes.
Describe NiFi components
FlowFile: Represents each object moving through the processors of the pipeline. It contains attributes of the data as well as a reference to the associated but it does not contain the data itself.
Processor: Responsible for performing the work. It is doing some combination of data routing, transformation or mediation between systems.
Connection: Provides the actual linkage between processors. These act as queues.
Flow Controller: Maintains the knowledge of how processes connect and manages the threads all processes use.
Process Group: Groups components together in order to organize the DataFlow in a way that makes it more understandable from a higher level.
Describe Apache NiFi architecture
NiFi runs within a JVM on a host operating system.
Primary components are:
WebServer: Host HTTP based command and control APIs
Flow Controller: Brain of the operations. Allocates and manages threads for processors.
Flow File Repository: Contains metadata for all current FlowFiles in the flow.
Content Repository: Holds the content for current and past FlowFiles.
Provenance Repository: Holds the history of FlowFiles
Each node participating in a cluster performs the same operations on data but each operates on a different set of data.
What is StreamSets Data Collector?
Allows to build continuous data pipelines, each of which consumes record-oriented data from a single origin, optionally operates on those records in one or more processors and writes data to one or more destinations.
How do you add customer processors on StreamSets?
You can create and build your own processor using Java. Generally speaking, you first create a processors template using a Maven archetype, add your custom logic, build the processor, and place the jar inside the lib folder.
Alternatively, you could use a Jython, Groove or JavaScript processor which allows to add custom code without having to create a custom processor from scratch.
What is a NiFi template?
It is workflow that may be reused, which you may import and export.
Is NiFi capable of functioning as a master-slave design?
The 0-master is taken into account. Each unit in the NiFi network is identical. The Zookeeper service manages the NiFi cluster and appoints a single point as the Cluster Administrator.
What are the components of a flowfile in NiFI?
It is made up of two parts:
- Content which is a stream of bytes that contains a pointer to the actual data being processed. Keep in mind that the flowfile itself does not contain the data, rather it is a pointer to the content data. The actual content will be in the Content Repository of NiFi.
- Attributes which contains metadata about the data such as filename, UUID, file type, etc.
What happens to data if NiFi goes down?
It keeps a record of what is happening in the FlowFile repo, so it is able to restore its state.
What are the repositories in NiFi?
FlowFile Repository contains metadata for all the current FlowFiles in the flow
Content Repository holds the content for current and past FlowFiles
Provenance Repository holds the history of FlowFiles.