Data intensive apps Flashcards
Avro
Binary transfer protocol. Tailored to Hadoop
Must know the precise schema in order to read -
No self description game
Very small size
Nice handling of reads and writes to old and new schemes
MessagePack
Binary Json format.
Not as compact as thrift, protobufs
Thrift
Binary json format for data Developed at Facebook Similar to protobufs Not self describing Hard schema, need for read or write Thrift interface definition language (IDL) like this: struct Person { 1: required string userName, 2: optional i64 favoriteNumber, 3: optional list interests }
Protobufs
Binary message format, similar to thrift
Must know schema to read
message Person { required string user_name = 1; optional int64 favorite_number = 2; repeated string interests = 3; }
Apache arrow
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication…
Apache pig
high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin.[1] Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.[2] Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy[3] and then call directly from the language.
ASN.1
Super old
Abstract Syntax Notation One (ASN. 1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.
It is a coded as DER as in ssl certs
Binary over json, xml, cab
They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.
The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date (whereas manually maintained documentation may easily diverge from reality).
Keeping a database of schemas allows you to check forward and backward compatibility of schema changes, before anything is deployed. For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.
Three common ways data flows between processes
Via databases (see “Dataflow Through Databases”)
Via service calls (see “Dataflow Through Services: REST and RPC”)
Via asynchronous message passing (see “Message-Passing Dataflow”)
Web service
When HTTP is used as the underlying protocol for talking to the service, it is called a web service.
This is perhaps a slight misnomer, because web services are not only used on the web, but in several different contexts. For example:
A client application running on a user’s device (e.g., a native app on a mobile device, or JavaScript web app using Ajax) making requests to a service over HTTP. These requests typically go over the public internet.
One service making requests to another service owned by the same organization, often located within the same datacenter, as part of a service-oriented/microservices architecture. (Software that supports this kind of use case is sometimes called middleware.)
One service making requests to a service owned by a different organization, usually via the internet. This is used for data exchange between different organizations’ backend systems. This category
Finagle
Thrift based rpc using futures aka promises
GRPC
Rpc for google protobufs
asynchronous message-passing systems,
somewhere between RPC and databases.
Using a message broker has several advantages compared to direct RPC:
It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).
It allows one message to be sent to several recipients.
It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).
Message brokers
In the past, the landscape of message brokers was dominated by commercial enterprise software from companies such as TIBCO, IBM WebSphere, and webMethods. More recently, open source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka have become popular.
shared-disk architecture
shared-disk architecture, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network.ii This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach