Diploma exam Flashcards
Describe the map-reduce data processing model and its assumptions.
PIS F SDD
P paralel computations
I independence of the tasks
S simplicity
F fault tolerance
S scalability
D deterministic
D data splitting
mapper, combiner, shuffle and sorting, reducer
assum:
- output of one job can be used as an input to second job
- to calculat number of occurences of tokens (words)
- filtering can occure at different stages
- scalability - can be easily used with vertical scaling (more nodes added)
* filtering can occure at different stages (map to discard, reduce to no
Enumerate and describe the assumptions of the Apriori algorithm
FTMLCD (ftorek ML na CD)
it is mainly used for market busket analysis to find goods which are bought togehter
F frequent itemsets - its goal is to find …
T transactional databases - set of transactions, each contain of items ordered in lexographical order
M MST (minimu suppot treshold)
L level-wise-search
C candidate generation
D downward closure protertyl
Describe clustering methods you got to know
2HEK2 CD
- hierarchical grouping
- hierarchical kmeans
- EM algorithm - it is basing on the assumption that data within one cluster should form normal distribution.
- k-means
- k-mendoids - centers assigned to one of the data points
- COBWEB - rules are generated during the run of the algorithm whcih use quality measures for the purpose of dividing on the groups??
- dnn-s can be used as a initial step producing features vectorch which can later on be divided on clusters using standard algorithms
Areas and mechanisms for providing safety and security in database management systems
firewalls, IDS (intrusion detection systems), antiviruses, personnel training, qualified personnel, physical - good servers stored in safe places
UI BAPCI
* U updates and patches
* I… identification, authentication, authorisation
* B backup
* A audits and monitoring (user activities)
* P privilages
* C cryptography
* I isolation and contenerisation
Areas and mechanisms for providing performance in database management systems
well trained personnel, documentation, physical
Druqi PM
* DS - optimal for the performance resons dm (normalisation)
* RW optimisation by partitioning
* Updates and patches
* queries
* indexes
* precompiled structures (materialized views)
* memory tuning (SGA, PWA)
isolation and contenerisation
auditing and monitoring - workload lead to lower performance
Discuss regularization methods for deep neural networks.
and his name is: EL DBA MEWA!!!
- early stoping
- lable smoothing -model is not that sure therefore it is less prone to overfitting
- droput
- balancing input
- augmentation
- multitask-learning
- ensemble
- weight decay - more generalized models, not overly focusing on some specific features
Discuss types of neural layers which can be found in a typical convolutional neural network, like AlexNet.
- convolutional - using kernels to detect features
- pooling - avg or max - reducing size of
- flatten
- dense
- dropout
List and describe data processing tools/solutions in the Hadoop ecosystem
- MapReduce
- Hive - DBMS to work with Hadoop
- Pig - for writing scripts, data preparation purpose
- Spark - tool for working with OLTP
- HBase - NoSQL for OLTP
- Kafka - managing pipelines, producer and consumer support
Reliability and high availability of cloud web services
reliability MTBF = (całkowity czas, który upłynął - suma przestojów) / liczba awarii
availability - (całkowity czas, który upłynął - suma przestojów) / całkowita ilość przestojów
- complete backups distributed around the world
- internet conenction quality
- servers centers security steam from localisation and security safeguards standards
- monitoring error detection and
- using trusted cloud services
- parallel computations of the same operations
- load balancing
- auto scaling
Describe main intelligent agents properties.
agapa aria
A - autonomy
G - goal oriented
A - actions on environment
P - perception of environment
A - adaptation
A - approximate results
R - robustness(solidność) and error handling
I - intelligent behaviour
A - asynchronous actions
Describe data models used by Geographic Information Systems
raster v vector
* fast easy colection / human not always can get to some places in order to measure easily
* fast
* cheap
* structure simple (pixels matrix)
* scaling to some resolutoi limit
* / attaching metadata
* / goruping objects on layers
* huge amount of data
* generalisation simple / complex algorithms required
* no objects islocation
DSM - two planes flying
DTM - algorithms removing, extracting objects such as building and trees
IP protocols - features, advantages, disadvantages, necessity to change
common:
* connectionless - does not establish conenction before sending packet
* medium independent (opereat over varius media types: Ethernet, wifi)
* best effort - no guarantee packet will be delivered
ipv4 v piv6
* fragmentation - handled by router / minimal size od packet is established before sending packet
* Header - complex varied size / simplified, fixed size
* IPSec - optional / part of standard
necessity to change:
* limited address space 32-bit 128
* security limitations of IPv4
* NAT (network address transmission) - required, not required
* simplified Address configuration - Manual or DHCP / Autoconfiguration
Characterize the architectures of the large-scale business systems.
Microservices Architecture:
Advantages:
* Flexibility: Independent operation of each department or service.
* Technology Stack Variation: Diverse technology stacks for each service.
* Independent Scaling: Services can be scaled independently.
Disadvantages:
* Complexity: Managing many small services with network latency and monitoring considerations.
—
Cloud-Based Architecture:
Advantages:
* Scalability: Cloud platforms offer scalability and high availability.
* Backup Solutions: Offers backup solutions for data reliability.
Disadvantages:
* Dependency on Third-Party: Dependency on a third-party provider.
* Cost Management: Requires ongoing cost management.
—
Distributed Architecture:
Advantages:
* Compute-Intensive Tasks: Beneficial for compute-intensive tasks.
Disadvantages:
* Management: Requires careful management for consistency and reliability.
—
Client-Server Architecture:
Advantages:
* Standard Use: Effective for standard web applications and services.
Disadvantages:
* Scalability: May lack scalability and flexibility for complex systems.
—
Tiered Architecture (N-Tier):
Advantages:
* Separation of Concerns: Separates concerns across layers.
* Balance: Provides a balance between complexity and modularity.
Disadvantages:
* Complexity: Can become complex when integrating across many systems and services.
Monolith architecture
Differences and similarities between Apache Hadoop and Apache Spark frameworks
sim:
* big data
* distributed processing
* support vertical scaling capabilities and can work on commodity hardware
* fault tolerance mechanisms to: recover from node failures and ensure data integrity.
dif Hadoop / Spark:
* analytical / analytical and transactional
* more complex low level programming requiring / Spark offers high-level APIs, easier to use
* java / more languages support
* more narrow uage (Map Reduce) / broad usages with graph libraries or ML tasks
* disc / RAM
* slower / faster
* Limited caching support / Extensive in-memory caching
* Uses HDFS (Hadoop Distributed File System) / can use HDFS but it is based on another way of FS system
Explain a race condition. Provide a code sample with such a condition.
- Thread A reads the current value of the shared variable.
- Thread B reads the same current value of the shared variable.
- Thread A performs its addition operation based on the value it read.
- Thread B performs its addition operation based on the value it read.
- Both Thread A and Thread B update the shared variable simultaneously, possibly leading to incorrect results because they didn’t take each other’s changes into account.
XML: properties, structure of the document and standards allowing for transformations and description of the structure.
properties:
* extensible
* flexible
* can be converted to HTML, plain text, PDF
* structure:
* consist of header with encoding type
* body: **custom tags **and their content, attributes can be added to elements
* human understandable
* hierarchical
* modular
* metalanguage - language to describe another language
* platform independent
* self descriptive
DTD
XML Schema
XSLT
XSL-FO - pdf
Memory management in modern operating systems
usual structure of the process from the top looks as follows:
code
heap - for storing dynamical objects, better due to its better flexibility
stack - for storing local variables, call parameters and return addresses
- garbage collection - garbage collector checks for the objects which are no longer used and which will be never further used during programs execution and reclaims their space
between heap | | stack there is usually free space therefore segmentation approach was introduced
it allows for dividing of the memory and distributing program code on parts on different places of memory
thanks to it OS is able to manage memory more flexibly
however it causes problem of so called fragmentation
segmentation fragmentation.” This happens when segments are allocated and deallocated over time, and the freed segments leave gaps or holes in memory that cannot be used for other purposes.
paging and swapping
* paging division of the physical memory on the blocks called pages - while process run it is divided on the memory blocks of the pages size and assigned to free physical pages.
* swapping - processes which are not currently in use are moved to disc once they are needed they are moved back tophysical memory. Thanks to this operation RAM is more efficiently used.
* demand paging - pages are loaded form dics to memory only when they are needed demanded
Concepts and paradigms of object-oriented programming.
A abstraction -
P - polymorphism
I - inheritance
E - encapsulation groupped methods adn attributes
classes and objects
abstraction - situation when two subclasses have different implementation of the method from parent class
interface - set of methods which are to be implemented
encapsulation - it gives the class the possibilty to hide chosen methods from another classes through access modifiers as well as promotes getters and setters methods for the purpose of editing attributes
To interact with the encapsulated data, classes provide public methods, often referred to as accessors (getters) and mutators (setters).
The combination of private or protected attributes and public methods for accessing and modifying those attributes results in data hiding.
Constructor and Destructor: Constructors are special methods that initialize objects when they are created. Destructors (not present in all OOP languages) clean up resources and memory when objects are destroyed. Constructors ensure objects are in a valid state upon creation.
methods overloading
methods overriding
Message Passing: In OOP, objects communicate by sending messages to each other. When one object wants to invoke a method or access data from another object, it sends a message.
class
objects
singletone - class which is one object. If we assign it to another variable it is like a reference, it is still the same object
Basic data structures and their applications
arrays - decks
tuples - map reduce, multiple return parameters
vectors - 3d graphics
matrices - images, linear algebra
lists - playlists
maps - dictionary lookups
queue - printing queue
stack - stack trace, storing local variables
heap - CPU task scheaduling
trees - ordering, clustering, file system
sets - tagging systems to avoid situation when post will have same tag multiple times assigned
hash tables - indexes, fast access
strings - serching words, phrases
graphs - social networks,
Computational and memory complexity
computational complexity - the amount of time/memory(RAM) needed to complete a task, expressed as a function
computational - more connected with the number of operations required to finish some computational task
memory - refers more to data structures efficiency of memory utilization. For example fixed size arrays even though they have instant data access (good computational complexity of operations), they are inefficient in terms of memory management as they occupy fixed size of memory
(pessimistic) maximal O, (optimistic) minimal Ω, (expected) average complexity Θ
constant O(1) - when access to memory or time of execution is constant independent of input data size
logarithmic - operations on tree,, heap
linear - calculating mean, median
square, cubic
Discuss the diagrams used in UML notation and indicate which aspects of an IT system can be modelled using particular kinds of diagrams.
PSS CUDAC
- Package -Grouping of Elements, Dependencies
- Sequence - Dynamic Behavior, Interactions (how objects interact in a particular scenario of a use case, sequence of messages exchanged)
- State - Object States, Lifecycles
- Class- Data Models, Static Structure
- Use Case - Functional Requirements (ineraction between users and the system)
- Deployment - Physical Deployment of Software (physical deployment of artifacts on nodes, such as hardware or network nodes)
- Activity - Workflow, Business Processes (object flow, sequence and conditions of the flow)
- Communication - Object Collaboration (how objects interact with each other)
Describe the process of the development of a relational database for a given problem
- specification of client e.g. dental clinic
- db aim identification
- defining system users
-
usage scenarios
as a user A I would like the system to do…
more specific - kinds of queries which can be asked - identification of assumptions and limitations (only one patient can be treated during one visit)
- identifying data needed and forming initial classes of data
- Lisitng queries which needs to be achievable for the database
- ERD diagram (presentation to client as well as docummentation)
but also Data Dictionary - For each attribute and list of relationships - consulting ERD, DD and Rel
- relational db implementation, testing and deployement
- Definition of relational database schema. It is again creation of the docummentation of entities with their attributes.
Discuss the basic concepts of functional programming: side effects, pure functions, partial application, map, filtration, reduction.
functional programming assumptions:
* immutability, data cannot be modified after it’s created
* Declarative programming - what to do rather than how to do it
* expression is referentially transparent if it can be replaced with its value without changing the program’s behavior
—
no side effects - Any effect of an expression or function call that is more than just returning a value, such as interacting with the operating system or changing the value of a global variable editing external files.
—
pure functions - without side effects and for the same input they always produce the same output
* they simplify searching of troublesome code detection of misbehaviour
* safely used by multiple threads
—
Partial application - execution of multi-argument functions using only part of the arguments.
–
Mapping - applying a given function on each list item. e.g. squaring all elements.
—
Filtration - selection of those elements, for which a given condition is true (command filter)
—
Reduction (or folding, or aggregation) is a sort of “summing up” all elements into one. process of combining elements in a collection to produce a single result by applying a binary operation cumulatively.
Characterize the multidimensional model and explain where and why we use it.
OLTP vs OLAP
purpose:
* make analysis of the data stored within datawarehouse faster
* periodic reporting and trend analysis
* Facilitates the aggregation and comparison of data over different dimensions
dimensional and fact tables
drill-down or slice-and-dice