Diploma exam Flashcards
Describe the map-reduce data processing model and its assumptions.
PIS F SDD
P paralel computations
I independence of the tasks
S simplicity
F fault tolerance
S scalability
D deterministic
D data splitting
mapper, combiner, shuffle and sorting, reducer
assum:
- output of one job can be used as an input to second job
- to calculat number of occurences of tokens (words)
- filtering can occure at different stages
- scalability - can be easily used with vertical scaling (more nodes added)
* filtering can occure at different stages (map to discard, reduce to no
Enumerate and describe the assumptions of the Apriori algorithm
FTMLCD (ftorek ML na CD)
it is mainly used for market busket analysis to find goods which are bought togehter
F frequent itemsets - its goal is to find …
T transactional databases - set of transactions, each contain of items ordered in lexographical order
M MST (minimu suppot treshold)
L level-wise-search
C candidate generation
D downward closure protertyl
Describe clustering methods you got to know
2HEK2 CD
- hierarchical grouping
- hierarchical kmeans
- EM algorithm - it is basing on the assumption that data within one cluster should form normal distribution.
- k-means
- k-mendoids - centers assigned to one of the data points
- COBWEB - rules are generated during the run of the algorithm whcih use quality measures for the purpose of dividing on the groups??
- dnn-s can be used as a initial step producing features vectorch which can later on be divided on clusters using standard algorithms
Areas and mechanisms for providing safety and security in database management systems
firewalls, IDS (intrusion detection systems), antiviruses, personnel training, qualified personnel, physical - good servers stored in safe places
UI BAPCI
* U updates and patches
* I… identification, authentication, authorisation
* B backup
* A audits and monitoring (user activities)
* P privilages
* C cryptography
* I isolation and contenerisation
Areas and mechanisms for providing performance in database management systems
well trained personnel, documentation, physical
Druqi PM
* DS - optimal for the performance resons dm (normalisation)
* RW optimisation by partitioning
* Updates and patches
* queries
* indexes
* precompiled structures (materialized views)
* memory tuning (SGA, PWA)
isolation and contenerisation
auditing and monitoring - workload lead to lower performance
Discuss regularization methods for deep neural networks.
and his name is: EL DBA MEWA!!!
- early stoping
- lable smoothing -model is not that sure therefore it is less prone to overfitting
- droput
- balancing input
- augmentation
- multitask-learning
- ensemble
- weight decay - more generalized models, not overly focusing on some specific features
Discuss types of neural layers which can be found in a typical convolutional neural network, like AlexNet.
- convolutional - using kernels to detect features
- pooling - avg or max - reducing size of
- flatten
- dense
- dropout
List and describe data processing tools/solutions in the Hadoop ecosystem
- MapReduce
- Hive - DBMS to work with Hadoop
- Pig - for writing scripts, data preparation purpose
- Spark - tool for working with OLTP
- HBase - NoSQL for OLTP
- Kafka - managing pipelines, producer and consumer support
Reliability and high availability of cloud web services
reliability MTBF = (całkowity czas, który upłynął - suma przestojów) / liczba awarii
availability - (całkowity czas, który upłynął - suma przestojów) / całkowita ilość przestojów
- complete backups distributed around the world
- internet conenction quality
- servers centers security steam from localisation and security safeguards standards
- monitoring error detection and
- using trusted cloud services
- parallel computations of the same operations
- load balancing
- auto scaling
Describe main intelligent agents properties.
agapa aria
A - autonomy
G - goal oriented
A - actions on environment
P - perception of environment
A - adaptation
A - approximate results
R - robustness(solidność) and error handling
I - intelligent behaviour
A - asynchronous actions
Describe data models used by Geographic Information Systems
raster v vector
* fast easy colection / human not always can get to some places in order to measure easily
* fast
* cheap
* structure simple (pixels matrix)
* scaling to some resolutoi limit
* / attaching metadata
* / goruping objects on layers
* huge amount of data
* generalisation simple / complex algorithms required
* no objects islocation
DSM - two planes flying
DTM - algorithms removing, extracting objects such as building and trees
IP protocols - features, advantages, disadvantages, necessity to change
common:
* connectionless - does not establish conenction before sending packet
* medium independent (opereat over varius media types: Ethernet, wifi)
* best effort - no guarantee packet will be delivered
ipv4 v piv6
* fragmentation - handled by router / minimal size od packet is established before sending packet
* Header - complex varied size / simplified, fixed size
* IPSec - optional / part of standard
necessity to change:
* limited address space 32-bit 128
* security limitations of IPv4
* NAT (network address transmission) - required, not required
* simplified Address configuration - Manual or DHCP / Autoconfiguration
Characterize the architectures of the large-scale business systems.
Microservices Architecture:
Advantages:
* Flexibility: Independent operation of each department or service.
* Technology Stack Variation: Diverse technology stacks for each service.
* Independent Scaling: Services can be scaled independently.
Disadvantages:
* Complexity: Managing many small services with network latency and monitoring considerations.
—
Cloud-Based Architecture:
Advantages:
* Scalability: Cloud platforms offer scalability and high availability.
* Backup Solutions: Offers backup solutions for data reliability.
Disadvantages:
* Dependency on Third-Party: Dependency on a third-party provider.
* Cost Management: Requires ongoing cost management.
—
Distributed Architecture:
Advantages:
* Compute-Intensive Tasks: Beneficial for compute-intensive tasks.
Disadvantages:
* Management: Requires careful management for consistency and reliability.
—
Client-Server Architecture:
Advantages:
* Standard Use: Effective for standard web applications and services.
Disadvantages:
* Scalability: May lack scalability and flexibility for complex systems.
—
Tiered Architecture (N-Tier):
Advantages:
* Separation of Concerns: Separates concerns across layers.
* Balance: Provides a balance between complexity and modularity.
Disadvantages:
* Complexity: Can become complex when integrating across many systems and services.
Monolith architecture
Differences and similarities between Apache Hadoop and Apache Spark frameworks
sim:
* big data
* distributed processing
* support vertical scaling capabilities and can work on commodity hardware
* fault tolerance mechanisms to: recover from node failures and ensure data integrity.
dif Hadoop / Spark:
* analytical / analytical and transactional
* more complex low level programming requiring / Spark offers high-level APIs, easier to use
* java / more languages support
* more narrow uage (Map Reduce) / broad usages with graph libraries or ML tasks
* disc / RAM
* slower / faster
* Limited caching support / Extensive in-memory caching
* Uses HDFS (Hadoop Distributed File System) / can use HDFS but it is based on another way of FS system
Explain a race condition. Provide a code sample with such a condition.
- Thread A reads the current value of the shared variable.
- Thread B reads the same current value of the shared variable.
- Thread A performs its addition operation based on the value it read.
- Thread B performs its addition operation based on the value it read.
- Both Thread A and Thread B update the shared variable simultaneously, possibly leading to incorrect results because they didn’t take each other’s changes into account.