Diploma exam Flashcards

1
Q

Describe the map-reduce data processing model and its assumptions.

A

PIS F SDD
P paralel computations
I independence of the tasks
S simplicity
F fault tolerance
S scalability
D deterministic
D data splitting

mapper, combiner, shuffle and sorting, reducer
assum:

  • output of one job can be used as an input to second job
  • to calculat number of occurences of tokens (words)
  • filtering can occure at different stages
  • scalability - can be easily used with vertical scaling (more nodes added)

* filtering can occure at different stages (map to discard, reduce to no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Enumerate and describe the assumptions of the Apriori algorithm

A

FTMLCD (ftorek ML na CD)
it is mainly used for market busket analysis to find goods which are bought togehter

F frequent itemsets - its goal is to find …
T transactional databases - set of transactions, each contain of items ordered in lexographical order
M MST (minimu suppot treshold)
L level-wise-search
C candidate generation
D downward closure protertyl

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe clustering methods you got to know

A

2HEK2 CD

  • hierarchical grouping
  • hierarchical kmeans
  • EM algorithm - it is basing on the assumption that data within one cluster should form normal distribution.
  • k-means
  • k-mendoids - centers assigned to one of the data points
  • COBWEB - rules are generated during the run of the algorithm whcih use quality measures for the purpose of dividing on the groups??
  • dnn-s can be used as a initial step producing features vectorch which can later on be divided on clusters using standard algorithms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Areas and mechanisms for providing safety and security in database management systems

A

firewalls, IDS (intrusion detection systems), antiviruses, personnel training, qualified personnel, physical - good servers stored in safe places
UI BAPCI
* U updates and patches
* I… identification, authentication, authorisation
* B backup
* A audits and monitoring (user activities)
* P privilages
* C cryptography
* I isolation and contenerisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Areas and mechanisms for providing performance in database management systems

A

well trained personnel, documentation, physical
Druqi PM
* DS - optimal for the performance resons dm (normalisation)
* RW optimisation by partitioning
* Updates and patches
* queries
* indexes
* precompiled structures (materialized views)
* memory tuning (SGA, PWA)

isolation and contenerisation
auditing and monitoring - workload lead to lower performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Discuss regularization methods for deep neural networks.

A

and his name is: EL DBA MEWA!!!

  • early stoping
  • lable smoothing -model is not that sure therefore it is less prone to overfitting
  • droput
  • balancing input
  • augmentation
  • multitask-learning
  • ensemble
  • weight decay - more generalized models, not overly focusing on some specific features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discuss types of neural layers which can be found in a typical convolutional neural network, like AlexNet.

A
  • convolutional - using kernels to detect features
  • pooling - avg or max - reducing size of
  • flatten
  • dense
  • dropout
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

List and describe data processing tools/solutions in the Hadoop ecosystem

A
  • MapReduce
  • Hive - DBMS to work with Hadoop
  • Pig - for writing scripts, data preparation purpose
  • Spark - tool for working with OLTP
  • HBase - NoSQL for OLTP
  • Kafka - managing pipelines, producer and consumer support
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Reliability and high availability of cloud web services

A

reliability MTBF = (całkowity czas, który upłynął - suma przestojów) / liczba awarii
availability - (całkowity czas, który upłynął - suma przestojów) / całkowita ilość przestojów

  • complete backups distributed around the world
  • internet conenction quality
  • servers centers security steam from localisation and security safeguards standards
  • monitoring error detection and
  • using trusted cloud services
  • parallel computations of the same operations
  • load balancing
  • auto scaling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe main intelligent agents properties.

A

agapa aria

A - autonomy
G - goal oriented
A - actions on environment
P - perception of environment
A - adaptation

A - approximate results
R - robustness(solidność) and error handling
I - intelligent behaviour
A - asynchronous actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe data models used by Geographic Information Systems

A

raster v vector
* fast easy colection / human not always can get to some places in order to measure easily
* fast
* cheap
* structure simple (pixels matrix)
* scaling to some resolutoi limit
* / attaching metadata
* / goruping objects on layers
* huge amount of data
* generalisation simple / complex algorithms required
* no objects islocation

DSM - two planes flying
DTM - algorithms removing, extracting objects such as building and trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

IP protocols - features, advantages, disadvantages, necessity to change

A

common:
* connectionless - does not establish conenction before sending packet
* medium independent (opereat over varius media types: Ethernet, wifi)
* best effort - no guarantee packet will be delivered
ipv4 v piv6
* fragmentation - handled by router / minimal size od packet is established before sending packet
* Header - complex varied size / simplified, fixed size
* IPSec - optional / part of standard

necessity to change:
* limited address space 32-bit 128
* security limitations of IPv4
* NAT (network address transmission) - required, not required
* simplified Address configuration - Manual or DHCP / Autoconfiguration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Characterize the architectures of the large-scale business systems.

A

Microservices Architecture:
Advantages:
* Flexibility: Independent operation of each department or service.
* Technology Stack Variation: Diverse technology stacks for each service.
* Independent Scaling: Services can be scaled independently.
Disadvantages:
* Complexity: Managing many small services with network latency and monitoring considerations.

Cloud-Based Architecture:
Advantages:
* Scalability: Cloud platforms offer scalability and high availability.
* Backup Solutions: Offers backup solutions for data reliability.
Disadvantages:
* Dependency on Third-Party: Dependency on a third-party provider.
* Cost Management: Requires ongoing cost management.

Distributed Architecture:
Advantages:
* Compute-Intensive Tasks: Beneficial for compute-intensive tasks.
Disadvantages:
* Management: Requires careful management for consistency and reliability.

Client-Server Architecture:
Advantages:
* Standard Use: Effective for standard web applications and services.
Disadvantages:
* Scalability: May lack scalability and flexibility for complex systems.

Tiered Architecture (N-Tier):
Advantages:
* Separation of Concerns: Separates concerns across layers.
* Balance: Provides a balance between complexity and modularity.
Disadvantages:
* Complexity: Can become complex when integrating across many systems and services.

Monolith architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Differences and similarities between Apache Hadoop and Apache Spark frameworks

A

sim:
* big data
* distributed processing
* support vertical scaling capabilities and can work on commodity hardware
* fault tolerance mechanisms to: recover from node failures and ensure data integrity.
dif Hadoop / Spark:
* analytical / analytical and transactional
* more complex low level programming requiring / Spark offers high-level APIs, easier to use
* java / more languages support
* more narrow uage (Map Reduce) / broad usages with graph libraries or ML tasks
* disc / RAM
* slower / faster
* Limited caching support / Extensive in-memory caching
* Uses HDFS (Hadoop Distributed File System) / can use HDFS but it is based on another way of FS system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain a race condition. Provide a code sample with such a condition.

A
  • Thread A reads the current value of the shared variable.
  • Thread B reads the same current value of the shared variable.
  • Thread A performs its addition operation based on the value it read.
  • Thread B performs its addition operation based on the value it read.
  • Both Thread A and Thread B update the shared variable simultaneously, possibly leading to incorrect results because they didn’t take each other’s changes into account.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

XML: properties, structure of the document and standards allowing for transformations and description of the structure.

A

properties:
* extensible
* flexible
* can be converted to HTML, plain text, PDF
* structure:
* consist of header with encoding type
* body: **custom tags **and their content, attributes can be added to elements
* human understandable
* hierarchical
* modular
* metalanguage - language to describe another language
* platform independent
* self descriptive

DTD
XML Schema

XSLT
XSL-FO - pdf

17
Q

Memory management in modern operating systems

A

usual structure of the process from the top looks as follows:
code
heap - for storing dynamical objects, better due to its better flexibility
stack - for storing local variables, call parameters and return addresses

  • garbage collection - garbage collector checks for the objects which are no longer used and which will be never further used during programs execution and reclaims their space

between heap | | stack there is usually free space therefore segmentation approach was introduced
it allows for dividing of the memory and distributing program code on parts on different places of memory
thanks to it OS is able to manage memory more flexibly
however it causes problem of so called fragmentation

segmentation fragmentation.” This happens when segments are allocated and deallocated over time, and the freed segments leave gaps or holes in memory that cannot be used for other purposes.

paging and swapping
* paging division of the physical memory on the blocks called pages - while process run it is divided on the memory blocks of the pages size and assigned to free physical pages.
* swapping - processes which are not currently in use are moved to disc once they are needed they are moved back tophysical memory. Thanks to this operation RAM is more efficiently used.
* demand paging - pages are loaded form dics to memory only when they are needed demanded

18
Q

Concepts and paradigms of object-oriented programming.

A

A abstraction -
P - polymorphism
I - inheritance
E - encapsulation groupped methods adn attributes
classes and objects

abstraction - situation when two subclasses have different implementation of the method from parent class
interface - set of methods which are to be implemented
encapsulation - it gives the class the possibilty to hide chosen methods from another classes through access modifiers as well as promotes getters and setters methods for the purpose of editing attributes

To interact with the encapsulated data, classes provide public methods, often referred to as accessors (getters) and mutators (setters).
The combination of private or protected attributes and public methods for accessing and modifying those attributes results in data hiding.

Constructor and Destructor: Constructors are special methods that initialize objects when they are created. Destructors (not present in all OOP languages) clean up resources and memory when objects are destroyed. Constructors ensure objects are in a valid state upon creation.

methods overloading

methods overriding

Message Passing: In OOP, objects communicate by sending messages to each other. When one object wants to invoke a method or access data from another object, it sends a message.

class

objects

singletone - class which is one object. If we assign it to another variable it is like a reference, it is still the same object

19
Q

Basic data structures and their applications

A

arrays - decks
tuples - map reduce, multiple return parameters
vectors - 3d graphics
matrices - images, linear algebra
lists - playlists
maps - dictionary lookups
queue - printing queue
stack - stack trace, storing local variables
heap - CPU task scheaduling
trees - ordering, clustering, file system
sets - tagging systems to avoid situation when post will have same tag multiple times assigned
hash tables - indexes, fast access
strings - serching words, phrases
graphs - social networks,

20
Q

Computational and memory complexity

A

computational complexity - the amount of time/memory(RAM) needed to complete a task, expressed as a function

computational - more connected with the number of operations required to finish some computational task

memory - refers more to data structures efficiency of memory utilization. For example fixed size arrays even though they have instant data access (good computational complexity of operations), they are inefficient in terms of memory management as they occupy fixed size of memory

(pessimistic) maximal O, (optimistic) minimal Ω, (expected) average complexity Θ

constant O(1) - when access to memory or time of execution is constant independent of input data size
logarithmic - operations on tree,, heap
linear - calculating mean, median
square, cubic

21
Q

Discuss the diagrams used in UML notation and indicate which aspects of an IT system can be modelled using particular kinds of diagrams.

A

PSS CUDAC

  • Package -Grouping of Elements, Dependencies
  • Sequence - Dynamic Behavior, Interactions (how objects interact in a particular scenario of a use case, sequence of messages exchanged)
  • State - Object States, Lifecycles
  • Class- Data Models, Static Structure
  • Use Case - Functional Requirements (ineraction between users and the system)
  • Deployment - Physical Deployment of Software (physical deployment of artifacts on nodes, such as hardware or network nodes)
  • Activity - Workflow, Business Processes (object flow, sequence and conditions of the flow)
  • Communication - Object Collaboration (how objects interact with each other)
22
Q

Describe the process of the development of a relational database for a given problem

A
  • specification of client e.g. dental clinic
  • db aim identification
  • defining system users
  • usage scenarios
    as a user A I would like the system to do…
    more specific - kinds of queries which can be asked
  • identification of assumptions and limitations (only one patient can be treated during one visit)
  • identifying data needed and forming initial classes of data
  • Lisitng queries which needs to be achievable for the database
  • ERD diagram (presentation to client as well as docummentation)
    but also Data Dictionary - For each attribute and list of relationships
  • consulting ERD, DD and Rel
  • relational db implementation, testing and deployement
  • Definition of relational database schema. It is again creation of the docummentation of entities with their attributes.
23
Q

Discuss the basic concepts of functional programming: side effects, pure functions, partial application, map, filtration, reduction.

A

functional programming assumptions:
* immutability, data cannot be modified after it’s created
* Declarative programming - what to do rather than how to do it
* expression is referentially transparent if it can be replaced with its value without changing the program’s behavior

no side effects - Any effect of an expression or function call that is more than just returning a value, such as interacting with the operating system or changing the value of a global variable editing external files.

pure functions - without side effects and for the same input they always produce the same output
* they simplify searching of troublesome code detection of misbehaviour
* safely used by multiple threads

Partial application - execution of multi-argument functions using only part of the arguments.

Mapping - applying a given function on each list item. e.g. squaring all elements.

Filtration - selection of those elements, for which a given condition is true (command filter)

Reduction (or folding, or aggregation) is a sort of “summing up” all elements into one. process of combining elements in a collection to produce a single result by applying a binary operation cumulatively.

24
Q

Characterize the multidimensional model and explain where and why we use it.

A

OLTP vs OLAP

purpose:
* make analysis of the data stored within datawarehouse faster
* periodic reporting and trend analysis
* Facilitates the aggregation and comparison of data over different dimensions

dimensional and fact tables
drill-down or slice-and-dice

25
Q

Describe data classification methods and methods of training classifiers

A
  • DNN
  • K neighrest neighbours
  • knn center
  • decision trees
  • SVM decsioins based on plane separating classes
  • bayesian classification - works based on the conditional probability of each known feature …?

all above methods are firstly trained then tested
* half by half
* cross validation
* k-fold validation
* leave one out

26
Q

Characterize the phases of a data mining process, along with methods and tools used in each phase

A

business understanding
- reports for understanding business goals
||
data understanding
- creating and using documentation
- graphs (histograms, scatter plots)
- regression (test of correlations between the values)
|
data preparation
- cleaning data (removing missing values)
- data transformation (normalisarion, merging, concatenation, focusing on important things)
- process of grouping the data, deciding only on important fields in relation to this research (reducing the number of dimensions)
||
modeling
- choosing model
- test scenarios for purpose of model validation
(association rules, decision trees, random forest (group of trees which help to make more informed decision), clustering methods, time series analysis, bayesian classification, KNN - k-neighrest neighbours)
|
evaluating results
in case results are not enough go to step 1
- ROC curve, confusion matrix, MSE, cross validation, k-fold, half to half, leave one out
- |
v
deployment of the model
-** final report, implementation plan**, monitoring and maintenance plan.

27
Q

Describe the reasons for the usage of nonrelational databases and their applications due to the scalability and data structures they operate on.

A
  • flexibility (good for the purpose of storing data with different parameters each object like in case with ecommerce)
  • appropriate for the horizontal scaling noSQL DB-s are constructed in such a way to so that it is easily to distribute the data between many servers what helps in effective access and security of the data
  • distributed architecture
  • cost effective - as it can use **multiple cheap servers **instead of investing in the expensive one
  • because it is able to hold complex tables it can kind of store overhead structures which are well suited for the purpose of performance

their applications due to scalability:
* for the purpose of the web applications - because their vertical scaling capabilities makes it appropriate for the purpose of handling fluctuating number of users
* e-coomerce
* Big Data
* IoT data - because these devices all together generate huge amount of data they require scalable solution for the purpose of storing this data

due to DS they operate on:
e-coomerce - in case of constant changes within the attributes of the diverse products in the catalog its appropriate as it is able to cope with flexibility od DS
Big Data - good for the purpose of big data which consist out of many different types of data which are constantly changing or being upgraded by a new type of the data

28
Q

Main data quality dimensions

A

CC-TURA

Consistency
Completeness
Timeliness - data is current enough to be used
understandability
Relevance
Accuracy

29
Q

What are goods and services and what are the main differences between them?

A

Goods: Tangible items that are produced, sold, and used, such as electronics, clothing, or furniture.
Services: Intangible offerings that involve performing tasks or providing expertise for others, like consulting services, healthcare, or restaurant dining.

G/S
* can be stored/
* tangible
* /consumed instantyl
* transfer from one place to another is possible

30
Q

Discuss the differences between profit maximisation carried out in a perfect competition market and pure monopoly in the short and the long run.

A

Monopoly
short and long run are the same due to the lack of competition. MR curve is typically below D what makes MC MR intersection cross below optimal quantity level. And as there is no competition price is then set to its limits, so at higher level than in competitive markets

Companies strategy due to the price and quantity does not change because there are no competitors who could compete on the market. However profit of monopolies may change in situation if they would work on reducing the costs if price od the good is above ATC - income
= ATC - break even point
between ATC and AVC - operating at a loss
below AVC - shut down

Perfect Market
short - companies may make profit, as well as losses by setting higher or lower price than the competition.

long - price to high eventually leads to earning no profit as other firms produce exactly the same good. Price too low if manageable without making lose eventually causes other firms to lower their price as they sell the same good, but if operating on the lose causes company to fail

31
Q

Please describe communication model

A

Sender-Message-Channel-Receiver

encodes, medium transfer and potential interruptions, decodes

written formal
* scientific papers, books, articles
* required by low or other regulations
* slowest medium

verbal formal
* conferences, business presentations, official meetings, job interview

verbal formal
* typical emails

verbal formal
* solving simple cases
* talking
* solving basic cases
* fastest medium

communication methods:

interactive:
two or more sides
multidirectional

push:
sent to specific receiver

pull:
send once and available for multiple receivers

32
Q

Please define assets

A

economic resources owned or controlled by the individual or company which have potential to provide the future benefits.

tangible - office, furnishing, cash, machinery, vehicles
intangible - patents, shares, trademarks, copyright

current below one year
* cash,
* inventory finished goods ready for sale,
* raw materials used in production,
* goods and materials that a company holds for the purpose of resale or for use in its production process
noncurrent - assets expected to provide benefits beyond one year
* equipment - tangible assets that a company uses in its operations to generate revenue or facilitate its business activities
* property
* long-term investments

33
Q

List one by one the stages of the marketing research process and briefly describe each of them.

A

define research problem
* main reserach problem
* sepcific research problems - supplementary questions in relation to main research problem
* research objectives - formulate ro fro the purpose of evaluating of the main rp

prepare research design
* establish available data sources
* create based on them questionnaire which will be able to establish research objectives

data collection
* establish target group
* find the ways to collect enough responses from the target group
* data processing and coding (e.g. from descriptive scale to numeric)

analyzing and interpreting data:
* analyzing, calculating statistics, averages, preparing comparisons, graphs
* interpreting results of analysis in the context of research objectives if they are fulfilled or not

creating research report
* its purpose is to help company to decide whether to invest in the idea they had or not

34
Q

List and shortly characterize the main methods of assessing investment projects in the enterprise.

A

PP

DPP (discounted)

IRR - Its meaning is the minimal Rate of Return which will cause project to be successful.

MIRR
it solves to IRR problems:
* multiple IRR problem and
* First, whereas the regular IRR assumes that the cash flows from each project are reinvested at the IRR itself, the MIRR assumes that cash flows are reinvested at the cost of capital.

PI (profitability index) = discounted cash inflows / cashoutflows

35
Q

Please describe three categories of user interface: CLI (command line based), GUI (Graphical User Interface), NUI (Natural User Interface). Please point out the main differences among them.

A

CLI / GUI / NUI
less user friendly
text interaction (keybord mainly)
low level operations
lower flexibility
higher, more complex tasks which can be done using that

36
Q

List and characterize the most commonly used business process modeling techniques. Compare them due to e.g. the possibility of modifying the model or the purpose of modeling.

A

FUGI CD
F - flow chart
U - UML
G - Gnatt Chart
I - IDEF0
C - Colourd Petri Nets
D - Data FLow Diagram
https://docs.google.com/document/d/1dZogcIPGFFFDJO0Gxu9ZnKyDv7ll1dXCxj6Zq74ciD0/edit?usp=sharing

37
Q

Gestalt Principles for Data Visualization

A

low FPS in CS on CCC map
F - figure-ground
P - proximity
S - similarity
C - common fate
S - symmetry
C- common region
C- closure
C - continuity

38
Q

What are the three key attributes (properties) of information security that are included in its most common definition? Explain the attributes and for each of them provide an example of its violation.

A

CIA
Confidentiality - information is kept private and can be access only by authorized users - data breaches
Integrity - ensures accuracy and reliability(wiarygoność) of the data, “data manipulation” attack
Availability - data is accessible at any time DOS attacs