Resume Flashcards
Resume Bullet Point: Caching and Distributed Lock
Yea so the idea here is we had multiple services fetching the same SLI on the dashboard since different services can share a common SLI, so in order to optimise our dashboard loading time speed, we decided to implement a caching system and distributed lock system
So I initially used Redis to build a distributed locking system, essentially spinning up another redis DB to call setNX and delete operations which are atomic operations to provide a consistent system to keep track of locks. Also we set a Time to live (ttl) through setNX to make sure our system is more fault tolerant. Additionally, I used fencing tokens to make the lock is safe and the right machine is releasing the lock. This also ensures that if there is a delayed connection to our system, we know that it is an old request and we can simply reject it.
I initially ended my project here, but when I was blocked on another project later on, I came back to this project to try to make it better.
I realised that to ensure stronger consistency, we need to use a consensus algorithm to ensure that any changes made is agreed upon agreed by the majority of the nodes before it’s committed, preventing split-brain scenarios where different parts of the system have a different idea of who is holding a lock.
I tried to use a raft consensus algorithm which will have an odd number of nodes, and one of them is a leader, everytime a read is proposed, the leader is going to propose this to all the other nodes, as long as majority of the follower nodes can accept this, then the leader is going to go ahead with the read/write, then tell all the other follower nodes about it.
TikTok Tell me more about your experience
- I was Backend Engineer Intern @ Bytedance, and I worked mostly on building and optimising tools for SREs and ensuring the stability and quality of tiktok’s services.
- I worked on optimising an in-house monitoring tool by developing a caching and distributed lock system using Redis to speed up the loading time of the dashboard for SREs.
- In addition to that, I worked on pre-calculating the SLI’s (Service Level Indicators) and storing them in a database, and repeating this process daily through the use of a cronjob so that we can fetch the necessary data without calling the expensive service many times each time we load the dashboard.
- I also worked on another smaller scale project briefly towards the end of internship to ensure the high availability of TikTok’s content creation capabilities.
- Essentially that project was about routing the traffic to another mock service when some of our strong dependencies for our content check is down and when we cannot determine if a particular content is “safe” for upload or not.
Resume Bullet Point: RPC Handlers using Kitex and Thrift
- Another optimisation for the inhouse monitoring tool but in a different area, this deals with a process that is of a larger scope, more general, as compared to the caching and distributed lock system
- Essentially, for the SLIs of different service chains and services under these chains, we want to pre-calculate and store all necessary data of all possible time granularities that the user can select on the dashboard and store it to MongoDB
- This is because multiple services may share the same SLI and this allows us to save on some work done
- Essentially wrote RPC handlers to call services to retrieve and aggregate data, then store it in MongoDB
- To ensure our data will always be relevant and up to date, I wrote a CronJob to repeat this process daily and update data for the past week
Resume Bullet Point: Mock Service
- Essentially TikTok has some safety checks for user content creation (in the form of videos or comments etc …) to check for things that violate their community guidelines (i.e Violence, etc …)
- However, when strong dependencies that the service relies on go down the system cannot check whether the content is “safe” or not and what happens is the response of this check will return an empty struct, which will then prevent the content from being uploaded as it is deemed as unsafe and hence some users with perfectly “safe” content will be denied upload.
- In order to ensure high availability during this short period of downtime, my team worked on creating a mock service that essentially routes all traffic to it instead and returns a dummy struct response with a tag, allowing all content to be uploaded successfully during this time.
- We then work with the content safety team to ensure all content with this tag will be scanned frequently to ensure that they do not violate the community guidelines and are indeed “safe” to publish, If it isn’t then it will be quickly taken down
- Essentially I worked on the routing portion of the mock service which checks if the strong dependencies are down and switches traffic to this mock service immediately.
How did you detect when the strong dependencies were down, triggering the need to redirect to the mock service?
- We utilized health check endpoints and monitoring tools API like Grafana’s API to constantly assess the health of our strong dependencies.
- If any anomalies or downtimes were detected, our system would trigger the routing logic to redirect the traffic to the mock service, ensuring continuous content upload capabilities for the users.
What is Kitex, and why did you choose it over other RPC frameworks?
- Kitex is an RPC framework developed by Bytedance, optimized for efficient network communication and low overhead serialization/deserialization.
- Supposedly Kitex is faster than gRPC, although gRPC has a larger community support and more feature support
- Kitex has built in load balancing to distribute requests effectively across multiple service instances
- Middleware support –> Supports middleware integration to enable functionality like tracing and monitoring
- We chose it due to its efficiency and because it’s tailored to our specific needs at Bytedance.
- Being an in-house tool, Kitex integrates seamlessly with other Bytedance systems and tools. This ensures smoother operations and reduces the time spent on integration challenges.
- We also have direct access to the development team behind Kitex.
- This ensures faster problem resolution, direct feedback loops, and enough internal documentation that might not be available for external tools.
Why did you choose Redis for distributed lock
- I chose Redis initially because I wanted to use Redlock, which is built for distributed locking and even uses a custom consensus algorithm, but later realised it limits each user to 1 session so if users load 2 dashboards, the caching doesn’t really work anymore
- Redis provides atomic operations like SETNX, which allows for implementing locking mechanisms.
- Additionally, its in-memory data store ensures high-speed access to lock states.
- It has support for TTL which ensures locks are not held indefinitely, ensuring system reliability.
SAF Tell me more about your experience
- During my time at the Singapore Armed Forces as a Machine Learning Engineer, I worked on many Machine Learning projects including:
- Anomaly Detection on Cyber Physical Systems
- Optimising and automating Geo-rectification using feature matching
- CV Object Detection
- Weather Prediction using Tabular Data
Most of these projects were started to try to improve operational processes, like reducing manpower needed for these processes.
Resume Bullet Point: Anomaly Detection
- Essentially did an anomaly detection project (Can’t really say too much about what I did) but essentially automated some manual tasks and our project allowed our sister/parent unit to remove their 24/7 shift work and only implement a day shift system
- Built an Isolation Forest + K Means Clustering Algorithm to find anomalous data points
- One problem is finding a balance between FN and FP. Although we don’t want to have a high FN, because if you cannot even detect anomalies then there is no point implementing an automated system, but at the same time we when having a high FP, it also defeats the purpose of using an automated detection system since the human will need to check and verify consistently, at that point might as well have the human do everything instead. (Used F1-Score to evaluate our predictions –> Harmonic mean between precision and recall)
- One way to tackle this is to simply ensemble models which uses different means to detect the anomaly, we found that this worked especially well in our use-case and we simply just mark node as Positive if both algorithms deems point as anomalous.
- Another Problem we faced is the shift of mean / distribution. We simply used a Variational Auto Encoder (VAE) to deal with this problem.
BACKGROUND KNOWLEDGE:
* AutoEncoders are essentially a NN that is trained by unsupervised learning to produce reconstructions that are close to its inputs * Essentially, auto encoder is simply a process that seeks to produce outputs identical to its input and uses unlabelled data for this task (which is essentially very fitting for an Anomaly Detection Problem) * AE has 2 parts: Encoder and Decoder, Encoder essentially receives data input x and compresses it into a smaller dimension while feeding it forward to the next layer in the encoder. This can be accomplished for h layers which are referred to as hidden layers * Final Compression of input occurs at the bottleneck of the AE. The input representation is now referred to as z, which is the latent representation of x * Now the decoder takes the input’s latent representation z and attempts to reconstruct the original input x by expanding it through the same number of hidden layers with identical corresponding neurons as the encoder and ideally the output x’ will be identical as the input x. * And the AE would learn a compressed (lower dimensional) version of the identify function * We then use the reconstruction error (the difference between x’ and x) to detect anomalies * VAEs are different from a standard AE is that its bottleneck at the encoder is a probabilistic distribution rather than a deterministic value.
1) Probabilistic Bottleneck:
- Unlike deterministic autoencoders, which just map an input to a point in the latent space, VAEs map inputs to a distribution in the latent space.
- This probabilistic bottleneck allows VAEs to generate a variety of plausible outputs for a given input, making them more robust to changes in input distributions.
2) Regularization in Latent Space:
- The regularization term in the VAE’s loss function, encourages the latent space to follow a specific distribution, typically a multivariate Gaussian.
- This regularization ensures that the latent representations are well-distributed and not overly concentrated, making it more resilient to shifts in the input data distribution.
- So we can exploit this feature of a VAE to obtain a probabilistic description of our data.
- This has proven to help us with mean and distribution shifts through some testing
Resume Bullet Point: Geo-rectification
- Geo-rectification was one big pain point for us, we use it for our operational processes and essentially we use ArcGis to perform this.
- This is time-consuming, degrades image quality and is not very accurate since it involves manual cropping
- So we essentially downloaded QGis (Community & Open Sourced Version of ArcGis) and wrote additional features on top using OpenCV for faster and automatic geo-rectification.
- Instead of requiring the person to compare the 2 images, we use OpenCV’s ORB (Oriented FAST {Features from Accelerated Segments Test} and Rotated Brief {Binary Robust Independent Elementary Features}) which calculates keypoints by considering pixel brightness around a given area
- A problem we then run into is that satellite images are blurry and don’t have distinct features. So we need to do pansharpening (panchromatic sharpening)
- PanSharpening has some issues: High Frequency Noise is amplified
Denoising issues: Blurring of edges + Eliminate smaller features
Combine both for best result
Resume Bullet Point: MySQL DB Design
Initially we used a MySQL DB design, but this is somewhat undesirable because we started to get lots of data and because meta data structure is complex and highly variable, we decided to swap to MongoDB structure instead, but there were some small issues there as well such as migrating our current data to MongoDB
* 1. MongoDB Schema: Since MySQL is well defined schema and Mongo is more free, we need to standardise our schema to ensure that future pulling / access of data will not be problematic * 2. MySQL Being Accessed right now. So essentially we just create a replica and perform migration from the replica to avoid affecting performance of the primary DB but this process took WAY too long
Resume Bullet Point: Multiple End-to-End ML Pipelines
- Also jumped in and out of other projects
1) Weather Detection:
* We had images + tabular data, essentially I focused purely on tabular data while the other team dealt with the images / CV part of things
* Essentially just another tabular data competition kind of idea:
* CV —> MSE
* LightGBM + XGBoost, some simple Bayesian optimisation (Optuna) and we’re done
2) Object detection System development (for Ops process)
* Worked on downgrading the ML model to mobilenet to get better FPS, since we are missing some objects because frames too low to catch them and they move across the screen before our model is able to catch them.
* This means our detection accuracy took a big hit, so we had to build a feedback loop and store videos for human vetting every other day.
* Essentially we annotated some of the images ourselves using an open source but customised CV annotation tool (we worked on development of this annotation tool as well)
* Essentially this optimised version is more suited for our usecase and the objects we are detecting. Specifically added classification + helping tools for the human to identify the object if missed
* Our Accuracy kind of peaked at ~80%, so we were trying to find ways to push the boundaries
3) Also in-charge of writing the training documents and process where the ML part is based on the object detection problem we had.
* Essentially Python basics / Programming basics and our evaluation was on computer vision instead, which is actually just to build the object detection system we built but with higher quality videos —> So no mobilenet limitation
4) Also had alot of Research Based Projects:
* WebODM —> Drone Imagery & Trying to fit lightweight NN
* NN Quantization
* Executes operations with reduced precision (floating point values)
* More compact model representation —> PyTorch INT8 instead of FP32 etc …
* Knowledge Distillation to ensure accuracy, where our smaller model can generalise from the “soft targets” provided by the teacher model
Why did you choose Isolation Forest and K-means clustering for your project
Isolation Forest and K-means clustering were chosen beacuse they were popular anomaly detection algorithms and they complement each other in an ensemble since they calculate anomalies in different manners:
Isolation Forest: This is an ensemble-based algorithm (Tree)
- How it works: It randomly selects features and splits them at random values, with the idea that anomalies require fewer random partitions to be isolated than regular data points.
- Advantage: Efficient with high-dimensional data and can achieve good performance with a smaller number of trees.
K-means Clustering: Unsupervised learning algorithm
- Why use it here: Anomalies typically do not fit well into any of the established clusters. Observations that are distant from the centroids of all clusters can be flagged as anomalies.
- Number of clusters picked using the common elbow method and some intuition where we pick the k value that starts decreasing when plotting the Within Sum of Squares (WSS) plot
Combining K-means with Isolation Forest gives a dual-layer check. If both algorithms flag a data point as an outlier, we can be more confident in its anomalous nature.
What are hyperparameters for Isolation Forest and K-means that require tuning? How did you approach this tuning process
Isolation Forest:
n_estimators: Number of trees in the forest.
max_samples: The number of samples to draw while building individual trees.
contamination: Proportion of outliers in the dataset - this affects the threshold for anomaly scoring.
K-means Clustering:
Number of Clusters: The number of clusters to form. Deciding on the right number is crucial as too many clusters might lead to normal clusters with few members being treated as anomalies.
Number of Runs: Number of time the k-means algorithm will be run with different centroid seeds.
Tuning Approach:
Grid Search: Systematically worked through multiple combinations of hyperparameter values, training a model for each combination.
Validation: Used a validation set or cross-validation to determine the performance of each hyperparameter combination.
Evaluation Metric: In anomaly detection, precision, recall, and the F1-score are more informative than accuracy due to the imbalanced nature of data.
How did you validate performance of your anomaly detection model, especially given challenges of imbalance in anomaly data
- CV or Holdout Validation
- Use Precision, Recall and F1-Score since Accuracy isn’t the best metric due to class imbalance.