Resume Flashcards

1
Q

Resume Bullet Point: Caching and Distributed Lock

A

Yea so the idea here is we had multiple services fetching the same SLI on the dashboard since different services can share a common SLI, so in order to optimise our dashboard loading time speed, we decided to implement a caching system and distributed lock system

So I initially used Redis to build a distributed locking system, essentially spinning up another redis DB to call setNX and delete operations which are atomic operations to provide a consistent system to keep track of locks. Also we set a Time to live (ttl) through setNX to make sure our system is more fault tolerant. Additionally, I used fencing tokens to make the lock is safe and the right machine is releasing the lock. This also ensures that if there is a delayed connection to our system, we know that it is an old request and we can simply reject it.

I initially ended my project here, but when I was blocked on another project later on, I came back to this project to try to make it better.

I realised that to ensure stronger consistency, we need to use a consensus algorithm to ensure that any changes made is agreed upon agreed by the majority of the nodes before it’s committed, preventing split-brain scenarios where different parts of the system have a different idea of who is holding a lock.

I tried to use a raft consensus algorithm which will have an odd number of nodes, and one of them is a leader, everytime a read is proposed, the leader is going to propose this to all the other nodes, as long as majority of the follower nodes can accept this, then the leader is going to go ahead with the read/write, then tell all the other follower nodes about it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

TikTok Tell me more about your experience

A
  • I was Backend Engineer Intern @ Bytedance, and I worked mostly on building and optimising tools for SREs and ensuring the stability and quality of tiktok’s services.
  • I worked on optimising an in-house monitoring tool by developing a caching and distributed lock system using Redis to speed up the loading time of the dashboard for SREs.
  • In addition to that, I worked on pre-calculating the SLI’s (Service Level Indicators) and storing them in a database, and repeating this process daily through the use of a cronjob so that we can fetch the necessary data without calling the expensive service many times each time we load the dashboard.
  • I also worked on another smaller scale project briefly towards the end of internship to ensure the high availability of TikTok’s content creation capabilities.
  • Essentially that project was about routing the traffic to another mock service when some of our strong dependencies for our content check is down and when we cannot determine if a particular content is “safe” for upload or not.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Resume Bullet Point: RPC Handlers using Kitex and Thrift

A
  • Another optimisation for the inhouse monitoring tool but in a different area, this deals with a process that is of a larger scope, more general, as compared to the caching and distributed lock system
  • Essentially, for the SLIs of different service chains and services under these chains, we want to pre-calculate and store all necessary data of all possible time granularities that the user can select on the dashboard and store it to MongoDB
  • This is because multiple services may share the same SLI and this allows us to save on some work done
  • Essentially wrote RPC handlers to call services to retrieve and aggregate data, then store it in MongoDB
  • To ensure our data will always be relevant and up to date, I wrote a CronJob to repeat this process daily and update data for the past week
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Resume Bullet Point: Mock Service

A
  • Essentially TikTok has some safety checks for user content creation (in the form of videos or comments etc …) to check for things that violate their community guidelines (i.e Violence, etc …)
  • However, when strong dependencies that the service relies on go down the system cannot check whether the content is “safe” or not and what happens is the response of this check will return an empty struct, which will then prevent the content from being uploaded as it is deemed as unsafe and hence some users with perfectly “safe” content will be denied upload.
  • In order to ensure high availability during this short period of downtime, my team worked on creating a mock service that essentially routes all traffic to it instead and returns a dummy struct response with a tag, allowing all content to be uploaded successfully during this time.
  • We then work with the content safety team to ensure all content with this tag will be scanned frequently to ensure that they do not violate the community guidelines and are indeed “safe” to publish, If it isn’t then it will be quickly taken down
  • Essentially I worked on the routing portion of the mock service which checks if the strong dependencies are down and switches traffic to this mock service immediately.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How did you detect when the strong dependencies were down, triggering the need to redirect to the mock service?

A
  • We utilized health check endpoints and monitoring tools API like Grafana’s API to constantly assess the health of our strong dependencies.
  • If any anomalies or downtimes were detected, our system would trigger the routing logic to redirect the traffic to the mock service, ensuring continuous content upload capabilities for the users.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Kitex, and why did you choose it over other RPC frameworks?

A
  • Kitex is an RPC framework developed by Bytedance, optimized for efficient network communication and low overhead serialization/deserialization.
  • Supposedly Kitex is faster than gRPC, although gRPC has a larger community support and more feature support
  • Kitex has built in load balancing to distribute requests effectively across multiple service instances
  • Middleware support –> Supports middleware integration to enable functionality like tracing and monitoring
  • We chose it due to its efficiency and because it’s tailored to our specific needs at Bytedance.
  • Being an in-house tool, Kitex integrates seamlessly with other Bytedance systems and tools. This ensures smoother operations and reduces the time spent on integration challenges.
  • We also have direct access to the development team behind Kitex.
  • This ensures faster problem resolution, direct feedback loops, and enough internal documentation that might not be available for external tools.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why did you choose Redis for distributed lock

A
  • I chose Redis initially because I wanted to use Redlock, which is built for distributed locking and even uses a custom consensus algorithm, but later realised it limits each user to 1 session so if users load 2 dashboards, the caching doesn’t really work anymore
  • Redis provides atomic operations like SETNX, which allows for implementing locking mechanisms.
  • Additionally, its in-memory data store ensures high-speed access to lock states.
  • It has support for TTL which ensures locks are not held indefinitely, ensuring system reliability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

SAF Tell me more about your experience

A
  • During my time at the Singapore Armed Forces as a Machine Learning Engineer, I worked on many Machine Learning projects including:
  1. Anomaly Detection on Cyber Physical Systems
  2. Optimising and automating Geo-rectification using feature matching
  3. CV Object Detection
  4. Weather Prediction using Tabular Data

Most of these projects were started to try to improve operational processes, like reducing manpower needed for these processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Resume Bullet Point: Anomaly Detection

A
  • Essentially did an anomaly detection project (Can’t really say too much about what I did) but essentially automated some manual tasks and our project allowed our sister/parent unit to remove their 24/7 shift work and only implement a day shift system
  • Built an Isolation Forest + K Means Clustering Algorithm to find anomalous data points
  • One problem is finding a balance between FN and FP. Although we don’t want to have a high FN, because if you cannot even detect anomalies then there is no point implementing an automated system, but at the same time we when having a high FP, it also defeats the purpose of using an automated detection system since the human will need to check and verify consistently, at that point might as well have the human do everything instead. (Used F1-Score to evaluate our predictions –> Harmonic mean between precision and recall)
  • One way to tackle this is to simply ensemble models which uses different means to detect the anomaly, we found that this worked especially well in our use-case and we simply just mark node as Positive if both algorithms deems point as anomalous.
  • Another Problem we faced is the shift of mean / distribution. We simply used a Variational Auto Encoder (VAE) to deal with this problem.

BACKGROUND KNOWLEDGE:

        * AutoEncoders are essentially a NN that is trained by unsupervised learning to produce reconstructions that are close to its inputs
        * Essentially, auto encoder is simply a process that seeks to produce outputs identical to its input and uses unlabelled data for this task (which is essentially very fitting for an Anomaly Detection Problem)
        * AE has 2 parts: Encoder and Decoder, Encoder essentially receives data input x and compresses it into a smaller dimension while feeding it forward to the next layer in the encoder. This can be accomplished for h layers which are referred to as hidden layers
        * Final Compression of input occurs at the bottleneck of the AE. The input representation is now referred to as z, which is the latent representation of x
        * Now the decoder takes the input’s latent representation z and attempts to reconstruct the original input x by expanding it through the same number of hidden layers with identical corresponding neurons as the encoder and ideally the output x’ will be identical as the input x. 
        * And the AE would learn a compressed (lower dimensional) version of the identify function
        * We then use the reconstruction error (the difference between x’ and x) to detect anomalies
        * VAEs are different from a standard AE is that its bottleneck at the encoder is a probabilistic distribution rather than a deterministic value.

1) Probabilistic Bottleneck:
- Unlike deterministic autoencoders, which just map an input to a point in the latent space, VAEs map inputs to a distribution in the latent space.
- This probabilistic bottleneck allows VAEs to generate a variety of plausible outputs for a given input, making them more robust to changes in input distributions.

2) Regularization in Latent Space:
- The regularization term in the VAE’s loss function, encourages the latent space to follow a specific distribution, typically a multivariate Gaussian.
- This regularization ensures that the latent representations are well-distributed and not overly concentrated, making it more resilient to shifts in the input data distribution.

  • So we can exploit this feature of a VAE to obtain a probabilistic description of our data.
  • This has proven to help us with mean and distribution shifts through some testing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Resume Bullet Point: Geo-rectification

A
  • Geo-rectification was one big pain point for us, we use it for our operational processes and essentially we use ArcGis to perform this.
  • This is time-consuming, degrades image quality and is not very accurate since it involves manual cropping
  • So we essentially downloaded QGis (Community & Open Sourced Version of ArcGis) and wrote additional features on top using OpenCV for faster and automatic geo-rectification.
  • Instead of requiring the person to compare the 2 images, we use OpenCV’s ORB (Oriented FAST {Features from Accelerated Segments Test} and Rotated Brief {Binary Robust Independent Elementary Features}) which calculates keypoints by considering pixel brightness around a given area
  • A problem we then run into is that satellite images are blurry and don’t have distinct features. So we need to do pansharpening (panchromatic sharpening)
  • PanSharpening has some issues: High Frequency Noise is amplified

Denoising issues: Blurring of edges + Eliminate smaller features

Combine both for best result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Resume Bullet Point: MySQL DB Design

A

Initially we used a MySQL DB design, but this is somewhat undesirable because we started to get lots of data and because meta data structure is complex and highly variable, we decided to swap to MongoDB structure instead, but there were some small issues there as well such as migrating our current data to MongoDB

    * 1. MongoDB Schema: Since MySQL is well defined schema and Mongo is more free, we need to standardise our schema to ensure that future pulling / access of data will not be problematic
    * 2. MySQL Being Accessed right now. So essentially we just create a replica and perform migration from the replica to avoid affecting performance of the primary DB but this process took WAY too long
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Resume Bullet Point: Multiple End-to-End ML Pipelines

A
  • Also jumped in and out of other projects
    1) Weather Detection:
    * We had images + tabular data, essentially I focused purely on tabular data while the other team dealt with the images / CV part of things
    * Essentially just another tabular data competition kind of idea:
    * CV —> MSE
    * LightGBM + XGBoost, some simple Bayesian optimisation (Optuna) and we’re done

2) Object detection System development (for Ops process)
* Worked on downgrading the ML model to mobilenet to get better FPS, since we are missing some objects because frames too low to catch them and they move across the screen before our model is able to catch them.
* This means our detection accuracy took a big hit, so we had to build a feedback loop and store videos for human vetting every other day.
* Essentially we annotated some of the images ourselves using an open source but customised CV annotation tool (we worked on development of this annotation tool as well)
* Essentially this optimised version is more suited for our usecase and the objects we are detecting. Specifically added classification + helping tools for the human to identify the object if missed
* Our Accuracy kind of peaked at ~80%, so we were trying to find ways to push the boundaries

3) Also in-charge of writing the training documents and process where the ML part is based on the object detection problem we had.
* Essentially Python basics / Programming basics and our evaluation was on computer vision instead, which is actually just to build the object detection system we built but with higher quality videos —> So no mobilenet limitation

4) Also had alot of Research Based Projects:
* WebODM —> Drone Imagery & Trying to fit lightweight NN
* NN Quantization
* Executes operations with reduced precision (floating point values)
* More compact model representation —> PyTorch INT8 instead of FP32 etc …
* Knowledge Distillation to ensure accuracy, where our smaller model can generalise from the “soft targets” provided by the teacher model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why did you choose Isolation Forest and K-means clustering for your project

A

Isolation Forest and K-means clustering were chosen beacuse they were popular anomaly detection algorithms and they complement each other in an ensemble since they calculate anomalies in different manners:

Isolation Forest: This is an ensemble-based algorithm (Tree)

  • How it works: It randomly selects features and splits them at random values, with the idea that anomalies require fewer random partitions to be isolated than regular data points.
  • Advantage: Efficient with high-dimensional data and can achieve good performance with a smaller number of trees.

K-means Clustering: Unsupervised learning algorithm
- Why use it here: Anomalies typically do not fit well into any of the established clusters. Observations that are distant from the centroids of all clusters can be flagged as anomalies.

  • Number of clusters picked using the common elbow method and some intuition where we pick the k value that starts decreasing when plotting the Within Sum of Squares (WSS) plot

Combining K-means with Isolation Forest gives a dual-layer check. If both algorithms flag a data point as an outlier, we can be more confident in its anomalous nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are hyperparameters for Isolation Forest and K-means that require tuning? How did you approach this tuning process

A

Isolation Forest:

n_estimators: Number of trees in the forest.
max_samples: The number of samples to draw while building individual trees.
contamination: Proportion of outliers in the dataset - this affects the threshold for anomaly scoring.

K-means Clustering:

Number of Clusters: The number of clusters to form. Deciding on the right number is crucial as too many clusters might lead to normal clusters with few members being treated as anomalies.
Number of Runs: Number of time the k-means algorithm will be run with different centroid seeds.

Tuning Approach:

Grid Search: Systematically worked through multiple combinations of hyperparameter values, training a model for each combination.

Validation: Used a validation set or cross-validation to determine the performance of each hyperparameter combination.

Evaluation Metric: In anomaly detection, precision, recall, and the F1-score are more informative than accuracy due to the imbalanced nature of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How did you validate performance of your anomaly detection model, especially given challenges of imbalance in anomaly data

A
  1. CV or Holdout Validation
  2. Use Precision, Recall and F1-Score since Accuracy isn’t the best metric due to class imbalance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why didn’t you simply use VAE for Anomaly Detection

A
  • Ensemble Approach: Combining multiple models like the Isolation Forest and K-Means allowed us to capture different aspects of the data’s structure.
  • By feeding the processed data from the VAE into these models, we aimed to benefit from both the VAE’s feature extraction capabilities and the diverse anomaly detection mechanisms of the ensemble models.
  • Using Isolation Forest + K Means with VAE Encoder simply yielded better results and actually takes about the same time as using simply VAE alone
15
Q

Why choose VAE to handle shift in mean distribution. Were there other methods considered and if so why did you choose VAE over them

A
  • We actually didn’t really have a good idea to deal with this, so if VAE didn’t really work out, we would be at a loss for ideas
  • One Sub-Par method we thought of was to increase our ensemble of models to try to deal with it and only mark points as anomalous if all models deem a certain point as anomalous (Majority Voting)
16
Q

What evaluation metric did you use for the object detection system

A

We used

==> Intersection over Union (IoU): Measures the overlap between the predicted bounding box and the true bounding box.

==> Although we did consider Mean Average Precision (mAP): This is a commonly used metric that considers both precision and recall across different thresholds.

  • We decided to use IoU even though mAP is somewhat suitable for multi class clasification since the location or spatial accuracy of the bounding boxes are important in our use case
17
Q

DSTA Tell me more about your experience

A
  • At DSTA, I was involved in creating an end-to-end data pipeline from scratch, primarily focused on automating the cleaning and processing of data.
  • One of our significant accomplishments was achieving a 95% accuracy rate in classifying open-ended responses.
  • Our project facilitated better policy decision-making for National Service.
  • Together with 3 other analysts, we managed to accelerate the project’s completion timeframe from an expected five years down to just two.
18
Q

Resume Bullet Point: Data Pipeline

A
  • End-to-end data pipeline from scratch, automating the cleaning and processing of data with 95% accuracy in classifying open-ended response
    • Explain NS
    • How they will have surveys to fill out what occupation you are in / level of seniority etc … and job preferences etc.
    • Essentially we have access to this data (through the DB team, we don’t query the DB because of security reasons), we then clean the responses using regex (spaCy) —> build some rule based matching system (Levenshtein Distance)
    • Then Created the Qlik Sense Dashboard based on these data
19
Q

Can you explain more about your rule based matching system

A

In our rule-based matching system, we employed a combination of regex and Levenshtein distance to group similar responses.

Regex (Regular Expressions): At the foundational level, we utilized regular expressions to clean and preprocess the responses. This involved tasks like stripping unwanted characters, converting text to a standard format, and removing common but unnecessary words or phrases.

Levenshtein Distance: Also known as the ‘edit distance’, this metric helped us to quantify how many edits (like insertions, deletions, or substitutions) it would take to transform one response into another. It’s a useful measure when dealing with typos or slight variations in phrasing.

The combination of these techniques allowed us to robustly group similar responses.

20
Q

EY Tell me more about this competition

A
  • The EY BWWDC is an annual competition hosted by EY that aims to bring data enthusiasts around the world to solve real-world problems using data driven solutions
  • In the 2022 Iteration which I participated in, the theme was on biodiversity conservation. I participated in the highest level of the challenge which was to build a conputational model to predict the count of frogs over specific locations
  • Participants were also allowed to use any dataset available on the Micrsoft Planatery Computer Data Catalog
  • So this was a pretty fun competition with alot of freedom given to participants actually and I thoroughly enjoyed having the freedom to explore what I wanted to and approach the problem however I wanted.
  • My Approach to this competition:

1) Understanding Problem Statement
- Learn more about the problem + Read up on research papers on this particular problem
- Learnt about what kind of habitats frogs like to be in, what are some factors that could affect where they choose to live etc. (Temperature, Humidity, Predators)
- Read up on some articles / papers to understand what has been done as of now

2) Learn more about Evaluation Metric
- In this case —> F1-Score
- However I felt like it was not a good metric at the end, since in cases of highly imbalanced data F1-score may not provide a clear picture of model’s performance and since we are dealing with frog counts, F-1 Score in itself does not seem like a good metric already since its not meant to handle regression type problems where you’re trying to predict a continuous variable.
- Instead I decided to use WMAPE, since the metric considers magnitude of error in relation to actual frog counts, we can ensure that our model can account for this skew in data

======== SKIPPABLE START ========
3) Dealing with Imabalanced Data
- I tried some of the following:
- Resampling (Over / Undersampling)
- Generating Synthetic Data
- Ensembling and Stacking
- Cost-sensitive Learning –> Higher Penalties for mispredicting frog counts
- Realised its actually about how you framed your approach (i.e how you select areas, if you selected area is small then small frog counts but more data, if large then more frogs but less data)
- Theoretically if the imbalanced distribution in the data reflects the true distribution in the real world, then it’s not necessarily a problem that needs to be corrected
- So ultimately, I did not really deal with this “problem”
======== SKIPPABLE END ========

4) Modelling and Experimenting
- Experimented with GBDTs + NNs —> Chose Model based on CV
- GroupKFolds, Grouped Based on Location (Since there is a misrepresentation of location as well, way more data for certain locations as compared to other locations)

5) Considerations
- Risk of overfitting due to skewness in data so had to be more careful. I believe other than looking at loss curves, I also looked at the heatmaps of prediction results overlayed on the world map to see if there were any issues (large variances in predictions)

6) Interesting Ideas / Techniques

  1. Psuedo Labelling Kind of technique to increase data
    - Value we were trying to predict was the aggregated frog count over 20) years
    - Because it was averaged in the end, I decided to split up the data into individual years, effectively multiplying my dataset by 20+
  2. Generating Bounding Box Sizes for Training Data
    - So for this competition you are given the coordinates of frog counts over a large area and you’re supposed to generate training data yourself
    - So I decided to split the areas up equally into bounding boxes and get the total frog counts of each area and use that as my training data
    - This fits really well with the additional data on Microsoft Planatery Data Catalog since almost all of them provide coordinates as well
    - By tuning the Bounding Box Size, we can tune the amount of data we can extract accordingly
21
Q

Why did you use WMAPE over F1-Score

A

1) F1-Score:
- F1-Score is the harmonic mean of precision and recall. It is normally used in binary classification tasks
- However since problem is regression like, I also didn’t really understand why F1-Score was used as the LB evaluation metric
- And also, In cases of highly imbalanced data, F1-Score might not provide a clear picture of the model’s performance. It also doesn’t account for the true negatives (a large number in imbalanced datasets).

2) WMAPE (Weighted Mean Absolute Percentage Error):
- WMAPE is the mean absolute percentage error, weighted by the volume of data points. In other words, errors in predicting larger values (or counts) have a proportionally larger impact on the metric than errors in predicting smaller values.
- WMAPE seems a more appropriate choice. It would help ensure that errors in high-density areas are penalized more, leading the model to focus on getting those predictions more accurate.

22
Q

Expert AI Tell me more about this competition

A

Description
- Essentially Hackathon to build anything you want with their expert-ai API
- We used their emotion-label API to build PagePal, an assistive tool that labels emotional traits in texts and story books to help children with ASD to better recognise emotions through the use of storytelling

Solution

  • Kind of first time working on REST / backend
  • Worked on Backend
  • Specifically the REST API Part + expert.ai API Part
  • For the REST API Part:
  • We used Firebase + FastAPI
  • Essentially CRUD Operations to Create/Read/Update a selection of books we have on our website
  • Also allow user to update their own texts, then we run the expert.ai’s emotional triaits analysis
23
Q

Why did you choose FastAPI and Firebase for the backend? Were there other technologies you considered

A
  • We chose FastAPI because of its fast performance, simplicity, and its built-in support for asynchronous tasks, which we believed would be crucial for handling potential I/O-bound tasks, especially when interacting with the expert-ai API.
  • We also have more familiarity with FastAPI since we’ve worked on it before
  • Firebase, on the other hand, offered a scalable NoSQL cloud database, authentication services, and storage solutions that are easy to set up and manage.
  • Its serverless architecture ensured we could focus on building features without the overhead of managing server infrastructure.
  • And so while we were considering our tech stack, we did look into alternatives like Flask for the API and MongoDB Atlas for the database.
  • However, given the time constraints of the hackathon and the out-of-the-box solutions provided by FastAPI and Firebase, we felt that they were the best fit for our needs.
  • And also we had teammates who were more familiar with these technogies in general
24
Q

How did you handle the scalability of your solution? Was the backend designed to support many users simultaneously?

A
  • FastAPI inherently supports asynchronous request handling.
  • This meant that while a particular API request might be waiting for a response from the expert-ai API, our server could still handle other incoming requests, enhancing concurrency.
  • Firebase also plays a part in our scalability considerations.
  • Being a managed cloud solution, it’s designed to scale automatically to handle the application’s load, meaning we didn’t have to provision or scale servers manually.
25
Q

How did you handle authentication and authorization for your API?

A
  • We used Firebase for authentication. Firebase provides an array of authentication methods, like email/password, OAuth, etc. We used OAuth.
  • Once authenticated, Firebase generates a JWT (JSON Web Token) which the client sends with each subsequent request.
  • On the backend, we verified this token to ensure the user’s identity and determine their permissions.
26
Q

Why did you choose REST over other architectures

A

Several factors made REST a more suitable choice for us:

1) Simplicity & Familiarity:
- REST has been around for a long time and is widely accepted in the industry.
- Many of our team members were already familiar with designing RESTful APIs, which meant a shorter learning curve and faster development time.

2) CRUD Alignment:
- Our primary requirements were centered around CRUD (Create, Read, Update, Delete) operations for managing the book selection and user texts on our website.
- REST naturally aligns with these CRUD operations, making it a logical fit for our use case.

3) Maturity of Tools:
- We used FastAPI for the backend, which provides an efficient and straightforward setup for building RESTful APIs.
- The ecosystem around REST is mature, with plenty of libraries and tools available to help streamline development.

4) Scalability & Performance:
- While both REST and GraphQL can be optimized for performance, REST, given its stateless nature, allowed us to easily scale our application by adding more instances behind a load balancer.

5) Scope of Project:
- Given that our project was part of a hackathon, we wanted to focus on delivering a functional product within a limited timeframe. With REST, we could leverage existing knowledge, tools, and patterns without the overhead of adapting to a new paradigm.

27
Q

MMLM Tell me more about this competition

A

1) Description
- a competition to predict the outcomes of that year’s college basketball tournaments

2) Solution
- Built XGBoost + Logistic Regression for Men’s Tourney Prediction and built a Logistic Regression model for Female’s Tourney Prediction

3) Logistic Regression
- Simple model that only uses Win Ratio, Average Point Difference per game, seedings and an external 538 Ratings

4) XGB
- More detailed feature engineerings with player box scores i.e 3 pt rate, 2 pt rate, conversion etc …
- Take into account time blocks for each season (i.e if a team/player is improving or performing worse as season is going on)
- Consider more external ratings

5) CV is just log loss, since it penalises confident and incorrect predictions quite harshly. Used GroupKFolds and grouped based on teams

6) Extra
- Some other things I tried included trying to build a Rating System, but it is very hard to do it fairly since not each team played against each other (since they are different regions / leagues) so a bit hard to standardise “ratings”
- On hindsight, one thing that is interesting is predicting matchups, since 1st seed face 16th seeds, and 8th seed face 9th seed first round, that means if we are predicting a 8th seed vs 16th seed, this means that the 16th seed beat the 1st seed in order to be able to face the 8th seed, which evidently beat the 9th seed. Using the initial matchup ratings, we can kind of build this system better, but no time to try (Just interesting food for thought for next year’s competition)

28
Q

Why did you choose GBDTs and Logistic Regression, did you consider any other model types

A

I opted for GBDTs due to their capacity to capture complex non-linear relationships, handle mixed data types. They’re also pretty robust to outliers and the staple model for competitive ML (tabular data competitions) since they have an inherent ability to capture interactions between features so you don’t have to come up with insane features

On the other hand, Logistic Regression is less complex and can provide a good contrast to a complex GBDT when used as an ensemble. Specifically for Women’s side of competition I decided to purely use LR because I felt like the women’s side of bracket is less prone to upsets and most of the time higher seed wins.

OTHER MODELS:

  • I explored neural networks given their flexibility and complexity as well, but they just didn’t perform as well and are more difficult and complicated to get to a high score compared to just a single out of the box GBDT.
29
Q

Severity of Toxic Comments Tell me more about this competition

A

Description
- Rank the severity of a set of toxic comments given to you
- Data is derived by having 10+ human reviewers that are given pairs of toxic comments and they pick which one they think is the more toxic one

Solution

  • Competition is a bit more unique, in a sense that they didn’t really give us alot of data at all for training and evaluation
  • I Decided to use competition data purely as validation and used past data from previous iterations of Jigsaw competition as training data.
  • Previous competitions were classification competitions and I essentially assigned a score to each classification and trained it that way
  • The trick to do well for this competition was to realise that the training data was not representative of the evaluated data (organisers talked about it in dicussion post) so it pays to be more generic, since training / validation data != to final LB data and additionally the LB score was only 5% of total data (don’t get a good gauge of how well you actually do).
  • In order to diversify and reduce variances, I used linear based models in addition to just RoBerta / DeBerta built on validation data to improve robustness of solution
  • This was key to doing well in the competition, most high ranking solutions had considered this approach

On How to build linear models (EXTRA)
- Create targets by assigning each category to a value (i.e 1 - 5)
- Vectorize the texts (TF-IDF) and we can simply perform regression analysis (Ridge Regression)
* TFIDF –> Similar to the count vectorization method, but the difference in the TF-IDF method is that we take into account a weight value that signifies how important a word is for an individual text message or document
- Find which words / vectors have most importance in scores.
- Then essentially find the total score of the sentence

30
Q

Chaii Hindi Tamil QA Tell me more about this competition

A

Description
- Simple / Basic Question Answering competition but with texts in Hindi and Tamil

Solution:
- One of my first NLP competitions, essentially just submitted a simple Roberta, pretrained on SQUAD data and fine tuned on the training data that we had, also got some additional data (TyDi Dataset, the Tamil and Hindi parts)

Some of the things I tried

  • Tried to do Back translation —> which was a very popular method at that time because it was used to win some previous Quora NLP competition recently (during that time)
  • Progressive Resizing:
  • Changing sequence lengths from 256 —> 384 —> 448 etc … to allow the model to learn different “level” of features (good if done right)
  • Cutout:
  • Essentially the idea of replacing some tokens with [MASK] ( ~ 0 - 10% is ideal )
  • Fine tuned based on learning rate and gradient accumulation