🔥 Top companies to work for in AI and 100+ ML interview questions open sourced from neuraprep.com
🙏 Feel free to submit a new job posting or suggest a change at team@neuraprep.com
🎓 On Hedge Funds side, only those that hire on ML actively and continuously were included
Top companies hiring on ML (subjectively ranked based on perception, culture, program, prestige and pay):
1️⃣ Meta - OpenAI - Anthropic - Nvidia.
2️⃣ Citadel (Securities) - Netflix - Google - TwoSigma.
3️⃣ RunwayML - Uber - xAI.
4️⃣ Microsoft - Tesla - Tiktok - Stripe - Cruise.
5️⃣ Lambda - Figure AI - Scale - Coinbase - Reddit - Adobe.
-
[Startup] Learning Rate Significance
Why do we take smaller values of the learning rate during the model training process instead of bigger learning rates like 1 or 2? -
[Startup] Train-Test Split Ratio
Is it always necessary to use an 80:20 ratio for the train test split? If not, how would you decide on a split? -
Covariance vs Correlation
What is the difference between covariance and correlation? -
Skewed Distributions Tendencies
What happens to the mean, median, and mode when your data distribution is right skewed and left skewed? -
[Amazon] Robustness to Outliers
Which one from the following is more robust to outliers: MAE or MSE or RMSE? -
[Automattic] Content vs Collaborative Filtering
What is the difference between the content-based and collaborative filtering algorithms of recommendation systems? -
[TripAdvisor] Restaurant Recommendation System
How would you build a restaurant recommendation for TripAdvisor? -
[Stanford] Ensemble Model Performance
Why do ensembles typically have higher scores than the individual models? Can an ensemble be worse than one of the constituents? Give a concrete example. -
[Bosch] Focal Loss in Object Detection
Elaborate on the focal loss and its application in object detection. -
[Hedge Fund] Clock Hands Angle
What is the angle between the hands of a clock when the time is 3:15? -
[Startup] Optimizing Labeled Data
Getting labeled data in real world applications is not cheap, how do you optimize the number of labeled data? Give 3 popular strategies used in the industry to solve this problem. -
Few-Shot Learning Steps
What steps does few-shot learning (sometimes grouped with meta learning) involve? -
[Startup] Greedy Layer-wise Pretraining 1
What is greedy layer-wise pretraining? How does it compare to freezing transfer learning layers? -
Freezing Transformer Layers
Why might you want to freeze transfer learning layers in the context of transformers? -
Dropout During Inference
What happens to dropout during inference? If at the training stage we randomly deactivate neurons, then do we do the same when predicting? -
[Tiktok] Importance of Variation in VAEs
Why do we need 'variation' in the variational autoencoder, what would happen if we remove the 'variation'? Explain how this relates to the difference between NLU and Natural Language Generation. -
Generative Model: Training vs Inference
How does a generative model differ during training and inference in the context of text generation? -
Subword Tokenization Explanation
What is subword tokenization, and why is it preferable to word tokenization? Name a situation when it is NOT preferable. -
Use of Sigmoid for Numerical Prediction
Suppose you want to build a model that predicts a numerical quantity such as loan amount, investment amount, product price, etc. Why might you feed the final layer through a sigmoid function? -
[Hedge Fund] Function Derivative Zero Sum
What function yields 0 when added to its own derivative? -
Continuous Binary State Function
In a binary state, there are only two possible values: 0 or 1, which can represent off/on, false/true, or any two distinct states without any intermediate values. However, in many computational and real-world scenarios, we often need a way to express not just the two extreme states but also a spectrum of possibilities between them. Give an example of a function that represents a continuous version of a binary state (bit) and explain why. -
[Circle K] PCA and Correlated Variables
You are given a dataset. The dataset contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run a PCA. Would you remove correlated variables and why? -
Dot Product Complexity
How does the dot product of two vectors scale with N? -
Deep Learning Success
Though the fundamentals of Neural nets were known since the 80s, how does this explain the success of Deep Learning in recent times? -
[Startup] Model Convergence Evaluation
You are training a neural network and you observe that training and testing accuracy converges to about the same. The train and testset are built well. Is this a success? What would you do to improve the model? -
[Facebook] Unfair Coin Probability
There is a fair coin (one side heads, one side tails) and an unfair coin (both sides tails). You pick one at random, flip it 5 times, and observe that it comes up as tails all five times. What is the chance that you are flipping the unfair coin? -
[Quora] Drawing normally
You are drawing from a normally distributed random variable X ~ N(0, 1) once a day. What is the approximate expected number of days until you get a value of more than 2 -
[Airbnb] Fair Odds from Unfair Coin
Say you are given an unfair coin, with an unknown bias towards heads or tails. How can you generate fair odds using this coin? -
[Airbnb] Customer Churn MLE
Say you model the lifetime for a set of customers using an exponential distribution with parameter l, and you have the lifetime history (in months) of n customers. What is the Maximum Likelihood Estimator (MLE) for l? -
[Lyft] Probability of Feature Shipping
Say that you are pushing a new feature X out. You have 1000 users and each user is either a fan or not a fan of X, at random. There are 50 users out of 1000 that do not like X. You will decide whether to ship the feature or not based on sampling 5 distinct users independently and if they all like the feature, you will ship it. What's the probability you ship the feature? How does the approach change if instead of 50 users , we have N users who do not like the feature, how would we get the maximum value of unhappy people to still ship the feature? - -
[Hedge Fund] Probability of Ruin in Gambling
I have $50 and Im gambling on a series of coin flips. For each head I win $2 and for each tail I lose $1. What's the probability that I will run out of money? - . -
[Hedge Fund] Regression Slope Inversion
Suppose that X and Y are mean zero, unit variance random variables. If least squares regression (without intercept) of Y against X gives a slope of b (i.e. it minimizes [(Y - bX)^2 ]), what is the slope of the regression of X against Y? -
[SimPrints] Face Verification System Steps
What do you think are the main steps for a face verification system powered by ML? Would CNN work well as a model? -
Mini-Max Optimization
Which method is used for optimizing a mini-max based solution? -
[Microsoft] Handling Missing Data
You are given a dataset consisting of variables having more than 30% missing values. Let's say out of 50 variables, 8 have more than 30% missing values. How do you deal with them? -
Type I vs Type II Errors
What is the difference between Type I and Type II errors? Follow up: better to have too many type I or type II errors in a solution? -
[Scaleai] Multimodal Attention in LLMs
How do you complete the alignment of different modal information in a multimodal Large Language Model? How does the attention mechanism still work with cross-modal inputs? -
[SentinelOne] RNN vs Transformer
How do RNNs differ from transformers? Mention 1 similarity and 2 differences. -
[Stanford] SGD Loss Function Behavior
Is it necessary that SGD will always result in decrease of loss function? -
Approximate Solutions in Training
Why might it be fine to get an approximate solution to an optimization problem during the training stage? -
Backpropagation Computational Cost
What operation is the most computational expensive in backpropagation and why? -
Noisy Label Classification
How would you do classification with noisy labels or many incorrect labels? -
Logistic Regression on Linearly Separable Data
What would happen if you try to fit logistic regression to a perfectly linearly separable binary classification dataset. What would you do if given this situation, assuming you must preserve logistic regression as the model? -
[Faang] Model Production Issue
Your training, validation and test accuracy are more than 90% accuracy. Once in production, the model starts to behave weirdly. How will you identify what's happening and how will you correct it? -
[Faang] Self-attention Time Complexity
What is the time complexity of the Self-attention layer? -
'Random' in Random Forest
What is random in a random forest, are there any benefits for this randomness? -
Gradient Descent vs Analytical Solution
Why do we need gradient descent instead of just taking the minimum of the N dimensional surface that is the loss function? -
Hessian Matrix in Optimization
What is the role of the Hessian matrix in optimization, and why is it not commonly used in training deep neural networks? -
Combining Normalized Features
If two features are embedding outputs - dimensions 1xN, 1xM - and one feature is single value output - 1x1 - and all feature values are normalized to between -1 and 1, how can these be combined to create a classification or regression output? -
A/B Test Analysis
A company runs an A/B test for a donation group but the conversion didn't increase / statistically-significant increase what you do? -
[Meta] Impact Analysis of New User
What kind of analysis will you run to measure the effect of a facebook user when their younger cousin joins? -
Logistic Regression vs Decision Tree
When would you use logistic regression over a decision tree? Which one would you use when the classification problem deals with perfectly linearly separable data? -
Improving Model Prediction
You have a model with a high number of predictors but poor prediction power. What would you do in this case? -
Naive Bayes with Laplace Smoothing
When should we use Naive Bayes with Laplace smoothing? Give a practical example -
Activation Function Comparison
Which of the following activations has the highest output for x=2: Tanh, ReLu, Sigmoid, ELU? Without computing the functions, provide an explanation. -
ReLU Activation Issue
You are training a model using ReLu activations functions. After some training, you notice that many units never activate. What are some plausible actions you could take to get more units to activate? -
Variance of Duplicated Data
What would happen to the variance of whole data if the whole data is duplicated? -
[Google] Identify Synonyms from Corpus
Say you are given a large corpus of words. How would you identify synonyms? Mention several methods in a short answer. -
[Airbnb] Model Airbnb Revenue
Say you are modeling the yearly revenue of new listings of airbnb rentals. What kinds of features would you use? What data processing steps need to be taken, and what kind of model would run? Would a neural network work? -
[Google] Linear Regression with Noise
Say we are running a linear regression which does a good job modeling the underlying relationship between some y and x. Now assume all inputs have some noise added, which is independent of the training data. What is the new objective function and effects on it? -
[Netflix] Gradient Descent in K-Means
If you had to choose, would you use stochastic gradient descent or batch gradient descent in k-means? Does k-means use any gradient descent to optimize the weights in practice? -
[Netflix] EM Algorithm Use Cases
When is Expectation-Maximization useful? Give a few examples. -
Dependence vs Correlation
What's the difference between dependence and correlation? -
[BioRender] Transformers in Computer Vision
How can transformers be used for tasks other than natural language processing, such as computer vision (ViT)? -
Convert NN Classification to Regression
How would you change a pre-trained neural network from classification to regression? -
Train Loss Stagnation
When it comes to training a neural network, what could be the reasons for the train loss not decreasing in a few epochs? -
High Momentum in SGD
What might happen if you set the momentum hyperparameter too close to 1 (e.g., 0.9999) when using an SGD optimizer? -
Law of Large Numbers
What is the Law of Large Numbers in statistics and how can it be used in data science? -
Selection Bias
What is the meaning of selection bias and how to avoid it? -
Weight Decay Scaling Factor
Why do we need a scaling factor in weight decay? Is it independent from batch size or learning rate? -
Fuzzy Logic Explanation
What is fuzzy logic? -
Latent Variables in Stable Diffusion
Why do we call the hidden states "latent variables" instead of embeddings in stable diffusion? -
[Robinhood] User Churn Prediction Model
Walk me through how you'd build a model to predict whether a particular Robinhood user will churn. -
Ensemble Logistic Regression as Network
Consider a binary classification problem and N distinct logistic regression models. You decide to take a weighted ensemble of these to make your prediction. Can you express the ensemble in terms of an artificial network? How? -
Feature Selection with Mutual Information
Consider learning a classifier in a situation with 1000 features total. 50 of them are truly informative about class. Another 50 features are direct copies of the first 50 features. The final 900 features are not informative. Assume there is enough data to reliably assess how useful features are, and the feature selection methods are using good thresholds. How many features will be selected by mutual information filtering? -
Optimal Polynomial Degree for Regression
We are trying to learn regression parameters for a dataset which we know was generated from a polynomial of a certain degree, but we do not know what this degree is. Assume the data was actually generated from a polynomial of degree 5 with some added Gaussian noise. For training we have 1000 pairs and for testing we are using an additional set of 100 pairs. Since we do not know the degree of the polynomial we learn two models from the data. Model A learns parameters for a polynomial of degree 4 and model B learns parameters for a polynomial of degree 6. Which of these two models is likely to fit the test data better? -
Softmax and Scaling
For an n-dimensional vector y, the softmax of y will be the same as the softmax of c * y, where c is any non-zero real number since softmax normalizes the predictions to yield a probability distribution. Am I correct in this statement? -
Evaluation Metric for Criminal Identification
You are hired by LAPD as a machine learning expert, and they require you to identify criminals, given their data. Since being imprisoned is a very severe punishment, it is very important for your deep learning system to not incorrectly identify the criminals, and simultaneously ensure that your city is as safe as possible. What evaluation metric would you choose and why? -
Batch Size and Minima
Is it always a good strategy to train with large batch sizes? How is this related to flat and sharp minima? -
Logistic Regression on Synthetic Data
You are building a classification model to distinguish between labels from a synthetically generated dataset. Half of the training data is generated from N(2,2) and half of it is generated from N(0,3). As a baseline, you decide to use a logistic regression model to fit the data. Since the data is synthesized easily, you can assume you have infinitely many samples. Can your logistic regression model achieve 100% training accuracy? -
ReLU before Sigmoid Issue
You decide to use ReLU as your hidden layer activation, and also insert a ReLU before the sigmoid activation such that ^y = s(ReLU(z)), where z is the preactivation value for the output layer. What problem are you going to encounter? -
Handling Class Imbalance in Medical Imaging
You're asked to build an algorithm estimating the risk of premature birth for pregnant women using ultrasound images. You have 500 examples in total, of which only 175 were examples of preterm births (positive examples, label = 1). To compensate for this class imbalance, you decide to duplicate all of the positive examples, and then split the data into train, validation and test sets. Explain what is a problem with this approach. -
[Stanford] Model-based Car Optimization
Suppose you have built a model to predict a car's fuel performance (e.g. how many miles per gallon) based on engine size, car weight, etc. . . (e.g. many attributes about the car). Your boss now has the great idea of using your trained model to build a car that has the best possible fuel performance. The way this is done will be by varying the parameters of the car, e.g. weight and engine size and then using your model to predict fuel performance. The parameters will then be chosen such that the predicted fuel performance is the best. Is this a good idea? Why? Why not? -
Improving High Training Loss
You want to solve a classification task with a neural network. You first train your network on 20 samples. Training converges, but the training loss is very high. You then decide to train this network on 10,000 examples. Is your approach to fixing the problem correct? If yes, explain the most likely results of training with 10,000 examples. If not, give a solution to this problem. -
CNN vs. Fully-connected for Images
Alice recommends the use of convolutional neural networks instead of fully-connected networks for image recognition tasks since convolutions can capture the spatial relationship between nearby image pixels. Bob points out that fully-connected layers can capture spatial information since each neuron is connected to all of the neurons in the previous layer. Both are correct, but describe two reasons we should prefer Alice's approach to Bob's. -
Weight Initialization in Neural Networks
You try a 4-layer neural network in a binary classification problem. You initialize all weights to 0.5. Is this a good idea? Briefly explain why or why not? -
Impact of Weight Sharing
Does weight sharing increase the bias or the variance of a model? Why? -
PDF value range
A probability density function (PDF) cannot be less than 0 or bigger than 1. Is that true? Why or why not? -
KNN Bias-Variance Tradeoff
How does the bias-variance tradeoff play out for the k nearest neighbor algorithm as we increase k? -
RAG Explanation
What is Retrieval-Augmented Generation (RAG)? -
RAG Limitations
What are some limitations of RAG? -
LLM Tuning Techniques
In the world of LLMs, choosing between fine-tuning, Parameter-Efficient Fine-Tuning (PEFT), prompt engineering, and retrieval-augmented generation (RAG) depends on the specific needs and constraints of your application. Explain what each one of these does. -
Biases in CNN
You are building the next SOTA CNN for vision tasks following the architecture: (Layer input) => (Conv Layer) => (Batch Norm) => (Activation) => (Next Layer Input). The Conv layer has a set of learnable weights and biases but you decide not to train biases (hardcode them all to 0). Would the performance be affected compared to the same system with biases learning turned on? -
Transformer Encoder vs Decoder
There are 2 deeper technical differences between the encoder and decoder in a transformer. Can you mention them? -
[Scaleai] Decoder vs Encoder Popularity
Decoder only models have become much more popular recently compared to encoder models. The majority of NLU models are decoder only, why is that? Think about the advantage that encoder models have over decoder, why didn't that matter? -
Encoder-Decoder vs Decoder-Only
Why do we need encoder-decoder models while decoder-only models can do everything? -
[Scaleai] GPT-4 Architecture and Training
Can you describe the process by which GPT-4 generates coherent and contextually relevant text and why a decoder-only architecture was chosen? What input/output was it used for training? -
T5 vs GPT-4 for LLM
A T5 or FlanT5 model is considered one of the best encoder-decoder models out there (as of 2024). In technical details, why aren't people using it at scale to train a large LLM that can compete with GPT4? -
[Spotify] Clustering Performance with Labels
Can you suggest some ways in which the performance of a clustering algorithm can be measured when the labels are given? -
[Uber] Simulating Fair Die Roll
How would you simulate the roll of a fair six-sided die using U(0,1) (uniform distribution) random number generator? How would you validate that the rolls are indeed fair? -
[Palo Alto Networks] Batch vs Instance Normalization Differences
What are the differences between batch normalisation and instance normalisation? Give an example where the instance norm would be preferred. -
Bias-Variance Tradeoff
Explain the concept of bias-variance tradeoff and its significance in machine learning. -
Curse of Dimensionality
What is the curse of dimensionality and how does it affect machine learning algorithms? -
[Kayzen] Gradient Boosting Explanation
How does gradient boosting work and why is it effective? -
[SYZYGY] Confusion Matrix Explanation
Describe the concept of a confusion matrix and its components. -
Backpropagation Explanation
How does backpropagation work in neural networks? -
Transfer Learning in Deep Learning
What is transfer learning and how is it applied in deep learning? -
[JPMorgan] Bagging vs Boosting
What is the difference between bagging and boosting in ensemble learning? -
[Apple] Feature Engineering with XGBoost
Explain the concept of feature engineering and its importance in the context of tabular data with XGBoost as a predictive modeling tool. -
Activation Functions in Neural Networks
What is the role of activation functions in neural networks? -
[Apple] ROC Curve Concept
Describe the concept of an ROC curve and how it is used. -
Binary Classification Metrics
What are the common metrics used for evaluating binary classification models? -
Feature Selection
How do you perform feature selection in machine learning? -
Clustering Performance Metrics
How do you measure the performance of a clustering algorithm? -
[Meta] Challenges with Large Datasets
What are the challenges of working with large-scale datasets? -
Softmax in Multi-Class Classification
How does the softmax function work in multi-class classification problems? -
Decision Trees Data Handling
Explain how decision trees handle categorical and numerical data. -
[Amazon] Handling Outliers
How do you handle outliers in a dataset? -
ML Deployment Challenges
What are the challenges in deploying machine learning models to production? -
Feature Scaling Importance
How do you perform feature scaling and why is it important? -
Linear Regression Coefficients
How do you interpret the coefficients of a linear regression model? -
Tree-based Algorithms Advantages
What are the advantages of using tree-based algorithms in machine learning? Mention as many as you know. -
Data Augmentation in Deep Learning
Explain the concept of data augmentation and its importance in training deep learning models. -
Early Stopping in Neural Networks
What is the role of early stopping in training neural networks? -
Time Series Cross-Validation
How do you implement cross-validation for time series data? -
Homoscedasticity vs Heteroscedasticity
Describe the difference between homoscedasticity and heteroscedasticity in regression analysis. -
[detikcom] Embedding Layer Purpose
What is the purpose of using an embedding layer in neural networks? -
Batch Normalization in Neural Networks
How does the batch normalization technique work in neural networks? -
Collaborative Filtering in Recommenders
How does the collaborative filtering algorithm work in recommendation systems? -
Bag-of-Words Explanation
Explain the concept of bag-of-words in natural language processing. -
Feature Selection Pitfalls
What are the common pitfalls in feature selection? -
Gradient Boosting Overfitting
How does the gradient boosting algorithm handle overfitting? -
Model Stacking Implementation
How do you implement model stacking in ensemble learning? -
Random Forest Missing Values
How does the random forest algorithm handle missing values? -
Training with Noisy Data
What are the challenges in training machine learning models with noisy data? -
Data Augmentation in CV
How do you perform data augmentation in computer vision tasks? -
Interpreting Hierarchical Clustering
How do you interpret the results of a hierarchical clustering algorithm? -
CNN Feature Extraction
How do you perform feature extraction using convolutional neural networks (CNNs)? - -
[Startup] Challenges of Limited Data
What are the challenges in training machine learning models with limited data? -
GMM Clustering
How does the Gaussian Mixture Model (GMM) perform clustering? -
PDP Model Interpretation
Describe the process of model interpretation using partial dependence plots (PDP). -
Feature Selection with Info Gain
How do you implement feature selection using information gain? -
PPO in Continuous Spaces
How does the reinforcement learning algorithm PPO optimize policies for continuous action spaces? -
[Imubit] Optimal Clusters in K-Means
How does the k-means clustering algorithm determine the optimal number of clusters? -
[Meta] Large-scale Image Classification System
Architect a large-scale image classification system that can process and categorize billions of images efficiently. -
[Tesla] Autonomous Vehicle Perception System
Develop a system for autonomous vehicle perception that can process sensor data in real-time and make safe driving decisions. -
Personalized News Feed Ranking
Develop a personalized news feed ranking system that can handle millions of users and articles while maintaining freshness and relevance. -
Large Language Model Training System
You're designing a distributed training system for large language models. How would you implement model parallelism and data parallelism? What are the trade-offs between them? -
Feature Store Design
For a recommendation system that needs to serve millions of users, how would you design the feature store? Consider aspects like real-time vs. batch features, storage choices, and serving architecture. -
Supply Chain Demand Forecasting
Develop a system for demand forecasting in supply chain management, processing large volumes of historical sales data to predict future demand accurately. -
Reinforcement Learning Training System
Design a scalable reinforcement learning system for training autonomous agents in complex simulated environments. -
Genomic Analysis Platform
Design a large-scale genomic analysis platform that uses machine learning to identify disease markers and predict patient outcomes from DNA sequencing data. -
Real-time Speech Recognition
Design a real-time speech recognition system that can transcribe live audio streams with high accuracy and support multiple languages. -
MLOps Infrastructure
Can you describe the key steps involved in creating infrastructure for MLOps? -
[Startup] Concept Drift Detection and Fixes
When could concept drift occur? How do you detect and address model or concept drift in deployed machine learning models? -
MLOps Data Privacy
What strategies can be implemented to ensure data privacy and security in MLOps pipelines? -
MLOps Scalability
What strategies do you employ to ensure scalability in your MLOps processes? -
Infrastructure as Code in MLOps
Can you explain the concept of Infrastructure-as-Code (IaC) and its role in MLOps? -
Model Bias Detection in MLOps
How do you detect and address model bias within an MLOps framework? -
Batch vs Real-time Inference
What distinguishes batch inference from real-time inference in machine learning applications? -
Search Engine Related Queries
When a user enters a search query on a search engine like Google, a list of related searches is displayed to enhance the search experience. How would you design a system to generate relevant related search suggestions for each query? -
[Airbnb] Property Search
How would you design a system to display the top 10 rental listings when a user searches for properties in a specific location on a platform like Airbnb? -
Trigger Word Detection
How would you design an algorithm to accurately detect the trigger word 'activate' within a 10-second audio clip? -
Document QA System
How would you design a question-answering system that can extract an answer from a large collection of documents given a user query? -
[Meta] Instagram Content Recommendations
How would you design a machine learning pipeline for Instagram's content recommendation system to handle data freshness, mitigate novelty effects in A/B testing, and ensure personalized content delivery to over a billion users? -
Food Delivery Query Understanding
How would you design a machine learning system to understand and expand user queries for a food delivery platform like Uber Eats, addressing challenges such as building a domain-specific knowledge graph, implementing representation learning for query expansion, handling ambiguous user intent, and ensuring real-time performance? -
Soft Prompt Tuning for Large Language Models
Explain how soft prompt tuning can be utilized to adapt a large language model to a new NLP task without modifying the model's core parameters. Discuss the benefits and potential challenges associated with this approach. -
Caching Strategies for Large Language Models
How would you design a caching system for serving large language model (LLM) responses to reduce latency and cost, while ensuring the accuracy and reliability of the responses? Discuss the key components of your caching mechanism, how you would handle semantic similarity, and the potential challenges you might face. -
Implementing Defensive UX for LLM-Based Products
How would you design a Defensive UX strategy for a product that utilizes large language models (LLMs) to anticipate and gracefully handle errors, guide user behavior, and prevent misuse, while ensuring accessibility, trust, and a seamless user experience? -
Implementing Cascade Pattern in ML Systems
How would you design a machine learning system using the cascade pattern to solve a complex problem by breaking it down into smaller, sequential tasks? Discuss the advantages and potential drawbacks of this approach, and provide real-world examples of its application. -
[Openai] Human-In-The-Loop for Collecting Explicit Labels
How would you design a Human-In-The-Loop (HITL) system to collect explicit labels for a supervised learning task, balancing the need for high-quality annotations with the constraints of cost and scalability? Discuss the methods you would use, the advantages and disadvantages of HITL, and how you might leverage large language models to enhance the labeling process. -
[Openai] Reframing Machine Learning Problems for Enhanced Performance
How would you apply the reframing technique to simplify a machine learning problem or its labels to improve model performance? Provide two examples of successful reframing and discuss the potential challenges of this approach. -
Evaluating LLM Outputs with LLM-Evaluators
How would you design and implement a system that uses large language models (LLMs) as evaluators to assess the quality, correctness, and safety of another LLM's responses, considering factors such as evaluation methods, alignment with human judgments, and scalability? -
Aligning LLM-Evaluators to User-Defined Criteria
How would you design a system to align large language model (LLM) evaluators with user-defined criteria, ensuring accurate and reliable assessments of another LLM's responses? Discuss the methods for defining and refining evaluation criteria, the role of interactive systems in this alignment, and the challenges involved in maintaining consistency with human judgments. -
Advancements in Evaluation Metrics for Machine Learning Models
How do newer evaluation metrics such as BERTScore and MoverScore address the challenges posed by traditional metrics in evaluating machine learning models? -
Mitigating Biases in Automated Evaluations Using Large Language Models
What strategies can be employed to mitigate biases in automated evaluations using large language models, and how do these strategies enhance the reliability of model assessments? -
Enhancing Open-Domain QA Systems with Fusion-in-Decoder (FiD)
How does Fusion-in-Decoder (FiD) enhance open-domain question-answering systems, and what are its key advantages over other retrieval-based models? -
Internet-Augmented LLM System Design
How do internet-augmented language models utilize search engines to enhance their performance in question-answering tasks, and what are the key components and processes involved? -
RAG System Design with Hybrid Retrieval
How can machine learning teams effectively apply Retrieval-Augmented Generation (RAG) using hybrid retrieval methods, and what are the key considerations and technologies involved in optimizing retrieval for large-scale applications? -
[Royal Bank Canada] Test if it's Gaussian
How would you determine that your dataset follows a Gaussian distribution? What if the data is univariate vs multivariate? -
[Microsoft] T-distribution vs normal distribution
When is t-distribution used as opposed to normal distribution? How many data points are considered good enough to use for a z vs a t test? -
[Microsoft] Linear regression assumptions test
The assumptions of linear regression are known to be: linearity, independence of errors, homoscedasticity, normality of residuals and no multicollinearity. How is each assumption tested? -
[Google Deepmind] Strong scaling
Can you explain, in 1 sentence, the concept of strong scaling in the context of large language models? -
[Google Deepmind] Possible Scenario
When a GPU or TPU rated at 500 TFLOPS only sustains ~50 TFLOPS on a large-model kernel, what possible scenarios related to chip-level factors can explain this tenfold performance drop? -
[Google Deepmind] Total memory computation.
Let A be an array with shape int8[128, 2048] sharded as A[I_XY, J] over a devide mesh Mesh({‘X': 2, ‘Y': 8, ‘Z': 2}) (so 32 devices total). How much memory does A use per device? How much total memory does A use across all devices? -
[OpenAI] Total number of parameters.
Let a transformer with D=4096 (hidden size) and F=4D (width of the feed-forward layers), V=32000 (size of model's vocabulary) and L=64 (depth of the network). You can assume we have multi-head attention with int8 KVs where the hidden size D is split accross N heads, each of the size H. How many parameters does the model have and what fraction of these are attention parameters? -
[OpenAI] Tricks to improve generation throughput.
Mention 4 methods/tricks to improve generation throughput and latency in the context of large language models. -
[OpenAI] Mixing in Local Attention Layers
How would you interleave local-window attention with global attention to curb KV-cache growth at long contexts? What are the pros and cons? -
[Meta] Data and model parallelism
How does reducing the per-device batch size—especially when scaling out with fully-sharded data parallelism (FSDP)—impact the communication cost of 1D tensor (model) parallelism compared to data parallelism?