Artificial Intelligence (AI) is transforming industries and reshaping the way we work, learn, and interact with technology. From autonomous vehicles to intelligent virtual assistants, AI is becoming an integral part of our daily lives. However, the field of AI can be overwhelming due to its complex terminology and rapid advancements. To help you navigate this ever-evolving landscape, we’ve created this AI glossary of 50 essential AI terms. Each definition is designed to demystify the concepts and provide clear, concise explanations, allowing you to build a solid understanding of AI and its key components. Whether you’re a beginner or looking to expand your knowledge, this AI glossary will equip you with the confidence to dive into the world of Artificial Intelligence.

1. Artificial Intelligence (AI)

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines programmed to think and learn. These systems are capable of understanding data, recognizing patterns, and making decisions based on input. AI encompasses a wide range of technologies, including expert systems, natural language processing, machine learning, and robotics. Its goal is to automate tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI is used in sectors such as healthcare, finance, education, and transportation. Virtual assistants like Siri and Alexa are consumer-facing examples. AI continues to evolve, driving innovation across nearly every industry.

2. Machine Learning (ML)

Machine Learning (ML) is a core subfield of AI that enables systems to learn from data and improve over time without being explicitly programmed. ML algorithms detect patterns, make predictions, and automate decision-making tasks based on experience. There are various types of ML, including supervised, unsupervised, and reinforcement learning. ML is widely used in recommendation engines, fraud detection, spam filtering, and image recognition. It relies heavily on large datasets and statistical techniques. The effectiveness of ML models depends on the quality and quantity of training data. As data grows, machine learning becomes increasingly powerful in business and research.

3. Deep Learning

Deep Learning is an advanced subset of machine learning that uses neural networks with many layers (hence “deep”) to analyze data. These models can extract high-level features automatically and are particularly effective in tasks such as image and speech recognition, language translation, and autonomous driving. Deep learning systems learn complex patterns from massive datasets and require significant computing power. Technologies like ChatGPT, facial recognition systems, and AlphaGo are built using deep learning architectures. Its foundation lies in artificial neural networks inspired by the human brain. As hardware improves, deep learning continues to achieve breakthroughs in AI performance.

4. Neural Network

A Neural Network is a computing system designed to simulate the way the human brain analyzes and processes information. It consists of layers of interconnected “neurons” that transform input data into meaningful output through weighted connections. Each layer performs mathematical transformations, and the system adjusts weights during training to improve accuracy. Neural networks are fundamental to deep learning and power applications like handwriting recognition and real-time translation. They excel in modeling non-linear relationships and can generalize well to unseen data. Despite their complexity, neural networks are highly flexible and are used in various AI fields. They continue to drive advancements in modern AI systems.

5. Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field of AI focused on enabling machines to understand, interpret, and generate human language. It combines computational linguistics, machine learning, and deep learning to process text and voice data. Key tasks include language translation, sentiment analysis, speech recognition, and text summarization. NLP is used in chatbots, search engines, and virtual assistants. It faces challenges like ambiguity, sarcasm, and varying linguistic styles. Techniques such as tokenization, part-of-speech tagging, and entity recognition are essential in NLP pipelines. As models like GPT improve, NLP becomes more capable of understanding natural conversation.

6. Computer Vision

Computer Vision is the field of AI that trains computers to interpret and process visual information from the world. It involves techniques such as image classification, object detection, facial recognition, and motion tracking. By analyzing pixels and patterns in digital images and videos, computer vision enables machines to understand their environment. Applications include self-driving cars, medical imaging diagnostics, security surveillance, and augmented reality. Deep learning, especially Convolutional Neural Networks (CNNs), plays a vital role in this area. As datasets and processing power grow, computer vision systems become increasingly accurate and reliable in real-world applications.

7. Supervised Learning

Supervised Learning is a machine learning approach where models are trained on labeled data — input-output pairs where the correct answer is already known. This allows the algorithm to learn a mapping from inputs to outputs and apply it to new, unseen data. Common algorithms include linear regression, logistic regression, support vector machines, and decision trees. Supervised learning is used in spam detection, credit scoring, and image classification. It requires large, high-quality datasets to generalize well. Training performance is measured with metrics like accuracy and precision. It’s one of the most widely used and well-understood ML methods.

8. Unsupervised Learning

Unsupervised Learning involves analyzing and finding hidden patterns or structures in data without labeled outputs. Algorithms group similar data points (clustering) or find associations between variables. Examples include K-means clustering, hierarchical clustering, and principal component analysis (PCA). It’s useful for market segmentation, anomaly detection, and recommendation systems. Because there’s no predefined outcome, evaluating performance is more complex than in supervised learning. These methods are powerful for exploratory data analysis and reducing dimensionality. Unsupervised learning helps discover insights that may not be apparent through manual analysis, especially in large datasets.

9. Reinforcement Learning

Reinforcement Learning is a machine learning approach where agents learn to make decisions by interacting with an environment. They receive rewards or penalties based on their actions and aim to maximize cumulative rewards over time. This trial-and-error learning method is used in robotics, game-playing AI (e.g., AlphaGo), and resource optimization. Core elements include states, actions, rewards, and policies. Reinforcement learning balances exploration (trying new things) with exploitation (choosing the best-known option). It often uses simulated environments to train safely and efficiently. As algorithms improve, RL is increasingly applied to real-world complex decision-making problems.

10. Algorithm

An Algorithm is a finite set of well-defined instructions used to solve a problem or perform a computation. In machine learning, algorithms process data to learn patterns or make predictions. Examples include decision trees for classification and gradient descent for model optimization. Algorithms can vary in complexity and performance, impacting accuracy and efficiency. Choosing the right algorithm depends on the problem type, data characteristics, and computational resources. They form the backbone of all AI and ML models. Optimizing algorithms is essential for scalable and responsive applications.

11. Training Data

Training Data refers to the dataset used to teach a machine learning model how to make predictions or decisions. It includes input data along with corresponding correct outputs (labels) for supervised learning, or just inputs for unsupervised learning. The model uses this data to learn patterns and relationships. The quality, size, and diversity of training data significantly affect model performance and accuracy. Poor or biased training data can lead to models that produce incorrect or unfair results. Data preprocessing techniques such as normalization, augmentation, and cleaning are often applied to enhance the dataset. Effective training data enables a model to generalize well to new, unseen examples.

12. Test Data

Test Data is a separate portion of the dataset used to evaluate a trained machine learning model’s performance. Unlike training data, the model has not seen test data during learning, making it useful for assessing generalization to new inputs. It helps determine how well the model is likely to perform in real-world scenarios. Metrics such as accuracy, precision, recall, and F1-score are calculated using test data results. Using test data helps prevent overfitting and identifies if further tuning or retraining is needed. It plays a critical role in validating the reliability and robustness of a model. Proper separation between training and testing datasets is essential for objective evaluation.

13. Overfitting

Overfitting occurs when a machine learning model learns not only the general patterns in the training data but also the noise or random fluctuations. As a result, it performs very well on training data but poorly on unseen or test data. This happens when the model is too complex relative to the size or variability of the dataset. Common signs include high accuracy on training data and low accuracy on test data. Techniques to address overfitting include simplifying the model, applying regularization, using dropout in neural networks, or collecting more training data. Overfitting leads to high variance and reduces a model’s ability to generalize.

14. Underfitting

Underfitting happens when a machine learning model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test sets. It fails to learn the relationships between input and output effectively. Causes may include using too few features, oversimplified algorithms, or insufficient training time. Underfitted models exhibit high bias and often make systematic errors. Solutions include increasing model complexity, selecting more relevant features, or using more powerful algorithms. Detecting underfitting early is important to avoid wasted time on tuning ineffective models. It highlights the need for balance between bias and variance in model design.

15. Bias

Bias in machine learning refers to systematic errors that cause a model to consistently predict certain outcomes inaccurately. It often stems from assumptions in the algorithm or from unrepresentative training data. For instance, a model trained on a dataset lacking diversity may perform poorly on underrepresented groups. This can lead to unfair or discriminatory outcomes in critical applications like hiring, lending, or criminal justice. Addressing bias involves using balanced datasets, implementing fairness-aware algorithms, and continuous auditing. While some bias is necessary to simplify complex data, excessive bias reduces accuracy. Ensuring fairness and transparency is key in responsible AI development.

16. Variance

Variance is the degree to which a model’s predictions fluctuate for different training sets. A high variance model learns patterns too closely from the training data, leading to overfitting and poor generalization to new data. For example, a high-degree polynomial regression might perfectly fit training data but fail on unseen samples. Variance is influenced by model complexity, feature noise, and training data size. Reducing variance often involves techniques like cross-validation, regularization, and simplifying the model. Striking a balance between variance and bias is crucial for optimal performance. This trade-off is known as the bias-variance tradeoff.

17. Feature

A feature is a measurable property or characteristic of the data used by a machine learning model. In a dataset about houses, features might include square footage, number of bedrooms, or location. Good features allow a model to learn useful patterns and make accurate predictions. Feature engineering, which involves selecting, creating, or transforming features, is a crucial part of the ML workflow. Irrelevant or redundant features can confuse models and reduce accuracy. Dimensionality reduction or feature selection techniques help streamline this process. High-quality features are key to the success of any ML model.

18. Label

A label is the output or target value associated with each data point in supervised learning. For example, in an image classification task, the image might be the input and the label could be “cat” or “dog.” Labels are used during training to teach the model what the correct output should be. The accuracy of a model heavily depends on the quality of the labeled data. Noisy or incorrect labels can misguide the learning process, resulting in poor predictions. Labeling can be manual, automated, or crowdsourced depending on the task complexity. In unsupervised learning, labels are not required.

19. Classification

Classification is a supervised learning task where the model assigns a category label to an input based on learned patterns. It’s used in applications like email spam detection, sentiment analysis, and medical diagnosis. Binary classification involves two classes, while multi-class classification handles more than two. Algorithms commonly used include logistic regression, decision trees, and neural networks. Performance is evaluated using accuracy, precision, recall, and F1 score. Imbalanced datasets can skew performance, often requiring data resampling or adjusted class weights. Effective classification models are integral to intelligent systems used every day.

20. Regression

Regression is a type of supervised learning where the goal is to predict continuous values. Examples include predicting house prices, stock values, or temperature. Unlike classification, regression outputs a number rather than a category. Algorithms like linear regression, decision trees, and gradient boosting are commonly used. Key evaluation metrics include mean squared error (MSE), root mean squared error (RMSE), and R-squared. Regression models require careful feature engineering and regularization to avoid overfitting. When designed well, regression models can provide high-accuracy predictions for a wide range of real-world problems.

21. Clustering

Clustering is an unsupervised learning technique that groups data points based on similarity. It does this without any labeled outputs, relying solely on inherent patterns in the data. Common clustering algorithms include K-Means, DBSCAN, and hierarchical clustering. It’s often used for customer segmentation, anomaly detection, or image grouping. A major challenge is choosing the number of clusters, which may require domain knowledge or methods like the elbow method. Clustering results can be sensitive to distance metrics and data scaling. It’s a powerful tool for exploratory data analysis and uncovering hidden structures.

22. Dimensionality Reduction

Dimensionality reduction is the process of decreasing the number of input variables or features in a dataset. It simplifies models, reduces training time, and improves interpretability. Techniques like Principal Component Analysis (PCA), t-SNE, and UMAP are popular. It helps combat the curse of dimensionality, where too many features can lead to overfitting and increased noise. While it simplifies data, care must be taken not to lose important information. Visualization of high-dimensional data becomes easier after reduction. Dimensionality reduction is commonly used before training machine learning models or for data exploration.

23. Principal Component Analysis (PCA)

PCA is a statistical method used to reduce the number of variables in a dataset while preserving its variance. It transforms the original features into a set of linearly uncorrelated components called principal components. These components capture the directions of maximum variance in the data. PCA is widely used for data compression, visualization, and noise reduction. It requires standardized input data and assumes linear relationships among variables. While it simplifies datasets, it may obscure interpretability of individual features. PCA is especially useful in fields like genomics, finance, and image processing.

24. K-Nearest Neighbors (KNN)

KNN is a simple, non-parametric algorithm used for classification and regression. It predicts the label of a data point based on the majority label (or average value) of its k closest neighbors. Distance metrics such as Euclidean or Manhattan distance determine similarity. KNN requires no training phase, but prediction can be slow with large datasets. Choosing the right value of k is critical—too small can make the model sensitive to noise, while too large can smooth out patterns. KNN works best when the data has clear groupings and low dimensionality. Feature scaling is important for fair distance calculation.

25. Support Vector Machine (SVM)

SVM is a powerful supervised learning algorithm that aims to find the best hyperplane to separate different classes. It maximizes the margin between the closest data points (support vectors) of different classes. SVMs can also handle non-linear data using kernel functions like the radial basis function (RBF). They are effective in high-dimensional spaces and commonly used for text classification and bioinformatics. While they offer strong theoretical guarantees, they can be slow on large datasets. Hyperparameter tuning is crucial for good performance. SVMs are less prone to overfitting but can be sensitive to noisy data.

26. Decision Tree

A decision tree is a flowchart-like model that splits data into branches based on feature values. Each internal node represents a test on a feature, each branch a decision outcome, and each leaf a predicted label. It’s easy to understand and visualize, making it a popular choice in interpretability-focused applications. However, decision trees can easily overfit the training data if not pruned properly. They can handle both numerical and categorical data. Decision trees form the building blocks of ensemble methods like Random Forest and Gradient Boosting. Their simplicity makes them ideal for quick prototyping.

27. Random Forest

Random Forest is an ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting. Each tree is trained on a random sample of the data, and predictions are made by majority vote (classification) or averaging (regression). This randomness reduces correlation between trees, enhancing robustness. Random Forests are capable of handling large datasets with high dimensionality. They also provide measures of feature importance, helping in feature selection. While they are more accurate than individual trees, they are less interpretable. Still, they are widely used in finance, medicine, and e-commerce.

28. Gradient Boosting

Gradient Boosting is an ensemble method that builds models sequentially, each correcting the errors of the previous one. Unlike Random Forests, which build trees independently, Gradient Boosting focuses on improving model accuracy step-by-step. Algorithms like XGBoost, LightGBM, and CatBoost are popular implementations. These models are highly effective for structured/tabular data. They can be tuned with parameters like learning rate, number of estimators, and tree depth. Gradient Boosting is prone to overfitting if not regularized, but techniques like early stopping can help. It’s one of the most powerful tools in a data scientist’s toolkit.

29. Hyperparameter

A hyperparameter is a configuration value set before the learning process begins and affects model training. Examples include learning rate, number of hidden layers, batch size, and regularization strength. These are different from parameters, which are learned from data during training. Choosing the right hyperparameters is crucial for model performance. Techniques like grid search, random search, and Bayesian optimization are used to find optimal values. Hyperparameter tuning often requires significant computational resources and experimentation. Automated tools like AutoML can help streamline this process in complex projects. underfitting, overfitting, or excessive resource usage. Proper tuning often requires experimentation and cross-validation.

30. Cross-Validation

Cross-validation is a technique used to evaluate how well a machine learning model performs on unseen data.
It works by dividing the dataset into multiple parts, or “folds,” and training the model on some while validating on others.
The most common type is k-fold cross-validation, where the data is split into k groups, and the model is trained and tested k times, each time with a different validation fold.
This method helps ensure that the model isn’t just performing well on a lucky split of the data.
It reduces the risk of overfitting by making sure the model generalizes across different data subsets.
Cross-validation is often used to compare model performance and to fine-tune hyperparameters.
Overall, it helps you estimate how your model will perform in real-world situations.

31. Confusion Matrix

A confusion matrix is a table that helps evaluate the performance of a classification model.
It compares actual target labels with the model’s predicted labels.
The matrix shows four key outcomes: True Positives, False Positives, True Negatives, and False Negatives.
These values help you calculate accuracy, precision, recall, and F1 score.
It’s especially useful for identifying specific types of errors, such as false alarms or missed detections.
For example, in a medical test, a false negative could mean missing a sick patient.
Using the confusion matrix, you can better understand how your model behaves under different conditions.

32. Precision

Precision is a metric that measures how many of the predicted positive results were actually correct.
It’s calculated as True Positives divided by (True Positives + False Positives).
In simpler terms, it answers the question: “Of all the items labeled as positive, how many were truly positive?”
High precision means that the model makes fewer false positive errors.
This is crucial in fields like medical diagnosis or spam detection, where incorrect positive results can cause problems.
For instance, flagging a legitimate email as spam would be a false positive.
Precision helps evaluate how trustworthy a model’s positive predictions are.

33. Recall

Recall, also known as sensitivity or true positive rate, measures how many actual positives the model correctly identified.
The formula is: Recall = True Positives / (True Positives + False Negatives).
It answers the question: “Of all the actual positive cases, how many did the model catch?”
High recall means fewer false negatives, which is vital in high-risk areas like disease detection or fraud prevention.
If a model misses actual cases, it may fail its purpose even with high overall accuracy.
For example, missing a cancer diagnosis is a serious false negative.
Therefore, recall is essential when missing a positive outcome is more dangerous than a false alarm.

34. F1 Score

The F1 Score combines precision and recall into one number using the harmonic mean.
The formula is: F1 = 2 * (Precision * Recall) / (Precision + Recall).
It balances the trade-off between false positives and false negatives, especially in imbalanced datasets.
A high F1 score means the model performs well in both detecting and correctly classifying positives.
It’s particularly useful when one class is much more common than the other, like fraud vs. normal transactions.
The score ranges from 0 (worst) to 1 (perfect).
F1 Score gives a single performance metric to compare different models fairly.

35. ROC Curve

The ROC Curve (Receiver Operating Characteristic) is a graphical tool used to evaluate the performance of classification models.
It plots the True Positive Rate (Recall) on the Y-axis against the False Positive Rate on the X-axis across different threshold settings.
Each point on the curve represents a different trade-off between sensitivity and specificity.
A curve closer to the top-left corner indicates a better-performing model.
The ROC curve helps you visualize how well your model separates the positive and negative classes.
It’s especially useful when you need to compare multiple classifiers on the same task.
If your model’s ROC curve is close to the diagonal, it’s performing no better than random guessing.

36. AUC (Area Under Curve)

AUC stands for “Area Under the ROC Curve” and provides a single number to summarize classifier performance.
It ranges from 0 to 1, where 1 means perfect classification and 0.5 means the model is guessing randomly.
Higher AUC values indicate that the model is better at distinguishing between the classes.
Unlike accuracy, AUC is not affected by class imbalance, making it more reliable in some situations.
It’s particularly helpful in medical or fraud detection scenarios where false positives and false negatives have different costs.
AUC helps you decide which model to deploy based on overall performance across all classification thresholds.
It’s a commonly used metric for comparing binary classifiers.

37. Loss Function

A loss function is a mathematical formula that measures how far off a model’s predictions are from the actual values.
It provides the feedback needed to improve the model during training.
For regression problems, Mean Squared Error (MSE) is commonly used, while Cross-Entropy is popular for classification.
The goal of training is to minimize the loss, which means improving the model’s predictions.
Choosing the right loss function is crucial for your model to learn correctly and efficiently.
Some functions penalize large errors more than others, affecting how learning progresses.
Loss functions are essential for guiding the optimization process during model training.

38. Gradient Descent

Gradient Descent is an optimization algorithm used to minimize the loss function in machine learning models.
It works by calculating the gradient (slope) of the loss with respect to each model parameter.
Then it updates the parameters by moving in the opposite direction of the gradient.
The size of each step is determined by the learning rate, a key hyperparameter.
There are several variations like Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent.
Gradient Descent is used to make the model learn by reducing the error over time.
Proper tuning is essential—if the learning rate is too high, the model may not converge.

39. Backpropagation

Backpropagation is a technique used in training neural networks by adjusting weights to minimize the loss function.
It uses the chain rule from calculus to calculate how changes in each weight affect the final loss.
The calculated gradients are then used by an optimizer (like Adam or SGD) to update the weights.
This process is repeated across many epochs to gradually improve model performance.
Backpropagation allows deep learning models to learn complex patterns from large datasets.
It makes training neural networks efficient and scalable.
Without backpropagation, modern deep learning wouldn’t be possible.

40. Epoch

An epoch is one complete pass through the entire training dataset by the model.
During each epoch, the model makes predictions, compares them to the actual values, and updates weights.
Multiple epochs are usually needed to properly train a model and reach a point of convergence.
The number of epochs is a hyperparameter that can affect both underfitting and overfitting.
Too few epochs might result in an undertrained model, while too many can cause overfitting.
Techniques like early stopping help monitor performance and stop training at the right time.
Each epoch includes several iterations if batch training is used.

41. Batch Size

Batch size is the number of training examples used to calculate each update to the model’s weights.
Smaller batch sizes (like 32 or 64) tend to use less memory and introduce more noise, which can help generalization.
Larger batch sizes can speed up training but may get stuck in local minima.
Choosing the right batch size affects training time, stability, and model accuracy.
It’s a crucial hyperparameter in deep learning.
Batch size also determines how many iterations are in an epoch—total samples divided by batch size.
Experimentation is often required to find the optimal size for a specific model and dataset.

42. Activation Function

Activation functions are mathematical operations used in neural networks to introduce non-linearity.
Without them, neural networks would behave like simple linear models and fail to learn complex patterns.
Common activation functions include ReLU, Sigmoid, and Tanh.
They decide whether a neuron should be “activated” or not, based on input.
ReLU is fast and often used in hidden layers, while Sigmoid is useful for probabilities.
The choice of activation function affects the model’s ability to learn and converge.
Different problems require different activation functions for optimal performance.

43. ReLU (Rectified Linear Unit)

ReLU is an activation function that outputs the input directly if it’s positive, otherwise, it returns zero.
It’s widely used in deep learning because it’s simple and computationally efficient.
ReLU helps avoid the vanishing gradient problem that occurs with functions like Sigmoid.
However, it can lead to “dead neurons” if too many outputs are zero.
Variants like Leaky ReLU and Parametric ReLU address this issue by allowing a small slope for negative values.
ReLU accelerates training and improves model performance in many real-world tasks.
It’s the default activation in most convolutional and deep neural networks.

44. Sigmoid Function

The Sigmoid function transforms input values into a range between 0 and 1, making it ideal for binary classification.
It squashes large input values into a smooth S-shaped curve.
This function is often used in the final layer of a binary classifier to produce probabilities.
However, it can suffer from vanishing gradients in deep networks, which slows learning.
Despite this limitation, it’s still useful in certain contexts like logistic regression.
The Sigmoid function outputs can be interpreted as confidence levels for a prediction.
It provides smooth and differentiable transitions, which is helpful in gradient-based optimization.

45. Softmax Function

Softmax is an activation function used in the output layer of multi-class classification models.
It converts raw scores (logits) into probabilities that add up to 1.
Each class’s score is exponentiated and divided by the sum of all exponentials.
The highest probability indicates the model’s predicted class.
Softmax is sensitive to extreme input values, which can affect stability.
It’s commonly used in image recognition tasks, NLP, and deep learning applications.
The output can be easily interpreted as confidence scores across classes.

46. Dropout

Dropout is a regularization technique used to reduce overfitting in neural networks.
During training, it randomly deactivates a fraction of neurons in each layer on every pass.
This prevents the network from becoming too dependent on any one node.
It effectively simulates an ensemble of smaller networks and improves generalization.
The dropout rate (e.g., 0.5) determines the fraction of units dropped during training.
At test time, all neurons are used, and their outputs are scaled accordingly.
Dropout is simple to implement and widely adopted in deep learning frameworks.

47. Regularization

Regularization helps prevent overfitting by adding a penalty to the loss function for large or complex model parameters.
It encourages simpler models that generalize better to new data.
Common types include L1 (Lasso), L2 (Ridge), and techniques like dropout.
Regularization balances the bias-variance tradeoff by discouraging the model from memorizing the training data.
In deep learning, it helps manage the high number of parameters effectively.
The strength of regularization is controlled by a hyperparameter (often called lambda or alpha).
Proper use of regularization improves robustness and performance on unseen data.

48. L1 Regularization

L1 Regularization, also known as Lasso, adds the absolute value of weights to the loss function.
This technique encourages sparsity, meaning it pushes less important feature weights to zero.
As a result, it can automatically perform feature selection by ignoring irrelevant inputs.
L1 is useful when dealing with high-dimensional data where many features may be unnecessary.
It helps simplify models while maintaining accuracy.
However, it may not perform well when features are highly correlated.
L1 regularization is often chosen when interpretability and simplicity are priorities.

49. L2 Regularization

L2 Regularization, also known as Ridge, adds the squared magnitude of weights to the loss function.
Unlike L1, it doesn’t push weights to zero but penalizes large weights to keep them small.
It helps reduce overfitting by smoothing the model’s response to input changes.
L2 is ideal when all input features are expected to contribute a little to the output.
It improves model stability, especially when inputs are noisy.
This form of regularization is widely used in linear regression, logistic regression, and neural networks.
L2 makes models more resistant to variance without sacrificing too much bias.

50. Transfer Learning

Transfer Learning is a method where a model trained on one task is reused for a different but related task.
This is useful when you have limited data for the new task but want to benefit from large pre-trained models.
Popular examples include using BERT for text or ResNet for image classification.
You can either fine-tune the entire model or freeze earlier layers and only train the final ones.
Transfer learning saves time, computational resources, and often improves performance.
It is especially useful in fields like medical imaging, NLP, and facial recognition.
By leveraging prior knowledge, it reduces the need to start model training from scratch.

AI Glossary: 50 Must-Know Terms Explained Simply