Easy Level
1.What is data preprocessing in data?
Answer: Data preprocessing involves cleaning, transforming, and organizing raw data to make it suitable for analysis.
2.What function is used to check for missing values in Pandas?
Answer: isnull()
3.What is the purpose of normalization?
Answer: To scale data to a specific range, often [0,1] or [-1,1].
4.How do you handle missing values in Pandas?
Answer: Using fillna()
, dropna()
, or imputation methods.
5.What does df.dropna()
do in Pandas?
Answer: Removes rows with missing values.
6.What is the difference between normalization and standardization?
Answer: Normalization scales data to a fixed range, while standardization transforms data to have mean 0 and standard deviation 1.
7.What does df.info()
do in Pandas?
Answer: Displays summary information about a DataFrame.
8.What is one-hot encoding?
Answer: A method to convert categorical variables into binary format.
9.Which library in Python is commonly used for data manipulation?
Answer: Pandas.
10.What function is used to read a CSV file in Pandas?
Answer: pd.read_csv()
Intermediate Level
11. What is an outlier in data?
Answer: An extreme value that deviates significantly from other observations.
12.How do you detect outliers in Python?
Answer: Using box plots, z-scores, or the IQR method.
13.What is feature engineering?
Answer: Creating new features from existing data to improve model performance.
14.How do you handle categorical data in machine learning?
Answer: Using one-hot encoding, label encoding, or embedding techniques.
15.What does df.duplicated()
do?
Answer: Identifies duplicate rows in a DataFrame.
16.How do you remove duplicates in Pandas?
Answer: Using df.drop_duplicates()
.
17.What is data augmentation?
Answer: The process of generating additional training data by modifying existing samples.
18.What is feature scaling?
Answer: The process of transforming data to fall within a specific range.
19.What does df.describe()
do in Pandas?
Answer: Provides summary statistics of numerical columns.
20.Write a Python code snippet to replace missing values with the mean.
Answer:
import pandas as pd
df = pd.DataFrame({‘A’: [1, 2, None, 4]})
df[‘A’].fillna(df[‘A’].mean(), inplace=True)
print(df)
Advanced Level
1. What is the difference between missing completely at random (MCAR) and missing at random (MAR)?
Answer:
MCAR (Missing Completely at Random): Missing values are independent of both observed and unobserved data.
MAR (Missing at Random): Missingness depends only on observed data but not on unobserved values.
2. How do you handle missing data in a dataset?
Answer:
Remove rows with missing values (if the missing percentage is low).
Impute missing values using mean, median, mode, KNN, or regression-based imputation.
Use advanced techniques like MICE (Multiple Imputation by Chained Equations).
3. What is the curse of dimensionality? How do you handle it?
Answer:
The curse of dimensionality refers to the problem where increasing the number of features leads to sparsity and increased computational complexity.
Solutions: PCA, t-SNE, Autoencoders, Feature Selection methods (Lasso, RFE, etc.).
4. How does PCA work for dimensionality reduction?
Answer:
PCA transforms correlated features into a set of uncorrelated components.
It finds principal components (eigenvectors) that capture maximum variance.
The data is projected onto these components to reduce dimensionality.
5. What is the difference between Standardization and Normalization?
Answer:
6. What are outliers? How can you detect and handle them?
Answer:
Outliers are extreme values deviating significantly from the dataset’s distribution.
Detection: Z-score, IQR method, Boxplots, DBSCAN clustering.
Handling: Remove, transform (log/square root), or use robust models.
7. What is the role of logarithmic transformation in data preprocessing?
Answer:
Reduces right-skewness, stabilizes variance, and improves model performance on non-normal data.
8. Explain the difference between underfitting and overfitting in preprocessing.
Answer:
Underfitting: Model is too simple, missing patterns in data.
Overfitting: Model learns noise instead of patterns.
Solution: Proper feature selection, data augmentation, regularization.
9. What is one-hot encoding, and when should you use it?
Answer:
One-hot encoding transforms categorical variables into binary columns.
Use it when categorical data is nominal (no inherent order).
10. What is label encoding, and when should it be used?
Answer:
Assigns numeric values (0,1,2, etc.) to categorical labels.
Suitable for ordinal categorical variables.
11. What is target encoding, and what are its risks?
Answer:
Replaces categories with the mean of the target variable for that category.
Risks: Overfitting, especially in small datasets.
12. How does KNN imputation work for handling missing values?
Answer:
It replaces missing values using the average (or majority class) of the k-nearest neighbors based on Euclidean distance.
13. What is data leakage, and how can it be prevented?
Answer:
Data leakage occurs when information from outside the training dataset is used inappropriately.
Prevention: Ensure proper train-test split, use pipelines, avoid data preprocessing on full dataset before splitting.
14. What is SMOTE, and when do you use it?
Answer:
Synthetic Minority Over-sampling Technique (SMOTE) creates synthetic samples for minority class to balance datasets.
Used in imbalanced classification problems.
15. What is Winsorization?
Answer:
A method to limit extreme values by capping them at a percentile threshold (e.g., 95th and 5th percentile).
16. What is Box-Cox transformation?
Answer:
17. What is the difference between TfidfVectorizer and CountVectorizer?
Answer:
CountVectorizer: Converts text into a matrix of token counts.
TfidfVectorizer: Adjusts word frequencies by importance using Term Frequency-Inverse Document Frequency (TF-IDF).
18. What is Variance Threshold in feature selection?
Answer:
A technique to remove low-variance features (those that provide little information).
19. How do you detect multicollinearity in a dataset?
Answer:
Check Variance Inflation Factor (VIF), correlation matrix, or condition index.
20. What are some feature engineering techniques?
Answer:
Polynomial features, interaction terms, binning, log transformations, aggregations.
21. What is min-max scaling?
Answer:
22. How do you handle imbalanced datasets?
Answer:
Resampling (Oversampling, Undersampling), SMOTE, Cost-sensitive learning.
23. What is the difference between ordinal and nominal data?
Answer:
Nominal: No order (e.g., colors, names).
Ordinal: Ordered categories (e.g., low, medium, high).
24. How does an autoencoder help in dimensionality reduction?
Answer:
An autoencoder compresses data into a lower-dimensional space and reconstructs it, learning efficient representations.
25. What is the difference between t-SNE and UMAP?
Answer:
t-SNE: Non-linear dimensionality reduction, good for visualization.
UMAP: Faster, preserves more global structure.
26. What is the purpose of feature scaling?
Answer:
Ensures numerical stability, improves convergence in gradient-based algorithms.
27. How do you handle categorical features with many unique values?
Answer:
Use target encoding, hashing, embedding-based methods.
28. What is the difference between L1 and L2 regularization?
Answer:
L1 (Lasso): Shrinks some coefficients to zero (feature selection).
L2 (Ridge): Reduces weights but doesn’t remove features.
29. Why is data augmentation used in preprocessing?
Answer:
Increases data variability, helps models generalize better (e.g., flipping, rotating images).
30. What are some advanced imputation techniques?
Answer:
Bayesian imputation, Multiple Imputation by Chained Equations (MICE), Deep Learning-based methods.