cl1p.net - The internet clipboard
cl1p.net Discord
Login/Sign Up
cl1p.net/safna
cl1p.net/safna
cl1p.net Discord
Login/Sign Up
This cl1p will be deleted in in 29 days.
Copy
1] AIM:- Introduction to Excel Perform conditional formatting on a dataset using various criteria. Create a pivot table to analyze and summarize data. Use VLOOKUP function to retrieve information from a different worksheet or table. Perform what-if analysis using Goal Seek to determine input values for desired output. PERFORM CONDITIONAL FORMATTING ON A DATASET USING VARIOUS CRITERIA. -Open the Titanic.csv file. -To perform the conditional formatting on the population column to highlight the cells with the greater than 16114236.5 we need to perform the following steps. STEPS:- 1. Go to the “Home” tab on the ribbon. 2. Click on the “Conditional Formatting” in the toolbar. 3. Choose “Highlight Cells Rules” and then “Greater Than.” 4. Enter the threshold value as 16114236.5 . 5. Customize the formatting options (e.g. choose a fill colour). 6. Click “OK” to apply the rule. CREATE A PIVOT TABLE TO ANALYZE AND SUMMARIZE DATA. Following are the steps to create to pivot table to analyze total sales by category. STEPS:- 1. Select the entire dataset including headers. 2. Go to the “insert” tab on the ribbon. 3. Click on the “PivotTable”. 4. Choose where you want to place the PivotTable (eg. new worksheet). 5. a.] Drag the “state” in the Report Filter. b.] Drag the “country” in the Columns Labels. c.] Drag the “city name” in the Row Labels. d.] Drag the “longitude” and “latitude” in the Values. 6.] Go to the All Option after creating the Pivot Table to and select the multiple items to analyze the data. 7.] After selecting multiple items it will give you the following output as in the form of the pivot table. USE VLOOKUP FUNCTION TO RETRIVE INFORMATION FROM A DIFFERENT WORKSHEET OR TABLE. Use the VLOOKUP function to retrieve the category of “temp” from a separate worksheet named as “Sheet3” using the following steps: STEPS:- 1. Create the temp column in separate sheet where our sr_no. will work as the primary key to use the vlookup function. 2. Also add the sr_no. column in the titanic sheet. 3. In a cell in your main dataset “i.e Titanic”, enter the formula: =VLOOKUP(A2,Sheet3!A1:D151,4,FALSE). 4. After applying this formula we get the following output. PERFORM WHAT-IF ANALYSIS USING THE GOAL SEEK TO DETERMINE INPUT VALUES FOR DESIRED OUTPUT. STEPS:- 1.Identity the cell containing the formula for “population” for “population_proper”. 2. Go to the “Data” tab on the ribbon. 3. Click on “What-if Analysis” and select “Goal Seek”. 4. Set “Set cell” to the population_proper cell (J2),”To value” to 200000, and “By changing cell” to the population cell (I2). 5. Click “OK” to let Excell determine the required population. 2] Practical 2 : Data Frames and Basic Data Pre-processing ∙ Read data from CSV and JSON files into a data frame. import pandas as pd dataframe = pd.read_csv("C:\\Data Science\\customers-10000.csv") print("Our DataFrame:") print(dataframe) ∙ Perform basic data pre-processing tasks such as handling missing values and outliers. import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'], 'Age': [25, None, 30, 22, None], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'] } df = pd.DataFrame(data) print("Original DataFrame:") print(df) print("\nMissing values:") print(df.isna().sum()) # 1. Fill missing 'Age' with the mean df['Age'] = df['Age'].fillna(df['Age'].mean()) # 2. Fill missing 'Name' with the mode df['Name'] = df['Name'].fillna(df['Name'].mode()[0]) # 3. Drop rows with missing 'City' df_cleaned = df.dropna(subset=['City']) print("\nCleaned DataFrame:") print(df_cleaned) ∙ Manipulate and transform data using functions like filtering, sorting, and grouping. import pandas as pd import seaborn as sns iris = sns.load_dataset('iris') print("Original Iris Dataset:") print(iris.head()) >>Filtering Data filtered_data = iris[iris['sepal_length'] > 5.0] print("\nFiltered Data (Sepal Length > 5.0):") print(filtered_data) >>Sorting Data sorted_data = iris.sort_values(by='sepal_length', ascending=False) print("\nSorted Data (Sepal Length Descending):") print(sorted_data) >>Grouping Data grouped_data = iris.groupby('species').mean() print("\nGrouped Data by Species (Mean of Numerical Columns):") print(grouped_data) Practical 3. Feature Scaling and Dummification ∙ Apply feature-scaling techniques like standardization and normalization to numerical features. import pandas as pd from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline data = { 'Product': ['Apple_Juice', 'Banana_Smoothie', 'Orange_Jam', 'Grape_Jelly', 'Kiwi_Parfait', 'Mango_Chutney', 'Pineapple_Sorbet', 'Strawberry_Yogurt', 'Blueberry_Pie', 'Cherry_Salsa'], 'Category': ['Apple', 'Banana', 'Orange', 'Grape', 'Kiwi', 'Mango', 'Pineapple', 'Strawberry', 'Blueberry', 'Cherry'], 'Sales': [1200, 1700, 2200, 1400, 2000, 1000, 1500, 1800, 1300, 1600], 'Cost': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800], 'Profit': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800] } # Create a DataFrame df = pd.DataFrame(data) # Display the original dataset print("Original Dataset:") print(df) >> Apply feature-scaling techniques like standardization and normalization to numerical features numeric_columns = ['Sales', 'Cost', 'Profit'] # Apply StandardScaler for standardization scaler_standardization = StandardScaler() df_scaled_standardized = pd.DataFrame( scaler_standardization.fit_transform(df[numeric_columns]), columns=numeric_columns ) # Apply MinMaxScaler for normalization scaler_normalization = MinMaxScaler() df_scaled_normalized = pd.DataFrame( scaler_normalization.fit_transform(df[numeric_columns]), columns=numeric_columns ) # Combine the scaled numeric features with the categorical features (for standardized data) df_scaled_standardized = pd.concat( [df_scaled_standardized, df.drop(numeric_columns, axis=1)], axis=1 ) # Display the dataset after standardization print("\nDataset after Standardization:") print(df_scaled_standardized) # Combine the scaled numeric features with the categorical features (for normalized data) df_scaled_normalized = pd.concat( [df_scaled_normalized, df.drop(numeric_columns, axis=1)], axis=1 ) # Display the dataset after normalization print("\nDataset after Normalization:") print(df_scaled_normalized) ∙ Apply feature-scaling techniques like standardization and normalization to numerical features. import pandas as pd from sklearn.preprocessing import OneHotEncoder # Define the data data = { 'Product': ['Apple_Juice', 'Banana_Smoothie', 'Orange_Jam', 'Grape_Jelly', 'Kiwi_Parfait'], 'Category': ['Apple', 'Banana', 'Orange', 'Grape', 'Kiwi'], 'Sales': [1200, 1700, 2200, 1400, 2000] } # Create a DataFrame df = pd.DataFrame(data) # Select the categorical columns categorical_columns = ['Product', 'Category'] # Instantiate the OneHotEncoder encoder = OneHotEncoder() # Fit and transform the categorical columns encoded_data = encoder.fit_transform(df[categorical_columns]) # Retrieve feature names feature_names = encoder.get_feature_names_out(categorical_columns) # Convert the encoded data to a DataFrame encoded_df = pd.DataFrame(encoded_data.toarray(), columns=feature_names) # Combine the encoded columns with the original DataFrame (excluding original categorical columns) df_dummified = pd.concat([encoded_df, df.drop(columns=categorical_columns)], axis=1) # Display the dummified DataFrame print("\nDummified DataFrame:") print(df_dummified) Practical : 4 Hypothesis Testing ∙ Formulate null and alternative hypotheses for a given problem. ∙ Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chisquare test). ∙ Interpret the results and draw conclusions based on the test outcomes. Code : import numpy as np from scipy import stats import matplotlib.pyplot as plt np.random.seed(42) sample1 = np.random.normal(loc=10, scale=2, size=30) sample2 = np.random.normal(loc=12, scale=2, size=30) t_statistic, p_value = stats.ttest_ind(sample1, sample2) alpha = 0.05 print("Results of Two-Sample t-test:") print(f"t-statistic: {t_statistic}") print(f"p-value: {p_value}") print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}") plt.figure(figsize=(10, 6)) plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue') plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange') plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2) plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2) plt.title('Distributions of Sample 1 and Sample 2') plt.xlabel('Values') plt.ylabel('Frequency') plt.legend() if p_value < alpha: critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(), sample2.max()), 1000) plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region') plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black', backgroundcolor='white') Practical 5 ANOVA (Analysis of Variance) ∙ Perform one-way ANOVA to compare means across multiple groups. ∙ Conduct post-hoc tests to identify significant differences between group means. Code : import scipy.stats as stats from statsmodels.stats.multicomp import pairwise_tukeyhsd group1 = [23, 25, 29, 34, 30] group2 =[19, 20, 22, 25, 24] group3 =[15, 18, 20, 21, 17] group4 = [28, 24, 26, 30, 29] all_data = group1 + group2 + group3 + group4 group_labels = ['Group1'] * len(group1) + ['Group2'] * len(group2) + ['Group3'] * len(group3) + ['Group4'] * len(group4) f_statistic, p_value = stats.f_oneway(group1, group2, group3, group4) print("One-way ANOVA:") print("F-statistic:", f_statistic) print("P-value:", p_value) tukey_results = pairwise_tukeyhsd(all_data, group_labels) print("\nTukey-Kramer post-hoc test:") print(tukey_results) Practical 6 Aim:- Regression and Its Types Implement simple linear regression using a dataset. Explore and interpret the regression model coefficients and goodness-of-fit measures. Extend the analysis to multiple linear regression and assess the impact of additional predictors. A] Implement simple linear regression using a dataset 1) import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error,r2_score 2) data = { 'X': np.arange(1, 21), # Feature 'Y': np.array([2.3, 2.5, 3.1, 3.6, 4.0, 4.5, 5.2, 5.8, 6.1, 6.9, 7.5, 8.0, 8.6, 9.1, 9.8, 10.3, 11.0, 11.5, 12.1, 12.8]) # Target } df = pd.DataFrame(data) 3) X = df[['X']] y = df['Y'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 4) model = LinearRegression() model.fit(X_train, y_train) o/p:- 5) y_pred = model.predict(X_test) 6) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f'Mean Squared Error: {mse}') print(f'R^2 Score: {r2}') o/p:- Mean Squared Error: 0.08604929198462502 R^2 Score: 0.9952843243192424 7) plt.scatter(X_test, y_test, color='blue', label='Actual values') plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression line') plt.xlabel('X') plt.ylabel('Y') plt.title('Simple Linear Regression') plt.legend() plt.show() o/p;- B] Explore and interpret the regression model coefficients and goodness-of-fit measures. 1) import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score 2) data = { 'X': np.arange(1, 21), # Feature 'Y': np.array([2.3, 2.5, 3.1, 3.6, 4.0, 4.5, 5.2, 5.8, 6.1, 6.9, 7.5, 8.0, 8.6, 9.1, 9.8, 10.3, 11.0, 11.5, 12.1, 12.8]) # Target } df = pd.DataFrame(data) 3) model = LinearRegression() model.fit(X_train, y_train) o/p:- 4) intercept = model.intercept_ coefficient = model.coef_[0] print(f'Intercept: {intercept}') print(f'Coefficient: {coefficient}') o/p:- Intercept: 1.1707887196501297 Coefficient: 0.5743779218820689 5) y_pred = model.predict(X_test) 6) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f'Mean Squared Error: {mse}') print(f'R^2 Score: {r2}') o/p:- Mean Squared Error: 0.08604929198462502 R^2 Score: 0.9952843243192424 7) print("Interpretation:") print(f'The regression equation is Y = {intercept:.2f} + {coefficient:.2f}X') print('The coefficient represents the expected change in Y for a one-unit increase in X.') print(f'An R^2 score of {r2:.2f} indicates the proportion of variance in Y explained by X.') o/p:- Interpretation: The regression equation is Y = 1.17 + 0.57X The coefficient represents the expected change in Y for a one-unit increase in X. An R^2 score of 1.00 indicates the proportion of variance in Y explained by X. 8) plt.scatter(X_test, y_test, color='blue', label='Actual values') plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression line') plt.xlabel('X') plt.ylabel('Y') plt.title('Simple Linear Regression') plt.legend() plt.show() C) Extend the analysis to multiple linear regression and assess the impact of additional predictors. 1) import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score 2) data = { 'X1': np.arange(1, 21), # Feature 1 'X2': np.random.uniform(1, 10, 20), # Additional predictor 'Y': np.array([2.3, 2.5, 3.1, 3.6, 4.0, 4.5, 5.2, 5.8, 6.1, 6.9, 7.5, 8.0, 8.6, 9.1, 9.8, 10.3, 11.0, 11.5, 12.1, 12.8]) # Target } df = pd.DataFrame(data) 3) X = df[['X1', 'X2']] y = df['Y'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 4) model = LinearRegression() model.fit(X_train, y_train) o/p:- 5) intercept = model.intercept_ coefficients = model.coef_ print(f'Intercept: {intercept}') print(f'Coefficients: {coefficients}') o/p:- Intercept: 1.1845776949791853 Coefficients: [ 0.574281 -0.00331089] 6) y_pred = model.predict(X_test) 7) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f'Mean Squared Error: {mse}') print(f'R^2 Score: {r2}') o/p:- Mean Squared Error: 0.08321736352773812 R^2 Score: 0.9954395197409104 8) print("Interpretation:") print(f'The regression equation is Y = {intercept:.2f} + {coefficients[0]:.2f}X1 + {coefficients[1]:.2f}X2') print('Each coefficient represents the expected change in Y for a one-unit increase in the corresponding predictor,holding the other predictor constant.') print(f'An R^2 score of {r2:.2f} indicates the proportion of variance in Y explained by X1 and X2.') o/p:- Interpretation: The regression equation is Y = 1.18 + 0.57X1 + -0.00X2 Each coefficient represents the expected change in Y for a one-unit increase in the corresponding predictor,holding the other predictor constant. An R^2 score of 1.00 indicates the proportion of variance in Y explained by X1 and X2 9) fig = plt.figure(figsize=(8,6)) ax = fig.add_subplot(111, projection='3d') ax.scatter(X_test['X1'], X_test['X2'], y_test, color='blue', label='Actual values') ax.set_xlabel('X1') ax.set_ylabel('X2') ax.set_zlabel('Y') ax.set_title('Multiple Linear Regression') plt.legend() plt.show() Practical 7 Aim:- Logistic Regression and Decision Tree Build a logistic regression model to predict a binary outcome. Evaluate the model's performance using classification metrics (e.g., accuracy, precision, recall). Construct a decision tree model and interpret the decision rules for classification. 1) import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, classification_report import seaborn as sns import matplotlib.pyplot as plt 2) df = pd.read_csv('diabetes.csv') print(df.head()) o/p:- 3) X = df.drop("Outcome", axis=1) y = df["Outcome"] 4) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 5) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) 6) log_reg = LogisticRegression(max_iter=200) log_reg.fit(X_train, y_train) o/p:- 7) y_pred = log_reg.predict(X_test) 8) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}") o/p:- Accuracy: 0.7359 9) print("Confusion Matrix:") cm = confusion_matrix(y_test, y_pred) print(cm) o/p:- Confusion Matrix: [[120 31] [ 30 50]] 10) print("\nClassification Report:") print(classification_report(y_test, y_pred)) o/p:- 11) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') plt.title("Confusion Matrix Heatmap") plt.xlabel("Predicted") plt.ylabel("Actual") plt.show() o/p:- c) Construct a decision tree model and interpret the decision rules for classification. 1) import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier, plot_tree,export_text from sklearn.metrics import accuracy_score,classification_report,confusion_matrix 2) x,y = make_classification(n_samples=1000,n_features=10,random_state=42) 3) X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42) 4) tree_model = DecisionTreeClassifier(max_depth = 4,random_state = 42) tree_model.fit(X_train, y_train) o/p:- 5) y_pred = tree_model.predict(X_test) 6) accuracy= accuracy_score(y_test,y_pred) print(f'Accuracy: {accuracy:4f}') print('Classification Report:\n',classification_report(y_test,y_pred)) o/p:- 7) conf_matrix = confusion_matrix(y_test,y_pred) plt.figure(figsize=(6,4)) sns.heatmap(conf_matrix,annot=True,fmt='d',cmap='Blues',xticklabels=['Negative','Positive'], yticklabels=['Negative','Positive']) plt.xlabel('Predicted') plt.ylabel('Actual') plt.title('Confusion Matrix') plt.show() o/p:- 8) plt.figure(figsize=(12,8)) plot_tree(tree_model,feature_names=[f'Feature {i}' for i in range(X.shape[1])],class_names=['class 0','class 1'],filled=True) plt.title('Decision Tree Visualization') plt.show() o/p:- Practical 8: k mean clustering #Import required packages import pandas as pd from sklearn.preprocessing import MinMaxScaler from sklearn.cluster import KMeans import matplotlib.pyplot as plt data = pd.read_csv('Wholesale customers data.csv') data.head() categorical_features = ['Channel', 'Region'] continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen'] data[continuous_features].describe() for col in categorical_features: dummies = pd.get_dummies(data[col], prefix=col) data = pd.concat([data, dummies], axis=1) data.drop(col, axis=1, inplace=True) data.head() mms = MinMaxScaler() mms.fit(data) data_transformed = mms.transform(data) # Assuming 'data_transformed' is preprocessed Sum_of_squared_distances = [] K = range(1, 15) for k in K: km = KMeans(n_clusters=k, n_init=10, random_state=42) # Explicitly set n_init km.fit(data_transformed) Sum_of_squared_distances.append(km.inertia_) # Plot Elbow Method plt.figure(figsize=(8, 5)) plt.plot(K, Sum_of_squared_distances, 'bo-', markersize=5, label='SSD') plt.xlabel('Number of clusters (k)') plt.ylabel('Sum of Squared Distances (Inertia)') plt.title('Elbow Method for Optimal k') plt.legend() plt.show() Practical 9: principal Component analysis import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_iris # Step 1: Load the dataset iris = load_iris() X = iris.data y = iris.target scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Step 3: Apply PCA without specifying components to analyze variance pca = PCA() X_pca_full = pca.fit_transform(X_scaled) # Step 4: Plot explained variance ratio explained_variance_ratio = np.cumsum(pca.explained_variance_ratio_) plt.figure(figsize=(8, 5)) plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, marker='o', linestyle='--', color='b') plt.xlabel('Number of Principal Components') plt.ylabel('Cumulative Explained Variance') plt.title('Explained Variance vs. Number of Components') plt.axhline(y=0.95, color='r', linestyle='-') # 95% variance threshold plt.grid(True) plt.show() # Step 5: Select optimal number of components (e.g., 2 for visualization) optimal_components = np.argmax(explained_variance_ratio >= 0.95) + 1 print(f"Optimal number of components to retain 95% variance: {optimal_components}") Optimal number of components to retain 95% variance: 2 # Step 6: Apply PCA with selected components pca_optimal = PCA(n_components=optimal_components) X_pca_optimal = pca_optimal.fit_transform(X_scaled) # Step 7: Visualize data in reduced space (only if reduced to 2D) if optimal_components == 2: plt.figure(figsize=(8, 6)) plt.scatter(X_pca_optimal[:, 0], X_pca_optimal[:, 1], c=y, cmap='viridis', edgecolor='k') plt.xlabel('Principal Component 1') plt.ylabel('Principal Component 2') plt.title(f'PCA Visualization with {optimal_components} Components') plt.colorbar(label='Target Classes') plt.show() Practical 10 Aim : Data Visualization and Storytelling ∙ Create meaningful visualizations using data visualization tools ∙ Combine multiple visualizations to tell a compelling data story. ∙ Present the findings and insights in a clear and concise manner. import pandas as pd import matplotlib.pyplot as plt import seaborn as sns data = pd.read_csv('income (1).csv') df = pd.DataFrame(data) plt.figure() df = df.applymap(lambda x: str(x) if isinstance(x, float) else x) plt.plot(df["Month"], df["Sales"], marker='o') plt.title("Sales Trend Over Time") plt.xlabel("Month") plt.ylabel("Sales") plt.show() plt.figure() sns.scatterplot(x="Marketing_Spend", y="Sales", data=df) plt.title("Marketing Spend vs Sales") plt.show() plt.figure(figsize=(12, 8)) df["Sales"] = pd.to_numeric(df["Sales"], errors="coerce") plt.subplot(2, 2, 1) plt.plot(df["Month"], df["Sales"], marker='o') plt.title("Sales Trend") plt.subplot(2, 2, 2) sns.barplot(x="Region", y="Sales", data=df) plt.title("Sales by Region") plt.subplot(2, 2, 3) sns.scatterplot(x="Marketing_Spend", y="Sales", data=df) plt.title("Marketing Spend vs Sales") plt.tight_layout() plt.show()