Packt_DataPro: Bringing data-driven insights to your fingertips. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â [Open in app]( or [online]()
[Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 3]( Packt_DataPro: Bringing data-driven insights to your fingertips. Mar 7
Â
[Save](
[â–·Â Â Listen](
 👋 Hey, As professionals in the field of data science, we are constantly seeking effective tools and automation techniques that can streamline the process of building accurate prediction models. To help you with this unwavering quest, our author [Adi]( has come up with a straightforward solution using Google Cloud’s BigQueryML AutoML Classifier. As we have already examined how to prepare the data and the feature engineering part of the process in the previous weeks, we will discuss the final part of the process, the model training. For those of you who may not be familiar with our expert author [Adi Wijaya]( he is a Strategic Cloud Data Engineer at Google. Throughout this tutorial, he will elaborate on each step. Whether you are new to the tool or looking to expand your knowledge, this part promises to provide valuable insights and tips to optimize your workflow. Get ready to be intrigued as we continue our deep dive into BigQueryML AutoML Classifier with the arrival of the last section - Part 3 in the series! If you have not read Parts 1 & 2, you can find them here: - [Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 1]( - [Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 2]( If you find the newsletter useful, share this with your friends! If you are interested in sharing ideas and suggestions to foster the growth of the professional data community, then this survey is for you. Consider it as your space to share your thoughts! Jump on in! [TELL US WHAT YOU THINK]( Cheers,
Merlyn Shelley
Associate Editor in Chief, Packt Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 3 - By [Adi Wijaya]( Disclosure: At the time I write this blog, I am employed by Google Cloud, however, the opinions expressed here are mine alone and do not reflect the views of my employer. Model Training This is the final model training step. Here we will be using the BQML AutoML function. On top of that, for the purpose of comparison, we will also perform the model creation using Logistic Regression and XGBoost. Here are the scripts to create the three models: Logistic Regression Script: ## Use Logistic Linear -- quick train (less than 2 minuts) ## create or replace model `bqml-project-370802.kaggleworkshops_shopper.model_v1` Transform( ML.ONE_HOT_ENCODER(f00_top_category,'none', 85) over() f00 ,f01_offer_value ,f02_offer_rule ,ML.STANDARD_SCALER(f03_trx_cnt) OVER() f03 ,ML.STANDARD_SCALER(f04_total_quantity) OVER() f04 ,ML.STANDARD_SCALER(f05_total_amt) OVER() f05 ,f06_weekend_pct ,ML.STANDARD_SCALER(f07_same_brand_trx_cnt) OVER() f07 ,ML.STANDARD_SCALER(f08_same_category_trx_cnt) OVER() f08 ,ML.STANDARD_SCALER(f09_top_category_trx_cnt) OVER() f09 ,f10_region_repeat_prob ,ML.IMPUTER(f11_chain_repeat_prob,'median') OVER() f11 ,general_split_col ,label ) OPTIONS(MODEL_TYPE = 'LOGISTIC_REG' ,INPUT_LABEL_COLS = ['label'] ,DATA_SPLIT_METHOD = 'CUSTOM' ,DATA_SPLIT_COL = 'general_split_col' ,L1_REG = 0.2 ,L2_REG = 0.1 ,AUTO_CLASS_WEIGHTS = TRUE ,OPTIMIZE_STRATEGY = 'BATCH_GRADIENT_DESCENT' ) as select * from `bqml-project-370802.kaggleworkshops_shopper.feat_train` XGBoost Script: ## Use XGBoost -- mid train (around 10 minutes) ## create or replace model `bqml-project-370802.kaggleworkshops_shopper.model_v2` Transform( f00_top_category ,f01_offer_value ,f02_offer_rule ,ML.STANDARD_SCALER(f03_trx_cnt) OVER() f03 ,ML.STANDARD_SCALER(f04_total_quantity) OVER() f04 ,ML.STANDARD_SCALER(f05_total_amt) OVER() f05 ,f06_weekend_pct ,ML.STANDARD_SCALER(f07_same_brand_trx_cnt) OVER() f07 ,ML.STANDARD_SCALER(f08_same_category_trx_cnt) OVER() f08 ,ML.STANDARD_SCALER(f09_top_category_trx_cnt) OVER() f09 ,f10_region_repeat_prob ,ML.IMPUTER(f11_chain_repeat_prob,'median') OVER() f11 ,general_split_col ,label ) OPTIONS(MODEL_TYPE = 'BOOSTED_TREE_CLASSIFIER' ,INPUT_LABEL_COLS = ['label'] ,DATA_SPLIT_METHOD = 'CUSTOM' ,DATA_SPLIT_COL = 'general_split_col' ,MAX_TREE_DEPTH = 5 ,SUBSAMPLE = 0.75 ,AUTO_CLASS_WEIGHTS = TRUE ) as select * from `bqml-project-370802.kaggleworkshops_shopper.feat_train` AutoML The key difference between logistic or tree-based models and the AutoML model is that the latter does not need to run data in TRANSFORM stages. The important variable in AutoML is budget hours. Budget hours are how much we are willing to pay and wait for the model to stop trying for the best combination. During the given time, AutoML will run multiple model algorithms, feature engineering, and hyperparameter tuning to find the best model. By default, the value is 1 hour. For this experiment, I’ll try using 3 hours as shown in this script: ## Use AutoML ## create or replace model `bqml-project-370802.kaggleworkshops_shopper.model_v3` OPTIONS(MODEL_TYPE = 'automl_classifier' ,INPUT_LABEL_COLS = ['label'] ,BUDGET_HOURS = 3 ,OPTIMIZATION_OBJECTIVE = 'MAXIMIZE_AU_ROC' ,DATA_SPLIT_COL = 'automl_split_col' ) as select ML.LABEL_ENCODER(cast(f00_top_category as string),100) over() f00 ,f01_offer_value ,f02_offer_rule ,ML.STANDARD_SCALER(f03_trx_cnt) OVER() f03 ,ML.STANDARD_SCALER(f04_total_quantity) OVER() f04 ,ML.STANDARD_SCALER(f05_total_amt) OVER() f05 ,f06_weekend_pct ,ML.STANDARD_SCALER(f07_same_brand_trx_cnt) OVER() f07 ,ML.STANDARD_SCALER(f08_same_category_trx_cnt) OVER() f08 ,ML.STANDARD_SCALER(f09_top_category_trx_cnt) OVER() f09 ,f10_region_repeat_prob ,f11_chain_repeat_prob ,automl_split_col ,label from `bqml-project-370802.kaggleworkshops_shopper.feat_train` Note that, in the AutoML, we need not create a train and test dataset manually. BigQuery ML will split your input data automatically to the dataset. When we run the query, the model training will start automatically. We can follow the model training progress in the BigQuery UI: Model results execution details After the model is done, we can find the dataset on the left side with naming ‘Models’.
Model Evaluation Once the model is trained, we can evaluate the model performance. BigQuery calculates the model performance metrics (Depending on the model type, i.e. precision, recall, accuracy, F1 score, log loss, or ROC AUC). We can preview it on the EVALUATION tab in the BigQuery console: Model evaluation The console also shows very useful visualization to better understand the model’s accuracy: Model evaluation visualized in BigQuery Lastly, it also comes with a visualized version of Confusion matrix, so we don’t need to manually create a SQL to calculate this matrix: Confusion Matrix in BigQuery At this point, you can decide whether you want to proceed with the trained model, or whether you want to go back to the model training step and improve model performance by defining a more complex model. Model Predict In this final sub-step, let’s perform the predict function to the testing data from Kaggle. Using the ML.PREDICT function, we will get the row id and the probability. Here are the example syntaxes and results for the three models: Logistic Regression Example prediction using Logistic Regression XGBoost Example prediction using XGBoost AutoML Example prediction using AutoML We can save the result from the result’s table into CSV and ready for submission. Submission Back to the Kaggle website, we will do submit using the Late Submission button. Here are the scores! Kaggle submission scores The experiment is done successfully! We finished all the end-to-end steps needed to submit a solution for Kaggle challenge using BQML! Understanding the outcome As can be seen in the result, AutoML received 0.51434 in the private score, compared to 0.62703, which is the competition’s winner. It is not the best, but it is not that bad either! With a small effort in feature engineering and hypertuning, we achieved the expected results. Since this is a one-time submission, the feature engineering part is crucial. Feature engineering requires a lot of creativity and business understanding, which is probably the reason why many people love being Data Scientists. In my personal view, this part will not be replaced by any automation tool. Wrap Up BQML AutoML is a great starting point for machine learning projects since it greatly reduces development time while producing high-quality models. With BQML AutoML, baseline accuracy can be achieved quickly, facilitating quicker decision-making and more precise project direction. See you next time! As a GDPR-compliant company, we want you to know why you’re getting this email. The _datapro team, as a part of Packt Publishing, believes that you have a legitimate interest in our newsletter and its products. Our research shows that you opted-in for email communication with Packt Publishing in the past and we think your previous interest warrants our appropriate communication. If you do not feel that you should have received this or are no longer interested in _datapro, you can opt out of our emails by clicking the link below.  [Like](
[Comment](
[Share]( Â Read Packt DataPro in the app
Listen to posts, join subscriber chats, and never miss an update from Packt SecPro.
[Get the iOS app]( the Android app]( © 2023 Copyright © 2022 Packt Publishing, All rights reserved.
Our mailing address is:, Packt Publishing
Livery Place, 35 Livery Street, Birmingham, West Midlands B3 2PB
United Kingdom
[Unsubscribe]() [Start writing]()