Newsletter Subject

Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 3

From

substack.com

Email Address

packtdatapro1@substack.com

Sent On

Tue, Mar 7, 2023 02:06 PM

Email Preheader Text

Packt_DataPro: Bringing data-driven insights to your fingertips. � � � � � � � � � � � �

Desktop View
HTML
Text
Mobile View

Go Premium to Unlock

Subscribe Now

Packt_DataPro: Bringing data-driven insights to your fingertips. [Open in app]( or [online]() [Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 3]( Packt_DataPro: Bringing data-driven insights to your fingertips. Mar 7 [Save]( [▷ Listen]( 👋 Hey, As professionals in the field of data science, we are constantly seeking effective tools and automation techniques that can streamline the process of building accurate prediction models. To help you with this unwavering quest, our author [Adi]( has come up with a straightforward solution using Google Cloud’s BigQueryML AutoML Classifier. As we have already examined how to prepare the data and the feature engineering part of the process in the previous weeks, we will discuss the final part of the process, the model training. For those of you who may not be familiar with our expert author [Adi Wijaya]( he is a Strategic Cloud Data Engineer at Google. Throughout this tutorial, he will elaborate on each step. Whether you are new to the tool or looking to expand your knowledge, this part promises to provide valuable insights and tips to optimize your workflow. Get ready to be intrigued as we continue our deep dive into BigQueryML AutoML Classifier with the arrival of the last section - Part 3 in the series! If you have not read Parts 1 & 2, you can find them here: - [Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 1]( - [Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 2]( If you find the newsletter useful, share this with your friends! If you are interested in sharing ideas and suggestions to foster the growth of the professional data community, then this survey is for you. Consider it as your space to share your thoughts! Jump on in! [TELL US WHAT YOU THINK]( Cheers, Merlyn Shelley Associate Editor in Chief, Packt Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 3 - By [Adi Wijaya]( Disclosure: At the time I write this blog, I am employed by Google Cloud, however, the opinions expressed here are mine alone and do not reflect the views of my employer. Model Training This is the final model training step. Here we will be using the BQML AutoML function. On top of that, for the purpose of comparison, we will also perform the model creation using Logistic Regression and XGBoost. Here are the scripts to create the three models: Logistic Regression Script: ## Use Logistic Linear -- quick train (less than 2 minuts) ## create or replace model `bqml-project-370802.kaggleworkshops_shopper.model_v1` Transform( ML.ONE_HOT_ENCODER(f00_top_category,'none', 85) over() f00 ,f01_offer_value ,f02_offer_rule ,ML.STANDARD_SCALER(f03_trx_cnt) OVER() f03 ,ML.STANDARD_SCALER(f04_total_quantity) OVER() f04 ,ML.STANDARD_SCALER(f05_total_amt) OVER() f05 ,f06_weekend_pct ,ML.STANDARD_SCALER(f07_same_brand_trx_cnt) OVER() f07 ,ML.STANDARD_SCALER(f08_same_category_trx_cnt) OVER() f08 ,ML.STANDARD_SCALER(f09_top_category_trx_cnt) OVER() f09 ,f10_region_repeat_prob ,ML.IMPUTER(f11_chain_repeat_prob,'median') OVER() f11 ,general_split_col ,label ) OPTIONS(MODEL_TYPE = 'LOGISTIC_REG' ,INPUT_LABEL_COLS = ['label'] ,DATA_SPLIT_METHOD = 'CUSTOM' ,DATA_SPLIT_COL = 'general_split_col' ,L1_REG = 0.2 ,L2_REG = 0.1 ,AUTO_CLASS_WEIGHTS = TRUE ,OPTIMIZE_STRATEGY = 'BATCH_GRADIENT_DESCENT' ) as select * from `bqml-project-370802.kaggleworkshops_shopper.feat_train` XGBoost Script: ## Use XGBoost -- mid train (around 10 minutes) ## create or replace model `bqml-project-370802.kaggleworkshops_shopper.model_v2` Transform( f00_top_category ,f01_offer_value ,f02_offer_rule ,ML.STANDARD_SCALER(f03_trx_cnt) OVER() f03 ,ML.STANDARD_SCALER(f04_total_quantity) OVER() f04 ,ML.STANDARD_SCALER(f05_total_amt) OVER() f05 ,f06_weekend_pct ,ML.STANDARD_SCALER(f07_same_brand_trx_cnt) OVER() f07 ,ML.STANDARD_SCALER(f08_same_category_trx_cnt) OVER() f08 ,ML.STANDARD_SCALER(f09_top_category_trx_cnt) OVER() f09 ,f10_region_repeat_prob ,ML.IMPUTER(f11_chain_repeat_prob,'median') OVER() f11 ,general_split_col ,label ) OPTIONS(MODEL_TYPE = 'BOOSTED_TREE_CLASSIFIER' ,INPUT_LABEL_COLS = ['label'] ,DATA_SPLIT_METHOD = 'CUSTOM' ,DATA_SPLIT_COL = 'general_split_col' ,MAX_TREE_DEPTH = 5 ,SUBSAMPLE = 0.75 ,AUTO_CLASS_WEIGHTS = TRUE ) as select * from `bqml-project-370802.kaggleworkshops_shopper.feat_train` AutoML The key difference between logistic or tree-based models and the AutoML model is that the latter does not need to run data in TRANSFORM stages. The important variable in AutoML is budget hours. Budget hours are how much we are willing to pay and wait for the model to stop trying for the best combination. During the given time, AutoML will run multiple model algorithms, feature engineering, and hyperparameter tuning to find the best model. By default, the value is 1 hour. For this experiment, I’ll try using 3 hours as shown in this script: ## Use AutoML ## create or replace model `bqml-project-370802.kaggleworkshops_shopper.model_v3` OPTIONS(MODEL_TYPE = 'automl_classifier' ,INPUT_LABEL_COLS = ['label'] ,BUDGET_HOURS = 3 ,OPTIMIZATION_OBJECTIVE = 'MAXIMIZE_AU_ROC' ,DATA_SPLIT_COL = 'automl_split_col' ) as select ML.LABEL_ENCODER(cast(f00_top_category as string),100) over() f00 ,f01_offer_value ,f02_offer_rule ,ML.STANDARD_SCALER(f03_trx_cnt) OVER() f03 ,ML.STANDARD_SCALER(f04_total_quantity) OVER() f04 ,ML.STANDARD_SCALER(f05_total_amt) OVER() f05 ,f06_weekend_pct ,ML.STANDARD_SCALER(f07_same_brand_trx_cnt) OVER() f07 ,ML.STANDARD_SCALER(f08_same_category_trx_cnt) OVER() f08 ,ML.STANDARD_SCALER(f09_top_category_trx_cnt) OVER() f09 ,f10_region_repeat_prob ,f11_chain_repeat_prob ,automl_split_col ,label from `bqml-project-370802.kaggleworkshops_shopper.feat_train` Note that, in the AutoML, we need not create a train and test dataset manually. BigQuery ML will split your input data automatically to the dataset. When we run the query, the model training will start automatically. We can follow the model training progress in the BigQuery UI: Model results execution details After the model is done, we can find the dataset on the left side with naming ‘Models’. Model Evaluation Once the model is trained, we can evaluate the model performance. BigQuery calculates the model performance metrics (Depending on the model type, i.e. precision, recall, accuracy, F1 score, log loss, or ROC AUC). We can preview it on the EVALUATION tab in the BigQuery console: Model evaluation The console also shows very useful visualization to better understand the model’s accuracy: Model evaluation visualized in BigQuery Lastly, it also comes with a visualized version of Confusion matrix, so we don’t need to manually create a SQL to calculate this matrix: Confusion Matrix in BigQuery At this point, you can decide whether you want to proceed with the trained model, or whether you want to go back to the model training step and improve model performance by defining a more complex model. Model Predict In this final sub-step, let’s perform the predict function to the testing data from Kaggle. Using the ML.PREDICT function, we will get the row id and the probability. Here are the example syntaxes and results for the three models: Logistic Regression Example prediction using Logistic Regression XGBoost Example prediction using XGBoost AutoML Example prediction using AutoML We can save the result from the result’s table into CSV and ready for submission. Submission Back to the Kaggle website, we will do submit using the Late Submission button. Here are the scores! Kaggle submission scores The experiment is done successfully! We finished all the end-to-end steps needed to submit a solution for Kaggle challenge using BQML! Understanding the outcome As can be seen in the result, AutoML received 0.51434 in the private score, compared to 0.62703, which is the competition’s winner. It is not the best, but it is not that bad either! With a small effort in feature engineering and hypertuning, we achieved the expected results. Since this is a one-time submission, the feature engineering part is crucial. Feature engineering requires a lot of creativity and business understanding, which is probably the reason why many people love being Data Scientists. In my personal view, this part will not be replaced by any automation tool. Wrap Up BQML AutoML is a great starting point for machine learning projects since it greatly reduces development time while producing high-quality models. With BQML AutoML, baseline accuracy can be achieved quickly, facilitating quicker decision-making and more precise project direction. See you next time! As a GDPR-compliant company, we want you to know why you’re getting this email. The _datapro team, as a part of Packt Publishing, believes that you have a legitimate interest in our newsletter and its products. Our research shows that you opted-in for email communication with Packt Publishing in the past and we think your previous interest warrants our appropriate communication. If you do not feel that you should have received this or are no longer interested in _datapro, you can opt out of our emails by clicking the link below. [Like]( [Comment]( [Share]( Read Packt DataPro in the app Listen to posts, join subscriber chats, and never miss an update from Packt SecPro. [Get the iOS app]( the Android app]( © 2023 Copyright © 2022 Packt Publishing, All rights reserved. Our mailing address is:, Packt Publishing Livery Place, 35 Livery Street, Birmingham, West Midlands B3 2PB United Kingdom [Unsubscribe]() [Start writing]()

Edit & Download HTML

Add To Favourites

EDM Keywords (122)

xgboost write winner willing whether want wait views value using update tutorial trained train top tool tips time think table survey suggestions submit streamline sql split space solve solution shown share series select seen scripts save run results result replaced reflect received reason ready query purpose professionals products process proceed probably probability preview prepare point perform pay past part outcome optimize opted opt newsletter new need much model may lot looking logistic link latter knowledge know intrigued interested hypertuning help growth getting get friends foster follow finished find field feel familiar experiment expand evaluate end employed emails email elaborate done discuss defining default dataset datapro data csv creativity create continue consider competition comparison come clicking calculate blog bigquery best automl arrival app achieved

substack.com

Merlyn from Packt

Follow domain to get weekly email update

Marketing emails from substack.com

Sent On

08/12/2024

Sent On

08/12/2024

Sent On

08/12/2024

Sent On

08/12/2024

Sent On

08/12/2024

Sent On

07/12/2024

Email Content Statistics

Subscribe Now

Subject Line Length

Data shows that subject lines with 6 to 10 words generated 21 percent higher open rate.

Subscribe Now

Average in this category

Subscribe Now

Number of Words

The more words in the content, the more time the user will need to spend reading. Get straight to the point with catchy short phrases and interesting photos and graphics.

Subscribe Now

Average in this category

Subscribe Now

Number of Images

More images or large images might cause the email to load slower. Aim for a balance of words and images.

Subscribe Now

Average in this category

Subscribe Now

Time to Read

Longer reading time requires more attention and patience from users. Aim for short phrases and catchy keywords.

Subscribe Now

Average in this category

Subscribe Now

Predicted open rate

Subscribe Now

Spam Score

Spam score is determined by a large number of checks performed on the content of the email. For the best delivery results, it is advised to lower your spam score as much as possible.

Subscribe Now

Flesch reading score

Flesch reading score measures how complex a text is. The lower the score, the more difficult the text is to read. The Flesch readability score uses the average length of your sentences (measured by the number of words) and the average number of syllables per word in an equation to calculate the reading ease. Text with a very high Flesch reading ease score (about 100) is straightforward and easy to read, with short sentences and no words of more than two syllables. Usually, a reading ease score of 60-70 is considered acceptable/normal for web copy.

Subscribe Now

Technologies

What powers this email? Every email we receive is parsed to determine the sending ESP and any additional email technologies used.

Subscribe Now

Email Size (not include images)

No.	Font Name
Subscribe Now

Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 3

Email Preheader Text

EDM Keywords (122)

substack.com

Marketing emails from substack.com

Email Content Statistics

Font Used