Newsletter Subject

Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 2

From

packtpub.com

Email Address

merlyns@packtpub.com

Sent On

Tue, Feb 28, 2023 02:10 PM

Email Preheader Text

Packt_DataPro: Bringing data-driven insights to your fingertips. February 28, 2023 | DataPro ~ Speci

Desktop View
HTML
Text
Mobile View

Go Premium to Unlock

Subscribe Now

Packt_DataPro: Bringing data-driven insights to your fingertips. [View this email in your browser]( [PacktDataPro Logo]( February 28, 2023 | DataPro ~ Special Edition#12 ð Hey {NAME}, As professionals in the field of data science, we are constantly seeking effective tools and automation techniques that can streamline the process of building accurate prediction models. To help you with this unwavering quest, our author [Adi]( has come up with a straightforward solution using Google Cloudâs BigQueryML AutoML Classifier. Having covered how to prepare the data last week, we'll examine the feature engineering part of the process this week, with the remaining sections covered in upcoming episodes. For those of you who may not be familiar with our expert author [Adi Wijaya]( he is a Strategic Cloud Data Engineer at Google. Throughout this tutorial, he will elaborate on each step. Whether you are new to the tool or looking to expand your knowledge, this part promises to provide valuable insights and tips to optimize your workflow. Get ready to be intrigued as we continue our deep dive into BigQueryML AutoML Classifier with the arrival of Part 2 in the series! For those who haven't read Part 1, it can be found here. - [Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 1]( If you find the newsletter useful, share this with your friends! If you are interested in sharing ideas and suggestions to foster the growth of the professional data community, then this survey is for you. Consider it as your space to share your thoughts! Jump on in! [TELL US WHAT YOU THINK]( Cheers,Merlyn Shelley Associate Editor in Chief, Packt Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 2 - By [Adi Wijaya]( Disclosure: At the time I write this blog, I am employed by Google Cloud, however, the opinions expressed here are mine alone and do not reflect the views of my employer. Upload Kaggle files to the GCS bucket Click the bucket name that youâve created, and then the UPLOAD FILES button. The UPLOAD FILES button After browsing & selecting all of the Kaggle files from your system, those will be shown in the bucket as below: All files got successfully uploaded to the bucket Next, we should create a BigQuery dataset. Creating a BigQuery Dataset Now, move to the BigQuery console, by choosing BigQuery in the Navigation Bar. We need to create a dataset for the tables and ML models. Letâs create one by clicking the âCreate Datasetâ drop-down option. BigQuery console, create dataset For me, I created a dataset called kaggleworkshops_shopper under Project ID: bqml-project-370802 in the US location. Create BigQuery tables Now, you can create your table on your dataset. Right-click from your dataset and CREATE TABLE and we can choose transactions data as an example. Create a table in the dataset As the data we have is in the bucket, we can choose the data source from Google Cloud Storage, choose CSV format, and tick Auto Detect for Schema. BigQuery table creation form Because the table has a header, we can skip the first row for upload. And then click CREATE TABLE. After clicking the CREATE TABLE button, the loading jobs from GCS to BigQuery will start, the processes may take time for around 30 minutes. This will only happen to the transactions table since the file size is quite big. If successful, you can see the table below: Transaction table preview Continue with creating table steps for all the other files. After completing this step and having created all the tables that you want to analyze, youâll now notice your dataset on the left sidebar. Final check the dataset With that, our data is prepared! And we are ready for feature engineering. Do note that these steps are unnecessary if the organization's main Data Warehouse is using BigQuery. In this tutorial, we did all the steps simply because the file sources are coming from Kaggleâs website. Feature Engineering In this step, we will create the features table. Here are the 3 sub-steps that we will do in the feature engineering step: - Split Dataset - Creating the features using SQL - Using BQML Transform The first step is to split the dataset into Train, Test, and Validation. Split Dataset This sub-step is actually not mandatory specifically for AutoML, because itâs already part of the automatic mechanism. But since later we want to also compare the results to other model algorithms, let's split the dataset manually using a BigQuery function. In the BigQuery console, click COMPOSE A NEW QUERY and create SQL to split the dataset into TRAIN, TEST, and VALIDATE from âtrain_historyâ and weâll name the table with âsplit_dimâ. (Check out: [How data splits are used]( Split table syntax The query will create a split_dim table. Which shows one column named: split_col. It contains the row labeling: TRAIN, TEST, or VALIDATE. As shown in the example screenshot below: Split table result After this step, we will have 120K customers as TRAIN data, 23K customers as TEST data, and 15K customers as VALIDATE data. We can count the number of rows using SQL query: Split column row counts After splitting the datasets, we will add more columns into a table. The table later will become our feature table. Creating the features using SQL First, I would like to create 2 features from the train_history table itself. Those features will be temporarily stored in BigQuery tables as dim_1 and dim_2. The dim_1 definition is the probability for each region have repeater user as region_repeated_prob. Create dim1 table The dim_2 definition is the probability for each store to have a repeater user as chain_repeated_prob. Next, letâs join those two tables into the other features from other tables. The final table will be called: feat_train. As shown here: Create a feature train table Create feature train table [REPLACE WITH scripts] Here are the features that we generated: - Top category: f00_top_category - Value of offering: f01_offer_value - Quantity of offering: f02_offer_rule - Count of transactions: f03_trx_cnt - Quantity of purchases: f04_total_quantity - Quantity of amount: f05_total_amount - Percentage of the number of weekends when purchasing: f06_weekend_pct - Total of brands who have transactions and get offers with the same brand: f07_same_brand_trx_cnt - Total of categories who have transactions and get offers with the same category: f08_same_category_trx_cnt - Total of top categories that customer purchase: f09_top_category_cnt_trx - 1st additional dimension: f10_region_repeat_prob - 2nd additional dimension: f11_chain_repeat_prob - For training and only 27% of TRAIN data are repeated (data imbalance): label and label_bool - The label of TRAIN, TEST, and VALIDATE datasets (used for AUTOML models): automl_split_col - The label of TRAIN, TEST, and VALIDATE datasets (used for other than AUTOML models): general_split_col After the query is done, the details of schema TRAIN data can be found below: Example train table columns And this is what it looks like for the testing table: Create a feature test table The features that we build for data testing same as data training, but we exclude label and label_bool in the dataset with the above schema. After finishing the creation of our feature, in the next step, we will further transform the features. Some additional transformations are needed for example one-hot encoder. Luckily BQML comes up with those functionalities out of the box. So, we just need to implement them in the SQL statement! Using BQML Transform In this sub-step, we will define the BQML Transform syntax. We define data transformations along with the model definition using the TRANSFORM statement in the following way: Using Transform statement in the model creation As you can see, we implemented 3 functions there: one hot encoder, standard scaler, and imputer. Now our table is ready for machine learning model training. With this, we have come to the end of Part#2. Keeping an eye out for more elaborate tips and tricks in Part#3 next week! Clicking unsubscribe will stop all [_datapro]( communication. Make sure you don't make a hasty decision! [NOT FOR YOU? UNSUBSCRIBE HERE]( [Facebook icon] [Instagram icon] [Twitter icon] [Logo] Copyright (C) 2023 Packt Publishing. All rights reserved. Hello, Thank you for being a part of the DataPro weekly newsletter. Team Packt. As a GDPR-compliant company, we want you to know why youâre getting this email. The _datapro team, as a part of Packt Publishing, believes that you have a legitimate interest in our newsletter and its products. Our research shows that you,{EMAIL}, opted-in for email communication with Packt Publishing in the past and we think your previous interest warrants our appropriate communication. If you do not feel that you should have received this or are no longer interested in _datapro, you can opt-out of our emails by clicking the link below. Our mailing address is: Packt Publishing Livery Place 35 Livery StreetBirmingham, West Midlands B3 2PB United Kingdom [Add us to your address book]( Want to change how you receive these emails? You can [update your preferences]( or [unsubscribe](

Edit & Download HTML

Add To Favourites

EDM Keywords (145)

write weekends week want views validate upload update unsubscribe unnecessary tutorial tricks transform transactions training tool tips time think tables table system survey suggestions successful streamline store stop steps step start splitting split space solve skip shown share series see scripts schema results region reflect received receive ready query professionals products process probability prepared prepare preferences past part organization optimize opted opt number notice note newsletter new needed need name move may make looking link label knowledge know kaggle join intrigued interested imputer implement help header happen growth getting gcs functionalities friends found foster finishing find files field feel features feature familiar eye expand examine end employed emails email elaborate done details define datasets dataset datapro data creation created create covered count continue contains consider completing coming come columns clicking choose change categories build bucket brands box blog bigquery become automl arrival analyze add actually 27

packtpub.com

Merlyn from Packt

Follow domain to get weekly email update

Marketing emails from packtpub.com

Sent On

07/11/2024

Sent On

26/10/2024

Sent On

17/10/2024

Sent On

09/10/2024

Sent On

03/10/2024

Sent On

25/09/2024

Email Content Statistics

Subscribe Now

Subject Line Length

Data shows that subject lines with 6 to 10 words generated 21 percent higher open rate.

Subscribe Now

Average in this category

Subscribe Now

Number of Words

The more words in the content, the more time the user will need to spend reading. Get straight to the point with catchy short phrases and interesting photos and graphics.

Subscribe Now

Average in this category

Subscribe Now

Number of Images

More images or large images might cause the email to load slower. Aim for a balance of words and images.

Subscribe Now

Average in this category

Subscribe Now

Time to Read

Longer reading time requires more attention and patience from users. Aim for short phrases and catchy keywords.

Subscribe Now

Average in this category

Subscribe Now

Predicted open rate

Subscribe Now

Spam Score

Spam score is determined by a large number of checks performed on the content of the email. For the best delivery results, it is advised to lower your spam score as much as possible.

Subscribe Now

Flesch reading score

Flesch reading score measures how complex a text is. The lower the score, the more difficult the text is to read. The Flesch readability score uses the average length of your sentences (measured by the number of words) and the average number of syllables per word in an equation to calculate the reading ease. Text with a very high Flesch reading ease score (about 100) is straightforward and easy to read, with short sentences and no words of more than two syllables. Usually, a reading ease score of 60-70 is considered acceptable/normal for web copy.

Subscribe Now

Technologies

What powers this email? Every email we receive is parsed to determine the sending ESP and any additional email technologies used.

Subscribe Now

Email Size (not include images)

No.	Font Name
Subscribe Now

Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 2

Email Preheader Text

EDM Keywords (145)

packtpub.com

Marketing emails from packtpub.com

Email Content Statistics

Font Used