Packt_DataPro: Bringing data-driven insights to your fingertips. [View this email in your browser]( [PacktDataPro Logo]( February 28, 2023 | DataPro ~ Special Edition#12 ð Hey {NAME}, As professionals in the field of data science, we are constantly seeking effective tools and automation techniques that can streamline the process of building accurate prediction models. To help you with this unwavering quest, our author [Adi]( has come up with a straightforward solution using Google Cloudâs BigQueryML AutoML Classifier. Having covered how to prepare the data last week, we'll examine the feature engineering part of the process this week, with the remaining sections covered in upcoming episodes. For those of you who may not be familiar with our expert author [Adi Wijaya]( he is a Strategic Cloud Data Engineer at Google. Throughout this tutorial, he will elaborate on each step. Whether you are new to the tool or looking to expand your knowledge, this part promises to provide valuable insights and tips to optimize your workflow. Get ready to be intrigued as we continue our deep dive into BigQueryML AutoML Classifier with the arrival of Part 2 in the series! For those who haven't read Part 1, it can be found here. - [Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 1]( If you find the newsletter useful, share this with your friends! If you are interested in sharing ideas and suggestions to foster the growth of the professional data community, then this survey is for you. Consider it as your space to share your thoughts! Jump on in! [TELL US WHAT YOU THINK]( Cheers,Merlyn Shelley
Associate Editor in Chief, Packt Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 2 - By [Adi Wijaya]( Disclosure: At the time I write this blog, I am employed by Google Cloud, however, the opinions expressed here are mine alone and do not reflect the views of my employer. Upload Kaggle files to the GCS bucket Click the bucket name that youâve created, and then the UPLOAD FILES button. The UPLOAD FILES button After browsing & selecting all of the Kaggle files from your system, those will be shown in the bucket as below: All files got successfully uploaded to the bucket Next, we should create a BigQuery dataset. Creating a BigQuery Dataset Now, move to the BigQuery console, by choosing BigQuery in the Navigation Bar. We need to create a dataset for the tables and ML models. Letâs create one by clicking the âCreate Datasetâ drop-down option. BigQuery console, create dataset For me, I created a dataset called kaggleworkshops_shopper under Project ID: bqml-project-370802 in the US location. Create BigQuery tables Now, you can create your table on your dataset. Right-click from your dataset and CREATE TABLE and we can choose transactions data as an example. Create a table in the dataset As the data we have is in the bucket, we can choose the data source from Google Cloud Storage, choose CSV format, and tick Auto Detect for Schema. BigQuery table creation form Because the table has a header, we can skip the first row for upload. And then click CREATE TABLE. After clicking the CREATE TABLE button, the loading jobs from GCS to BigQuery will start, the processes may take time for around 30 minutes. This will only happen to the transactions table since the file size is quite big. If successful, you can see the table below: Transaction table preview Continue with creating table steps for all the other files. After completing this step and having created all the tables that you want to analyze, youâll now notice your dataset on the left sidebar. Final check the dataset With that, our data is prepared! And we are ready for feature engineering. Do note that these steps are unnecessary if the organization's main Data Warehouse is using BigQuery. In this tutorial, we did all the steps simply because the file sources are coming from Kaggleâs website. Feature Engineering In this step, we will create the features table. Here are the 3 sub-steps that we will do in the feature engineering step: - Split Dataset - Creating the features using SQL - Using BQML Transform The first step is to split the dataset into Train, Test, and Validation. Split Dataset This sub-step is actually not mandatory specifically for AutoML, because itâs already part of the automatic mechanism. But since later we want to also compare the results to other model algorithms, let's split the dataset manually using a BigQuery function. In the BigQuery console, click COMPOSE A NEW QUERY and create SQL to split the dataset into TRAIN, TEST, and VALIDATE from âtrain_historyâ and weâll name the table with âsplit_dimâ. (Check out: [How data splits are used]( Split table syntax The query will create a split_dim table. Which shows one column named: split_col. It contains the row labeling: TRAIN, TEST, or VALIDATE. As shown in the example screenshot below: Split table result After this step, we will have 120K customers as TRAIN data, 23K customers as TEST data, and 15K customers as VALIDATE data. We can count the number of rows using SQL query: Split column row counts After splitting the datasets, we will add more columns into a table. The table later will become our feature table. Creating the features using SQL First, I would like to create 2 features from the train_history table itself. Those features will be temporarily stored in BigQuery tables as dim_1 and dim_2. The dim_1 definition is the probability for each region have repeater user as region_repeated_prob. Create dim1 table The dim_2 definition is the probability for each store to have a repeater user as chain_repeated_prob. Next, letâs join those two tables into the other features from other tables. The final table will be called: feat_train. As shown here: Create a feature train table Create feature train table [REPLACE WITH scripts] Here are the features that we generated: - Top category: f00_top_category - Value of offering: f01_offer_value - Quantity of offering: f02_offer_rule - Count of transactions: f03_trx_cnt - Quantity of purchases: f04_total_quantity - Quantity of amount: f05_total_amount - Percentage of the number of weekends when purchasing: f06_weekend_pct - Total of brands who have transactions and get offers with the same brand: f07_same_brand_trx_cnt - Total of categories who have transactions and get offers with the same category: f08_same_category_trx_cnt - Total of top categories that customer purchase: f09_top_category_cnt_trx - 1st additional dimension: f10_region_repeat_prob - 2nd additional dimension: f11_chain_repeat_prob - For training and only 27% of TRAIN data are repeated (data imbalance): label and label_bool - The label of TRAIN, TEST, and VALIDATE datasets (used for AUTOML models): automl_split_col - The label of TRAIN, TEST, and VALIDATE datasets (used for other than AUTOML models): general_split_col After the query is done, the details of schema TRAIN data can be found below: Example train table columns And this is what it looks like for the testing table: Create a feature test table The features that we build for data testing same as data training, but we exclude label and label_bool in the dataset with the above schema. After finishing the creation of our feature, in the next step, we will further transform the features. Some additional transformations are needed for example one-hot encoder. Luckily BQML comes up with those functionalities out of the box. So, we just need to implement them in the SQL statement! Using BQML Transform In this sub-step, we will define the BQML Transform syntax. We define data transformations along with the model definition using the TRANSFORM statement in the following way: Using Transform statement in the model creation As you can see, we implemented 3 functions there: one hot encoder, standard scaler, and imputer. Now our table is ready for machine learning model training. With this, we have come to the end of Part#2. Keeping an eye out for more elaborate tips and tricks in Part#3 next week! Clicking unsubscribe will stop all [_datapro]( communication. Make sure you don't make a hasty decision! [NOT FOR YOU? UNSUBSCRIBE HERE]( [Facebook icon] [Instagram icon] [Twitter icon] [Logo] Copyright (C) 2023 Packt Publishing. All rights reserved. Hello, Thank you for being a part of the DataPro weekly newsletter. Team Packt.
As a GDPR-compliant company, we want you to know why youâre getting this email. The _datapro team, as a part of Packt Publishing, believes that you have a legitimate interest in our newsletter and its products. Our research shows that you,{EMAIL}, opted-in for email communication with Packt Publishing in the past and we think your previous interest warrants our appropriate communication. If you do not feel that you should have received this or are no longer interested in _datapro, you can opt-out of our emails by clicking the link below.
Our mailing address is:
Packt Publishing Livery Place
35 Livery StreetBirmingham, West Midlands B3 2PB
United Kingdom
[Add us to your address book](
Want to change how you receive these emails?
You can [update your preferences]( or [unsubscribe](