Packt_DataPro: Bringing data-driven insights to your fingertips. [View this email in your browser]( [PacktDataPro Logo]( February 21, 2023 | DataPro ~ Special Edition#11 ð Hey {NAME}, As professionals in the field of data science, we are constantly seeking effective tools and automation techniques that can streamline the process of building accurate prediction models. To help you with this unwavering quest, our author [Adi]( has come up with a straightforward solution using Google Cloudâs BigQueryML AutoML Classifier. The focus this week will be on covering the data preparation stage of the entire process, with the remaining sections to be covered in upcoming episodes. For those of you who may not be familiar with our expert author [Adi Wijaya]( he is a Strategic Cloud Data Engineer at Google. Throughout this tutorial, he will outline each step in detail. Whether you are new to the tool or looking to expand your knowledge, this part promises to provide valuable insights and tips to optimize your workflow. Get ready to be intrigued as we continue our deep dive into BigQueryML AutoML Classifier with the arrival of Part 1 in the series! If you find the newsletter useful, share this with your friends! If you are interested in sharing ideas and suggestions to foster the growth of the professional data community, then this survey is for you. Consider it as your space to share your thoughts! Jump on in! [TELL US WHAT YOU THINK]( Cheers,Merlyn Shelley
Associate Editor in Chief, Packt Using BigQueryML AutoML Classifier to Solve the Kaggle competition! - Part 1 - By [Adi Wijaya]( Disclosure: At the time I write this blog, I am employed by Google Cloud, however, the opinions expressed here are mine alone and do not reflect the views of my employer. While there may be ongoing discussions in 2023 about whether Data Scientist continues to be the most desirable job of the 21st century, the importance of building machine learning models remains significant. It is expected that data scientists, regardless of the level, should have the technical knowledge to develop machine learning models. During the initial stages, we used to experiment with different algorithms like regression, decision tree, random forest, XGboost, deep learning, and more. However, as we gain expertise, the choice of algorithm is determined by a process of trial and error, guided by predetermined accuracy metrics. Data cleansing and operationalizing ML models are still topics of discussion. In this article, you will learn about a compelling technology called BigQuery Machine Learning AutoML, which will be demonstrated through a step-by-step experiment on the Google Cloud Platform feature, using the dataset from Kaggle. The AutoML In 2015, I came to the realization that it was possible to automate the creation of a machine learning model. At the time, my initial idea was to develop a simple software for trial and selection logic. However, I soon discovered that this idea was a burgeoning concept, now known as Automatic Machine Learning (AutoML), which has matured significantly since then and is gaining more prominence in 2023. AutoML is not a replacement for traditional model algorithms such as regression or decision trees, but rather a tool that can try out various model algorithms within a limited time frame to obtain the highest level of accuracy possible. There are several options available for AutoML providers, including the H2O library, which was the first one I used, and now the auto-sklearn package, which is a widely used machine learning package in Python. Interface and data store When building a machine learning model, the interface and data store are two crucial aspects to consider. It is reassuring to note that most Python machine learning packages (though not all) are compatible with pandas dataframes, which have become the de facto standard for data storage and analysis in Python. However, this also means that the data must be moved from the source database to the memory of a single machine. This issue is not a concern when working with small datasets or datasets that are already in CSV format. However, with larger datasets, data is often stored in a data warehouse that uses modern technologies to distribute the data. This can frequently become a bottleneck. The BigQuery Machine Learning (BQML) Now comes BQML, a SQL interface for building machine learning models using data directly from BigQuery storage (Data Warehouse). The 3 key technical aspects of BQML are: - The SQL interface - The AutoML mechanism - The BigQuery storage What does that mean? The first and second aspects involve using Google infrastructure with the most common language SQL, enabling Data Scientists to leverage advanced AutoML modelling. This makes the learning curve simpler and also provides opportunities for Data Analysts and business analysts to experiment with building machine learning models. The third and most crucial aspect of this use case is BigQuery storage. With BQML, Data Scientists no longer need to move data from a Data Warehouse to file storage or third-party databases in CSV format. By obtaining the required permission or IAM, Data Scientists can easily build models on top of existing tables in BigQuery. Now, letâs get started with the experimentation! The Experimentation This experiment aims to tackle a Kaggle challenge using BQML AutoML model, with the results submitted to Kaggle. Prior to model development, we will conduct feature engineering within BigQuery. To simplify the process, I will present the experiment as a step-by-step tutorial. Through this tutorial, the primary objective is to provide a comprehensive understanding of BQML AutoML, enabling readers to determine whether this approach can enhance their daily work. Although accuracy and winning the Kaggle competition are not the tutorial's focus, readers can use the steps as a reference in their own work. The Kaggle Challenge The challenge that we will use is: [Acquire Valued Shoppers Challenge; Predict which shoppers will become repeat buyers.]( As part of the challenge, we'll build a machine learning model that predicts which shoppers are likely to become repeat customers. Data Understanding This data captures the process of offering incentives (a.k.a. coupons) to a large number of customers and predict those who will be loyal to the product. Let's say 100 customers are offered a discount to buy two bottles of water. Of the 100 customers, 60 choose to redeem the offer. These 60 clients are the focus of this competition. You are asked to predict which of the 60 will return (during or after the promotion period) to buy the same product again. To make this prediction, you are given at least a year of shopping history prior to each customer's incentive and the purchase histories of many other shoppers (some of whom will have received the same offer). The transaction history contains all items purchased, not just items related to the offer. Only one offer per customer is included in the data. The training set consists of offers issued before 2013-05-01. The test set is offered on or after 2013-05-01. Fields Here are the available fields that is described in the Kaggle challenge page: history id - A unique id representing a customer chain - An integer representing a store chain offer - An id representing a certain offer market - An id representing a geographical region repeattrips - The number of times the customer made a repeat purchase repeater - A boolean, equal to repeattrips > 0 offerdate - The date a customer received the offer transactions id - A unique id representing a customer chain - An integer representing a store chain dept - An aggregate grouping of the Category (e.g. water) category - The product category (e.g. sparkling water) company - An id of the company that sells the item brand - An id of the brand to which the item belongs date - The date of purchase productsize - The amount of the product purchase (e.g. 16 oz of water) productmeasure - The units of the product purchase (e.g. ounces) purchasequantity - The number of units purchased purchaseamount - The dollar amount of the purchase offers offer - An id representing a certain offer category - The product category (e.g. sparkling water) quantity - The number of units one must purchase to get the discount company - An id of the company that sells the item offervalue - The dollar value of the offer brand - An id of the brand to which the item belongs Key Columns The transactions file can be joined to the history file by (id,chain). The history file can be joined to the offers file by (offer). The transactions file can be joined to the offers file by (category, brand, company). A negative value in product quantity and purchase amount indicates a return. Size Here are the file sizes. The reason I like this Kaggle challenge is the data is big enough to represent a real data size in a company. Here you go: Step-by-Step tutorial There will be 5 main steps in this tutorial, as follows: - Data preparation - Feature engineering - Machine learning model - Evaluation - Submission Data Preparation In this step, we will prepare all GCP objects that we need and upload the Kaggle files. If you are new to GCP, please follow the [GCP console configuration to set up.]( For more advanced options, you can also follow the steps from my book: [Data Engineering with Google Cloud Platform book](. After getting yourself familiar with the GCP console, here are the 4 sub-steps in the data preparation step: - Create Google Cloud Storage(GCS) bucket - Upload competition files to the GCS bucket - Create BigQuery Dataset - Create BigQuery tables Letâs start with creating the GCS bucket. - Create Google Cloud Storage(GCS) bucket In the GCP console, go to GCS bucket page. Click the Cloud Storage in the Navigation Bar: Navigation Bar, Cloud Storage button On the next page, click CREATE BUCKET CREATE BUCKET button in Cloud Storage console Now, add the bucket name. Make sure the bucket name is globally unique. For me, the bucket name is bucket-ws. Please choose your own bucket name to make sure uniqueness. Define the Cloud Storage bucket name Choose the location type to Multi-region US. Note that, itâs important to choose a location that is already supporting BigQuery AutoML at this early stage Cloud Storage bucket location If it succeeds, the result is shown below: The Cloud Storage bucket created Next, we will upload the files. With this, we have come to the end of Part#1. Keeping an eye out for more elaborate tips and tricks in Part#2 next week! Clicking unsubscribe will stop all [_datapro]( communication. Make sure you don't make a hasty decision! [NOT FOR YOU? UNSUBSCRIBE HERE]( [Facebook icon] [Instagram icon] [Twitter icon] [Logo] Copyright (C) 2023 Packt Publishing. All rights reserved. Hello, Thank you for being a part of the DataPro weekly newsletter. Team Packt.
As a GDPR-compliant company, we want you to know why youâre getting this email. The _datapro team, as a part of Packt Publishing, believes that you have a legitimate interest in our newsletter and its products. Our research shows that you,{EMAIL}, opted-in for email communication with Packt Publishing in the past and we think your previous interest warrants our appropriate communication. If you do not feel that you should have received this or are no longer interested in _datapro, you can opt-out of our emails by clicking the link below.
Our mailing address is:
Packt Publishing Livery Place
35 Livery StreetBirmingham, West Midlands B3 2PB
United Kingdom
[Add us to your address book](
Want to change how you receive these emails?
You can [update your preferences]( or [unsubscribe](