Packt DataPro#34: Memoizing DataFrame Functions                                                                                                                                                                                                         [Open in app]( or [online]()
[TensorFlow, CNN in PyTorch & Studio notebooks with Python]( Packt DataPro#34: Memoizing DataFrame Functions Mar 9
Â
[Save](
[â–·Â Â Listen](
 👋 Hey, "Machine learning allows us to build software solutions that exceed human understanding and shows us how AI can innervate every industry." - [Steve Jurvetson, Board Member of SpaceX and Tesla.]( AI and ML technologies have become the backbone of our everyday activities! As the digital arena continues to evolve, these tools are now considered essential. Whether in business or just trying to stay on top of things, it is important to know how to make use of these tools to achieve the desired results. With that being stated, our goal for this week is to focus on building data pipelines that streamline data preparation and enhance our capacities to fine-tune machine learning models. Key Insights: - [Using TensorFlow Serving to serve models]( - [Building a Convolutional Neural Network in PyTorch]( - [Creating a Data Pipeline with Spark, Google Cloud Storage and Big Query]( If you are interested in sharing ideas to foster the growth of the professional data community, then this survey is for you. Consider it as your space to share your thoughts! Jump on in! [TELL US WHAT YOU THINK]( Cheers,
Merlyn Shelley
Associate Editor in Chief, Packt Keep up with cutting-edge Research on GitHub - [jupyterlab]( This is a monorepo that houses the core jupyter_ai package in addition to the default supported AI modules. - [sdv-dev]( Python library for modeling multivariate distributions and sampling from them using copula functions. - [jina-ai]( MLOps framework to build multimodal AI services and pipelines then serve, scale and deploy them to a production-ready environment like Kubernetes or Jina AI Cloud. - [pyvista]( This is a helper module for the Visualization Toolkit (VTK) that wraps the VTK library through NumPy and direct array access through a variety of methods and classes. - [deepset-ai]( An end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. - [castorini]( [Diffusion Attentive Attribution Maps (DAAM)]( a cross attention-based approach for interpreting Stable Diffusion. - [handrew]( An intelligent web browsing agent controlled by natural language. [Pledge your support]( Stay informed about Data & ML Industry Insights AWS - [Hosting YOLOv8 PyTorch models on Amazon SageMaker Endpoints:]( Launching the [YOLOv8]( object detection model from [Ultralytics]( on Amazon SageMaker endpoints enables efficient, scalable and cost-optimized model deployment. The solution utilizes [AWS CloudFormation]( to automate the creation of a SageMaker instance. It can be done by directly cloning the [GitHub repository]( to the instance, then downloading YOLOv8 PyTorch model via SageMaker notebook and storing the custom inference code in an [Amazon S3 bucket](. - [Training large language models on Amazon SageMaker: Best practices:]( Large language models (LLMs) are neural networks with hundreds of millions ([BERT]( to more than a trillion parameters ([MiCS](. And models of this capacity make single-GPU training not practically viable. [Amazon SageMaker Training]( is a managed batch ML compute service that downsizes the time and cost to train and tune models at scale without the need to manage infrastructure. This post covers tips and best practices for successful training of LLM workload on SageMaker Training, including all stages and related infrastructure features. - [Four approaches to manage Python packages in Amazon SageMaker Studio notebooks:]( [Studio notebooks]( are collaborative Jupyter notebooks for easy data processing that can launch tasks without setting up compute instances and file storage. [Amazon SageMaker Studio]( is a web-based IDE for machine learning that allows building, training, debugging, deploying, and monitoring your ML models. This post compares options and recommended practices for managing Python packages and virtual environments in SageMaker Studio notebooks. You can find the hands-on examples in a public [GitHub repo](. Google Cloud - [Rapidly expand the reach of Spanner databases with read-only replicas and zero-downtime moves:]( Spanner has an exclusive capability to deliver high performance across vast geographic territories using [read-only replicas](. It offers [regional]( and [multi-regional]( configurations with high availability, near unlimited scale, and strong consistency. Spanner's zero-downtime instance move service allows production Spanner instances to be moved from any configuration to another on the fly, with zero downtime. This includes regional, multi-regional, or custom configurations with configurable read-only replicas. - [Enriching Knowledge Graphs in Neo4j with Google Enterprise Knowledge Graph:]( [Neo4j Graph Database]( can be enhanced by integrating data from Google Enterprise Knowledge Graph, leading to more comprehensive knowledge graphs that can provide answers to more queries. This integration can result in better business decisions and improved customer experiences. The blog post details the process of combining these two tools to enrich a knowledge graph. - [At Box, a game plan for migrating critical storage services from HBase to Cloud Bigtable:]( Box, Inc. has migrated from Apache HBase to Cloud Bigtable for cloud-based content management, collaboration, and file sharing. Bigtable requires no maintenance downtime, enabling merging of three HBase clusters down to just two Bigtable clusters in separate regions. The Google-provided [Dataproc Import Job]( was used to import three HBase snapshots of 200TB each into Bigtable, facilitating ad hoc analytics data retrieval without additional jobs. Additionally, querying BigQuery is faster than running MapReduce jobs. Just for Laughs! Why did the data pipeline refuse to work with the ML model? Because the model kept trying to predict its own input! Understanding Data & ML Core Concepts Using TensorFlow Serving to serve models – By [Md Johirul Islam]( In this section, we will use TensorFlow Serving to serve models. First, we will use the recommended mechanism of using TensorFlow with Docker. Here is the [official page]( presenting this example. TensorFlow Serving with Docker Make sure Docker is installed on your platform. Now, let’s work through the following steps to serve our dummy example model: Step 1: First of all, start Docker and make sure it is running. You can verify whether Docker is running by running the docker –version command in your operating system’s terminal. It should give you an output similar to the following: → TF_SERVE docker –version Docker version 20.10.11, build dea9396 Step 2: Now, let’s pull the latest TensorFlow Docker image using the docker pull tensorflow/serving command. → TF_SERVE docker pull tensorflow/serving Using default tag: latest latest: Pulling from tensorflow/serving a1e1d413e326: Pull complete Digest: sha256:6c3c199683df6165f5ae28266131722063e9fa012c1 5065fc4e245ac7d1db980 Status: Downloaded newer image for tensorflow/serving:latest docker.io/tensorflow/serving:latest Step 3: Next, let’s clone the GitHub repository for TensorFlow Serving using the git clone command. We will use the saved_model_half_plus_two_cpu model from this folder. This model takes an input, halves it, and then adds two to the result. Step 4: Let’s create an environment variable in the terminal with the path to the example demo models by running the following command: TESTDATA="$(pwd)/serving/tensorflow_serving/servables/tensorflow/testdata" Step 5: This test data directory contains a bunch of servables that can be served directly without us having to create our own servable from scratch. These servables can be directly loaded for serving. Step 6: We will be using the saved_model_half_plus_two_cpu servable for our example. This 0000123 subdirectory is usually used to denote different versions. Step 7: It contains a saved TensorFlow model in .pb format, as shown below. It also contains a directory for variables and another directory for assets. This is the ideal structure that we get after saving a TensorFlow model: Step 8: Now, we can serve the model using the following command: docker run -t --rm -p 8501:8501 \ -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" \ -e MODEL_NAME=half_plus_two \ tensorflow/serving & Step 9: Now, let’s call the inference API. Run the following command in your terminal to call the predict API: curl -d '{"instances": [1.0, 2.0, 5.0]}' \ -X POST predict If you are interested in knowing what the model that we have demonstrated can do, you can look at the [source code in the file for the model](. The code is very big because it is used to generate models for multiple platforms. This content was curated from the book, [Machine Learning Model Serving Patterns and Best Practices (packtpub.com).]( To explore more click the button below! [SIT BACK, RELAX & START READING!]( Find Out What’s New in Data & ML - [Introducing Quix Streams: an open-source Python library for Kafka:]( [Quix Streams]( is a library that enables easy production and consumption of time-series data streams with a Pandas-like interface optimized for telemetry data. It handles stateful processing of streaming data and uses binary tables to significantly enhance performance. Quix Streams simplifies time-series data stream handling, making it more accessible for users. - [Memoizing DataFrame Functions:]( The article discusses using function memoization to optimize software performance and how StaticFrame provides tools to implement in-memory and disk-based memoization for DataFrames. Proper cache invalidation and collision avoidance are crucial to achieving performance benefits by eliminating repetitive work. - [Time Series Forecasting with statsmodels and Prophet:]( Developing a forecast model for time series data can be done easily with popular Python packages like statsforecast and Prophet. These packages can help analyze data over time and create future predictions. This article demonstrates how to use these packages to develop a forecast model and evaluate its effectiveness. - [Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages:]( The [Universal Speech Model]( (USM) is a speech model with 2 billion parameters trained on 12 million hours of speech and 28 billion sentences of text, covering 300+ languages. It uses the [Conformer]( encoder-decoder architecture with [attention]( [feed-forward]( and [convolutional modules](. Compared to [Whisper (large-v2)]( which was trained on 400k hours of labeled data, USM has a 32.7% relative lower WER on average for 18 languages that Whisper can decode with less than 40% WER. - [Building a Convolutional Neural Network in PyTorch:]( Neural networks consist of layers connected to each other, with convolutional layers being commonly used for image-related applications. These layers are powerful as they preserve the spatial structure of an image, leading to state-of-the-art results in computer vision neural networks. This post explains the convolutional layer, the network it creates in [PyTorch]( and how to use it to handle image input, as well as how to visualize feature maps. - [EgoObjects: A large-scale egocentric dataset for object understanding:]( [Meta]( introduced EgoObjects, a large-scale egocentric dataset focused on object detection and recognition. This dataset will push the boundaries of object understanding in the metaverse. State-of-the-art category and instance-level detectors will be benchmarked to better understand their capabilities, while data collection and annotation will continue to grow in scale and diversity. The dataset will also be available as an open source to the research community to facilitate further research in self-centric object understanding. - [Creating a Data Pipeline with Spark, Google Cloud Storage and Big Query:]( Learn how to develop a free-tier data pipeline using Apache Spark, Google Cloud Storage, and Google Big Query in this post. Apache Spark is the data processing tool that separates processing logic from connection logic, while Google Cloud Storage and Google Big Query are used for storage. The guide is ideal for those looking to build a data pipeline using these tools. See you next time! As a GDPR-compliant company, we want you to know why you’re getting this email. The _datapro team, as a part of Packt Publishing, believes that you have a legitimate interest in our newsletter and its products. Our research shows that you opted-in for email communication with Packt Publishing in the past and we think your previous interest warrants our appropriate communication. If you do not feel that you should have received this or are no longer interested in _datapro, you can opt out of our emails by clicking the link below.  [Like](
[Comment](
[Share]( Â Read Packt DataPro in the app
Listen to posts, join subscriber chats, and never miss an update from Packt SecPro.
[Get the iOS app]( the Android app]( © 2023 Copyright © 2022 Packt Publishing, All rights reserved.
Our mailing address is:, Packt Publishing
Livery Place, 35 Livery Street, Birmingham, West Midlands B3 2PB
United Kingdom
[Unsubscribe]() [Start writing]()