#AWSGlue
#うひーメモ
2023-11-30 04:05:54
[速報]AWS GlueのAmazon Q統合が発表、自然言語でのデータ統合パイプライン作成が可能に(Coming Soon)#AWSreInvent
#技術系ブログ等
#amazonq
#awsglue
#comingsoonawsreinvent
[速報]AWS GlueのAmazon Q統合が発表、自然言語でのデータ統合パイプライン作成が可能に(Coming Soon)#AWSreInvent
データアナリティクス事業本部インテグレーション部コンサルティングチーム新納にいのです日本時間reInventのSwamiSivasubramanian氏によるキーノートにてAW
dev.classmethod.jp
November 29, 2023 at 7:05 PM
🆕 AWS Glue Data Catalog now supports Apache Iceberg automatic table optimization through Amazon VPC

#AWS #AwsLakeFormation #AwsGlue
AWS Glue Data Catalog now supports Apache Iceberg automatic table optimization through Amazon VPC
AWS Glue Data Catalog now supports automatic optimization of Apache Iceberg tables that can be only accessed from a specific Amazon Virtual Private Cloud (VPC) environment. You can enable automatic optimization by providing a VPC configuration to optimize storage and improve query performance while keeping your tables secure. AWS Glue Data Catalog supports compaction, snapshot retention and unreferenced file management that help you reduce metadata overhead, control storage costs and improve query performance. Customers who have governance and security configurations that require an Amazon S3 bucket to reside within a specific VPC can now use it with Glue Catalog. This gives you broader capabilities for automatic management of your Apache Iceberg data, regardless of where it's stored on Amazon S3. Automatic optimization for Iceberg tables through Amazon VPC is available in 13 AWS regions US East (N. Virginia, Ohio), US West (Oregon), Europe (Ireland, London, Frankfurt, Stockholm), Asia Pacific (Tokyo, Seoul, Mumbai, Singapore, Sydney), South America (São Paulo). Customers can enable this through the AWS Console, AWS CLI, or AWS SDKs. To get started, you can now provide the Glue network connection as an additional configuration along with optimization settings such as default retention period and days to keep unreferenced files. The AWS Glue Data Catalog will use the VPC information in the Glue connection to access Amazon S3 buckets and optimize Apache Iceberg tables. To learn more, read the blog, and visit the AWS Glue Data Catalog documentation.
aws.amazon.com
November 21, 2024 at 10:23 PM
Amazon SageMaker now offers 9 additional visual ETL transforms

Visual ETL in Amazon SageMaker now offers 9 new built-in transforms: “Derived column”, “Flatten”, “Add current timestamp”, “Explode array or map into rows”, “To timestamp”, “Arr...

#AWS #AmazonSagemaker #AwsGlue
Amazon SageMaker now offers 9 additional visual ETL transforms
Visual ETL in Amazon SageMaker now offers 9 new built-in transforms: “Derived column”, “Flatten”, “Add current timestamp”, “Explode array or map into rows”, “To timestamp”, “Array to columns”, “Intersect”, “Limit” and “Concatenate columns”. Visual ETL in Amazon SageMaker provides a drag-and-drop interface for building ETL flows and authoring flows with Amazon Q Developer. With these new transforms, ETL developers can quickly build more sophisticated data pipelines without having to write custom code for common transform tasks. Each of these new transforms address a unique data processing need. For example, use “Derived column” to define a new column based on a math formula or SQL expression, use “To timestamp” to convert a column to timestamp type, or build a new string column using the values of other columns with an optional spacer with the “Concatenate columns” transform. This new feature is now available in all AWS regions where Amazon SageMaker is available. Access the supported https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/supported-regions.html for the most up-to-date availability information. To learn more, visit our Amazon SageMaker https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/visual-etl-supported-transforms.html.
aws.amazon.com
April 2, 2025 at 6:05 PM
🆕 Amazon SageMaker now offers unified scheduling for visual ETL and query editors, simplifying scheduling via Amazon EventBridge Scheduler. This new feature is available in all AWS regions where Amazon SageMaker is supported.

#AWS #AmazonEventBridge #AwsGlue #AmazonSagemaker
Amazon SageMaker scheduling experience for Visual ETL and Query editors
Amazon SageMaker now offers a unified scheduling experience for visual ETL flows and queries. The next generation of Amazon SageMaker is the center for all your data, analytics, and AI, and includes SageMaker Unified Studio, a single data and AI development environment. Visual ETL in Amazon SageMaker provides a drag-and-drop interface for building ETL flows and authoring flows with Amazon Q. The query editor tool provides a place to write and run queries, view results, and share your work with your team. This new scheduling experience simplifies the scheduling process for Visual ETL and Query editor users. With unified scheduling you can now schedule your workloads with Amazon EventBridge Scheduler from the same visual interface you use to author your query or visual ETL flow. Previously, you needed to create a code-based workflow in order to run a single flow or query on schedule. You can also view, modify or pause/resume these schedules and monitor the runs they invoked. This new feature is now available in all AWS regions where Amazon SageMaker is available. Access the supported region list for the most up-to-date availability information. To learn more, visit our Amazon SageMaker Unified Studio documentation, blog post and Amazon EventBridge Scheduler pricing page.
aws.amazon.com
April 30, 2025 at 10:40 PM
🆕 AWS Glue expands connectivity to 16 native connectors for applications

#AWS #AwsGlue
AWS Glue expands connectivity to 16 native connectors for applications
AWS Glue announces 16 new connectors for applications, expanding its connectivity portfolio. Now, customers can use AWS Glue native connectors to ingest data from Adobe Analytics, Asana, Datadog, Facebook Page Insights, Freshdesk, Freshsales, Google Search Console, Linkedin, Mixpanel, Paypal Checkout, Quickbooks, SendGrid, SmartSheets, Twilio, WooCommerce and Zoom Meetings. As enterprises increasingly rely on data-driven decisions, they need to integrate with data from various applications. With 16 new connectors, customers have more options to easily establish a connection to their applications using the AWS Glue console or AWS Glue APIs without the need to learn application-specific APIs. Glue native connectors provide the scalability and performance of the AWS Glue Spark engine along with support for standard authorization and authentication methods like OAuth 2. With these connectors, customers can test connections, validate their connection credentials, preview data, and browse metadata. AWS Glue native connectors to Adobe Analytics, Asana, Datadog, Facebook Page Insights, Freshdesk, Freshsales, Google Search Console, Linkedin, Mixpanel, Paypal Checkout, Quickbooks, SendGrid, SmartSheets, Twilio, WooCommerce and Zoom Meetings are available in all AWS commercial regions. To get started, create new AWS Glue connections with these connectors and use them as source in AWS Glue studio. To learn more, visit AWS Glue documentation for connectors.
aws.amazon.com
December 18, 2024 at 6:23 PM
🆕 AWS Glue Data Catalog now provides usage metrics in Amazon CloudWatch for improved API monitoring and optimization. Track lakehouse resource reads, updates, and deletions, and set alarms for proactive management. Available globally.

#AWS #AwsGlue #AmazonCloudwatch
AWS Glue Data Catalog usage metrics now available with Amazon CloudWatch
AWS Glue Data Catalog now offers usage metrics for APIs in Amazon CloudWatch, enabling you to monitor, troubleshoot, and optimize your API usage with greater visibility. The insights from these API usage metrics will help you better understand your lakehouse runtime API usage in production environments. Customers seek better observability of their API usage to identify bottlenecks, detect anomalies, and understand usage patterns in their lakehouse architecture. With Data Catalog Usage Metrics in CloudWatch, you can track critical API usage performance indicators per minute, including reads, updates, and deletions of lakehouse resources such as catalogs, tables, partitions, connections, and statistics. You can set up CloudWatch alarms to receive notifications when metrics exceed specified thresholds, allowing proactive management of your lakehouse. You can get started by navigating to Metrics in the CloudWatch console and filter usage by AWS Glue resource. You can then graph the metrics and configure alarms that alert you when usage approaches specified thresholds. This feature is available in all AWS Regions where Data Catalog is available. To get started, read the launch blog and read Data Catalog documentation.
aws.amazon.com
June 26, 2025 at 11:40 PM
AWS Glue Data Catalog now supports scheduled generation of column level statistics

https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html now supports the scheduled generation of column-level statistics for Apache Iceberg tables and file formats su...

#AWS #AwsGlue #AwsLakeFormation
AWS Glue Data Catalog now supports scheduled generation of column level statistics
https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html now supports the scheduled generation of column-level statistics for Apache Iceberg tables and file formats such as Parquet, JSON, CSV, XML, ORC, and ION. With this launch, you can simplify and automate the generation of statistics by creating a recurring schedule in the Glue Data Catalog. These statistics are integrated with the cost-based optimizer (CBO) from Amazon Redshift Spectrum and Amazon Athena, resulting in improved query performance and potential cost savings. Previously, to setup recurring statistics generation schedule, you had to call AWS services using a combination of AWS Lambda and Amazon EventBridge Scheduler. With this new feature, you can now provide the recurring schedule as an additional configuration to Glue Data Catalog along with sampling percentage. For each scheduled run, the number of distinct values (NDVs) are collected for Apache Iceberg tables, and additional statistics such as the number of nulls, maximum, minimum, and average length are collected for other file formats. As the statistics are updated, Amazon Redshift and Amazon Athena use them to optimize queries, using optimizations such as optimal join order or cost based aggregation pushdown. You have visibility into the status and timing of each statistics generation run, as well as the updated statistics values. To get started, you can schedule statistics generation using the AWS Glue Data Catalog Console or AWS Glue APIs. The support for scheduled generation of AWS Glue Catalog statistics is generally available in all regions where Amazon EventBridge Scheduler is available. Visit AWS Glue Catalog https://docs.aws.amazon.com/glue/latest/dg/column-statistics.html to learn more.
aws.amazon.com
November 14, 2024 at 2:05 AM
AWS Glue G4 and G8 worker types now available in six new regions

Today, AWS announces the general availability of AWS Glue G.4X and G.8X workers in the
US West (N. California), Asia Pacific (Seoul), Asia Pacific (Mumbai), Europe (London),
Europe (Spain), and South America (S...

#AWS #AwsGlue
AWS Glue G4 and G8 worker types now available in six new regions
Today, AWS announces the general availability of AWS Glue G.4X and G.8X workers in the US West (N. California), Asia Pacific (Seoul), Asia Pacific (Mumbai), Europe (London), Europe (Spain), and South America (São Paulo) AWS regions. Glue G.4X and G.8X workers enable you to run your most demanding serverless data integration workloads in these additional regions. AWS Glue is a serverless, scalable data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources. AWS Glue G.4X and G.8X workers provide higher compute, memory, and storage resources than current Glue workers. These new types of workers help you scale and run your most demanding data integration workloads, such as memory-intensive data transforms, skewed aggregations, machine learning transforms, and entity detection checks with petabytes of data. To learn more, visit the AWS Glue https://aws.amazon.com/glue/ and our https://docs.aws.amazon.com/glue/. For AWS Glue region availability, please see the https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/  
aws.amazon.com
April 3, 2025 at 7:05 PM
Amazon EMR enables enhanced Apache Spark capabilities for Lake Formation tables with full table access

Amazon EMR now supports read and write operations from Apache Spark jobs on AWS Lake Formation registered tables when the job role has full table access. This capabi...

#AWS #AmazonEmr #AwsGlue
Amazon EMR enables enhanced Apache Spark capabilities for Lake Formation tables with full table access
Amazon EMR now supports read and write operations from Apache Spark jobs on AWS Lake Formation registered tables when the job role has full table access. This capability enables Data Manipulation Language (DML) operations including CREATE, ALTER, DELETE, UPDATE, and MERGE INTO statements on Apache Hive and Iceberg tables from within the same Apache Spark application. While Lake Formation's fine-grained access control (FGAC) offers granular security controls at row, column, and cell levels, many ETL workloads simply need full table access. This new feature enables Apache Spark to directly read and write data when full table access is granted, removing FGAC limitations that previously restricted certain ETL operations. You can now leverage advanced Spark capabilities including RDDs, custom libraries, UDFs, and custom images (AMIs for EMR on EC2, custom images for EMR-Serverless) with Lake Formation tables. Additionally, data teams can run complex, interactive Spark applications through SageMaker Unified Studio in compatibility mode while maintaining Lake Formation's table-level security boundaries. This feature is available in all AWS Regions where Amazon EMR and AWS Lake Formation are supported. To learn more about this feature, visit the https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/lake-formation-unfiltered-access.html section in EMR Serverless documentation.
aws.amazon.com
May 29, 2025 at 10:05 PM
🆕 AWS Glue now supports write operations for SAP OData, Adobe Marketo Engage, Salesforce Marketing Cloud, and HubSpot, enabling direct data writing from ETL jobs. This simplifies end-to-end ETL pipelines, eliminating custom scripts. Available in all AWS Glue regions.

#AWS #AwsGlue
AWS Glue adds write operations for SAP OData, Adobe Marketo Engage, Salesforce Marketing Cloud, and HubSpot connectors
AWS Glue adds write operations support for SAP OData, Adobe Marketo Engage, Salesforce Marketing Cloud, and HubSpot connectors. This allows you to not only extract data from those applications, but also write data to them directly from your AWS Glue ETL jobs. With the new write functionality you can create and update records in SAP systems; sync leads into Adobe Marketo Engage; updating subscriber and campaign data in Salesforce Marketing Cloud; manage contacts, companies, and deals in HubSpot; and more. This feature simplifies building end-to-end ETL pipelines that both extract data from and write processed results back to target applications, eliminating the need for custom scripts or intermediate systems. Write operations support for SAP OData, Adobe Marketo Engage, Salesforce Marketing Cloud, and HubSpot connectors is available in all Regions where AWS Glue is available. To learn more and see the list of supported entities, visit AWS Glue documentation.
aws.amazon.com
October 3, 2025 at 10:40 PM
🆕 AWS Glue Data Catalog now supports scheduled generation of column level statistics

#AWS #AwsGlue #AwsLakeFormation
AWS Glue Data Catalog now supports scheduled generation of column level statistics
AWS Glue Data Catalog now supports the scheduled generation of column-level statistics for Apache Iceberg tables and file formats such as Parquet, JSON, CSV, XML, ORC, and ION. With this launch, you can simplify and automate the generation of statistics by creating a recurring schedule in the Glue Data Catalog. These statistics are integrated with the cost-based optimizer (CBO) from Amazon Redshift Spectrum and Amazon Athena, resulting in improved query performance and potential cost savings. Previously, to setup recurring statistics generation schedule, you had to call AWS services using a combination of AWS Lambda and Amazon EventBridge Scheduler. With this new feature, you can now provide the recurring schedule as an additional configuration to Glue Data Catalog along with sampling percentage. For each scheduled run, the number of distinct values (NDVs) are collected for Apache Iceberg tables, and additional statistics such as the number of nulls, maximum, minimum, and average length are collected for other file formats. As the statistics are updated, Amazon Redshift and Amazon Athena use them to optimize queries, using optimizations such as optimal join order or cost based aggregation pushdown. You have visibility into the status and timing of each statistics generation run, as well as the updated statistics values. To get started, you can schedule statistics generation using the AWS Glue Data Catalog Console or AWS Glue APIs. The support for scheduled generation of AWS Glue Catalog statistics is generally available in all regions where Amazon EventBridge Scheduler is available. Visit AWS Glue Catalog documentation to learn more.
aws.amazon.com
November 14, 2024 at 1:23 AM
🆕 AWS Glue Data catalog now automates generating statistics for new tables

#AWS #AwsGlue
AWS Glue Data catalog now automates generating statistics for new tables
AWS Glue Data Catalog now automates generating statistics for new tables. These statistics are integrated with cost-based optimizer (CBO) from Amazon Redshift and Amazon Athena, resulting in improved query performance and potential cost savings. Table statistics are used by a query engine, such as Amazon Redshift and Amazon Athena, to determine the most efficient way to execute a query. Previously, creating statistics for Apache Iceberg tables in AWS Glue Data Catalog required you to continuously monitor and update configurations for your tables. Now, AWS Glue Data Catalog lets you generate statistics automatically for new tables with one time catalog configuration. You can get started by selecting default catalog in the Lake Formation console and enabling table statistics in the table optimization configuration tab. As new tables are created or existing tables are updated, statistics are generated using a sample of rows for all columns and will be refreshed periodically. For Apache Iceberg tables, these statistics include the number of distinct values (NDVs). For other file formats like Parquet, additional statistics are collected, such as the number of nulls, maximum and minimum values, and average length. Amazon Redshift and Amazon Athena use the updated statistics to optimize queries, using optimizations such as optimal join order or cost based aggregation pushdown. Glue Catalog console provides you visibility into the updated statistics and statistics generation runs. The support for automation for AWS Glue Catalog statistics is generally available in the following AWS regions: US East (N. Virginia, Ohio), US West (N. California, Oregon), Europe (Ireland), Asia Pacific (Tokyo) regions. Read the blog post and visit AWS Glue Catalog documentation to learn more.
aws.amazon.com
December 3, 2024 at 8:23 PM
#うひーメモ
2023-10-28 18:04:18
AWS Glue ジョブで paramiko を使って ssh 接続する
#Program
#awsglue
#paramiko
#python
AWS Glue ジョブで paramiko を使って ssh 接続する
こんにちはインサイトテクノロジーの松尾です本投稿ではAWSGlueジョブでPythonを使ってssh接続する方法を紹介しますssh接続にはparamikoを使用します
qiita.com
October 28, 2023 at 9:04 AM
🆕 AWS Glue Studio now supports more file types, including Excel, XML, and Tableau Hyper files, plus single file output. New compression types like LZ4, SNAPPY, and ZSTD are available. These features are now available globally where AWS Glue is supported.

#AWS #AwsGlue #AwsGovcloudUs
AWS Glue Studio now supports additional file types and single file output
Today, AWS Glue Studio announces support for additional compressed file types, Excel files (as source), and XML and Tableau's Hyper files (as target). We are also introducing the option to select the number of output files for an S3 target. These enhancements will allow you to use visual ETL jobs for additional data processing workflows not supported today, for example loading data from an Excel file into a single XML file output. The new experience will now enable you to have one single file as the output of your Glue job, or to specify a custom number for the output files. Further, Glue now supports Excel files via S3 file source nodes, and XML or Tableau Hyper files for S3 file target nodes. New compression types that will be available to use are: LZ4 , SNAPPY, DEFLATE, LZO, BROTLI, ZSTD and ZLIB. These new features are now available in all AWS commercial Regions and AWS GovCloud (US) Regions where AWS Glue is available. Access the AWS Regional Services List for the most up-to-date availability information. To learn more, visit the AWS Glue documentation.
aws.amazon.com
May 15, 2025 at 8:40 PM
🆕 Announcing Amazon S3 Metadata (Preview) – Easiest and fastest way to manage your metadata

#AWS #AmazonS3 #AwsGlue #AmazonBedrock
Announcing Amazon S3 Metadata (Preview) – Easiest and fastest way to manage your metadata
Amazon S3 Metadata is the easiest and fastest way to help you instantly discover and understand your S3 data with automated, easily-queried metadata that updates in near real-time. This helps you to curate, identify, and use your S3 data for business analytics, real-time inference applications, and more. S3 Metadata supports object metadata, which includes system-defined details like size and the source of the object, and custom metadata, which allows you to use tags to annotate your objects with information like product SKU, transaction ID, or content rating, for example. S3 Metadata is designed to automatically capture metadata from objects as they are uploaded into a bucket, and to make that metadata queryable in a read-only table. As data in your bucket changes, S3 Metadata updates the table within minutes to reflect the latest changes. These metadata tables are stored in S3 Tables, the new S3 storage offering optimized for tabular data. S3 Tables integration with AWS Glue Data Catalog is in preview, allowing you to stream, query, and visualize data—including S3 Metadata tables—using AWS Analytics services such as Amazon Data Firehose, Athena, Redshift, EMR, and QuickSight. Additionally, S3 Metadata integrates with Amazon Bedrock, allowing for the annotation of AI-generated videos with metadata that specifies its AI origin, creation timestamp, and the specific model used for its generation. Amazon S3 Metadata is currently available in preview in the US East (N. Virginia), US East (Ohio), and US West (Oregon) Regions, and coming soon to additional Regions. For pricing details, visit the S3 pricing page. To learn more, visit the product page, documentation, and AWS News Blog.
aws.amazon.com
December 3, 2024 at 5:23 PM
Announcing support of AWS Glue Data Catalog views with AWS Glue 5.0

Today, we announce support for AWS Glue Data Catalog views with AWS Glue 5.0 for Apache Spark jobs. AWS Glue Data Catalog views with AWS Glue 5.0 allows customers to create views from Glue 5.0 Spa...

#AWS #AwsGlue #AwsGovcloudUs
Announcing support of AWS Glue Data Catalog views with AWS Glue 5.0
Today, we announce support for AWS Glue Data Catalog views with AWS Glue 5.0 for Apache Spark jobs. AWS Glue Data Catalog views with AWS Glue 5.0 allows customers to create views from Glue 5.0 Spark jobs that can be queried from multiple engines without requiring access to referenced tables. AWS Glue is a serverless, scalable data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources. AWS Glue Data Catalog views are virtual tables in which the contents are defined by a SQL query that references one or more tables. These views support multiple SQL query engines, so you can access the same view across different AWS services. Administrators can control underlying data access using the rich SQL dialect provided by AWS Glue 5.0 Spark jobs. Access is managed with AWS Lake Formation permissions, including named resource grants, data filters, and lake formation tags. All requests are logged in AWS CloudTrail. AWS Glue Data Catalog views is generally available on AWS Glue 5.0, in all AWS Glue 5.0 regions. To learn more, visit the https://aws.amazon.com/glue/ product page and our https://docs.aws.amazon.com/glue/.
aws.amazon.com
March 14, 2025 at 9:05 PM
AWS Glue Schema Registry adds support for C#

https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html (GSR) has now expanded the programming language support for GSR Client library to...

#AWS #AmazonMsk #AmazonManagedServiceForApacheFlink #AwsGovcloudUs #AwsGlue #AmazonKinesisStreams
AWS Glue Schema Registry adds support for C#
https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html (GSR) has now expanded the programming language support for GSR Client library to include C# support along with existing Java support. C# applications integrating with Apache Kafka or https://aws.amazon.com/msk/, https://aws.amazon.com/kinesis/data-streams/, and Apache Flink or https://aws.amazon.com/managed-service-apache-flink/ can now interact with AWS Glue Schema Registry to maintain data quality and schema compatibility in streaming data applications. AWS Glue Schema Registry, a serverless feature of https://aws.amazon.com/glue/, enables you to validate and control the evolution of streaming data using registered schemas at no additional charge. Schemas define the structure and format of data records produced by applications. Using AWS Glue Schema Registry, you can centrally manage and enforce schema definitions across your data ecosystem. This ensures consistency of schemas across applications and enables seamless data integration between producers and consumers. Through centralized schema validation, teams can maintain data quality standards and evolve their schemas in a controlled manner.   C# support is available across all https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/ where Glue Schema Registry is available. Visit the Glue Schema Registry https://docs.aws.amazon.com/glue/latest/dg/schema-registry-gs-serde-csharp.html , and https://www.nuget.org/packages/AWS.Glue.SchemaRegistry to get started with C# integration.  
aws.amazon.com
November 5, 2025 at 11:05 PM
AWS Glue Data Catalog offers advanced automatic optimization for Apache Iceberg tables

https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html now offers advanced automatic optimization for Apache Iceberg tables. This update includes supporting compaction of delete ...

#AWS #AwsGlue
AWS Glue Data Catalog offers advanced automatic optimization for Apache Iceberg tables
https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html now offers advanced automatic optimization for Apache Iceberg tables. This update includes supporting compaction of delete files, nested data types, partial progress commits, and partition evolution support, making it easier to maintain consistently performant transactional data lakes. These features address challenges faced by customers with streaming data continuously ingested into Apache Iceberg tables, resulting in a large number of delete files that track changes in data files. With this new capability, Glue Data Catalog constantly monitors table partitions for positional and equality delete files, initiates the compaction process, and regularly commits partial progress to reduce conflicts. Glue Catalog optimizers now support schema evolution as you reorder or rename columns as well as partition spec evolution. In addition, Glue Catalog has expanded support for heavily nested complex data and support for parquet compression codecs - zstd, brotli, lz4, gzip, snappy. Enabling automatic compaction reduces delete files and metadata overhead on your Iceberg tables and improves query performance. These new features are automatically applied to existing and new Glue Catalog optimizers. In addition to the AWS console, customers can also use the AWS CLI or AWS SDKs to automate optimization for Apache Iceberg tables. The feature is available in 14 AWS regions US East (N. Virginia, Ohio), US West (Oregon), Europe (Ireland, London, Frankfurt, Stockholm), Canada (central), Asia Pacific (Tokyo, Seoul, Mumbai, Singapore, Sydney), South America (São Paulo). To learn more, read the https://aws.amazon.com/blogs/big-data/accelerate-queries-on-apache-iceberg-tables-through-aws-glue-auto-compaction/, and visit the AWS Glue Data Catalog https://docs.aws.amazon.com/glue/latest/dg/table-optimizers.html.  
aws.amazon.com
December 19, 2024 at 10:05 PM
AWS Glue is now available in Asia Pacific (Malaysia)

We are happy to announce that AWS Glue, a serverless data integration service, is now available in the AWS Asia Pacific (Malaysia) Regions.

AWS Glue is a serverless data integration service that makes it simple to discove...

#AWS #AwsGlue
AWS Glue is now available in Asia Pacific (Malaysia)
We are happy to announce that AWS Glue, a serverless data integration service, is now available in the AWS Asia Pacific (Malaysia) Regions. AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides both visual and code-based interfaces to make data integration simpler so you can analyze your data and put it to use in minutes instead of months. To learn more, visit the https://aws.amazon.com/glue/ product page and our https://docs.aws.amazon.com/glue/. For AWS Glue region availability, please see the https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/.  
aws.amazon.com
November 14, 2024 at 10:05 PM
AWS Glue enables enhanced Apache Spark capabilities for AWS Lake Formation tables with full table access

AWS Glue now supports read and write operations from AWS Glue 5.0 Apache Spark jobs on AWS Lake Formation registered tables when the job role has full table...

#AWS #AwsGlue #AwsLakeFormation
AWS Glue enables enhanced Apache Spark capabilities for AWS Lake Formation tables with full table access
AWS Glue now supports read and write operations from AWS Glue 5.0 Apache Spark jobs on AWS Lake Formation registered tables when the job role has full table access. This capability enables Data Manipulation Language (DML) operations including CREATE, ALTER, DELETE, UPDATE, and MERGE INTO statements on Apache Hive and Iceberg tables from within the same Apache Spark application. While Lake Formation's fine-grained access control (FGAC) offers granular security controls at row, column, and cell levels, many ETL workloads simply need full table access. This new feature enables AWS Glue 5.0 Spark jobs to directly read and write data when full table access is granted, removing limitations that previously restricted certain Extract, Transform, and Load (ETL) operations. You can now leverage advanced Spark capabilities including Resilient Distributed Datasets (RDDs), custom libraries, and User Defined Functions (UDFs) with Lake Formation tables. Additionally, data teams can run complex, interactive Spark applications through SageMaker Unified Studio in compatibility mode while maintaining Lake Formation's table-level security boundaries. This feature is available in all AWS Regions where AWS Glue and AWS Lake Formation are supported. To learn more, visit the https://aws.amazon.com/glue/ and https://docs.aws.amazon.com/glue/latest/dg/security-access-control-fta.html.
aws.amazon.com
June 25, 2025 at 6:05 PM
🔊 Create external volumes on #s3, #azure, and #gcp
📚 Connect your existing catalog of choice––think #openCatalog, #awsGlue, or #apachePolaris (incubating)
🔗 [Optional] Create a catalog-linked database to auto-discover and auto-refresh your existing Iceberg tables
✏️ Write to Iceberg tables
June 12, 2025 at 4:29 PM
AWS Glue Data Catalog now supports Apache Iceberg automatic table optimization through Amazon VPC

https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html now supports automatic optimization of Apache Iceberg tables that can be only accessed from a s...

#AWS #AwsLakeFormation #AwsGlue
AWS Glue Data Catalog now supports Apache Iceberg automatic table optimization through Amazon VPC
https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html now supports automatic optimization of Apache Iceberg tables that can be only accessed from a specific Amazon Virtual Private Cloud (VPC) environment. You can enable automatic optimization by providing a VPC configuration to optimize storage and improve query performance while keeping your tables secure. AWS Glue Data Catalog supports compaction, snapshot retention and unreferenced file management that help you reduce metadata overhead, control storage costs and improve query performance. Customers who have governance and security configurations that require an Amazon S3 bucket to reside within a specific VPC can now use it with Glue Catalog. This gives you broader capabilities for automatic management of your Apache Iceberg data, regardless of where it's stored on Amazon S3. Automatic optimization for Iceberg tables through Amazon VPC is available in 13 AWS regions US East (N. Virginia, Ohio), US West (Oregon), Europe (Ireland, London, Frankfurt, Stockholm), Asia Pacific (Tokyo, Seoul, Mumbai, Singapore, Sydney), South America (São Paulo). Customers can enable this through the AWS Console, AWS CLI, or AWS SDKs. To get started, you can now provide the Glue network connection as an additional configuration along with optimization settings such as default retention period and days to keep unreferenced files. The AWS Glue Data Catalog will use the VPC information in the Glue connection to access Amazon S3 buckets and optimize Apache Iceberg tables. To learn more, read the https://aws.amazon.com/blogs/big-data/aws-glue-data-catalog-supports-automatic-optimization-of-apache-iceberg-tables-through-your-amazon-vpc/, and visit the AWS Glue Data Catalog https://docs.aws.amazon.com/glue/latest/dg/table-optimizers.html.  
aws.amazon.com
November 21, 2024 at 11:05 PM
🚀 We recently tackled a client's complex data loading process using AWS Glue, leading to major improvements! 🌟 We optimized the process, boosting efficiency by 80% and reducing costs by nearly 50%. #CenitSolutions #AWSGlue #DataTransformation
June 19, 2024 at 3:23 PM
Automate CSV ingestion into #DynamoDB with #AWSLambda & #S3! Process millions of rows instantly, track failed writes, and scale with #AWSGlue. Perfect for finance, healthcare & regulated industries. No more manual uploads—just seamless automation! Watch now to transform your workflow. #AWS #ETL
May 23, 2025 at 3:31 PM