re:Invent 2022 release summary — Data, Artificial Intelligence and Machine Learning

Mark Ross
8 min readJan 5, 2023

--

Well, a month went quickly! Having set out with the best of intentions to provide a series of posts on re:Invent announcements the weeks since I got back home have gone in a flash and it’s already 2023… on the plus side only 11 months until re:Invent 2023! If you want to read a summary of Compute, Networking and Storage announcements see my previous post here.

I thought I’d turn my attention next to AI and ML announcements, and of course it’s hard to mention these topics without touching on the data storage and analytics services given their importance in providing good quality data to the AI and ML services.

Overview of end to end data strategy

To be able to take actionable insights on data you need good quality data. As the saying goes ‘garbage in, garbage out’ and if you’re not careful you can certainly get a lot of garbage in.

Another important consideration is understanding what data you have. With the advent of relatively cheap cloud storage and the explosion of sources of data it is possible to be overwhelmed by the amount of data you have. Data lake strategies work on the principle of schema on read, which encourages the storage of all types of data because you might not know how you’ll want to interpret it in the future and you’re not forced into a concise schema to be able to store it in the first place. This makes cataloguing your data important, you want to know what it is that you’ve got and where it is, otherwise you’re going to struggle to extract value from it.

The last area I’ll touch on is governance. As more and more data is aggregated together its value can increase significantly, for example I’ve seen some government customers increase the classification of data as it gets aggregated. Therefore it’s important to ensure the data is adequately protected, and that it’s only available to people with a genuine need to access it, the latter point being a careful balancing act with granting access to enough data to make business improvements / decision.

AWS has a broad set of services that cover an end to end data strategy.

There’s a complete range of data storage services covering relational, NoSQL, document, ledger, time series databases — https://aws.amazon.com/products/databases/.

There’s a comprehensive set of data analytics services covering batch and real-time requirements, data manipulation and extract, transform, load (ETL) — https://aws.amazon.com/big-data/datalakes-and-analytics/

There’s a broad set of AI/ML services available, covering specific use cases out of the box - speech to text, text to speech, language translation, comprehension, object detection to name a few. There are services to develop to your own capabilities, including some low / no code options through to writing your own code with SageMaker — https://aws.amazon.com/machine-learning/

It’s fair to say my summary of re:Invent announcements won’t be exhaustive, there’s simply too many to keep abreast of them all. AWS have a summary across the board for all services here, and if you want to keep up to date on an ongoing basis I’d recommend consuming the RSS feed here. I’ve broken down the announcements I’ll cover into some loosely coupled areas below….

Security

AWS continue to innovate in the security space with AI/ML services, in efforts to improve the security posture of the cloud and customer workloads running in the cloud, without necessitating huge investment on the customer side. A couple of announcements caught my eye in this area related to GuardDuty, the AI/ML powered threat detection service. Having worked on projects with significant security requirements in the past I really like GuardDuty. When you hear people talking about ‘undifferentiated heavy lifting’ there is nothing worse than finding a Security Incident and Event Management (SIEM) tool that can ingest any logs you want, but the only way you’re going to get any actionable insights out of it is to write something like a regular expression (regex) for the data you’re looking to extract from the logs. At best, perhaps you have a partner or some out of the box capability for some popular components, but often you can be playing catch-up by reacting after the event, finding the log patterns and looking for them in the future. GuardDuty brings cloud scale to the party so everyone is benefitting from all the behaviour AWS has seen or determined as being a likely threat. As AWS provide more and more ‘higher value’ services I envisage GuardDuty will support more and more of them.

GuardDuty now supports RDS, albeit in preview. This brings the power of GuardDuty protection to your databases, so you can find out about potential threats and take event driven action to mitigate them if you desire.

Earlier in 2022 GuardDuty support was extended to EKS, to provide threat detection for your Kubernetes clusters. It was announced at re:Invent that this support was being further enhanced to provide threat detection for the containers them, and not just the clusters, and is integrated with EKS.

Away from GuardDuty, Amazon Security Lake was announced , centralising AWS and 3rd party log sources into a data like normalised into Open Cybersecurity Schema Framework (OCSF) format. We’re proud at Atos to be a launch service partner for this capability.

Data Governance

Amazon SageMaker ML Governance was released. Through a combination of SageMaker Role Manager (permission management), SageMaker Model Cards (centralised and standardised model documentation), and SageMaker Model Dashboard (tracking and monitoring of models deployed and violations) it will help you simplify access control and enhance transparency over your ML projects.

Amazon Redshift centralised access controls were announced in preview, enabling you to efficiently share live data across Amazon Redshift data warehouses using AWS Lake Formation to centrally manage permissions.

Amazon Datazone was released, providing the capability to share, search, and discover data at scale across organizational boundaries. Collaboration on data projects through a unified data analytics portal.. It supports personalised views and the ability to use business terms to search for data, which is another regular theme AWS pursue to lower the barrier of entry to their services to consumers.

Productivity

I’ve loosely bundled together a number of announcements that I see as being designed to increase productivity. This is great for customers as it lowers the cost of exploiting the AWS Services, and great for AWS as their annual recurring revenue doesn’t fit a glass ceiling imposed by the number of people who can write code, machine learning models etc!

Amazon CodeWhisperer is a machine learning powered tool to support developer productivity. It generates code recommendations and can even write entire functions based on developer natural language comments. Although it was already available prior to re:Invent there were a number of announcements extending it’s functionality, for example additional language support and code recommendations for AWS APIs.

Amazon QuickSight Q is an extension to the Amazon Quicksight business intelligence solution that allows you to query your data using natural language. Again this service existed prior to re:Invent, but was enhanced with forecasting capabilities.

We then have a number of announcements in the data integration space, where AWS are supporting improved productivity with the manipulation and sharing of data in what they describe as a move towards ‘zero ETL’.

Amazon Aurora zero-ETL integration with Amazon Redshift was released, making transactional data from Aurora available within Redshift for analytics within seconds. No ETL pipelines to maintain.

Amazon Redshift auto-copy from Amazon S3 simplifies the loading and data management of getting S3 data into Redshift. You can now set-up continuous file ingestion rules based on S3 paths and the data will automatically be loaded for you.

Amazon Redshift integration for Apache Spark now allows you to build and run Spark applications on Redshift data without the need to move the data from one to the other.

AWS Glue Data Quality automatically monitors and measures data lake and data pipeline quality. Automatically assessing and recommending rules to save time writing them from scratch yourself.

Scale / Agility / Availability

There were a number of announcements I’d group into the scalability, agility and availability area. One of the well architected principles is ‘stop guessing capacity’ so it’s not a surprise to see AWS continue to iterate on services that have ‘push button’ scaling (i.e. you have to define your requirements) to a serverless capability, or if not serverless to have some form of auto-scaling which still ultimately has a defined capacity ceiling.

OpenSearch went Serverless, at least in preview, allowing use of OpenSearch without having to configure, manage or scale your clusters. It also supports independent scaling of ingesting and indexing, from the search / querying of the data.

DocumentDB went elastic. You can now elastically scale your document databases to handle millions of reads and petabytes of storage, which grow with sharding with your application growth, in a fraction of the time that would be required manually.

Amazon Redshift now supports multi-AZ in preview. This will simplify disaster recover needs for people with mission critical analytics. Previously clusters were limited to a single AZ and required your own solution to either monitor and react to failure or be manually invoked.

Specific Industry Use Cases

There were a couple of specific topics that I thought worthy of calling out.

Amazon Omics will help healthcare and life science orgs and their partners store, query, and analyse genomic, transcriptomic, and other omics data. This should help to accelerated advancements in the field of healthcare and life sciences.

Amazon SageMaker Geospatial ML allows data scientists and ML engineers to easily build, train, and deploy ML models using geospatial data (e.g. satellite, mapping, location data). No doubts many use cases here for connected car and other IoT use cases.

So that’s my quick summary of data, AI and ML from AWS re:Invent 2022. If you’ve not already seen it I’d recommend catching Swami’s keynote here, and the related breakout sessions here.

My takeaway was AWS will continue to innovate in this area, and the sentiment of the phrase ‘data is the new oil’ isn’t going away any time soon. Expect to see further improvements to ease integration between services so you can get your data where it needs to be. Further enhancements to visibility and governance tooling so you can remain compliant will be common. I also expect to see further development of the range of consumption options, with the low code / no code (with less flexibility) at one end to accommodate those who aren’t coders but understand business challenges, with fully flexible options that require coding experience at the other end of the spectrum.

--

--