Vast Data Engine: Revolutionizing AI Data Management for 2024
Discover the Vast Data Engine, a groundbreaking AI-focused data platform set to launch in 2024, designed to transform unstructured data into structured insights and enhance deep learning applications.
Video Summary
In a recent announcement, Jeff Denworth, the co-founder of Vast Data, unveiled the Vast Data Engine, a pivotal element of their AI-centric data platform, which is set to launch in 2024. This innovative platform is grounded in three fundamental principles: adherence to industry standards, simplicity in both infrastructure and data management, and empowering customers with control over their data and infrastructure. The Vast Data Engine is designed to elevate AI capabilities by converting unstructured data into structured insights, building upon previous advancements such as the Vast Data Store introduced in 2019 and the Vast Database.
The architecture of the Vast Data Engine is tailored for global-scale distributed computing, facilitating seamless integration across hybrid and multi-cloud environments. Denworth highlighted the necessity for a system that emulates human learning processes, advocating for a shift from traditional batch processing to real-time, continuous computing. A significant feature of the engine is the introduction of a new AI data format that accommodates both structured and unstructured data, thereby enabling sophisticated deep learning applications. This forward-thinking approach addresses the shortcomings of existing data platforms, which are predominantly designed for structured data and often struggle to manage the vast quantities of unstructured data generated in today's digital landscape.
The DAIS architecture, a modern systems architecture developed since the inception of the Google File System in 2003, plays a crucial role in this initiative. It supports global computing across extensive data resources, creating a scalable and resilient data center computer capable of handling failures while ensuring high availability, achieving an impressive six nines of uptime. This architecture adeptly combines state with logic, accommodating various data types, including events, files, objects, and tables, all within a unified stateful namespace.
Vast Data's platform introduces new data types, specifically triggers and functions, which significantly enhance the system's capacity to process data dynamically. By integrating structured and unstructured data, the platform allows for real-time querying and continuous analytics through its innovative VaST Streams interface. This interface merges a streaming engine with a high-performance tabular data store, enabling users to query both new and existing data effortlessly.
The VaST database is notably recognized as the world's first system that harmonizes transactional and analytical performance, facilitating continuous processing and real-time interaction with data. Additionally, a notification system is incorporated to keep consumers informed of data updates, ensuring efficient communication within the platform. The discussion also delves into the capabilities of Vast Data's AI pipeline, emphasizing its efficient programming environment built on a Python SDK and C++ architecture. This system catalogs incoming photos in the Vast database, enabling real-time data processing and function execution, such as creating thumbnails and performing data augmentations.
Designed for high computing efficiency, the architecture allows for substantial infrastructure savings and a transition from batch processing to real-time continuous computing. The event notification engine plays a vital role in tracking data changes and function executions, enhancing communication within the system. Furthermore, the integrated system supports serverless computing, enabling deployment across diverse environments, including cloud and edge computing.
The concept of 'code as data' is introduced, allowing for dynamic management of data and functions within the pipeline. The Vast data set concept is presented as a method to create materialized views for training models without necessitating specific file creation. The pipeline processes data through various functions, including inference and augmentation, all while maintaining version control and the ability to revert to previous states. As the presentation concluded, Denworth provided a glimpse into future developments anticipated in 2024, underscoring the vision of integrating logic with state and compute with storage to foster advanced deep learning applications.
Click on any timestamp in the keypoints section to jump directly to that moment in the video. Enhance your viewing experience with seamless navigation. Enjoy!
Keypoints
00:00:06
Introduction
Jeff Denworth, co-founder of Vast Data, introduces the Vast Data Engine, a product aimed at creating a 'thinking machine' through advanced data processing capabilities.
Keypoint ads
00:00:29
Architecture Principles
Denworth emphasizes three core architectural principles: adherence to standards to avoid code rewriting, simplicity in infrastructure and data management, and customer control over data and infrastructure, allowing for a personalized zero-trust agenda.
Keypoint ads
00:01:30
Vast Data Engine Overview
The Vast Data Engine is described as the logic behind Vast Data's distributed AI computer, designed to evolve and discover insights as AI applications advance. This initiative builds on work that began in 2016, culminating in the introduction of the Vast Data Store in 2019.
Keypoint ads
00:02:04
Vast Data Store
The Vast Data Store, launched in 2019, serves as a universal storage system that addresses the long-standing performance and capacity trade-offs faced by customers, particularly in managing unstructured data for AI applications.
Keypoint ads
00:02:28
Vast Database
Denworth introduces the Vast Database, the world's first transactional and analytical database that integrates both functionalities down to the archive level, establishing a semantic layer for AI computing.
Keypoint ads
00:03:00
Vast Data Space
The Vast Data Space is created to facilitate global data processing, allowing customers to seamlessly transition between different cloud environments, thus enhancing the overall data management experience.
Keypoint ads
00:03:35
Engine Availability
The Vast Data Engine is set to be available in 2024, marking the final component of Vast Data's infrastructure aimed at simplifying AI deployment for customers at any scale.
Keypoint ads
00:03:52
Data Processing Pipeline
Denworth outlines a classic data processing pipeline, which includes various stages such as data landing, transformation in data lakes, and preparation for AI computing, highlighting the complexity and multiple steps involved in managing data effectively.
Keypoint ads
00:04:48
Real-Time Learning
The discussion begins with the observation that traditional computing methods do not align with human learning processes. The speaker emphasizes the need for a system that mirrors human learning, which involves continuous interaction with the environment and real-time data processing. This necessitates a redefinition of 'real-time' to ensure that learning occurs from events as they happen.
Keypoint ads
00:05:24
Data Processing Challenges
The speaker highlights the limitations of current data platforms, which are primarily designed for structured data and traditional analytics. These platforms typically handle data in the terabyte to petabyte range, processed by CPUs in centralized data lakes. However, deep learning requires handling vast amounts of unstructured data, such as images and text, which cannot be easily categorized or processed using existing architectures.
Keypoint ads
00:06:54
Need for Distributed Systems
The speaker points out that deep learning involves processing data at an unprecedented scale, often reaching petabytes to exabytes. This necessitates the use of distributed systems that can operate globally, as traditional data lake architectures are insufficient for the gravity of such large datasets. The need for GPUs over CPUs is also emphasized due to the nature of the data being processed.
Keypoint ads
00:07:30
VAST Data Engine
In response to the identified challenges, the speaker introduces the VAST data engine, designed for continuous and recursive computing. This system allows data to flow through it as events, triggering additional processing and correlations. Unlike traditional batch-oriented systems, the VAST data engine supports real-time processing and introduces a new AI data format that accommodates both structured and unstructured data, complete with versioning for model lineage.
Keypoint ads
00:08:34
DAIS Architecture
The speaker explains the DAIS architecture, which is the first modern systems architecture since the introduction of the Google File System in 2003. DAIS enables a data center-scale computer with a collection of CPUs and a high-performance, low-latency network. This architecture allows for explicit parallel processing and resilience, adopting a 'cattle versus pets' mentality where infrastructure can fail without affecting overall system availability.
Keypoint ads
00:09:40
Infrastructure Availability
Customers are deploying infrastructure with six nines of availability, supporting hundreds of petabytes in a common cluster, a feat achieved with the DAIS architecture. This level of scale and availability is unprecedented, showcasing the capabilities of the system.
Keypoint ads
00:09:59
Cost Efficiency
A core principle of the DAIS architecture is achieving efficiency, ensuring that all-flash infrastructure costs no more than traditional hard drive-based infrastructure. This is accomplished through various innovative efficiency approaches, allowing for a single tier of flash for all data.
Keypoint ads
00:10:17
Data Processing Architecture
The DAIS architecture combines state with logic at a data center scale, processing data through multiple interfaces. It supports protocols for unstructured data, query handling, stream ingestion, and notifications, alongside a serverless environment for executing built-in and customer-provided functions across numerous containers.
Keypoint ads
00:11:03
Data Representation
The system supports various data representations necessary for building a distributed AI pipeline, including events, files, objects, tables, functions, and triggers. Policies for quality of service, security, and replication can be applied from a single stateful namespace, enabling comprehensive data management without the need for additional systems.
Keypoint ads
00:11:37
VaST Data Engine Introduction
The introduction of the VaST data engine aims to 'breathe life into data' by adding two new types of data elements: triggers and functions. These elements interact with data and metadata, creating a dynamic system where data flow can initiate triggers, which in turn can call functions to generate additional data.
Keypoint ads
00:12:30
Structured and Unstructured Data
The VaST data platform effectively marries structured and unstructured data, allowing files and objects to be stored in a high-scale, low-cost architecture while also supporting structured data tables with ACID transactional guarantees. This integration facilitates high-performance querying across all data types within a single system.
Keypoint ads
00:13:16
VaST Streams Interface
The new VaST Streams interface combines a streaming engine with a high-performance tabular data store, eliminating the need for separate systems for data ingestion and storage. This allows topics to be treated as tables, enabling ACID transactions and real-time querying of both new and existing data, thus transforming the streaming paradigm.
Keypoint ads
00:14:01
Reactive Computing
The VaST database underpins a reactive mode of computing, allowing systems to interact with the natural world at the speed of thought. This innovative approach combines transactional and analytical performance, enabling comprehensive analytics down to the archive level, marking a significant advancement in data processing capabilities.
Keypoint ads
00:14:27
Streaming Integration
The discussion emphasizes a shift from traditional batch computing to continuous streaming, highlighting the integration of a streaming interface with the VaST database. This innovative approach eliminates the need for separate data stores, allowing real-time queries on vast amounts of data, including exabytes, thus enabling immediate insights and interactions with the natural world.
Keypoint ads
00:15:56
VaST Database Features
The VaST database is presented as the world's first transactional and analytical database that maintains high transactional performance even at the archive level. This system supports continuous queries and analytics, merging structured and unstructured data, which is crucial for real-time data processing and correlation.
Keypoint ads
00:16:01
Notification System
A built-in notification system within the VaST platform is described, which alerts users to every data update, including creates, deletes, and updates. This notification engine serves as a communication bus, facilitating interactions between various components of the system and external applications, thereby enhancing the overall functionality of AI pipelines.
Keypoint ads
00:16:38
Programming Environment
The platform includes a programming environment supported by a Python SDK, allowing users to efficiently manage data. An example is provided where photos are cataloged in real-time as they enter the system, demonstrating how unstructured data is processed and stored in the VaST database, including the generation of thumbnails.
Keypoint ads
00:17:52
Efficiency and Architecture
The architecture of the VaST data platform is designed for high efficiency, utilizing C++ for optimal machine utilization. This single-system approach not only enhances data and compute efficiency but also transforms how data platforms are built, promoting a real-time, continuous computing paradigm.
Keypoint ads
00:18:40
Event Tracking
The event notification engine is crucial for tracking data changes and function executions within the system. This capability allows for seamless integration of internal functions and external applications, ensuring that all events are communicated effectively, thus supporting a robust programming environment capable of operating at exabyte scale.
Keypoint ads
00:19:03
Daze Architecture
The Daze architecture is highlighted as a web-scale flash solution that offers performance levels comparable to archive infrastructure. This architecture is foundational for the VaST data platform, enabling it to deliver high performance at a cost-effective rate, making it suitable for extensive data processing needs.
Keypoint ads
00:19:13
Integrated System
The discussion emphasizes the implementation of functions within a C++-based environment, highlighting its efficiency from a hardware perspective. This integrated system simplifies management and reduces costs, marking a shift from a batch environment to a real-time, continuous computing environment where data is central to operations.
Keypoint ads
00:19:44
Serverless Computing
The programming environment is described as serverless, utilizing a Python-based SDK that facilitates autoscaling of execution nodes. This allows for deployment across various locations, including the cloud, edge, and core data centers, all contributing to a unified global computing system.
Keypoint ads
00:20:30
Dynamic Data Management
The speaker introduces a novel approach to data management, where code and triggers are treated as data. This paradigm shift enables the creation of a pipeline that dynamically links code and data, allowing for seamless updates across the system whenever a function is modified.
Keypoint ads
00:21:01
Vast Data Set
The concept of a 'vast data set' is introduced, serving as a foundation for training models. This approach allows for the creation of materialized views around data without the need for specific file structures, as information is organized within the vast database.
Keypoint ads
00:22:05
Data Processing Pipeline
A detailed description of a data processing pipeline is provided, illustrating how data flows into the system, is stored in the vast data store, and cataloged in the vast database. The pipeline includes functions for inference, data augmentation, and training model refinement, all while maintaining a structure that accommodates both structured and unstructured data.
Keypoint ads
00:23:14
Deployment Flexibility
The system's deployment capabilities are highlighted, indicating that it can operate in any environment with a vast data space. It supports various hardware configurations, including CPU and GPU, and features auto-scaling based on workload policies, showcasing its versatility and adaptability.
Keypoint ads
00:23:36
Future Developments
The speaker hints at future developments, with plans to reveal more details in 2014 and launch the product in 2024. The overarching vision is to integrate logic with state and compute with storage, enhancing the functionality of data in the context of next-generation deep learning.
Keypoint ads