Looking to build, train, deploy, or implement Generative AI? (Sponsored)
Meet Innodata — offering high-quality solutions for developing and implementing industry-leading generative AI, including:
Diverse Golden Datasets
Supervised Fine-Tuning Data
Human Preference Optimization (e.g. RLHF)
RAG Development
Model Safety, Evaluation, & Red Teaming
Data Collection, Creation, & Annotation
Prompt Engineering
With 5,000+ in-house SMEs and expansion and localization supported across 85+ languages, Innodata drives AI initiatives for enterprises globally.
Data and AI systems are a mess.
They are complex and hard to grasp.
If you just started working in AI or have been working for a few years, it’s hard to see how the worlds of data engineering, research (DS, ML), and production (AIE, MLE, MLOps) come together into a single homogenous system.
As a data engineer, you finish your work by ingesting the standardized data into a data warehouse or lake.
As a researcher, your work finishes when you train the best model on a static dataset and push it to a model registry.
As an AIE or MLE, your work finishes when serving the model to production.
As an MLOps engineer, your work finishes when the operations are automated and monitored adequately for long-term robustness.
But is there a more accessible and intuitive way to understand the entire end-to-end data and AI system?
Yes! Through the FTI architecture.
Let’s quickly dig into the FTI architecture and apply it to a production LLM & RAG use case.
Introducing the FTI architecture
The FTI architecture proposes a clear and straightforward mind map that any team or person can follow to compute the features, train the model, and deploy an inference pipeline to make predictions.
The pattern suggests that any ML system can be boiled down to these 3 pipelines: feature, training, and inference.
This is powerful, as we can clearly define the scope and interface of each pipeline. Ultimately, we have just 3 instead of 20 moving pieces, as suggested in Figure 1, which is much easier to work with and define.
Figure 2 shows the feature, training, and inference pipelines. We will zoom in on each one to understand its scope and interface.
Before going into the details, I would like to point out that each pipeline is a separate component that can run on different processes or hardware. Thus, each pipeline can be written using a different technology, by a different team, or scaled differently.
The feature pipeline
The feature pipeline takes raw data as input, processes it, and outputs the features and labels required by the model for training or inference.
Instead of directly passing them to the model, the features and labels are stored inside a feature store. Its responsibility is to store, version, track, and share the features.
By saving the features into a feature store, we always have a state of our features. Thus, we can easily send the features to the training and inference pipelines.
The training pipeline
The training pipeline takes the features and labels from the features stored as input and outputs a trained model(s).
The models are stored in a model registry. Its role is similar to that of feature stores, but the model is the first-class citizen this time. Thus, the model registry will store, version, track, and share the model with the inference pipeline.
The inference pipeline
The inference pipeline takes as input the features and labels from the feature store and the trained model from the model registry. With these two, predictions can be easily made in either batch or real-time mode.
As this is a versatile pattern, it is up to you to decide what you do with your predictions. If it’s a batch system, they will probably be stored in a DB. If it’s a real-time system, the predictions will be served to the client who requested them.
The most important thing you must remember about the FTI pipelines is their interface. It doesn’t matter how complex your ML system gets — these interfaces will remain the same.
The final thing you must understand about the FTI pattern is that the system doesn’t have to contain only 3 pipelines. In most cases, it will include more.
For example, the feature pipeline can be composed of a service that computes the features and one that validates the data. Also, the training pipeline can comprise the training and evaluation components.
Consider reading the following article for an in-depth dive into the FTI architecture and its benefits:
Applying the FTI architecture to a use case
The FTI architecture is tool-agnostic, but to better understand how it works, let’s present a concrete use case and tech stack.
Use case: Fine-tune an LLM on your social media data (LinkedIn, Medium, GitHub) and expose it as a real-time RAG application. Let’s call it your LLM Twin.
We will split the system into four core components.
You will ask yourself this: “Four? Why not three, as the FTI pipeline design clearly states?” That is a great question.
Fortunately, the answer is simple. We must also implement the data pipeline along the three feature/training/inference pipelines. According to best practices:
The data engineering team owns the data pipeline.
The ML engineering team owns the FTI pipelines.
Figure 3 shows the LLM system architecture. The best way to understand it is to review the four components individually and explain how they work.
Data collection pipeline
The data collection pipeline involves crawling your personal data from Medium, Substack, LinkedIn, and GitHub.
As a data pipeline, we will use the extract, load, transform (ETL) pattern to extract data from social media platforms, standardize it, and load it into a data warehouse.
The output of this component will be a NoSQL DB, which will act as our data warehouse. As we work with text data, which is naturally unstructured, a NoSQL DB fits like a glove.
Even though a NoSQL DB, such as MongoDB, is not labeled as a data warehouse, from our point of view, it will act as one. Why? Because it stores standardized raw data gathered by various ETL pipelines that are ready to be ingested into an ML system.
The collected digital data is binned into 3 categories:
Articles (Medium, Substack, other blogs)
Posts (LinkedIn)
Code (GitHub)
We want to abstract away the platform where the data was crawled. For example, when feeding an article to the LLM, knowing it came from Medium or Substack is not essential. We can keep the source URL as metadata to give references.
However, from the processing, fine-tuning, and RAG points of view, it is vital to know what type of data we ingested, as each category must be processed differently. For example, the chunking strategy between a post, article, and piece of code will look different.
Feature pipeline
The feature pipeline’s role is to take raw articles, posts, and code data points from the data warehouse, process them, and load them into the feature store. The characteristics of the FTI pattern are already present.
Here are some custom properties of the LLM twin’s feature pipeline:
It processes 3 data types differently: articles, posts, and code.
It contains 3 main processing steps for fine-tuning and RAG: cleaning, chunking, and embedding.
It creates 2 snapshots of the digital data, one after cleaning (used for fine-tuning) and one after embedding (used for RAG).
It uses a logical feature store instead of a specialized feature store.
Let’s zoom in on the logical feature store part a bit. As with any RAG-based system, one of the central pieces of the infrastructure is a vector DB. Instead of integrating another DB, more concretely, a specialized feature store, we used the vector DB, plus some additional logic to check all the properties of a feature store our system needs.
The vector DB doesn’t offer the concept of a training dataset, but it can be used as a NoSQL DB. This means we can access data points using their ID and collection name.
Thus, we can easily query the vector DB for new data points without any vector search logic. Ultimately, we will wrap the retrieved data into a versioned, tracked, and shareable artifact.
How will the rest of the system access the logical feature store? The training pipeline will use the instruct datasets as artifacts, and the inference pipeline will query the vector DB for additional context using vector search techniques.
For our use case, this is more than enough because of the following reasons:
The artifacts work great for offline use cases such as training.
The vector DB is built for online access, which we require for inference.
To conclude, we input raw articles, posts, or code data points, process them, and store them in a feature store to make them accessible to the training and inference pipelines.
Note that trimming all the complexity away and focusing only on the interface perfectly matches the FTI pattern. Beautiful, right?
Training pipeline
The training pipeline consumes instruct datasets from the feature store, fine-tunes an LLM with it, and stores the tuned LLM weights in a model registry.
More concretely, when a new instruct dataset is available in the logical feature store, we will trigger the training pipeline, consume the artifact, and fine-tune the LLM.
In the initial stages, the data science team owns this step. They run multiple experiments to find the best model and hyperparameters for the job, either through automatic hyperparameter tuning or manually.
To compare and pick the best set of hyperparameters, we will use an experiment tracker to log everything of value and compare it between experiments. Ultimately, they will choose the best hyperparameters and fine-tuned LLM and propose it as the LLM production candidate. The proposed LLM is then stored in the model registry.
After the experimentation phase, we store and reuse the best hyperparameters found to eliminate the process's manual restrictions. Now, we can completely automate the training process, known as continuous training.
The testing pipeline is triggered for a more detailed analysis than during fine-tuning. Before pushing the new model to production, assessing it against a stricter set of tests is critical to see that the latest candidate is better than what is currently in production. If this step passes, the model is ultimately tagged as accepted and deployed to the production inference pipeline.
Even in a fully automated ML system, it is recommended to have a manual step before accepting a new production model. It is like pushing the red button before a significant action with high consequences. Thus, at this stage, an expert looks at a report generated by the testing component. If everything looks good, it approves the model, and the automation can continue.
One last aspect we want to clarify is continuous training (CT).
Our modular design lets us quickly leverage an ML orchestrator like ZenML to schedule and trigger different system parts.
For example, we can schedule the data collection pipeline to crawl data every week. Then, we can trigger the feature pipeline when new data is available in the data warehouse and the training pipeline when new instruction datasets are available… and BAM, we have CT!
Inference pipeline
The inference pipeline is the last piece of the puzzle.
It is connected to the model registry and logical feature store. From the model registry, it loads a fine-tuned LLM, and from the logical feature store, it accesses the vector DB for RAG.
It receives client requests as queries through a REST API. It uses the fine-tuned LLM and access to the vector DB to carry out RAG and answer the queries.
All the client queries, enriched prompts using RAG, and generated answers are sent to a prompt monitoring system to analyze, debug, and better understand the system. Based on specific requirements, the monitoring system can trigger alarms to take action either manually or automatically.
At the interface level, this component follows exactly the FTI architecture, but when zooming in, we can observe unique characteristics of an LLM and RAG system, such as the following:
A retrieval client is used to do vector searches for RAG.
Prompt templates map user queries and external information to LLM inputs.
Special tools for prompt monitoring.
Summary
The FTI architecture is a powerful mindmap that helps you connect the dots in the complex data and AI world, as illustrated in the LLM Twin use case.
This article was inspired by our latest book, “LLM Engineer’s Handbook.”
If you liked this article, consider supporting our work by buying our book and getting access to an end-to-end framework on how to engineer LLM & RAG applications, from data collection to fine-tuning, serving and LLMOps:
Images
If not otherwise stated, all images are created by the author.
Very informative
The book is on its way to reach my destination 😊. Can’t wait for it!