How I Turned ChatGPT into an SQL-Like Translator for Picture and Video Datasets | by Jacob Marks, Ph.D. | Jun, 2023



Though the immediate for these two examples was structured in the identical means, the responses differed in a number of key methods. Response 1 makes an attempt to create a DatasetView by including ViewStage to the dataset. Response 2 defines and applies a MongoDB aggregation pipeline, adopted by the restrict() methodology (making use of Restrict stage) to restrict the view to 10 samples, in addition to a non-existent (AKA hallucinated) show() methodology. Moreover, whereas Response 1 masses in an precise dataset (Open Photos V6), Response 2 is successfully template code, as "your_dataset_name" and "your_model_name” must be stuffed in.

These examples additionally highlighted the next points:

  1. Boilerplate code: some responses contained code for importing modules, instantiating datasets (and fashions), and visualizing the view (session = fo.launch_app(dataset)).
  2. Explanatory textual content: in lots of instances — together with academic contexts — the truth that the mannequin explains its “reasoning” is a optimistic. If we need to carry out queries on the person’s behalf, nevertheless, this explanatory textual content simply will get in the best way. Some queries even resulted in a number of code blocks, break up up by textual content.

What we actually wished was for the LLM to reply with code that could possibly be copied and pasted right into a Python course of, with out the entire additional baggage. As a primary try at prompting the mannequin, I began to offer the next textual content as prefix to any pure language question I wished it to translate:

Your process is to transform enter pure language queries into Python code to generate ViewStages for the pc imaginative and prescient library FiftyOne.
Listed below are some guidelines:
- Keep away from all header code like importing packages, and all footer code like saving the dataset or launching the FiftyOne App.
- Simply give me the ultimate Python code, no intermediate code snippets or clarification.
- at all times assume the dataset is saved within the Python variable `dataset`
- you should use the next ViewStages to generate your response, in any mixture: exclude, exclude_by, exclude_fields, exclude_frames, …

Crucially, I outlined a process, and set guidelines, instructing the mannequin what it was allowed and never allowed to do.

Be aware: with responses coming in a extra uniform format, it was at this level that I moved from the ChatGPT chat interface to utilizing GPT-4 through OpenAI’s API.

Limiting Scope

Our workforce additionally determined that, at the very least to begin, we’d restrict the scope of what we had been asking the LLM to do. Whereas the fiftyone question language itself is full-bodied, asking a pre-trained mannequin to do arbitrarily advanced duties with none fine-tuning is a recipe for disappointment. Begin easy, and iteratively add in complexity.

For this experiment, we imposed the next bounds:

  • Simply photos and movies: don’t count on the LLM to question 3D level clouds or grouped datasets.
  • Ignore fickle ViewStages: most ViewStages abide by the identical primary guidelines, however a number of buck the development. `Concat` is the one ViewStages` that takes in a second DatasetView; Mongo makes use of MongoDB Aggregation syntax; GeoNear has a question argument, which takes in a fiftyone.utils.geojson.geo_within() object; and GeoWithin requires a 2D array to outline the area to which the “inside” applies. We determined to disregard Concat, Mongo, and GeoWithin, and to assist all GeoNear utilization besides for the question argument.
  • Stick to 2 phases: whereas it could be nice for the mannequin to compose an arbitrary variety of phases, in most workflows I’ve seen, one or two ViewStages suffice to create the specified DatasetView. The purpose of this undertaking was to not get caught within the weeds, however to construct one thing helpful for pc imaginative and prescient practitioners.
VoxelGPT utilizing pure language to question a picture dataset. Picture courtesy of the writer.

Along with giving the mannequin an express “process” and offering clear directions, we discovered that we might enhance efficiency by giving the mannequin extra details about how FiftyOne’s question language works. With out this data, the LLM is flying blind. It’s simply greedy, reaching out into the darkness.

For instance, in Immediate 2, after I requested for false optimistic predictions, the response tried to reference these false positives with predictions.errors.false_positive. So far as ChatGPT was involved, this appeared like an inexpensive solution to retailer and entry details about false positives.

The mannequin didn’t know that in FiftyOne, the reality/falsity of detection predictions is evaluated with dataset.evaluate_detections() and after working mentioned analysis, you’ll be able to retrieve all photos with a false optimistic by matching for eval_fp>0 with:

images_with_fp = dataset.match(F("eval_fp")>0)

I attempted to make clear the duty by offering extra guidelines, equivalent to:

- When a person asks for probably the most "distinctive" photos, they're referring to the "uniqueness" subject saved on samples.
- When a person asks for probably the most "fallacious" or "mistaken" photos, they're referring to the "mistakenness" subject saved on samples.
- If a person would not specify a label subject, e.g. "predictions" or "ground_truth" to which to use sure operations, assume they imply "ground_truth" if a ground_truth subject exists on the info.

I additionally offered details about label varieties:

- Object detection bounding containers are in [top-left-x, top-left-y, width, height] format, all relative to the picture width and top, within the vary [0, 1]
- doable label varieties embody Classification, Classifications, Detection, Detections, Segmentation, Keypoint, Regression, and Polylines

Moreover, whereas by offering the mannequin with a listing of allowed view phases, I used to be in a position to nudge it in the direction of utilizing them, it didn’t know

  • When a given stage was related, or
  • How to make use of the stage in a syntactically right method

To fill this hole, I wished to offer the LLM details about every of the view phases. I wrote code to loop by way of view phases (which you’ll be able to checklist with fiftyone.list_view_stages()), retailer the docstring, after which break up the textual content of the docstring into description and inputs/arguments.

Nonetheless, I quickly bumped into an issue: context size.

Utilizing the bottom GPT-4 mannequin through the OpenAI API, I used to be already bumping up in opposition to the 8,192 token context size. And this was earlier than including in examples, or any details about the dataset itself!

OpenAI does have a GPT-4 mannequin with a 32,768 token context which in concept I might have used, however a back-of-the-envelope calculation satisfied me that this might get costly. If we stuffed the whole 32k token context, given OpenAI’s pricing, it could value about $2 per question!

As an alternative, our workforce rethought our method and did the next:

  • Change to GPT-3.5
  • Decrease token depend
  • Be extra selective with enter information

Switching to GPT-3.5

There’s no such factor as a free lunch — this did result in barely decrease efficiency, at the very least initially. Over the course of the undertaking, we had been in a position to recuperate and much surpass this by way of immediate engineering! In our case, the hassle was value the fee financial savings. In different instances, it won’t be.

Minimizing Token Rely

With context size turning into a limiting issue, I employed the next easy trick: use ChatGPT to optimize prompts!

One ViewStage at a time, I took the unique description and checklist of inputs, and fed this data into ChatGPT, together with a immediate asking the LLM to attenuate the token depend of that textual content whereas retaining all semantic data. Utilizing tiktoken to depend the tokens within the unique and compressed variations, I used to be in a position to scale back the variety of tokens by about 30%.

Being Extra Selective

Whereas it’s nice to supply the mannequin with context, some data is extra useful than different data, relying on the duty at hand. If the mannequin solely must generate a Python question involving two ViewStages, it in all probability gained’t profit terribly from details about what inputs the opposite ViewStages take.

We knew that we wanted a solution to choose related data relying on the enter pure language question. Nonetheless, it wouldn’t be so simple as performing a similarity search on the descriptions and enter parameters, as a result of the previous typically is available in very completely different language than the latter. We wanted a solution to hyperlink enter and knowledge choice.

That hyperlink, because it seems, was examples.

Producing Examples

For those who’ve ever performed round with ChatGPT or one other LLM, you’ve in all probability skilled first-hand how offering the mannequin with even only a single related instance can drastically enhance efficiency.

As a place to begin, I got here up with 10 utterly artificial examples and handed these alongside to GPT-3.5 by including this under the duty guidelines and ViewStage descriptions in my enter immediate:

Listed below are a number of examples of Enter-Output Pairs in A, B type:

A) "Filepath begins with '/Customers'"
B) `dataset.match(F("filepath").starts_with("/Customers"))`

A) "Predictions with confidence > 0.95"
B) `dataset.filter_labels("predictions", F("confidence") > 0.95)`

With simply these 10 examples, there was a noticeable enchancment within the high quality of the mannequin’s responses, so our workforce determined to be systematic about it.

  1. First, we combed by way of our docs, discovering any and all examples of views created by way of mixtures of ViewStages.
  2. We then went by way of the checklist of ViewStages and added examples in order that we had as shut to finish protection as doable over utilization syntax. To this, we made positive that there was at the very least one instance for every argument or key phrase, to offer the mannequin a sample to comply with.
  3. With utilization syntax coated, we various the names of fields and lessons within the examples in order that the mannequin wouldn’t generate any false assumptions about names correlating with phases. For example, we don’t need the mannequin to strongly affiliate the “individual” class with the match_labels() methodology simply because the entire examples for match_labels() occur to incorporate a “individual” class.

Choosing Related Examples

On the finish of this instance era course of, we already had a whole bunch of examples — excess of might match within the context size. Thankfully, these examples contained (as enter) pure language queries that we might instantly evaluate with the person’s enter pure language question.

To carry out this comparability, we pre-computed embeddings for these instance queries with OpenAI’s text-embedding-ada–002 mannequin. At run-time, the person’s question is embedded with the identical mannequin, and the examples with probably the most related pure language queries — by cosine distance — are chosen. Initially, we used ChromaDB to assemble an in-memory vector database. Nonetheless, on condition that we had been coping with a whole bunch or 1000’s of vectors, fairly than a whole bunch of 1000’s or thousands and thousands, it truly made extra sense to change to an actual vector search (plus we restricted dependencies).

It was turning into tough to handle these examples and the parts of the immediate, so it was at this level that we began to make use of LangChain’s Prompts module. Initially, we had been ready to make use of their Similarity ExampleSelector to pick out probably the most related examples, however finally we needed to write a customized ExampleSelector in order that we had extra management over the pre-filtering.

Filtering for Applicable Examples

Within the pc imaginative and prescient question language, the suitable syntax for a question can depend upon the media kind of the samples within the dataset: movies, for instance, generally must be handled in another way than photos. Quite than confuse the mannequin by giving seemingly conflicting examples, or complicating the duty by forcing the mannequin to deduce based mostly on media kind, we determined to solely give examples that will be syntactically right for a given dataset. Within the context of vector search, this is called pre-filtering.

This concept labored so nicely that we finally utilized the identical issues to different options of the dataset. In some instances, the variations had been merely syntactic — when querying labels, the syntax for accessing a Detections label is completely different from that of a Classification label. Different filters had been extra strategic: generally we didn’t need the mannequin to learn about a sure function of the question language.

For example, we didn’t need to give the LLM examples using computations it could not have entry to. If a textual content similarity index had not been constructed for a selected dataset, it could not make sense to feed the mannequin examples of trying to find the perfect visible matches to a pure language question. In an analogous vein, if the dataset didn’t have any analysis runs, then querying for true positives and false positives would yield both errors or null outcomes.

You’ll be able to see the entire instance pre-filtering pipeline in within the GitHub repo.

Selecting Contextual Data Based mostly on Examples

For a given pure language question, we then use the examples chosen by our ExampleSelector to resolve what extra data to supply within the context.

Particularly, we depend the occurrences of every ViewStage in these chosen examples, establish the 5 most frequent `ViewStages, and add the descriptions and details about the enter parameters for these ViewStages as context in our immediate. The rationale for that is that if a stage incessantly happens in related queries, it’s doubtless (however not assured) to be related to this question.

If it’s not related, then the outline will assist the mannequin to find out that it’s not related. Whether it is related, then details about enter parameters will assist the mannequin generate a syntactically right ViewStage operation.

VoxelGPT utilizing pure language to question a picture dataset. Picture courtesy of the writer.

Up till this level, we had centered on squeezing as a lot related data as doable — and simply related data — right into a single immediate. However this method was reaching its limits.

Even with out accounting for the truth that each dataset has its personal names for fields and lessons, the area of doable Python queries was simply too giant.

To make progress, we wanted to interrupt the issue down into smaller items. Taking inspiration from current approaches, together with Chain-of-thought prompting and Choice-inference prompting, we divided the issue of producing a DatasetView into 4 distinct choice subproblems

  1. Algorithms
  2. Runs of algorithms
  3. Related fields
  4. Related class names

We then chained these choice “hyperlinks” collectively, and handed their outputs alongside to the mannequin within the closing immediate for DatasetView inference.

For every of those subtasks, the identical rules of uniformity and ease apply. We tried to recycle the pure language queries from current examples wherever doable, however made a degree to simplify the codecs of all inputs and outputs for every choice process. What’s easiest for one hyperlink will not be easiest for one more!


In FiftyOne, data ensuing from a computation on a dataset is saved as a “run”. This consists of computations like uniqueness, which measures how distinctive every picture is relative to the remainder of the pictures within the dataset, and hardness, which quantifies the issue a mannequin will expertise when making an attempt to be taught on this pattern. It additionally consists of computations of similarity, which contain producing a vector index for embeddings related to every pattern, and even analysis computations, which we touched upon earlier.

Every of those computations generates a distinct kind of outcomes object, which has its personal API. Moreover, there’s no one-to-one correspondence between ViewStages and these computations. Let’s take uniqueness for instance.

A uniqueness computation result’s saved in a float-valued subject ("uniqueness” by default) on every picture. Because of this relying on the scenario, chances are you’ll need to kind by uniqueness:

view = dataset.sort_by("uniqueness")

Retrieve samples with uniqueness above a sure threshold:

from fiftyone import ViewField as F
view = dataset.match(F("uniqueness") > 0.8)

And even simply present the distinctiveness subject:

view = dataset.select_fields("uniqueness")

On this choice step, we process the LLM with predicting which of the doable computations is perhaps related to the person’s pure language question. An instance for this process appears like:

Question: "most unusual photos with a false optimistic"
Algorithms used: ["uniqueness", "evaluation"]

Runs of Algorithms

As soon as probably related computational algorithms have been recognized, we process the LLM with choosing probably the most acceptable run of every computation. That is important as a result of some computations might be run a number of instances on the identical dataset with completely different configurations, and a ViewStage might solely make sense with the best “run”.

An excellent instance of that is similarity runs. Suppose you might be testing out two fashions (InceptionV3 and CLIP) in your information, and you’ve got generated a vector similarity index on the dataset for every mannequin. When utilizing the SortBySimilarity view stage, which photos are decided to be most much like which different photos can rely fairly strongly on the embedding mannequin, so the next two queries would want to generate completely different outcomes:

## question A:
"present me the ten most related photos to picture 1 with CLIP"

## question B:
"present me the ten most related photos to picture 1 with InceptionV3"

This run choice course of is dealt with individually for every kind of computation, as every requires a modified set of process guidelines and examples.

Related Fields

This hyperlink within the chain includes figuring out all subject names related to the pure language question which might be not associated to a computational run. For example not all datasets with predictions have these labels saved beneath the title "predictions”. Relying on the individual, dataset, and software, predictions is perhaps saved in a subject named "pred", "resnet", "fine-tuned", "predictions_05_16_2023", or one thing else completely.

Examples for this process included the question, the names and sorts of all fields within the dataset, and the names of related fields:

Question: "Exclude model2 predictions from all samples"
Obtainable fields: "[id: string, filepath: string, tags: list, ground_truth: Detections, model1_predictions: Detections, model2_predictions: Detections, model3_predictions: Detections]"
Required fields: "[model2_predictions]"

Related Class Names

For label fields like classifications and detections, translating a pure language question into Python code requires utilizing the names of precise lessons within the dataset. To perform this, I tasked GPT-3.5 with performing named entity recognition for label lessons in enter queries.

Within the question “samples with at the very least one cow prediction and no horses”, the mannequin’s job is to establish "horse" and "cow". These recognized names are then in contrast in opposition to the category names for label fields chosen within the prior step — first case delicate, then case insensitive, then plurality insensitive.

If no matches are discovered between named entities and the category names within the dataset, we fall again to semantic matching: "folks""individual", "desk""eating desk", and "animal"[“cat”, “dog", “horse", …].

Each time the match isn’t an identical, we use the names of the matched lessons to replace the question that’s handed into the ultimate inference step:

question: "20 random photos with a desk"
## turns into:
question: "20 random photos with a eating desk"

ViewStage Inference

As soon as all of those choices have been made, the same examples, related descriptions, and related dataset information (chosen algorithmic runs, fields, and lessons) are handed in to the mannequin, together with the (probably modified) question.

Quite than instruct the mannequin to return code to me within the type dataset.view1().view2()…viewn() as we had been doing initially, we ended up nixing the dataset half, and as a substitute asking the mannequin to return the ViewStages as a listing. On the time, I used to be stunned to see this enhance efficiency, however in hindsight, it matches with the perception that the extra you break up the duty up, the higher an LLM can do.

Creating an LLM-powered toy is cool, however turning the identical kernel into an LLM-power software is way cooler. Right here’s a short overview of how we did it.

Unit Testing

As we turned this from a proof-of-principle right into a robustly engineered system, we used unit testing to emphasize take a look at the pipeline and establish weak factors. The modular nature of hyperlinks within the chain implies that every step can individually be unit examined, validated, and iterated on without having to run the whole chain.

This results in quicker enchancment, as a result of completely different people or teams of individuals inside a prompt-engineering workforce can work on completely different hyperlinks within the chain in parallel. Moreover, it ends in diminished prices, as in concept, you must solely must run a single step of LLM inference to optimize a single hyperlink within the chain.

Evaluating LLM-Generated Code

We used Python’s eval() operate to show GPT-3.5’s response right into a DatasetView. We then set the state of the FiftyOne App session to show this view.

Enter Validation

Rubbish enter → rubbish output. To keep away from this, we run validation to guarantee that the person’s pure language question is smart.

First, we use OpenAI’s moderation endpoint. Then we categorize any immediate into one of many following 4 instances:

1: Smart and full: the immediate can fairly be translated into Python code for querying a dataset.

All photos with canine detections

2: Smart and incomplete: the immediate is affordable, however can’t be transformed right into a DatasetView with out extra data. For instance, if we’ve two fashions with predictions on our information, then the next immediate, which simply refers to “my mannequin” is inadequate:

Retrieve my mannequin’s incorrect predictions

3: Out of scope: we’re constructing an software that generates queried views into pc imaginative and prescient datasets. Whereas the underlying GPT-3.5 mannequin is a normal objective LLM, our software mustn’t flip right into a disconnected ChatGPT session subsequent to your dataset. Prompts like the next must be snuffed out:

Clarify quantum computing like I’m 5

4: Not smart: given a random string, it could not make sense to try to generate a view of the dataset — the place would one even begin?!