Higher Understanding Picture Queries
Years in the past, I wouldn’t have anticipated a search engine telling a searcher about objects in {a photograph} or video, however search engines like google have been evolving and getting higher at what they do
In February, Google was granted a patent to assist return picture queries from searches involving figuring out objects in pictures and movies. A search engine might have bother attempting to know what a human could also be asking, utilizing a pure language question, and this patent focuses upon disambiguating picture queries.
The patent supplies the next instance:
For instance, a person might ask a query a couple of {photograph} that the person is viewing on the computing machine, akin to “What is that this?”
The patent tells us that the method in it perhaps for picture queries, with textual content, or video queries, or any mixture of these.
In response to a searcher asking to establish picture queries, a computing machine might:
- Seize a respective picture that the person is viewing
- Transcribe the query
- Transmit that transcription and the picture to a server
The server might obtain the transcription and the picture from the computing machine, and:
- Establish visible and textual content material within the picture
- Generate labels for pictures within the content material of the picture, akin to areas, entities, names, forms of animals, and many others.
- Establish a specific sub-image within the picture, which can be {a photograph} or drawing
The Server might:
- Establish a part of a specific sub-image that could be of major curiosity to a searcher, akin to a historic landmark within the picture
- It could carry out picture recognition on the actual sub-image to generate labels for that sub-image
- It could additionally generate labels for textual content within the picture, akin to feedback concerning the sub-image, by performing textual content recognition on part of the picture apart from the actual sub-image
- It could then generate a search question primarily based on the transcription and the generated labels
- That question might ben be providee to a search engine
The Course of Behind Disambiguating a Visible Question
The method described on this patent contains:
- Receiving a picture introduced on, or similar to, a minimum of part of a show of a computing machine
- Receiving a transcription of an utterance spoken by a searcher, when the picture is being introduced
- Figuring out a specific sub-image included within the picture, and primarily based on performing picture recognition on the actual sub-image
- Figuring out a number of first labels that present a context of the actual sub-image
- Performing textual content recognition on part of the picture apart from the actual sub-image
- Figuring out a number of second labels displaying the context of the actual sub-image, primarily based on the transcription, the primary labels, and the second labels
- Producing a search question
- Offering, for output, the search question
Different Features of performing such picture queries searches might contain:
- Weighting a primary label otherwise than a second label: the search question might substitute a number of of the primary labels or the second labels primarily based upon phrases within the transcription
- Producing, for every of the primary labels and the second labels, a label confidence rating that signifies a chance that the label corresponds to part of the actual sub-image that’s of major curiosity to the person
- Choosing a number of of the primary labels and second labels primarily based on the respective label confidence scores, whereby the search question relies on the a number of chosen first labels and second labels
- Accessing historic question information together with earlier search queries supplied by different customers
- Producing, primarily based on the transcription, the primary labels, and the second labels, a number of candidate search queries
- Evaluating the historic question information to the a number of candidate search queries
- Choosing a search question from among the many a number of candidate search queries, primarily based on evaluating the historic question information to the a number of candidate search queries
The strategy can also embody:
- Producing, primarily based on the transcription, the primary labels, and the second labels, a number of candidate search queries
- Figuring out, for every of the a number of candidate search queries, a question confidence rating that signifies a chance that the candidate search question is an correct rewrite of the transcription
- Choosing, primarily based on the question confidence scores, a specific candidate search question because the search question
- Figuring out a number of pictures included within the picture
- Producing for every of the a number of pictures included within the picture, a picture confidence rating that signifies a chance that a picture is a picture of major curiosity to the person
- Choosing the actual sub-image, primarily based on the picture confidence scores for the a number of pictures
- Receiving information indicating a number of a management occasion on the computing machine, whereby the management occasion identifies the actual sub-image. (The computing machine might seize the picture and seize audio information that corresponds to the utterance in response to detecting a predefined hotword.)
Additional, the strategy can also embody:
- Receiving a further picture of the computing machine and a further transcription of a further utterance spoken by a person of the computing machine
- Figuring out a further explicit sub-image that’s included within the extra picture, primarily based on performing picture recognition on the extra explicit sub-image
- Figuring out a number of extra first labels that point out a context of the extra explicit sub-image, primarily based on performing textual content recognition on a portion of the extra picture apart from the extra explicit sub-image Figuring out a number of extra second labels that point out the context of the extra explicit sub-image, primarily based on the extra transcription, the extra first labels, and the extra second labels
- Producing a command, and performing the command
Performing the command can embody:
- Storing the extra picture in reminiscence
- Storing the actual sub-image within the reminiscence
- Importing the extra picture to a server
- Importing the actual sub-image to the server
- Importing the extra picture to an utility of the computing machine
- Importing the actual sub-image to the appliance of the computing machine
- Figuring out metadata related to the actual sub-image, whereby figuring out the a number of first labels that point out the context of the actual sub-image primarily based additional on the metadata related to the actual sub-image
Benefits of following the picture queries course of described within the patent can embody:L
- The strategies can decide the context of a picture similar to a portion of a show of a computing machine to help within the processing of pure language queries
- The context of the picture could also be decided by way of picture and/or textual content recognition
- The context of the picture could also be used to rewrite a transcription of an utterance of a person
- The strategies might generate labels that consult with the context of the picture, and substitute the labels for parts of the transcription, akin to “The place was this taken?”)
- The strategies might decide that the person is referring to the picture on the display screen of the computing machine
- The strategies can extract details about the picture to find out the context of the picture, in addition to a context of different parts of the picture that don’t embody the picture, akin to a location that the picture was taken
This patent might be discovered at:
Contextually disambiguating queries
Inventors: Ibrahim Badr, Nils Grimsmo, Gokhan H. Bakir, Kamil Anikiej, Aayush Kumar, and Viacheslav Kuznetsov
Assignee: Google LLC
US Patent: 10,565,256
Granted: February 18, 2020
Filed: March 20, 2017
Summary
Strategies, programs, and equipment, together with laptop applications encoded on a pc storage medium, for contextually disambiguating queries are disclosed. In a side, a way contains receiving a picture being introduced on a show of a computing machine and a transcription of an utterance spoken by a person of the computing machine, figuring out a specific sub-image that’s included within the picture, and primarily based on performing picture recognition on the actual sub-image, figuring out a number of first labels that point out a context of the actual sub-image. The strategy additionally contains, primarily based on performing textual content recognition on a portion of the picture apart from the actual sub-image, figuring out a number of second labels that point out the context of the actual sub-image, primarily based on the transcription, the primary labels, and the second labels, producing a search question, and offering, for output, the search question.