Probably some of you have used google reverse image search - that is when you drag image from your computer to the search field or paste image url after clicking on camera icon. But there is not API for that to get the results nicely in JSON or XML without any hassle. There was API for google image search which is now deprecated but it didn't provide the reverse image search functionality anyway.
So I searched for other APIs. First one that I found and was recommended on the internet as alternative to Google is the TinEye. I tried uploading some pictures on their website but the results weren't so rich as Google Reverse Image Search.
Other alternative was the
Bing Search API. I didn't find anything about reverse image search in the description, so I had setup quickly Bing Search API to test its functionality. All it had was just normal search API - no reverse search. So if you want usual search API for images then consider using bing search API.
Okay lets jump into the google reverse image search.
I bet you're wondering how could you automate the part when you have to drag an image to Google search box or what is the full URL when you upload image by URL. The full URL is
https://www.google.com/searchbyimage?&image_url=<YOUR URL>
For example
https://www.google.com/searchbyimage?&image_url=http://kaizern.com/blog/beautiful-landscapes-1.jpg
If you go to the above address from your browser you will get the the search results and will see that the link is different. It was redirected. So if you use in your code something like
file_get_contents(https://www.google.com/searchbyimage?&image_url=http://kaizern.com/blog/beautiful-landscapes-1.jpg);
You will get to the first page with status 302. What you need is to follow the redirect chain. At this pont
cURL comes to the rescue. The cURL below works like charm and opens the google's search results page.
function open_url($full_url)
{
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $full_url);
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_REFERER, 'http://www.kaizern.com');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11");
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
$content = utf8_decode(curl_exec($curl));
curl_close($curl);
return $content;
}
The
$full_url variable is the full URL for reverse image search like
https://www.google.com/searchbyimage?&image_url=http://kaizern.com/blog/beautiful-landscapes-1.jpg
And open_url function returns whole results page. Next step is probably dropping out unneeded html content, lets say we need everything from in <body> tags and <head> tag will be dropped.
function get_tag_content_as_dom($img_res_url, $tag_name = 'body')
{
$dom = new DOMDocument();
$dom->strictErrorChecking = false; // turn off warnings and errors when parsing
@$dom->loadHTML($img_res_url);
$body = $dom->getElementsByTagName($tag_name);
$body = $body->item(0);
$new_dom = new DOMDocument();
$node = $new_dom->importNode($body, true);
$new_dom->appendChild($node);
return $new_dom;
}
So lets sort out that function. First argument it takes is the result from the
open_url function and second argument is the html tag which contents we need. We use PHP's DOMDocument library. Next we get element by tag name 'body' and afterwards we make a new DOMDocument with all the childs notes recursively to traverse it later with xpath.
Now is the time to analyze the HTML structure of Google search result to write correct xpath query. Try playing around by uploading different pictures to Google Image search. You will see that google recognizes some images and writes a
best guess for this image. What I did next is that I opened my Chrome debugger and wrote down the path where the best guess is. Here's the path:
<div id="main"> <div> <div id="cnt"> <div id="rcnt"> <div id="center_col"> <div class="med" id="res" role="main"> <div id="topstuff">
The xpath query for that is
/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='topstuff']
My function for that is
function get_xpath_result($dom, $xpath_query)
{
$dom_xpath = new DOMXPath($dom);
return $dom_xpath->query($xpath_query);
}
The first argument is the DOM document that you got with previous function
get_tag_content_as_dom and second argument is the xpath query for the topstuff id in HTML
Don't worry if you don't grasp the whole picture now, I will write below in this post my whole class for that.
I have the xpath query in the class's scope as static variable
static $topstuff_div_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='topstuff']";
In my class's constructor this part looks like
$topstuff_div = $this->get_xpath_result($body_dom, self::$topstuff_div_query);
Next we need to get the text after
best guess for this image
I chose a string parsing approach instead of xpath and this method
function get_best_guess($topstuff_div)
{
$topstuff_result = '';
foreach ($topstuff_div as $val) {
$topstuff_result .= $val->nodeValue . " ";
}
$best_guess = $this->strstr_after($topstuff_result, 'Best guess for this image:');
return trim($best_guess, ' ');
}
This function's argument is the result of our previous function
get_xpath_result as you can guess from the argument variable name and
get_best_guess returns the text that is after
best guess for this image. In our example for the beautiful landscape image it is
sentieri del cuore.
What if there is no best guess? Then there is no div with id topstuff. Then we have to jump into the search results. Again jump into the HTML of the search results and look how the results are structured. Here, I wrote it down for myself
<div id="main">
<div>
<div id="cnt">
<div id="rcnt">
<div id="center_col">
<div class="med" id="res" role="main">
<div id="topstuff">
<div id="search">
<div id="ires">
<ol id="rso">
<li class="g">
NOT<li id="imagebox_bigimages">
Each result is in list element with class g. In addition, the similar images are also in list with additional id imagebox_bigimages. If you don't need the results of similar images, then exclude it in your xpath query.
By going deeper in results HTML we see that each results title is in <h3> tag with class r and the description is in <span> element with class st. Here are my xpath queries for title and description for each search result with excluded similar images list element.
static $title_h3_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//h3[@class='r']";
static $span_text_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//span[@class='st']";