This is a written version of my talk for Confab Central, given May 9th 2014, (based on a blog post of the same name I wrote in late 2013), which broke down some of the key features of Google’s Hummingbird algorithm update and tried to make sense of it for people who create content on the internet. I also threw in a few extra details I didn’t cover in the talk and links to sources.
Let’s talk about some scary SEO stuff.
Actually, let’s talk about how to make some of the scary SEO stuff not so scary anymore. Because, in reality, Hummingbird represents a major change in the nature of search, and it’s important that we—as content strategists and content creators—understand how the content we produce is found by our audience on the internet.
I’d like to start off by asking a question: What does Google want?
I believe this question is at the heart of understanding how to make search-friendly content on the internet.
What does Google want? What is it trying to achieve? Why does it even exist?
Some may answer that Google strives to create a structured system of ranking websites in order to create a level playing field for all content on the internet—to give it all the same chance at gaining exposure through search.
The truth is, Google doesn’t care. It doesn’t care about you, your website, or your company. It doesn’t care how much time you spent writing that blog post. It doesn’t care how much money you’ve spent buying links.
To Google, a single page on your website is like a grain of sand on a beach.
In the Milky Way.
There is so much content out there that Google simply doesn’t have the resources to care about anything other than algorithmically delivering the best information and value it can provide to its users.
Google is waaaaaay past ranking websites.
Google is now an information engine. And the only way it will continue to make piles and piles of cash is if it can continue to provide the best, most valuable information to its users. Google is not trying to send people to websites, it’s trying to provide immediate answers and information to users as quickly as possible.
The only way you’re going to matter to Google’s search algorithm at all is if you provide value to Google's users.
And this is what brings us to Hummingbird.
Hummingbird is Google’s latest, big algorithm update. It was launched in September of 2013. It was called Hummingbird because it is supposed to be light, and fast, and accurate.
But Hummingbird is not just another algorithm update. Hummingbird represents a major philosophical shift in the way Google approaches the idea of search on the internet.
It’s essential to understand that Hummingbird isn’t just another update.
It is a brand new search engine.
Think of Hummingbird like replacing a car engine. Many of the parts are the same, but the major factor that pushes the car forward is completely new.
In fact, Google has been building to Hummingbird for awhile with updates like Panda and Penguin and elements like synonym recognition, co-occurrence, and others.
If the old Google search engine were a cocktail, it would be a big blue monstrosity with elements hanging off the sides and a big straw sticking out the top. But Hummingbird has turned the once-clunky search engine into a sexy little cocktail that you’d order in a bar if you wanted to look hip and sophisticated.
The alcohol is still in there, but the sexy cocktail is stronger and more effective. As a result, this new sexy version of the search engine allows Google to change direction slightly.
But it shouldn't have come as a surprise. Google already made their intention clear. Back in May 2012, they puiblically introduced the knowledged graph and stated that they were moving from strings to things.
that means that Google is moving away from keyword strings as the primary driver of search and toward semantically understanding the “thing” you are actually searching for.
For example, in the old engine, if you searched for “Cat’s in the Cradle,” the engine would recognize that query as a string of words. And it would then go out and search the internet for that exact string of words.
It would bring back all the pages that contained that exact keyword string. Usually the webpage with the most links pointing to it that also contained the exact phrase would appear at the top of the results.
But with the new engine, that’s changed slightly.
Today, if you type in the same string of keywords: “Cat’s in the Cradle,” Google understands your query as a song by Harry Chapin. (Go ahead, look it up, it’s pretty neat. I’ll wait.)
Essentially, Google is finally taking advantage of the semantic web.
That is, Google is moving beyond keyword strings and toward attempting to understand the real-world meaning and intent of searches.
Semantics itself is the study if the relationship between the sign and signified (i.e., the words we use to indicate a thing in the real world and the thing itself).
As such, it is the study of something that is nebulous and constantly shifting.
In essense, the words we use are completely arbitrary and have no connection to the things they represent whatsoever. Meaning is only created socially. WE create meaning together because we all agree that certain sounds we utter represent the things, people, places, and ideas around us.
Google is now trying to algorithmically deconsturct the semantic meaning of our language by examining all the content we've given it on the internet. It's decoding the context and relationships inherent in our language in order to understand what it is we're actually talking about. And it's doing this so that it can give us better search results.
This is great news for content creators. It's great news because creating meaning is what we know how to do. It’s what we do best. It’s why we’re good at our jobs. Understanding how to create emotionally compelling, quality content is what we’re all about. And FINALLY Google is beginning to understand that content, judge its quality, and use its mathematical understanding of semantic meaning to evaluate and rank our content. Because if it can find the best content on the internet, it can serve it to its users, and they'll keep coming back for more.
For many years, content and search optimization were two different things. Google could only match keywords and count links, so that’s optimization consisted basically of matching keywords and counting links. Google couldn’t judge quality algorithmically, so quality was not an important factor in search optimization.
However, that's changing. Today, Hummingbird is the marriage of content and optimization.
Increasingly, quality is optimization.
If you need any proof of the increasing importance of quality content, you only need to look at the SEO industry over the past couple of years to observe the massive shift toward content!, content!, content!
Which, again, is awesome for people who already know how to create quality, engaging content. However, If quality content is optimization then content creators should be taking a bigger role in search optimization. For the simple reason that we already know how to do it.
That's one of the reasons I believe that it is easier to teach content creators SEO than it is to teach SEOs to create good content.
I'm not saying that SEOs always make terrible content. I am saying that the SEO industry as a whole has found itself trying to play catch up to understand what makes good content work. Whereas a content creator with years of experience creating meaning and effectively expressing information can start learning SEO today and master a lot of the basics very quickly.
Although learning the basics of SEO isn't that hard. Understanding how search engines crawl and evaluate content takes work and practice.
With that in mind, let's talk about the impact Hummingbird has on the way we create and distribute content on the internet. And there is no better place to start than with entities.
An entity is a person, place, thing, or idea that exists in the real world. If you’re imagining the semantic web, think about entities as the nodes or intersections on that web.
To give you some background: In 2010, Google bought a company called Metaweb. (Metaweb owns Freebase, which we’ll talk about later on). Metaweb compiles semantic data from the internet in order to discern relationships and patterns.
One of the ways they do this is through “triples.” A triple is a grammatical construction of subject, predicate, and object. (Grammar nerds unite!)
For example: "Bill Murray plays Steve Zissou."
The triple break down for that sentence is:
Bill Murray = Subject
Plays = Predicate
Steve Zissou = Object
That breakdown isn’t very helpful unless, like Metaweb, you can compile these triples across thousands of pages of content on the internet.
When you do, the triples pile up and reveal relationships and patterns.
Then you can take all these patterns and relationships and create a rich search result like this:
As you can see, this results page isn’t just a list of links to webpages that include the keyword “Bill Murray.” This result is a comprehensive representation of the entity called "Bill Murray."
Google recognizes that Bill Murray isn’t just a keyword string. Rather, Bill Murray is an entity that exists in the real world. Google recognizes that when people search for Bill Murray, they’re not searching for keywords. They're looking for THIS MAN. And the results page reflects that intent.
This page isn't just a list of links. It’s an answer to the questions Google thinks you might possibly be asking about the entity Bill Murray.
Google is guessing at your intent and trying to serve you the information it thinks you want.
Take a look at the wide variety of content formats it presents: links, pictures, video, news, data, in-depth articles, and more. Google is trying to create a complete picture of the entity known as "Bill Murray," without the user having to go anywhere else.
Chances are you aren’t as famous as Bill Murray. But that shouldn’t keep you from trying to become recognized as an entity by the Google algorithm.
In terms of your content strategy, you need to ask yourself:
What relationships are you creating through your content?
Use plain language to talk about yourself (both on- and off-site) that is simple, clear, and draws a direct connection with the concepts, ideas, and entities you want to be associated with.
Are you creating the type of content Google wants to display?
It’s clear that Google doesn’t want to simply display a page of links anymore (remember, it’s an information engine now). It wants to display a diverse set of results. If that's the case, are you creating a wide variety of types of content—videos, pictures, in-depth articles, etc.—that Google can draw from when people search for you online?
But why does all this matter anyway? Why is becoming an entity so beneficial? Because Google favors entities.
If you follow SEO blogs or stay in-tune with the search industry, you've probably noticed the blanket wringing about Google increasingly favoring big brands in the results.
In fact, if you enter a single letter into the Google search bar, 80% of the top three auto-suggestions will be a brand name.
Pretty crazy. Enter one letter in Google and the first thing that pops up is usually a brand. @jamescgunter #ConfabMN pic.twitter.com/Wcb7fSBwOB— Amanda Gallucci (@agalluch) May 9, 2014
Be that as it may, I think Google favors more than brands—Google favors entities. (Brands just happen to be the most visible entities.)
Favoring entities makes complete sense for Google. If Google wants to give people the best, most reliable results, it will rely more heavily on content from and about real world entities that its users already know and trust.
Remember, Google is simply trying to reflect popularity and trust in the real world. If more people trust Target than Joe’s Variety Store, then Google is going to reflect that preference in the search results.
Now let’s take a look at another way Google is determining entities and relationships on the internet.
Co-occurrence is an aspect of the Google algorithm that uncovers patterns of grammatical duplication. In other words, if certain words or entities appear at a high rate with each other in text, Google assumes a correlation. Then they can use that correlation to deliver better serach results to users.
For example, if we look back at Bill Murray’s search result, you’ll see people that are related to Bill Murray.
Clearly people who searched for Bill Murray may have also searched for Harold Ramis, Dan Aykroyd, and Wes Anderson, but these are also entities that co-occur with Bill Murray at a high rate—as you can imagine.
Let’s take a look at another example and break down the way Google is using co-occurrence to determine relationships on the internet.
If you search for “greatest basketball player,” you’ll get this nifty little carousel of pictures displaying the people you would expect to be mentioned in a list of the greatest basketball players.
But how does Google know that these are the greatest basketball players? Google is just a robot. It can’t actually take part in a debate about great basketball players. And there’s no one at Google sitting behind a desk curating this list. This carousel is generated algorithmically. At least in part, this carousel is an effect of co-occurrence.
Bill Slawski, did a great breakdown of Google’s co-occurrence patent on his blog, SEO by the Sea, and I’m going to use his analysis as a jumping off point to break it down into even simpler terms, so you can see how it works—at least in broad strokes.
Google analyzes co-occurrence roughly like this:
First, it will crawl the top 1000 (or so) results for a search term, like “Michael Jordan.” (I say 1000, but that may or may not be an actual number. It’s just a nice round number that’s easy to work with.)
Second, the algorithm will weed out the most commonly used words on the internet—articles, conjunctions, some prepositions and adverbs, etc. (again, keep in mind this is rough).
So a sentence, like this:
Becomes a jumble of words, like this:
Third, the algorithm scores words based on their proximity to the prime term. For example, words that are right next to the prime term could receive a score of 1. Words that are two words away could receive a score of 2. And so on.
Scoring one document like this doesn’t create much insight. However, when you can take those results and automatically compare them across thousands (nee, tens of thousands) of documents on the internet, patterns will inevitably arise.
Now you can see how Google—at least roughly—determines that these specific basketball-player-entities should show up in a search for “greatest basketball player." They have a natural pattern of co-occurrence with that term.
But wait, there's more. This next part blows my mind...
You can clearly see that Lebron James ranks #2 on this list.
When you search for Lebron James, the first two results that appear are his official website and his Wikipedia page. Do you want to take a guess at which phrase never appears in either of the most authoritative sources on Lebron James?
Yep, “greatest basketball player.”
That means Lebron James is showing up in this search without any conscious effort on his part. Google has scoured the internet for metions of Lebron James and has found a naturally occuring semantic pattern. Then he appears in results that he isn't even targeting, based on content created and published by other people.
Here’s another example that is even more mind-blowing to me.
This is the results page for “online photoshop.” (A great example, originally shared by Rand Fishkin.)
As you can see, the first result is Pixlr, an online photo editing site. What’s crazy is that it ranks for this term, despite not using the term “photoshop” anywhere on the site. That means there are enough people on the internet referring to Pixlr as an “online photoshop” that Google recognizes the association and ranks it well for that term.
The Pixlr marketing team have done such a good job convincing people that their tool is an “online photoshop” that it even ranks above photoshop.com and ribbet.com—which, as you can see, is explicitly trying to rank for Photoshop terms, based on its meta title and description.
In the old search engine, it was nearly impossible to appear in a search for a keyword that you didn’t have on your site at all. Today, entierly possible.
If there are enough people talking about you on the internet in certain terms, Google will recognize the association and could rank you for it. Which could be both a good or bad thing.
(Note: since giving this presentation, the "online photoshop" SERP has changed. Photoshop now ranks #1 for “online photoshop” and Pixlr ranks #2 and #3. Still, this is an impressive feat for a site that doesn’t use the keyword “photoshop” at all.)
How should you incorporate the principle of co-occurrence into your content?
How is your brand talked about online?
Are you being associated with the terms, keywords, and entities that you want to be associated with? If not, maybe it’s time to make a change. Define the language you want to be associated with both on- and off-site in your content strategy and style guides. Work with your SEO and content marketing teams to ensure they are consistent in the way they talk about your brand online.
What phrases are you co-occuring with?
Are they the terms and phrases you want to co-occur with? Again, take the time to evaluate what terms you are currently being associated with and decide whether they are helping you or hurting you. If you need to, make a change, document it in your strategy and style guide, and make it consistent in all your communication.
Another aspect of the algorithm that goes hand-in-hand with co-occurrence is co-citation. Whereas co-occurrence had to do with grammatical pattern repetition, co-citation deals with citation (or link) pattern repetition.
In essence, it works like this:
(HT to Haris Bascic)
When site A links to site B and C, Google implies an association between sites B and C. As stated previously, this citation analysis wouldn’t be very useful or accurate if you analyzed just one site. But if you could analyze the linking patterns of tens of thousands of sites, you’d start to uncover recurring patterns. And those patterns could become strong indicators of associations between websites.
These relationships and associations feed into the same results you’d get from co-occurrence. Essentially, co-occurrence and co-citation work together to determine strong associations between entities and keywords.
Questions to ask yourself about co-citation and your content strategy:
Where are you mentioned?
What citations are currently pointing to your site and where are they coming from? Make sure that you’re being mentioned in the right neighborhoods and contexts as well as with brands that are similar to yours. If you’re not being linked to in the same contexts or from the same sites as your competitors, maybe that’s something you can work into your strategy.
Are you linking out?
There was once this idea that “link juice” flowed from one site to another through citations. You wanted juice flowing to your site, but you didn’t want to link out to other sites because that would mean losing link juice—like poking holes in the bottom of a boat. But that is a false analogy. Google wants to see relationships between content and entities through links. Would you trust an article that didn’t cite it’s sources or point you toward more information? Neither does Google.
And that bring us to...
These are pretty well-known aspects of the algorithm, but content creators and strategists may not be as familiar with them as many people in the SEO world. Let’s go over them briefly and talk about the impact they have on your content strategy.
Panda rolled out in February 2011 and was Google’s first real attempt to try and separate good content from bad. Primarily, Panda penalized thin, low-quality, over-optimized content. Another name for Panda at the time was “Farmer” because it hit content farms pretty hard. Sites like Demand Media, Hubspot, and others that published very short (200 word), keyword-stuffed articles had ranked well and garnered a lot of traffic in the previous years, but rarely provided any usable or reliable information to actual users. Panda sought to change that by favoring quality content and supressing content of no actual value to users.
Then in April 2012, Google rolled out Penguin. Penguin was a reaction against spammy linkbuilding practices, over-optimized anchor text, and linkbuilding through article marketing (spinning). A lot of sites that had over-optimized backlink profiles from gibberish content were hit with Penguin.
So, what’s the best way to avoid pandas and penguins (besides going to the desert)?
Create good, quality content.
“Create quality content” is easy to say, but what does "quality" actually mean for Google?
First of all, longer is generally better than shorter. Now, I say this with a big caveat. Generally this is the case, but not in all instances or in all contexts. Although 2000 words might be better than 500 words, generally—length is relative. There is no magic number of words on a page that will ensure it ranks #1 in the results.
Take a look at your industry and the content that is already being produced on the topics you want to publish content on. If the field is full of 300-word blogs posts, you might be able to write a quality 700-word piece and have it rank better than your competitor's content.
Also keep in mind that Google’s in-depth results (which I’ve seen creeping up the page from position 10 to position 7 and 8 from time to time) are usually 3000-5000 words long. Again, it’s all relative to the search you’re doing and the content that already exists on that topic.
This isn’t a mandate to write a novel on every page of your site. Write for users first and algorithms second.
However, if there are a lack of amazing resources on a topic closely related to your business or organization, don’t be afraid to tackle it at length or in as much detail as necessary to provide a good resource to users. Length is nothing to be afraid of.
In addition to length and depth, a Mathsight study meant to deconstruct the Penguin 2.0 update found that, for content that ranked well in search, “More rare words are good and generally rewarded—i.e., those that are not in the 5,000 most common words in the English language.”
As we all know, correlation does not equal causation. I don’t believe an elevated reading level and unique words are explcitiy rewarded by the algorithm.
Quality, in-depth, valuable content in general doesn’t shy away from more unique words, scientific or industry-specific terms, or higher reading levels. That is, if you’re churning out 5 blog posts a day or spinning content, your content is likely to be both shorter and at a more basic reading level than content that is more in-depth. In-depth content is likely to be more useful to people and is more likely to be shared and linked to, thus sending more quality signals to Google that the content is good and should be ranked well.
So what does all this mean? Create a unique voice and perspective, and don’t be afraid to go deep when you think it will benefit your audience.
Keep in mind, bad content has neither a voice nor a perspective.
Communicate like a normal person, use the terms that fit best to explain what you’re trying to get across, and take the time to explore your topic fully. As a result, you’ll be more likely to do well in search.
Moving on: There are a couple things I want to touch on that are more technical in nature but can help Google understand your content.
Although Google is getting better at identifying the subject and quality of your content, it’s still a long way off from inherently understanding it. Google needs your help.
A little while back, Google got together with Yahoo and Bing to create a shared markup language that content creators and developers could use to tag their content in ways that search engines could easily identify and understand. The result was Schema.org.
Go to Schema.org and look through hundreds of ways to tag your content. It’s got tags for people, places, movies, businesses, organizations, and more. For example, you can use Schema.org to tag your product pages so Google can recognize product names, serial numbers, descriptions, reviews, and more.
As of November of last year, less than a quarter of websites on the internet used structured content markup like Schema.org. Chances are your competitors aren't doing this.
That means, if you can get on the ball with sturctured markup. You'll have the edge over 75% of the websites out there.
Remember way back at the top of this page when I said Google owns a company called Metaweb? Well, Metaweb owns and runs Freebase—a free, open database for just about everything you can imagine. Seriously, everything.
Freebase is one of the sources Google uses to build the knowledge graph and recognize entities. Again, it is free and open for anyone to use. You can sign up for an account right now and edit or add to the database and help Google figure out entities, relationships, and associations for your brand, company, or organization.
Keep in mind, Google does not use Freebase as a sole source of information when building the knowledge graph and determining entities. It uses Freebase as a foundation and corroborates information on other sites before it “trusts” the information it’s been given. If Google can find multiple instances of the same information relating to a specific entities across the internet, it is more likely to trust the information and use it to affect search results.
Let me show you a quick example of how a simple implementation of Schema.org markup and Freebase can help you.
At the end of last year while researching Hummingbird, entities, and the knowledge graph and trying to make sense of it, I decided to conduct a simple experiment.
On thing you have to understand before we go further: I’m not famous. I'm nobody. I’m not known for anything in particular. When you searched for my name on Google, I did not appear in the results at all. (It turns out there are a fair number of James Gunters in the world, and I’m not a special one.)
For the sake of this experiment, my unremarkableness was absolutely perfect. I decided to find out if I could rank for my own name simply by adding Schema.org markup to my website, creating a Freebase entry for myself, and leaving some breadcrumbs for Google to follow. Keep in mind, my personal website was brand spanking new at the time and there were no inbound links pointing to it—zero inbound links.
I created a Freebase entry for myself, entered a little bit of personal information, and linked to my personal site and some social profiles. Very minimal. Next, I used Schema.org to markup my site for authorship—pointing back to my Freebase entry and to the same social profiles. Then I updated my Google+ profile to ensure that all the same information from Freebase matched my profile and pointed toward my website.
Lo and behold!
Just a short while later, when you searched for “James Gunter,” my picture showed up in the results, and my Twitter profile showed up just underneath. And when you searched for “James C Gunter”—which I generally use, so as to differentiate myself from all the other James Gunters in the world—my Facebook profile showed up #2, my picture was the first to show in the images results, and my Twitter profile and official website showed up just below.
Note: See that knowledge graph box you can see for the “James Gunter” result? That’s not me, that a long-dead English confectioner from the 1700s who is more famous than I am.
Granted, searches for my name are not extremely competitive and there aren’t hundreds of thousands of searches for “James Gunter” each month. Also there are many other ways to "tell" Google what your content is about. But, I feel I can go out on a limb and say that by simply implementing a few small changes and formally giving Google corroborating information through sources it trusts, I was able to get Google to recognize me in at least a small way. Maybe doing simple things like this can also help your content succeed on the internet.
Update: Since taking the screenshot above, the results have actually improved without any additional external links to my site.
We’ve covered a lot of information. A lot of it was fairly technical in nature. But you don’t have to remember all of it. I don’t expect you do be experts in optimizing your content after walking away from this.
I just want you to remember that all Google wants to do is find and deliver the best content to its users. We—as content creators and strategists—already know how to create great, audience-targeted content. It’s what we’re good at.
We are the meaning makers.
That’s why I believe it’s our turn to step up to the plate and take more responsibility for ensuring that the content we produce has the best shot at being discovered by Google and devoured by real people.
It’s not that hard to do.
The three things I want you to takeaway:
1. Focus on becoming an entity through semantic association
2. Audit your content for thin, keyword-stuffed crap, and turn it into content with a voice and purpose
3. Implement Schema.org and Freebase so that Google has a better chance of understanding your content
Overall, I want you to keep our original question in mind: What does Google want?
It wants good content.
Make it. Publish it. And ensure Google can understand it.
Hummingbird is not that scary. SEO is not that scary. Optimizing your content is not scary. It just takes a willingness to learn how to do it.
Now it’s your turn to take charge of your content.
Copyright © 2019 James C. Gunter