Adding Support for Search Engines to your Javascript Applications


#1

It’s a myth that if you use a client side MVC framework that your application’s content cannot be indexed by search engines. In fact, Discourse forums were indexable by Google the day we launched.

Search engine visibility does, however, require a little more work to implement. This is a real trade off you’ll have to consider before you decide to go with an MVC framework instead of an application that does its rendering on the server side.

Before you get scared off: I’d like to point out that our search engine code was done by Sam Saffron in a day! This extra work might take you less time than you thought.

Getting Started: Pretty URLs

Out of the box, most client side MVC frameworks will default to hash-based URLs that take advantage of the fact that characters in a URL after an # are not passed through to the server. Once the Javascript application boots up it looks at hash data and figures out what it has to do.

Modern browsers have a better alternative to hash-based URLs: The HTML5 History API. The History API allows your Javascript code to modify the URL without reloading the entire page. Instead of URLs like http://yoursite.com/#/users/eviltrout you can support http://yoursite.com/users/eviltrout.

There are two downsides to using the History API. The first is Internet Explorer only started supporting it IE10. If you have to support IE9, you’ll want to stick with hashes. (Note: Discourse actually works on IE9, but the URL does not update as the user goes around. We’ve accepted this trade off.)

The second downside is that you have to modify your server to serve up your Javascript application regardless of what URL is requested. You need to do this because if you change the browser URL and the user refreshes their browser the server will look for a document at that path that doesn’t exist.

Serving Content

The second downside I mentioned actually has an nice upside to it. Even if you are serving up the same Javascript code regardless of URL, there is still an opportunity for the server to do some custom work.

The trick is to serve up two things in one document: your Javascript application and the basic markup for search engines in a <noscript> tag. If you’re unfamiliar with a the <noscript> tag, it’s designed for rendering versions of a resource to a clients like search engines that don’t support Javascript.

This is really easy to do in Ruby on Rails (and probably other frameworks that I’m less familiar with!). Your application.html.erb can look like this:

<html>
  <body>
    <section id='main'></section>
    <noscript>
      <%= yield %>
    </noscript>
  </body>
  ... load your Javascript code here into #main
</html>

With this approach, if any server side route renders a simple HTML document, it will end up in the <noscript> tag for indexing. I wouldn’t spend much time on what the HTML looks like. It’s meant to be read by a robot! Just use very basic HTML. To preview what a search engine will see, you can turn off Javascript support in your browser and hit refresh.

We’ve found it advantageous to use the same URLs for our JSON API as for our routes in the Javascript application. If a URL is requested via XHR or otherwise specifies the JSON content type, it will receive JSON back.

In Rails, you can reuse the same logic for finding your objects, and then choose the JSON or HTML rendering path in the end. Here’s a simplified version of our user#show route:

def show
  @user = fetch_user_from_params

  respond_to do |format|
    format.html do
      # doing nothing here renders show.html.erb with the basic user HTML in <noscript>
    end

    format.json do
      render_json_dump(UserSerializer.new(@user))
    end
  end
end

Note that you don’t have to implement HTML views for all your routes, just the ones that you want to index. The others will just render nothing into <noscript>.

One More Thing

If you get an HTML request for a URL that also responds with JSON, there is a good chance your application is going to make a call to the same API endpoint after it loads to retrieve the data in JSON so it can be rendered.

You can avoid this unnecessary round trip by rendering the JSON result into a variable in a <script> tag. Then, when your Javascript application looks for your JSON, have it check to see if it exists in the document already. If it’s there, use it instead of making the extra request.

This approach is much faster for initial loads! If you’re interested in how it’s implemented in Discourse, check out:


This is a companion discussion topic for the original entry at http://eviltrout.com/2013/06/19/adding-support-for-search-engines-to-your-javascript-applications.html

#2

Thanks for publishing your experience.

I'm still not completely sold on the "big ball of javascript" approach, but Ember looks impressive—and Discourse does too, by the way! Clearly, this approach is not as bad as it appeared before. We've had experience with spaghetti code that really turned into a nightmare, but it looks like Ember would have made that much more manageable.

For anyone starting an "ambitious web app" today, I wouldn't compare Ember with Backbone and Angular. I would suggest comparing it to some completely different approaches: http://derbyjs.com/ (on Node) and https://github.com/chrismccord... (on Rails). Both are very young projects, but both look really nice in completely different ways. Oh, and we're using Sync in production with great results.

The reason these other approaches appeal to me on a gut level is they present the same copy of the content to both humans and robots. This is DRY from the start, and it couldn't be construed as misleading or manipulative to the search engines. I guess the fatter client + API approach means your API is DRY. So, there's an interesting trade-off. Again, I think it's worth reflecting on these different paradigms side by side.

(Side note: I came into this conversation from a tweet that critiqued Github's pjax-based approach. So I'm starting out in a slightly contrary posture, as I think Github's UX comes out superb. Hopefully I'm not just adding heat with no light. I think there are multiple promising ways to approach the same problems, and you guys blazing a trail in this direction is great for all of us.)


#3

Re avoiding unnecessary round trips, you could also use Gon (https://github.com/gazay/gon) to initialize some JS objects that you can use.


#4

Being aware of keeping your application DRY, I can understand why some people feel this solution is sufficient. I feel that the best solution for keeping a SPA SEO friendly is by rendering out the DOM response and then hijacking anchor tags and form submissions by having your HTML5 history routing equivalent to your server side routing. Furthermore there should be a single templating language shared by your MVCs on client and server, there are examples with Mustache and Handlebars. The api should be able to send the same context it renders on the server available to your client side as well. I like to keep my JS in a json object that is received after pageload. I've seen some duplication functionality on client and server with rendering out html but for the most part the JS templates are specific to certain pieces of the DOM while the server side templates utilize inheritance and other powerful features available through the framework. I've chosen BackboneJS and Django and extended Django's templating language by creating a custom template tag to render out Handlebars or Mustache and a custom template tag to bootstrap data on initial pageload or bootstrapping with ajax after page load. Next steps are to create a client side system to begin preloading assets predicting what the user is going to do within the application state.


#5

"The second downside is that you have to modify your server to serve up your Javascript application regardless of what URL is requested. You need to do this because if you change the browser URL and the user refreshes their browser the server will look for a document at that path that doesn’t exist."

How did you do it in Rails?


#6

It is not a solution as much as a band-aid. Rendering the full content on a noscript tag defeats the whole purpose of making the first load lighter by "loading only the styles and scripts and the body tag (sort of)". Basically you're loading the full content page all the time, albeit not showing it and having to load it a second time (behaviour which was patched, acording to the one more thing section).

Also, you end up rendering the content on the client side. Now, since you are using Rails and not NodeJS, I assume you are not using the same templates. I'd say you're using handlebars on the client side, what are you using on the server side? Are you mixing erb with mustache/poirot? Even if your resource templates are fairly simple, most of the projects out there are not.


#7

It doesn't render the full content. It only renders the stuff that needs to be indexable such as the text. All the other chrome such as buttons and such is ignored.

Lowering the initial load is not the only intent. Lowering the amount of requests makes a large difference on rendering time. Additionally, we're optimized for people who spend time browsing, not who arrive and bounce off.

Try on Discourse! I think you'll be surprised how fast it is!


#8

Thanks for this article! I've been looking around a lot for a way to make my EmberJS application (Empress - https://github.com/hodgesmr/Em... crawlable by Google. The methods you've described seem the most promising. One caveat on my end: I don't have control over any server-side code. The application is completely client-side, and simple loads Markdown files through AJAX calls (no RESTful API). The project is very much a hack, but I like to think it's a "cool" hack to host an EmberJS application on GitHub pages. Any additional thoughts on what to do without have any server-side to the application?


#10

Looks like Google is indexing Javascript created elements and such after all:

https://www.distilled.net/blog/seo/google-stop-playing-the-jig-is-still-up-guest-post/

So if you are only concerned about Google (and who isn’t concerned about Google) then maybe you don’t need to worry too much about filling out those <noscript> tags. It’s probably still a good idea, but maybe won’t have quite the same level of importance.


#11

We’ve noticed google is indexing JS but it’s not very sophisticated at this point. The noscript stuff gives much better results.


#12

Hi @eviltrout,

I followed your “noscript” method, and at first blush it seemed to be working fine, but then I noticed a pretty huge problem.

Basically, here is what is happening:

  1. Google fetches a page. The “title” tag is set via Ember, and my “noscript” section is generated for this particular page by my backend, which has an “h1” tag similar to the title tag. Google indexes the page title for this page properly.
  2. Since it appears Google is executing JavaScript now, Google “clicks” a JavaScript link, and attempts to index the next page.
  3. Now, here is where it gets problematic. Google attempts to index this page. It doesn’t use the new “title” tag for the title (as set by JavaScript), but still uses the OLD “noscript” section to pull out a title (likely from the “h1” tag). Since this “noscript” section is from the original page, the page title of this second page (as shown on Google Search) is obviously incorrect.

Any insights into this conundrum?

What I’m going to do now is disable the “noscript” tag generation on the server-side, and see if Google fixes the page titles on its next indexing run…


#13

We had the same problem. At some point after this article was written Google started crawling Javascript sites and it broke our noscript trick. I’d like to go back and explain how we fixed it, but the crux is:

  • We now sniff for crawlers. If we see a crawler, we serve the noscript version of the page.

  • We also support Google’s AJAX crawling API, where we include a meta tag that tells search engines to crawl again with an _escaped_fragment parameter. If that’s present we also serve up the noscript simple version of the page. This is a good fallback for browsers we don’t correctly sniff but that support the AJAX crawling standard.


#14

Ah, well I’m glad it’s not just me then!

Thanks for the tips. I assume the project in question where this is implemented is Discourse? I’m going to poke around the source shortly to see how you guys solved this issue, but given that it’s a pretty large project, any pointers about where to look would be greatly appreciated!

With regards to Google’s AJAX crawling API, would that still work for my case even though I use pushState? (Ember router location: “auto”, currently). It seems to me I’d have to use the “#!” URLs, which I am trying to avoid for all but the oldest browsers…

EDIT: I looked into how the crawler sniffing is happening in Discourse. Pretty straightforward – thanks!

Additional question: if we’re doing the sniffing and simply serving up the plain HTML content to crawlers, is there really any benefit to supporting Google’s AJAX crawling API?


#15

It’s good in case a web crawler supports that standard but is not whitelisted. For example a large Russian one does and we previously did not whitelist it. I also just feel it makes us a good citizen to support the standard if available.


#16

Ah, fair enough. I wish there was a nicer API, in either case. The “escaped_fragment” thing seems really ugly. That’s web development for ya…