|
Google Taking Blog Comments Searching Real-Time? |
|
Tuesday, January 22 2008 @ 02:05 PM EST
|
A reader sent me an article about something intriguing he noticed in Google results. He was wondering if any of you have noticed it too. I contacted Google to see what they might say, and here's what they told me: "We crawl the web continuously and schedule visits to each page intelligently to maximize freshness." Of course, to maximize intelligently, one has to experiment. In fact, Google has a blog post about their various experiments: From time to time, we run live experiments on Google — tests visible to a relatively few people -- to discover better ways to search. We do this because there’s no good substitute for understanding how real people, in real-world situations, actually operate. Theories are fine, but “improving the user experience” really happens best when we understand what people do online.
So to learn more, we sometimes randomly select a group of people to see a possible improvement to search options. Or we may select a group of people and try out a new element while they're searching. If you ever wonder why your Google site looks slightly different from that of the person sitting next to you, this is why. He seems to have hit on one. So, take a look and see what you think.
*********************************
Google Taking Blog Comments Searching Real-Time?
~ by Bill Binko
I have come across some anecdotal evidence that something is
happening at Google that I thought Groklaw members would
appreciate. I have checked with other search engines (Yahoo!, MSN,
and Ask plus some smaller players) and this seems to be uniquely a Googlism.
It seems that Google has massively increased its capacity to perform
full-text searches for current, dynamic content. Specifically,
posts and public comments are now indexed and full-text-searchable
in minutes, not days if the page is listed in the site's RSS or Atom
feed. Even smaller sites, such as small-town newspapers and niche
community sites, seem to be included in this new system.
Perhaps as interesting, these recent posts are often being given
top-billing in the search results, showing that Google may be
slanting its bias away from "established" pages with a longer
history, and towards "current" information. This seems in line with
the FAQ on FeedBurner.com (which Google recently acquired):
Q. Why did Google acquire FeedBurner?
A. Google believes that feed-based content and advertising is a
developing space where we can add value for users, advertisers
and publishers. FeedBurner's technology and talented team are a
great addition to Google's current solutions for advertisers and
publishers.
Background
Over the past couple of weeks, I have noticed that my Google Alerts
were getting more and more rapid. That is, the time between the
content being made available on the net and the time Google notified
me of it via an Alert was dropping fast. This was even true when
the content was posted on smaller sites, such as a local
(small-town) newspaper site.
I assumed that Google was focusing on RSS/Atom feeds and that they
had just increased their crawl rate. However, I believe it's much
more than that.
A few days later, I was reading an
article on Slashdot, and one of
the comments held this quote (which I really like):
"Nihilism means nothing to the dancing peasants."
It wasn't attributed to anyone (it's from Oscar Wilde), so I went to
look it up on Google. Because I wanted the entire quote, I
surrounded it with double-quotes, and got this
this result.
Notice that the *first* result is the Slashdot story. What was
fascinating was that at the time I searched, this story had only
been posted for about an *hour*, and the comment was about a half-hour
old! Also interesting is that the comment was absolutely *not* in
the RSS feed itself -- it was only on the page whose URL was listed
in the feed.
The other night, I started doing some investigating and found something
that seems amazing to me. Google seems to now be full-text indexing not
only RSS Feeds but the entire contents of all of the pages listed
the feed at a refresh rate of less than 2 hours, not just for
big RSS feeds like Slashdot, but for many small ones as well.
To verify this, I started with a site that I understood. TribalPages.com is a family tree site that
I have done some consulting for, and they have RSS feeds for their
two forums. The traffic on those forums is very low by any
standards, with 20-50 posts per day. Yet when I searched Google for
distinctive phrases from those posts, such as "about the search in
the forum", I found results in the main Google index less than an
hour after the posts were made. Additional tests showed that it
didn't really matter whether I included any meaningful keywords in
the text: the entire text was already searchable on Google.
My next thought was that Google was personalizing the results or
searching sites that I regularly visited (which would be interesting
enough). Several friends confirmed what I'd found, and they were
not regulars to TribalPages or TBO.com, another site I tested. However, just to be sure, I ssh'd into a server that was just put
online in a colo site last Friday and used lynx to visit a
small-town newspaper site that I'd never visited,
bradenton.com). I found a unique phrase on a news story
less than a few hours old ("detectives also had found two burn
barrels") and searched Google for it. The
story was the only
result returned. Again, the search phrase was not in the RSS feed.
Try it yourself
Here's a walk-through of how to test this for yourself. I'd be
interested to hear whether others can confirm this or have a better
explanation.
1. Pick a fairly low volume website that has articles with
comments. It must also have an RSS feed of the articles. For
this example, we'll choose the American Bar Association
Journal's Daily News page. (Yes,
I realize that isn't small, but I don't want to hammer some
small site from Groklaw).
2. Find a story that is between 1 and 12 hours old. Here, we'll
use this one.
3. Go to a section of the page that is not in the RSS feed.
Most feeds contain only the first paragraph or two, so any
text after that should be a good test. It's tempting to use
comments, and that often works, but some sites (like Groklaw)
do not allow Google to index pages with comments and others
use Javascript to display them. Pick out a unique phrase that
is unlikely to be found elsewhere, but is really unrelated to
the content. We'll use the phrase "pose significant issues
for employers concerning"
4. Search Google using double quotes around the phrase, and you
will see search results like
this
that point right back to your article. When I ran that
search, the results said it was posted "2 hours ago".
Here are some sites that I've tested that now seem to have all of
the pages listed in their feeds fully indexed within 2 hours of
being posted.
- Large News Sites
- http://slashdot.org
- http://news.com
- http://nytimes.com
- Legal Sites
- http://law.com
- http://abajournal.com
- http://medicalfutility.blogspot.com
- Niche Sites
- http://www.tribalpages.com (Genealogy)
- http://www.scrapbookinggems.com/ (Scrapbooking)
- http://coastalsurfing.com/ (Surfing)
- Small-town Newspapers
- http://www.fredericksburgstandard.com/news/
(Fredericksburg, TX)
- http://www.bradenton.com/local/ (Bradenton, FL)
- http://www.cordeledispatch.com/local/ (Cordele, GA)
I couldn't find a single significant site that has news
articles posted with a valid RSS feed that didn't
seem to have all of its robot-visible content available for
full-text search on Google -- even those articles posted late in the
day. There are some borderline cases: for example, Google has
indexed PatentlyO's news text but not its comments (even on the
same page), and ESPN.com uses Javascript to display its comments, so
they are not searchable. But by-and-large, I've been amazed at the
breadth of this change and the fact that we haven't anything heard about it.
Implications
If my observations are correct, the implications of this are huge. Even given Google's history, this seems to push the boundaries of
what they are capable of to a new level: there is now no lag time
between posting a comment and the world finding it.
Researchers and reporters will undoubtedly find the new, timely
information invaluable.
It seems this could also magnify the power of "New Media", in that there is no
longer any time lag: as long as you are of a size that gets you on
Google's radar, and particularly if you are first with a story, you are in the game in a new way.
Perhaps I'm wrong
and misreading the tea leaves. If so, I'm sure a Groklaw member (or
two) will explain it to me! One way or another, something
interesting is going on.
|
|
Authored by: artp on Tuesday, January 22 2008 @ 02:20 PM EST |
Summary of change in the title block, please.
---
Userfriendly on WGA server outage:
When you're chained to an oar you don't think you should go down when the galley
sinks ?[ Reply to This | # ]
|
|
Authored by: artp on Tuesday, January 22 2008 @ 02:23 PM EST |
Change the Title block.
Read the Instructions below the text entry box.
Change Mode to HTML if necessary.
HTMLify Web links, please.
Review recent article on using HTML in Groklaw Comments.
---
Userfriendly on WGA server outage:
When you're chained to an oar you don't think you should go down when the galley
sinks ?[ Reply to This | # ]
|
|
Authored by: tiger99 on Tuesday, January 22 2008 @ 02:26 PM EST |
Comments about items in the Groklaw newspicks can go here. Please put the title
of the Newspick item in your title, so we can see what your comment is about at
a glance.[ Reply to This | # ]
|
|
Authored by: Anonymous on Tuesday, January 22 2008 @ 02:28 PM EST |
What is this doing to the access times, etc for regular traffic? If they are
indexing all web pages every few hours, is a performance penalty that could
affect other Internet users in some way?[ Reply to This | # ]
|
|
Authored by: joef on Tuesday, January 22 2008 @ 02:32 PM EST |
As of 1928Z I searched for the string "lag time between posting a comment
and the world" (past paragraph of the article) and got no hit, I'll retry
periodically and see how soon it comes up. 1928Z is some 23 minutes after the
timestamp on PJ's article.
[ Reply to This | # ]
|
|
Authored by: cmc on Tuesday, January 22 2008 @ 03:03 PM EST |
I personally wish Google would leave well enough alone. I've used a lot of
search engines through the years. My personal favorites were WebCrawler, then
MetaCrawler, and now Google. But Google used to be so much better than it is
now. Remember when you got meaningful results? Remember when the text you
searched for was actually on the pages returned in the results?
Nowadays, more often that not, the text I enter is not on the pages returned.
So why are the pages listed in the results? Because somewhere on the internet,
a page which linked to the page in the results contains the text I entered. How
on earth Google thinks that will help me is beyond comprehension. PageRank is
only useful for poisoning search results.
PageRank was the worst thing to ever happen to search results. And putting
more-current pages at the top of results is just as bad in my opinion. More
frequent does not mean more worthy.[ Reply to This | # ]
|
|
Authored by: Anonymous on Tuesday, January 22 2008 @ 03:04 PM EST |
Does that mean that we have to wait until after 4 pm before this very thread
supplants the slashdot article?[ Reply to This | # ]
|
|
Authored by: Holocene Epoch on Tuesday, January 22 2008 @ 03:08 PM EST |
That would explain why my personal site's photo gallery was top of the Google
search when I changed the software and had "random Photography" as the
title [since back to the original software].
I also unlocked a Domain Name so that I could change registrars, would this be
the same tech / similar tech to that which sent me an email to say that the
status of the Domain Name had changed??
Oh, Seattle could use some global warming if anyone has some extra, finally
reached freezing.[ Reply to This | # ]
|
|
Authored by: jeevesbond on Tuesday, January 22 2008 @ 04:03 PM EST |
I've seen this also. On Slashdot a few weeks ago a user posted: 'just Google
for "xyz"' someone else came back an hour later complaining that the only result
for 'xyz' was the user's comment about Googling for it. :)
PJ: have you
looked at Google
Webmaster Tools? In there, under Tools -> Set crawl rate you can see
graphs of what Google is downloading from your site per day, what's really
interesting--in relation to this story--is the 'Your current speed' section. For
me the 'Faster' speed setting is greyed out, but maybe for sites like Slashdot
and Groklaw it isn't?
I'd be intrigued to find out whether you can
access that 'faster' setting. :) [ Reply to This | # ]
|
|
Authored by: Stevieboy on Tuesday, January 22 2008 @ 04:12 PM EST |
I personally find software that tries to predict what you might be wanting to do
or that gives you 'helpful information' telling you what you're doing wrong or
what you should do next extremely irritating and productivity reducing.
Windows is the biggest (but not the only) culprit repeatedly telling you to
update this or that over and over again - to the guys at Microsoft and any other
software producer, if I've clicked the message once it means I've taken it on
board and don't want to see the same message again!! Ever!!!!!
Give me predictable software rather than predictive. I just want software that
does what it says on the tin - no less and no more - and doesn't try to second
guess me.
Or have I misunderstood what Google is trying to do?[ Reply to This | # ]
|
|
Authored by: Anonymous on Tuesday, January 22 2008 @ 04:27 PM EST |
Think of those that would try and flood a site with comments - for some
reason I can't think of the silly term at the moment. Basically, simulate a
grass-roots movement.
Now they can go one better, flood Google so their
responses hit in the top sections. Potentially useful to those that wish to do
research but it's also useful to those who wish to prevent research.
RAS[ Reply to This | # ]
|
|
Authored by: cwbinko on Tuesday, January 22 2008 @ 04:56 PM EST |
Google results for the unique phrase
"We crawl the web continuously and schedule
visits"
are now resolving to this page.
Seems like Groklaw is at
worst being indexed at 3 hour intervals (sorry I wasn't testing
earlier).
BTW: Great feedback from the community - keep it coming.
- Bill[ Reply to This | # ]
|
|
Authored by: ssavitzky on Tuesday, January 22 2008 @ 08:30 PM EST |
Some sites, Livejournal for example, have a "live feed" of new
articles that Google undoubtedly taps into. Basically they just connect to a
socket, and drink from the firehose of new articles. Wouldn't surprise me if
Slashdot has one too. And of course a lot of blogs are on Blogger, which they
own.
---
Never anger a bard, for they are not subtle and people remember funny songs.[ Reply to This | # ]
|
|
Authored by: DodgeRules on Tuesday, January 22 2008 @ 09:35 PM EST |
... and used lynx to visit a small-town newspaper site that I'd
never visited, bradenton.com).
<sarcasm>
Well I don't know if I should be happy to see my local paper listed in a GL
posting or be insulted.
</sarcasm> [ Reply to This | # ]
|
|
Authored by: Anonymous on Tuesday, January 22 2008 @ 09:49 PM EST |
Bill Binko:
I'm wondering about the system you did this testing from.
1. What browser did you use? (Probably not important, I'm just curious.)
2. Is the Google toolbar installed in this browser?
3. Does this browser accept (and preserve) Google's persistent cookies?[ Reply to This | # ]
|
|
Authored by: SoundChsr on Wednesday, January 23 2008 @ 12:53 AM EST |
I noticed this tonight myself. I re-posted a couple sections of my gOS review
on linuxtweakers.org (my new site - notice no clickie - not trying to
advertise). When I went over to Google to verify that the site was getting
indexed, the search results showed that the articles I had just posted were
indexed in under 20 minutes. Astounding!
// George[ Reply to This | # ]
|
|
|
|
|