10 Comments

It is not easy to distinguish the relative importance of user data over innovation in Google's success. As Hal Varian, Google's chief economist once commented, ‘every action a user performs is considered a signal to be analyzed and fed back into the system.’ It is the innovative use of such user signals that have improved search results. Saying that Google gained advantage from a larger sample size of user data is questionable. It is quite possible that similar innovations could be gained from smaller collections of user data, such as those available to other Search engines such as DuckDuckGo.

LLMs such as Bert are trained with scraped data, but refined later with user data (e.g. analyzing feedback on results produced).

What is more pertinent is the line of questioning examining the default access to Google search within user devices. Why would any average user change the default search engine (Google) if it does the job. Even if the ability to change search engines is easily there, why bother?

Expand full comment

“Saying that Google gained advantage from a larger sample size of user data is questionable.”

Larger sample sizes ALWAYS provide better data. It may be a diminishing advantage as sample size increases. 100 is much better than 10. 1,000,100 is only incrementally better than 1,000,000, but is still better.

Expand full comment

Hey Christopher

When Larry Page, Google's founder came up with PageRank the signal that ordered search results in a logical way, he had no user data, Google didn't exist then. It was just a good idea, that worked and ended up revolutionizing Search on the web. You might be surprised at how much published scientific research is compiled using very small samples sizes of data.

Let me give you a relevant example that might help.

Let's say we want to improve search results for people searching for 'Cute puppies' online. If you take a random sample of 1000 user searches for 'Cute Puppies' it may show that 78% of users click on the image tab to look at pictures of 'Cute Puppies'. From this we may conclude that it is a good idea to show images in the initial page of results returned whenever anyone searches for 'Cute Puppies' in future. Machine learning looks for patterns like this. With a larger sample size of user data, we may get a more precise picture of the distribution of search behavior, you might be able to learn that in fact it is more like 75.6425% of users that want to see images when they search for 'Cute puppies' but this doesn't change the initial insight that it is a good idea to display images from the get go. If you do searches on all of the popular Search engines for 'cute puppies' what do you see? Who has the advantage?

Expand full comment
Sep 24, 2023·edited Sep 24, 2023

Hey, Boodey.

Not sure what you’re disagreeing with in what I wrote. I didn’t say small sample sizes couldn’t be used to provide insight. And I acknowledged the diminishing size of the benefits as sample size continues to increase.

But nearly from the beginning Google has had a large amount of data, whether it’s the behaviours of puppy enthusiasts or any other group you wish to look at. (People who clicked on an ad for skis, people using the Firefox browser, those with slow connections, those who ...)

Past a certain point they probably don’t get any significant additional info that changes or adds noticeably to what they’ve deduced/learned about a particular group. But the shear mass of data they’ve had does provide them info on a larger number of groups than competitors.

Adding a few significant digits to what they know about puppy fans is not meaningful (although being able to add significant digits does support what I said about larger samples always adding more info). What is is that they have info in not just the puppy fans group.

Expand full comment

Sure I get your points, but Google's argument here is we are number one because we are the best, we are the most innovative, that's why we have more data and more users. Let's watch how this unfolds and see if the DOJ can counter this.

Expand full comment

Sure, that’s what they say. But they clearly have a vested interest in that being true, so they’re not likely to say otherwise regardless of what may actually be true.

My personal opinion is that, beyond having a very clean homepage compared to the other sites (Yahoo, Altavista, Ask Jeeves) in the early days, a key part of their advantage comes from the amount of data they have.

Intuit’s Quicken was in a similar situation. They were something like the 26th personal accounting software on the market but what gave them an early boost was that they were the first one to actually put a picture of a check on the screen. Something familiar to people who’d been balancing their checkbook manually. And once that edge in their UI helped them grab market share, they leveraged that size of their user base to do deals that then helped them grow that user base faster than their competitors. That they were better give them an advantageous tor hold. Then that they were larger became a major component of their continued growth.

So let’s watch how this unfolds.

Expand full comment

Let's not forget little Neeva! Who are they? If you know, congratulations, I'll bet you miss them too. If you don't know who they were I have only one comment. "Exactly" and that's the point.

Expand full comment

Hmmm, I find it very difficult to believe that training data doesn't include (or wasn't shaped by) scraped search data.

Expand full comment
Sep 25, 2023·edited Sep 25, 2023

Regarding AI, Google and the search engine, I was happy to see the AI responses in the search; at first, they're often more pertinent to what I sought. Lately, however, they can't seem to recommend anything not listed on the first page, no matter how emphatically I state *inexpensive*. There might be a point to argue there!

As far as scraping, I'm not sure which LLM is being used to respond to in-search responses. Is it Bard or a small LLM front end? I don't know. Depending on which model you're talking about, they can technically refer to it and probably say, "No, we didn't use search scrapes to train it." If, on the other hand, they're talking about the LLM used to give responses to in-search queries... that's complicated!

Before I could guess, I'd have to know how the LLM is being used. The original LLM response isn't needed for training. However, with an LLM's response to a user's request for more information, they could use both the user's *request* and the *LLM's answer* in training! If so, then, in principle, it is a scrape. Using the LLM's first generated response for training doesn't make sense because there's nothing to train it on. It is already a trained response, and the only information in it is in the already scraped question. It is the follow-up questions that would be used for training. The question is: Is that considered a scrape or an LLM query and response and, therefore, evaluated training data according to the rules?

I know all LLM providers say don't use any personal data. Still, you don't have to be an AI expert to understand that any response to an AI or request for additional data is going to contain *some* form of personal information, even if it's only to get more details on a subject, or a question about a product to buy, or a question about something the user is working on, interested in, or planning to do. That sounds personal to me. Ask me how removing personal data from a writing project is impossible. I'm working on a piece of fiction. Do I use AI to write it? Please clarify the use of the word use! The data is personal if I ask even a simple question, whether it's grammar-related or a question about the book or something I wrote. Truth is, I have to work hard to create analogies to remove the explicit idea conveyed in my questions. Still, even *that* idea is personal. Frankly, concepts can be inferred even if the context is unrelated. It sucks! Since I've written, there has been a preponderance of coincidence in the fictional medium lately. Is that a coincidence, or is there nothing new under the sun? Either way, it's a punch in the gut!

Expand full comment

...The data is personal if I ask even a simple question, whether it's grammar-related or a question about the book or something I wrote. Truth is, I have to work hard to create analogies to remove the explicit idea conveyed in my questions. Still, even *that* idea is personal. Frankly, concepts can be inferred even if the context is unrelated. It sucks! Since I've written, there has been a preponderance of coincidence in the fictional medium lately. Is that a coincidence, or is there nothing new under the sun? Either way, it's a punch in the gut!

Expand full comment