Pointers: November 2013

Much has been made of the human ability to recognize patterns, seeing shapes and trends in complex scenes. We can detect familiar faces in crowds, clouds which look like dragons, trends in data. Also notable is the human ability to filter and ignore data. Filtering is essential in processing new data.

Data overload is arguably not a new problem. Consider a World Cup spectator in a crowded stadium. A pre-literate tribe member in a dense rainforest. Charles Darwin exploring new ecosystems during the voyages of the Beagle. Sherlock Holmes at a crime scene. Human neurology and cognition enables these observers to filter what is potentially paralyzing volumes of data, focus on relevant information, and detect patterns.

The Sherlock Holmes example is different, and not just because he is a fictional character (supposedly). Dr. Watson might not agree that a crime scene is rich with observable data. But to Sherlock Holmes it always is, since he is exceptionally skilled as a detective, and highly trained in esoteric arts such as classification of tobacco ash and natural fibers. So data overload can depend on the tools of observation used -- or misused. A modern example is that a few cheap webcams can easily generate terabytes of mostly useless data per day (e.g. surveillance video) which can be stored and consequently require effort to distinguish from more valuable data. We can agree that Hubble telescope video is more precious than surveillance video, but question the costs vs. the benefits of the data volume produced by our most prized and advanced observational tools.

If a crime scene included a corpse lying prone with a dagger in the back, I think Sherlock Holmes would refrain from tobacco ash analysis and focus on the dagger, at least initially. Since he is not called to easy cases, maybe eventually the dagger will prove to be a false lead, and the tobacco ash will deserve attention. But Holmes like any rational observer will prioritize, will rank the evidence. If Holmes was unable to focus, filter and rank, the bloody dagger would blend into the observable thousands of coat fibers and tobacco ashes. Which brings us to Google.

The original Google PageRank algorithm was a game changer, enabling navigation of an expanding WWW, and propelling the success of Google. Crawling and caching the web was difficult, but had been done (e.g. AltaVista). Google’s key innovation was to in some sense replicate the essential human need to focus, filter, and rank. However, despite the genius behind it, Google is fundamentally a very crude, blunt observational tool. The top hits in a Google search can only be as good as the query, which up till now has essentially been a free text search. The PageRank algorithm depends on page links to implement a kind of crowdsourced, popularity-contest-like scoring, vulnerable to manipulation and misinformation. We can do better than Google, and often do, but for realms less than “all the World’s information”.

Finally, there must be consideration given to math and statistics. Recognizing faces in crowds and patterns in data means gambling with some reasonable probabilities of success and failure. Too many false matches incur costs, being too cautious and we miss opportunities, the so-called opportunity costs. Some evolutionary biologists suggest that homo sapiens are inclined toward false positives, perhaps to maximize food gathering and predator avoidance. However, humans are not cognitively equipped to assess probabilities when data volumes are far from experience. Some examples are the perceived dangers of automotive travel vs. aircraft travel, shark attack, alien abduction, etc. In any case, in designing data analysis algorithms, wise use of statistics is imperative.

Pointers

Wednesday, November 13, 2013

Evidence, patterns, filtering. Charles Darwin, Sherlock Holmes, and Google.

Followers

Blog Archive

About Me