Preamble

Flashback nearly a decade and you’ll find me toiling away in a filthy (custodians would typically not go into the labs for fear of getting blamed for something going wrong) basement lab working on an algorithm for my doctoral thesis. Identifying exotic particles (eg: magnetic monopoles, Q-balls, strangelets, etc.) in cosmic ray datasets is not exactly what you’d call the most employable pursuit. However, it was definitely more useful than SJW grievance studies, more interesting than working as a glorified proofreader for other people’s code like some of my friends and I wasn’t paying for it, so what the Hell? Everyone knows the real reason you get into physics is for the pussy anyway (hahahahaha, oh I almost made it through typing that without LOLing).

So here I am cannibalizing standing on the shoulders of giants, using previous theoretical mathematical work on Bayesian predictive inference. Mathematics like this had been around for decades, this was just a novel application of it and formed the basis of my thesis work. I was creating an algorithm to use simulated training data and a Bayesian comparison between said training data and real data to try and identify compositional limits on particles theorized to exist but never observed (aforementioned MMs, strangelets, Q-balls etc.). While certainly fun to talk about at parties and a real panty peeler (more LOL), the thought that I’d use any of this stuff in the real world seemed remote. I had already ruled out pursuing a career in academia, so I figured I’d just go become a code monkey like my friends. Little did I know that I was inadvertently making myself eminently employable in a field that has become the new “hot thing” in tech.

A Rose By Any Other Name is Just as Confusing

At the time, this field was limited to academia and a few tech companies that were using it to claw their way to the top (see: Google, Facebook, Amazon, et. al.). It didn’t even have a name other than just “statistics” or “data analytics”; boring pedestrian things that only the pocket protector squad cared about. Glamorous Silicon Valley VCs would never get on board with such dull nonsense. So, being the innovators that they are, techies rebranded this field “data science” employing “artificial intelligence” and “machine learning”. I personally have issues with all these monikers; “data science” is just meaningless (in spite of that being my job title) and “artificial intelligence” and “machine learning” both suffer from the same problem. Namely, they both imply that a computer is learning in the same fashion as a human brain. My preferred moniker is “predictive analytics” since I think it captures reality better and doesn’t overstate what the algorithm is doing to some kind of mind reading and/or Skynet AI.

So what exactly is it? Well, the short explanation is that any predictive algorithm takes parametric data inputs to build a statistical model that will predict the outcome of future iterations within some uncertainty. Essentially, you start with a set of “training data” with known outcomes, the algorithm then processes that data to build a model of how each parameter affects the outcome. You then feed the algorithm a set of test data, it applies the model to all the parameters, makes a prediction, then looks at the known outcome and scores whether it’s correct, a false positive or a false negative. If the algorithm passes some human-defined threshold, it starts working to make predictions on real-world data, all the while refining its model to get better as it processes more data. This real-time refinement is where the “learning” and “artificial intelligence” stuff comes in. To an external observer, it looks like the computer is learning and adapting; which in a way it is, but only in some narrowly defined brute-force iterative way within specific parameters. It has none of the heuristic properties of human intelligence. Perhaps someday we’ll unlock the secrets of the human mind and be able to simulate true intelligence, but I see that as a long way off.

How It Makes Your Life Better

As stated, this kind of analysis has been used in mathematical and academic settings for a long time, but the first exposure I ever had to it in the real world was a fun little quiz called the Gender Test at www.thespark.com (to early internet denizens, this was kind of a forerunner to places like College Humor, Ebaum’s World and finally the Glib-approved favorite, The Chive). This test asked a series of seemingly irrelevant questions such as “Which word is more gross, used or moist?” and showing pictures of two different cartoon monkeys asking “Which one will win?” After 50 or so of these kinds of questions, the quiz would then predict if you were male of female and ask if it got it right. This was long before the misgendering insanity so it was a binary choice; each time it got it right, it increased the relative weights of the preceding questions toward that gender. Each time it was wrong, it reduced the weights. The very first time someone took the test, the prediction was pure chance. But after a couple hundred thousand iterations, the relative gender weighting on the questions got pretty good and the algorithm could predict male or female almost all the time. In this case, the answers to the questions were the parameters and the gender was the predictive variable. While it may seem simple minded, this basic paradigm is what drives most of our modern computational conveniences.

Every time you search something in Google, that’s a set of parameters used to refine its model. It gets better and better at searching. Each time you “like” something on Facebook or click a link in Twitter or look at a job posting on LinkedIn, their models refine and get a little bit better. Each time you ask Siri something, she gets a little better at understanding you (remember when you first unboxed your new iPhone and Siri asked you to say a few things at startup? There’s your training data).

Of course the most important innovation is in the industry that is always the tip of the technological spear: porn. This goes way beyond dumbly suggesting videos tagged “big tits” after you’ve searched for big tits. EVERYTHING you do is a parametric data point. Among the videos you watch, are the tits real or fake? How big are they exactly? Is this lesbian, one on one hetero, threesome, group or something more exotic? What parts of the scene do you linger on? Go even further and perhaps there’s eye tracking technology (tape over your webcam people). What part of the tits do you look at the longest? In what sequence do you look at them? Is there a type of nipple you gaze at longer? Can the nipples themselves be broken down into parametric data for classification? The possibilities are endless. In this way, the porn site “learns” not only what your revealed preferences are, but it also can use data from other users with similar preferences to suggest things that you yourself might not even know you like. Like big tits? Might we suggest these ebony strap-on compilations for you?

There are of course more pedestrian applications like what I’m working on professionally now. We have biopsy slides that have been pre-tagged by experienced pathologists as cancerous or non-cancerous. The algorithm does pixel-by-pixel imagery analysis to classify features that indicate cancer or not. The hope is that eventually the algorithm will get good enough that it can identify cancer on its own, even in stages too early for a human to see. It’s not nearly as cool as porn, but a guy’s gotta eat right?

How it Ruins Your Life

Coolness factor aside, this way of doing things can quickly cross over from nifty to creepy. Target famously has an algorithm that not only tracks what you buy, but will automatically latch onto your smartphone and track your movements in the store. The most amazing (read: creepy) application of this is its ability, through lots of training and refinement, to tell the gender of the customer, the approximate age of the customer, whether the customer is pregnant and the approximate due date of the customer before she herself even knows she’s pregnant. All this is possible from millions of data points of known pregnant women (going from buying prenatal vitamins, to stretch mark cream to eventually diapers and formula) and their purchases and movements around the store leading up to the birth. The more times this happens, the better the algorithm gets.

One might be tempted to actually put this in the “how it improves your life” column. After all, Target can offer you discounts on things it knows you’ll need and make your life more convenient in the process. However, it doesn’t take much imagination to see how this can quickly morph into something very sinister, very quickly.

Creepy when a private company does it, this becomes nefarious when a government does it. Even worse is when government gets in bed with private companies to start profiling you based on your data. Buying a lot of fertilizer? Maybe you’re making a bomb. Let’s look at literally every parameter that comprises your life for the past decade to see (at a 95% confidence level) if you’re a terrorist. G-d help us if we ever get to a point in which this kind of shit is accepted in a court of law. We would literally have a Minority Report Pre-Crime situation on our hands.

Every single thing you do, seemingly significant or not, is a parametric data point that can be fed into an ML algorithm to extract features, classify them and make predictions about you. Not just what toothpaste you use, but how long and how often you brush. Do you start from the molars or the incisors? Do you gargle your mouthwash? What are your favorite sexual positions? How loud are your orgasms? Do you own a tabby or a tuxedo cat? Do you typically move your bowels in the morning or the evening? Do you configure your toilet paper over or under? People like to think that this kind of data collection is limited to conscious decisions like the products they buy or the places they go, but that is barely scratching the surface. Emotions, unconscious behaviors, pointless or useless decisions of daily life; these things are the treasure trove that gives insight into your essence. The eyes are not the window to the soul, Big Data is. The only way to escape it is to forsake all modern technology, retreat to the woods and live as if it’s the 18th century (behavior which itself, by the way, offers a ton of data about you).

Now of course all of this can be used for good or ill. In all seriousness, a change in bowel habits could indicate a health problem. But let’s not be naive about the true nature of how these technologies are/will be used. To those who crave power and long to rule us, these developments are a gift from Heaven (or, more likely, Hell). These analytical techniques, so seemingly innocuous when Thomas Bayes first pioneered them 300 (!) years ago have opened a can of worms that could enslave the human race in ways Big Brother could only dream of. If Bayes could see what’s happening now he might echo Oppenheimer; “now I am become Death, the destroyer of worlds.”

Unfortunately, I don’t hold out a lot of hope for the future. Constitutional protections have proven toothless, people stupidly *volunteer* massive amounts of data and the data that they don’t volunteer gets vacuumed up by an ever more intrusive State. The campus #metoo squad is just the advanced scouting group checking out how fortified the “innocent until proven guilty” doctrine is; a trial balloon for the destruction of due process.

Working in the field I do only makes me more pessimistic because I see how powerful this is first hand. My advice: well, I don’t really have any; aside from the aforementioned retreat into the woods. Other than that, all you can do is continue to support causes that shore up data privacy protections and defend against 4th Amendment violations. That’s at least a finger in the dike (not finger in the dyke you perverts).

But, hey, at least PornHub’s suggested viewing is spot on right?