There’s been much recent conjecture on whether book sales can be predicted by text analysis alone. My company, Kadaxis, has dedicated the past few years to machine learning research and product development for the publishing industry. In our early days, we set out to build an algorithm to predict bestsellers, and tested it in the wild. In this post, I’ll share my perspectives on why the text alone isn’t enough.

If You Publish It, Will They Come?

To predict book sales, you need to account for the factors that influence book sales. The text of a book is core to the product, but many other factors, such as sales and marketing, influence whether a customer will discover and buy it. An algorithm predicting book sales using only the text as input will only work in a book market meritocracy, where the best-written books always sell the most copies.

Author platform (brand awareness) is one such non-text factor that influences sales, as in the following examples:

– The Cuckoo’s Calling hits the top of Amazon’s bestseller list only after Robert Galbraith is revealed to be J.K. Rowling.
– Amy Schumer’s memoir, The Girl with the Lower Back Tattoo, hits the New York Times bestseller list in its first week—an improbable feat without her strong personal brand.
– Dave Eggers publishes multiple books and receives several award nominations prior to releasing bestseller The Circle, and does so after having appeared on television numerous times.

Even amongst well-discovered books, the relationship between reader satisfaction and sales volume can be tenuous. Consider Harry Potter and the Cursed Child and Go Set a Watchman, books that have sold millions of copies each, but have achieved star ratings of 3.6 and 3.4 on Amazon, respectively—scores below the average indicator of satisfied readers.

Many other factors might also influence book sales, such as the editorial process, cover design, marketing budget, seasonal trends and book metadata. A machine, just like a human, needs to consider which of these factors will make a book sell more, to make an accurate prediction.

Machine Reading

Assume for a moment a linear relationship exists between reader satisfaction, discoverability and sales (i.e. the best written books are found the most often and sell the most copies). In this author’s utopia, we can reliably predict sales volume directly from a book’s text, as long as we can measure what’s important to readers. As products go, books are nuanced and complex, and the reasons why they resonate with us are also complex (compared to, say, a toothbrush). How do we uniformly distill the unique traits of a book into data?

This is, of course, where machine learning helps us. One approach, which is also the method used by the authors of the much-talked-about The Bestseller Code, is topic analysis (or latent Dirichlet allocation). This technique allows us to define a book in terms of how much of a topic it contains, such as “Homicide – 8.7 percent.”

If you’d like to see the data a topic model creates, you can view an example from our systems here (or upload your own book for analysis at authorcheckpoint.com). Topic modeling gives us a good snapshot of the content of a book, and allows us to make apples-to-apples comparisons between them. It is also useful data to use as input to training a predictive algorithm.

The Curse of Dimensionality

Our machine reader might define thousands of topics for each book we’re analyzing. While more data points might seem like a good thing, the more we add, the more books we need to read in order to make reliable predictions. If, for example, we had 2,500 different data points about a book, we’d likely need several tens of thousands of books to be confident our algorithm is accurate. Even 20,000 books (the data set used in The Bestseller Code) is likely far too few books, and puts us at risk of the curse of dimensionality.

(A quick tech side-bar: even with cross-validation we’re still likely overfitting our data, and hold-out is no guarantee against this, especially when using heavily unbalanced classes such as “bestsellers” for classification.)

Too many data points, and not enough books, means our algorithm will probably find patterns to say whatever we want them to say. The patterns exist in the data, but they aren’t representative of the real cause of what we’re trying to predict. In the world of black-box trading systems, this phenomenon is well-known.

So is there value in analyzing the intrinsic qualities of books such an algorithm might identify as selling well? It might be an interesting exercise, and the similarities the algorithm finds might make sense to a human observer. But you couldn’t reliably conclude that those similarities were the reason the books both sold well. In a contrived example, we might conclude that books with a red cover, 250+ pages in length and featuring a dog instead of a cat, will sell more copies than those without.

There is, of course, a simple way to prove the efficacy of any predictive model, and that is to apply it to new, unseen books before publication.

Predicting What’s Important

Even with access to enough books in our author’s utopia, we of course need a reliable metric to measure. Bestseller lists are a weak proxy for actual sales volume for many reasons, not least for the fact that they reflect “fast sellers,” meaning a book on a list may sell less overall copies, over time, than a book that isn’t.

But rather than searching for a magic formula to help move more copies of a book, a more valuable and attainable goal is to solve for reader satisfaction. By tying together data about the content of a book, with data capturing a reader’s reaction to it (beyond tracking where they stopped reading), we can begin to understand the true impact a book has on a particular audience and why. Armed with this insight, we can better match books to readers (recommendation systems) and books to markets.

This article originally appeared on the DBW blog September 28, 2016