domingo, outubro 25, 2015

Shakespeare vs Computer Science: "Can a Computer Write a Sonnet?"

Following up on one my latest book reviews, I dug some more about the theme of Stylometry. Several chapters in Edmonson’s and Well’s book deal with this topics in terms of proving Shakespeare authorship of his plays. Very interesting stuff to follow-up on.

It was recently in the news that computer analysis had been used to determine that the play “Double Falsehood” by Lewis Theobald, first published in 1728, is possibly the work of William Shakespeare in conjunction with John Fletcher. Computer programs were used to analyze the writings of all three men and the result is that some people think this is a lost play by Shakespeare. It is still quite controversial. I am just offering it up as an example of computers being used in conjunction with literary analysis:

Personally, I think computer programs are only as good as the people who write them, meaning that they may inadvertently contain flaws or biases. An additional layer of human error enters the mix in how the data is interpreted. Then there is the question, is there such a thing as a fixed and rigid interpretation. I have been doing my own personal literary analysis for many years and when I revisit something like a play by Shakespeare I find that my views and interpretations have changed over the years. However, I wouldn't totally discard Data Science. It produces interesting facts and features, which might serve to reinforce our initial human reactions to what we are reading.

Having said that, let’s delve some more into it.

Performing a textual analysis on a Shakespeare text has nevertheless some interesting points, namely, our ability to improve on the techniques we use in Data Science. Is there a worthier subject than devising a machine learning algorithm that would enable us to pin-point what makes Shakespeare? When I say "pin-point" I'm thinking in mathematical terms.

Everyone will have a distinct opinion on what makes Shakespeare the greatest playwriter of the English Language. My (Shakespeare) Nirvana would be to have some kind of enlightenment coming from the field of Computer Science, that would tell us that certain "traits" are what differentiates Shakespeare from the rest of the pack (e.g., Ben Jonson, and Thomas Middleton).

Shakespeare's flair is one of a kind, granted, but is it possible to name instances where we can say for sure why Jonson and Middleton did not capture people's imaginations the way Shakespeare did. Again, I'm not talking about individual opinions. I have my own on what makes Shakespeare. I'm more interested in identifying data (e.g., patterns) that would have the weight of science behind it. Might these "patterns" be identified and supported through the use of word selection and frequency?

I've just run a statistical analysis on the "Much Ado About Nothing" play in order to identify all of the atomic components in the text, and what came out was this:

Surprisingly (or not), the number of prepositions is not very high, maybe due to the fact that in Elizabethan times its use was not so widespread as it's today (e.g., the use of the pronoun "its" is seldom used by Shakespeare). It'd be interesting just for analysis sake to make a comparison between the works of Jonson, Marlowe and Middleton, just to name three of the icons writing at around the same time period.

Another analysis I did was by using Google Books Ngram Viewer and selecting three words that Shakespeare is said to have coined. I went to for this and chose the first three words that appeared on the list - academe, accused and addicted. I selected a start date of 1600 and this is what I came up with. It is interesting that the first large spike for the word "accused" occurs around the time if the English Civil War. I also reran the data with a start date of earlier than 1600 and the word "accused" may actually predate Shakespeare's writing! So one does tend to wonder where the data from came from:

It seems that the number of words Shakespeare likely coined has been exaggerated for a couple of reasons. Not all that much survives from Shakespeare's day and before, so he's one of the few places to look for any words in usage at the time. Of the works available, Shakespeare's work is far and away the most famous and a 'got to' source. According to one article I read, the originators of the OED had a tendency to stop if they found something in Shakespeare and attribute it to him as the first usage. Many words attributed to him have been found in earlier works since, but often the attribution to Shakespeare hasn't been changed. It's never made sense to me that Shakespeare could have SO many new words in individual plays- he wasn't writing cutting edge, pretentious plays- he was writing plays that average people would go to and if every fifth word was 'new,' the plays would have been practically unintelligible to their initial audience. Shakespeare certainly invented some words and skewed the meaning of other words by using them in intelligible ways that they hadn't been used in previously. I'm sure he came up with even more new phrases which have become common knowledge and which could have been understood on first hearing, but he didn't 'invent' nearly as many words as some claim. 

Nevertheless I completely agree about the priority of the text. Even when I use computational analyses, I find they should be coupled with close readings of the texts. The promise I see with computer analysis is that it can point out large-scale patterns that may not have been visible with close reading alone, particularly when you have a large corpus that you're working with -- it would be difficult to compare 1,000 early modern plays with only close reading (and the reason we might want to look at 1,000 early modern plays is to better characterize early modern literature and the individual texts therein). There is a lot of talk recently among digital humanists about how computers can help us access what Margaret Cohen calls the "great unread" of literature, which would include texts that are left out of traditional canons and that we can't feasibly close read because there are so many.

There is so much to gain from Shakespeare. Why would we limit our gains by restricting the methods with which we can derive meaning? Watching a performance, making a close reading, performing algorithmic analysis--all these methods can work hand-in-hand.

Shakespeare reflects life and life is a glorious muddle of comedy, tragedy, romance and problem plays! And let's not forget history. Indeed, how can we even say that Shakespeare writes history plays? They are not accurate enough to be used as a history source, but they are wonderfully rich dramas.

I'm not sure whether a machine would be able to write Shakespeare-like literature or not. But let's lower the bar. What about a sonnet of average quality? Could a machine be able to write "something" that we'd consider having some quality? (I'm not going into the debate of what I mean by quality).

Let's do a little test. 

Would you say the following sonnet was written by an human or a machine? 
(This is another form of a Turing Test.)

"Whose shade in dreams doth wake the sleeping morn,
The daytime shadow of my love betrayed
Lends hideous night to dreaming’s faded form;
Were painted frowns to gild mere false rebuff,
Then shouldst my heart be patient as the sands,
For nature’s smile is ornament enough
When thy gold lips unloose their drooping bands.
As clouds occlude the globe’s enshrouded fears,
Which can by no astronomy be assail’d,
Thus thine appearance, tears in atmospheres,
No fond perceptions, nor no gaze unveils.
Disperse the clouds which banish light from thee,
For no tears be true until we truly see."

(Later on I'll post the provenance of the abovementioned sonnet...)

Sem comentários: