Finding sound in books, or my data science project
ANOUK DYUSSEMBAYEVA | SEPTEMBER, 1 / 2020
Photo by Min An from Pexels
For the past few months, I've been brainstorming (or trying to brainstorm) an idea for a data science project that would be about music. Although there are many articles and blog posts about people doing lyrics analysis, predicting whether a song will make it to the top charts, and even what characteristics a song should have to be a hit, it wasn't quite appealing to me — I wanted to do something different.

At the time I thought of various ways I could try and find correlations where people didn't expect them. This was when I embarked on my first, and a very short, journey of analyzing the emodiversity of songs. Discovering a research paper about emodiversity and how it affects peoples' mental health, the idea of trying to calculate it in music seemed too interesting to miss. However, I soon realized that it would be quite tricky, and even subjective, to do that for songs.

This was when I dove into the idea of audiobooks and audio drama. Being my curious self, while writing the past week's article, I listened to a couple of podcast dramas to really hear how music is being woven into the story. What I noticed was that there wasn't a lot of sound — sure, there was the narrator, but the music mostly existed at the beginning and the end, with other sound effects scattered frivolously across the timeline.

I knew that books had much more sound in them… Or did they? What if I try to count and see how much sound a certain novel has? What if I could automate that process? What if I can analyze and find correlations between the time period the story was written in and the amount of sound it has? What if I can add actual sound by using APIs of soundbacks to text in real time?
This became the baseline for my project. Grabbing Fahrenheit 451 off of my bookshelf, I started underlining the words that could "produce" a sound. Reading through 53 pages took a whole evening, and I quickly learned that I would be better off doing the analysis using data science skills. Knowledge rusty from not using Jupyter Notebook for almost a year, it took another day just to brush up my toolbox.

I dedicated an evening to finding and searching for lists of words that describe sounds, all the way from onomatopoeia, or words that are sound (for example "splash") to words that produce sound ("bark"), to words that literally mean sound ("music").

I wish I could automate this process, but I had to construct the list manually, which took up two days. Because these words were from a number of websites that all have their specific writing style — some use bullet points for words while others put each word in a separate box — I had to copy and paste, copy and paste, clean, delete bullet points, clean, add commas, create a separate document. After that, I inserted the words into Jupyter Notebook, and spent the next three hours adding quotation marks and commas to every single word, since this is the only notation you can use to make a list.

From there, I was able to combine the three lists and count how many words there were.

I also knew that there would probably be some repetitions, and I found the specific words that correlated with each other. I proceeded to delete the duplicates, leaving 642 words in the list.

boombox_words = ['crackle', 'crash', 'bodyfall', 'boing', 'boom', 'buzz', 'chomp', 'click', 'creak', 'flutter', 'glug', 'groan', 'honk', 'ahoogah', 'jingle', 'neigh', 'poof', 'pop', 'puff', 'rattle', 'ribbit', 'quack', 'rustle', 'rumble', 'scream', 'screech', 'skid', 'slurp', 'splash', 'splat', 'splatter', 'squawk', 'squeak', 'squish', 'swish', 'swoosh', 'thunk', 'twang', 'whip crack', 'whoosh', 'woof', 'yelp', 'zap'] 

print (type(boombox_words))

words = boombox_words + sound_words +soundwords 
word = list(dict.fromkeys(words)) 
len(word)
The code in my Jupyter Notebook
The second step was to find the PDF version of Fahrenheit 451, which I mistakenly thought would be easy. Not only did I download seven versions before finding the one that would actually respond back with text when running the program in Jupyter, but it also took me two hours to find a library that would open the PDF.

The first six versions either gave out random symbols instead of text, didn't read the PDF at all, or only produced the title. You should have heard my excitement when I finally found the right version, because I was frustrated to the point of almost giving up. My happiness knew no bounds until I realized that by using the extractText() command I could only extract one page at a time.

After googling a plethora of combinations in hopes of finding how to get the whole text from the PDF and not just a particular page, I stumbled upon this code that used the same command, but in a loop, which was perfect… except I now had to figure out how to compare it to the list of sound words.

As I was about to give up yet again, Google pointed me towards the article that also implemented a loop, but it was slightly different, and also incorporated tokenizers, which allowed me to get rid of the impediments such as random symbols that didn't give a chance to compare words.

import PyPDF2
Frn451 = open('Fahrenheit 451.pdf', 'rb') 
Fahrenheit451 = PyPDF2.PdfFileReader(Frn451) 
Fahrenheit451.numPages 
page1 = Fahrenheit451.getPage(0) 
page1.extractText() #the first PDF couldn't be read

FRN451 = open('F451.pdf', 'rb') 
F451 = PyPDF2.PdfFileReader(FRN451) 
pg1 = F451.getPage(0) 
pg1.extractText() 

pdf = open('F451.pdf', 'rb')

#create a loop
F451 = PyPDF2.PdfFileReader(pdf) 

numOfPages = F451.getNumPages()

for i in range(0, numOfPages):
    print("Page Number: " + str(i))
    print("- - - - - - - - - - - - - - - - - - - -")
    pageObj = F451.getPage(i)
    print(pageObj.extractText())
    print("- - - - - - - - - - - - - - - - - - - -")

pdf.close() #looked relatively normal, but wasn't comparable between the list of sound words 

#the loop that finally worked
F451 = PyPDF2.PdfFileReader(FRN451)

num_pages = F451.numPages
count = 0
text = ""

while count < num_pages:
    pageObj = F451.getPage(count)
    count +=1
    text += pageObj.extractText()

if text != "":
   text = text

else:
   text = textract.process(fileurl, method='tesseract', language='eng') 
When I was finally able to look at which words were sound words, it turned out there were 123 elements. Obviously, this number looked minuscule to me — of course, the function that I used printed each word only once, even if that word was repeated several times throughout the book…

Subscribe to the newsletter and stay tuned for weekly updates on my data science project.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

tokens = word_tokenize(text) 

punctuations = ['(',')',';',':','[',']',',', '.', '\n\nby', '\n\n', '\n']
keywords = [word for word in tokens if not word in punctuations] 

len(set(keywords) & set(word)) 
Made on
Tilda