Harry Potter Book Summary - Information Retrieval Tool

Dataset Description

This dataset is a collection of summaries to the chapters of the books in the Harry Potter Series By author JK Rowlings with a concentration on the 1st 3 books in the series of 7 namely: Harry Potter: The Philosopher Stone, Harry Potter: The Chamber of Secrets, Harry Potter: The Prisoner of Azkaban. It Consisted of 57 documents compiled at the begining of our Project when we worked on the query and calculated the tf-idf scores and cosine Similarity after which we updated our data set adding more chapter summeries from the 4th and 5th books in order to have enough documents to successfully run a kmeans() clustering algorithm through our algorithm put together by Lino Virgen and Patrice Fote.


1. Pre-processing

a) Tokenization

b) Lemmatization

c) Stemming


2. Query


3. Term Frequency and Inverse Term Frequency


4. Tf-Idf


5. Visaulization


6. Tf-Idf Vector Space


7. Cosine Similarity

a) Visualization