Chapter 3
Project Plan
3.1 Procedures and Deliverables
This project was divided into a number of procedures to achieve the explained aim and
requirements. To keep track of the performance, a blog was created to include the all information
that was taken from the papers that were used in the background reading chapter, the steps that
were carried out to define the features and the design and implementation of the classifier.
Understanding the problem and the requirements was the main point to begin with, and then a
preferable schedule was planned. In the process of implementation there were a number of java
programs and reports to deliver. The procedure explained in the following steps:
1. Better understanding of the problem and the possible way of implementing the
solutions.
2. Understand the natural language processing algorithms that help in text classification,
was described in the background research.
3. Design the system that will define a possible classification of Arabic Quran. The system
should classify Quran verses under two classes, which are interesting or not interesting.
4. The system should be able to classify another Arabic data set of the same format into
interesting and non-interesting.
5. Classification is done based on predefined features that will characterise the interesting
verses.
6. This project used supervised learning algorithm which requires training data that was
provided in the case of implementing the classifier on the holy Quran data set and
training data sets that was created in the case of hadith. A number of text files were
used later in the main steps of implementation to train and test the system.
7. Defined the features that will classify the verses, then train the system on the interesting
text and non-interesting using WEKA
8. Implement java program that create the complement set that is used as training data
(Deliverable: java program, complement set text file, Random subset, WEKA arff file).
9. Write up a mid-project report, includes the introduction, background research, and the
plan schedule (Deliverables: mid-term report).
10. Work on the further work specified. (Deliverables: classification of verses).
11. A demonstration was organized and presented to the assessor and supervisor to present
the work that was accomplished.
12. Evaluation of the system implemented.
13. Write-up of the final report. (Deliverables: The final report).
3.2 Schedule
In the beginning of the project, a suitable schedule was planned. The schedule was
organised to fit all requirements in the time provided. It was obvious that some adjustments were
necessary even though an effort to keep time in hand. The reason of the adjustments that were
made on the schedule was to include a presentation that wasn’t planned for previously. In addition,
the background reading took more time than what was expected. The reason for this is to get hold
of the recent papers that doesn’t belong to the years before 2000. Moreover, getting hold of the
papers which are related to the project specifically since text classification is broad area of study. In
addition, it was necessary to get hold of other examples that will hold the concept of the hereafter
other than the Holy Quran which will help in testing the performance of the classifier on different
texts. One of the examples of data set that will be considered is hadith, using Sahih Muslim and
creating a data set that has similar format of the holy Quran data set. The design of the data set that
was created manually took some time since hadith corpus wasn’t available over the intrenet. Then
use this data set to test the classification on this text and test the features that were selected if they
were appropriate for this classification.
3.3 Methodology
The classification in this project was processed in a number of steps to achieve the
requirements and evaluate the language model that was built. In order to accomplish this two java
program where designed and ran on the English data set of the holy Quran. However changes were
made later to achieve the same results on different Arabic data sets including the Arabic version of
the Holy Quran. The reason for the changes that were done is when the classification was
implemented on the English sets of the Holy Quran it did not perform in the right way and was
obvious that it wouldn’t on the Arabic set Holy Quran too. An example of the changes is one java
program that will return the required files for classification. The final java program that was
produced extracts features from the random sub set that was selected from the complement text
file and counts the frequencies of these features line by line and output an .arff file. In addition the
lines in the .arff file were labelled (Yes or No) in reference existence. Another change that was made
is using additional Arabic text files and performs the classification on it. Changes that were made will
be explained in full in the design and implementation part. Based on the classification that was
implemented the attempt of combine it with another classification will be an additional option in
order to try and increase the accuracy of the results. According to [ CITATION Atw1 \l 2057 ], Meccan
chapters give more highlighting the end of day’s topics. This can be used as an extra feature in which
a verse that was classified to be interesting should be contained in a Meccan chapter. Furthermore,
implementation and tests on the hadith data set was carried out after completing the main steps of
designing the data set and java program that will help in retrieving features and creating the
required .arff file that was in classification in WEKA. Since the testing sets and the training set were
available for both data sets, the use of 10-fold cross validation was an option to evaluate the some
results. Additionally, one of the options to evaluate the results of the classification is using decision
trees and another option will be Naïve Bayes classifiers. The results of the classification will be
evaluated using the output produced by WEKA.