This is a Julia based tutorial covering the following topics:

- Introduction to Linear Algebra
- Applications of Matrix Factorizations
- Introduction to Text Mining
- Introduction to Recommender Systems

This package has each of the above mentioned topics as sub-modules. However as of now only Text Mining tutorial is available, and the rest are under construction.

This is an unregistered package, and can be installed in either of the following two ways:

```
Pkg.clone("https://github.com/abhijithch/JuliaTutorial.jl.git")
```

alternatively, this also could be directly cloned from github as follows,

```
git clone https://github.com/abhijithch/JuliaTutorial.jl.git
```

in which case the dependent packages will have to be installed. If installed through the package manager, `Pkg.clone()`

the dependent packages would be automatically installed.

To start using the package, first do `using JuliaTutorial`

. Then according to the options given, include the sub-modules by `using JuliaTutorial.TextMining`

to enable all the functions of Text Mining tutorial.

Please refer to docs/Julia_TextMining.pdf for the theoretical concepts. This Text Mining module depends on `TextAnalysis.jl`

, for most of the preprocessing and preparation of the Term Document Matrix.

The first thing to do is generate a corpus from collection of textual data. In this module we work with documents as the source of textual data. These documents could be collection of research articles, HTML files etc, and the function `PrepDocCorpus(dirname::String,DocType::Type)`

prepares a corpus, i.e., collection of all the documents under one entity. It also standardizes all the documents to a singly type, specified by `DocType`

. The types could be any of `StringDocument`

, `TokenDocument`

or `NGramDocument`

.

The query corpus are to be obtained using the function, `PrepQueriesCorpus(NoQueries::Int,QueryFile::String)`

. The `NoQueries`

number of queries are stored in a single text file, `QueryFile`

. Each queries are delimited by 2 blank lines.

The `PreProcess!(crps::Corpus)`

function does all the preprocessing like removal of articles, pronouns, prepositions and stop words.

The functions `dtm`

or the `tdm`

from the `TextAnalysis`

package are used to generate the TDM(Term Document Matrix). All the models end up factoring this TDM.

The proximity measure used is the cosine measure, the function `CosTheta(q::Array{Float64,1},d::Array{Float64,1})`

, returns the cosine of the angle between the query vector `q`

and the document vector `d`

.

Like in any information retrieval tasks, *Recall*, `R`

and *Precision*, `P`

model the performance. `R=Dr/Nr`

, where `Dr`

is the number of relevant documents retrieved and `Nr`

is the total number of relevant documents in the database, `P=Dr/Dt`

where `Dt`

is the total number of documents retrieved. The function `PrepTest()`

prepares the test matrix, which is human verified list of the relevant documents for the correspoding queries.

This is the basic model in which the column vecotrs of the TDM constitute the Document space, of dimension equal to number of terms(keywords). A new query will also be another vector in the same space, and in the VSM model we just find the cosine similarity between the query and all the documents. A tolerance value decides the number of documents which will be returned. The performance analysis is done for various tolerance levels.

The VSM can be tested using the function `VSMModel()`

with constrained parmeters types. The method `VSMModel(A::Array{Float64,2},nq::Int64)`

gives the *Precision* and *Recall* for `nq`

queries which form the first `nq`

columns of the `A`

matrix. The Documents are the remaining column vectors of `A`

.

The method `VSMModel(Q_C::Corpus,D_C::Corpus)`

forms the TDM from the Query and Document corpus, and gives the average *Recall* and *Precision*.

The Method `VSMModel(QueryNum::Int64,A::Array{Float64,2},nq::Int64)`

give the *Recall* and *Precision* for a single query identified by `QueryNum`

.

The LSI model finds the SVD of the Term Document Matrix, and decomposes the same into *Document Space* and *Query Space*. The method `SVDModel(A::Array{Float64,2},nq::Int64,rank::Int64)`

uses the reduced rank approximation, and returns the average *Recall* and *Precision*. The methods `SVDModel(Q_C::Corpus,D_C::Corpus,rank::Int64)`

does the same for the Query and Document corpus. The methods `SVDModel(QueryNum::Int64,A::Array{Float64,2},NumQueries::Int64,rank::Int64)`

gives the *Recall*
and *Precision* for the single query `QueryNum`

.

Considering the Documents to be points in `m`

dimensional space, documents with similar content tend to be closer to each other. Hence by clustering the documents into `K`

clusters, with the *centroid* of each each clusters representing them. Hence all these `k`

centroid vectors as a mtrix `C`

represent the entire Document Space. But to obtain an orthonormal basis of this space, we do QR-Factorization of `C`

, represented by `G`

. Then by projecting the Document vectors and query vectors onto this space `G`

, we find the cosine measure between the query and all of the douments.

The method `KMeansModel(A::Array{Float64,2},NumQueries::Int64,NumClusters::Int64)`

gives the average *Recall* and *Precision* by using `NumClusters`

. The method `KMeansModel(QueryNum::Int64,A::Array{Float64,2},NumQueries::Int64,Clusters::Int64)`

does the same for single query `QueryNum`

. The method `KMeansModel(Q_C::Corpus,D_C::Corpus,Clusters::Int64)`

finds the *Recall* and *Precision* for the Query and Document Corpus `Q_C`

and `D_C`

.

The function `plot_DrDtNr(Dr::Array{Float64,1},Dt::Array{Float64,1},Nr::Array{Float64,1},qNum::Int64,tol::Array{Float64,1})`

can be used to plot the `Dr`

, `Dt`

and `Nr`

for a single query `qNum`

against the tolerance levels specified by `tol`

.

The function `plotNew_RecPrec(Rec::Array{Float64,1},Prec::Array{Float64,1},strMethod::String)`

must be used to plot the recall and precision. The `strMethod`

specifies the model used, e.x, VSM or LSI etc. This function generates a new figure().

By using the function `plotAdd_RecPrec(Rec::Array{Float64,1},Prec::Array{Float64,1},strMethod::String)`

a plot can be added to an already existing figure object. The plots automatically chooses different colors and corresponding legends are created. It supports upto 7 plots of the following colors, `Colors=["red","blue","green","black","cyan","magenta","yellow"]`

. In the `PlotResults.jl`

, new colors can be added to enable more plots.

02/26/2014

7 days ago

61 commits