Txt2Vec is a toolkit to represent text by vector. It's based on Google's word2vec project, but with some new features, such incremental training, model vector quantization and so on. For a specified term, phrase or sentence, Txt2vec is able to generate correpsonding vector according its semantics in text. And each dimension of the vector represents a feature.
Txt2Vec is based on neural network for model encoding and cosine distance for terms similarity. Furthermore, Txt2Vec has fixed some issues of word2vec when encoding model in multiple-threading environment.
The following is the introduction about how to use console tool to train and use model. For API parts, I will update it later.
Txt2VecConsole tool supports four modes. Run the tool without any options, it will shows usage about modes.
Txt2VecConsole for Text Distributed Representation
Specify the running mode:
: train model to build vectors for words
: calculating the similarity between two words
: multi-words semantic analogy
: shrink down the size of model
: dump model to text format
: build vector quantization model in text format
With train mode, you can train a word-vector model from given corpus. Note that, before you train the model, the words in training corpus should be word broken. The following are parameters for training mode
Txt2VecConsole.exe -mode train
Parameters for training:
-trainfile : Use text data from to train the model
-modelfile : Use to save the resulting word vectors / word clusters
-vector-size : Set size of word vectors; default is 200
-window : Set max skip length between words; default is 5
-sample : Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; default is 0 (off), useful value is 1e-5
-threads : the number of threads (default 1)
-min-count : This will discard words that appear less than times; default is 5
-alpha : Set the starting learning rate; default is 0.025
-debug : Set the debug mode (default = 2 = more info during training)
-cbow : Use the continuous bag of words model; default is 0 (skip-gram model)
-vocabfile : Save vocabulary into file
-save-step : Save model after every words processed. it supports K, M and G for larger number
-iter : Run more training iterations (default 5)
-negative : Number of negative examples; default is 5, common value are 3 - 15
-pre-trained-modelfile : Use which is pre-trained-model file
-only-update-corpus-word : Use 1 to only update corpus words, 0 to update all words
Txt2VecConsole.exe -mode train -trainfile corpus.txt -modelfile vector.bin -vocabfile vocab.txt -debug 1 -vector-size 200 -window 5 -min-count 5 -sample 1e-4 -cbow 1 -threads 1 -save-step 100M -negative 15 -iter 5
After the training is finished. The tool will generate three files. vector.bin contains words and vector in binary format, vocab.txt contains all words with their frequency in given training corpus, and vector.bin.syn which is used for incremental model training in future.
Incremental Model Training
After we collected some new corpus and new words, to get these new words' vector or update existing words' vector by new corpus, we need to re-train existing model in incremental model. Here is an example:
Txt2VecConsole.exe -mode train -trainfile corpus_new.txt -modelfile vector_new.bin -vocabfile vocab_new.txt -debug 1 -window 10 -min-count 1 -sample 1e-4 -threads 4 -save-step 100M -alpha 0.1 -cbow 1 -iter 10 -pre-trained-modelfile vector_trained.bin -only-update-corpus-word 1
We have already trained a model "vector_trained.bin" before, currently, we have collected some new corpus named "corpus_new.txt" and new words saved into "vocab_new.txt". The above command line will re-train existing model incrementally, and generate a new model file named "vector_new.bin". To get better result, the "alpha" value should be usually bigger than that in full corpus and vocabulary size training.
Incremental model training is very useful for incremental corpus and new word. In this mode, we are able to generate new words vector aligned with existing words efficiently.
Calculating word similarity
With distance mode, you are able to calculate the similarity between two words. Here are parameters for this mode
Txt2VecConsole.exe -mode distance
Parameters for calculating word similarity
-modelfile : encoded model needs to be loaded
-maxword : the maximum word number in result. Default is 40
After the model is loaded, you can input a word from console and then the tool will return the Top-N most similar words.