Learning Notes_Text Mining

type

status

date

slug

Lecture 1 NLP, TM and POS

Vocabulary

disciplines /ˈdɪsəplɪnz/ 学科

multidisciplinary /ˌmʌltiˌdɪsɪˈplɪnəri/ 多学科的

interface /ˈɪntərfeɪs/ 接口

generative /ˈdʒɛnərətɪv/ 生成的

Linguistics /lɪŋˈɡwɪstɪks/ 语言学

Morphological knowledge /ˌmɔrfəˈlɑdʒɪkəl ˈnɒlɪdʒ/ 形态学知识

Syntactic knowledge /sɪnˈtæktɪk ˈnɒlɪdʒ/ 句法知识

Semantic knowledge /sɪˈmæntɪk ˈnɒlɪdʒ/ 语义知识

Phonetic knowledge /fəˈnɛtɪk ˈnɒlɪdʒ/ 语音知识

Pragmatic knowledge /præɡˈmætɪk ˈnɒlɪdʒ/ 语用知识

Discourse Knowledge /ˈdɪskɔrs ˈnɒlɪdʒ/ 话语知识

pronoun resolution /ˈprəunaʊn ˌrɛzəˈluːʃən/ 代词消解

control inflation /kənˈtroʊl ɪnˈfleɪʃən/ 控制通货膨胀

polysemous words /ˌpɒlɪˈsiːməs wɜrdz/ 多义词

utterance interpretation /ˈʌtərəns ɪnˌtɜːprɪˈteɪʃən/ 话语解释

Cohesion /koʊˈhiːʒən/ 衔接

Synonyms /ˈsɪnənɪmz/ 同义词

Coreference /ˌkɔːrˈɛfərəns/ 共指

computational framework /ˌkɒmpjuːˈteɪʃənəl ˈfreɪmwɜːrk/ 计算框架

lexical connectivity patterns /ˈlɛksɪkəl kəˌnɛktɪvəti ˈpætərnz/ 词汇连接模式

lexical Ambiguities /ˈlɛksɪkəl æmbɪˈɡwɪtiz/ 词汇歧义

rhetorical relation /rɪˈtɔrɪkəl rɪˈleɪʃən/修辞关系

Clauses /klɔːzɪz/ 子句

ambiguous /æmˈbɪɡjuəs/ 模糊的

Syntactic Parse /sɪnˈtæktɪk pɑːrs/ 句法分析

Structural ambiguities /ˈstrʌktʃərəl æmˈbɪɡjuətiz/ 结构歧义

Peen Treebank Corpora /piːn ˈtriˌbæŋk kɔːˈpɔːrə/ Peen树库

Corpus /ˈkɔːrpəs/ 语料库

ADJP Adjective phrase /ˌeɪdiːdʒˈpiː ˈædʒɪktɪv freɪz/ 形容词短语

ADVP Adverb phrase /ˌædvɜːb freɪz/ 副词短语

NP Noun Phrase /ˌnaʊn freɪz/ 名词短语

PP Prepositional phrase /ˌprɛpəˈzɪʃənl freɪz/ 介词短语

S Simple declarative clause /ˈsɪmpəl dɪˈklærətɪv klɔːz/ 简单陈述句

1. Notion of NLP

Natural language knowledge representation is context based

NLP is multidisciplinary in nature

Natural means its language is produced by humans.

Text Mining uses algorithms and results developed by NLP

TM – applied, NLP – fundamentals

2. Disciplines of NLP

Linguistics : How words, phrases, and sentences are formed.

Psycholinguistics : How people understand and communicate using a human language.

Computational linguistics : Deals with models and computational aspects of NLP.

Artificial intelligence : issues related to knowledge representation and reasoning.

NL Engineering : implementation of large, realistic systems, like LLMs,such as ChatGpt and Bart

Text Mining : information extraction from text for a specific purpose (closely related to AI)

3. Basic Levels of Language Processing

semantics [sɪˈmæntɪks] - 语义学

语义学研究的是语言的意义，包括词语、短语和句子的意义。它涉及词汇的含义、句子的真值和逻辑关系等。在NLP中，语义学用于理解文本的含义和推理。

Words and sentence meaning. e.g They saw a log. They saw a log yesterday. He saws a log.

phonetics [fəˈnɛtɪks] - 语音多义词学

语音学研究语音的物理性质和发音方式。它涵盖了声音如何在发音器官中产生，以及这些声音如何被接收和解释。在NLP中，语音学有助于语音识别和语音合成等任务。

how words are related to the sounds that realize them Essential for speech processing.

He leads the team.

Lead is a heavy metal.

syntactics [sɪnˈtæktɪks] - 语法句法学

句法学研究的是语言的结构，特别是词与词之间的组织和句子的结构。它关注语言单位之间的语法关系，如主谓关系、修饰关系等。在NLP中，句法学用于分析句子的结构和解析语法。

how words can be put together to form correct sentences, and the role each word play’s in the sentence. e.g John ate the cake.

morphology [mɔːˈfɒlədʒi] - 词根形态学

形态学研究的是词语的内部结构和词的形态变化。它包括词根、词缀和词干等词法单位的研究。在NLP中，形态学有助于词形还原、词干提取和词形分析等任务。

how words are constructed : e.g friend, friendly, unfriendly, friendliness.

4. Ambiguities

Lexical Ambiguities: the word has different meaning, like the insect flies and the verb fly.

Semantic Ambiguities: the word has the same meaning, but when it’s used in different context, different meaning. Like the word kill, is not only murder but also beat.

5. Discourse Interpretation

Investigation of lexical connectivity patterns as the reflection of discourse structure

Cohesion: Well-formed text exhibits strong lexical connectivity via use of:

Repetitions

Synonyms

Coreference

Specification of a small set of rhetorical relation among discourse segments

Assumption: Clauses in well-formed text are related via predefined rhetorical relations, like the discourse segments are logically connected with reasons and results.

Adaption of the notion of grammar

Examination of intentions and relations among them as the foundation of discourse structure

这段文本提到的 "cohesion"（凝聚性）是指文本内部词语之间的联系和连贯性。在文本挖掘中，一个假设是，良好构造的文本通过以下方式展现出强大的词汇连贯性：

重复（Repetitions）：文本中相同词语或短语的重复可以增强文本的连贯性。通过重复使用相同的词语或短语，作者可以强调某些观点、概念或信息，从而使读者更容易理解文本的主题和内容。

同义词（Synonyms）：使用同义词或近义词可以增强文本的表达能力，并丰富文本的词汇。通过使用不同的词语表达相似的含义，作者可以避免文本重复，并使文本更加生动和多样化。

指代（Coreference）：指代是指在文本中使用代词或其他词语来引用先前提及过的实体或概念。通过正确使用指代，作者可以建立起文本内部的逻辑连接，使读者能够清晰地理解文本中的各个部分之间的关系。

因此，这段文本表明，在文本挖掘中，强大的词汇连贯性通过以上这些方式体现出来，这些方式有助于构建具有逻辑性和连贯性的文本，从而提高了文本挖掘任务的效果和准确性。

6. Parsing

Syntactic structure of a sentence,

Penn Treebank Corpora

Part_of_Speech tagged Syntactic Bracketing

Lecture 2 Extracting Tokens and nGrams

Vocabulary

singular: /ˈsɪŋɡjələ(r)/ adj. 单数的

plural: /ˈplʊərəl/ adj. 复数的

possessive: /pəˈzɛsɪv/ adj. 所有格的

tense of the verb: /tɛns ʌv ðə vɜrb/ 动词时态

Regulars: /ˈrɛɡjʊlərz/ 正规的

Irregulars: /ɪˈrɛɡjʊlərz/ 不规则的

morphemes: /ˈmɔrfiːm/ n. 语素

Stems: /stɛmz/ n. 词干

Affixes: /ˈæfɪksɪz/ n. 词缀

Inflectional Morphology: /ɪnˈflɛkʃənəl mɔrˈfɑlədʒi/ 屈折形态学

Derivational Morphology: /dɪˈrɪveɪʃənəl mɔrˈfɑlədʒi/ 派生形态学

punctuation: /ˌpʌŋktʃuˈeɪʃən/ n. 标点符号

terminology: /ˌtɜːmɪˈnɒlədʒi/ n. 术语

Wordform: /ˈwɜːrdfɔːrm/ n. 词形

inflected: /ɪnˈflɛktɪd/ adj. 屈折的

inflections: /ɪnˈflɛkʃənz/ n. 屈折形式

granularity: /ɡrænjuˈlærɪti/ n. 粒度

lemma: /ˈlɛmə/ n. 引用形式；词条

citation: /saɪˈteɪʃən/ n. 引用；引证

1. Comparing NLTK, spaCy, and scikit-learn

These three libraries are popular tools in the Python ecosystem for different tasks in text processing and machine learning.

NLTK (Natural Language Toolkit): This is one of the oldest and most comprehensive libraries for natural language processing (NLP) in Python. It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It's great for learning and prototyping but can be slower compared to other libraries.

spaCy: This is a modern, fast NLP library designed to be fast, streamlined, and production-ready. It is less comprehensive than NLTK but focuses on providing only the best and most efficient methods for common NLP tasks. spaCy is well-suited for large-scale information extraction tasks and includes pre-trained statistical models and word vectors.

scikit-learn: Unlike NLTK and spaCy, scikit-learn is primarily a machine learning library and not specialized for NLP. It includes tools for data mining and data analysis and is built on NumPy, SciPy, and matplotlib. For text data, it offers features like feature extraction from text for use in machine learning algorithms. It doesn’t provide tools for tasks like tokenization or parsing but integrates well with NLTK or spaCy for the complete pipeline.

2. Concepts of Tokenization, Stemming, Lemmatization, and Bi-grams

Tokenization: This is the process of breaking down a string of text into pieces, called tokens, which roughly correspond to "words". It is a fundamental step for most NLP tasks because it allows the algorithm to work with individual terms. For example, the sentence "ChatGPT is helpful" could be tokenized into ["ChatGPT", "is", "helpful"].

Stemming: This involves cutting down a word to its base or root form. It often involves chopping off the end of the word to reach a common form, which might not always be a real word. For instance, "running", "runner", and "ran" might all be stemmed to "run".

Lemmatization: Similar to stemming, lemmatization also reduces words to a base form, but it aims to ensure that the root word (lemma) is a valid word in the language. Lemmatization uses vocabulary and morphological analysis, so it’s generally more accurate and sophisticated than stemming. For example, "better" would be lemmatized to "good".

Bi-grams: These are a type of n-gram, where 'n' refers to the number of contiguous words grouped together. Bi-grams pair adjacent words together, which can be useful in applications like language modeling and text prediction. For example, from the sentence "Natural language processing", the bi-grams would be ["Natural language", "language processing"].

3. Pre-processing Before Text Processing

Text pre-processing is crucial for cleaning and standardizing text data before performing more complex NLP tasks. Here are some common steps:

Cleaning Text: Removing irrelevant characters such as punctuation, special characters, and numbers.

Lowercasing: Converting all the characters in the text into lower case to maintain uniformity.

Removing Stop Words: Filtering out common words (like "and", "the", etc.) that might be of little value in tasks like text classification.

Handling Missing Data: Filling in or removing missing values in text data.

Encoding: Converting text into a format that can be used by machine learning algorithms, often using techniques like Bag of Words, TF-IDF, or word embeddings.

1. Words, Token, Tokenization

Words: These are the basic units of meaning in language

Tokens: These are smaller units created during a process called tokenization. Tokenization breaks down text into a format that computers can understand better. Tokens can be individual words, but they can also be:

Sub-words: Especially in machine learning models, words can be split into smaller meaningful units, like "play" becoming "pla" and "y." This helps handle uncommon words or account for different word forms.

Punctuation marks: Commas, periods, etc. can all be considered tokens.

Special characters: Symbols like @, #, etc. can also be tokens.

Relationship between Words and Tokens:

One word can be one token (e.g., "happy").

One word can be multiple tokens (e.g., "wouldn't" becomes "would", "n't").

Multiple words can be combined into a single token (sometimes done for named entities like "New York").

2. Regulars and Irregulars

Regulars...

Walk, walks, walking, walked, walked

Irregulars

Eat, eats, eating, ate, eaten
Catch, catches, catching, caught, caught
Cut, cuts, cutting, cut, cut

3. Inflectional and Derivational Morphology

Inflectional morphology in English is fairly straightforward. Serves a grammatical/semantic purpose, But is nevertheless transparently related to the original. Below are all different forms of verbs:

skip, skipping, skipped

Derivational morphology, some affixes changes the meaning of the word without changing the grammatical function.

deconstruction, unfriendly

4. Types and Tokens

In language, a type is the underlying concept behind a word; tokens are the physical occurrences of a word in a text. Tokens can also include punctuation marks, special characters, or even sub-word units (especially in machine learning)

One type can have many tokens. For example, the type "book" can have tokens like "book," "books," or "booked."

A single token typically belongs to one type. However, there can be ambiguity in some cases (e.g., "play" can be a noun or a verb).

5. Lemma and Wordforms

Lemma:

Also known as the citation form, base form, or dictionary headword.

Represents the core meaning or dictionary form of a word.

It's the underlying form to which all inflected versions of a word relate.

Example: "break" is the lemma for "break," "breaks," "broke," "broken," and "breaking."

Wordform:

Represents all the various inflected or modified versions of a lemma.

These variations reflect grammatical features like tense, plurality, or case.

Example: "breaks" (present tense, 3rd person singular), "broke" (past tense), "broken" (past participle), and "breaking" (present participle) are all wordforms of the lemma "break."

6. Morphological Knowledge

Morphology is the study of the structure and form of words, which includes understanding how words are formed from morphemes (the smallest grammatical units in a language). In text mining, morphological analysis helps in:

Word Segmentation: Especially important in languages like Chinese, where segmentation into morphemes can be crucial for understanding.

Part-of-Speech Tagging: Determining whether a word is a noun, verb, adjective, etc., based on its morphological form.

Stemming and Lemmatization: As discussed earlier, these processes reduce words to their root forms, helping in generalizing different forms of the same word.

7. Syntactic Knowledge

Syntax refers to the arrangement of words in a sentence to make grammatical sense. Syntactic knowledge is used in text mining for:

Parsing: Analyzing the grammatical structure of sentences helps understand the relationships between parts of a sentence. This is vital for tasks that require understanding sentence structure, such as dependency parsing.

Sentence Breaking: Determining where sentences begin and end in a large text.

Grammar Checking: Identifying syntactic errors in text can be useful in applications like automated proofreading tools.

8. Semantic Knowledge

Semantics is the study of meaning in language. Semantic knowledge allows text mining algorithms to understand the meanings of words in context and how these meanings change in different situations:

Word Sense Disambiguation: Determining which meaning of a word is used in a given context.

Entity Recognition: Identifying and categorizing key pieces of information in text like names of people, organizations, locations, etc.

Relationship Extraction: Determining relationships between entities, which can be crucial for information extraction applications.

9. Phonetic Knowledge

Phonetics deals with the sounds of human speech. While not as commonly applied in text mining as the other types of knowledge, phonetic analysis can be useful in:

Speech Recognition: Transcribing spoken language into text.

Phonetic Similarity: For tasks such as rhyme detection or pronunciation-based text matching.

Speech Synthesis: Converting text to speech where the phonetic properties of words need to be considered to produce natural-sounding speech.

Demo Code: https://colab.research.google.com/drive/171un6knTTlbvx5n2vtaMA9v0mSO-tNhU?usp=share_link#scrollTo=oqApSbghNd7m

Lecture 3:

1. Confusion Matrix

A confusion matrix is a table that is commonly used to evaluate the performance of a classification model. It allows us to understand the performance of a classification algorithm by presenting a more detailed breakdown of correct and incorrect classifications.

In the confusion matrix:

True Positives (TP): Instances that were correctly classified as positive.

True Negatives (TN): Instances that were correctly classified as negative.

False Positives (FP): Instances that were incorrectly classified as positive (false alarms).

False Negatives (FN): Instances that were incorrectly classified as negative (misses).

In the vertical direction, the first one is TP, others are all FPs

In the horizontal direction, the correct one is TP, others are all FN

Precision, Recall, Error Rate and Accuracy Rate

Precision (P):

Precision measures the proportion of correctly identified positive instances out of all instances classified as positive by the model. It reflects the model's ability to avoid misclassifying negative instances as positive.

Recall (R):

Recall, also known as sensitivity or true positive rate, measures the proportion of correctly identified positive instances out of all actual positive instances in the dataset. It reflects the model's ability to capture all positive instances, minimizing false negatives.

Error Rate:

The error rate (or misclassification rate) measures the proportion of misclassified instances (both false positives and false negatives) out of all instances in the dataset. It represents the overall rate of errors made by the model.

Accuracy Rate:

The accuracy rate measures the proportion of correctly classified instances (both true positives and true negatives) out of all instances in the dataset. It represents the overall correctness of the model's predictions.

-score

-score is a harmonic mean of precision and recall, adjusted by the parameter to reflect the relative importance of precision and recall in the evaluation of a classification model.

In this formula:

P represents precision.

R represents recall.

β is the parameter that controls the relative importance of precision and recall in the calculation.

When β=1, it's the standard F1-score, giving equal weight to precision and recall.

When β>1, more emphasis is placed on recall over precision.

When 0<β<1, more weight is given to precision over recall.

2. Taggers methods

Rule-Based Taggers: Use hand-written rules to determine tags based on word definitions and contextual rules.

Stochastic Taggers: Use statistical models such as Hidden Markov Models (HMM) or Maximum Entropy models to predict tags based on the probability of tag sequences.

Machine Learning-Based Taggers: Employ advanced algorithms like Conditional Random Fields (CRF), Support Vector Machines (SVM), and Neural Networks (e.g., LSTM, BERT) to perform tagging with high accuracy.

Hybrid Taggers: Combine elements from multiple tagging methods to optimize performance.

3. Calculate F1, P, R by sklearn

4. Dependency Tree

Root:

The root is not a head. It's the central node of the tree from which all other nodes (words) are dependent. It's usually the main verb or the most central semantic element of the sentence. The root has no head since it sits at the top of the dependency structure.

Nodes:

Nodes represent all the words in the sentence, including the root. In dependency trees, each word is a node.

Head:

Each node (except the root) has a head, which is another node that governs it. The head-node relationship defines a hierarchical structure where the head has syntactic dominance.

Children:

Children refer to nodes that are dependent on a particular head. If a node has dependents, those are its children in the tree.

Edges:

Edges are the lines that connect nodes in the tree. Each edge goes from a head to one of its children and is labeled with the type of grammatical relationship.

5. Taggers Usage

1. Pattern Matching

Tagged patterns can be used to identify specific grammatical structures in texts, such as a sequence of part-of-speech tags that form a meaningful phrase.

2. Rule-Based NLP Systems

In rule-based systems, tagged patterns help in parsing or translating text based on predefined grammatical rules.

3. Information Extraction

Tagged patterns are crucial for extracting structured information like names, locations, or specific relationships from text.

session 3 Demo Code: https://colab.research.google.com/drive/1VeSnyzIhfmiYcdk1UxhGeMTESbd9t-aO?usp=share_link#scrollTo=G-fS8N0RhRYb

Lecture 4 Vector Space Model and Similarity Computations

Vocabulary

logarithm ˈlɒɡərɪðəm 对数

Vectorization is the first step for Text Processing

1. Term Frequency Counting (CountVectorizer):

Library: sklearn.feature_extraction.text.CountVectorizer

Description: Calculates the frequency of each word in the documents. The result is a term frequency vector for each document.

CountVectorizer is an application of Bag of Words

2. TF-IDF (Term Frequency-Inverse Document Frequency):

Library: sklearn.feature_extraction.text.TfidfVectorizer

Description: Measures the importance of a word in a document, taking into account both term frequency (TF) and inverse document frequency (IDF). This method reduces the impact of common words while emphasizing key terms.

Use models LogisticRegression, SVM or RandomForrest to build relationship between TF-IDF matrix and labels, then use it to process new documents.

Word Embeddings:

Libraries: gensim.models.Word2Vec, gensim.models.FastText

Description: Uses models like Word2Vec or FastText to learn dense vector representations of words. These representations capture the complex semantic relationships between words.

Additional: spacy also supports using pre-trained word embedding models for text processing.

One-hot Encoding:

Library: keras.preprocessing.text.Tokenizer can be used for one-hot encoding.

Description: Assigns a unique index to each word and represents it in binary form (1 for the presence of the word, 0 for the absence).

BERT Embeddings:

Library: transformers (provided by Hugging Face)

Description: Uses pre-trained BERT models to generate dense vector representations of text. These embeddings capture the meaning of words in context and are well-suited for complex NLP tasks.

n-gram Model:

Library: Can be configured in sklearn.feature_extraction.text.CountVectorizer or TfidfVectorizer by setting the ngram_range parameter.

Description: Enhances text representation by considering word order (i.e., combinations of consecutive words).

1. Bag of Words Model

Tokenization: The first step is to tokenize the text, breaking it down into individual words or terms. Punctuation and stopwords (common words like "the", "and", "is", etc.) are often removed during this process.

Vocabulary Creation: Next, a vocabulary is created by compiling a list of unique words found in the text corpus. Each word becomes a feature, and the vocabulary size equals the total number of unique words.

Vectorization: Once the vocabulary is established, each document (or piece of text) is represented as a numerical vector. The length of the vector is equal to the size of the vocabulary. Each position in the vector corresponds to a word in the vocabulary, and the value at that position indicates the frequency of the corresponding word in the document.

Normalization(归一化): Optionally, you can normalize the vectors to ensure that the length of the vectors is consistent. One common normalization technique is to divide each vector by the total number of words in the document.

Bag of Words model loses the sequence information of the words and hence disregards the context in which words appear. This limitation can be addressed by using more advanced techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings (like Word2Vec, GloVe, or FastText) which capture semantic meaning and contextual information better than BoW.

One Hot representation is a 01 matrix representing the presence or absence of tokens.

2. Google word2vec vector space model

Word2vec aims to represent words in a continuous vector space where semantically similar words are placed close to each other.

It has two main architectures: Continuous Bag of Words (CBOW) and Skip-gram. CBOW use context to predict word while Skip-gram vice versa.

3. TF-IDF weighting scheme

The tf idf weight of a term is the product of its tf weight and its idf weight.

tf is the term frequency in the document, and dft is the numbers of documents containing this term. All below equations can be used.

Consider the following corpus of 3 documents and 4 terms. What is the value of inverse document frequency using the formula for the term "insurance" for Doc 3?

Term	Doc 1	Doc 2	Doc 3
car	27	4	24
auto	3	33	0
insurance	0	33	29
best	14	0	17

This N is 3, not 33+29, and dft is 2, not 29. so:

TF-IDF is basicly tf * IDF, there are different types of the formula:

4. Distance Calculation Method

Each distance metric serves different purposes and is suitable for specific types of data or tasks. Here's a breakdown of when each method is commonly used:

Euclidean Distance:

Usage: Suitable for continuous numerical data or vector spaces.

Applications: Clustering algorithms like K-means, nearest neighbor search, dimensionality reduction techniques such as PCA, and in general, any scenario where the straight-line distance between points in a multi-dimensional space is relevant.

Cosine Distance:

Usage: Appropriate for comparing documents or text data, especially when considering the direction rather than the magnitude of vectors.

Applications: Text mining, document clustering, information retrieval, and recommendation systems where the angle between vectors is more meaningful than their magnitude.

Jaccard Coefficient:

Usage: Primarily used for comparing the similarity of sets, where the order or frequency of elements is not important.

Applications: Text mining, recommendation systems, and clustering tasks where the presence or absence of items in sets is more important than their frequency or order.

Levenshtein Distance:

Usage: Specifically designed for measuring the edit distance between strings, representing the minimum number of single-character edits (insertions, deletions, substitutions) required to change one string into another.

Applications: Spell checking, DNA sequence analysis, fuzzy string matching, and generally any scenario where string similarity or edit distance is relevant.

Hamming Distance:

Usage: Used to measure the difference between two binary strings of equal length, counting the number of positions at which the corresponding bits are different.

Applications: Error detection and correction in digital communication, data compression, and genetic algorithms.

Applications:

Clustering: Identifying similar data points by calculating distances and grouping them together.

Classification: Finding the nearest training samples to a given data point using distance metrics.

Recommendation Systems: Calculating similarities between users or items to make personalized recommendations.

Anomaly Detection: Identifying outliers by measuring distances from the majority of data points.

Session 4 Demo: https://colab.research.google.com/drive/1_txuvn00YOPFfUDuYzNoObm7H3pSB5FV?usp=share_link#scrollTo=7UGpzALFV1x-

Lecture 5 Word Embedding

Embed Semantics into words so that queen and woman are more related than queen and tree.

count based methods: BOW models, TF, TFIDF models

predictive models: Word2vec, CBOW(continous bag of words), Skip-gram

Two forms of meaning representation:

Denotational

Distributional

Word Embedding Models

CBOW: predict the focus word given a context

Skip-gram: predict the context given a focus word

Word2vec use the above two models to train the Embedding Matrix, This matrix contains 300-dimensional vector representations for 3 million words.

1. Steps for n-gram Vectorization with TensorFlow Training

Define the Range of n-grams:

Determine whether you're using word-level or character-level n-grams and choose the value of n.

Example: For bigrams (n=2) of words.

Generate n-grams from Text:

Extract all possible n-grams for each document in your corpus.

Example code:

Build Vocabulary:

Create a vocabulary that includes every unique n-gram extracted from the corpus.

Example code:

Vectorize the Text:

Convert each document to a vector where each dimension corresponds to an n-gram in the vocabulary.

Common methods include Boolean presence, frequency count, or TF-IDF.

Example code using CountVectorizer from sklearn:

2. TensorFlow Training for n-gram Vectorization

X and Y come from individual words in the training data and their surrounding context. X represents the target word, and Y represents its associated context words.

These pairs train a neural network to learn word embeddings. TensorFlow uses X and Y as input and labels, adjusting the network's weights to minimize loss.

Through this unsupervised process, the model learns dense representations for each word, known as "word embeddings.”

Setting Up TensorFlow and Other Imports:

Creating One-Hot Encoding for Words:

Here, you create a one-hot vector for each word index. The size of the one-hot vector (ONE_HOT_DIM) equals the number of unique words in the vocabulary (len(words)).

Preparing the Dataset:

You then prepare your input (X) and output (Y) datasets by converting words (from a DataFrame presumably containing contexts and target words) into their respective one-hot encoded forms using the word2int mapping.

Defining TensorFlow Placeholders:

These placeholders will hold the input and output data during the training session.

Building the Neural Network Model:

A simple neural network with one hidden layer (embedding layer) and an output softmax layer.

W1 and b1 are the weights and bias for the hidden layer, mapping input to a lower-dimensional embedding space (EMBEDDING_DIM).

W2 and b2 are the weights and bias for the output layer, which predicts the context words.

Defining the Loss Function:

Cross entropy is used here as it is suitable for classification tasks.

Setting Up the Training Operation:

Gradient descent optimizer is used to minimize the cross entropy loss.

Running the Training Session:

You would typically run this inside a TensorFlow session where you repeatedly execute train_op to update the model's weights based on your training data (X_train, Y_train).

Demo Link: https://colab.research.google.com/drive/15voT5ikENO-N-8Rn4oJBI6Fage9J8y4o?usp=share_link#scrollTo=OIuN00nT0Kai

Lecture 6 KNN, NB and Deep Learning

1. Main steps for classifying text:

Load Data: Load text data from files using load_files function from sklearn.datasets.

Preprocess Data: Preprocess the text data by:

Removing special characters, single characters, and prefixed 'b'.

Converting text to lowercase.

Lemmatizing words to their base form.

Feature Extraction: Convert preprocessed text data into numerical features using:

CountVectorizer: Convert text documents into a matrix of token counts.

TfidfTransformer: Transform the count matrix to a normalized term-frequency or term-frequency times inverse document-frequency representation.

Split Data: Split the data into training and testing sets using train_test_split from sklearn.model_selection.

Train Model: Train a classifier using the training data. In this case, a Random Forest classifier is used.

Predictions: Make predictions on the test data using the trained classifier.

Evaluate Model: Evaluate the performance of the classifier by calculating:

Confusion matrix: A table showing the number of correct and incorrect predictions.

Classification report: Precision, recall, F1-score, and support for each class.

Accuracy score: The accuracy of the model on the test data.

Session 6 Demo Code with RandomForrest: https://colab.research.google.com/drive/1CwXMsymfdZjPeeDo7LLbu5pd7bEpWkEP?usp=sharing#scrollTo=b7zDF3JErgzR

Text Classification with CNN:

Model Architecture:

This code snippet uses a Convolutional Neural Network (CNN) as the model architecture, instead of an ensemble learning model based on random forest.

CNN is a deep learning model suitable for tasks involving grid-structured data like images, but it can also be applied to sequential data such as text.

Feature Extraction:

In this code snippet, text data undergoes feature extraction and representation learning through an Embedding layer and a Convolutional layer (Conv1D), instead of using feature representations based on word frequencies or TF-IDF.

Training Approach:

The model is trained using different optimizers and loss functions. In this code, Adam optimizer and binary cross-entropy loss function are employed.

Performance Evaluation:

The method for performance evaluation may differ. Although both methods evaluate model performance by printing confusion matrices and accuracy, the results may vary due to differences in model architecture and training approach.

Demo Code with CNN:

https://colab.research.google.com/drive/1DmFIh-aFXs7uKDLbWCYGN3Qsc0BEpPjI?usp=share_link

KNN

Multiple neighbours can be used to decide the category of the test object.

The algorithm needs a similarity function for computation of the distance between neighbours.

If the calculation is based on weight, sum the weight

If the calculation is based on majority voting, sum the quantity in K

Using KNN to categorize documents, needs to vectorize the documents first

Naive Bais

It assumes the the tokens are in the texts are independent of each other.

The probablitiy of mulitple attributes appearing together is equal to the individual probabilities of each of the attributes multipled together.

Which ones of the following is true about deep learning?

It uses sliding window to select parts of the input matrix to extract selective features.

At the end of the "deep layers" all of the input values are fed into a fully connected layer which is then fed into a softmax function.

As you move deeper into the neural network, the dimenstions of the matrix reduces.

Kernels are used to select parts of the input matrix to construct a specific feature map.

After training a model, its performance is often compared with a baseline model performance.

It can be any type algorithm with a basic set of parameters without any attempt at fine tuning.

It is the most basic, naive model.

Lecture 7 Name Entity Recognition

Dependency Tree

Focuses on Relationships: Dependency Trees map the grammatical structure of a sentence by showing how words in a sentence depend on each other. Each word is connected directly to the words (nodes) that it depends on or that depend on it, forming a network of dependencies.

Binary Relations: Each word (or token) in a sentence typically has exactly one parent (except the root, which has none) and zero or more children, which defines a binary grammatical relation between words, such as subject, object, modifier, etc.

Examples of Dependencies: For the sentence "John drove to a town in Auckland yesterday to buy a Toyota for $5000," a dependency tree would illustrate direct dependencies like "drove to town," "town in Auckland," and "drove to buy."

Constituency Parse Tree

Focuses on Phrase Structure: Constituency Parse Trees, also known as Phrase Structure Trees, break down a sentence into sub-phrases or constituents. These trees group words into nested constituents, typically labeled with syntactic category names like NP (noun phrase), VP (verb phrase), etc.

Hierarchical Organization: This type of tree shows how small constituents combine to form larger constituents, ultimately encompassing the entire sentence. It provides a clear hierarchical structure of the sentence.

Examples of Constituents: For the same sentence, a constituency parse tree would group "John" as an NP, "drove to a town in Auckland yesterday" as a VP, and within that VP, "to a town in Auckland" as a PP (prepositional phrase), and so forth.

(S (NP (NNP John)) (VP (VBD drove) (PP (IN to) (NP (NP (DT a) (NN town)) (PP (IN in) (NP (NNP Auckland))))) (NP (NN yesterday)) (S (VP (TO to) (VP (VB buy) (NP (DT a) (NNP Toyota)) (PP (IN for) (NP ($ $) (CD 5000))))))) (. .))

spaCy for NER

Advantages:

Speed and Accuracy: spaCy is designed for production use, emphasizing speed and accuracy.

Pre-trained Models: It provides access to well-optimized and updated pre-trained models for NER, which are more accurate and can recognize a variety of entity types out of the box.

Ease of Use: spaCy's API is straightforward for tasks like tokenization, NER, and dependency parsing, making it easier to integrate into applications.

Scalability: spaCy handles large volumes of text efficiently, making it suitable for real-world applications.

Integration: It integrates well with other Python libraries and tools, providing pipelines for tasks beyond NER, such as dependency parsing, POS tagging, and more.

Using spaCy for NER: The given code snippet uses spaCy to load a pre-trained English model and apply it to the text contained in the variable sample. It then iterates through recognized entities and prints out their text, start position, end position, and label:

NLTK for NER

Advantages:

Educational Use: NLTK is widely used in academia and education for teaching and studying NLP concepts due to its simplicity and comprehensive documentation.

Customization: It allows for more customization in processing pipelines and algorithms, making it suitable for experimental NLP.

Resource Variety: NLTK provides access to a vast array of resources for text processing, including corpora, lexical resources, and grammatical models.

Using NLTK for NER: NLTK’s approach to NER typically involves more manual setup compared to spaCy. You might need to train your models or use chunking strategies based on POS tags. Here’s a brief example using a simplistic chunking method:

1. Definition of Named Entity Recognition

NER is the process of locating and classifying named entities mentioned in unstructured text into predefined categories such as the names of persons, organizations, locations, dates, quantities, monetary values, percentages, etc.

2. Categories of Named Entities

Person: Names of individuals (e.g., "John Smith").

Organization: Names of corporations, agencies, institutions (e.g., "United Nations", "Google").

Location: Names of places, countries, cities, rivers, mountains (e.g., "Paris", "Mount Everest").

Date/Time: Absolute or relative dates or periods (e.g., "1998", "next Friday").

Monetary Values: Includes prices and monetary amounts (e.g., "$100", "ten euros").

Percentages: Expressions of percentages (e.g., "twenty percent", "50%").

Quantities: Measurements of quantities (e.g., "ten kilograms", "four liters").

Others: Depending on the specific requirements, other categories like laws, nationalities, events, and products can also be identified.

3. Methods of NER

Rule-based Systems: Use hand-crafted linguistic rules to identify entities based on their patterns and context in the text.

Statistical Models: Utilize algorithms like Hidden Markov Models (HMM), Conditional Random Fields (CRF), or Support Vector Machines (SVM) based on large sets of labeled data.

Deep Learning Approaches: Employ neural networks, particularly Recurrent Neural Networks (RNNs) and variants like Long Short-Term Memory networks (LSTM), typically requiring large volumes of annotated data.

Transfer Learning Models: Use pre-trained models like BERT, RoBERTa, or GPT, which can be fine-tuned on smaller amounts of NER data to achieve high accuracy.

4. Applications of NER

Content Classification: Helps in classifying content for news feeds, document management systems, etc.

Customer Support: Automatically identifies important information in customer communications.

Information Retrieval: Enhances search algorithms by focusing on specific entities.

Knowledge Graph Construction: Used for extracting structured information to populate knowledge bases or graphs.

Compliance and Monitoring: Identifies sensitive or regulated information in communications or documents.

5. Challenges in NER

Ambiguity: Words or phrases that can be interpreted in multiple ways depending on the context (e.g., "Apple" can be a fruit or a company).

Variability: Different expressions or spellings for the same entity (e.g., "USA" vs. "United States").

Domain-Specific Entities: Entities that are highly specialized to a certain domain or industry may not be well-covered by general models.

Session 7 Demo Code: https://colab.research.google.com/drive/1nq7bIGs_ShWiTZxymtkMLaEYJPRtPP_q?usp=share_link