May 10, 2026

Glossary & FAQ - Artificial Intelligence

Those who want to read the main AI Glossary can go here:  Glossary - Artificial Intelligence.


1) Three Drivers of AI Innovation

Data Proliferation: Vast growth in available digital data (text, images, audio, logs, etc.) that AI systems can learn from.

Algorithm Advancement: Improved learning algorithms and architectures that can extract better patterns from data and train stronger AI models.

Computing Hardware DevelopmentHigh-powered computing systems (especially GPU-based and advanced semiconductor hardware) that can process massive datasets quickly and efficiently.

2) NLP Foundations & Tasks (Practical Building Blocks)

Tokenization: Breaks raw text into smaller units called tokens (words, subwords, or characters). This is typically the first step in NLP pipelines such as language modeling and machine translation. Example: “Natural Language Processing” → ["Natural", "Language", "Processing"]. Note: Subword methods like Byte-Pair Encoding (BPE) balance vocabulary size and efficiency for large language models.

Embeddings: Dense numeric vectors representing words/sentences so that similar meanings lie closer together in vector space; used for search, clustering, and LLM understanding.
Semantic Similarity: Measuring meaning-based closeness between texts using embeddings (often via cosine similarity).

Vector Database: A database optimized to store embeddings and retrieve the most similar vectors quickly (used in semantic search and retrieval pipelines).

Part-of-Speech (POS) Tagging: Assigns grammatical labels to words—such as noun, verb, adjective—helping downstream tasks like parsing and entity extraction. Methods include rule-based approaches, probabilistic approaches (e.g., Hidden Markov Models), and modern neural (context-aware) approaches.

Named Entity Recognition (NER): Identifies and classifies entities such as people, organizations, and locations within text. Example: “Steve Jobs” (Person), “Apple” (Organization). Typically involves tokenization, context analysis, entity classification, and ambiguity resolution.

Sentiment Analysis: Detects emotional tone in text—commonly positive, negative, or neutral—using NLP techniques such as tokenization and transformer-based classifiers (e.g., BERT-style models fine-tuned for sentiment).

Chatbots (NLP Chatbots): Conversational systems that combine tokenization, intent recognition, context handling, and response generation to support natural interactions. Modern chatbots can manage multi-turn conversation and improve over time using feedback and real usage data.

3) NLP Preprocessing & Features

Text Normalization: Cleaning text into a consistent format (lowercasing, removing extra spaces, handling punctuation) to reduce noise for downstream NLP tasks.

Stopwords: Common words (e.g., “is”, “the”, “and”) that may be removed in traditional NLP pipelines to reduce dimensionality (depending on use case).

Stemming: Reducing words to crude base forms (e.g., “running” → “run”) using heuristic rules; fast but may produce non-words.

Lemmatization: Reducing words to dictionary base forms (e.g., “better” → “good”) using vocabulary + grammar; usually more accurate than stemming.

N‑grams: Contiguous sequences of N tokens (e.g., bigrams/trigrams) used as features for traditional NLP modeling.

TF‑IDF: A vectorization method that scores words by importance using term frequency and inverse document frequency.


4) India-Focused Multilingual AI (Indic Languages & Speech)

Morni (Multimodal Representation for India) – Google DeepMind: A project targeting around 125 Indic languages and dialects to build AI models that can understand and process India’s linguistic diversity, including many under-resourced languages with limited digital content.

Project Vaani: An open-source speech data initiative supporting the creation of large-scale speech datasets for Indian languages, enabling translation, voice AI, and broader accessibility.

5) Major Model Families 

PaLM 2 (Pathways Language Model 2): Google’s large language model family built on the Pathways architecture for efficient scaling across multilingual tasks, reasoning, and code generation.

Med‑PaLM 2: A medical-domain model built on PaLM 2, fine-tuned on medical datasets for clinical question answering, summarization, and medical text insights.

Llama 2: Meta’s family of pretrained and chat-optimized models (7B to 70B parameters), trained for dialogue and widely used in open model experimentation.

Claude 2: Anthropic’s assistant model designed to be helpful and safe, known for improved reasoning, coding capability, and longer-context interactions.

BERT: A transformer-based language understanding model known for strong performance in tasks like classification, NER, and question answering.

GPT (Generative Pre-trained Transformer family): A family of large generative models designed for text creation, coding, and reasoning, known for broad general-purpose capability.

6) Open AI Ecosystem & Tooling

Hugging Face: An open-source AI platform and community hub providing access to a large collection of pretrained models, datasets, and demos across NLP, vision, audio, and multimodal AI.

Model Hub: A central repository for discovering, sharing, and collaborating on AI models; commonly used to publish model checkpoints and run inference.

Transformers Library (Hugging Face): A popular library that simplifies tokenization, model loading, fine-tuning, evaluation, and inference for many state-of-the-art transformer models.

Datasets & Tools (Hugging Face): Utilities that streamline dataset loading and experimentation, plus “Spaces” for interactive demos; also includes enterprise options like private hubs and security features.

7) Deployment & Efficiency

Quantization: Reducing numeric precision (e.g., from FP16/FP32 to INT8/INT4) to speed up inference and reduce memory usage.

Distillation: Training a smaller “student” model to mimic a larger “teacher” model, improving efficiency while retaining performance.

Latency: Time taken to produce a response (often measured per request or per token).
Throughput: How many requests/tokens per second a system can process.

8) Speech + Language Stack (Audio → Text → Voice)

Speech Data (Audio): Raw voice recordings used to train speech AI systems. Speech captures acoustic features like pitch, tone, and phonemes; supervised datasets include transcripts.

Speech‑to‑Text (ASR – Automatic Speech Recognition): Converts spoken audio into written text using acoustic modeling and language modeling (increasingly neural approaches) for transcription and voice search.

Text‑to‑Speech (TTS): Converts text into natural-sounding speech using neural speech synthesis, supporting prosody and accents for voice assistants and accessibility use cases.

Spectrogram: A time–frequency visual representation of audio energy; commonly used as input features for speech models.

Mel‑Spectrogram: A spectrogram mapped to the mel scale (closer to human hearing); widely used in TTS and ASR feature extraction.

Phoneme: The smallest unit of sound in speech; useful in pronunciation modeling and TTS.

Speaker Diarization: Splitting audio by “who spoke when,” useful in meetings, call centers, and multi-speaker recordings.

9) Perplexity AI (Answer Engine)

Perplexity AI: An AI-powered search and answer engine designed to provide conversational answers with citations by combining large language models with web search.

10) LLM Generation & Decoding

Inference: Using a trained model to generate outputs (predictions) on new inputs; unlike training, weights do not change during inference.

Decoding: The method used to convert probability distributions over tokens into actual text output.

Top‑k Sampling: At each step, restrict token choices to the top k most probable tokens, then sample from them.

Top‑p (Nucleus) Sampling: Choose the smallest set of tokens whose cumulative probability exceeds p, then sample from that set (adaptive alternative to top‑k).

Beam Search: Keeps multiple best candidate sequences at once to find a higher‑probability output; common in translation and structured generation.

11) How Do LLMs Work? (High-Level Steps)

Step 1: Tokenization – Break the input text into tokens.
Step 2: Embeddings – Convert tokens into numeric vectors representing meaning.
Step 3: Self‑Attention – Identify which parts of the text matter most for context.
Step 4: Prediction – Predict the next token based on context.
Step 5: Response Generation – Repeat prediction to form a coherent response.

12) Evaluation Metrics (NLP + Speech)

Perplexity (Metric): Measures how well a language model predicts tokens; lower perplexity generally means better predictive fit on similar text.

Precision: Of the predicted positives, how many were correct.

Recall: Of the actual positives, how many were found.

F1 Score: Harmonic mean of precision and recall; common for imbalanced classification and NER.

BLEU: Metric often used to evaluate machine translation by comparing overlap with reference translations.

ROUGE: Metric family often used for summarization evaluation based on overlap with reference summaries.

WER (Word Error Rate): Standard ASR metric measuring speech-to-text errors as a ratio of substitutions, deletions, and insertions.


13) LLM Security & Operational Risks

Prompt Injection: A malicious prompt designed to override instructions or extract hidden/system information.

Data Leakage: Sensitive data appearing in outputs due to training exposure, retrieval exposure, or unsafe prompting.

Jailbreak: Prompt strategies intended to bypass safety rules or behavioral constraints.

No comments:

Post a Comment