Stemming vs Lemmatization: Key Differences + When to Use Each
- Muiz As-Siddeeqi

- 9 hours ago
- 28 min read

Stemming vs Lemmatization: Key Differences + When to Use Each
Every search you run, every chatbot you talk to, every AI assistant you use—they all rely on a hidden battle happening behind the scenes. Two text processing techniques fight for dominance: stemming and lemmatization. One is fast but messy. The other is accurate but slow. Pick wrong, and your search engine returns garbage. Your sentiment analysis fails. Your AI misunderstands everything.
This isn't academic theory. In 2022, a systematic literature review analyzing 33 research studies found that lemmatization consistently outperforms stemming in sentence similarity tasks, yet stemming remains widely used due to its speed advantage (Pramana et al., November 2022). The stakes are real: search engines process billions of queries daily, and even a 1% improvement in accuracy translates to millions of better results.
Don’t Just Read About AI — Own It. Right Here
TL;DR
Stemming chops word endings using simple rules—fast but often produces non-words like "studi" from "studies"
Lemmatization uses linguistic analysis to return proper dictionary words—slower but semantically accurate
Speed gap: Stemming processes text 5-10x faster than lemmatization for large datasets
Accuracy gap: Lemmatization reduces errors from 76.7% (Porter stemmer) to 6.7% in context-aware analysis (Agbele et al., 2012)
Use stemming for search engines, large-scale text indexing, and speed-critical applications
Use lemmatization for sentiment analysis, chatbots, machine translation, and accuracy-critical NLP tasks
Stemming removes word endings through rule-based truncation to create stems (e.g., "running" → "run"), while lemmatization uses morphological analysis and part-of-speech tagging to return dictionary forms (lemmas). Stemming is faster but less accurate; lemmatization is slower but linguistically precise, making it ideal for applications requiring semantic understanding like chatbots and sentiment analysis.
Table of Contents
What Are Stemming and Lemmatization?
Both stemming and lemmatization reduce words to their base forms, but they take radically different paths to get there.
Stemming is a crude, rule-based process that strips prefixes and suffixes from words. Think of it as taking a machete to text—fast, effective, but rough around the edges. The word "running" becomes "run," but "studies" becomes "studi" (not a real word). Stemming doesn't care about language rules or meaning. It just chops.
Lemmatization is a linguistic process that analyzes word structure, grammar, and context. It uses dictionaries and morphological analysis to return the proper base form. "Running" becomes "run," and "studies" correctly becomes "study." Lemmatization understands that "better" is the comparative form of "good" and handles it accordingly.
The distinction matters enormously. According to Stanford NLP research, stemming "chops off the ends of words in the hope of achieving this goal correctly most of the time," while lemmatization "does things properly with the use of a vocabulary and morphological analysis of words" (Stanford NLP Book). In production systems handling millions of documents, this difference between hope and certainty drives everything from search quality to chatbot accuracy.
How Stemming Works
The Algorithm Behind the Machete
Stemming algorithms follow a simple pattern: identify common suffixes, check conditions, strip them off. The most famous example is the Porter Stemmer, developed by Martin Porter in 1979 and published in 1980.
The Porter algorithm runs through five sequential phases of word reduction. Each phase contains rules for handling specific suffix patterns. For example:
Phase 1: Strip plural endings (e.g., "caresses" → "caress," "ponies" → "poni")
Phase 2: Remove derivational endings (e.g., "relational" → "relate")
Phase 3: Continue derivational stripping (e.g., "electricity" → "electric")
Phase 4 & 5: Final cleanup and edge cases
Each rule includes conditions. A simple one: if a word has the pattern [consonant-vowel-consonant] and ends in a double consonant, remove the last character. So "hopping" becomes "hop."
The entire Porter algorithm comprises hundreds of these rules, all applied sequentially. The original implementation from 1980 remains widely used today—a testament to its empirical effectiveness despite its simplicity (Porter, 1980).
Types of Stemmers
Porter Stemmer: The industry standard for English. Used in everything from Elasticsearch to Solr. Lightweight, fast, well-tested.
Snowball Stemmer (Porter2): An improved version supporting multiple languages—French, German, Spanish, Russian, and more. About 15% more accurate than Porter for many languages (tartarus.org).
Lovins Stemmer: One of the earliest (1968), more aggressive than Porter. Contains 260 rules but can over-stem words.
Lancaster Stemmer (Paice-Husk): The most aggressive. Strips words down to very short stems. Good for precision-focused retrieval, risky for recall.
According to a 2021 study comparing Porter and enhanced Porter algorithms, modified versions achieved 92% accuracy in stemming performance, up from Porter's baseline (Polus & Abbas, February 2021).
When Stemming Goes Wrong
Stemming creates two types of errors:
Overstemming happens when different words get reduced to the same stem. "Universal" and "university" both become "univers"—they're not synonyms. "Requirement" and "requires" both stem to "requir," losing important semantic differences (Coursera, June 2025).
Understemming happens when related words get different stems. In some languages, irregular verb forms don't stem correctly. "Run," "ran," and "running" might not all reduce to the same stem, fragmenting your index.
A 2016 multilingual study found that stemming accuracy varies dramatically by language. For English, Porter stemmer error rates range from 15-25% depending on the corpus. For highly inflected languages like Finnish, error rates can exceed 30% (Flores & Moreira, April 2016).
How Lemmatization Works
The Linguistic Approach
Lemmatization treats text processing as a linguistic problem, not a string manipulation problem. The process involves three key steps:
Step 1: Tokenization
Break text into individual words (tokens). Advanced tokenizers handle contractions, hyphenated words, and edge cases.
Step 2: Part-of-Speech (POS) Tagging
Determine each word's grammatical role. Is "running" a verb (they are running) or an adjective (running water)? The POS tag determines the correct lemma.
Step 3: Dictionary Lookup with Morphological Rules
Consult a lexical database like WordNet to find the word's base form. Apply morphological rules specific to that word class. Return the lemma—always a valid dictionary word.
For example, lemmatizing "better":
POS tagging identifies it as an adjective
Morphological analysis recognizes it as the comparative form
Dictionary lookup returns "good" as the lemma
This three-stage process requires significantly more computational resources than stemming, but produces linguistically valid results (GeeksforGeeks, July 2024).
Types of Lemmatization
Rule-Based Lemmatization: Uses predefined grammatical rules. For regular verbs in English: remove "-ed" for past tense, "-ing" for progressive. Fast but struggles with irregular forms.
Dictionary-Based Lemmatization: Looks up each word in a comprehensive dictionary. Handles irregularities well but requires large lexical databases. WordNet contains over 155,000 unique strings for English.
Machine Learning-Based Lemmatization: Trains models on annotated corpora to predict lemmas. The EditTreeLemmatizer in spaCy v3.3+ learns form-to-lemma transformations from training data, often achieving higher accuracy than lookup-based methods (spaCy Documentation, 2025).
According to a 2014 comparative study, lemmatization "produced better precision compared to stemming" in information retrieval systems, though the differences were sometimes statistically insignificant for simple queries (Balakrishnan & Lloyd-Yemoh, 2014).
The Accuracy Advantage
Lemmatization's linguistic foundation gives it a crucial advantage: context awareness.
Consider the word "saw":
As a verb (past tense of "see"): lemma = "see"
As a noun (cutting tool): lemma = "saw"
Stemming blindly reduces both to "s" or "saw" without context. Lemmatization uses the surrounding sentence structure to choose correctly.
In a 2012 context-aware analysis study, researchers developed a modified stemming algorithm that reduced error rates from 76.7% (standard Porter) to 6.7% by incorporating contextual cues—essentially moving toward lemmatization (Agbele et al., 2012).
Key Differences: Side-by-Side Comparison
Aspect | Stemming | Lemmatization |
Method | Rule-based suffix removal | Morphological analysis + dictionary lookup |
Output | Stem (may not be a valid word) | Lemma (always a valid dictionary word) |
Speed | Very fast (5-10x faster for large datasets) | Slower (requires POS tagging and lookup) |
Accuracy | 70-85% accurate (language-dependent) | 90-98% accurate with proper POS tagging |
Context Awareness | No—processes words independently | Yes—uses POS tags and sentence context |
Language Support | Good for English, limited for others | Better for morphologically complex languages |
Use Case | Search engines, text indexing, IR systems | Sentiment analysis, chatbots, machine translation |
Examples | "studies" → "studi" | "studies" → "study" |
"running" → "run" | "running" → "run" (verb) or "running" (adj) | |
"better" → "better" | "better" → "good" | |
Error Types | Overstemming, understemming | Rare, but POS tagging errors propagate |
Computational Cost | Low (simple string operations) | High (dictionary lookups, ML models) |
Key Insight: According to a 2024 LinkedIn analysis, "stemming reduces words to their root forms, simplifying text processing and improving performance, but may result in inaccuracies due to overgeneralization. Lemmatization, being more precise, enhances accuracy but demands more resources due to its complexity" (LinkedIn Expert Panel, March 2024).
Performance Benchmarks and Real Data
Let's look at hard numbers from peer-reviewed research and production systems.
Speed Benchmarks
In a 2018 GitHub issue on the spaCy repository, a developer reported processing 164,758 news articles:
NLTK lemmatization with multiprocessing: 5 minutes total
spaCy lemmatization (without optimization): Estimated 4.5 hours
That's a 54x difference. However, the developer was misusing spaCy's API. When properly optimized using batch processing with nlp.pipe(), spaCy's performance improves dramatically (spaCy GitHub Issue #1837, January 2018).
A 2024 comparison showed that for a corpus of 100 articles:
Porter stemming (NLTK): Processed in seconds
spaCy lemmatization: Slightly slower but within comparable range when optimized
Effective token reduction: spaCy reduced corpus to fewer unique tokens due to better lemmatization (NewsCatcher, March 2024)
Accuracy Metrics
Information Retrieval Performance:
A 2020 study on sentence retrieval using TREC novelty track data found:
Lemmatization with long queries: Superior MAP (Mean Average Precision) scores
Stemming with short queries: Better performance for quick lookups
Overall: "Lemmatization produces better results with longer queries, while stemming shows worse results with longer queries" (Doko et al., ASTE Journal, October 2025)
Clustering Performance:
Research on Finnish text document clustering (a morphologically complex language) showed:
Precision improvement: Lemmatization with average linkage and Ward's methods produced higher precision than stemming
Highly relevant documents: Lemmatization recovered highly relevant documents significantly better
Conclusion: "Lemmatization is a better word normalization method than stemming when Finnish text documents are clustered for information retrieval" (Korenius et al., ACM CIKM 2004)
Cross-Language Results:
A 2016 multilingual study across English, French, Portuguese, and Spanish found a surprising result: "The most accurate stemmer was not the one to have the biggest improvement in Information Retrieval, in none of the languages" (Flores & Moreira, ScienceDirect, April 2016).
This counterintuitive finding suggests that stemming accuracy and IR effectiveness don't always correlate—sometimes a slightly less accurate stemmer groups documents more effectively for retrieval.
Library Performance: NLTK vs spaCy
NLTK (Natural Language Toolkit):
Educational focus, flexible but slower
String-based processing
Better for experimentation and research
spaCy:
Production-optimized, 50-1000x faster for batch processing (when properly configured)
Object-oriented architecture
Built-in lemmatization uses trained statistical models
"spaCy consistently demonstrates superior performance in lemmatization compared to NLTK's stemming approaches" (Bastaki Software Solutions, March 2025)
According to spaCy's official benchmarks, its industrial-strength pipeline processes text efficiently on both CPU and GPU, with accuracy validated against multiple NLP benchmarks (spaCy Facts & Figures, 2025).
When to Use Stemming
Stemming shines when speed trumps perfection. Here's when to reach for it:
1. Search Engines and Information Retrieval
Google, Bing, Elasticsearch, Solr—they all use stemming variants for index generation. When you're processing billions of documents, stemming's speed advantage becomes critical.
Why it works: In Boolean retrieval systems, "stemming never lowers precision" and "stemming never lowers recall" according to Stanford NLP research. The crude approach actually helps by creating broader matches (Stanford NLP IR Book).
Real example: Elasticsearch uses Snowball stemmers across 15+ languages by default. Their architecture prioritizes query speed—returning results in milliseconds matters more than perfect linguistic analysis.
2. Large-Scale Text Indexing
Processing terabytes of text data? Stemming lets you:
Reduce vocabulary size by 30-50%
Speed up indexing by 5-10x
Decrease storage requirements significantly
A 2010 study on MapReduce implementation showed that an enhanced Porter stemmer with partitioner (PSP) provided "20-25% more stemming capacity than Lovins stemmer and 3-15% more capacity than standard Porter stemmer" when processing massive datasets (Achieving Magnitude Order Improvement, April 2010).
3. Real-Time Applications
Chatbots that need sub-second responses, live sentiment analysis of Twitter streams, real-time spam detection—these applications can't wait for lemmatization's computational overhead.
4. Simple Text Analytics
For basic text classification, keyword extraction, and topic modeling where perfect linguistic accuracy isn't critical, stemming's good-enough approach works fine.
5. Languages with Simple Morphology
English and other languages with relatively simple inflection patterns work reasonably well with stemming. The error rate is manageable, and the speed gain is substantial.
Key Decision Factor: Choose stemming when you need to process large volumes quickly and can tolerate 15-25% error rates in word normalization. The errors usually don't significantly hurt aggregate performance across millions of documents.
When to Use Lemmatization
Lemmatization is your choice when meaning matters more than milliseconds.
1. Sentiment Analysis and Opinion Mining
Understanding whether a review is positive or negative requires semantic accuracy. Misidentifying "better" vs "bitter" due to stemming errors can flip sentiment scores.
According to GeeksforGeeks, "Sentiment analysis: Lemmatization preserves word meaning, leading to more accurate sentiment classification" (July 2024). When brands analyze millions of customer reviews, accuracy directly impacts business decisions.
Industry application: According to TechTarget, lemmatization is "an important part of natural language understanding and NLP. It also plays an important role in big data analytics and AI" for sentiment analysis (TechTarget Definition).
2. Chatbots and Virtual Assistants
Alexa, Siri, Google Assistant, customer service chatbots—they need to understand user intent precisely. Lemmatization helps these systems:
Disambiguate word meanings based on context
Handle irregular verb forms correctly
Maintain semantic consistency across conversations
"In NLP, lemmatization helps an AI or ML tool understand and converse with end users accurately" (TechTarget). When a user asks "What time does the store close?" vs "What time did the store close?", lemmatization correctly identifies both as forms of "close" while preserving tense information.
3. Machine Translation
Google Translate and DeepL rely on lemmatization for accurate cross-language translation. Stemming's crude approach fails catastrophically when translating between languages with different morphological systems.
4. Content-Based Recommendation Systems
Netflix, Spotify, YouTube—they analyze content semantics to make recommendations. Lemmatization helps these systems:
Group related content accurately
Understand subtle semantic differences
Preserve meaning across inflected forms
5. Academic and Legal Text Analysis
When analyzing legal documents, academic papers, or medical records, precision isn't optional—it's required. A stemming error could conflate distinct legal concepts or medical terms.
6. Highly Inflected Languages
For Finnish, Turkish, Arabic, Russian, and other morphologically rich languages, lemmatization is almost mandatory. Stemming error rates for these languages often exceed 30%, making it practically unusable.
According to Bitext, for languages like Greek where "a typical verb has different stems for perfective forms and imperfective ones," only lemmatization can correctly group related forms (Bitext Blog, May 2023).
Key Decision Factor: Choose lemmatization when you need semantic accuracy, are working with complex languages, or building applications where understanding meaning is critical (even if it costs you processing time).
Case Studies: Real-World Applications
Case Study 1: Babel Street Analytics—Multilingual Search Platform
Company: Babel Street (intelligence and risk mitigation platform)
Challenge: Enable accurate cross-language search across European languages with complex morphology
Date: February 2025
Problem: Standard stemming approaches were failing for European languages where word forms change dramatically based on usage. Searching for "celebrities" would incorrectly return results about "celebrations" due to stemming errors.
Solution: Babel Street implemented lemmatization as a standard feature across their analytics platform. Their linguists and engineers collaborated to build lemmatization support for multiple European languages.
Results:
"Studies have shown that lemmatization is significantly more accurate than stemming in many European languages"
Customers gained "high-quality search across multiple languages"
Search accuracy improved dramatically for morphologically complex queries
"So let's review the previous example: 'celebrities' is searched, but with lemmatization utilized by the search engine, the query is correctly interpreted as 'celebrity,' not 'celebration,' enabling the search engine to deliver the right results" (Babel Street Blog, February 2025).
Key Takeaway: For production search systems handling multiple languages, lemmatization's accuracy advantage outweighed its computational cost. The company positioned lemmatization as a competitive differentiator.
Case Study 2: Arabic Sentiment Analysis Study
Research: Systematic Literature Review of Arabic Text Processing
Scope: 2,024 documents analyzed, 33 studies selected
Publication: November 2022
Challenge: Arabic language has complex morphology with heavy inflection patterns. Determine whether stemming or lemmatization performs better for sentence similarity and sentiment classification.
Methodology: Researchers tested 10 different stemming and lemmatization algorithms on Arabic text using:
Support Vector Machines (SVM)
Stochastic Gradient Descent (SGD)
Naïve Bayesian (NB) classifiers
Standard datasets from SemEval-2017 Task 1, Track 1
Results:
"Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results"
Lemmatization produced better semantic preservation for sentiment analysis
Both techniques improved performance over using original text
Lemmatization's advantage was more pronounced for longer, more complex sentences
Conclusion: "Previous studies have found the differences between stemming and lemmatization is usually insignificant in terms of accuracy, stemming has been used more widely than lemmatization as it offers similar performance to lemmatization while having faster" processing (Pramana et al., ResearchGate, November 2022).
Key Takeaway: For Arabic NLP tasks, the choice between stemming and lemmatization depends on whether speed or accuracy is the priority. Lemmatization wins for accuracy-critical applications.
Case Study 3: Finnish Document Clustering Study
Research: Text Document Clustering for Information Retrieval
Institutions: University of Helsinki, University of Tampere
Publication: ACM CIKM 2004
Challenge: Finnish is an agglutinative language with extensive inflection. Standard English stemming approaches fail catastrophically. Determine optimal text normalization method for clustering Finnish news documents.
Methodology: Tested four hierarchical clustering methods (single linkage, complete linkage, average linkage, Ward's method) with:
No normalization (baseline)
Stemming
Lemmatization
Evaluated using precision metrics across multiple relevance scales.
Results:
Precision: "In comparison with stemming, lemmatization together with the average linkage and Ward's methods produced higher precision"
Highly relevant documents: "The stringent relevance scale showed that lemmatization allowed the single and complete linkage methods to recover especially the highly relevant documents better than stemming"
Overall effectiveness: Clear superiority of lemmatization for morphologically complex language
Conclusion: "We conclude that lemmatization is a better word normalization method than stemming, when Finnish text documents are clustered for information retrieval" (Korenius et al., ACM CIKM, 2004).
Key Takeaway: For languages with rich morphology, lemmatization isn't just better—it's often the only viable option. The accuracy gain justifies the computational cost.
Tools and Libraries
NLTK (Natural Language Toolkit)
Best for: Learning, experimentation, research
Stemming support: Excellent (Porter, Snowball, Lancaster, Lovins)
Lemmatization support: Good (WordNet-based)
Speed: Moderate (string-based processing)
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
# Stemming
stemmer = PorterStemmer()
stemmer.stem("running") # Output: "run"
stemmer.stem("studies") # Output: "studi"
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos=wordnet.VERB) # Output: "run"
lemmatizer.lemmatize("better", pos=wordnet.ADJ) # Output: "good"Pros:
Over 50 corpora and lexical resources included
Extensive documentation and learning materials
Flexible, customizable algorithms
Cons:
Slower than production-focused libraries
Requires explicit POS tagging for good lemmatization results
String-based architecture less efficient for large-scale processing
Latest version: NLTK 3.9.1 (as of 2025)
spaCy
Best for: Production applications, scalable NLP pipelines
Stemming support: None built-in (by design—stemming considered less accurate)
Lemmatization support: Excellent (trained statistical models)
Speed: Very fast (especially with batch processing)
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The children were running towards the playground")
for token in doc:
print(f"{token.text} → {token.lemma_}")
# Output:
# The → the
# children → child
# were → be
# running → run
# towards → towards
# the → the
# playground → playgroundPros:
"spaCy is specifically designed for production use" (spaCy Documentation)
Integrated POS tagging, dependency parsing, NER
Supports 70+ languages
GPU and CPU optimization
Modern transformer integration (BERT, RoBERTa)
Cons:
Less flexibility for experimentation
Steeper learning curve initially
Larger memory footprint
Latest versions:
spaCy 3.7 (2025)
Three lemmatizer types: lookup, rule-based, EditTreeLemmatizer (trainable, introduced in v3.3)
Performance note: "spaCy consistently demonstrates superior performance in lemmatization compared to NLTK's stemming approaches" (Bastaki Software, March 2025). However, proper usage of batch processing is critical—improper use can make spaCy much slower than NLTK.
Snowball (Stemming Only)
Best for: Multilingual stemming
Languages: 15+ including English, French, German, Spanish, Russian, Portuguese, Arabic
Speed: Extremely fast
Usage: Often integrated into search engines (Elasticsearch, Solr)
The Snowball framework lets you implement stemming algorithms in a high-level language, then compile to efficient C code. Porter2 (improved Porter stemmer) is the recommended English version.
Gensim
Best for: Topic modeling, document similarity
Stemming: Limited built-in support
Lemmatization: Good (via pattern library or spaCy integration)
Speed: Optimized for large corpora
Gensim focuses on unsupervised learning and works well with either stemmed or lemmatized text as preprocessing.
Stanza (Stanford NLP)
Best for: Multilingual, academic-grade NLP
Lemmatization: Excellent (neural network-based)
Languages: 60+ languages with trained models
Speed: Moderate (neural models are computationally intensive)
Stanza provides state-of-the-art accuracy but requires more computational resources than rule-based approaches.
Tool Selection Matrix
Your Need | Recommended Tool |
Learning NLP basics | NLTK |
Production web app | spaCy |
Multilingual search engine | Snowball stemmers |
Academic research | NLTK or Stanza |
Real-time chatbot | spaCy (with batch processing) |
Topic modeling | Gensim (with preprocessing) |
Maximum accuracy | spaCy or Stanza lemmatization |
Maximum speed | Snowball stemming |
Common Mistakes and How to Avoid Them
Mistake 1: Using Stemming for Sentiment Analysis
The Problem: Stemming's crude approach destroys semantic nuances critical for sentiment detection.
Example:
Original: "The product quality is better than expected"
Stemmed: "The product qualiti is better than expect"
Issue: "better" remains unchanged by most stemmers, but loses its connection to "good" in semantic analysis
Fix: Use lemmatization with proper POS tagging for sentiment analysis. Accept the computational cost as necessary for accuracy.
Mistake 2: Forgetting POS Tagging for Lemmatization
The Problem: Without POS tags, lemmatizers default to noun treatment, producing incorrect results.
Example:
# Without POS tag
lemmatizer.lemmatize("running") # Output: "running" (treated as noun)
# With POS tag
lemmatizer.lemmatize("running", pos=wordnet.VERB) # Output: "run"Fix: Always provide POS tags to your lemmatizer. Use NLTK's pos_tag() or spaCy's built-in POS tagging.
Mistake 3: Not Optimizing spaCy for Batch Processing
The Problem: Processing documents one-by-one in spaCy is extremely slow.
Bad code:
for text in documents:
doc = nlp(text) # Processes each separatelyGood code:
for doc in nlp.pipe(documents, batch_size=50):
# Processes in optimized batchesImpact: The GitHub issue mentioned earlier showed 54x performance difference between optimized and unoptimized spaCy usage.
Mistake 4: Using English Stemmers for Other Languages
The Problem: Porter stemmer is designed for English. Using it on Spanish, French, or other languages produces garbage.
Fix: Use language-specific stemmers:
Spanish: Snowball Spanish stemmer
French: Snowball French stemmer
German: Snowball German stemmer
Arabic: Dedicated Arabic light stemmers
Alternatively, use lemmatization with language-specific models (spaCy supports 70+ languages).
Mistake 5: Expecting Perfect Accuracy from Either Method
The Problem: Both techniques have limitations. Stemming makes errors; lemmatization depends on correct POS tagging.
Reality Check: According to research, even the best lemmatizers achieve 90-98% accuracy, not 100%. Context-aware errors still occur.
Fix: Understand your accuracy requirements upfront. For some applications, 85% accuracy with 10x speed (stemming) beats 95% accuracy at 1x speed (lemmatization).
Mistake 6: Not Removing Stop Words First
The Problem: Processing stop words ("the," "is," "and") through stemming/lemmatization wastes computation.
Fix: Always remove stop words during preprocessing, before applying stemming or lemmatization.
# Good preprocessing pipeline
1. Tokenization
2. Stop word removal
3. Stemming/Lemmatization
4. Further processingMyths vs Facts
Myth 1: "Lemmatization is always better than stemming"
Reality: Depends on your use case. A 2016 study found that "the most accurate stemmer was not the one to have the biggest improvement in Information Retrieval, in none of the languages" tested (Flores & Moreira, 2016).
For search engines indexing billions of documents, stemming's speed advantage often outweighs lemmatization's accuracy gain. The aggregate performance across millions of queries can be similar.
Verdict: False. Context determines which is "better."
Myth 2: "Stemming always produces non-words"
Reality: Stemming produces non-words for many inputs, but not always. Common words like "run," "walk," "test" remain unchanged or become valid stems.
The Porter stemmer's goal is to create equivalence classes, not valid English words. But in practice, many stems happen to be valid words.
Verdict: Mostly false. Stems are often—but not always—non-words.
Myth 3: "Lemmatization is too slow for production use"
Reality: Major production systems successfully use lemmatization:
Google uses lemmatization variants in their search algorithms
Babel Street's analytics platform uses lemmatization as a standard feature
Enterprise chatbots from IBM, Microsoft, and others rely on lemmatization
With proper optimization (batch processing, GPU acceleration, caching), lemmatization runs fast enough for real-world applications.
Verdict: False. Proper engineering makes lemmatization production-viable.
Myth 4: "You should always use the same technique throughout your pipeline"
Reality: Hybrid approaches work well. You might:
Use stemming for initial filtering/indexing
Use lemmatization for final semantic analysis
Use stemming for speed-critical real-time responses
Use lemmatization for accuracy-critical batch processing
Verdict: False. Mix and match based on stage-specific requirements.
Myth 5: "Stemming and lemmatization only matter for English"
Reality: These techniques are even MORE important for morphologically rich languages. Finnish, Turkish, Arabic, and Russian benefit enormously from proper normalization.
In fact, for highly inflected languages, lemmatization often isn't optional—stemming's error rates become prohibitively high.
Verdict: False. Non-English languages benefit even more.
Myth 6: "Modern transformers like BERT make stemming/lemmatization obsolete"
Reality: This is partially true but nuanced. Transformer models like BERT do learn subword representations that capture morphological variations. However:
Preprocessing with lemmatization can still improve downstream task performance
Not every application can afford transformer computational costs
Hybrid approaches (lemmatization + transformers) often perform best
According to LinkedIn's NLP experts, "In LLMs, lemmatization is often an implicit part of understanding language, not a discrete preprocessing step," but "explicit lemmatization is more characteristic of traditional models like TF-IDF" (LinkedIn Expert Discussion, March 2024).
Verdict: Partially true. Transformers reduce but don't eliminate the value of explicit normalization.
Implementation Guide
Decision Framework: Which Should You Use?
Use this flowchart approach:
Step 1: Define Your Priority
Speed + Scale → Consider stemming
Accuracy + Meaning → Consider lemmatization
Step 2: Assess Your Language
English, simple morphology → Stemming viable
Morphologically rich (Finnish, Arabic, Russian, Turkish) → Lemmatization strongly recommended
Step 3: Evaluate Your Use Case
Search/IR/Indexing → Stemming often sufficient
Sentiment/Chatbot/Translation → Lemmatization preferred
Step 4: Check Your Infrastructure
Limited compute resources → Stemming
Adequate compute, batch processing possible → Lemmatization
Step 5: Test Both
Run benchmarks on your actual data
Measure task-specific performance (accuracy, F1, precision, recall)
Choose based on empirical results, not assumptions
Implementation Patterns
Pattern 1: Basic Stemming Pipeline (NLTK)
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Setup
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
# Tokenize
tokens = word_tokenize(text.lower())
# Remove stop words and punctuation
tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
# Stem
stems = [stemmer.stem(t) for t in tokens]
return stems
# Usage
text = "The running dogs were jumping over the sleeping cats"
processed = preprocess_text(text)
print(processed)
# Output: ['run', 'dog', 'jump', 'sleep', 'cat']Pattern 2: Production Lemmatization Pipeline (spaCy)
import spacy
# Load model once (expensive operation)
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]) # Disable unused components
def preprocess_documents(documents):
"""Process multiple documents efficiently"""
results = []
for doc in nlp.pipe(documents, batch_size=50):
# Extract lemmas, excluding stop words and punctuation
lemmas = [token.lemma_ for token in doc
if not token.is_stop and not token.is_punct and token.is_alpha]
results.append(lemmas)
return results
# Usage
documents = [
"The children were running towards the playground",
"Better products yield better results",
"Studies show improved performance"
]
processed_docs = preprocess_documents(documents)
for doc_lemmas in processed_docs:
print(doc_lemmas)
# Output:
# ['child', 'run', 'playground']
# ['good', 'product', 'yield', 'good', 'result']
# ['study', 'show', 'improve', 'performance']Pattern 3: Hybrid Approach
def hybrid_normalization(text, use_lemma_for=['sentiment_words']):
"""Use stemming for speed, lemmatization for critical words"""
# Fast stemming for most words
tokens = word_tokenize(text.lower())
normalized = [stemmer.stem(t) for t in tokens]
# Lemmatization for sentiment-critical words
sentiment_words = {'better', 'worse', 'best', 'worst', 'good', 'bad'}
doc = nlp(text)
for i, token in enumerate(doc):
if token.text.lower() in sentiment_words:
normalized[i] = token.lemma_
return normalizedPerformance Optimization Tips
For NLTK:
Precompile stop word sets (don't reload each time)
Use list comprehensions instead of loops
Consider multiprocessing for large corpora
Cache frequently processed terms
For spaCy:
Critical: Use nlp.pipe() for batch processing
Disable unused pipeline components (parser, NER) if not needed
Use smaller models for speed (sm instead of lg)
Process documents in parallel batches
Consider using GPU acceleration for very large datasets
Testing Your Implementation
def benchmark_methods(texts, iterations=100):
"""Compare stemming vs lemmatization performance"""
import time
# Stemming benchmark
start = time.time()
for _ in range(iterations):
for text in texts:
tokens = word_tokenize(text)
[stemmer.stem(t) for t in tokens]
stem_time = time.time() - start
# Lemmatization benchmark
start = time.time()
for _ in range(iterations):
for doc in nlp.pipe(texts):
[token.lemma_ for token in doc]
lemma_time = time.time() - start
print(f"Stemming: {stem_time:.2f}s")
print(f"Lemmatization: {lemma_time:.2f}s")
print(f"Speedup factor: {lemma_time/stem_time:.1f}x")
# Run on your data
sample_texts = [...] # Your actual text data
benchmark_methods(sample_texts)Future Trends
1. Neural Lemmatization
The EditTreeLemmatizer in spaCy v3.3+ represents a shift toward trainable, neural network-based lemmatization. Instead of hand-crafted rules, these models learn morphological patterns from annotated data.
Advantage: "This removes the need to write language-specific rules and can (in many cases) provide higher accuracies than lookup and rule-based lemmatizers" (spaCy Linguistic Features Documentation).
Trend: Expect more libraries to adopt neural lemmatization as transformer models become more efficient.
2. Integration with Large Language Models
As noted by LinkedIn experts in 2024, "In LLMs, lemmatization is often an implicit part of understanding language, not a discrete preprocessing step" (LinkedIn Discussion, March 2024).
Modern transformers (BERT, GPT, T5) learn subword representations that capture morphological relationships. However, explicit lemmatization as preprocessing can still improve performance on specific tasks.
Emerging pattern: Hybrid architectures that combine explicit lemmatization with transformer-based contextual embeddings.
3. Context-Aware Stemming
Research into context-aware stemming (CAS) algorithms shows promise. A 2012 study reduced Porter stemmer errors from 76.7% to 6.7% by incorporating contextual analysis (Agbele et al., 2012).
Future direction: Stemming algorithms that use surrounding context to make better truncation decisions—bridging the gap between stemming speed and lemmatization accuracy.
4. Multilingual Unified Models
The push toward language-agnostic NLP models continues. Instead of separate stemmers/lemmatizers for each language, unified models (like mBERT, XLM-R) learn cross-lingual representations.
Impact: Reduces the need for language-specific preprocessing while maintaining accuracy across 100+ languages.
5. Real-Time Lemmatization at Scale
Cloud infrastructure improvements (GPU acceleration, distributed processing, edge computing) are making real-time lemmatization feasible even for high-volume applications.
Example: Modern chatbot platforms now routinely lemmatize user input in under 50ms, making the speed argument for stemming less compelling.
6. Domain-Specific Normalization
Generic stemmers/lemmatizers struggle with specialized vocabulary (medical terms, legal language, technical jargon). The trend is toward:
Domain-specific training data
Customizable lemmatization rules
Industry-specific dictionaries
Prediction: By 2027, most enterprise NLP systems will use domain-adapted normalization rather than generic approaches.
FAQ
1. Can I use both stemming and lemmatization together?
Yes, but it's usually redundant. Each is trying to solve the same problem (word normalization) using different approaches. Pick one based on your speed vs accuracy tradeoff.
In rare cases, you might stem during indexing for speed, then lemmatize query terms for accuracy. But most systems pick one approach and stick with it.
2. How do I know if my stemmer is working correctly?
Test it on a sample of your actual data. Look for:
Overstemming: Unrelated words getting the same stem
Understemming: Related words getting different stems
Non-word stems: How often does it produce invalid strings?
Calculate error rate on a manually annotated sample (100-500 words). If errors exceed 25-30%, consider lemmatization or a different stemmer.
3. Why doesn't spaCy include stemming?
By design choice. spaCy's creators believe stemming is linguistically unsound and produces lower-quality results than lemmatization. Since spaCy targets production systems where accuracy matters, they omitted stemming entirely.
From their perspective: if you need speed, optimize lemmatization properly (batch processing, GPU). Don't sacrifice accuracy for a crude stemming shortcut.
4. Can transformers like BERT replace stemming/lemmatization entirely?
Partially. BERT-family models learn subword representations that capture morphological relationships implicitly. For many tasks, this is sufficient.
However:
Not all applications can afford transformer computational costs
Preprocessing with lemmatization can still improve downstream performance
Traditional models (TF-IDF, topic models) still benefit from explicit normalization
Bottom line: Transformers reduce but don't eliminate the value of stemming/lemmatization.
5. What's the difference between a stem and a lemma?
Stem: Result of algorithmic truncation. May not be a dictionary word. Example: "studies" → "studi"
Lemma: Valid dictionary base form. Always a real word. Example: "studies" → "study"
Stems create equivalence classes for matching. Lemmas preserve linguistic validity.
6. How do I handle words not in the lemmatizer's dictionary?
Most lemmatizers have fallback behavior:
Lookup-based: Return the original word unchanged
Rule-based: Apply morphological rules even to unknown words
Neural: Use learned patterns to predict lemma
For domain-specific terms, you might need to:
Extend the dictionary
Train a custom lemmatizer
Use a hybrid approach (lemmatize common words, keep technical terms as-is)
7. Does lemmatization work for non-English languages?
Yes, often better than English! spaCy supports 70+ languages. Stanza supports 60+ languages. Many morphologically rich languages (Finnish, Turkish, Arabic) benefit more from lemmatization than English does.
The challenge is finding high-quality trained models and dictionaries for lower-resource languages.
8. How much does lemmatization slow down my NLP pipeline?
Depends on implementation:
Unoptimized: 10-50x slower than stemming
Properly optimized (batch processing, GPU): 2-5x slower
With caching: Nearly equal for repeated text
For most modern applications, the speed difference is negligible compared to other pipeline bottlenecks (network I/O, database queries, etc.).
9. Can I train my own lemmatizer?
Yes. Tools like spaCy's EditTreeLemmatizer allow training on annotated corpora. You'll need:
Training data with word forms and their lemmas
POS tags for training examples
Sufficient computational resources
Evaluation dataset to measure accuracy
For most users, pretrained models work well. Custom training makes sense for specialized domains or low-resource languages.
10. What's the typical accuracy difference in real applications?
Benchmarks from research:
Stemming accuracy: 70-85% for English (varies by algorithm)
Lemmatization accuracy: 90-98% with proper POS tagging
However, the impact on downstream tasks varies:
For information retrieval: Often statistically insignificant differences
For sentiment analysis: Lemmatization typically improves F1 scores by 3-8%
For machine translation: Lemmatization provides substantial improvements
Test on your specific task and data to measure actual impact.
11. Should I remove stop words before or after stemming/lemmatization?
Best practice: Remove stop words AFTER stemming/lemmatization.
Reason: Some stemmers/lemmatizers might normalize stop words to forms that aren't in your stop word list. Process in this order:
Tokenization
Stemming/Lemmatization
Stop word removal
12. How do I choose between Porter, Snowball, and Lancaster stemmers?
Porter: Most balanced, widely tested, good default choice
Snowball (Porter2): Improved Porter with multilingual support
Lancaster: Most aggressive, highest stemming rate, use for precision-focused tasks
Recommendation: Start with Snowball. Switch to Porter if you need exact compatibility with legacy systems. Avoid Lancaster unless you specifically need aggressive stemming.
13. Can lemmatization hurt my model's performance?
Rarely, but possible in specific cases:
If POS tagging is incorrect, lemmatization propagates errors
For tasks like Named Entity Recognition, lemmatizing entities can destroy important information
In short texts (tweets, queries), morphological information might carry semantic value that lemmatization removes
Best practice: A/B test lemmatization vs no-lemmatization on your specific task with your actual data.
14. How do I handle compound words?
English: Standard stemmers/lemmatizers treat compound words as single units (e.g., "blackboard" stays as one token)
German/Dutch: These languages create many compound words. Use compound splitters before stemming/lemmatization:
"Donaudampfschifffahrtsgesellschaft" → ["Donau", "dampf", "schiff", "fahrt", "gesellschaft"]
Most NLP libraries for German include compound splitting as a preprocessing step.
15. What's the best way to evaluate stemmer/lemmatizer quality?
Intrinsic evaluation: Manual annotation
Take 500-1000 words from your corpus
Manually assign correct stems/lemmas
Compare algorithm output to gold standard
Calculate precision, recall, F1 score
Extrinsic evaluation: Downstream task performance
Run your full NLP pipeline with different normalization approaches
Measure end-task accuracy (classification F1, retrieval MAP, etc.)
Choose approach that maximizes end-task performance
The best stemmer/lemmatizer for your application is the one that improves your actual business metric most.
Key Takeaways
Stemming uses crude rule-based truncation to create stems (often non-words), while lemmatization uses linguistic analysis to return valid dictionary forms (lemmas). Stemming is 5-10x faster; lemmatization is 15-20% more accurate.
Speed vs Accuracy tradeoff drives the choice: Use stemming for search engines, large-scale indexing, and real-time applications where speed matters. Use lemmatization for sentiment analysis, chatbots, and tasks where semantic accuracy is critical.
Language morphology matters enormously: English works reasonably with both approaches. Highly inflected languages (Finnish, Turkish, Arabic, Russian) strongly favor lemmatization—stemming error rates can exceed 30% for these languages.
Real-world performance data: Studies show lemmatization produces better precision for document clustering and retrieval, especially with longer queries. However, for simple Boolean searches, stemming often performs comparably.
Modern tools favor lemmatization: spaCy deliberately excludes stemming, focusing entirely on production-quality lemmatization. NLTK supports both for educational purposes. Industry trend is toward accurate lemmatization with optimized batch processing.
Implementation quality matters more than choice: A poorly optimized lemmatizer (54x slow) loses to well-implemented stemming. A properly batched spaCy pipeline approaches stemming's speed while maintaining accuracy.
Context awareness is lemmatization's key advantage: Understanding that "saw" as a verb → "see" but "saw" as a noun → "saw" requires the grammatical analysis that lemmatization provides.
No universal "best" choice exists: A 2016 multilingual study found that "the most accurate stemmer was not the one to have the biggest improvement in Information Retrieval" across four languages. Test on your specific data and task.
Hybrid approaches work well: Use stemming for initial filtering or indexing, lemmatization for final semantic analysis. Batch processing enables real-time lemmatization for user-facing applications.
Future trends favor lemmatization: Neural lemmatization, transformer integration, and cloud infrastructure improvements are making accurate lemmatization feasible even for high-scale applications. The speed argument for stemming weakens as optimization improves.
Actionable Next Steps
Audit your current text preprocessing pipeline. Identify where you use stemming, lemmatization, or neither. Document your current approach and its performance metrics.
Run A/B tests with your actual data. Don't rely on general benchmarks—test both stemming and lemmatization on your specific task (search, classification, sentiment analysis, etc.). Measure task-specific metrics (precision, recall, F1, user satisfaction).
If using NLTK, upgrade your lemmatization code to include proper POS tagging:
lemmatizer.lemmatize(word, pos=get_pos_tag(word))
Without POS tags, you're getting poor-quality lemmatization.
If using spaCy, optimize for batch processing:
for doc in nlp.pipe(documents, batch_size=50): process(doc)
This single change can provide 10-50x speed improvements.
For search/IR applications currently using no normalization, start with Snowball stemming. It's the fastest path to measurable improvement with minimal implementation cost.
For sentiment analysis or chatbots currently using stemming, migrate to lemmatization. The accuracy gain typically improves end-task performance by 3-8% with acceptable speed tradeoff.
Set up proper benchmarking infrastructure. Create a test suite that measures:
Processing speed (words/second)
Normalization accuracy (if you have gold standard data)
End-task performance (your actual business metric)
Document your decision rationale. Write down why you chose stemming or lemmatization, what tradeoffs you considered, and what metrics justify your choice. Revisit annually as infrastructure and tools improve.
Consider language-specific needs. If working with multiple languages, don't assume one approach works for all. Use language-appropriate normalization strategies.
Stay updated on neural lemmatization. As trainable lemmatizers improve, they may offer accuracy gains with minimal speed penalty. Monitor releases from spaCy, Stanza, and other major NLP libraries.
Glossary
Agglutinative Language: A language where words are formed by stringing together morphemes, each retaining its original meaning (e.g., Turkish, Finnish). Makes stemming particularly challenging.
Corpus (plural: corpora): A large collection of text documents used for NLP research and training.
Derivational Morphology: Word formation through adding affixes that change meaning or part of speech (e.g., "happy" → "unhappiness"). Stemming often removes these.
EditTreeLemmatizer: A trainable neural lemmatization component in spaCy v3.3+ that learns morphological transformations from annotated data.
Inflection: Modification of a word to express grammatical categories like tense, number, case, or gender (e.g., "run," "runs," "running").
Information Retrieval (IR): The process of finding relevant documents from a large collection in response to a user query.
Lemma: The canonical, dictionary form of a word. All inflected forms map to a single lemma (e.g., "am," "are," "is" → "be").
Lemmatization: The process of reducing words to their lemma using morphological analysis and dictionaries.
Morpheme: The smallest meaningful unit of language (e.g., "un-" in "unhappy").
Morphological Analysis: Studying the structure and form of words, including how they're built from morphemes.
Natural Language Processing (NLP): The field of AI focused on enabling computers to understand, interpret, and generate human language.
Overstemming: Stemming error where unrelated words are reduced to the same stem (e.g., "universal" and "university" → "univers").
Part-of-Speech (POS) Tagging: Assigning grammatical categories (noun, verb, adjective, etc.) to each word in text.
Porter Stemmer: The most widely used English stemming algorithm, developed by Martin Porter in 1979-1980.
Snowball: A framework for implementing stemming algorithms, and the name for improved Porter stemmer (Porter2) with multilingual support.
Stem: The base form of a word after removing affixes through stemming. May not be a valid dictionary word.
Stemming: The process of reducing words to their stem through rule-based suffix/prefix removal.
Stop Words: Common words (e.g., "the," "is," "and") that carry little semantic value and are typically removed during preprocessing.
TF-IDF (Term Frequency-Inverse Document Frequency): A numerical statistic reflecting how important a word is to a document in a collection.
Tokenization: Breaking text into individual words (tokens) as a preprocessing step.
Understemming: Stemming error where related words retain different stems (e.g., "data" and "datum" not reducing to the same stem).
WordNet: A large lexical database of English, grouping words into sets of synonyms and providing semantic relationships.
Sources & References
Agbele, K., Adesina, A., Azeez, N., & Abidoye, A. (2012). Context-Aware Stemming Algorithm for Semantically Related Root Words. African Journal of Computing & ICT, 5(4), 33-42.
Balakrishnan, V., & Lloyd-Yemoh, E. (2014). Stemming and Lemmatization: A Comparison of Retrieval Performances. Lecture Notes on Software Engineering, 2, 262-267. DOI: 10.7763/LNSE.2014.V2.134
Babel Street. (February 25, 2025). Delivering More Accurate Search Results with Lemmatization. Retrieved from https://www.babelstreet.com/blog/delivering-more-accurate-search-results-with-lemmatization
Bastaki Software Solutions. (March 12, 2025). Natural Language Processing with Python: A Comprehensive Guide to NLTK, spaCy, and Gensim in 2025. Retrieved from https://bastakiss.com/blog/python-5/natural-language-processing-with-python-a-comprehensive-guide-to-nltk-spacy-and-gensim-in-2025-738
Bitext. (May 4, 2023). What is the Difference Between Stemming and Lemmatization? Retrieved from https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/
Coursera. (June 5, 2025). Lemmatization vs. Stemming: Understanding NLP Methods. Retrieved from https://www.coursera.org/articles/lemmatization-vs-stemming
Doko, A., Stula, M., & Štula, M. (2020). Sentence Retrieval using Stemming and Lemmatization with Different Length of the Queries. Advances in Science, Technology and Engineering Systems Journal, 5(3), 45. DOI: 10.25046/aj050345
DS Stream Artificial Intelligence. (2025). The Grand Tour of NLP: spaCy vs. NLTK. Retrieved from https://www.dsstream.com/post/the-grand-tour-of-nlp-spacy-vs-nltk
Flores, F. N., & Moreira, V. P. (April 18, 2016). Assessing the Impact of Stemming Accuracy on Information Retrieval – A Multilingual Perspective. Information Processing & Management, 52(6), 1117-1135. DOI: 10.1016/j.ipm.2016.04.007
GeeksforGeeks. (July 1, 2024). Lemmatization vs. Stemming: A Deep Dive into NLP's Text Normalization Techniques. Retrieved from https://www.geeksforgeeks.org/nlp/lemmatization-vs-stemming-a-deep-dive-into-nlps-text-normalization-techniques/
IBM. (November 17, 2025). What Are Stemming and Lemmatization? IBM Think Topics. Retrieved from https://www.ibm.com/think/topics/stemming-lemmatization
Korenius, T., Laurikkala, J., Järvelin, K., & Juhola, M. (2004). Stemming and Lemmatization in the Clustering of Finnish Text Documents. Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM'04), 625-633. DOI: 10.1145/1031171.1031285
LinkedIn Expert Panel. (March 2, 2024). How Do Stemming and Lemmatization Affect the Performance and Scalability of NLP Applications? Retrieved from https://www.linkedin.com/advice/1/how-do-stemming-lemmatization-affect
NewsCatcher. (March 14, 2024). SpaCy vs NLTK: Text Normalization Comparison [with code]. Retrieved from https://www.newscatcherapi.com/blog-posts/spacy-vs-nltk-text-normalization-comparison-with-code-examples
Polus, M. E., & Abbas, T. (February 26, 2021). Development for Performance of Porter Stemmer Algorithm. Eastern-European Journal of Enterprise Technologies, 1(2), 6-13. DOI: 10.15587/1729-4061.2021.225362
Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program: Electronic Library and Information Systems, 14(3), 130-137. DOI: 10.1108/eb046814
Pramana, R., Debora, Y., Subroto, J. J., Gunawan, A. A. S., & et al. (November 4, 2022). Systematic Literature Review of Stemming and Lemmatization Performance for Sentence Similarity. Proceedings of 2022 International Conference on Information Technology Systems and Innovation (ICITSI), 366-371. DOI: 10.1109/ICITSI56531.2022.9970943
spaCy. (2025). Facts & Figures. spaCy Usage Documentation. Retrieved from https://spacy.io/usage/facts-figures
spaCy. (2025). Linguistic Features: Lemmatization. spaCy Usage Documentation. Retrieved from https://spacy.io/usage/linguistic-features
spaCy GitHub Issue #1837. (January 13, 2018). Why the Performance of Lemmatizing of spaCy is So Slow Compared with NLTK. Retrieved from https://github.com/explosion/spaCy/issues/1837
Stanford NLP. Introduction to Information Retrieval: Stemming and Lemmatization. Retrieved from https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
TechTarget. What is Lemmatization? Definition. Retrieved from https://www.techtarget.com/searchenterpriseai/definition/lemmatization

$50
Product Title
Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button

$50
Product Title
Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button.

$50
Product Title
Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button.






Comments