top of page

Stemming vs Lemmatization: Key Differences + When to Use Each

Stemming vs Lemmatization hero image with futuristic NLP workspace and holographic word transformations.

Stemming vs Lemmatization: Key Differences + When to Use Each

Every search you run, every chatbot you talk to, every AI assistant you use—they all rely on a hidden battle happening behind the scenes. Two text processing techniques fight for dominance: stemming and lemmatization. One is fast but messy. The other is accurate but slow. Pick wrong, and your search engine returns garbage. Your sentiment analysis fails. Your AI misunderstands everything.


This isn't academic theory. In 2022, a systematic literature review analyzing 33 research studies found that lemmatization consistently outperforms stemming in sentence similarity tasks, yet stemming remains widely used due to its speed advantage (Pramana et al., November 2022). The stakes are real: search engines process billions of queries daily, and even a 1% improvement in accuracy translates to millions of better results.

 

Don’t Just Read About AI — Own It. Right Here

 

TL;DR

  • Stemming chops word endings using simple rules—fast but often produces non-words like "studi" from "studies"

  • Lemmatization uses linguistic analysis to return proper dictionary words—slower but semantically accurate

  • Speed gap: Stemming processes text 5-10x faster than lemmatization for large datasets

  • Accuracy gap: Lemmatization reduces errors from 76.7% (Porter stemmer) to 6.7% in context-aware analysis (Agbele et al., 2012)

  • Use stemming for search engines, large-scale text indexing, and speed-critical applications

  • Use lemmatization for sentiment analysis, chatbots, machine translation, and accuracy-critical NLP tasks


Stemming removes word endings through rule-based truncation to create stems (e.g., "running" → "run"), while lemmatization uses morphological analysis and part-of-speech tagging to return dictionary forms (lemmas). Stemming is faster but less accurate; lemmatization is slower but linguistically precise, making it ideal for applications requiring semantic understanding like chatbots and sentiment analysis.





Table of Contents

What Are Stemming and Lemmatization?

Both stemming and lemmatization reduce words to their base forms, but they take radically different paths to get there.


Stemming is a crude, rule-based process that strips prefixes and suffixes from words. Think of it as taking a machete to text—fast, effective, but rough around the edges. The word "running" becomes "run," but "studies" becomes "studi" (not a real word). Stemming doesn't care about language rules or meaning. It just chops.


Lemmatization is a linguistic process that analyzes word structure, grammar, and context. It uses dictionaries and morphological analysis to return the proper base form. "Running" becomes "run," and "studies" correctly becomes "study." Lemmatization understands that "better" is the comparative form of "good" and handles it accordingly.


The distinction matters enormously. According to Stanford NLP research, stemming "chops off the ends of words in the hope of achieving this goal correctly most of the time," while lemmatization "does things properly with the use of a vocabulary and morphological analysis of words" (Stanford NLP Book). In production systems handling millions of documents, this difference between hope and certainty drives everything from search quality to chatbot accuracy.


How Stemming Works


The Algorithm Behind the Machete

Stemming algorithms follow a simple pattern: identify common suffixes, check conditions, strip them off. The most famous example is the Porter Stemmer, developed by Martin Porter in 1979 and published in 1980.


The Porter algorithm runs through five sequential phases of word reduction. Each phase contains rules for handling specific suffix patterns. For example:

  • Phase 1: Strip plural endings (e.g., "caresses" → "caress," "ponies" → "poni")

  • Phase 2: Remove derivational endings (e.g., "relational" → "relate")

  • Phase 3: Continue derivational stripping (e.g., "electricity" → "electric")

  • Phase 4 & 5: Final cleanup and edge cases


Each rule includes conditions. A simple one: if a word has the pattern [consonant-vowel-consonant] and ends in a double consonant, remove the last character. So "hopping" becomes "hop."


The entire Porter algorithm comprises hundreds of these rules, all applied sequentially. The original implementation from 1980 remains widely used today—a testament to its empirical effectiveness despite its simplicity (Porter, 1980).


Types of Stemmers

Porter Stemmer: The industry standard for English. Used in everything from Elasticsearch to Solr. Lightweight, fast, well-tested.


Snowball Stemmer (Porter2): An improved version supporting multiple languages—French, German, Spanish, Russian, and more. About 15% more accurate than Porter for many languages (tartarus.org).


Lovins Stemmer: One of the earliest (1968), more aggressive than Porter. Contains 260 rules but can over-stem words.


Lancaster Stemmer (Paice-Husk): The most aggressive. Strips words down to very short stems. Good for precision-focused retrieval, risky for recall.


According to a 2021 study comparing Porter and enhanced Porter algorithms, modified versions achieved 92% accuracy in stemming performance, up from Porter's baseline (Polus & Abbas, February 2021).


When Stemming Goes Wrong

Stemming creates two types of errors:


Overstemming happens when different words get reduced to the same stem. "Universal" and "university" both become "univers"—they're not synonyms. "Requirement" and "requires" both stem to "requir," losing important semantic differences (Coursera, June 2025).


Understemming happens when related words get different stems. In some languages, irregular verb forms don't stem correctly. "Run," "ran," and "running" might not all reduce to the same stem, fragmenting your index.


A 2016 multilingual study found that stemming accuracy varies dramatically by language. For English, Porter stemmer error rates range from 15-25% depending on the corpus. For highly inflected languages like Finnish, error rates can exceed 30% (Flores & Moreira, April 2016).


How Lemmatization Works


The Linguistic Approach

Lemmatization treats text processing as a linguistic problem, not a string manipulation problem. The process involves three key steps:


Step 1: Tokenization

Break text into individual words (tokens). Advanced tokenizers handle contractions, hyphenated words, and edge cases.


Step 2: Part-of-Speech (POS) Tagging

Determine each word's grammatical role. Is "running" a verb (they are running) or an adjective (running water)? The POS tag determines the correct lemma.


Step 3: Dictionary Lookup with Morphological Rules

Consult a lexical database like WordNet to find the word's base form. Apply morphological rules specific to that word class. Return the lemma—always a valid dictionary word.


For example, lemmatizing "better":

  • POS tagging identifies it as an adjective

  • Morphological analysis recognizes it as the comparative form

  • Dictionary lookup returns "good" as the lemma


This three-stage process requires significantly more computational resources than stemming, but produces linguistically valid results (GeeksforGeeks, July 2024).


Types of Lemmatization

Rule-Based Lemmatization: Uses predefined grammatical rules. For regular verbs in English: remove "-ed" for past tense, "-ing" for progressive. Fast but struggles with irregular forms.


Dictionary-Based Lemmatization: Looks up each word in a comprehensive dictionary. Handles irregularities well but requires large lexical databases. WordNet contains over 155,000 unique strings for English.


Machine Learning-Based Lemmatization: Trains models on annotated corpora to predict lemmas. The EditTreeLemmatizer in spaCy v3.3+ learns form-to-lemma transformations from training data, often achieving higher accuracy than lookup-based methods (spaCy Documentation, 2025).


According to a 2014 comparative study, lemmatization "produced better precision compared to stemming" in information retrieval systems, though the differences were sometimes statistically insignificant for simple queries (Balakrishnan & Lloyd-Yemoh, 2014).


The Accuracy Advantage

Lemmatization's linguistic foundation gives it a crucial advantage: context awareness.


Consider the word "saw":

  • As a verb (past tense of "see"): lemma = "see"

  • As a noun (cutting tool): lemma = "saw"


Stemming blindly reduces both to "s" or "saw" without context. Lemmatization uses the surrounding sentence structure to choose correctly.


In a 2012 context-aware analysis study, researchers developed a modified stemming algorithm that reduced error rates from 76.7% (standard Porter) to 6.7% by incorporating contextual cues—essentially moving toward lemmatization (Agbele et al., 2012).


Key Differences: Side-by-Side Comparison

Aspect

Stemming

Lemmatization

Method

Rule-based suffix removal

Morphological analysis + dictionary lookup

Output

Stem (may not be a valid word)

Lemma (always a valid dictionary word)

Speed

Very fast (5-10x faster for large datasets)

Slower (requires POS tagging and lookup)

Accuracy

70-85% accurate (language-dependent)

90-98% accurate with proper POS tagging

Context Awareness

No—processes words independently

Yes—uses POS tags and sentence context

Language Support

Good for English, limited for others

Better for morphologically complex languages

Use Case

Search engines, text indexing, IR systems

Sentiment analysis, chatbots, machine translation

Examples

"studies" → "studi"

"studies" → "study"


"running" → "run"

"running" → "run" (verb) or "running" (adj)


"better" → "better"

"better" → "good"

Error Types

Overstemming, understemming

Rare, but POS tagging errors propagate

Computational Cost

Low (simple string operations)

High (dictionary lookups, ML models)

Key Insight: According to a 2024 LinkedIn analysis, "stemming reduces words to their root forms, simplifying text processing and improving performance, but may result in inaccuracies due to overgeneralization. Lemmatization, being more precise, enhances accuracy but demands more resources due to its complexity" (LinkedIn Expert Panel, March 2024).


Performance Benchmarks and Real Data

Let's look at hard numbers from peer-reviewed research and production systems.


Speed Benchmarks

In a 2018 GitHub issue on the spaCy repository, a developer reported processing 164,758 news articles:

  • NLTK lemmatization with multiprocessing: 5 minutes total

  • spaCy lemmatization (without optimization): Estimated 4.5 hours


That's a 54x difference. However, the developer was misusing spaCy's API. When properly optimized using batch processing with nlp.pipe(), spaCy's performance improves dramatically (spaCy GitHub Issue #1837, January 2018).


A 2024 comparison showed that for a corpus of 100 articles:

  • Porter stemming (NLTK): Processed in seconds

  • spaCy lemmatization: Slightly slower but within comparable range when optimized

  • Effective token reduction: spaCy reduced corpus to fewer unique tokens due to better lemmatization (NewsCatcher, March 2024)


Accuracy Metrics

Information Retrieval Performance:

A 2020 study on sentence retrieval using TREC novelty track data found:

  • Lemmatization with long queries: Superior MAP (Mean Average Precision) scores

  • Stemming with short queries: Better performance for quick lookups

  • Overall: "Lemmatization produces better results with longer queries, while stemming shows worse results with longer queries" (Doko et al., ASTE Journal, October 2025)


Clustering Performance:

Research on Finnish text document clustering (a morphologically complex language) showed:

  • Precision improvement: Lemmatization with average linkage and Ward's methods produced higher precision than stemming

  • Highly relevant documents: Lemmatization recovered highly relevant documents significantly better

  • Conclusion: "Lemmatization is a better word normalization method than stemming when Finnish text documents are clustered for information retrieval" (Korenius et al., ACM CIKM 2004)


Cross-Language Results:

A 2016 multilingual study across English, French, Portuguese, and Spanish found a surprising result: "The most accurate stemmer was not the one to have the biggest improvement in Information Retrieval, in none of the languages" (Flores & Moreira, ScienceDirect, April 2016).


This counterintuitive finding suggests that stemming accuracy and IR effectiveness don't always correlate—sometimes a slightly less accurate stemmer groups documents more effectively for retrieval.


Library Performance: NLTK vs spaCy

NLTK (Natural Language Toolkit):

  • Educational focus, flexible but slower

  • String-based processing

  • Better for experimentation and research


spaCy:

  • Production-optimized, 50-1000x faster for batch processing (when properly configured)

  • Object-oriented architecture

  • Built-in lemmatization uses trained statistical models

  • "spaCy consistently demonstrates superior performance in lemmatization compared to NLTK's stemming approaches" (Bastaki Software Solutions, March 2025)


According to spaCy's official benchmarks, its industrial-strength pipeline processes text efficiently on both CPU and GPU, with accuracy validated against multiple NLP benchmarks (spaCy Facts & Figures, 2025).


When to Use Stemming

Stemming shines when speed trumps perfection. Here's when to reach for it:


1. Search Engines and Information Retrieval

Google, Bing, Elasticsearch, Solr—they all use stemming variants for index generation. When you're processing billions of documents, stemming's speed advantage becomes critical.


Why it works: In Boolean retrieval systems, "stemming never lowers precision" and "stemming never lowers recall" according to Stanford NLP research. The crude approach actually helps by creating broader matches (Stanford NLP IR Book).


Real example: Elasticsearch uses Snowball stemmers across 15+ languages by default. Their architecture prioritizes query speed—returning results in milliseconds matters more than perfect linguistic analysis.


2. Large-Scale Text Indexing

Processing terabytes of text data? Stemming lets you:

  • Reduce vocabulary size by 30-50%

  • Speed up indexing by 5-10x

  • Decrease storage requirements significantly


A 2010 study on MapReduce implementation showed that an enhanced Porter stemmer with partitioner (PSP) provided "20-25% more stemming capacity than Lovins stemmer and 3-15% more capacity than standard Porter stemmer" when processing massive datasets (Achieving Magnitude Order Improvement, April 2010).


3. Real-Time Applications

Chatbots that need sub-second responses, live sentiment analysis of Twitter streams, real-time spam detection—these applications can't wait for lemmatization's computational overhead.


4. Simple Text Analytics

For basic text classification, keyword extraction, and topic modeling where perfect linguistic accuracy isn't critical, stemming's good-enough approach works fine.


5. Languages with Simple Morphology

English and other languages with relatively simple inflection patterns work reasonably well with stemming. The error rate is manageable, and the speed gain is substantial.


Key Decision Factor: Choose stemming when you need to process large volumes quickly and can tolerate 15-25% error rates in word normalization. The errors usually don't significantly hurt aggregate performance across millions of documents.


When to Use Lemmatization

Lemmatization is your choice when meaning matters more than milliseconds.


1. Sentiment Analysis and Opinion Mining

Understanding whether a review is positive or negative requires semantic accuracy. Misidentifying "better" vs "bitter" due to stemming errors can flip sentiment scores.


According to GeeksforGeeks, "Sentiment analysis: Lemmatization preserves word meaning, leading to more accurate sentiment classification" (July 2024). When brands analyze millions of customer reviews, accuracy directly impacts business decisions.


Industry application: According to TechTarget, lemmatization is "an important part of natural language understanding and NLP. It also plays an important role in big data analytics and AI" for sentiment analysis (TechTarget Definition).


2. Chatbots and Virtual Assistants

Alexa, Siri, Google Assistant, customer service chatbots—they need to understand user intent precisely. Lemmatization helps these systems:

  • Disambiguate word meanings based on context

  • Handle irregular verb forms correctly

  • Maintain semantic consistency across conversations


"In NLP, lemmatization helps an AI or ML tool understand and converse with end users accurately" (TechTarget). When a user asks "What time does the store close?" vs "What time did the store close?", lemmatization correctly identifies both as forms of "close" while preserving tense information.


3. Machine Translation

Google Translate and DeepL rely on lemmatization for accurate cross-language translation. Stemming's crude approach fails catastrophically when translating between languages with different morphological systems.


4. Content-Based Recommendation Systems

Netflix, Spotify, YouTube—they analyze content semantics to make recommendations. Lemmatization helps these systems:

  • Group related content accurately

  • Understand subtle semantic differences

  • Preserve meaning across inflected forms


5. Academic and Legal Text Analysis

When analyzing legal documents, academic papers, or medical records, precision isn't optional—it's required. A stemming error could conflate distinct legal concepts or medical terms.


6. Highly Inflected Languages

For Finnish, Turkish, Arabic, Russian, and other morphologically rich languages, lemmatization is almost mandatory. Stemming error rates for these languages often exceed 30%, making it practically unusable.


According to Bitext, for languages like Greek where "a typical verb has different stems for perfective forms and imperfective ones," only lemmatization can correctly group related forms (Bitext Blog, May 2023).


Key Decision Factor: Choose lemmatization when you need semantic accuracy, are working with complex languages, or building applications where understanding meaning is critical (even if it costs you processing time).


Case Studies: Real-World Applications


Case Study 1: Babel Street Analytics—Multilingual Search Platform

Company: Babel Street (intelligence and risk mitigation platform)

Challenge: Enable accurate cross-language search across European languages with complex morphology

Date: February 2025


Problem: Standard stemming approaches were failing for European languages where word forms change dramatically based on usage. Searching for "celebrities" would incorrectly return results about "celebrations" due to stemming errors.


Solution: Babel Street implemented lemmatization as a standard feature across their analytics platform. Their linguists and engineers collaborated to build lemmatization support for multiple European languages.


Results:

  • "Studies have shown that lemmatization is significantly more accurate than stemming in many European languages"

  • Customers gained "high-quality search across multiple languages"

  • Search accuracy improved dramatically for morphologically complex queries


"So let's review the previous example: 'celebrities' is searched, but with lemmatization utilized by the search engine, the query is correctly interpreted as 'celebrity,' not 'celebration,' enabling the search engine to deliver the right results" (Babel Street Blog, February 2025).


Key Takeaway: For production search systems handling multiple languages, lemmatization's accuracy advantage outweighed its computational cost. The company positioned lemmatization as a competitive differentiator.


Case Study 2: Arabic Sentiment Analysis Study

Research: Systematic Literature Review of Arabic Text Processing

Scope: 2,024 documents analyzed, 33 studies selected

Publication: November 2022


Challenge: Arabic language has complex morphology with heavy inflection patterns. Determine whether stemming or lemmatization performs better for sentence similarity and sentiment classification.


Methodology: Researchers tested 10 different stemming and lemmatization algorithms on Arabic text using:

  • Support Vector Machines (SVM)

  • Stochastic Gradient Descent (SGD)

  • Naïve Bayesian (NB) classifiers

  • Standard datasets from SemEval-2017 Task 1, Track 1


Results:

  • "Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results"

  • Lemmatization produced better semantic preservation for sentiment analysis

  • Both techniques improved performance over using original text

  • Lemmatization's advantage was more pronounced for longer, more complex sentences


Conclusion: "Previous studies have found the differences between stemming and lemmatization is usually insignificant in terms of accuracy, stemming has been used more widely than lemmatization as it offers similar performance to lemmatization while having faster" processing (Pramana et al., ResearchGate, November 2022).


Key Takeaway: For Arabic NLP tasks, the choice between stemming and lemmatization depends on whether speed or accuracy is the priority. Lemmatization wins for accuracy-critical applications.


Case Study 3: Finnish Document Clustering Study

Research: Text Document Clustering for Information Retrieval

Institutions: University of Helsinki, University of Tampere

Publication: ACM CIKM 2004


Challenge: Finnish is an agglutinative language with extensive inflection. Standard English stemming approaches fail catastrophically. Determine optimal text normalization method for clustering Finnish news documents.


Methodology: Tested four hierarchical clustering methods (single linkage, complete linkage, average linkage, Ward's method) with:

  • No normalization (baseline)

  • Stemming

  • Lemmatization


Evaluated using precision metrics across multiple relevance scales.


Results:

  • Precision: "In comparison with stemming, lemmatization together with the average linkage and Ward's methods produced higher precision"

  • Highly relevant documents: "The stringent relevance scale showed that lemmatization allowed the single and complete linkage methods to recover especially the highly relevant documents better than stemming"

  • Overall effectiveness: Clear superiority of lemmatization for morphologically complex language


Conclusion: "We conclude that lemmatization is a better word normalization method than stemming, when Finnish text documents are clustered for information retrieval" (Korenius et al., ACM CIKM, 2004).


Key Takeaway: For languages with rich morphology, lemmatization isn't just better—it's often the only viable option. The accuracy gain justifies the computational cost.


Tools and Libraries


NLTK (Natural Language Toolkit)

Best for: Learning, experimentation, research

Stemming support: Excellent (Porter, Snowball, Lancaster, Lovins)

Lemmatization support: Good (WordNet-based)

Speed: Moderate (string-based processing)

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Stemming
stemmer = PorterStemmer()
stemmer.stem("running")  # Output: "run"
stemmer.stem("studies")  # Output: "studi"

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos=wordnet.VERB)  # Output: "run"
lemmatizer.lemmatize("better", pos=wordnet.ADJ)   # Output: "good"

Pros:

  • Over 50 corpora and lexical resources included

  • Extensive documentation and learning materials

  • Flexible, customizable algorithms


Cons:

  • Slower than production-focused libraries

  • Requires explicit POS tagging for good lemmatization results

  • String-based architecture less efficient for large-scale processing


Latest version: NLTK 3.9.1 (as of 2025)


spaCy

Best for: Production applications, scalable NLP pipelines

Stemming support: None built-in (by design—stemming considered less accurate)

Lemmatization support: Excellent (trained statistical models)

Speed: Very fast (especially with batch processing)

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("The children were running towards the playground")

for token in doc:
    print(f"{token.text} → {token.lemma_}")
    
# Output:
# The → the
# children → child
# were → be
# running → run
# towards → towards
# the → the
# playground → playground

Pros:

  • "spaCy is specifically designed for production use" (spaCy Documentation)

  • Integrated POS tagging, dependency parsing, NER

  • Supports 70+ languages

  • GPU and CPU optimization

  • Modern transformer integration (BERT, RoBERTa)


Cons:

  • Less flexibility for experimentation

  • Steeper learning curve initially

  • Larger memory footprint


Latest versions:

  • spaCy 3.7 (2025)

  • Three lemmatizer types: lookup, rule-based, EditTreeLemmatizer (trainable, introduced in v3.3)


Performance note: "spaCy consistently demonstrates superior performance in lemmatization compared to NLTK's stemming approaches" (Bastaki Software, March 2025). However, proper usage of batch processing is critical—improper use can make spaCy much slower than NLTK.


Snowball (Stemming Only)

Best for: Multilingual stemming

Languages: 15+ including English, French, German, Spanish, Russian, Portuguese, Arabic

Speed: Extremely fast

Usage: Often integrated into search engines (Elasticsearch, Solr)


The Snowball framework lets you implement stemming algorithms in a high-level language, then compile to efficient C code. Porter2 (improved Porter stemmer) is the recommended English version.


Gensim

Best for: Topic modeling, document similarity

Stemming: Limited built-in support

Lemmatization: Good (via pattern library or spaCy integration)

Speed: Optimized for large corpora


Gensim focuses on unsupervised learning and works well with either stemmed or lemmatized text as preprocessing.


Stanza (Stanford NLP)

Best for: Multilingual, academic-grade NLP

Lemmatization: Excellent (neural network-based)

Languages: 60+ languages with trained models

Speed: Moderate (neural models are computationally intensive)


Stanza provides state-of-the-art accuracy but requires more computational resources than rule-based approaches.


Tool Selection Matrix

Your Need

Recommended Tool

Learning NLP basics

NLTK

Production web app

spaCy

Multilingual search engine

Snowball stemmers

Academic research

NLTK or Stanza

Real-time chatbot

spaCy (with batch processing)

Topic modeling

Gensim (with preprocessing)

Maximum accuracy

spaCy or Stanza lemmatization

Maximum speed

Snowball stemming

Common Mistakes and How to Avoid Them


Mistake 1: Using Stemming for Sentiment Analysis

The Problem: Stemming's crude approach destroys semantic nuances critical for sentiment detection.


Example:

  • Original: "The product quality is better than expected"

  • Stemmed: "The product qualiti is better than expect"

  • Issue: "better" remains unchanged by most stemmers, but loses its connection to "good" in semantic analysis


Fix: Use lemmatization with proper POS tagging for sentiment analysis. Accept the computational cost as necessary for accuracy.


Mistake 2: Forgetting POS Tagging for Lemmatization

The Problem: Without POS tags, lemmatizers default to noun treatment, producing incorrect results.


Example:

# Without POS tag
lemmatizer.lemmatize("running")  # Output: "running" (treated as noun)

# With POS tag
lemmatizer.lemmatize("running", pos=wordnet.VERB)  # Output: "run"

Fix: Always provide POS tags to your lemmatizer. Use NLTK's pos_tag() or spaCy's built-in POS tagging.


Mistake 3: Not Optimizing spaCy for Batch Processing

The Problem: Processing documents one-by-one in spaCy is extremely slow.


Bad code:

for text in documents:
    doc = nlp(text)  # Processes each separately

Good code:

for doc in nlp.pipe(documents, batch_size=50):
    # Processes in optimized batches

Impact: The GitHub issue mentioned earlier showed 54x performance difference between optimized and unoptimized spaCy usage.


Mistake 4: Using English Stemmers for Other Languages

The Problem: Porter stemmer is designed for English. Using it on Spanish, French, or other languages produces garbage.


Fix: Use language-specific stemmers:

  • Spanish: Snowball Spanish stemmer

  • French: Snowball French stemmer

  • German: Snowball German stemmer

  • Arabic: Dedicated Arabic light stemmers


Alternatively, use lemmatization with language-specific models (spaCy supports 70+ languages).


Mistake 5: Expecting Perfect Accuracy from Either Method

The Problem: Both techniques have limitations. Stemming makes errors; lemmatization depends on correct POS tagging.


Reality Check: According to research, even the best lemmatizers achieve 90-98% accuracy, not 100%. Context-aware errors still occur.


Fix: Understand your accuracy requirements upfront. For some applications, 85% accuracy with 10x speed (stemming) beats 95% accuracy at 1x speed (lemmatization).


Mistake 6: Not Removing Stop Words First

The Problem: Processing stop words ("the," "is," "and") through stemming/lemmatization wastes computation.


Fix: Always remove stop words during preprocessing, before applying stemming or lemmatization.

# Good preprocessing pipeline
1. Tokenization
2. Stop word removal
3. Stemming/Lemmatization
4. Further processing

Myths vs Facts


Myth 1: "Lemmatization is always better than stemming"

Reality: Depends on your use case. A 2016 study found that "the most accurate stemmer was not the one to have the biggest improvement in Information Retrieval, in none of the languages" tested (Flores & Moreira, 2016).


For search engines indexing billions of documents, stemming's speed advantage often outweighs lemmatization's accuracy gain. The aggregate performance across millions of queries can be similar.


Verdict: False. Context determines which is "better."


Myth 2: "Stemming always produces non-words"

Reality: Stemming produces non-words for many inputs, but not always. Common words like "run," "walk," "test" remain unchanged or become valid stems.


The Porter stemmer's goal is to create equivalence classes, not valid English words. But in practice, many stems happen to be valid words.


Verdict: Mostly false. Stems are often—but not always—non-words.


Myth 3: "Lemmatization is too slow for production use"

Reality: Major production systems successfully use lemmatization:

  • Google uses lemmatization variants in their search algorithms

  • Babel Street's analytics platform uses lemmatization as a standard feature

  • Enterprise chatbots from IBM, Microsoft, and others rely on lemmatization


With proper optimization (batch processing, GPU acceleration, caching), lemmatization runs fast enough for real-world applications.


Verdict: False. Proper engineering makes lemmatization production-viable.


Myth 4: "You should always use the same technique throughout your pipeline"

Reality: Hybrid approaches work well. You might:

  • Use stemming for initial filtering/indexing

  • Use lemmatization for final semantic analysis

  • Use stemming for speed-critical real-time responses

  • Use lemmatization for accuracy-critical batch processing


Verdict: False. Mix and match based on stage-specific requirements.


Myth 5: "Stemming and lemmatization only matter for English"

Reality: These techniques are even MORE important for morphologically rich languages. Finnish, Turkish, Arabic, and Russian benefit enormously from proper normalization.


In fact, for highly inflected languages, lemmatization often isn't optional—stemming's error rates become prohibitively high.


Verdict: False. Non-English languages benefit even more.


Myth 6: "Modern transformers like BERT make stemming/lemmatization obsolete"

Reality: This is partially true but nuanced. Transformer models like BERT do learn subword representations that capture morphological variations. However:

  • Preprocessing with lemmatization can still improve downstream task performance

  • Not every application can afford transformer computational costs

  • Hybrid approaches (lemmatization + transformers) often perform best


According to LinkedIn's NLP experts, "In LLMs, lemmatization is often an implicit part of understanding language, not a discrete preprocessing step," but "explicit lemmatization is more characteristic of traditional models like TF-IDF" (LinkedIn Expert Discussion, March 2024).


Verdict: Partially true. Transformers reduce but don't eliminate the value of explicit normalization.


Implementation Guide


Decision Framework: Which Should You Use?

Use this flowchart approach:


Step 1: Define Your Priority

  • Speed + Scale → Consider stemming

  • Accuracy + Meaning → Consider lemmatization


Step 2: Assess Your Language

  • English, simple morphology → Stemming viable

  • Morphologically rich (Finnish, Arabic, Russian, Turkish) → Lemmatization strongly recommended


Step 3: Evaluate Your Use Case

  • Search/IR/Indexing → Stemming often sufficient

  • Sentiment/Chatbot/Translation → Lemmatization preferred


Step 4: Check Your Infrastructure

  • Limited compute resources → Stemming

  • Adequate compute, batch processing possible → Lemmatization


Step 5: Test Both

  • Run benchmarks on your actual data

  • Measure task-specific performance (accuracy, F1, precision, recall)

  • Choose based on empirical results, not assumptions


Implementation Patterns

Pattern 1: Basic Stemming Pipeline (NLTK)

import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Setup
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text.lower())
    
    # Remove stop words and punctuation
    tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
    
    # Stem
    stems = [stemmer.stem(t) for t in tokens]
    
    return stems

# Usage
text = "The running dogs were jumping over the sleeping cats"
processed = preprocess_text(text)
print(processed)
# Output: ['run', 'dog', 'jump', 'sleep', 'cat']

Pattern 2: Production Lemmatization Pipeline (spaCy)

import spacy

# Load model once (expensive operation)
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])  # Disable unused components

def preprocess_documents(documents):
    """Process multiple documents efficiently"""
    results = []
    
    for doc in nlp.pipe(documents, batch_size=50):
        # Extract lemmas, excluding stop words and punctuation
        lemmas = [token.lemma_ for token in doc 
                  if not token.is_stop and not token.is_punct and token.is_alpha]
        results.append(lemmas)
    
    return results

# Usage
documents = [
    "The children were running towards the playground",
    "Better products yield better results",
    "Studies show improved performance"
]

processed_docs = preprocess_documents(documents)
for doc_lemmas in processed_docs:
    print(doc_lemmas)

# Output:
# ['child', 'run', 'playground']
# ['good', 'product', 'yield', 'good', 'result']
# ['study', 'show', 'improve', 'performance']

Pattern 3: Hybrid Approach

def hybrid_normalization(text, use_lemma_for=['sentiment_words']):
    """Use stemming for speed, lemmatization for critical words"""
    
    # Fast stemming for most words
    tokens = word_tokenize(text.lower())
    normalized = [stemmer.stem(t) for t in tokens]
    
    # Lemmatization for sentiment-critical words
    sentiment_words = {'better', 'worse', 'best', 'worst', 'good', 'bad'}
    doc = nlp(text)
    
    for i, token in enumerate(doc):
        if token.text.lower() in sentiment_words:
            normalized[i] = token.lemma_
    
    return normalized

Performance Optimization Tips

For NLTK:

  1. Precompile stop word sets (don't reload each time)

  2. Use list comprehensions instead of loops

  3. Consider multiprocessing for large corpora

  4. Cache frequently processed terms


For spaCy:

  1. Critical: Use nlp.pipe() for batch processing

  2. Disable unused pipeline components (parser, NER) if not needed

  3. Use smaller models for speed (sm instead of lg)

  4. Process documents in parallel batches

  5. Consider using GPU acceleration for very large datasets


Testing Your Implementation

def benchmark_methods(texts, iterations=100):
    """Compare stemming vs lemmatization performance"""
    import time
    
    # Stemming benchmark
    start = time.time()
    for _ in range(iterations):
        for text in texts:
            tokens = word_tokenize(text)
            [stemmer.stem(t) for t in tokens]
    stem_time = time.time() - start
    
    # Lemmatization benchmark
    start = time.time()
    for _ in range(iterations):
        for doc in nlp.pipe(texts):
            [token.lemma_ for token in doc]
    lemma_time = time.time() - start
    
    print(f"Stemming: {stem_time:.2f}s")
    print(f"Lemmatization: {lemma_time:.2f}s")
    print(f"Speedup factor: {lemma_time/stem_time:.1f}x")

# Run on your data
sample_texts = [...] # Your actual text data
benchmark_methods(sample_texts)

Future Trends


1. Neural Lemmatization

The EditTreeLemmatizer in spaCy v3.3+ represents a shift toward trainable, neural network-based lemmatization. Instead of hand-crafted rules, these models learn morphological patterns from annotated data.


Advantage: "This removes the need to write language-specific rules and can (in many cases) provide higher accuracies than lookup and rule-based lemmatizers" (spaCy Linguistic Features Documentation).


Trend: Expect more libraries to adopt neural lemmatization as transformer models become more efficient.


2. Integration with Large Language Models

As noted by LinkedIn experts in 2024, "In LLMs, lemmatization is often an implicit part of understanding language, not a discrete preprocessing step" (LinkedIn Discussion, March 2024).


Modern transformers (BERT, GPT, T5) learn subword representations that capture morphological relationships. However, explicit lemmatization as preprocessing can still improve performance on specific tasks.


Emerging pattern: Hybrid architectures that combine explicit lemmatization with transformer-based contextual embeddings.


3. Context-Aware Stemming

Research into context-aware stemming (CAS) algorithms shows promise. A 2012 study reduced Porter stemmer errors from 76.7% to 6.7% by incorporating contextual analysis (Agbele et al., 2012).


Future direction: Stemming algorithms that use surrounding context to make better truncation decisions—bridging the gap between stemming speed and lemmatization accuracy.


4. Multilingual Unified Models

The push toward language-agnostic NLP models continues. Instead of separate stemmers/lemmatizers for each language, unified models (like mBERT, XLM-R) learn cross-lingual representations.


Impact: Reduces the need for language-specific preprocessing while maintaining accuracy across 100+ languages.


5. Real-Time Lemmatization at Scale

Cloud infrastructure improvements (GPU acceleration, distributed processing, edge computing) are making real-time lemmatization feasible even for high-volume applications.


Example: Modern chatbot platforms now routinely lemmatize user input in under 50ms, making the speed argument for stemming less compelling.


6. Domain-Specific Normalization

Generic stemmers/lemmatizers struggle with specialized vocabulary (medical terms, legal language, technical jargon). The trend is toward:

  • Domain-specific training data

  • Customizable lemmatization rules

  • Industry-specific dictionaries


Prediction: By 2027, most enterprise NLP systems will use domain-adapted normalization rather than generic approaches.


FAQ


1. Can I use both stemming and lemmatization together?

Yes, but it's usually redundant. Each is trying to solve the same problem (word normalization) using different approaches. Pick one based on your speed vs accuracy tradeoff.


In rare cases, you might stem during indexing for speed, then lemmatize query terms for accuracy. But most systems pick one approach and stick with it.


2. How do I know if my stemmer is working correctly?

Test it on a sample of your actual data. Look for:

  • Overstemming: Unrelated words getting the same stem

  • Understemming: Related words getting different stems

  • Non-word stems: How often does it produce invalid strings?


Calculate error rate on a manually annotated sample (100-500 words). If errors exceed 25-30%, consider lemmatization or a different stemmer.


3. Why doesn't spaCy include stemming?

By design choice. spaCy's creators believe stemming is linguistically unsound and produces lower-quality results than lemmatization. Since spaCy targets production systems where accuracy matters, they omitted stemming entirely.


From their perspective: if you need speed, optimize lemmatization properly (batch processing, GPU). Don't sacrifice accuracy for a crude stemming shortcut.


4. Can transformers like BERT replace stemming/lemmatization entirely?

Partially. BERT-family models learn subword representations that capture morphological relationships implicitly. For many tasks, this is sufficient.


However:

  • Not all applications can afford transformer computational costs

  • Preprocessing with lemmatization can still improve downstream performance

  • Traditional models (TF-IDF, topic models) still benefit from explicit normalization


Bottom line: Transformers reduce but don't eliminate the value of stemming/lemmatization.


5. What's the difference between a stem and a lemma?

  • Stem: Result of algorithmic truncation. May not be a dictionary word. Example: "studies" → "studi"

  • Lemma: Valid dictionary base form. Always a real word. Example: "studies" → "study"


Stems create equivalence classes for matching. Lemmas preserve linguistic validity.


6. How do I handle words not in the lemmatizer's dictionary?

Most lemmatizers have fallback behavior:

  • Lookup-based: Return the original word unchanged

  • Rule-based: Apply morphological rules even to unknown words

  • Neural: Use learned patterns to predict lemma


For domain-specific terms, you might need to:

  • Extend the dictionary

  • Train a custom lemmatizer

  • Use a hybrid approach (lemmatize common words, keep technical terms as-is)


7. Does lemmatization work for non-English languages?

Yes, often better than English! spaCy supports 70+ languages. Stanza supports 60+ languages. Many morphologically rich languages (Finnish, Turkish, Arabic) benefit more from lemmatization than English does.


The challenge is finding high-quality trained models and dictionaries for lower-resource languages.


8. How much does lemmatization slow down my NLP pipeline?

Depends on implementation:

  • Unoptimized: 10-50x slower than stemming

  • Properly optimized (batch processing, GPU): 2-5x slower

  • With caching: Nearly equal for repeated text


For most modern applications, the speed difference is negligible compared to other pipeline bottlenecks (network I/O, database queries, etc.).


9. Can I train my own lemmatizer?

Yes. Tools like spaCy's EditTreeLemmatizer allow training on annotated corpora. You'll need:

  • Training data with word forms and their lemmas

  • POS tags for training examples

  • Sufficient computational resources

  • Evaluation dataset to measure accuracy


For most users, pretrained models work well. Custom training makes sense for specialized domains or low-resource languages.


10. What's the typical accuracy difference in real applications?

Benchmarks from research:

  • Stemming accuracy: 70-85% for English (varies by algorithm)

  • Lemmatization accuracy: 90-98% with proper POS tagging


However, the impact on downstream tasks varies:

  • For information retrieval: Often statistically insignificant differences

  • For sentiment analysis: Lemmatization typically improves F1 scores by 3-8%

  • For machine translation: Lemmatization provides substantial improvements


Test on your specific task and data to measure actual impact.


11. Should I remove stop words before or after stemming/lemmatization?

Best practice: Remove stop words AFTER stemming/lemmatization.


Reason: Some stemmers/lemmatizers might normalize stop words to forms that aren't in your stop word list. Process in this order:

  1. Tokenization

  2. Stemming/Lemmatization

  3. Stop word removal


12. How do I choose between Porter, Snowball, and Lancaster stemmers?

Porter: Most balanced, widely tested, good default choice

Snowball (Porter2): Improved Porter with multilingual support

Lancaster: Most aggressive, highest stemming rate, use for precision-focused tasks


Recommendation: Start with Snowball. Switch to Porter if you need exact compatibility with legacy systems. Avoid Lancaster unless you specifically need aggressive stemming.


13. Can lemmatization hurt my model's performance?

Rarely, but possible in specific cases:

  • If POS tagging is incorrect, lemmatization propagates errors

  • For tasks like Named Entity Recognition, lemmatizing entities can destroy important information

  • In short texts (tweets, queries), morphological information might carry semantic value that lemmatization removes


Best practice: A/B test lemmatization vs no-lemmatization on your specific task with your actual data.


14. How do I handle compound words?

English: Standard stemmers/lemmatizers treat compound words as single units (e.g., "blackboard" stays as one token)


German/Dutch: These languages create many compound words. Use compound splitters before stemming/lemmatization:

  • "Donaudampfschifffahrtsgesellschaft" → ["Donau", "dampf", "schiff", "fahrt", "gesellschaft"]


Most NLP libraries for German include compound splitting as a preprocessing step.


15. What's the best way to evaluate stemmer/lemmatizer quality?

Intrinsic evaluation: Manual annotation

  • Take 500-1000 words from your corpus

  • Manually assign correct stems/lemmas

  • Compare algorithm output to gold standard

  • Calculate precision, recall, F1 score


Extrinsic evaluation: Downstream task performance

  • Run your full NLP pipeline with different normalization approaches

  • Measure end-task accuracy (classification F1, retrieval MAP, etc.)

  • Choose approach that maximizes end-task performance


The best stemmer/lemmatizer for your application is the one that improves your actual business metric most.


Key Takeaways

  1. Stemming uses crude rule-based truncation to create stems (often non-words), while lemmatization uses linguistic analysis to return valid dictionary forms (lemmas). Stemming is 5-10x faster; lemmatization is 15-20% more accurate.


  2. Speed vs Accuracy tradeoff drives the choice: Use stemming for search engines, large-scale indexing, and real-time applications where speed matters. Use lemmatization for sentiment analysis, chatbots, and tasks where semantic accuracy is critical.


  3. Language morphology matters enormously: English works reasonably with both approaches. Highly inflected languages (Finnish, Turkish, Arabic, Russian) strongly favor lemmatization—stemming error rates can exceed 30% for these languages.


  4. Real-world performance data: Studies show lemmatization produces better precision for document clustering and retrieval, especially with longer queries. However, for simple Boolean searches, stemming often performs comparably.


  5. Modern tools favor lemmatization: spaCy deliberately excludes stemming, focusing entirely on production-quality lemmatization. NLTK supports both for educational purposes. Industry trend is toward accurate lemmatization with optimized batch processing.


  6. Implementation quality matters more than choice: A poorly optimized lemmatizer (54x slow) loses to well-implemented stemming. A properly batched spaCy pipeline approaches stemming's speed while maintaining accuracy.


  7. Context awareness is lemmatization's key advantage: Understanding that "saw" as a verb → "see" but "saw" as a noun → "saw" requires the grammatical analysis that lemmatization provides.


  8. No universal "best" choice exists: A 2016 multilingual study found that "the most accurate stemmer was not the one to have the biggest improvement in Information Retrieval" across four languages. Test on your specific data and task.


  9. Hybrid approaches work well: Use stemming for initial filtering or indexing, lemmatization for final semantic analysis. Batch processing enables real-time lemmatization for user-facing applications.


  10. Future trends favor lemmatization: Neural lemmatization, transformer integration, and cloud infrastructure improvements are making accurate lemmatization feasible even for high-scale applications. The speed argument for stemming weakens as optimization improves.


Actionable Next Steps

  1. Audit your current text preprocessing pipeline. Identify where you use stemming, lemmatization, or neither. Document your current approach and its performance metrics.


  2. Run A/B tests with your actual data. Don't rely on general benchmarks—test both stemming and lemmatization on your specific task (search, classification, sentiment analysis, etc.). Measure task-specific metrics (precision, recall, F1, user satisfaction).


  3. If using NLTK, upgrade your lemmatization code to include proper POS tagging:

    lemmatizer.lemmatize(word, pos=get_pos_tag(word))

    Without POS tags, you're getting poor-quality lemmatization.


  4. If using spaCy, optimize for batch processing:

    for doc in nlp.pipe(documents, batch_size=50): process(doc)

    This single change can provide 10-50x speed improvements.


  5. For search/IR applications currently using no normalization, start with Snowball stemming. It's the fastest path to measurable improvement with minimal implementation cost.


  6. For sentiment analysis or chatbots currently using stemming, migrate to lemmatization. The accuracy gain typically improves end-task performance by 3-8% with acceptable speed tradeoff.


  7. Set up proper benchmarking infrastructure. Create a test suite that measures:

    • Processing speed (words/second)

    • Normalization accuracy (if you have gold standard data)

    • End-task performance (your actual business metric)


  8. Document your decision rationale. Write down why you chose stemming or lemmatization, what tradeoffs you considered, and what metrics justify your choice. Revisit annually as infrastructure and tools improve.


  9. Consider language-specific needs. If working with multiple languages, don't assume one approach works for all. Use language-appropriate normalization strategies.


  10. Stay updated on neural lemmatization. As trainable lemmatizers improve, they may offer accuracy gains with minimal speed penalty. Monitor releases from spaCy, Stanza, and other major NLP libraries.


Glossary

  1. Agglutinative Language: A language where words are formed by stringing together morphemes, each retaining its original meaning (e.g., Turkish, Finnish). Makes stemming particularly challenging.

  2. Corpus (plural: corpora): A large collection of text documents used for NLP research and training.

  3. Derivational Morphology: Word formation through adding affixes that change meaning or part of speech (e.g., "happy" → "unhappiness"). Stemming often removes these.

  4. EditTreeLemmatizer: A trainable neural lemmatization component in spaCy v3.3+ that learns morphological transformations from annotated data.

  5. Inflection: Modification of a word to express grammatical categories like tense, number, case, or gender (e.g., "run," "runs," "running").

  6. Information Retrieval (IR): The process of finding relevant documents from a large collection in response to a user query.

  7. Lemma: The canonical, dictionary form of a word. All inflected forms map to a single lemma (e.g., "am," "are," "is" → "be").

  8. Lemmatization: The process of reducing words to their lemma using morphological analysis and dictionaries.

  9. Morpheme: The smallest meaningful unit of language (e.g., "un-" in "unhappy").

  10. Morphological Analysis: Studying the structure and form of words, including how they're built from morphemes.

  11. Natural Language Processing (NLP): The field of AI focused on enabling computers to understand, interpret, and generate human language.

  12. Overstemming: Stemming error where unrelated words are reduced to the same stem (e.g., "universal" and "university" → "univers").

  13. Part-of-Speech (POS) Tagging: Assigning grammatical categories (noun, verb, adjective, etc.) to each word in text.

  14. Porter Stemmer: The most widely used English stemming algorithm, developed by Martin Porter in 1979-1980.

  15. Snowball: A framework for implementing stemming algorithms, and the name for improved Porter stemmer (Porter2) with multilingual support.

  16. Stem: The base form of a word after removing affixes through stemming. May not be a valid dictionary word.

  17. Stemming: The process of reducing words to their stem through rule-based suffix/prefix removal.

  18. Stop Words: Common words (e.g., "the," "is," "and") that carry little semantic value and are typically removed during preprocessing.

  19. TF-IDF (Term Frequency-Inverse Document Frequency): A numerical statistic reflecting how important a word is to a document in a collection.

  20. Tokenization: Breaking text into individual words (tokens) as a preprocessing step.

  21. Understemming: Stemming error where related words retain different stems (e.g., "data" and "datum" not reducing to the same stem).

  22. WordNet: A large lexical database of English, grouping words into sets of synonyms and providing semantic relationships.


Sources & References

  1. Agbele, K., Adesina, A., Azeez, N., & Abidoye, A. (2012). Context-Aware Stemming Algorithm for Semantically Related Root Words. African Journal of Computing & ICT, 5(4), 33-42.

  2. Balakrishnan, V., & Lloyd-Yemoh, E. (2014). Stemming and Lemmatization: A Comparison of Retrieval Performances. Lecture Notes on Software Engineering, 2, 262-267. DOI: 10.7763/LNSE.2014.V2.134

  3. Babel Street. (February 25, 2025). Delivering More Accurate Search Results with Lemmatization. Retrieved from https://www.babelstreet.com/blog/delivering-more-accurate-search-results-with-lemmatization

  4. Bastaki Software Solutions. (March 12, 2025). Natural Language Processing with Python: A Comprehensive Guide to NLTK, spaCy, and Gensim in 2025. Retrieved from https://bastakiss.com/blog/python-5/natural-language-processing-with-python-a-comprehensive-guide-to-nltk-spacy-and-gensim-in-2025-738

  5. Bitext. (May 4, 2023). What is the Difference Between Stemming and Lemmatization? Retrieved from https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/

  6. Coursera. (June 5, 2025). Lemmatization vs. Stemming: Understanding NLP Methods. Retrieved from https://www.coursera.org/articles/lemmatization-vs-stemming

  7. Doko, A., Stula, M., & Štula, M. (2020). Sentence Retrieval using Stemming and Lemmatization with Different Length of the Queries. Advances in Science, Technology and Engineering Systems Journal, 5(3), 45. DOI: 10.25046/aj050345

  8. DS Stream Artificial Intelligence. (2025). The Grand Tour of NLP: spaCy vs. NLTK. Retrieved from https://www.dsstream.com/post/the-grand-tour-of-nlp-spacy-vs-nltk

  9. Flores, F. N., & Moreira, V. P. (April 18, 2016). Assessing the Impact of Stemming Accuracy on Information Retrieval – A Multilingual Perspective. Information Processing & Management, 52(6), 1117-1135. DOI: 10.1016/j.ipm.2016.04.007

  10. GeeksforGeeks. (July 1, 2024). Lemmatization vs. Stemming: A Deep Dive into NLP's Text Normalization Techniques. Retrieved from https://www.geeksforgeeks.org/nlp/lemmatization-vs-stemming-a-deep-dive-into-nlps-text-normalization-techniques/

  11. IBM. (November 17, 2025). What Are Stemming and Lemmatization? IBM Think Topics. Retrieved from https://www.ibm.com/think/topics/stemming-lemmatization

  12. Korenius, T., Laurikkala, J., Järvelin, K., & Juhola, M. (2004). Stemming and Lemmatization in the Clustering of Finnish Text Documents. Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM'04), 625-633. DOI: 10.1145/1031171.1031285

  13. LinkedIn Expert Panel. (March 2, 2024). How Do Stemming and Lemmatization Affect the Performance and Scalability of NLP Applications? Retrieved from https://www.linkedin.com/advice/1/how-do-stemming-lemmatization-affect

  14. NewsCatcher. (March 14, 2024). SpaCy vs NLTK: Text Normalization Comparison [with code]. Retrieved from https://www.newscatcherapi.com/blog-posts/spacy-vs-nltk-text-normalization-comparison-with-code-examples

  15. Polus, M. E., & Abbas, T. (February 26, 2021). Development for Performance of Porter Stemmer Algorithm. Eastern-European Journal of Enterprise Technologies, 1(2), 6-13. DOI: 10.15587/1729-4061.2021.225362

  16. Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program: Electronic Library and Information Systems, 14(3), 130-137. DOI: 10.1108/eb046814

  17. Pramana, R., Debora, Y., Subroto, J. J., Gunawan, A. A. S., & et al. (November 4, 2022). Systematic Literature Review of Stemming and Lemmatization Performance for Sentence Similarity. Proceedings of 2022 International Conference on Information Technology Systems and Innovation (ICITSI), 366-371. DOI: 10.1109/ICITSI56531.2022.9970943

  18. spaCy. (2025). Facts & Figures. spaCy Usage Documentation. Retrieved from https://spacy.io/usage/facts-figures

  19. spaCy. (2025). Linguistic Features: Lemmatization. spaCy Usage Documentation. Retrieved from https://spacy.io/usage/linguistic-features

  20. spaCy GitHub Issue #1837. (January 13, 2018). Why the Performance of Lemmatizing of spaCy is So Slow Compared with NLTK. Retrieved from https://github.com/explosion/spaCy/issues/1837

  21. Stanford NLP. Introduction to Information Retrieval: Stemming and Lemmatization. Retrieved from https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

  22. TechTarget. What is Lemmatization? Definition. Retrieved from https://www.techtarget.com/searchenterpriseai/definition/lemmatization




$50

Product Title

Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button

$50

Product Title

Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button.

$50

Product Title

Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button.

Recommended Products For This Post
 
 
 

Comments


bottom of page