NLPgap Research
Overview
This project examines what Latin American critical cultural theory can contribute to the development of culturally situated natural language processing systems. Building on Zhou et al.'s 2025 framework for cultural NLP, I argue that achieving culturally competent language models requires rethinking data curation from the ground up, not as post-processing correction but as foundational methodology.
Core Argument
Current NLP data curation pipelines risk categorizing culturally significant linguistic features as "noise", excluding elements that carry unique meaning and representational value for non-dominant language variants. The problem cannot be solved at the modeling stage; it must be addressed at the curation stage, with culturally situated evaluation determining what constitutes "signal" versus "noise" rather than imposing categories from dominant frameworks.
This requires epistemological redesign of data curation pipelines, integrating professionals trained in cultural analysis from initial design stages. Not as consultants brought in after the fact, but as core participants in determining the categories themselves.
Theoretical Framework
The research draws on three interconnected bodies of work:
- Postcolonial theory (Quijano, Mignolo, Lugones, Bhabha, Spivak): Understanding anglocentrism in NLP as tool of ongoing coloniality, where technical systems reproduce epistemic hierarchies
- Latin American cultural studies (García Canclini, Ortiz, de Andrade, Bucholtz & Hall): Theories of hybridization, transculturation, and indexicality developed from the experience of being culturally situated outside dominant centers
- Cultural NLP frameworks (Zhou et al. 2025): Contemporary computational approaches to cultural representation in language models
The convergence of these traditions is not coincidental. Latin American theorists have been thinking about cultural attenuation, hybrid identities, and epistemic violence for decades. Zhou et al. arrived at similar concerns through computational linguistics. My work connects these parallel developments, showing that theoretical traditions emerging from lived experience of marginalization offer methodological insights mainstream NLP currently lacks.
Planned Output
Academic paper for arXiv (cs.CY or cs.CL), originally written in Latin American Spanish, then translated to English as performative gesture of the argument itself. Target: state-of-the-art review + theoretical contribution suitable for conferences or journal submission.
Current Status
- Complete abstract and paper outline
- Introduction in progress (fragments written, requires consolidation)
- Comprehensive bibliography organized in Zotero (postcolonial theory, cultural studies, NLP research)
- Research notes and synthesis documents for key theorists
- Critical analysis of current initiatives (CENIA, LatamGPT, PatagonIA)
Note on pause: This research is paused while I establish financial stability, not abandoned. The conceptual framework is solid, the bibliography is comprehensive, and the argument is clear. I plan to resume writing in Q1 2026 once immediate financial pressures are resolved.
Foundational Work: BA Thesis (2006)
Patrimonio, Identidad e Historia: Su interacción con la institución para la formación de público
(Heritage, Identity, and History: Their Interaction with Institutions for Public Formation)
My undergraduate thesis, directed by Chilean poet, writer, and feminist critic María Eugenia Brito Astrosa (currently faculty at Universidad de Chile's School of Visual Arts), examined how Chilean cultural institutions mediate public access to artistic heritage. Working from Bourdieu (cultural reproduction through education) and García Canclini (hybrid cultures, cultural consumption), we argued that the problem wasn't just geographic concentration of art in Santiago or economic barriers but the systematic absence of visual literacy formation.
Without teaching people how to read artistic codes, making art "available" perpetuates exclusion rather than democratizing access. We proposed the Patrimony-Identity-History triangle as framework: these concepts are co-constitutive, and you cannot democratize access to one without addressing the others.
Why This Matters Now
This thesis establishes the conceptual foundation for all my subsequent work. The question is structurally identical to what I investigate in NLPgap Research:
2006: How do cultural institutions reproduce unequal access to artistic patrimony? Formal "availability" of art in museums doesn't democratize without teaching visual literacy codes.
2025: How do NLP systems reproduce cultural attenuation of Latin American Spanish speakers? Formal "inclusion" of Spanish in datasets doesn't represent without cultural attunement.
Same structure: Who has access? Who is represented? How do institutions (museums then, tech companies now) reproduce hegemonies? What's needed for true democratization versus performative inclusion?
I've been working on this question for twenty years. It didn't start with AI; it started with trying to understand why my own culture felt inaccessible to me despite being "available" in official institutions.
Methodological Continuity
The thesis demonstrates skills directly transferable to current research:
- Institutional policy analysis (dissected entire Chilean state cultural apparatus)
- Interdisciplinary theory synthesis (sociology + semiotics + cultural studies + art theory)
- Systems thinking (culture as production-circulation-consumption circuits)
- Quantitative cultural data analysis (museum visitors, school enrollment, consumption patterns)
- Explicit commitment to epistemic equity and democratization
Research Interests
- Cultural representation and attenuation in NLP systems
- Data curation methodologies for non-dominant languages and cultures
- Institutional reproduction of cultural hegemonies in technical systems
- Translation theory and its application to AI localization
- Feminist technoscience and relational approaches to human-AI interaction
- Critical evaluation of "AI ethics" discourse and corporate accountability