DigitAI for Localized TEI / XML Assistance

An experiment with Small-Scale XAI

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Alexander C. Fisher, Hadleigh Jae Bills, and Elisa Beshero-Bondar

Penn State Erie, The Behrend College

Link to Slides:
https://bit.ly/

digitai-dh25

What DigitAI will be

A tool that assists scholars and editors in applying the TEI Guidelines to their own corpora

  • Built for users who are new to TEI or seeking deeper understanding
  • Acts as a TEI aware tutor NOT an Auto-Encoder
  • Helps scholars and editors understand best practices in TEI encoding
  • Helps TEI Developers to find  inconsistencies in the guidelines

Supports learning by providing relevant examples, explanations, and excerpts directly from the TEI Guidelines

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Choosing a model was not easy!

What we needed in a model:

  • Small "explainable AI"
  • Obedient! (instruction-model)
  • Fairly lightweight (minimal resource footprint)

 

 

  • Multilingual Although the TEI is mostly in English, we wanted to include the multi-lingual glosses and descriptions of the Code of the TEI
  • Interpretable: Supports learning not just answering
  • Modular: Should work as part of a RAG Pipeline
  • Open-Source— For transparency and reproducibility

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Our path so far:

  • Started with Ollama models from Meta
    • Very fast and lightweight
    • But too limited in multilingual support for our needs!
  • Now we are exploring Qwen2 models which show real promise!
    • Models being explored currently
      • Qwen2:7b
      • Qwen2:7b-instruct

 

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Why TEI?

  • One of us regularly edits TEI Guidelines
  • TEI Guidelines : delivered as
    • website (and pdf, etc)
    • a code schema (TEI-all.rng)
    • most importantly, built in TEI XML
  • p5.xml = a single highly structured, systematic code base and knowledge resource, containing:
    • chapters
    • examples
    • "coding specs" that define the building blocks of the TEI elements, attributes, macros, modules, datatypes, etc.
    • built from component parts when we release new versions of the Guidelines
    • one of the best-organized resources we can imagine for designing a knowledge base for a RAG system*
      * Literally p5.xml is designed from the ground-up to be the TEI community's "ground truth" about our encoding system!

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

p5.xml → XSLT → digitai-p5.json

TEI Guidelines p5.xml: a single XML document that contains the entire built TEI Guidelines. It's a hierarchical (tree) structure:

  • contains front matter, 24 Chapters, each with sections and nested subsections.
    • Paragraphs in chapters
      • contain encoding that defines rules for elements and attribute classes, model classes (connecting related elements and attributes together), datatype specifications.
  • We are writing XSLT to translate this XML tree into
    • a JSON structure (digitai-p5.json)  for a database to ingest
    • Query scripts (in Cypher) to direct this database to translate this JSON into a knowledge graph

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Designing a RAG with the TEI

Making a knowlege graph: the challenge!

<XML>

<XSLT>

Neo4j

Knowledge Graph

LLM

Nodes

Relationships

<Cypher>

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Automating the cypher scripts for neo4j

  • XSLT mode 1 (completed):
    • reads p5.xml
    • applies XSLT 3.0 maps and arrays
    • outputs JSON data
  • XSLT mode 2 (under development):
    • reads an XML mapping of that JSON data and outputs (in text mode) a cypher script for neo4J to create a knowledge graph
      • Current prototype is partially complete: built by cypher scripts we wrote by hand
      • Very tricky work--SO much debugging!
      • Neo4j reads the cypher script as instructions for creating the knowledge graph.  Neo4j constructs the knowledge graph as a NEW JSON: neo4j.json

The Four Beating Hearts of DigitAI

The Four Core Components of the DigitAI Retreival Pipeline 

Neo4j Graph Database

BGE-M3 Embedding Model

FAISS Vector Search

Qwen2-7B Language Model

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Core 1— Neo4j: The Memory Graph

  • Stores TEI Guidelines as Nodes and Relationships
  • Enables retrieval of structured and semantically relevant text
  • Preserves relationships of the XML structure
  • Queried using Cypher, Neo4j's native query language
    • Allows us to retrieval precise, contextual excerpts

Neo4j is like a digital card catalog— it doesnt just store texts, it understands how they relate to each other

WITH doc, doc_data 
FOREACH (part_data_1 IN doc_data.CONTAINS_PARTS | 
	MERGE (part:Part {name: part_data_1.PART}) SET part.sequence = part_data_1.SEQUENCE 
	MERGE (doc)-[:HAS_PART]->(part) 
	WITH part, part_data_1 
	FOREACH (chapter_data_2 IN part_data_1.CONTAINS_CHAPTERS | 
		MERGE (chapter:Chapter {chapter_id: chapter_data_2.ID}) 
        SET chapter.sequence = chapter_data_2.SEQUENCE, 
        chapter.title = chapter_data_2.CHAPTER,
        chapter.links = [x IN chapter_data_2.RELATES_TO WHERE x IS NOT NULL | x.ID] 
		MERGE (part)-[:HAS_CHAPTER]->(chapter) 
		WITH chapter, chapter_data_2 

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

FOREACH (example_data_7 IN paragraph_data_6.TEI_ENCODING_DISCUSSED.CONTAINS_EXAMPLES | 
	MERGE (example:Example {example: example_data_7.EXAMPLE}) 
    SET example.language = example_data_7.LANGUAGE 
	MERGE (paragraph)-[:HAS_EXAMPLE]->(example) 
	WITH example, example_data_7 
  FOREACH (paragraph_data_8 IN example_data_7.CONTAINS_END_PARAS | 
		MERGE (paragraph:TerminalPara {parastring: paragraph_data_8.PARASTRING}) 
        SET paragraph.files_mentioned = 
        [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.FILES_MENTIONED 
        WHERE x IS NOT NULL | x.FILE], 
        paragraph.parameter_entities_mentioned = 
          [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.PES_MENTIONED WHERE x IS NOT NULL | x.PE],
          paragraph.elements_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.ELEMENTS_MENTIONED 
          WHERE x IS NOT NULL | x.ELEMENT_NAME], paragraph.attributes_mentioned = 
          [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.ATTRIBUTES_MENTIONED 
          WHERE x IS NOT NULL | x.ATTRIBUTE_NAME], 
          paragraph.sequence = paragraph_data_8.SEQUENCE, 
          paragraph.frags_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.FRAGS_MENTIONED 
          WHERE x IS NOT NULL | x.FRAG], 
          paragraph.ns_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.NSS_MENTIONED
          WHERE x IS NOT NULL | x.NS], 
          paragraph.classes_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.CLASSES_MENTIONED
          WHERE x IS NOT NULL | x.CLASS],
          paragraph.modules_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.MODULES_MENTIONED
          WHERE x IS NOT NULL | x.MODULE], 
          paragraph.macros_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.MACROS_MENTIONED 
          WHERE x IS NOT NULL | x.MACRO], 
          paragraph.speclist_links = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.CONTAINS_SPECLISTS.SPECLIST
          WHERE x IS NOT NULL | x.ID], 
          paragraph.relaxng_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.RNGS_MENTIONED 
          WHERE x IS NOT NULL | x.RNG], 
          paragraph.datatypes_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.DATATYPES_MENTIONED 
          WHERE x IS NOT NULL | x.DATATYPE], paragraph.links = [x IN paragraph_data_8.RELATES_TO.SECTION
          WHERE x IS NOT NULL | x.ID],
          paragraph.parameter_entities_mentioned_ge = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.PES_MENTIONED 
          WHERE x IS NOT NULL | x.GE], 
          paragraph.schemas_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.SCHEMAS_MENTIONED 
          WHERE x IS NOT NULL | x.SCHEMA] 
	MERGE (example)-[:HAS_END_PARAGRAPH]->(paragraph) 
    ) 

Neo4j's Graph Model JSON... Beyond Human Readability?

{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:0", "text": null, "labels": ["Document"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:1", "text": null, "labels": ["Part"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:2", "text": null, "labels": ["Part"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:3", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:4", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:5", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:6", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:7", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:8", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:9", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:10", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:11", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:12", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:13", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:14", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:15", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:16", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:17", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:18", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:19", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:20", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:21", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:22", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:23", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:24", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:25", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:26", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:27", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:28", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:29", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:30", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:31", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:32", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:33", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:34", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:35", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:36", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:37", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:38", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:39", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:40", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:41", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:42", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:43", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:44", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:45", "text": null, "labels": ["Section"]}
  • Multilingual embedding model from Beijing Academy of Artificial Intelligence (BAAI)
  • Converts text (both queries and node text) into vector representations
  • Embeddings capture semantic similarity across language and phrasing
  • We use normalized vectors to align with FAISS search requirements
  • Enables meaningful matching even if wording differs

BGE-M3 acts as a translator!

Core 2 BGE-M3: The Embedding Model

"Understanding" Queries

Turns ideas into coordinates so similar meanings land "near" each other in vector space

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

BGE-M3: Ideal for Meaning-Matching

  • Developed specifically for cross-lingual search and retrieval
  • Produces dense embeddings optimized for similarity tasks
  • Performs well on academic + technical texts
  • Fast to compute and easy to normalize
  • Available via Hugging Face, free, open-source, and well maintained

 

It takes ~10 minutes to prepare embeddings from BGE-M3 on a machine with 128 GB RAM!

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Core 3— FAISS: The Vector Indexer

  • Facebook Artificial Intelligence Semantic Search
  • High-performance vector search library from Meta AI
  • Supports nearest-neighbor search on node embeddings
  • Critical for narrowing down thousands of nodes to a small candidate set
  • We retrieve the top most relevant nodes from a given query
  • Helps to make querying fast and lighter weight

An FAISS search runs in a few milliseconds!

FAISS is like a highly efficient search engine that ranks results by context and meaning, not just keywords

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Core 4— Qwen2:7B: The LLM

FAISS

Neo4j Database

BGE-M3

Qwen LLM

Neo4j Database

BGE-M3

FAISS

RAG Embeddings
.JSONL

RAG Embeddings
.FAISS

Embedded Prompt

Generates Vector Embeddings from Node Text

RAG Embeddings Converted to FAISS Formatted Data Map

Data Map Compared to Embedded Query

Constructs Query

Top Matched Embeddings Retrieved as Text From Neo4j

Human Prompt

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

# === Get user input ===
query = input("❓ Enter your query: ").strip()
if not query:
    print("⚠️ No query provided. Exiting.")
    exit()
# === Encode the query ===
query_embedding = model.encode(query)

# Convert to 2D array and normalize if cosine similarity is enabled
if normalize:
    query_embedding = sk_normalize([query_embedding], norm="l2")
else:
    query_embedding = np.array([query_embedding])
query_embedding = query_embedding.astype("float32")

# === Perform FAISS similarity search ===
TOP_K = 5 # Number of top matching texts to retrieve
scores, indices = index.search(query_embedding, TOP_K)

# Filter out invalid results (-1 = no match)
matched_ids = [id_map[i] for i in indices[0] if i != -1]
if not matched_ids:
    print("⚠️ No relevant matches found in the FAISS index.")
    exit()
# === Fetch matching texts from JSONL file ===
def fetch_node_texts_by_ids(node_ids):
    texts = []
    with open(neo4jNodes, "r", encoding="utf-8") as f:
        for line in f:
            record = json.loads(line)
            if record.get("id") in node_ids and record.get("text"):
                texts.append(record["text"])
    return texts

texts = fetch_node_texts_by_ids(matched_ids)

if not texts:
    print("❌ No node texts found for matched IDs in local file.")
    exit()
# === Construct prompt for the LLM ===
context = "\n".join(f"- {text}" for text in texts)

prompt = f"""You are a chatbot that helps people understand 
the TEI guidelines which specify how to encode machine-readable 
texts using XML.

Answer the question below in the **same language the question is asked in**.
Use examples from the provided context as needed — they can be in any language. 
Do not translate them.

Context:
{context}

Question:
{query}
"""

# === Send prompt to local Ollama LLM ===
def ask_ollama(prompt, model):
    try:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": False
            }
        )
        return response.json().get("response", "[ERROR] Empty response from LLM.")
    except Exception as e:
        print("❌ Error while querying Ollama:", e)
        return "[ERROR] Could not get response from local LLM."

# === Query the model ===
print(f"🤖 Sending prompt to LLM ({llm_model})...")
answer = ask_ollama(prompt, llm_model)

# === Display the result ===
print("\n🧾 Response:\n")
print(answer)

What did we learn?

 

  • Data structures and graph modeling
  • How to write Cypher Scripts
  • Differences between "fine-tuning", "RAG augmenting" and "training"
  • This is harder than we thought!!!
  • Also accessible to learn together (students and professor alike!

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

What's next for DigitAI?

 

  • Finish refining the knowledge graph
  • Fine-tuning with the text and code specs of the P5 Guidelines
  • Optimization and packaging (user interface over local network, docker container)

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Questions?

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

https://bit.ly/

digitai-github

GitHub:

https://bit.ly/

digitai-dh25

Slides:

https://bit.ly/

digitai-jupyter

Jupyer Notebooks:

OUTTAKES

AI's problems and role in our process

An example problem!

Enter your query:
What should I use to mark the title of a poem in a TEI document?
Sending prompt to LLM (qwen:7b) ...

Response:


Para marcar o título de uma poesia no formato TEI, você pode usar 
um elemento específico como ‹title›' ou ‹head› contendo o texto do 
título. Aqui está um exemplo:

'''xml
<tei>
	<front>
		<title>Onde os Ventos Cantam</title> 
		<! -- Other frontmatter elements -->
	</front>
	<! -- Rest of the TEI document -->
</tei>
'''

No codigo acima, '<titles>' foi usado para indicar o titulo da 
poesia. Lembre-se de substituir "Onde os Ventos Cantam" pelo 
título real que deseja registrar.

DIGIT-AI Runtime Flow 

DIGIT AI for Localized TEI/XML Assistance

By Elisa Beshero-Bondar

DIGIT AI for Localized TEI/XML Assistance

A presentation about a small explainable AI project to fine-tune a language model with a graph RAG built with the TEI Guidelines.

  • 594