Project FAQ

A fan search tool built as a serious RAG project.

Umaguedo Search was created to help Umamusume fans identify characters, explore facts, and ask grounded questions, while also serving as a portfolio project for learning and demonstrating modern RAG engineering. This FAQ explains the data sources, retrieval strategy, maintenance workflow, safeguards, and limitations behind the system.

What is Umaguedo Search?

Umaguedo Search is an independent AI-powered search assistant for the Umamusume Pretty Derby universe. It is built around retrieval-augmented generation: the system searches stored character data, retrieves relevant evidence, and uses the model to produce grounded answers instead of relying on memory alone. It helps users identify characters from clues, ask questions about profiles, appearance, relationships, media appearances, and gameplay traits, and compare characters when the available evidence supports it.

Why build this project?

Umaguedo Search started as a sandbox to learn retrieval-augmented generation in practice: testing different retrieval strategies, working with structured and unstructured data, experimenting with tools and orchestration, and understanding where a RAG system fails. Over time, it evolved into a usable search assistant for Umamusume fans, while also forcing production-oriented decisions around deployment, API usage, cost control, rate limits, database maintenance, and abuse protection.

Why use Umamusume as the domain?

The idea started from a casual fan-domain experiment, but it quickly became technically interesting. A fictional game universe is a good environment for learning RAG because it is engaging to work with, avoids private, medical, financial, or company-internal documents, and still provides enough complexity to test real retrieval problems. Umamusume is especially useful because the franchise is well-developed and extremely character-rich: it has many characters, and each one has detailed profile facts, visual traits, relationships, media appearances, gameplay attributes, variants, and alternate names. Although the domain is still relatively niche, it has an active fan community and well-maintained resources such as fan wikis and fan pages, which makes it practical to build a real retrieval system from public data. At the same time, it is niche enough that there is no obvious tutorial or standard dataset to copy, so building the system required real data modeling, ingestion, normalization, retrieval, and grounding work.

What sources are used?

The current version mainly uses public Umamusume fan resources, especially Gametora and umamusu.wiki. These sources provide structured and semi-structured information such as character data, profile facts, relationships, assets, gameplay attributes, media appearances, and descriptive text. The project ingests that information into PostgreSQL, normalizes it into searchable records, and builds retrieval units for SQL-based and semantic search. When possible, answers and character cards link back to the original source pages instead of presenting the data as owned by this project.

How are derived attributes used?

The system does not only store source records as-is. It also builds derived attributes from the ingested character data, such as normalized physical measurements, percentiles, and searchable buckets. This helps bridge the gap between exact source values and natural user queries: a user may ask for a “short”, “tall”, or visually distinctive character without knowing the exact profile numbers. These derived attributes are rebuilt from the source data and used by SQL tools and retrieval units as additional structured evidence.

Will more sources be added?

Possibly. A future version could add more public sources to improve coverage of media appearances, anime events, manga-related information, story context, or newly released Umamusume content. The goal would be to expand the searchable evidence while keeping attribution, source links, and a clear separation between retrieved facts and model inference.

How is the assistant scoped?

The assistant is scoped to the Umamusume Pretty Derby domain. Out-of-domain rejection was tested so unrelated requests should be refused instead of being answered from the model's general knowledge. Within the Umamusume scope, the system has been tested on representative query types such as identification, facts, relationships, gameplay filters, media appearances, and cautious comparisons, but it has not been exhaustively tested on every possible in-domain question.

How should I try the demo?

Good test queries include direct character questions, vague visual descriptions, relationship questions, gameplay filters, media appearance questions, and cautious comparison prompts. The most interesting tests are mixed queries, because they force the system to combine normalization, structured search, semantic retrieval, and grounded response generation instead of answering from a single keyword match.

Can it handle noisy or multilingual queries?

Yes, within limits. Users do not need to write perfectly formatted queries: the system can handle multilingual input through an LLM-based normalization step. Before retrieval, the normalizer tries to preserve character names, clean up noisy wording, clarify the user's intent, and reformulate the query for the search tools. The assistant then tries to answer in the same language as the user when possible. This improves robustness for mixed languages, vague clues, common name variations, and imperfect phrasing, but it does not guarantee perfect handling of every typo, ambiguous clue, or unsupported language.

How does a query get answered?

A user query first goes through normalization, then the system chooses the most relevant retrieval path. Depending on the request, it can use structured SQL tools for facts, gameplay, assets, and relationships, semantic retrieval for descriptive or visual clues, and candidate grounding to keep the answer tied to retrieved evidence. The final response is generated from that evidence instead of relying only on the model's prior knowledge.

How much does answer quality depend on the sources?

A large part of the system's quality depends on the quality, structure, and coverage of the underlying source data, especially umamusu.wiki. The retrieval and grounding layers can help organize, search, and combine evidence, but they cannot create reliable facts that are missing, incomplete, ambiguous, or incorrect in the source material. When the source data is strong, the assistant is much more useful; when coverage is weak, the system should be more cautious.

What is grounded inference?

Some questions are not directly answered by a source. For example, a comparison or “most likely” question may require reasoning from available evidence. In those cases, the assistant should separate explicit facts from a cautious estimate, and avoid presenting an inference as canon or as something directly stated by a source. If the retrieved evidence is too weak, the assistant should say so.

How does data maintenance work?

The project includes a daily scheduled maintenance workflow for keeping the database usable over time. In the normal path, it checks database health, looks for newly available characters, syncs only the affected records, and updates the relevant retrieval units and embeddings. If the health check detects missing or corrupted retrieval data, the system can fall back to a heavier repair rebuild that regenerates derived attributes, appearance units, and RAG retrieval units.

How are costs and abuse controlled?

Because public LLM endpoints can be abused or become expensive, the API includes server-side safeguards: request size limits, message length limits, duplicate prompt protection, per-IP and global concurrency controls, rate limits, daily caps, and restricted debug access. These checks are enforced by the backend, so they still apply even if someone bypasses the frontend and calls the API directly.

What is the technical stack?

The backend is built with FastAPI, PostgreSQL, and SQLAlchemy. The RAG pipeline uses LangChain and LangGraph to orchestrate tools, route queries, combine retrieval steps, and generate grounded answers. Retrieval combines structured SQL tools and embedding-based semantic search. LLM inference and embedding generation are handled through OpenAI-compatible API calls, used for normalization, retrieval preparation, embeddings, and final answer generation. The frontend is static HTML/CSS/JavaScript, with Docker-oriented deployment and scheduled maintenance scripts for sync, health checks, and rebuilds.

What are the limitations?

Umaguedo Search is a portfolio project and fan-made search assistant, not an official encyclopedia. It can fail if the source data is incomplete, if retrieval misses the relevant evidence, or if the model misinterprets the retrieved context. Ambiguous, speculative, or underspecified questions may produce cautious estimates rather than definitive answers. The system is designed to be grounded, but it is not guaranteed to be exhaustive or error-free.

Is it official?

No. Umaguedo Search is an independent, fan-made technical project. It is not affiliated with, endorsed by, sponsored by, or connected to the official Umamusume Pretty Derby rights holders. Source links are provided for attribution and verification, and the project should not be treated as an official encyclopedia or official game resource.

Who built it?

Umaguedo Search was built by Alexandre Aguedo, an AI and machine learning engineer. I created it as a hands-on project to learn RAG engineering in depth and to demonstrate how retrieval, tool orchestration, backend deployment, database maintenance, and API safeguards fit together in a real application.