As AI reshapes how we engage with information, Emma Hooper, head of information management strategy at RLB Digital, explores how we can refine large language models to improve accuracy, reduce bias, and uphold data integrity — without losing the essential human skill of critical thinking
In a world where AI is becoming an increasingly integral part of our everyday lives, the potential benefits are immense. However, as someone with a background in technology — having spent my career producing, managing or thinking about information — I continue to contemplate how AI will alter our relationship with information and how the integrity and quality of data will be managed.
Understanding LLMs
AI is a broad field focused on simulating human intelligence, enabling machines to learn from examples and apply this learning to new situations. As we delve deeper into its sub-types, we become more detached from the inner workings of these models, and the statistical patterns they use become increasingly complex. This is particularly relevant with large language models (LLMs), which generate new content based on training data and user instructions (prompts).
A large language model (LLM) uses a transformer model, that is a specific type of neural network. These models learn patterns and connections from words or phrases, so the more examples they are fed, the more accurate they become. Consequently, they require vast amounts of data and significant computational power, which puts considerable pressure on the environment. These models power tools such as ChatGPT, Gemini, and Claude.
Find this article plus many more in the March / April 2025 Edition of AEC Magazine
👉 Subscribe FREE here 👈
The case of DeepSeek-R1
DeepSeek-R1 which has recently been in the news, demonstrates how constraints can drive innovation through good old-fashioned problem-solving. This open-source LLM uses rule-based reinforcement learning, making it cheaper and less compute-intensive to train compared to more established models.
However, since it is an LLM it still faces limitations in output quality. However, when it comes to accuracy, LLMs are statistical models that operate based on probabilities. Therefore, their responses are limited to what they’ve been trained on. They perform well when operating within their dataset, but if there are gaps or they go out of scope, inaccuracies or hallucinations can occur.
Inaccurate information is problematic when reliability is crucial, but trust in quality isn’t the only issue. General LLMs are trained on internet content, but much domain-specific knowledge isn’t captured online or is behind downloads/paywalls, so we’re missing out on a significant chunk of knowledge.
Training LLMs: the built environment
Training LLMs is resource-intensive and requires vast amounts of data. However, data sharing in the built environment is limited, and ownership is often debated. This raises several questions in my mind: Where does the training data come from? Do trainers have permission to use it? How can organisations ensure their models’ outputs are interoperable? Are SMEs disadvantaged due to limited data access? How can we reduce bias from proprietary terminology and data structures? Will the vast variation hinder the ability to spot correct patterns?
With my information manager hat on, without proper application and understanding it’s not just rubbish in and rubbish out, it’s rubbish out on a huge scale that is all artificial and completely overwhelms us.
How do we improve the use of LLMs?
There are techniques such as Retrieval Augmented Generation (RAG), that use vector databases to retrieve relevant information from a specific knowledge base. This information is used within the LLM prompt to provide outputs that are much more relevant and up to date. Having more control over the knowledge base ensures the sources are known and reliable.
This leads to an improvement, but the machine still doesn’t fully understand what it’s being asked. By introducing more context and meaning, we might achieve better outputs. This is where returning to information science and using knowledge graphs can help.
A knowledge graph is a collection of interlinked descriptions of things or concepts. It uses a graph-structured data model within a database to create connections – a web of facts. These graphs link many ideas into a cohesive whole, allowing computers to understand real world relationships much more quickly. They are underpinned by ontologies, which provide a domain-focused framework to give formal meaning. This meaning, or semantics, is key. The ontology organises information by defining relationships and concepts to help with reasoning and inference.
Knowledge graphs enhance the RAG process by providing structured information with defined relationships, creating more context-enriched prompts. Organisations across various industries are exploring how to integrate knowledge graphs into their enterprise data strategies. So much so they even made it onto the Gartner Hype Cycle on the slope of enlightenment.
The need for critical thinking
From an industry perspective, semantics is not just where the magic lies for AI; it is also crucial for sorting out the information chaos in the industry. The tools discussed can improve LLMs, but the results still depend on a backbone of good information management. This includes having strategies in place to ensure information meets the needs of its original purpose and implementing strong assurance processes to provide governance.
Therefore, before we move too far ahead, I believe it’s crucial for the industry to return to the theory and roots of information science. By understanding this, we can lay strong foundations that all stakeholders can work from, providing a common starting point and a sound base to meet AI halfway and derive the most value from it.
Above all it’s important to not lose sight that this begins and ends with people and one of the greatest things we can ever do is to think critically and keep questioning!