Any demo of artificial intelligence for professional, tax, legal, medical or technical practices shows the same examples. A complex document that the system summarises in thirty seconds. A draft reply to a client generated with a professional tone. A search of regulations that returns the right citations. Everything works, everything shines, the presenter smiles, the practice owner nods.
Then the demo ends, the tool goes into production, three months pass, and the reality of the practice is more complicated. Some things work really well. Others work badly in a silent, cautious way, difficult to pinpoint until someone pays the price. Others still change the way collaborators work in ways nobody had foreseen.
This article is about what demos do not say.
The problem of hallucinations on regulation and case law
One of the fields where AI has been presented as a turning point for professional practices is the search for regulations, rulings, circulars, analogous cases. In theory, an AI system can consult large document bases and return in seconds what would take a collaborator hours. In practice, this promise has a documented dark side that anyone managing a practice should know.
Language models produce text. They do not necessarily consult sources. Some systems integrate dedicated search engines that guarantee citations come from real databases. Others do not. And even those that integrate dedicated engines sometimes produce results in which a citation is distorted: the ruling number is close but not exact, the date is off by a year, the operative part of the ruling is partially rewritten with different words. These distortions, called hallucinations, are the most serious risk in professional practices where accuracy of citations is a matter of liability.
The problem is not that AI always produces hallucinations. It produces a low percentage of them, perhaps five to ten per cent depending on the tool and context. The problem is that hallucinations are plausible. Those who do not verify do not see them. A collaborator who receives a draft client reply with three citations of rulings, one of them distorted, does not notice unless he goes through the trouble of verifying them one by one. And if he does, he has spent the time the tool was meant to save.
The operational solution is not to abandon the tool; it is to redesign the workflow so that source verification is part of the process, not an option. Many practices that have adopted AI in legal or tax studios moved, after some incidents, to a rigorous protocol: AI produces drafts, the practitioner verifies every citation before sending anything to the client. This protocol reduces the time savings promised by demos but makes them sustainable.
The output at seventy per cent
This is perhaps the least discussed and most insidious risk. A language model, when answering a complex professional request, typically produces an output somewhere between seventy and eighty-five per cent completeness. The experienced senior, reading the output, recognises the fifteen to thirty per cent missing almost instantly. He knows that a certain reply to the client would lack a reference to a recent circular, knows that a draft legal opinion has ignored a less well-known case-law aspect, knows that a clinical summary has underestimated a signal in the medical history.
The junior, in the second or third year of practice, does not recognise the fifteen to thirty per cent missing. For him, output at seventy per cent looks complete, or nearly so. The structure is right, the tone is professional, the topics seem covered. This is the most serious problem with introducing AI in practices: not so much the tool's errors, but the junior's inability to evaluate what is missing.
In a practice where seniors have little time and increasingly delegate the first draft to juniors, AI risks becoming an amplifier of mediocrity. The junior produces seventy per cent output faster. The senior, chased by volume, reviews less. The client receives output on average inferior to what he would have received before. The practice does not notice immediately, because clients who receive seventy per cent output do not notice immediately. They notice months or years later, when a problem emerges.
The way out of this trap passes through training of juniors, not through technology. A junior who has spent two years writing opinions, replies, clinical documents without AI help has built the taste needed to recognise what is missing. He can then use AI as a tool of acceleration, not as a substitute. A junior who has started directly with AI has never built that taste, and is unlikely to build it later.
This implies a counter-intuitive organisational choice. For juniors, AI should be introduced as late as possible, not as early as possible. For seniors, as early as possible. Those who reverse this logic, letting juniors work with AI while seniors continue with traditional methods, are working against their own service quality.
The problem of prompt dependency
A third little-discussed aspect concerns the quality of outputs, which depends enormously on the quality of the initial request, that is the prompt. A prompt written by someone who knows how to formulate complex requests produces excellent outputs. A prompt written by someone who asks "write me a reply to client Rossi" produces generic outputs.
In professional practices, prompt quality is a new competence many underestimate. It is not learned by watching a twenty-minute video. It develops with daily use, reflecting on what has worked and what has not, building up a repertoire of ways of asking that produce reliable results.
This competence is unevenly distributed within the practice. Typically one or two collaborators get there naturally and produce very good outputs. The others use the tool mechanically and produce average outputs. Internal variance grows, it does not decrease.
Without structuring work on prompts (standardised templates, reusable request models, examples of good practice), the tool amplifies competence differences rather than smoothing them. Those already skilled become more effective; those less skilled obtain results worse than they would have obtained with traditional methods.
A general-purpose AI (ChatGPT, Claude, Gemini) knows a lot about everything and little about anything specific. For a professional practice, this means the tool is good at generic answers on broad topics ("what does Italian regulation say about shareholder agreements?") but less reliable on technical specifics ("how does the 2024 tax reform apply in the case of a limited partnership with a managing partner who is a Swiss citizen with Italian tax residence?").
The AI's reply on specific cases tends to be reasonably structured but to lack the details that an experienced sector practitioner would hold in mind. The senior who uses AI on cases within his own field knows how to recognise these limits and complete them. The junior, again, does not.
The remedy here passes through specialised AI tools, trained on sector-specific corpora, or through a context layer that brings the practice's specific information to the general-purpose model. Both paths exist; both require investment. Neither is "open ChatGPT and start".
The choices a serious practice makes
A practice that takes AI seriously makes some choices that are not in the demos.
It does not introduce AI as a universal shortcut. It introduces it for specific tasks where the risk-benefit ratio is favourable. A typical list includes: summaries of long documents for internal use, initial sorting of communications, drafts of standard documents, research support with mandatory verification.
It introduces verification protocols. No AI output leaves the practice without passing through the hands of a qualified practitioner who verifies substance and form. This protocol must be formalised, not left implicit.
It structures the business context. It builds a corpus of practice knowledge that the AI can consult: style of writing, examples of previous outputs, internal procedures, client categories. Without this corpus, every output is generic. With it, the outputs start to resemble how the practice really works.
It trains juniors without AI. The first years of practice serve to build taste. AI enters when taste is consolidated, not before.
It measures. It collects data on what AI produced, what was corrected, what generated problems. This data serves to calibrate over time.
The choice for the practice reading this
If you own a professional practice and are evaluating the introduction of AI, or have already done it and are not sure how it is going, the choice before you is not "AI yes or AI no". It is "with which protocols, on which tasks, with which context infrastructure, with which training plan".
Answering these questions requires a map of your practice today: where time is consumed, who does what, which competences are distributed, which reputational risks you cannot afford. From this map a sensible plan is derived, one that does not replicate the demos but takes account of your specific reality.
In our experience, this map emerges well in a forty-five-minute conversation with someone who knows both practice work and the real limits of AI tools. The output is a document that describes the situation, the opportunities, the risks. It is useful even if you then decide to postpone any decision.