Testing for Non-Coders: How to Know If It Actually Works
You've built something with an AI tool. A dashboard, a web app, a script that processes your data. It looks right. The AI said it works. But does it? How would you know?
Testing is one of those things that professional developers spend entire careers mastering, with frameworks, methodologies, and job titles dedicated to it. But the core idea is something anyone can understand: checking that the thing you built actually does what you think it does, and doesn't do things you don't want it to do.
This is Part 2 of The Stackless Guide, a series for non-coders who are building real things with AI tools. Part 1 covered security and privacy. This part covers how to test your work without needing a computer science degree.
Disclaimer
This guide reflects my personal experience building projects with AI tools. I'm not a professional developer or QA engineer. For anything with real-world consequences (medical tools, financial calculations, anything people depend on), get proper testing from a qualified professional. You use these techniques and tools at your own risk.
Why Testing Matters More for AI-Built Projects
When a professional developer writes code, they (usually) understand every line. They know what the code does because they wrote it deliberately.
When you build with AI, something different happens. The AI generates code you didn't write and may not fully understand. It works, probably. But you're trusting an AI that is, by nature, confident about everything, even when it's wrong.
I've had Claude Code generate a dashboard that looked perfect, had all the right charts, all the right labels, and was calculating one column completely wrong. The average spending was being divided by the wrong number of months. It looked plausible. If I hadn't checked the actual numbers against my spreadsheet, I'd never have caught it.
That's the risk: AI-generated code is fluent but not always correct. It passes the "looks right" test because AI is very good at producing things that look right. Your job is to check whether it actually is right.
The Three Levels of Testing You Actually Need
Forget about unit tests, integration tests, regression tests, and all the other terminology that makes testing sound like a university course. As a non-coder, you need three things:
Level 1: Does It Open?
This sounds trivial. It's not. A surprising number of AI-generated projects have basic errors that prevent them from running at all: a missing closing bracket, a reference to a file that doesn't exist, a typo in a variable name.
Step by Step: Basic Smoke Test
1. Open the file (double-click the HTML, run the Python script, whatever your project needs).
2. Does it load without errors? Check the browser console (F12 in Chrome/Edge, click "Console" tab) for red error messages.
3. Does it look roughly like what you expected? If you asked for a dashboard with 4 charts and you see 3, something's wrong.
4. Click every button. Does every button do something? Or do some of them just sit there?
5. Try the obvious paths: if there's a search box, search for something. If there's a dropdown, change it. If there's a form, fill it in.
I call this the "click everything once" test. It takes about two minutes and catches about 50% of problems. If you do nothing else, do this.
Level 2: Does It Calculate Correctly?
This is where most non-coders stop, and it's where the most dangerous bugs hide. The app works, it loads, the buttons work, but the numbers are wrong.
Step by Step: Data Verification
1. Pick 3-5 specific data points that you can verify independently. For a finance dashboard, pick a few transactions and add them up with a calculator or spreadsheet.
2. Compare the AI's output to your manual calculation. Do the totals match? Do the percentages add up to 100%?
3. Check the edge cases: what happens with zero values? Negative numbers? Very large numbers? Missing data?
4. Look for the "off by one" error: if your data covers January to December, does the dashboard show 12 months or 11?
5. If you're processing dates, check that the AI is using the right date format. UK dates (DD/MM/YYYY) are famously confused with US dates (MM/DD/YYYY). Is 03/04/2025 the 3rd of April or the 4th of March?
Real Example: Grocery Dashboard
When I built the grocery spending dashboard, I had it categorise about 400 items from Tesco order history. The totals looked right. The charts looked great. But when I checked individual items, some were in the wrong category. "Oat Milk" was in "Dairy" instead of "Plant-Based Alternatives." "Chicken Stock Cubes" was in "Fresh Meat" instead of "Cooking Essentials."
Each individual error was small. But 20 small errors meant the category breakdowns were meaningfully wrong. I had to go through and check a sample, identify the patterns in the mistakes, give the AI better rules, and re-run the categorisation.
The lesson: always verify a sample. If you're processing 1,000 items, manually check 20-30. If more than 2 or 3 are wrong, the whole batch probably needs re-doing with better instructions.
Level 3: Does It Survive Real Use?
This is what the software industry calls "UAT": User Acceptance Testing. It sounds formal, but it just means: can a real person actually use this thing for its intended purpose?
The difference between Level 2 and Level 3 is that Level 2 checks individual calculations, while Level 3 checks the whole experience. Does it make sense? Is it usable? Does it actually help?
Step by Step: UAT for Non-Coders
1. Use the thing yourself, for real. Don't just test it: actually use it for its intended purpose for a day or a week.
2. Note every time you think "that's annoying" or "I expected that to work differently." Those are bugs, even if the code is technically correct.
3. Give it to someone else. Watch them use it without helping. Where do they get confused? What do they click first? Where do they get stuck?
4. Try to break it. What happens if you put letters in a number field? What if you upload a CSV with extra columns? What if you use it on your phone?
5. Ask: does this actually solve the problem I built it for? Sometimes the answer is "sort of, but I need to change the approach."
The "What Am I Actually Checking?" Framework
Before you test anything, write down (literally, on paper or in a note) what "correct" looks like. This forces you to think about what you expect before you see the result. Once you've seen the result, your brain will convince you that's what you expected all along.
For a dashboard:
- What should the total be? (Check against your source data)
- What should the top category be? (You probably know this already)
- What date range should it cover?
- How many items/rows/entries should there be?
For a web app:
- What happens when I search for something I know exists?
- What happens when I search for something that doesn't exist?
- Does the result make sense? Not just "does it return a result" but "is this the right result?"
For a script that processes data:
- How many items went in? How many came out? Same number?
- Pick 5 items at random. Are they processed correctly?
- Are there any items that got skipped or duplicated?
The Sample-Then-Scale Approach
This is the single most useful testing technique I've learned, and it applies to everything.
Never process your full dataset first. Always test on a small sample, verify it's correct, then scale up.
Here's the exact process I follow:
- Start with 5-10 items. Process them. Check every single one manually. Are they all correct?
- If not, fix the instructions. Tell the AI what went wrong. Be specific: "You categorised oat milk as dairy, but it should be plant-based. Check for 'oat', 'soy', 'almond', 'coconut' in the name and categorise those as Plant-Based."
- Re-run the small sample. Check again. All correct now?
- Scale to 50-100 items. Spot-check 10 of them. Any new errors? New edge cases you didn't think of?
- Scale to the full dataset. Spot-check 20-30 items from different parts of the data (beginning, middle, end, different categories).
This approach caught errors in my email organiser (16,000 emails), my grocery dashboard (400+ items), and my finance dashboard (thousands of transactions). Every single time, the first pass had mistakes. Every single time, the second pass after corrections was dramatically better.
Real Example: Email Categorisation
Processing 16,000 emails, I started with a batch of 100. Checked 20 of them. Found that "delivery notification" emails from Amazon were being categorised as "Shopping" instead of "Delivery Updates." Also found that newsletter unsubscribe confirmations were categorised as "Newsletters" when they should have been "Account Management."
Updated the rules, re-ran 100, checked 20 more. Better, but now some Deliveroo order confirmations were in "Food" instead of "Delivery." Added more rules. Third pass: 19 out of 20 checked correctly. Good enough to scale up.
If I'd run all 16,000 first, I'd have had to re-do the entire batch multiple times. By testing small first, I only re-did 100 emails three times, which took minutes instead of hours.
"Good Enough" is a Real Standard
Professional software has bugs. Your bank's app has bugs. Google has bugs. The question is never "is this perfect?" because nothing is perfect. The question is "is this good enough for what I need?"
Here's how I think about "good enough":
- Personal use only? If only you are using it, 90% accuracy might be fine. You'll notice the errors as you use it and fix them over time.
- Sharing with friends/family? Aim for 95%+. Check the main paths thoroughly. Accept that edge cases might be imperfect.
- Selling it or sharing publicly? This needs more rigorous testing. Get someone else to try it. Test on different devices. Check accessibility basics (does it work on a phone? can you read the text?).
- Anything involving money, health, or other people's data? "Good enough" is a much higher bar. Consider whether you should be building this at all without professional help.
When to Stop Building
If you're building something that calculates medication doses, manages other people's money, or handles sensitive personal data, "testing it myself" is not sufficient. These domains have regulations, liability implications, and safety requirements that go far beyond what any AI tool or self-taught testing can cover. Know your limits. The worst outcome isn't a bug: it's a bug that hurts someone.
The Iteration Loop: How Testing Actually Works in Practice
In reality, testing isn't a separate phase. It's woven into everything you do. Here's what a typical session looks like for me:
- Tell the AI what I want. "Build me a dashboard that shows my spending by category, with a chart and a table."
- Open it, look at it. Does it roughly match what I imagined? Usually yes, with some things I'd change.
- Check the numbers. Open my source data, pick a category, add it up manually. Does it match? If not, tell the AI: "The 'Food' category shows £340 but my spreadsheet says £312. Something is being included that shouldn't be."
- Fix and re-check. The AI fixes it. I check again. Better? Good. Next issue.
- Use it for real for a few days. Notice things. "The chart is unreadable on my phone." "The date filter doesn't work for January." "I keep wanting a button that doesn't exist."
- Tell the AI to fix those things. Re-check. Repeat until I'm happy.
This loop, build a little, check a little, fix a little, is how professional development works too. The difference is that professionals have automated tools to speed up the checking. You have your own eyes and a calculator. Both are valid.
What to Check on Different Devices
If you're building something web-based (HTML files, web apps), check it on more than one device. At minimum:
- Your computer's browser. This is where the AI built it, so it usually works here.
- Your phone. A shocking number of AI-generated websites look terrible on mobile. Text overflows, buttons are too small to tap, charts are unreadable. Ask the AI to "make it responsive" if it isn't.
- A different browser. If you normally use Chrome, check in Edge or Firefox. Most things work everywhere, but occasionally something breaks.
You don't need to test on 15 different devices. But phone + computer covers about 95% of real-world use.
The Console Is Your Friend
If you're building web things (HTML, JavaScript), learn this one thing: how to open the browser's developer console.
- Chrome/Edge: Press F12, click "Console" tab
- Firefox: Press F12, click "Console" tab
- Safari: Enable "Develop" menu in settings first, then Cmd+Option+C
Red text in the console means something is broken. You don't need to understand what the error means. Just copy it and paste it to your AI tool: "I'm getting this error in the console: [paste error]. Can you fix it?"
Yellow warnings are usually fine. Red errors are problems. No text at all is ideal.
Real Example: Ingredient Checker
My Curly Girl ingredient checker looked perfect. All the buttons worked, the search worked, everything seemed fine. But some ingredients weren't being detected. I opened the console and saw a red error: a function was failing because of a missing comma in the data file. Pasted the error to Claude Code, it fixed the comma, and everything worked.
Without the console, I would have spent ages trying to figure out why certain ingredients weren't showing results. With the console, the fix took 30 seconds.
A Testing Checklist You Can Actually Use
Print this out or save it. Run through it for every project:
- Does it open/run without errors?
- Does it look roughly like what I asked for?
- Have I clicked every button and link?
- Have I checked 3-5 specific data points against my source data?
- Do the totals add up?
- Does it work on my phone?
- Are there any red errors in the browser console?
- Have I tried searching/filtering/sorting (if those features exist)?
- What happens when I enter wrong or weird input?
- Have I used it for its actual intended purpose for at least a day?
- Has anyone else tried using it?
- Am I comfortable sharing/selling/relying on this?
When the AI Says "It Works" But It Doesn't
AI tools are very good at sounding confident. When you report a problem, the AI will often say "I've fixed it" or "that should work now." Trust, but verify. Every time.
I have a rule: if the AI says it's fixed something, I check the specific thing it claims to have fixed before moving on. Not "does the page still load?" but "does the exact problem I reported actually not happen anymore?" Because sometimes the AI's "fix" introduces a new problem, or doesn't actually fix the original one but changes something nearby that makes it look different.
The AI is not trying to deceive you. It genuinely thinks it's fixed the problem. But "thinking it's fixed" and "actually fixed" are different things, and you're the only one who can verify.
Version Everything
This connects back to the security post (Part 1 of this series): use git, or at minimum, save copies before making changes.
Here's why this matters for testing: you will have a working version, ask for a change, and the change will break something that was working before. If you saved the previous version, you can go back. If you didn't, you're trying to recreate it from memory.
Step by Step: Save Before Changing
1. You have a working version. It's not perfect, but the core thing works.
2. Before asking the AI to make changes, commit what you have (if using git) or copy the working files to a folder called "backup-YYYY-MM-DD" or "version-1-working".
3. Ask the AI for the change.
4. Test the change. If it broke something, you can go back to your saved version and try a different approach.
5. If the change works, save again. Now you have two known-good states.
I've lost count of how many times this has saved me. "It was working 20 minutes ago" is only useful if 20-minutes-ago is saved somewhere.
Getting Someone Else to Test
The most valuable testing you can do is the simplest: hand it to someone who wasn't involved in building it and watch them use it.
Don't explain anything. Don't say "you click here first, then there." Just say "have a look at this, it's supposed to [whatever it does]." Then watch. Where do they hesitate? What do they click first? Do they find the main feature, or do they get lost?
Every time I've done this, I've learned something. My partner looked at my finance dashboard and immediately said "where's the monthly view?" I hadn't built one because I was thinking in terms of categories. But a regular person wants to see "what did I spend this month?" as the first thing. That's not a bug in the code. It's a bug in my thinking about what the tool should do.
The Honest Truth About Testing
Testing is not fun. It's not the exciting part. The exciting part is telling the AI what to build and watching it appear. The testing part is the slow, careful "ok but does 42.50 + 17.30 actually equal what the chart says?" part.
But skipping it is how you end up with a beautiful dashboard that shows the wrong numbers. Or a web app that works perfectly on your computer and crashes on everyone else's phone. Or a script that processed 10,000 items and got 2,000 of them wrong.
The good news: it gets faster. Once you've tested a few projects, you develop a sense for where AI tools make mistakes. You learn to check the calculations first (because that's where AI is weakest). You learn to check mobile (because AI always forgets mobile). You learn to try edge cases (because AI builds for the happy path).
And the better news: the AI is your testing partner too. You can literally say: "I'm worried the calculations might be wrong. Can you walk me through exactly how the Food category total is calculated, step by step?" The AI will explain its logic, and often in explaining it, you'll spot the mistake together.
Final Disclaimer
Testing techniques described here are based on personal experience building projects with AI tools. They are not a substitute for professional quality assurance, especially for software that handles sensitive data, financial calculations, or anything with health/safety implications. You build, test, and deploy at your own risk. When in doubt, get expert help.