Does Onsen actually work? What 18 months of data shows

We asked the hardest question a mental health app can ask about itself.

11 April 2026 · 24 min read

Does Onsen actually work? What 18 months of data shows

Most mental health apps never answer the hardest question: does this actually help?

It's easy to show testimonials. It's easy to cherry-pick a feel-good story. It's much harder to measure whether your app actually makes people feel better, and publish what you find, good or bad.

That's what we did. Onsen is an AI-powered mental health app where you can chat, journal, track your mood, and work through guided exercises like CBT and meditation. For 18 months, we've included questionnaires that measure happiness^[1], stress^[2], and gratitude^[3], the same ones used by psychologists and researchers. Over 1,600 users have taken at least one, generating more than 2,250 responses. Here's what the data shows.

A note on privacy: Everything in this article is based on aggregated, de-identified data. We looked at group averages and trends. Not individual entries, not personal conversations, not anything that could identify a specific user. No journal entries, chat messages, or personal content were read for this analysis.

Most users arrive unhappy and stressed

Onsen includes wellbeing trackers: short questionnaires used by psychologists to measure things like happiness, stress, and gratitude. The Happiness Tracker is based on the Oxford Happiness Questionnaire, the Stress Tracker on the Perceived Stress Scale, and the Gratitude Tracker on the GQ-6. Each takes about five minutes, and you can retake them over time to spot trends.

People don't come to Onsen because everything is fine. The average happiness score is 2.86 out of 5. Anything below 3.5 is considered low. Stress averages 3.63, right at the line for what's considered high.

In short: our users arrive struggling, with real room to improve. Women report higher stress than men^[4], and people in their 30s are the most stressed group, likely from the collision of career pressure, young families, and financial strain. The question is whether they get better.

54% felt happier, 51% felt less stressed

We looked at users who took the same questionnaire at least twice, about three months apart on average. These returning users tended to start with slightly higher stress than the overall population (3.72 vs 3.63). That makes sense: if you're more stressed, you have more reason to come back and check again.

Happiness went up. Users scored 6% higher on average. Of the 85 who retook it, 54% improved, 12% stayed the same, and 34% scored lower.

Stress went down. Stress scores dropped by 3% on average. Of 168 users, 51% improved, 11% stayed the same, and 38% scored higher.

Gratitude trended positive, but we don't have enough data yet. Only 37 users retook the gratitude questionnaire. The trend looks good, but we're not going to claim something we can't fully back up.

Average scores: first vs last assessment

These improvements are consistent with what research finds^[5] for mental health apps generally. Not a miracle cure, but a real, measurable shift in the right direction.

Heavy users saw 3x more stress reduction

We split users into four groups by how much they used the app, from light users (around 30 messages total) to heavy users (over 1,300 messages, roughly daily use), and compared the two extremes. The difference was dramatic.

Heavy users became 8.8% happier (vs 4.0% for light users), saw 3x more stress reduction (7.2% vs 2.4%), and improved nearly twice as much on gratitude. The pattern held across every tracker we measured.

When we looked at which types of engagement predicted the most improvement, journaling came out on top. Ahead of chatting, meditation, and every other feature. Decades of research^[6] support the mental health benefits of writing about what you're feeling, so this wasn't a surprise. But it's encouraging to see it show up so clearly in our own data.

Stress dropped 22%, gratitude rose 19%, and the trend kept going

The most powerful visual in our data is what happens to scores over repeated check-ins.

Users who stuck with Onsen kept getting better. Stress scores started high and dropped steadily, with a brief plateau around the fourth and fifth check-in before continuing to fall. By the tenth check-in, stress had dropped 22% from where it started.

Gratitude followed a similar upward arc, from 3.6 at the first check-in to 4.3 by the sixth (though with a smaller sample).

Gratitude scores across repeated check-ins

To make sure this wasn't just the improvers sticking around, we tracked the same 65 users across their first three check-ins. Same people, no one dropping in or out. The pattern held.

We also found that showing up regularly matters more than showing up occasionally. Users who checked in with shorter gaps between assessments saw more improvement than those who dipped in and out. (The Stress Tracker and Happiness Tracker make it easy to build that habit.)

Users with the highest stress improved the most

Two findings stood out when we looked at who benefits most.

First, users who start more stressed benefit the most. If you're coming to Onsen at a really difficult time, the data says that's exactly when it helps the most.

Second, Onsen works for everyone. Women and men, under 40 and over 40. All groups improved. In fact, men who used Onsen saw nearly 3x the stress reduction of women, though our male sample is still small, so we'll need more data to confirm this.

The 30-something woman juggling a career and young kids, and the 50-year-old man who's never talked about his feelings. The data suggests Onsen helps them both.

What this study doesn't prove

We believe in being straightforward about what this data does and doesn't prove.

We don't have a control group. We didn't compare Onsen users against a matched group of people who didn't use the app. That means we can't rule out that improvements happened for other reasons: a change in season, starting therapy, life circumstances getting better on their own.

People who improve may be more likely to come back. If you feel worse, you might not retake the questionnaire. This could make the results look better than they are. We tested for this (see the appendix), but can't rule it out completely.

Very high scores tend to come down on their own. If you score extremely high on stress the first time, you'll probably score a bit lower the next time regardless of what you do. Some of the improvement we see is likely this effect.

Gratitude needs more data. With only 37 repeat users, we can't draw conclusions about gratitude yet. The trend is positive, but that's all we can say honestly.

We're not going to overclaim. We think this data is encouraging, genuinely encouraging. But honesty matters more than a headline. Numbers only tell part of the story. To hear what real users are saying in their own words, see our reviews. And if you want the full statistical details, the appendix below has everything: effect sizes, confidence intervals, regression models, the lot.

How to get the most out of Onsen

If you're considering Onsen, here's what the data suggests about who benefits most:

Show up consistently. Regular check-ins beat sporadic ones. Build it into your routine.
Journal. Writing was the strongest predictor of improvement we found. Try guided journaling if you're not sure where to start.
Track your progress. Users who checked in more often improved more. The act of measuring how you feel seems to help by itself.

Onsen isn't a replacement for professional care. If you're in crisis or dealing with a serious mental health condition, please seek help from a qualified professional. But as a daily practice, a place to check in, reflect, and build self-awareness, our data suggests it genuinely helps. Especially if you keep showing up.

We'll keep measuring. We'll keep publishing. And if the data ever shows something isn't working, we'll say that too.

Download Onsen and take your first assessment. You might be surprised what checking in with yourself regularly can do. Our data certainly surprised us.

Appendix: Full research methodology and statistical results

This section contains the complete statistical details behind the findings above. If you're a researcher, journalist, or just curious about the numbers, this is for you.

Study design

Type: Retrospective within-subjects pre-post observational study

Period: September 2024 to April 2026 (18 months)

Population: All non-internal, non-deleted Onsen users who completed at least one tracker assessment. Analysis cohort: users with 2+ completed assessments on a given tracker.

Instruments:

Instrument	Measures	Items	Score Range	Improvement
Oxford Happiness Questionnaire (OHQ-8)^[1]	Happiness & life satisfaction	8 items, 5-point Likert	1-5 (mean)	Score increases
Perceived Stress Scale (PSS-10)^[2]	Perceived stress (past 2 weeks)	10 items, 5-point frequency	1-5 (mean)	Score decreases
Gratitude Questionnaire (GQ-6)^[3]	Dispositional gratitude	6 items, 5-point Likert	1-5 (mean)	Score increases

All instruments include reverse-scored items. Reverse scoring is applied at the item level before computing the overall mean, consistent with each instrument's published scoring protocol.

Sample size

Tracker	Unique Users	Total Assessments	Users with 2+	Users with 3+
Happiness (OHQ-8)	480	663	85	25
Stress (PSS-10)	869	1,231	168	65
Gratitude (GQ-6)	257	360	37	14

Stress is the most popular tracker by far, likely because stress is a universal concern that people want to quantify. The average of 1.4 assessments per user means most people take a tracker once; about 15-19% return for a second assessment. The "repeat users" who form our analysis cohort are a self-selected subset who chose to check in again.

Demographics


Gender	73% women, 11% men, 2% other, 15% not set
Largest age groups	30-39 (25%), 40-49 (22%), 17-29 (21%)
Top countries	US, UK/Ireland, Canada, Australia

The gender skew toward women is typical of mental health apps and self-help tools generally. The 15% "not set" represents users who haven't completed their profile. The age distribution peaks in the 30-49 range, which aligns with the life stage where mental health support seeking is highest.

Baseline score distributions

Tracker	n	Mean	SD	Min	P25	Median	P75	Max
Happiness	663	2.861	0.800	1.00	2.30	2.80	3.40	5.00
Stress	1,231	3.626	0.716	1.00	3.20	3.70	4.10	5.00
Gratitude	360	3.759	0.852	1.70	3.20	3.80	4.35	5.00

Clinical reference: Happiness >4.0 = high, <3.5 = low. Stress >3.67 = high, <2.33 = low. Gratitude >=4.5 = high, <4.0 = lower disposition.

The happiness scores cluster low. The median (2.80) sits well below the "low happiness" threshold of 3.5, confirming that Onsen users arrive in a difficult place. Stress scores cluster high, near the "high stress" cutoff. This population has meaningful room for improvement, which matters for detecting change. A ceiling or floor effect would make improvements harder to observe.

Pre-post results (first vs last assessment)

Tracker	n	Avg First	Avg Last	Avg Change	SD	% Improved	% Unchanged	Avg Days Between
Happiness	85	2.873	3.051	+0.178	0.560	54.1%	11.8%	90.4
Stress	168	3.717	3.605	-0.112	0.593	50.6%	11.3%	88.4
Gratitude	37	3.827	3.911	+0.084	0.518	40.5%	27.0%	104.7

All three trackers move in the expected direction. The average time between first and last assessment is approximately 3 months, a meaningful follow-up period for psychological change. About 11-12% of users show zero change (identical first and last scores), which is expected on a 5-point Likert scale with limited granularity.

Gratitude has the highest proportion of unchanged scores (27%) and the smallest improvement. This may reflect gratitude's nature as a more stable dispositional trait. It shifts slowly, if at all, compared to state-like constructs such as perceived stress.

Statistical tests

Paired t-test and Wilcoxon signed-rank test:

Tracker	n	Mean Change	Cohen's d	t	p (t-test)	p (Wilcoxon)	95% CI
Happiness	85	+0.178	0.317	2.923	0.004	0.003	[0.057, 0.299]
Stress	168	-0.112	-0.189	-2.447	0.016	0.068	[-0.202, -0.022]
Gratitude	37	+0.084	0.162	0.983	0.332	0.393	[-0.089, 0.257]

Happiness is statistically significant by both parametric and non-parametric tests (p=0.004 and p=0.003). Cohen's d of 0.32 is a small-to-medium effect. The 95% confidence interval [0.057, 0.299] excludes zero.

Stress is significant by paired t-test (p=0.016) but borderline on Wilcoxon (p=0.068). The discrepancy suggests possible outliers affecting the parametric test. The 95% CI [-0.202, -0.022] still excludes zero.

Gratitude is not statistically significant. A power analysis suggests approximately 240 users with 2+ assessments would be needed to detect an effect of d=0.16 at 80% power.

Multiple comparison note: We tested three trackers, which raises the question of correction for multiple comparisons. Under Bonferroni correction (alpha = 0.05/3 = 0.017), happiness (p=0.004) remains clearly significant. Stress (p=0.016) sits right at the corrected threshold. It passes, but should be interpreted with caution. Gratitude was already non-significant before correction.

Engagement analysis

Engagement is measured as the total number of active messages sent plus journal entries written. Users were split at the median into "low" and "high" groups, and separately into quartiles (Q1-Q4). "Low" users averaged 61-114 total interactions over their time on Onsen; "High" users averaged 805-1,449. In the quartile analysis, Q1 users averaged around 30 messages (occasional use) while Q4 users averaged over 1,300 (roughly daily use).

Median split (high vs low engagement):

Tracker	Group	n	Avg Change	Avg Engagement
Happiness	Low	43	+0.144	61 messages
Happiness	High	42	+0.212	1,120 messages
Stress	Low	84	-0.058	69 messages
Stress	High	84	-0.165	805 messages
Gratitude	Low	19	0.000	114 messages
Gratitude	High	18	+0.172	1,449 messages

The engagement gap between groups is enormous. High-engagement users have 10-15x more messages and entries than low-engagement users. These aren't minor variations in usage; they represent fundamentally different relationships with the app. The gratitude result is the starkest: low-engagement users showed literally zero improvement, while high-engagement users improved by +0.172.

Dose-response by quartile (stress, n=168):

Quartile	n	Avg Engagement	Avg Stress Change
Q1 (lowest)	42	34	-0.090
Q2	42	105	-0.026
Q3	42	257	-0.062
Q4 (highest)	42	1,353	-0.269

The Q4 group (averaging 1,353 messages, roughly daily use) shows 3x the stress reduction of Q1. The Q2 dip is real and unexplained. It may reflect users who engage sporadically enough to be reminded of their stress without engaging deeply enough to process it. Or it may simply be noise. We report it honestly rather than smoothing over it.

Dose-response by quartile (happiness, n=85):

Quartile	n	Avg Engagement	Avg Happiness Change
Q1 (lowest)	22	23	+0.114
Q2	21	102	+0.176
Q3	21	299	+0.171
Q4 (highest)	21	1,940	+0.252

Happiness shows a cleaner dose-response than stress. Q2 and Q3 are similar, with Q4 pulling ahead. This suggests a possible threshold effect: moderate engagement provides a baseline benefit, but the biggest marginal gain comes from moving into heavy, consistent use.

Bivariate correlations (engagement vs improvement):

Tracker	n	r (Messages)	r (Journal Entries)	r (Baseline Score)
Happiness	85	0.297	0.330	-0.212
Stress	168	0.278	0.340	0.189
Gratitude	37	0.248	0.221	-0.357

Journal entries are the strongest correlate of improvement (r=0.33-0.34 for happiness and stress). This suggests that the reflective act of writing, not just chatting, is associated with better outcomes. The negative correlation between baseline happiness and improvement (r=-0.21) indicates that happier users at baseline have less room to grow. For stress, the positive correlation (r=0.19) means users who start more stressed improve more. The app appears to benefit those who need it most.

Regression analysis

We ran OLS regression models to understand which factors predict improvement while controlling for each other. The stress model is our strongest and most informative.

Stress tracker OLS regression (n=168, R²=0.172, p<0.0001):

Predictor	Coefficient	p-value	Interpretation
Baseline score	0.187	0.010	Higher initial stress → more improvement
Num. assessments	0.047	0.004	Each additional assessment → +0.047 improvement
Days between	-0.001	0.038	Longer gaps → less improvement
Meditation sessions	0.072	0.104	Trending positive, not significant
Active messages (log)	0.033	0.549	Not significant after controls
Journal entries (log)	-0.092	0.414	Not significant after controls

This model explains 17% of the variance in stress improvement, typical for behavioral interventions where individual variability is large. Three predictors reached significance:

Baseline score (p=0.01): For every 1-point increase in initial stress, improvement increases by 0.19 points. This partially controls for regression to the mean, and the effect is still significant after that control, suggesting the benefit for highly stressed users is real, not just statistical drift.
Number of assessments (p=0.004): Each additional check-in is associated with +0.047 improvement. This likely reflects both a dose effect (more tracking = more self-awareness) and selection (users who improve keep tracking). We can't fully separate these mechanisms.
Days between assessments (p=0.038): Every extra day between check-ins reduces improvement by 0.001 points. Over a 90-day gap vs. a 30-day gap, that's a difference of 0.06 points. Enough to matter. This supports the "consistent practice" finding from the main body.

Meditation sessions trend positive (p=0.10). Each session is associated with +0.072 improvement. With a larger sample, this could become significant. It's a signal worth watching.

Why engagement metrics lose significance: Active messages and journal entries both correlate with improvement in bivariate analyses (r=0.28-0.34), but drop below significance when entered alongside number of assessments. This is multicollinearity: users who engage more also track more often, so the regression can't cleanly attribute improvement to one vs. the other. The bivariate correlations are more informative for the engagement story.

Pooled model (all trackers, n=290, R²=0.098, p=0.0006): Number of assessments was the only significant predictor (p<0.001). The lower R² compared to the stress-only model suggests that combining trackers introduces noise. The predictors of improvement may differ across happiness, stress, and gratitude.

Happiness tracker (n=85, R²=0.185, p=0.023): Number of assessments is significant (p=0.037) with the same pattern as stress. Baseline score trends toward significance (p=0.07, negative): users who start less happy tend to improve more. The model explains 19% of variance, comparable to stress.

Gratitude (n=37): Model not significant (p=0.26). With only 37 observations and 7 predictors, we lack the statistical power to detect effects. The R² of 0.25 is inflated by overfitting (adjusted R² drops to 0.065). No actionable conclusions.

Demographic subgroup analysis

Baseline differences (statistically significant):

Finding	t	p
Women report higher stress than men (3.67 vs 3.42)	3.67	<0.001
Under-40s report higher stress than 40+ (3.76 vs 3.50)	5.46	<0.0001

The gender difference on stress (+0.25 points) is consistent with the broader PSS literature, where women consistently score higher across cultures and age groups. Happiness and gratitude show no significant gender or age differences. The gap is specific to perceived stress.

The age finding is the strongest demographic result in the study (t=5.46). The 30-39 age group is the most stressed (mean 3.81), followed by 17-29 (3.68). The 40-49 group drops sharply to 3.45. This aligns with life-stage research: the 30s are when career demands, early parenthood, and financial obligations tend to peak simultaneously.

Improvement rates (not significant):

Comparison	Mean Change (Group A)	Mean Change (Group B)	p
Women vs Men (stress)	-0.104	-0.267	>0.4 (n=12 men)
Under 40 vs 40+ (stress)	-0.161	-0.110	>0.6

This is an important null finding. While who arrives differs meaningfully by demographics, how much they improve does not. Both genders and both age groups show similar rates of stress reduction. The men show a numerically larger improvement (-0.267 vs -0.104) but with only 12 men in the repeat-assessment cohort, this is far too underpowered to interpret. It could easily be noise.

The practical implication: Onsen doesn't appear to work better for one demographic over another. The app seems to meet users where they are, regardless of age or gender.

Cohort-controlled trajectory

A key concern with trajectory data is survivorship bias: if users who don't improve drop out, the remaining sample looks artificially better over time. To test this, we tracked the exact same 65 users across their first 3 stress assessments. No one enters or exits the cohort.

Assessment	n	Avg Score	SD
1	65	3.768	0.507
2	65	3.615	0.760
3	65	3.583	0.747

The improvement holds: 3.77 → 3.62 → 3.58 for the same individuals. The decline from assessment 1 to 2 (-0.15) is larger than from 2 to 3 (-0.03), suggesting the biggest shift may happen early, consistent with an "awareness effect" where the act of first measuring stress triggers initial behavioral change.

The cohort's starting score (3.77) is slightly higher than the full population (3.69). That makes sense: users who return for a second and third assessment started with slightly more elevated stress, giving them more motivation to track and more room to improve.

Threats to validity

No control group: Cannot establish causation. Improvements may reflect natural recovery, therapy, seasonal effects, or life changes.
Survivorship bias: Users who improve may be more likely to retake. Cohort analysis (above) partially controls for this.
Regression to the mean: Users with extreme baseline scores tend to regress. Regression models include baseline as a covariate.
Self-selection bias: Users who choose Onsen are not randomly assigned. They may be more motivated than the general population.
Gratitude underpowered: n=37 is insufficient for reliable conclusions (need ~240 for 80% power at d=0.16).
Practice effects: Familiarity with questionnaire items on retake could influence scores.
Temporal confounds: Score changes could reflect seasonal effects or trends unrelated to app use.
Multicollinearity: Engagement metrics correlate with each other and with assessment count, making it difficult to isolate which type of engagement matters most.

Sources

1.
Hills & Argyle (2002). “The Oxford Happiness Questionnaire.” [DOI ]
2.
Cohen, Kamarck & Mermelstein (1983). “A Global Measure of Perceived Stress.” [DOI ]
3.
McCullough, Emmons & Tsang (2002). “The Grateful Disposition: A Conceptual and Empirical Topography.” [DOI ]
4.
Barbosa-Leiker et al. (2013). “Measurement invariance of the PSS and latent mean differences across gender.” [PubMed ]
5.
Serrano-Ripoll et al. (2022). “Impact of smartphone app-based psychological interventions for reducing depressive symptoms (SMD -0.51).” [PubMed ]
6.
Pennebaker & Beall (1986). “Confronting a traumatic event: toward an understanding of inhibition and disease.” [DOI ]

Share it with the world!