Welcome to the Podiatry Arena forums, for communication between foot health professionals about podiatry and related topics.
You are currently viewing our podiatry forum as a guest which gives you limited access to view all podiatry discussions and access our other features. By joining our free global community of Podiatrists and other interested foot health care professionals you will have access to post podiatry topics (answer and ask questions), communicate privately with other members (PM), upload content, view attachments, receive a weekly email update of new discussions, earn CPD points and access many other special features. Registered users do not get displayed the advertisments in posted messages. Registration is fast, simple and absolutely free so please, join our global Podiatry community today!
If you have any problems with the registration process or your account login, please contact contact us.
I am currently working on my final year disseratation paper on how experience may affect the reliability of the Foot Posture Index-6 and I am using Multi-rater and Weighted Kappa for analysing the data.I have no experience on Kappa analysis and would appreciate any advise on good resources or general suggestions using Kappa analysis.
If memory serves me Kappa analysis is generally used for qualitative data. What is your methodology? Are you doing intra-rater reliability for the FPI for both an experienced and unexperienced user and want to compare them to see which is more reliable? If so it may be just as simple to do intraclass correlation coefficients (ICCs), or limits of agreement (LOAs).
I took a convenience sample of 20 subjects with asymptomatic feet and assessed both feet in three trials.Yes,there was 2 group of raters an expert group of 4 and a novice group of 2,both groups assessed all subjects on the same day.
I am investigating the inter-rater and intra-rater reliability.The inter-rater reliability will be analysed using multi-rater kappa and the intra-rater will be analysed using weighted kappa.
As the data did not have a normal distribution and is categorical I can't do ICC or LOAs.
Categorical data (or nominal data) is data which uses labels - when you report FPI data it is usually done qualitatively (i.e -4 or +6 for example) and would therefore be interval/ratio data. The gold standard (as far as what journals would expect to see reported) for intra/inter-rater reliability is the ICC.
FPI data is categorical. Categorical data refers to data which can be sorted into categories (i.e. foot types). Ian, you state below that FPI data is reported qualitatively, what do you mean by that? Due to the boundaries of the FPI scale I wouldn't agree that this is interval or ratio data.
The gold standard for reporting reliability data is to make the correct assumptions on the data and run the appropriate statistics.
I'm glad someone with far superior research methods knowledge to me has pitched on this as I've been mulling this over for the last few days. By saying it was reported qualitatively I meant numerically (e.g +12 or -8 etc) - naturally it could be sorted into categories (or labels as I referred to them earlier) based on one of the 5 'foot types' but is this the norm with respect to a reliability study? Is it rigorous enough?
I assumed it was not as by categorising into foot types, using the highly supinated foot type of the FPI-6 as an example, we cover the range of -5 to -12. Therefore I could score someone differently by up to a value of 7 but as they are still in the same category it would appear to have good intra-rater reliability? Appreciate your thoughts on this - in the mean time I will delve back into the FPI literature and re-read.
PS I'm assuming Nikki is a student of yours at UEL so apologies if I have added to her woes or confusion!
I suspect it is a student of ours however not under a name that I recognise. If it is the case then clearly their meeting with the statistician at University has caused them further confusion.
I'm sure there are inconsistencies in the methods for reporting the FPI but as a tool for quantifying foot posture I'd argue that the scores on their own don't mean much. I'm not sure that I grasp what you mean when referring to the norms for a reliability study however, the study is looking at the reliability across raters in their classification of foot type and therefore I can't see why the design is a problem (or not rigorous enough).
I appreciate what you are saying regarding the supinated foot but, in recognition of the limitations of the index, we can't do much about that other than recognise the possibility for reliability values which may not be a true reflection. There are processes where the data can be transformed in order to run parametric tests - possibly worth pursuing - but due to the pending submission deadlines for the BSc project this isn't feasible at this stage.
I'm not sure that I grasp what you mean when referring to the norms for a reliability study however, the study is looking at the reliability across raters in their classification of foot type and therefore I can't see why the design is a problem (or not rigorous enough).
I meant the norm for a reliability study of the FPI (not a reliability study in general).
I am no stats whizz by any stretch, but my intuitive assumption was that you would sort the data as numerical values rather than labels/categories. My understanding is that a inter-rater reliability study is essentially trying to ascertain how repeatable a measurement is between two raters (whatever that measurement may be). So if I was measuring the length of every subjects left leg and so were you we would have a table of both our measurements for each subject (in mm; i.e. numerically) and then we would run the stats to see how 'similar' (not a stats word I know) they were. Would this not be the same for FPI? Are we not just seeing how 'similar' rater A and rater B's measurements (or values) are?
As I said no stats whizz - happy to 'get my coat' if I'm way off
I've just had another read of the Cornwall et al paper (attached above) and it sort of answers my question. In part of their study they tried to determine whether reliability was better when using the raw 'score' to the classify foot types.
When analysing raw score Wilcoxon used
When analysing category Kappa coefficient used
When analysing raw score Independent t test used
When analysing category Kappa ceofficient used
The ICCs were calculated for all.
Conclusion: Classification of feet based on the raw FPI-6 score does not seem to improve the amount of agreement between clinicians.
Hi Ian, I hope I haven't lost the focus with the answer. You are right that an inter-rater reliability study is comparing between raters and I suppose you would compare the scores but you need to look at the underlying assumptions of the data (and what is being provided by the measurement tool). I think the example of leg length is an interesting one but not directly comparable because we are not looking at a summated score with the LLD (we are comparing the analysis of continuous data with categorical). In essence, the data generated by the FPI is discrete, being limited by the boundaries of the score -12 to +12. This is not the case with leg length measurements. As we are taking a sum of 6 criteria then we need to consider the 6 individual criteria and how much do they contribute to the final score. There is the assumption that each individual item of the index and the divisions within that item have equal weighting, not necessarily the case! (and this is where the weighted kappa comes in, I think) If we want to run the analysis like the leg length data then we need to transform the data to logit transformed scores. This is the process of changing raw FPI-6 scores into a data form suitable for parametric analysis but for this, large data sets are required (Keenan et al , 2007).
Keenan AM, Redmond AC, Horton M, Conaghan PG, Tennant A. (2007) The Foot Posture Index: Rasch analysis of a Novel, Foot-Specific Outcome Measure. Arch Phys Med Rehabil. 88: 88 - 93