Repetitive records BUT different results

Tagged: Data Collection, data repetition

This topic has 6 replies, 2 voices, and was last updated 2 years, 1 month ago by Jeremy.

Viewing 7 posts - 1 through 7 (of 7 total)

Author

Posts
May 26, 2023 at 8:05 am #10630
Chloris
Participant
Hi Jeremy,

It’s me again. I’m cleaning the results for data quality check. However, I found that, for some participants, the records were duplicated (the correct lines would be 216, but for some people it’s 432 – although it’s not very frequent), and the tricky thing is – some results are different. I would assume that the participant actually only did the experiment once and it has something to do with data recording, because it’s highly unlikely that she/he spend exactly the same time when doing all the trials twice.

Below is part of the erroneous results. RT is in milliseconds.
```
ID	group	score	entry	RT	Label	word_id

fr51	A	1	bateau	94.15	block_1	22
fr51	A	1	bateau	94.15	block_1	22
fr51	A	6	amour	270.6	block_1	8
fr51	A	4	amour	270.6	block_1	8
fr51	A	1	haine	139.6	block_3	119
fr51	A	7	haine	139.6	block_3	119
fr51	A	1	larme	122.55	block_3	131
fr51	A	1	larme	122.55	block_3	131
fr51	A	7	peur	206.8	block_4	165
fr51	A	7	peur	206.8	block_4	165
fr51	A	7	regret	175.3	block_4	180
fr51	A	7	regret	175.3	block_4	180
fr51	A	7	tristesse	135.35	block_4	206
fr51	A	7	tristesse	135.35	block_4	206
```
Do you know the possible reason for this? How do I tell which records were actually the “real” ones? How to only keep the correct records when cleaning the results in R?

Here’s the link for my experiment:
https://farm.pcibex.net/r/BcwcIV/

Many thanks,
Chloris
May 30, 2023 at 8:18 am #10636

Jeremy
Keymaster

Hi Chloris,

The table you include in your message presents the data after it has been transformed, so unless you provide the script or a detailed explanation of how you obtained that table, I can hardly understand how it maps to the raw results of your experiment

That being said, judging from your table, it looks like you maybe summarized the data in groups as defined by the ID column above, ie. treated the results as if all the lines referencing “fr51” corresponded to the same submission. That, however, is not the case: I find five submissions in the database that reference “fr51”. Four of those five submissions report the same MD5 hash, indicating that they were taken on the same device using the same browser and the same connection — the remaining submission has a different MD5 hash. I don’t know whether it’s something unexpected for your collection method

Jeremy

May 31, 2023 at 3:48 am #10644
Chloris
Participant
Hi Jeremy,

Thanks a lot for your reply! My experiment asked participants to rate French words according to certain criteria. I got the table by selecting relevant columns and then selecting certain words (rows) in R:
```
# select and rename relevant columns
tidied_val <- results_val %>%
  filter(Parameter == "Choice" | Value == "Start") %>%
  select(ID, group, entry, Label, Parameter, PennElementName, Value, EventTime, word_id) %>%
  group_by(group, ID, entry) %>%
  mutate(RT = (mean(EventTime[Parameter=="Choice"] - EventTime[Value=="Start"]))/10)  %>%
  ungroup() %>%
  filter(Parameter == "Choice")  %>%
  select(ID, group, Value, entry, RT, Label, word_id) %>%
  rename(score = Value)

# quality check (valence)
quality_val <- tidied_val %>%
  filter(entry == "amour"| 
           entry == "espoir"| 
           entry == "haine"| 
           entry == "peur"| 
           entry == "tristesse"| 
           entry == "fer"| 
           entry == "sandwich"| 
           
           entry == "cauchemar"|
           entry =="douleur"| 
           entry =="rire"|
           entry =="passion"|
           entry =="jalousie"|
           entry == "racisme"| 
           entry == "anxiété"| 
           entry == "dent"| 
           
           entry == "plaisir"| 
           entry == "souffrance"| 
           entry == "colère"| 
           entry == "désagréable"| 
           entry == "bonheur"|
           entry == "richesse"| 
           entry == "poire"| 
           entry == "moyen"| 
         
           entry == "horreur"| 
           entry == "mensonge"| 
           entry == "confort"| 
           entry == "victoire"|
           entry == "diable"|
           entry == "chagrin"|
           entry == "vertu" |
           entry == "appareil") %>%
  group_by(ID)


readr::write_excel_csv(quality_val, "quality_val.csv") 
```
The ID variable was acquired by asking participants to enter it by themselves. I thought about whether it might be two different participants, on of whom accidentally put the wrong ID. But why would the two records have the same response time in that case?

May I know what you referred when saying “MD5 hash”? I’m wondering how to distinguish which four records belonged to that single participant…

Best wishes,
Chloris
May 31, 2023 at 5:26 am #10646

Jeremy
Keymaster

Hi Chloris,

The first columns of each line of the results file are described in the IBEX manual: the second column is the MD5 hash, the first one is the reception time of the submission; using the two together reliably identifies the rows that come from the same submission

The reason why you get the same RT for different scores is that your code calculates RTs across all the scores: group_by(group, ID, entry) groups the data by group, ID and entry, but not by MD5+ReceptionTime, so the groups might contain more than one choice (in case of multiple submissions being associated with the same ID). In those cases where your groups contain several choices, mutate(RT = (mean(EventTime[Parameter=="Choice"] - EventTime[Value=="Start"]))/10) will calculate one RT per group spanning multiple choices. Even though you’re adding a column that contains a single RT value per group, your table at that point still contains multiple choices per group, so when you do filter(Parameter == "Choice")+rename(score = Value) later on, you end up with multiple scores (for those groups that contain multiple ones)

I don’t know why you got multiple submissions with the same ID: it could be that one participant took the experiment several times, or that they shared their ID with other participants that could have taken the experiment on the same browser+device in some cases. It looks like that didn’t just happen with fr51; you should double-check your results file yourself

Jeremy

May 31, 2023 at 7:54 am #10647

Chloris
Participant

The same problem also occurred in my other experiment:
https://farm.pcibex.net/r/Iwjgkg/

It happened to fr24, still not very frequent – only once, but perplexing.

May 31, 2023 at 8:07 am #10648

Chloris
Participant

Thanks for providing the latest tutorial! I referred to this tutorial https://www.linguisticsociety.org/sites/default/files/PCIbex_Tutorial%5B2%5D.pdf when writing R codes…

Could you suggest a more unequivocal way of calculating the RTs?

Many thanks,
Chloris

May 31, 2023 at 10:51 am #10651

Jeremy
Keymaster

The PDF you link to is indeed outdated. Please follow the latest tutorial instead

As far as I can tell, however, your issue does not come from PCIbex, or from the way you calculate RTs per se. It comes from the fact that some values (fr51, fr24, …) were entered on several occasions, for multiple submissions. As I mentioned in my previous message, and as suggested by the IBEX manual, you could group by MD5+ReceptionTime instead of, or in addition to, ID (which turned out to not uniquely identify submissions, as we found out): group_by(group, Results.reception.time, MD5.hash.of.participant.s.IP.address, ID, entry). But at the end of the day, if you initially assumed that ID should uniquely identify submissions, you should probably try to figure out why that turns out not to be the case

Jeremy
Author

Posts

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic.