Just testing how the bounty works! (I'll remove this text after 24 hrs)
I have a data frame that looks like this:
USUBJID IDVAR IDVARVAL QNAM QVAL
0 Dummy-01-0001 AESEQ 1.0 AEdummy1 2012-02-15
1 Dummy-01-0002 AESEQ 1.0 AEdummy1 2012-02-23
2 Dummy-01-0004 AESEQ 1.0 AEdummy1 2012-02-06
3 Dummy-01-0004 AESEQ 2.0 AEdummy1 2012-03-10
4 Dummy-01-0005 AESEQ 1.0 AEdummy1 2012-03-10
5 Dummy-01-0001 AESPID 1 dummy2AE Gastrointestinal disorders
6 Dummy-01-0002 AESPID 1 dummy2AE Nervous system disorder
7 Dummy-01-0004 AESPID 2 dummy2AE Gastrointestinal disorders
8 Dummy-01-0004 AESPID 1 dummy2AE Nervous system disorder
9 Dummy-01-0005 AESPID 1 dummy2AE Gastrointestinal disorders
Above dataframe is just a sample, please use following code to get some more data into dataframe.
df = pd.DataFrame({'USUBJID': {0: 'Dummy-01-0001', 1: 'Dummy-01-0002', 2: 'Dummy-01-0004', 3: 'Dummy-01-0004', 4: 'Dummy-01-0005', 5: 'Dummy-01-0007', 6: 'Dummy-01-0008', 7: 'Dummy-01-0008', 8: 'Dummy-01-0010', 9: 'Dummy-01-0010', 10: 'Dummy-01-0011', 11: 'Dummy-01-0011', 12: 'Dummy-01-0013', 13: 'Dummy-01-0013', 14: 'Dummy-01-0014', 15: 'Dummy-01-0016', 16: 'Dummy-01-0016', 17: 'Dummy-01-0019', 18: 'Dummy-01-0001', 19: 'Dummy-01-0002', 20: 'Dummy-01-0004', 21: 'Dummy-01-0004', 22: 'Dummy-01-0005', 23: 'Dummy-01-0007', 24: 'Dummy-01-0008', 25: 'Dummy-01-0008', 26: 'Dummy-01-0010', 27: 'Dummy-01-0010', 28: 'Dummy-01-0011', 29: 'Dummy-01-0011', 30: 'Dummy-01-0013', 31: 'Dummy-01-0013', 32: 'Dummy-01-0014', 33: 'Dummy-01-0016', 34: 'Dummy-01-0016', 35: 'Dummy-01-0017', 36: 'Dummy-01-0017', 37: 'Dummy-01-0019'}, 'IDVAR': {0: 'AESEQ', 1: 'AESEQ', 2: 'AESEQ', 3: 'AESEQ', 4: 'AESEQ', 5: 'AESEQ', 6: 'AESEQ', 7: 'AESEQ', 8: 'AESEQ', 9: 'AESEQ', 10: 'AESEQ', 11: 'AESEQ', 12: 'AESEQ', 13: 'AESEQ', 14: 'AESEQ', 15: 'AESEQ', 16: 'AESEQ', 17: 'AESEQ', 18: 'AESPID', 19: 'AESPID', 20: 'AESPID', 21: 'AESPID', 22: 'AESPID', 23: 'AESPID', 24: 'AESPID', 25: 'AESPID', 26: 'AESPID', 27: 'AESPID', 28: 'AESPID', 29: 'AESPID', 30: 'AESPID', 31: 'AESPID', 32: 'AESPID', 33: 'AESPID', 34: 'AESPID', 35: 'AESPID', 36: 'AESPID', 37: 'AESPID'}, 'IDVARVAL': {0: '1.0', 1: '1.0', 2: '1.0', 3: '2.0', 4: '1.0', 5: '1.0', 6: '1.0', 7: '2.0', 8: '1.0', 9: '2.0', 10: '1.0', 11: '2.0', 12: '1.0', 13: '2.0', 14: '1.0', 15: '1.0', 16: '2.0', 17: '1.0', 18: '1', 19: '1', 20: '2', 21: '1', 22: '1', 23: '1', 24: '1', 25: '2', 26: '2', 27: '1', 28: '1', 29: '2', 30: '1', 31: '2', 32: '1', 33: '1', 34: '2', 35: '1', 36: '2', 37: '1'}, 'QNAM': {0: 'AEdummy1', 1: 'AEdummy1', 2: 'AEdummy1', 3: 'AEdummy1', 4: 'AEdummy1', 5: 'AEdummy1', 6: 'AEdummy1', 7: 'AEdummy1', 8: 'AEdummy1', 9: 'AEdummy1', 10: 'AEdummy1', 11: 'AEdummy1', 12: 'AEdummy1', 13: 'AEdummy1', 14: 'AEdummy1', 15: 'AEdummy1', 16: 'AEdummy1', 17: 'AEdummy1', 18: 'dummy2AE', 19: 'dummy2AE', 20: 'dummy2AE', 21: 'dummy2AE', 22: 'dummy2AE', 23: 'dummy2AE', 24: 'dummy2AE', 25: 'dummy2AE', 26: 'dummy2AE', 27: 'dummy2AE', 28: 'dummy2AE', 29: 'dummy2AE', 30: 'dummy2AE', 31: 'dummy2AE', 32: 'dummy2AE', 33: 'dummy2AE', 34: 'dummy2AE', 35: 'dummy2AE', 36: 'dummy2AE', 37: 'dummy2AE'}, 'QVAL': {0: '2012-02-15', 1: '2012-02-23', 2: '2012-02-06', 3: '2012-03-10', 4: '2012-03-10', 5: '2012-02-08', 6: '2012-03-18', 7: '2012-03-07', 8: '2012-02-01', 9: '2012-01-10', 10: '2012-01-19', 11: '2012-03-28', 12: '2012-02-19', 13: '2012-02-14', 14: '2012-03-13', 15: '2012-03-08', 16: '2012-02-05', 17: '2012-03-18', 18: 'Gastrointestinal disorders', 19: 'Nervous system disorder', 20: 'Gastrointestinal disorders', 21: 'Nervous system disorder', 22: 'Gastrointestinal disorders', 23: 'Vascular disorders', 24: 'Gastrointestinal disorders', 25: 'Vascular disorders', 26: 'Nervous system disorder', 27: 'Gastrointestinal disorders', 28: 'Nervous system disorder', 29: 'Nervous system disorder', 30: 'Nervous system disorder', 31: 'Gastrointestinal disorders', 32: 'Gastrointestinal disorders', 33: 'Vascular disorders', 34: 'Gastrointestinal disorders', 35: 'Nervous system disorder', 36: 'Nervous system disorder', 37: 'Vascular disorders'}})
Let's understand the structure a bit. IDVAR
holds the variable name and IDVARVAL
holds its values, and QNAM
holds another variable name and QVAL
holds the corresponding value. This structure, in the domain I'm working is called normalized structure, I know it's weird though.
I want to get this data in the following form:
USUBJID AESEQ AESPID AEdummy1 dummy2AE
0 Dummy-01-0001 1.0 1 2012-02-15 Gastrointestinal disorders
1 Dummy-01-0002 1.0 1 2012-02-23 Nervous system disorder
2 Dummy-01-0004 1.0 2 2012-02-06 Gastrointestinal disorders
3 Dummy-01-0004 2.0 1 2012-03-10 Nervous system disorder
4 Dummy-01-0005 1.0 1 2012-03-10 Gastrointestinal disorders
If I'd have to do it for QNAM
and QVAL
only, there is no problem, I can easily do it using pandas' pivot as follows:
df.pivot(
index=['USUBJID', 'IDVARVAL', 'IDVAR'],
columns='QNAM',
values='QVAL'
).reset_index()
Even for this problem, I can use some sort of masking, but I know that is not going to be the most efficient approach since this data is going to have thousands or sometimes millions of records.
Please be noted that: The combination of USUBJID
, and IDVARVAL
for an IDVAR
should map to the value of QVAL
for a QNAM
. That being said, some of the records for AESEQ
, AESPID
, AEdummy1
, and dummy2AE
could be NaN or empty for the output DataFrame. In other words, for the above sample, Dummy-01-0001
and AESEQ=1.0
uniquely identifies AEdummy1 = 2012-02-15
.
from How to denormalize a complex dataframe structure?
No comments:
Post a Comment