Maho Takahashi

Linguistics PhD

research

CV

code

mtakahas[at]ucsd[dot]edu

How to generate random datasets

import pandas as pd
sample_df = pd.util.testing.makeDataFrame()
sample_df.head()
A B C D
57UYdLCAz9 0.143324 0.012548 -1.501348 -0.873817
wxTnNfCFsX -1.200068 -0.079628 -0.101257 -1.825547
4PYx9lgmIg 1.054539 -1.463891 0.282045 -0.203671
J7JCQvcH86 1.323521 1.695652 -0.674408 0.638206
46tUB2L5Xm 0.180413 1.887923 0.992859 0.996487

As a default, the sample dataset contains 30 rows and 4 columns.

include NaNs in a random dataset

sample_df = pd.util.testing.makeMissingDataframe()
sample_df.head()
A B C D
i3ExiwBU7S -1.212042 -0.892190 -0.039163 0.010801
zFzIsE2HUx 0.842377 -1.474449 0.896692 NaN
lHoqmwixVC -0.432291 -0.140856 -0.820675 -0.876261
ZGtc7XPikQ -0.330881 -2.578911 NaN 0.627988
lXMCFHfcYz 1.095504 -0.670436 1.614122 -0.500160

include timeseries in a random dataset

sample_df = pd.util.testing.makeTimeDataFrame()
sample_df.head()
A B C D
2000-01-03 -1.150055 0.340213 -0.193093 1.140389
2000-01-04 -0.116650 -1.928765 -1.934851 1.065835
2000-01-05 0.436392 -1.976887 0.124077 -1.292386
2000-01-06 2.362524 -0.733541 -0.745991 -0.600506
2000-01-07 -0.537766 0.622937 -1.650008 -0.308583

include mixed variables in a random dataset

sample_df = pd.util.testing.makeMixedDataFrame()
sample_df.dtypes
A           float64
B           float64
C            object
D    datetime64[ns]
dtype: object